Volcano

Cloud native batch scheduling system for compute-intensive workloads

Network Topology Aware Scheduling

Supports network topology aware scheduling, significantly reducing application communication overhead between nodes, and greatly enhancing model training efficiency in AI distributed training scenarios

Online and Offline Workloads Colocation

Supports online and offline workloads colocation, enhancing resource utilization while ensuring online business QoS through unified scheduling, dynamic resource overcommitment, CPU burst capabilities, and resource isolation

Multi Cluster Job Scheduling

Supports cross cluster job scheduling for larger scale resource pool management and load balancing

Star

Learn more about volcano multi-cluster scheduling

Hierarchical Queue Management

Supports multi-level queue management for fine-grained resource quota control

Descheduling

Supports dynamic descheduling to optimize cluster load distribution and improve system stability

Star

Learn more about volcano descheduling

Multiple Scheduling Policies

Supports various scheduling strategies including Gang, Fair-Share, Binpack, DeviceShare, Capacity, Proportion, NUMA aware, and Task Topology, optimizing resource utilization

Heterogeneous Device Support

Supports scheduling for various heterogeneous devices like GPU and NPU, unleashing hardware computing power

High Performance Unified Scheduling

Supports native Kubernetes workload scheduling while providing complete support for frameworks like Ray, PyTorch, TensorFlow, Spark, and Flink through VolcanoJob, achieving unified job scheduling with excellent performance

Powerful Monitoring

Provides rich logging, monitoring metrics, and dashboards

Star

Learn more about volcano dashboard

Why Volcano

Unified Scheduling

Supports integrated job scheduling for both Kubernetes native workloads and mainstream computing frameworks (such as TensorFlow, Spark, PyTorch, Ray, Flink, etc.).

Queue Management

Provides multi-level queue management capabilities, enabling fine-grained resource quota control and task priority scheduling.

Heterogeneous Device Support

Efficiently schedules heterogeneous devices like GPU and NPU, fully unleashing hardware computing potential.

Network Topology Aware Scheduling

Greatly enhancing model training efficiency in AI distributed training scenarios.

Multi-cluster Scheduling

Supports cross cluster job scheduling, improving resource pool management capabilities and achieving large scale load balancing.

Online and Offline Workloads Colocation

Enables online and offline workloads colocation, improving cluster resource utilization through intelligent scheduling strategies.

Load Aware Descheduling

Optimizing cluster load distribution and enhancing system stability.

Multiple Scheduling Policies

Supports various scheduling strategies such as Gang scheduling, Fair-Share, Binpack, DeviceShare, NUMA-aware scheduling, Task Topology, etc.

Rich Framework Support

Seamlessly integrate with mainstream computing frameworks for AI, big data, and scientific computing

Spark

Apache Spark™ is a unified analytics engine for large-scale data processing.

Flink

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams.

TensorFlow

An end-to-end open source machine learning platform.

PyTorch

An open source machine learning framework that accelerates the path from research prototyping to production deployment.

argo

Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD.

MindSpore

The all-scenario deep learning framework developed by Huawei.

Ray

Ray is a high-performance distributed computing framework that supports machine learning, deep learning, and distributed applications.

Kubeflow

The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable.

Open MPI

The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners.