Introduction

Last updated on Jun 24, 2025

What is Volcano

Volcano is a cloud native system for high-performance workloads, which has been accepted by Cloud Native Computing Foundation (CNCF) as its first and only official container batch scheduling project. Volcano supports popular computing frameworks such as Spark, TensorFlow, PyTorch, Flink, Argo, MindSpore, PaddlePaddle and Ray. Volcano also provides various scheduling capabilities including heterogeneous device scheduling, network topology-aware scheduling, multi-cluster scheduling, online-offline workloads colocation and so on.

Why Volcano

Job scheduling and management become increasingly complex and critical for high-performance batch computing. Common requirements are as follows:

Support for diverse scheduling algorithms
More efficient scheduling
Non-intrusive support for mainstream computing frameworks
Support for multi-architecture computing

Volcano is designed to cater to these requirements. In addition, Volcano inherits the design of Kubernetes APIs, allowing you to easily run applications that require high-performance computing on Kubernetes.

Features

Unified Scheduling

Support native Kubernetes workload scheduling
Provide complete support for frameworks like PyTorch, TensorFlow, Spark, Flink, Ray through VolcanoJob
Unified scheduling for both online microservices and offline batch jobs to improve cluster resource utilization

Rich Scheduling Policies

Gang Scheduling: Ensure all tasks of a job start simultaneously, suitable for distributed training and big data scenarios
Binpack Scheduling: Optimize resource utilization through compact task allocation
Heterogeneous Device Scheduling: Efficiently share GPU resources, support both CUDA and MIG modes for GPU scheduling, and NPU scheduling
Proportion/Capacity Scheduling: Resource sharing/preemption/reclaim based on queue quotas
NodeGroup Scheduling: Support node group affinity scheduling, implementing binding between queues and node groups
DRF Scheduling: Support fair scheduling of multi-dimensional resources
SLA Scheduling: Scheduling guarantee based on service quality
Task-topology Scheduling: Support task topology-aware scheduling, optimizing performance for communication-intensive applications
NUMA Aware Scheduling: Supports scheduling for NUMA architecture, optimizing resource allocation for tasks on multi-core processors, enhancing memory access efficiency and computational performance.
…

Volcano supports custom plugins and actions to implement more scheduling algorithms.

Queue Resource Management

Support multi-dimensional resource quota control (CPU, Memory, GPU, etc.)
Provide multi-level queue structure and resource inheritance
Support resource borrowing, reclaiming and preemption between queues
Implement multi-tenant resource isolation and priority control

Multi-architecture computing

Volcano can schedule computing resources from multiple architectures:

x86
Arm
Kunpeng
Ascend
GPU: Supports multiple GPU virtualization technologies for flexible resource management
- Dynamic MIG Support: Enables dynamic partitioning of NVIDIA Multi-Instance GPUs (MIG), providing hardware-level isolation to segment a physical GPU into multiple independent instances
- vCUDA Virtualization: Virtualizes physical GPUs into multiple vGPU devices at the software level for resource sharing and isolation
- Fine-Grained Resource Control: Provides dedicated memory and compute allocation for each GPU instance
- Multi-Container Sharing: Enables multiple containers to securely share a single GPU, maximizing utilization
- Unified Monitoring: Delivers unified monitoring and metrics collection for all GPU instances

Network Topology-aware Scheduling

Supports network topology-aware scheduling, fully considering the network bandwidth characteristics between nodes. In AI scenarios, this network topology-aware scheduling effectively optimizes data transmission for communication-intensive distributed training tasks, significantly reducing communication overhead and improving model training speed and overall efficiency.

Online and Offline Workloads Colocation

Supports online and offline workloads colocation, enhancing resource utilization while ensuring online worloads QoS through unified scheduling, dynamic resource overcommitment, CPU burst, and resource isolation.

Multi-cluster Scheduling

Support cross-cluster job scheduling for larger-scale resource pool management and load balancing

For more details about multi-cluster scheduling, see: volcano-global

Descheduling

Support dynamic descheduling to optimize cluster load distribution and improve system stability

For more details about descheduling, see: descheduler

Monitoring and Observability

Complete logging system
Rich monitoring metrics
Provides a dashboard, facilitating graphical interface operations for users.

For more details about dashboard, see: dashboard

For more details about volcano metrics, see: metrics

Ecosystem

Volcano has become the de facto standard in batch computing scenarios and is widely used in the following high-performance computing frameworks:

Additionally, Volcano has been widely adopted by various enterprises and organizations in the fields of AI and big data. With its powerful resource management capabilities, efficient job management mechanisms, and rich scheduling strategies (such as Gang scheduling, heterogeneous device scheduling, and topology-aware scheduling), it effectively meets the complex demands of distributed training and data analysis tasks. At the same time, Volcano enhances scheduling performance while ensuring the flexibility and reliability of task scheduling, providing strong support for enterprises to build an efficient resource utilization system.

Future Outlook

Volcano will continue to expand its functional boundaries through community collaboration and technical innovation, becoming a leader in high-performance computing and cloud-native batch scheduling.