This article was firstly released at
Container Cube
on September 6th, 2021, refer toVolcano v1.4.0-Beta发布,支持NUMA-Aware等多个重要特性
Volcano, CNCF’s first batch computing project, is now available with a new version, v1.4 (Beta). This version includes multiple important features, such as resource ratio-based partitions on GPU nodes, NUMA-aware, mixed deployment of multiple schedulers, and greatly improved stability.
Resource ratio-based partitions on GPU nodes is developed to avoid idle GPUs while GPU-consuming jobs are starving. This is an important feature contributed by Leinao Cloud, a Volcano community member.
Previously, a scheduler had separate rules for allocating scarce resources such as GPUs and common resources such as CPUs. That is, CPU-consuming jobs can be directly allocated to GPU nodes to consume CPU and memory resources without considering the upcoming GPU jobs and reserving no resources for them. Alternatively, an independent scheduler was configured for GPU nodes, which did not allow CPU-consuming jobs to be scheduled to GPU nodes.
Now with resource ratio-based partitions, you can set a dominant resource (usually GPU) and configure a resource ratio (for example, GPU:CPU:Memory = 1:4:32) for the dominant resource. The scheduler ensures that the ratio of idle GPU, CPU, and memory resources on a GPU node is greater than or equal to the value you set.
In this way, GPU-consuming jobs that meet the ratio requirement can be scheduled to the node at any time, preventing GPU wastes. Compared with other solutions in the industry, this more flexible method improves node resource utilization.
For details about the feature design and usage, you can visit https://github.com/volcano-sh/volcano/blob/master/docs/design/proportional.md.
CPU NUMA-aware is another important feature of this version. For computing-intensive jobs such as AI and big data jobs, enabling NUMA will significantly improve the computing efficiency. With CPU NUMA-aware scheduling, you can configure the NUMA policy to determine whether to enable NUMA for workloads. The scheduler will select a node that meets the NUMA requirements.
For details about the feature design and usage, you can visit https://github.com/volcano-sh/volcano/blob/master/docs/design/numa-aware.md.
You can now deploy different types of schedulers in a Kubernetes cluster to properly schedule resources. The most common use case is deploying default-scheduler and Volcano together. Native Kubernetes resource objects, such as Deployments and StatefulSets, can be scheduled by default-scheduler, and high-performance computing workloads, such as Volcano Jobs, TensorFlow Jobs, and Spark Jobs, can be scheduled by Volcano. This solution can make the best possible use of each type of schedulers and reduce the concurrency pressure of a single scheduler.
For details about the feature design and usage, you can visit https://github.com/volcano-sh/volcano/blob/master/docs/design/multi-scheduler.md.
In addition to the preceding features, Volcano v1.4 (Beta) adds the stress testing automation framework and fixes bugs introduced by the resource comparison function robustness.
The community is collecting roadmap features for Volcano v1.5. We have received requirements on support for cluster resource monitoring, hierarchical queues, enhanced Spark integration, and task dependency. Every piece of your suggestions and issues is welcome.