- enhanced plugin for PyTorch Jobs
- Ray on Volcano
- enhanced scheduling for general Kubernetes services
- multi-architecture images of Volcano
- optimized queue status info
Volcano is the industry-first cloud native batch computing project. Open-sourced at KubeCon Shanghai in June 2019, it became an official CNCF project in April 2020. In April 2022, Volcano was promoted to a CNCF incubating project. By now, more than 490 global developers have committed code to the project. The community is seeing growing popularity among developers, partners, and users.
1. Enhanced Plugin for PyTorch Jobs
As one of the most popular AI frameworks, PyTorch has been widely used in deep learning fields such as computer vision and natural language processing. More and more users turn to Kubernetes to run PyTorch in containers for higher resource utilization and parallel processing efficiency.
Volcano 1.7 enhanced the plugin for PyTorch Jobs, freeing you from the manual configuration of container ports, MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and RANK environment variables.
Other enhanced plugins include those for TensorFlow, MPI, and PyTorch Jobs. They are designed to help you run computing jobs on desired training frameworks with ease.
Volcano also provides an extended development framework for you to tailor Job plugins to your needs.
Design Documentation: Pytorch-plugin
User Guide: Pytorch-plugin-user-guide
2. Ray on Volcano
Ray is a unified framework for extending AI and Python applications. It can run on any machine, cluster, cloud, and Kubernetes cluster. Its community and ecosystem are growing steadily.
As machine learning workloads are hosting computing jobs at a density higher than ever before, single-node environments are failing in providing enough resources for training tasks. Here’s where Ray comes in, which seamlessly coordinates resources of the entire cluster, instead of a single node, to run the same set of code. Ray is designed for common scenarios and any type of workloads.
For users running multiple types of Jobs, Volcano partners with Ray to provide high-performance batch scheduling. Ray on Volcano has been released in KubeRay v0.4.
User Guide: KubeRay-integration-with-Volcano
Issue: #2429， #213
3. Enhance Scheduling for General Kubernetes Services
Schedulers have their own advantages according to the use case. For example, in batch computing, Volcano provides more scheduling policies and capabilities. In general scheduling, the Kubernetes default scheduler is more balanced. However, it’s often the case that a user runs multiple types of tasks in the same cluster. When there are both batch computing and general tasks, scheduling can be a challenge.
Starting from version 1.7, Volcano becomes fully compatible with the Kubernetes default scheduler to schedule and manage long-running services. Now you can use Volcano to centrally schedule both batch computing and general workloads.
- Supports multiple types of schedulers for Volcano scheduler and webhook.
- Supports NodeVolumeLimits plugin.
- Supports VolumeZone plugin.
- Supports PodTopologySpread plugin.
- Supports SelectorSpread plugin.
Support for Kubernetes 1.25 is also available in Volcano 1.7.
4. Multi-architecture Images
You can now compile multi-architecture Volcano images by a few clicks through cross compilation. For example, you can compile the base images of the amd64 and arm64 architectures on an amd64 host and push the images to the image repository. During installation and deployment, the system automatically selects a proper image based on the host architecture for you, more user-friendly than before.
User Guide: building-docker-images
5. Optimized Queue Status Info
Volcano can now collect statistics on allocated resources in real time to the queue status info, which eases dynamic resource adjustment and puts cluster resources into good use.
Volcano allocates and manages cluster resources by queues. The Capability field limits the resource use for each queue, which is a hard ceiling.
Before, users had no clear view on the allocated resources in queues and idle resources among those defined by Capability. Creating a large number of workloads against insufficient resources may cause job suspension and unexpected cluster scale-out triggered by autoscaler, increasing the cloud resource costs. Now with more detailed status info, you can manage cluster resources more efficiently and avoid excess costs.
Volcano 1.7.0 is brought into being from hundreds of code commits from 29 contributors. Thanks for your contributions.
Contributors on GitHub:
Volcano is designed for high-performance computing applications such as AI, big data, gene sequencing, and rendering, and supports mainstream general computing frameworks. More than 26,000 global developers joined us, among whom the in-house ones come from companies such as Huawei, AWS, Baidu, Tencent, JD, and Xiaohongshu. There are 2,800 Stars and 670 Forks for the project. Volcano has been proven feasible for mass data computing and analytics, such as AI, big data, and gene sequencing. Supported frameworks include Spark, Flink, TensorFlow, PyTorch, Argo, MindSpore, Paddlepaddle, Kubeflow, MPI, Horovod, MXNet, KubeGene, and Ray. The ecosystem is thriving with more developers and use cases coming up.