As artificial intelligence (AI) technologies advance and large language models (LLMs) grow more popular, the demand for AI compute has been booming. This has generated huge demand for high-performance scheduling for the AI and for hardware like AI chips.
Volcano is the first cloud native batch computing project in the industry. In 2019, it was donated by Huawei Cloud to the Cloud Native Computing Foundation (CNCF) and became CNCF’s first and only batch computing incubator project. Volcano provides unified high-performance job management for AI, big data, and high-performance computing (HPC) and supports a variety of high-level scheduling policies, including online and offline scheduling, AI elastic training scheduling, service level agreement (SLA), topology-based scheduling, fairness, load aware scheduling, rescheduling, preemption, and reclamation. It offers unified lifecycle management, job dependency management, and task dependency management for workloads like Spark, Flink, PyTorch, MPI, and TensorFlow. In terms of fine-grained resource management, Volcano supports min-max queue resource management, queue resource reservation, and dynamic resource sharing for multi-tenant resource leasing or preemption. Additionally, Volcano schedules heterogeneous resources including x86, Arm, GPUs, and Ascend, and provides refined scheduling of CPUs and GPUs. Users can allocate resources based on their requirements and significantly improve cost-effectiveness using Volcano.
The Volcano community has attracted more than 58,000 developers worldwide and won more than 3,200 Stars and over 730 Forks on GitHub. The contributors include Huawei, AWS, IBM, Baidu, Tencent, JD, Xiaohongshu, 4Paradigm, BoCloud, DaoCloud, Ruitian Capital, Qiniu Cloud, Yinqing Technology, ByteDance, Kuaishou, Unisound, Infosys, Visa, NetEase, Red Hat, Kingsoft Cloud, Inspur, ZTE, Oracle, and iQIYI.
More than 50 cases related to Volcano have been implemented. These cases are widely distributed in industries such as Internet, advanced manufacturing, finance, life sciences, scientific research, autonomous driving, and medicine. They cover massive data computing and analysis scenarios like AI, big data, genomic sequencing, and rendering. The main users are Tencent, Amazon, ING Bank, Baidu, Xiaohongshu, DiDi, 360, iQIYI, Leinao, Pengcheng Laboratory, Cruise, Li Auto, Unisound, Ximalaya, Vipshop, GrandOmics, BOSS Zhipin, and so on. With the expansion of the Volcano ecosystems, more and more users are highly willing to join the community. Huawei Cloud has worked with 11 partners to launch the Volcano community co-construction program and cultivate a more prosperous Volcano ecosystem.
According to Deng Mingkun, General Manager of Huawei Cloud Open Source Services, “The cloud native batch computing project, Volcano, has been widely adopted in domains such as AI, big data, genomic sequencing, rendering, transcoding, multimedia, and finance, since June 2019. A group of industry users not only actively promote the implementation of Volcano in production environments, but also contribute a lot to the Volcano community based on their own experience. Huawei Cloud intends to work with partners to launch the Volcano community co-construction program to create a more prosperous Volcano ecosystem and help more enterprises accelerate their cloud native progress.”
The first batch of members to join the program are Baidu, BoCloud, 4Paradigm, Vipshop, Ruitian Capital, Leinao, Pinlan, 360, NetEase Shufan, Ximalaya, and BOSS Zhipin.
According to Zhou Ti, the tech lead of Baidu’s PaddlePaddle open source ecosystem, “PaddlePaddle and Volcano jointly released the PaddlePaddle on Volcano solution to improve PaddlePaddle’s computing efficiency. As a platform for high-performance computing, Volcano makes up for Kubernetes’ lack of basic capabilities in machine learning, deep learning, HPC, and big data computing. Additionally, Volcano enhances the batch creation and lifecycle management of computing tasks, fair-share scheduling and other aspects on the basis of the native Kubernetes capability. These features meet PaddlePaddle’s basic requirements.”
Zhao Anquan, General Manager of BoCloud PaaS, said, “BoCloud’s HPC solution, based on CNCF’s Volcano scheduling engine, a product well respected by many customers, provides a high-concurrency computing platform that runs AI, big data, and simulation calculation applications, resolving many pain points in the industry. We also donated the industry’s first HPC job orchestration component JobFlow to the Volcano community so that users can better apply cloud native technologies.”
Li Mengxuan, head of heterogeneous computing virtualization in 4Paradigm, said, “The Volcano project enables us to solve the pain points encountered during the implementation of cloud native technologies in AI projects at a low cost, especially in terms of device reuse. The use of Volcano will significantly improve the cluster resource utilization. 4Paradigm will continuously contribute code to the community to build Volcano into a reuse platform that supports all mainstream forms of heterogeneous compute such as NPUs, GPUs, MLUs, and DCUs.”
He Yingpeng, head of Vipshop’s AI cloud platform, said, “As a top e-commerce platform in China, Vipshop faces problems associated with rapid growth, rapid product iteration, and maintaining a diverse product portfolio. A Volcano-based AI training platform with advanced scheduling policies like queue and gang scheduling can support scheduling of more than 100,000 vCPUs, accelerating Vipshop’s service innovation.”
Chang Feng, head of the Leinao R&D Center, said, “Volcano is one of the first open source cloud native projects for batch computing. It has dynamically configurable advanced scheduling policies and excellent resource management capabilities, which can address multiple challenges, like job scheduling, lifecycle management, and heterogeneous hardware support in AI scenarios. During the implementation, we expanded Volcano’s capabilities to effectively improve our system stability and resource utilization.”
Peng Jingtian, co-founder and CTO of Pinlan, said, “CNCF’s Volcano project has been successfully applied to our cloud native intelligent building design platform — AlphaDraw. Volcano provides AlphaDraw’s algorithm services with batch processing and auto scaling capabilities in scenarios like AI-based model flipping of Computer Aided Design (CAD) two-dimensional drawings and intelligent design of three-dimensional building models, greatly improving Kubernetes cluster resource utilization and optimizing workload performance. As the first member of the Volcano community co-construction program, Pinlan continuously contributes best practices for Cloud+AI in the architectural design field to the community. We expect AlphaDraw and Volcano to develop together to continuously provide more excellent products and solutions for intelligent cloud computing and the cloud native progress of the industry in the future.”
Wang Xinyong, a cloud native technology expert from NetEase Shufan, said, “Volcano provides many useful supplements to Kubernetes’ native capabilities, enabling it to better orchestrate batch processing tasks like AI training and big data computing. Volcano’s excellent task abstraction and management capabilities, multiple scenario-based scheduling mechanisms, and out-of-the-box integration with multiple common open source computing frameworks enable us to focus more on providing business value for users without spending a lot of efforts on reinventing systems.”
The owner of the Ruitian Capital Infrastructure Team said, “Volcano supplements native Kubernetes capabilities such as batch task scheduling, resource sharing, and fair scheduling policies, and provides unified interfaces to reduce learning and maintenance costs. In the production environment, Volcano works with our proprietary level-2 scheduling to meet the requirements of tens of thousands of tasks per day, greatly improving the efficiency of policy research.”
The leader of the 360 container team said, “Volcano makes up for Kubernetes’ lack of basic scheduling capabilities in machine learning and big data computing tasks. It provides various plug-ins to schedule tasks in different scenarios, greatly improving the cluster utilization. Additionally, Volcano supports most mainstream computing frameworks like Spark, TensorFlow, and Flink. The overall design of Volcano follows the design and mechanisms of Kubernetes, which reduces our learning costs.”
The head of the Ximalaya AI cloud team said, “Volcano enhances Kubernetes’ capabilities like batch task scheduling, resource sharing, and fair scheduling; and provides elastic scheduling. As a basic component for resource scheduling of the machine learning platform, Volcano improves GPU utilization in the production environment.
The owner of BOSS Zhipin AI fundamental platform team said, “BOSS Zhipin builds infrastructures based on Volcano in AI and big data computing scenarios. Volcano’s powerful batch processing and robust scheduling policies are very convenient for us. They help support complex service scenarios and greatly improve BOSS Zhipin’s cluster resource utilization and stability. With the support of its robust ecosystem and the community, Volcano has greatly helped our technological and business development.”
We look forward to working with more organizations to build a more inclusive Volcano community.
The Volcano community launched the co-construction program to more quickly include users into the Volcano community, to accelerate cloud native progress, and to ensure a diverse Volcano ecosystem.
Through this program, you will have opportunities for technological guidance, promotion, as well as online and offline technological sharing. If your company or organization recognizes the value that Volcano has to offer, wants help using Volcano, or wants to exert their technological influence, consider joining the program.
For details about the requirements and benefits, see https://github.com/volcano-sh/community/blob/master/community-building-program.md.
Application to the program
- Scan the QR code or click to read the full text and fill in the application form.
- The result will be sent by email. Please wait.
If you have any questions, please contact the Volcano community at wang.platform@gmail.com.