A cloud native system for high-performance workloads
Volcano is CNCF’s first cloud native batch computing project, focusing on high performance computing scenarios such as AI, big data, and genomics analysis. Its core capabilities include:
• Unified Scheduling: Supports integrated job scheduling for both Kubernetes native workloads and mainstream computing frameworks (such as TensorFlow, Spark, PyTorch, Ray, Flink, etc.).
• Queue Management: Provides multi-level queue management capabilities, enabling fine-grained resource quota control and task priority scheduling.
• Heterogeneous Device Support: Efficiently schedules heterogeneous devices like GPU and NPU, fully unleashing hardware computing potential.
• Network Topology Aware Scheduling: Greatly enhancing model training efficiency in AI distributed training scenarios.
• Multi-cluster Scheduling: Supports cross cluster job scheduling, improving resource pool management capabilities and achieving large scale load balancing.
• Online and Offline Workloads Colocation: Enables online and offline workloads colocation, improving cluster resource utilization through intelligent scheduling strategies.
• Load Aware Descheduling: Optimizing cluster load distribution and enhancing system stability.
As the industry’s first cloud native batch computing engine, Volcano has been widely applied in high-performance computing scenarios such as artificial intelligence, big data, and genome sequencing, providing powerful support for enterprises to build elastic, efficient, and intelligent computing platforms.
A powerful batch scheduler that allows you to run multi-architecture, computing-intensive jobs as Kubernetes workloads
Apache Spark™ is a unified analytics engine for large-scale data processing.
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams.
An end-to-end open source machine learning platform.
An open source machine learning framework that accelerates the path from research prototyping to production deployment.
Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD.
The all-scenario deep learning framework developed by Huawei.
Ray is a high-performance distributed computing framework that supports machine learning, deep learning, and distributed applications.
The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable.
The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners.
Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
A truly open source deep learning framework suited for flexible research prototyping and production.
PaddlePaddle is an open source deep learning platform derived from industrial practice initiated by Baidu.