Supports integrated job scheduling for both Kubernetes native workloads and mainstream computing frameworks (such as TensorFlow, Spark, PyTorch, Ray, Flink, etc.).
Provides multi-level queue management capabilities, enabling fine-grained resource quota control and task priority scheduling.
Efficiently schedules heterogeneous devices like GPU and NPU, fully unleashing hardware computing potential.
Greatly enhancing model training efficiency in AI distributed training scenarios.
Supports cross cluster job scheduling, improving resource pool management capabilities and achieving large scale load balancing.
Enables online and offline workloads colocation, improving cluster resource utilization through intelligent scheduling strategies.
Optimizing cluster load distribution and enhancing system stability.
Supports various scheduling strategies such as Gang scheduling, Fair-Share, Binpack, DeviceShare, NUMA-aware scheduling, Task Topology, etc.
Seamlessly integrate with mainstream computing frameworks for AI, big data, and scientific computing
Apache Spark™ is a unified analytics engine for large-scale data processing.
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams.
An end-to-end open source machine learning platform.
An open source machine learning framework that accelerates the path from research prototyping to production deployment.
Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD.
The all-scenario deep learning framework developed by Huawei.
Ray is a high-performance distributed computing framework that supports machine learning, deep learning, and distributed applications.
The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable.
The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners.
Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
A truly open source deep learning framework suited for flexible research prototyping and production.
PaddlePaddle is an open source deep learning platform derived from industrial practice initiated by Baidu.