Volcano Blog

Introducing Kthena: Redefining LLM Inference for the Cloud-Native Era

Tue, 06 Jan 2026 00:00:00 GMT

Today, the Volcano community is proud to announce the launch of Kthena, a new sub-project designed for global developers and MLOps engineers.

Kthena is a cloud-native, high-performance system for LLM inference routing, orchestration, and scheduling, tailored specifically for Kubernetes. Engineered to address the complexity of serving LLMs at production scale, Kthena delivers granular control and enhanced flexibility. Through features like topology-aware scheduling, KV Cache-aware routing, and Prefill-Decode (PD) disaggregation, it significantly improves GPU/NPU utilization and throughput while minimizing latency.

As a sub-project of Volcano, Kthena extends Volcano’s capabilities beyond AI training, creating a unified, end-to-end solution for the entire AI lifecycle.

The "Last Mile" Challenge of LLM Serving

While Large Language Models (LLMs) are reshaping industries, deploying them efficiently on Kubernetes remains a complex systems engineering challenge. Developers face four critical hurdles:

Low Resource Utilization: The dynamic memory footprint of LLM inference—especially the KV Cache—creates massive pressure on GPU/NPU resources. Traditional Round-Robin load balancers fail to perceive these characteristics, leading to a mix of idle resources and queued requests that drives up costs.
The Latency vs. Throughput Trade-off: Inference consists of two distinct phases: Prefill (compute-intensive) and Decode (memory-bound). Coupled scheduling limits optimization. While PD Disaggregation is the industry standard solution, efficient routing and scheduling for it remain difficult.
Complex Multi-Model Management: Enterprises often serve multiple models, versions, and LoRA adapters simultaneously. Implementing fair scheduling, priority management, and dynamic routing is difficult, leading some to resort to rigid 1:1 mappings between AI Gateways and models.
Lack of Native K8s Integration: Many existing solutions are either fragmented from the Kubernetes ecosystem or too complex for standard platform operations.

Kthena: The Intelligent Brain for Cloud-Native Inference

Kthena was built to conquer these challenges. Rather than replacing existing inference engines (like vLLM or SGLang), Kthena acts as an intelligent orchestration layer atop them, deeply integrated into Kubernetes.

Kthena consists of two core components:

Kthena Router: A high-performance, multi-model router that acts as the entry point for all inference requests. It intelligently distributes traffic to backend ModelServers based on ModelRoute rules.
Kthena Controller Manager: The control plane responsible for workload orchestration and lifecycle management. It reconciles Custom Resource Definitions (CRDs) like ModelBooster, ModelServing, and AutoScalingPolicy to convert declarative intent into runtime resources.
- It orchestrates ServingGroups and roles (Prefill/Decode).
- It handles topology-aware affinity, Gang scheduling, rolling updates, and failure recovery.
- It drives elastic scaling based on defined policies.

Core Features and Advantages

1. Production-Grade Inference Orchestration (ModelServing)

Kthena introduces a Hierarchical Workload Architecture (ModelServing -> ServingGroup -> Role).

Unified API: A single API supports diverse patterns, from standalone deployments to complex PD Disaggregation and Expert Parallelism (EP).
Simplified Management: For example, a massive PD deployment is managed as a single ModelServing resource containing multiple ServingGroups.
Native PD Disaggregation: Kthena optimizes hardware usage by routing compute-intensive Prefill tasks to high-compute nodes and memory-bound Decode tasks to High Bandwidth Memory (HBM) nodes. It supports independent scaling to dynamically adjust the Prefill/Decode ratio.
Topology Awareness & Gang Scheduling: Gang scheduling guarantees that pods in a ServingGroup are scheduled as an atomic unit, preventing deadlocks. Topology awareness minimizes data transmission latency by placing related pods closer together in the network fabric.

2. Out-of-the-Box Deployment (ModelBooster)

Templates: Provides built-in templates for mainstream models (including PD separation), automatically generating necessary routing and lifecycle resources.
Flexibility: Covers general scenarios while allowing granular control via ModelServing for complex needs.

3. Intelligent, Model-Aware Routing

Multi-Model Routing: OpenAI API compatible. Routes traffic based on headers or body content.
Pluggable Algorithms: Includes Least Request, Least Latency, KV Cache Awareness, Prefix Cache Awareness, LoRA Affinity, and Fairness Scheduling.
LoRA Hot-Swapping: Detects loaded LoRA adapters for non-disruptive hot-swapping and routing.
Traffic Governance: Supports canary releases, token-level rate limiting, and failover.
All-in-One Architecture: Eliminates the need for a separate Envoy Gateway by natively handling routing logic.

4. Cost-Driven Autoscaling

Homogeneous Scaling: Scales precisely based on business metrics (CPU/GPU/Memory/Custom).
Heterogeneous Optimization: Optimizes resource allocation across different accelerators based on a "Cost-Performance" ratio.

5. Broad Hardware & Engine Support

Inference Engines: Supports vLLM, SGLang, Triton/TGI, and more via a unified API abstraction.
Heterogeneous Compute: Enables co-location of GPU and NPU resources to balance cost and Service Level Objectives (SLOs).

6. Built-in Flow Control & Fairness

Fairness Scheduling: Prioritizes traffic based on usage history to prevent "starvation" of low-priority users.
Flow Control: Granular limits based on user, model, and token length.

Performance Benchmarks

In scenarios with long system prompts (e.g., 4096 tokens), Kthena's "KV Cache Awareness + Least Request" strategy delivers significant gains compared to a random baseline:

Throughput: Increased by ~2.73x
TTFT (Time To First Token): Reduced by ~73.5%
End-to-End Latency: Reduced by >60%

Plugin Configuration	Throughput (req/s)	TTFT (s)	E2E Latency (s)
Least Request + KVCacheAware	32.22	9.22	0.57
Least Request + Prefix Cache	23.87	12.47	0.83
Random	11.81	25.23	2.15

Note: While gaps narrow with short prompts, KV Cache awareness offers decisive advantages for multi-turn conversations and template-heavy workloads.

Community & Industry Support

Kthena has already attracted widespread attention and support from industry leaders since its very beginning.

"Open source is the lifeblood of technical innovation and the primary driver of industry standardization. As the initiator of Volcano, Huawei Cloud is proud to launch Kthena alongside our community partners.

This release marks not only a significant milestone in Volcano's technical evolution but also underscores Huawei Cloud's enduring commitment to Cloud Native AI. By deeply integrating with infrastructure like Huawei Cloud CCE and CCI, Kthena unlocks the full potential of diverse computing power—including Ascend—delivering superior cost-efficiency to our customers.

Through Kthena, we look forward to collaborating with global developers to build an open, thriving ecosystem that lays a robust foundation for the intelligent transformation of industries worldwide."

Xiaobo Qi, Director of General Computing Services, Huawei Cloud

"Kthena further solidifies Volcano's leadership in intelligent workload scheduling. By leveraging Volcano's unified scheduling and resource pooling capabilities, our platform addresses diverse compute requirements—spanning general-purpose computing, AI training, and inference—within a single, unified framework.

This enables dynamic resource allocation across different scenarios, effectively eliminating resource silos. Looking ahead, we are excited to combine Kthena with Volcano’s elastic scaling and Volcano Global’s cross-cluster scheduling to drive resource utilization to new heights."

Lei Yang, PaaS R&D Director, China Telecom AI

"Since its inception, Volcano has evolved in lockstep with the community to address diverse AI scenarios, establishing a comprehensive ecosystem for AI batch processing.

The launch of Kthena marks a major milestone, extending Volcano's capabilities into the critical realm of Large Model inference. It crystallizes years of Volcano’s best practices in scheduling, elasticity, and multi-architecture support into a powerful engine for unified orchestration and intelligent routing.

By leveraging the existing Kubernetes and Volcano ecosystems, teams can achieve smarter scheduling decisions and higher compute efficiency at a lower cost. For DaoCloud, Kthena not only solves tangible inference challenges but also embodies the future of Cloud Native AI—an open, intelligent ecosystem worthy of our long-term investment and deep engagement."

Paco Xu, Open Source Team Lead at DaoCloud, Member of Kubernetes Steering Committee

"Deploying and managing self-hosted LLM inference services at production scale is a complex systems engineering challenge. It encompasses the entire lifecycle—deployment, operations, elasticity, and recovery—alongside critical requirements like GPU stability, scheduling efficiency, and AI observability. Kthena is engineered specifically to address these complexities.

During Kthena’s planning phase, the Xiaohongshu Cloud Native team engaged deeply with contributors to co-design various intelligent traffic scheduling strategies. Moving forward, we will continue our collaboration on the AI Gateway front. By leveraging Xiaohongshu’s production insights, we aim to provide the community with production-ready capabilities, including granular traffic scheduling, model API management, and MCP protocol support."

Kong Gu (Huachang Chen), Cloud Native Business Gateway Lead, Xiaohongshu

"After an in-depth evaluation of Kthena, China Unicom Cloud is impressed by its forward-looking design. We are particularly excited about its joint scheduling capabilities with Volcano.

Features like topology awareness and Gang Scheduling directly address the critical efficiency and reliability challenges inherent in large-scale distributed inference, offering a promising solution to complex scheduling bottlenecks.

We believe Kthena’s superior low latency, high throughput, and intelligent routing will provide the open-source community with a truly production-ready solution, empowering developers to build and manage cloud-native AI applications with greater efficiency."

Zhaoxu Lu, Team Lead, Intelligent Computing Center, China Unicom Cloud

"Openness and collaboration fuel innovation. Within the CNCF ecosystem, we are dedicated to driving infrastructure towards an 'AI Native' future.

By launching the Kthena sub-project, the Volcano community applies its proven expertise in batch computing—like topology awareness and Gang scheduling—to online LLM inference. Kthena introduces essential cloud-native scheduling primitives, enabling complex LLM workloads to run efficiently as first-class citizens in Kubernetes.

We invite developers worldwide to join us in refining this critical infrastructure and accelerating the AI Native era."

Kevin Wang, Volcano Maintainer, CNCF TOC Vice Chair

Start Exploring Kthena Today

This is just the beginning. We plan to support more efficient scheduling algorithms and broader best practices for large model deployment.

GitHub Repository: https://github.com/volcano-sh/kthena
Official Website: https://kthena.volcano.sh/
Community: Join our Slack

Join us to unlock the full potential of Cloud Native LLMs!

Volcano v1.13 Released: Comprehensive Enhancement of Scheduling Capabilities for LLM Training and Inference

Mon, 29 Sep 2025 00:00:00 GMT

On September 29, 2025 (Beijing Time), [Volcano v1.13] (https://github.com/volcano-sh/volcano/releases/tag/v1.13.0)[1] was officially released. This update brings functional enhancements across multiple areas, providing users with a more comprehensive cloud-native batch computing solution.

Release Highlights

The v1.13.0 release includes the following major updates:

AI Training and Inference Enhancements

Resource Management and Scheduling Enhancements

Colocation Enhancements

Support LeaderWorkerSet for Large Model Inference Scenarios

LeaderWorkerSet (LWS) is an API for deploying a group of Pods on Kubernetes. It is primarily used to address multi-host inference in AI/ML inference workloads, especially scenarios that require sharding large language models (LLMs) and running them across multiple devices on multiple nodes.

Since its open-source release, Volcano has actively integrated with upstream and downstream ecosystems, building a comprehensive community ecosystem for batch computing such as AI and big data. In the v0.7 release of LWS, it natively integrated Volcano's AI scheduling capabilities. When used with the new version of Volcano, LWS automatically creates PodGroups, which are then scheduled and managed by Volcano, thereby implementing advanced capabilities like Gang scheduling for large model inference scenarios.

Looking ahead, Volcano will continue to expand its ecosystem integration capabilities, providing robust scheduling and resource management support for more projects dedicated to enabling distributed inference on Kubernetes.

Usage documentation: LeaderWorkerSet With Gang.

Sincerely thanks to community developer: @JesseStutler

Introduce Cron VolcanoJob

This release introduces support for Cron Volcano Jobs. Users can now periodically create and run Volcano Jobs based on a predefined schedule, similar to native Kubernetes CronJobs, to achieve periodic execution of batch computing tasks like AI and big data. Detailed features are as follows:

Scheduled Execution: Define the execution cycle of jobs using standard Cron expressions (spec.schedule).
Timezone Support: Set the timezone in spec.timeZone to ensure jobs execute at the expected local time.
Concurrency Policy: Control concurrent behavior via spec.concurrencyPolicy:
- AllowConcurrent: Allows concurrent execution of multiple jobs (default).
- ForbidConcurrent: Skips the current scheduled execution if the previous job has not completed.
- ReplaceConcurrent: Terminates the previous job if it is still running and starts a new one.
History Management: Configure the number of successful (successfulJobsHistoryLimit) and failed (failedJobsHistoryLimit) job history records to retain; old jobs are automatically cleaned up.
Missed Schedule Handling: The startingDeadlineSeconds field allows tolerating scheduling delays within a certain timeframe; timeouts are considered missed executions.
Status Tracking: The CronJob status (status) tracks currently active jobs, the last scheduled time, and the last successful completion time for easier monitoring and management.

Sincerely thanks to community developers: @GoingCharlie, @hwdef, @Monokaix

Usage example: Cron Volcano Job Example.

Support Label-based HyperNode Auto-Discovery

Volcano officially launched network topology-aware scheduling capability in v1.12 and pioneered the UFM auto-discovery mechanism based on InfiniBand (IB) networks. However, for hardware clusters that do not support IB networks or use other network architectures (such as Ethernet), manually maintaining the network topology remains cumbersome.

To address this issue, the new version introduces a Label-based HyperNode auto-discovery mechanism. This feature provides users with a universal and flexible way to describe network topology, transforming complex topology management tasks into simple node label management.

This mechanism allows users to define the correspondence between topology levels and node labels in the volcano-controller-configmap. The Volcano controller periodically scans all nodes in the cluster and automatically performs the following tasks based on their labels:

Automatic Topology Construction: Automatically builds multi-layer HyperNode topology structures from top to bottom (e.g., rack -> switch -> node) based on a set of labels on the nodes.
Dynamic Maintenance: When node labels change, or nodes are added or removed, the controller automatically updates the members and structure of the HyperNodes, ensuring the topology information remains consistent with the cluster state.
Support for Multiple Topology Types: Allows users to define multiple independent network topologies simultaneously to adapt to different hardware clusters (e.g., GPU clusters, NPU clusters) or different network partitions.

Configuration example:

# volcano-controller-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: volcano-controller-configmap
  namespace: volcano-system
data:
  volcano-controller.conf: |
    networkTopologyDiscovery:
      - source: label
        enabled: true
        interval: 10m # Discovery interval
        config:
          networkTopologyTypes:
            # Define a topology type named topology-A
            topology-A:
              # Define topology levels, ordered from top to bottom
              - nodeLabel: "volcano.sh/hypercluster" # Top-level HyperNode
              - nodeLabel: "volcano.sh/hypernode"   # Middle-level HyperNode
              - nodeLabel: "kubernetes.io/hostname" # Bottom-level physical node

This feature is enabled by adding the label source to the Volcano controller's ConfigMap. The above configuration defines a three-layer topology structure named topology-A:

Top Level (Tier 2): Defined by the volcano.sh/hypercluster label.
Middle Level (Tier 1): Defined by the volcano.sh/hypernode label.
Bottom Level: Physical nodes, identified by the Kubernetes built-in kubernetes.io/hostname label.

When a node is labeled as follows, it will be automatically recognized and classified into the topology path cluster-s4 -> node-group-s0:

# Labels for node node-0
labels:
  kubernetes.io/hostname: node-0
  volcano.sh/hypernode: node-group-s0
  volcano.sh/hypercluster: cluster-s4

The label-based network topology auto-discovery feature offers excellent generality and flexibility. It is not dependent on specific network hardware (like IB), making it suitable for various heterogeneous clusters, and allows users to flexibly define hierarchical structures of any depth through labels. It automates complex topology maintenance tasks into simple node label management, significantly reducing operational costs and the risk of errors. Furthermore, this mechanism dynamically adapts to changes in cluster nodes and labels, maintaining the accuracy of topology information in real-time without manual intervention.

Sincerely thanks to community developer: @zhaoqi612

Usage documentation: HyperNode Auto Discovery.

Add Native Ray Framework Support

Ray is an open-source unified distributed computing framework whose core goal is to simplify parallel computing from single machines to large-scale clusters, especially suitable for scaling Python and AI applications. To manage and run Ray on Kubernetes, the community provides KubeRay—an operator specifically designed for Kubernetes. It acts as a bridge between Kubernetes and the Ray framework, greatly simplifying the deployment and management of Ray clusters and jobs.

Historically, running Ray workloads on Kubernetes primarily relied on the KubeRay Operator. KubeRay integrated Volcano in its v0.4.0 release (released in 2022) for scheduling and resource management of Ray Clusters, addressing issues like resource deadlocks in distributed training scenarios. With this new version of Volcano, users can now directly create and manage Ray clusters and submit computational tasks through native Volcano Jobs. This provides Ray users with an alternative usage scheme, allowing them to more directly utilize Volcano's capabilities such as Gang Scheduling, queue management and fair scheduling, and job lifecycle management for running Ray workloads.

Sincerely thanks to community developer: @Wonki4

Design documentation: Ray Framework Plugin Design Doc.

Usage documentation: Ray Plugin User Guide.

Introduce HCCL Plugin Support

The new version adds an HCCL Rank plugin (hcclrank) to Volcano Jobs, used for automatically assigning HCCL Ranks to Pods in distributed tasks. This includes:

New implementation of the hcclrank plugin for Volcano Jobs, supporting automatic calculation and injection of HCCL Rank into Pod annotations based on task type (master/worker) and index.
The plugin supports custom master/worker task names, allowing users to specify the master/worker roles in distributed tasks.

This feature enhances Volcano's native support for HCCL communication scenarios, such as Huawei Ascend, facilitating automatic management and assignment of Ranks in AI training tasks.

Sincerely thanks to community developer: @kingeasternsun

Enhance NodeGroup Functionality

In hierarchical queue structures, repeatedly configuring the same node group affinity (nodeGroupAffinity) for each sub-queue as its parent queue leads to configuration redundancy and difficult maintenance.

To solve this problem, the Nodegroup plugin adds support for inheriting affinity within hierarchical queues. Once enabled, the scheduler resolves the effective affinity for a queue according to the following rules:

Prioritize Self-Configuration: If the queue has defined spec.affinity, it uses this configuration directly.
Upward Inheritance: If the queue has not defined spec.affinity, it searches upward through its parents and inherits the affinity configuration defined by the nearest ancestor queue.
Override Capability: A child queue can override the inherited configuration by defining its own spec.affinity, ensuring flexibility.

This feature allows administrators to set unified node group affinity at a parent queue (e.g., department level), and all child queues (e.g., team level) will automatically inherit this setting, simplifying management.

For queues without NodeAffinity configuration, the "strict" parameter in the plugin controls scheduling behavior. When strict is set to true (the default value), tasks in these queues cannot be scheduled to any nodes. When strict is set to false, these tasks are allowed to be scheduled to regular nodes that do not have the volcano.sh/nodegroup-name label.

In the nodegroup plugin parameters of the scheduler configuration file, setting enableHierarchy: true enables hierarchical queue mode, and setting strict: false configures non-strict mode. Example configuration is as follows:

actions: "allocate, backfill, preempt, reclaim"
tiers:
- plugins:
  - name: nodegroup
    arguments:
      enableHierarchy: true # Enable hierarchical support
      strict: false # Set to non-strict mode, allowing tasks in the queue to be scheduled to nodes without the "volcano.sh/nodegroup-name" label

Sincerely thanks to community developers: @JesseStutler, @wuyueandrew

NodeGroup design documentation: NodeGroup Design.

NodeGroup usage documentation: NodeGroup User Guide.

Introduce ResourceStrategyFit Plugin

In the native Kubernetes noderesources fit strategy, only a single aggregated (MostAllocated) or dispersed (LeastAllocated) strategy can be applied to all resources. This has limitations in complex heterogeneous computing environments (like AI/ML clusters). To meet differentiated scheduling requirements, Volcano introduces the enhanced ResourceStrategyFit plugin.

This plugin now integrates two core features: Independent scoring strategies by resource type and Scarce Resource Avoidance (SRA).

Independent Scoring Strategy by Resource Type

This feature allows users to specify MostAllocated (binpack) or LeastAllocated (spread) strategies for different resources (e.g., cpu, memory, nvidia.com/gpu) independently, and assign different weights to them. The scheduler calculates the node score meticulously based on the independent configuration for each resource.

To simplify the management of resources within the same family (e.g., different model GPUs from the same vendor), this feature also supports suffix wildcard (*) matching for resource names.

Syntax Rules: Only suffix wildcards are supported, e.g., nvidia.com/gpu/*. Patterns like * or vendor.*/gpu are considered invalid.
Matching Priority: Uses the "longest prefix match" principle. Exact matches have the highest priority; when no exact match exists, the wildcard pattern with the longest prefix is selected.

Configuration Example: The following configuration sets a high-priority binpack strategy for a specific V100 GPU model, a generic binpack strategy for all other NVIDIA GPUs, and a spread strategy for CPU resources. Pod-level resource scoring strategy configuration is also supported.

actions: "enqueue, allocate, backfill, reclaim, preempt"
tiers:
- plugins:
  - name: resource-strategy-fit
    arguments:
      resourceStrategyFitWeight: 10
      resources:
        # Exact match, highest priority
        nvidia.com/gpu-v100:
          type: MostAllocated
          weight: 3
        # Wildcard match, applies to all other NVIDIA GPUs
        nvidia.com/gpu/*:
          type: MostAllocated
          weight: 2
        # Exact match for CPU resource
        cpu:
          type: LeastAllocated
          weight: 1

Scarce Resource Avoidance (SRA)

SRA is a "soft" strategy designed to improve the overall utilization of expensive or scarce resources (like GPUs). It influences node scoring to guide ordinary tasks that do not require specific scarce resources (e.g., CPU-only tasks) to avoid nodes containing those resources where possible. This helps "reserve" scarce resource nodes for tasks that truly need them, thereby reducing resource contention and task waiting time.

Mechanism:

Users define a set of "scarce resources" (e.g., nvidia.com/gpu) in the configuration.
When scheduling a Pod that does not request any of the defined scarce resources, the SRA policy takes effect.
The scheduler reduces the score of nodes that possess these scarce resources. The more types of scarce resources a node has, the lower its score.
For Pods that do request scarce resources, the SRA policy does not negatively impact their scheduling decisions.

Configuration Example: The following configuration defines nvidia.com/gpu as a scarce resource. When scheduling a CPU-only task, nodes with GPUs will have their scores reduced, making the task more likely to be scheduled onto nodes without GPUs.

actions: "enqueue, allocate, backfill, reclaim, preempt"
tiers:
- plugins:
  - name: resource-strategy-fit
    arguments:
      # ... binpack/spread strategy configuration for resourceStrategyFit ...
      resources:
        nvidia.com/gpu:
          type: MostAllocated
          weight: 2
        cpu:
          type: LeastAllocated
          weight: 1
      # SRA policy configuration
      sra:
        enable: true
        resources: "nvidia.com/gpu" # Define scarce resource list, comma-separated
        weight: 10 # Weight of the SRA policy in the total score
        resourceWeight:
          nvidia.com/gpu: 1 # Define nvidia.com/gpu as a scarce resource and its weight

By combining the binpack/spread strategies of ResourceStrategyFit with the avoidance strategy of SRA, users can achieve more refined and efficient scheduling of heterogeneous resources.

Sincerely thanks to community developers: @LY-today, @XbaoWu, @ditingdapeng, @kingeasternsun

Design documentation: ResourceStrategyFit Design

Usage documentation: ResourceStrategyFit User Guide

Decouple Colocation from OS

Volcano's colocation capability consists of two parts: application-level and kernel-level. Application-level colocation provides unified scheduling for online and offline workloads, dynamic resource overcommitment, node pressure eviction, etc. Kernel-level colocation involves QoS guarantees for resources like CPU, Memory, and Network at the kernel level, which typically requires support from a specific OS (like OpenEuler). In the new version, Volcano decouples the colocation capability from the OS. For users using an OS that does not support kernel-level colocation, they can choose to use Volcano's application-level colocation capabilities to achieve unified scheduling of online and offline tasks, dynamic resource overcommitment, and high-priority task guarantees.

Specific usage: When installing the Volcano agent, specify the --supported-features parameter:

helm install volcano . --create-namespace -n volcano-system --set custom.colocation_enable=true --set "custom.agent_supported_features=OverSubscription\,Eviction\,Resources"

Sincerely thanks to community developers: @ShuhanYan, @Monokaix

Colocation documentation: Cloud Native Colocation

Support Custom OverSubscription Resource Names

The Volcano colocation Agent adds parameters --extend-resource-cpu-name and --extend-resource-memory-name, allowing users to customize the names of overcommitted resources. This supports custom naming for CPU and memory resources (defaults are kubernetes.io/batch-cpu and kubernetes.io/batch-memory respectively), enhancing flexibility in setting overcommitted resource names.

Specific usage: When installing Volcano, specify the --extend-resource-cpu-name and --extend-resource-memory-name parameters:

helm install volcano . --create-namespace -n volcano-system --set custom.colocation_enable=true --set custom.agent_extend_resource_cpu_name=example.com/cpu --set custom.agent_extend_resource_memory_name=example.com/memory

Sincerely thanks to community developers: @ShuhanYan, @Monokaix

Colocation documentation: Cloud Native Colocation

Add Kubernetes 1.33 Support

The Volcano version keeps pace with the Kubernetes community releases. v1.13 supports the latest Kubernetes v1.33 release, ensuring functionality and reliability through comprehensive UT and E2E test cases.

For participating in Volcano's adaptation work for new Kubernetes versions, refer to: adapt-k8s-todo.

Sincerely thanks to community developer: @mahdikhashan

Conclusion: Volcano v1.13.0 Continues to Lead Cloud-Native Batch Computing

Volcano v1.13.0 is not just a technological advancement but a continuation of innovation in cloud-native batch computing. Whether for AI large model training and inference, Big Data scheduling, or resource optimization, Volcano v1.13.0 delivers powerful features and flexible solutions. We believe Volcano v1.13.0 will help users achieve greater heights in cloud-native batch computing, ushering in a new era of AI and Big Data scheduling!

Experience Volcano v1.13.0 now and step into a new era of efficient computing!

v1.13.0 release: https://github.com/volcano-sh/volcano/releases/tag/v1.13.0

Acknowledgments

Volcano v1.13.0 includes contributions from 36 community members. Sincerely thanks to all contributors:

@ElectricFish7	@philandstuff	@junzebao
@ShuhanYan	@GautamBytes	@coldzerofear
@houyuting	@lhlxc	@cyf-2002
@neo502721	@suyiiyii	@dafu-wu
@ditingdapeng	@GoingCharlie	@Wonki4
@zhaoqi612	@huntersman	@JesseStutler
@LY-today	@XbaoWu	@kingeasternsun
@Monokaix	@wuyueandrew	@mahdikhashan
@bibibox	@archlitchi	@guoqinwill
@ouyangshengjia	@Poor12	@dongjiang1989
@zhifei92	@halcyon-r	@Xu-Wentao
@hajnalmt	@kevin-wangzefeng	@linuxfhy

iFlytek Enhances AI Infrastructure with Volcano, Wins CNCF End-User Case Study Award

Fri, 13 Jun 2025 00:00:00 GMT

[HONG KONG, CHINA — June 10, 2025] — The Cloud Native Computing Foundation (CNCF) today announced that iFlytek has won the CNCF End-User Case Study Competition. The CNCF, which is committed to building a sustainable ecosystem for cloud native software, recognized iFlytek for its innovative use of Volcano. The company shared its success in large-scale AI model training at the KubeCon + CloudNativeCon China conference, held in Hong Kong from June 10-11.

iFlytek's Challenges

As a leading Chinese technology company specializing in voice and language AI, iFlytek faced significant scaling challenges amid its rapid business growth. Inefficient scheduling led to underutilized GPU resources, while complex workflow management and intense resource contention among teams slowed down research and development, straining the company's infrastructure.

By adopting Volcano, iFlytek implemented elastic scheduling, DAG-based workflows, and multi-tenancy isolation, which simplified operations and significantly improved resource utilization.

"Before using Volcano, coordinating training across our large-scale GPU clusters was a constant exercise in firefighting, with frequent resource bottlenecks, task failures, and complex pipeline debugging," said DongJiang, Senior Platform Architect at iFlytek. "Volcano gives us the flexible control we need to scale our AI training efficiently and reliably. We are honored to be recognized by the CNCF and look forward to sharing our experiences with the community at KubeCon + CloudNativeCon China."

About Volcano

Volcano is a cloud-native batch computing system built on Kubernetes. It is designed for high-performance workloads, including AI/machine learning, big data processing, and scientific computing. Volcano offers advanced scheduling capabilities such as job orchestration, fair-share resource allocation, and queue management to efficiently handle large-scale distributed tasks. After joining the CNCF as a Sandbox project in 2020 and graduating to the Incubating stage in 2022, Volcano has become a critical tool for compute-intensive workloads.

Significant Results iFlytek Achieved with Volcano

As the demand for AI grew, iFlytek turned to Volcano to manage its increasingly large and complex training infrastructure. The engineering team required a more efficient way to allocate resources, handle complex multi-stage training workflows, minimize job interruptions, and ensure fair resource access across teams. With Volcano, they achieved:

A 40% improvement in GPU utilization, leading to significantly lower infrastructure costs and reduced resource idling.
A 70% faster recovery from task failures, ensuring continuous training operations.
A 50% reduction in resource interference, ensuring service stability and resource usage flexibility.

Chris Aniszczyk, CTO of the CNCF, commented, "iFlytek's story is a great example of how open source technology can solve complex and critical challenges at scale. By leveraging Volcano to improve GPU efficiency and streamline their training workflows, they have reduced costs, accelerated development, and built a more reliable AI infrastructure on Kubernetes—a critical advantage for any organization at the forefront of AI."

As AI workloads become more complex and resource-intensive, iFlytek's success demonstrates that cloud-native tools like Volcano are essential for teams looking to simplify operations and enhance scalability. Their presentation at KubeCon + CloudNativeCon China [1] offers practical insights into managing distributed training more effectively in a Kubernetes environment.

References

[1] Presentation: https://kccncchn2025.sched.com/event/23EWS?iframe=no

Volcano v1.12.0 Available Now

Thu, 12 Jun 2025 00:00:00 GMT

Volcano v1.12 released: Advancing Cloud-Native AI and Batch Computing

As AI large model technology rapidly evolves, enterprises are placing higher demands on computing resource efficiency and application performance. For complex application scenarios such as AI, big data, and high-performance computing (HPC), efficiently utilizing accelerators like GPUs, ensuring high system availability, and managing resources with fine granularity are the core areas of focus for the Volcano community's continuous innovation.

Each version of Volcano is an active response to these challenges. With contributions from over 1,000 developers from more than 30 countries, resulting in nearly 40,000 contributions, Volcano has been adopted in production environments by more than 60 enterprises worldwide. Its scheduling performance and resource management capabilities have been widely proven in practice.

Today, the Volcano community officially releases v1.12. This new version focuses on the core requirements of modern AI and big data scenarios, and introduces a series of key features and usability improvements:

Highlights of v1.12

Network Topology-Aware Scheduling (Alpha): Optimizes the deployment of large-scale AI training and inference tasks by using network topology awareness to reduce cross-switch communication and improve runtime efficiency.
Enhanced GPU Virtualization: Adds support for NVIDIA GPU dynamic MIG partitioning besides the existing vCUDA solution. This provides users with both software and hardware virtualization options for more flexible and efficient GPU resource sharing.
DRA Support: Enhances the flexibility and capabilities of heterogeneous resource management.
Queue Capacity Management in Volcano Global: Supports unified limits and management of resource quotas (capabilities) for tenant queues in a multi-cluster environment.
Comprehensive Security Enhancements: Implements multi-dimensional security hardening, from API access control to container runtime permissions, to improve system robustness.
Performance Optimization for Large-Scale Scenarios: Improves concurrent task processing efficiency by reducing unnecessary webhook calls.
Enhanced Gang Scheduling for Generic Workloads: Adds support for custom minimum member counts (minMember) for Gang scheduling of generic workloads like Deployments and StatefulSets via annotations, providing more fine-grained Gang Scheduling strategies.
Job Flow Enhancements: Improves the robustness and observability of the built-in workflow orchestration engine.
And many other stability and usability improvements.

We believe these updates in v1.12 will further enhance intelligent task scheduling, resource utilization, and overall system performance, helping users to better meet the challenges of the AI and big data era.

Core Feature Details

Network Topology-Aware Scheduling (Alpha Release)

Previously a preview feature in v1.11, Volcano's Network Topology-Aware Scheduling is now an Alpha release in v1.12. This feature is designed to optimize the deployment of AI tasks in large-scale training and inference scenarios (e.g., model-parallel training, leader-worker inference). By scheduling tasks within the same network topology performance domain, it reduces cross-switch communication, thereby significantly improving task efficiency. Volcano uses the HyperNode CRD to abstract and represent heterogeneous hardware network topologies and supports a hierarchical structure for easier management.

Version 1.12 integrates the following key features:

HyperNode Auto-Discovery: Volcano can now automatically discover the cluster's network topology. Users can configure the discovery type, and the system will automatically create and maintain hierarchical HyperNodes that reflect the cluster's actual network topology. It currently supports obtaining topology information from InfiniBand (IB) networks via the UFM (Unified Fabric Manager) interface to automatically update HyperNodes. Support for more network protocols like RoCE is planned for the future.
Prioritized HyperNode Selection: This version introduces a scoring strategy based on a combination of node-level and HyperNode-level scores to determine the final priority of a HyperNode.
- Node-level: It is recommended to configure the BinPack plugin to pack nodes within a HyperNode first, reducing resource fragmentation.
- HyperNode-level: Lower-level HyperNodes are prioritized for better performance, as they involve fewer cross-switch traversals. For HyperNodes at the same level, those containing more tasks receive a higher score to reduce HyperNode-level resource fragmentation.
Node Matching with Label Selectors: HyperNode leaf nodes are associated with physical nodes in the cluster and support the following three matching strategies:
- Exact Match: Directly matches node names.
- Regex Match: Matches node names using regular expressions.
- Label Match: Matches nodes using standard Label Selectors.

Dynamic MIG Partitioning for GPU Virtualization

Volcano's GPU virtualization feature allows users to request partial GPU resources based on memory and compute power. It works with a Device Plugin to achieve hardware isolation and improve GPU utilization. While traditional GPU virtualization limits GPU usage by intercepting CUDA APIs, the MIG (Multi-Instance GPU) technology in the NVIDIA Ampere architecture allows a single physical GPU to be partitioned into multiple independent instances. However, typical MIG solutions often use pre-configured, fixed-size instances, which can lead to resource waste and inflexibility.

Volcano v1.12 introduces dynamic MIG partitioning and scheduling capabilities. It can select the appropriate MIG instance size in real-time based on the user's requested GPU amount and uses a Best-Fit algorithm to reduce resource waste. It also supports GPU scoring strategies like BinPack and Spread to minimize resource fragmentation and improve GPU utilization. Users can request resources using the unified APIs volcano.sh/vgpu-number, volcano.sh/vgpu-cores, and volcano.sh/vgpu-memory, without needing to be aware of the underlying implementation details.

Design Document: Dynamic MIG Design Document
Usage Guide: Dynamic MIG Usage Guide

Related PRs:

Thanks to the following community developers for their contributions to this feature: @sailorvii, @archlitchi.

Support for Dynamic Resource Allocation (DRA)

Kubernetes DRA (Dynamic Resource Allocation) is a native feature that provides a more flexible and powerful way to manage heterogeneous hardware resources in a cluster, such as GPUs, FPGAs, and high-performance network cards. It addresses the limitations of the traditional Device Plugin model in some advanced scenarios. Volcano v1.12 adds support for DRA, allowing the cluster to dynamically allocate and manage external resources, which enhances Volcano's integration with the Kubernetes ecosystem and improves the flexibility of resource management.

Usage Guide: Enabling DRA in Volcano

Related PR:

https://github.com/volcano-sh/volcano/pull/3799

Thanks to community developer @JesseStutler for their contribution to this feature.

Queue Capacity Management in Volcano Global

Queues are a core concept in Volcano. To support quota management in multi-cluster and multi-tenant environments, Volcano v1.12 extends its global queue capacity management capabilities. Users can now uniformly limit tenant resource usage in a multi-cluster environment. The configuration is consistent with the single-cluster scenario: the capability field in the queue configuration is used to limit tenant quotas.

Related PR:

https://github.com/volcano-sh/volcano-global/pull/16

Thanks to community developer @tanberBro for their contribution to this feature.

Security Enhancements

The Volcano community is committed to security. In v1.12, in addition to fine-grained control over sensitive permissions like ClusterRoles, the following security risks have been addressed and hardened:

Set Timeouts for HTTP Servers: The Metric and Healthz endpoints of all Volcano components now have server-side ReadHeader, Read, and Write timeouts to prevent prolonged resource occupation. (PR: https://github.com/volcano-sh/volcano/pull/4208)
Add Warning for Skipping SSL Certificate Verification: When a client request sets insecureSkipVerify to true, a warning is logged to recommend enabling SSL certificate verification in production environments. (PR: https://github.com/volcano-sh/volcano/pull/4211)
Disable Volcano Scheduler's pprof Endpoint by Default: To prevent the leakage of sensitive program information, the profiling data port used for troubleshooting is now disabled by default. (PR: https://github.com/volcano-sh/volcano/pull/4173)
Remove Unnecessary File Permissions: Unnecessary execute permissions have been removed from Go source files to follow the principle of least privilege. (PR: https://github.com/volcano-sh/volcano/pull/4171)
Set Security Context for Containers and Run as Non-Root: All Volcano components now run with non-root privileges. Security contexts have been added with seccompProfile and SELinuxOptions, and allowPrivilegeEscalation is set to false to prevent container privilege escalation. Only necessary Linux capabilities are retained, comprehensively restricting container permissions. (PR: https://github.com/volcano-sh/volcano/pull/4207)
Limit HTTP Response Body Size: For HTTP requests sent by the Extender Plugin and ElasticSearch Service, the response body size is limited to prevent issues like OOM caused by excessive resource consumption. (Advisory: https://github.com/volcano-sh/volcano/security/advisories/GHSA-hg79-fw4p-25p8)

Performance Improvements for Large-Scale Scenarios

Volcano's performance is continuously being optimized. The new version removes and disables some non-essential webhooks by default without affecting functionality, improving performance in large-scale batch creation scenarios:

PodGroup Mutating Webhook Disabled by Default: Previously, when a PodGroup was created without a specified queue, the queue could be populated from the Namespace. Since this scenario is uncommon, this webhook is now disabled by default. Users can enable it if needed.
Queue Status Check Moved from Pod to PodGroup: Task submission is not allowed when a queue is in a closed state. The original validation logic was performed at the Pod creation stage. Since Volcano's basic scheduling unit is the PodGroup, moving the validation to the PodGroup creation stage is more efficient. As the number of PodGroups is less than the number of Pods, this change reduces webhook calls and improves performance.

Related PRs:

Thanks to community developer @Monokaix for their contribution to this feature.

Gang Scheduling for Various Workload Types

Gang scheduling is a core capability of Volcano. For Volcano Job and PodGroup objects, users can directly set minMember to define the required minimum number of replicas. In the new version, users can specify this minimum by setting the annotation scheduling.volcano.sh/group-min-member on other types of workloads such as Deployments, StatefulSets, and Jobs. This means that when using Volcano for scheduling, either the specified number of replicas are all scheduled successfully, or none are scheduled at all, enabling Gang scheduling for a wider variety of workload types.

For example, to set minMember=10 for a Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: volcano-group-deployment
  annotations:
    # Set min member=10
    scheduling.volcano.sh/group-min-member: "10"

Related PR:

https://github.com/volcano-sh/volcano/pull/4000

Thanks to community developer @sceneryback for their contribution to this feature.

Job Flow Enhancements

Job Flow is a lightweight workflow orchestration framework for Volcano Jobs. In v1.12, Job Flow has been enhanced with the following improvements:

New Monitoring Metrics: Added metrics for the number of successful and failed Job Flows.
DAG Validity Check: Introduced a function to validate the structure of a Job Flow's Directed Acyclic Graph (DAG).
State Synchronization Fix: Resolved an issue that caused inaccurate Job Flow state synchronization.

Related PRs:

Thanks to community developer @dongjiang1989 for their contribution to this feature.

Finer-Grained Permission Control in Multi-Tenant Scenarios

Volcano natively supports multi-tenant environments and emphasizes permission control in such scenarios. In the new version, Volcano has enhanced permission control for Volcano Jobs by adding read-only and read-write ClusterRoles. Users can assign different permissions to tenants as needed to achieve better isolation.

Related PR:

https://github.com/volcano-sh/volcano/pull/4174

Thanks to community developer @Hcryw for their contribution to this feature.

Kubernetes 1.32 Support

Volcano stays current with Kubernetes releases. Version 1.12 supports the latest Kubernetes v1.32 and ensures functionality and reliability through comprehensive unit and end-to-end (E2E) tests.

To participate in Volcano's adaptation work for new Kubernetes versions, please refer to: adapt-k8s-todo.

Related PR:

https://github.com/volcano-sh/volcano/pull/4099

Thanks to community developers @guoqinwill and @danish9039 for their contributions to this feature.

Enhanced Queue Monitoring Metrics

Volcano queues now include several new key resource metrics. The system now supports monitoring and visualization of request, allocated, deserved, capacity, and real_capacity for CPU, memory, and extended resources, providing a detailed view of the status of key queue resources.

Related PR:

https://github.com/volcano-sh/volcano/pull/3937

Thanks to community developer @zedongh for their contribution to this feature.

Fuzz Testing Support

Fuzz testing is an automated software testing technique. In this release, Volcano introduces a fuzz testing framework to test key function units. It uses Google's open-source OSS-Fuzz framework for continuous testing, which helps to discover potential vulnerabilities and defects early, enhancing the security and robustness of Volcano.

Related PR:

https://github.com/volcano-sh/volcano/pull/4205

Thanks to community developer @AdamKorcz for their contribution to this feature.

Stability Enhancements

This release includes several stability fixes, addressing issues such as panics caused by improper queue capacity settings, hierarchical queue validation failures, unnecessary PodGroup refreshes, and StatefulSets with zero replicas consuming queue resources. These improvements further enhance the stability of the system in complex scenarios.

Related PRs:

Thanks to the following community developers for their contributions: @halcyon-r, @guoqinwill, @JackyTYang, @JesseStutler, @zhutong196, @Wang-Kai, @HalfBuddhist.

Pre-Upgrade Notes

Before upgrading to Volcano v1.12, please note the following changes:

PodGroup Mutating Webhook Disabled by Default: In v1.12, the PodGroup's Mutating Webhook is disabled by default. If you have workflows that rely on the webhook to automatically populate a PodGroup's queue from its Namespace, you must manually enable this webhook after upgrading.
Queue Status Check Migration and Behavioral Change: The queue status validation logic for task submission has been moved from the Pod creation stage to the PodGroup creation stage. Now, when a queue is closed, the system will prevent task submission at the PodGroup creation time. However, individual Pods (not submitted via a PodGroup) can still be submitted to a closed queue, but they will not be scheduled by the Volcano Scheduler.
Volcano Scheduler pprof Endpoint Disabled by Default: For security reasons, the Volcano Scheduler's pprof endpoint is disabled by default in this version. If needed, it can be explicitly enabled via the Helm parameter custom.scheduler_pprof_enable=true or the command-line argument --enable-pprof=true.

Summary and Future Work

The release of Volcano v1.12 is the result of the joint efforts of community contributors and users. This version brings enhancements to AI task scheduling, GPU resource utilization, heterogeneous resource management, security, and performance and stability in large-scale scenarios.

Version 1.12 aims to improve the performance and efficiency of running AI, big data, and other batch computing tasks in cloud-native environments. We recommend that users upgrade to the new version and welcome feedback and suggestions for improvement through our community channels.

In the future, the Volcano community will continue to focus on the core needs of CNAI, big data, and other fields, iterating continuously.

Roadmap and Call for Contributions

The Volcano community is committed to building a more powerful, flexible, and user-friendly batch computing platform while actively responding to evolving technology trends and user needs. In upcoming releases, we plan to focus on the following areas:

Deepen Network Topology-Aware Scheduling Capabilities: Building on the v1.12 Alpha version, we will continue to enhance our network topology-aware capabilities. Key areas include providing automatic discovery support for RoCE networks, intelligent identification and use of node labels, and moving towards more fine-grained, task-level topology-aware scheduling. We will also explore and implement more advanced scheduling features to meet the performance requirements of complex AI training scenarios. Related issues:
Introduce Advanced Resource Management Mechanisms: We will focus on developing and improving job rescheduling and resource reservation functions. This will help to more flexibly respond to dynamic changes in cluster load, ensure resource guarantees for critical tasks, and further improve overall cluster resource utilization. Related issue:
- GPU fragmentation across nodes and Job/Pod rescheduling strategy request
Enhance Queue Scheduling Flexibility: We will provide fine-grained configuration for queue-level scheduling policies. This will allow users to customize scheduling behavior and resource allocation strategies based on the characteristics, priorities, and SLA requirements of different business queues. Related issue:
- volcano supports queue-level scheduling policies
Deepen Ecosystem Collaboration and Integration: We will actively promote collaboration with the upstream Kubernetes community and other cloud-native projects, such as integrating LWS (Leader Worker Set) with Volcano to better provide Gang Scheduling capabilities for distributed applications. Related issue:
- Support custom scheduler to enable gang scheduling We warmly welcome other open-source projects to join with Volcano to build and enrich the cloud-native batch computing ecosystem.
Expand Heterogeneous Hardware Support and Cooperation: We will strengthen cooperation with hardware ecosystem partners, such as adapting and optimizing Ascend's Device Plugin and DRA Driver, and collaborating with major GPU vendors on DRA Drivers. This will ensure that Volcano can efficiently and stably schedule and manage various cutting-edge heterogeneous accelerator resources.
Improve JobFlow Workflow Capabilities: We will continue to optimize Volcano's built-in lightweight workflow engine, JobFlow. Plans include enhancing its capabilities in complex job dependency management, status monitoring, error handling, and user-defined extensions to provide users with a more powerful and user-friendly workflow orchestration solution. Related issues:
- Support JobFlowTemplate CRD
- Enhance JobFlow Functionality
Introduce Volcano Scheduler Simulator to Enhance Scheduling Transparency and Testability: To improve the transparency of the scheduling process and simplify testing, Volcano plans to introduce a scheduling simulator. This tool will allow users to accurately reproduce Volcano's core scheduling process in a lightweight environment by flexibly configuring a simulated cluster state (nodes, Pods, queue configurations, etc.). By outputting detailed scheduling logs and optional performance analysis, the simulator will make it easier for developers to test new features, help users understand and validate Volcano's scheduling behavior in different scenarios, and efficiently evaluate the impact of various scheduling policies. Related issue:
- Implement Volcano Scheduler Simulator

Community Engagement

The above roadmap is a preliminary plan. We welcome developers and users to participate in discussions and contribute ideas and suggestions for the future of Volcano.

GitHub Issues: Create a kind/feature issue in the Volcano GitHub repository, detailing your use case and feature expectations.
Community Communication: Participate in community meetings, or start a discussion in the WeChat group, Slack channel, or mailing list to communicate with developers and community members.
Roadmap Contribution: Feel free to make suggestions regarding our proposed roadmap or other features you consider important.

Acknowledgments

Volcano v1.12 includes hundreds of code commits from 43 community contributors. We would like to express our sincere thanks to all of them for their contributions. Their GitHub IDs are listed below:

@AdamKorcz	@HalfBuddhist	@Hcryw
@JackyTYang	@JesseStutler	@MondayCha
@Monokaix	@Poor12	@SataQiu
@Wang-Kai	@archlitchi	@baddoub
@cnmcavoy	@co63oc	@de6p
@dongjiang1989	@ecosysbin	@fengruotj
@feyounger	@fjq123123	@googs1025
@guoqinwill	@halcyon-r	@hansongChina
@hiwangzhihui	@hwdef	@kingeasternsun
@linuxfhy	@mahdikhashan	@mahmut-Abi
@murali1539	@ouyangshengjia	@qGentry
@sailorvii	@sceneryback	@sfc-gh-raravena
@wangyang0616	@weapons97	@xieyanke
@ytcisme	@yuyue9284	@zedongh
@zhutong196

Volcano completes security audit

Fri, 30 May 2025 00:00:00 GMT

Volcano is excited to announce the completion of our CNCF-funded security audit carried out by Ada Logics and facilitated by OSTIF in collaboration with the Volcano maintainers. The audit was scoped to cover the Volcano source code, supply-chain risks and fuzzing. The auditing team identified 10 security issues which the Volcano security team has fixed with the completion of the audit.

Volcano has addressed several infrastructure-level security issues by making targeted configuration changes that reduce risk and improve the default security posture of its default deployment. Below is a breakdown of each issue, the associated risks, and how Volcano resolved them, along with the resulting security improvements.

One issue involved several Volcano components running with root privileges by default. Containers running as root pose an increased security risk in that if compromised, an attacker gains access to capabilities they can use to escalate their privileges. Volcano fixed this by configuring all components - including the scheduler, admission controller, controllers, and dashboard - to run as non-root by default. This change limits the scope of what an attacker can do inside a container and helps contain breaches more effectively.

Another issue was the absence of seccomp profiles across Volcano’s workloads. Without seccomp, containers can invoke any Linux system call which increases the attack surface for kernel-level attacks and container escapes. Volcano addressed this by adding seccomp profiles, specifically using RuntimeDefault, which restricts containers to a safe subset of system calls. This reduces the kernel’s exposure and strengthens runtime isolation.

Volcano also lacked SELinux in its containers. SELinux manages access control at the kernel level and limits how processes can interact with files, system resources, and other processes. Volcano added SELinux to all its pods and containers.

In addition, Volcano had previously assigned containers with unnecessary Linux capabilities—fine-grained permissions that determine what a containerized process can do. For example, capabilities like CAP_NET_ADMIN or CAP_SYS_ADMIN grant significant power and are often unnecessary for typical application logic. Volcano mitigated this risk by removing non-essential capabilities using a “drop all” approach and only adding back specific permissions if needed. This reduces the attack surface and enforces the principle of least privilege.

Prior to the audit, Volcano allowed containers to escalate privileges during execution, which could permit non-privileged processes to gain additional privileges. Such privilege escalation increases the risk of bypassing container security controls. Volcano resolved this by setting allowPrivilegeEscalation: false in its containers and pods ensuring that processes run only with the privileges they were initially assigned.

These changes help contain potential attacks, reduce the avenues for privilege escalation or container breakout, and enhance the overall resilience of the system in multi-tenant and production environments.

On the application side, the auditors identified 5 issues, of which the most interesting was an issue where an attacker who had compromised an elastic service or an extender plugin in the cluster could cause denial of service of the Volcano scheduler. This issue was assigned CVE-2025-32777 of HIGH severity.

Fuzzing

During the audit, Ada Logics integrated volcano into Googles OSS-Fuzz project with two initial fuzz tests. OSS-Fuzz is an open source project that other critical open source projects can integrate into. Google runs integrated projects’ fuzzers on vast amounts of compute and reports any findings to the projects team via email. OSS-Fuzz’s reports contain information such as stack traces, steps to reproduce, which fuzz harness found the issue and more. Periodically, OSS-Fuzz reproduces the issue to assert that it still exists. If it can’t reproduce it, OSS-Fuzz automatically marks the issue fixed.

Getting involved in Volcano

Volcano is the industry's first cloud-native batch computing engine and the sole batch computing project within the CNCF. It operates as a Kubernetes-native batch scheduling system, enhancing the standard kube-scheduler. Volcano provides comprehensive features to manage and optimize diverse batch and elastic workloads, including AI/ML/DL, Bioinformatics/Genomics, and other "Big Data" applications. It offers robust integration with frameworks such as Spark, Flink, Ray, TensorFlow, PyTorch, Argo, MindSpore, PaddlePaddle, Kubeflow, MPI, Horovod, MXNet, and KubeGene. Drawing from over fifteen years of experience in high-performance workload operations, Volcano combines proven practices and innovative concepts to deliver a powerful and flexible scheduling solution.

We encourage you to join our community and contribute to Volcano's development. Your participation is valuable, whether you're asking questions, sharing experiences, or contributing code.

GitHub: Access our main repository to contribute code or report issues: https://github.com/volcano-sh/volcano.
Website & Documentation: Find comprehensive documentation and news on our official website: https://volcano.sh/en/.
Contributing Code: Our Contribution Guide offers detailed instructions on finding good first issues and submitting pull requests. We welcome contributions of all sizes.
Slack Channel: Join our Slack workspace for real-time discussions and support. First, join the CNCF Slack at https://slack.cncf.io/, then navigate to the #volcano channel: https://cloud-native.slack.com/archives/C011GJDQS0N.
Community Meetings: Participate in our regular community meetings to discuss project updates, roadmaps, and proposals.
- Meeting Link
- Meeting Notes
Mailing List: Subscribe to our mailing list for important announcements and broader discussions.

You can find the audit report here. We would like to thank all involved parties in the audit for their great work.

How volcano boosts distributed training and inference performance

Tue, 01 Apr 2025 00:00:00 GMT

The Growing Demand for LLM Workloads and Associated Challenges

The increasing adoption of large language models (LLMs) has led to heightened demand for efficient AI training and inference workloads. As model size and complexity grow, distributed training and inference have become essential. However, this expansion introduces challenges in network communication, resource allocation, and fault recovery within large-scale distributed environments. These issues often create performance bottlenecks that hinder scalability.

Addressing Network Bottlenecks Through Topology-Aware Scheduling

In LLM training, model parallelism distributes workloads across multiple nodes, requiring frequent data exchanges. Network communication can become a bottleneck, particularly in heterogeneous environments with InfiniBand (IB), RoCE, or NVSwitch configurations. Communication efficiency depends on network topology—fewer switches between nodes typically result in lower latency and higher throughput. One approach to mitigating this challenge is Network Topology-Aware Scheduling, which optimizes workload placement to minimize cross-switch communication. A key component of this strategy is the HyperNode, an abstraction for representing network topology via Custom Resource Definitions (CRDs). Unlike label-based methods, HyperNode provides a hierarchical structure that reflects actual network layouts, improving management and optimization. Nodes within the same HyperNode communicate more efficiently than those spanning multiple layers.

Topology constraints can also be specified for jobs through the networkTopology field, with options for strict (Hard Mode) or flexible (Soft Mode) enforcement. This granular control helps ensure workloads are deployed in optimal network environments, reducing latency and improving throughput.

Managing Multi-Cluster Environments for Scalability

As AI workloads expand, single Kubernetes clusters may no longer suffice for large-scale training and inference. While multiple clusters can address this limitation, managing them efficiently presents challenges. The Volcano Global subproject extends scheduling capabilities to multi-cluster environments, integrating with Karmada to enable cross-cluster scheduling for distributed workloads. Features such as Queue Priority Scheduling, Job Priority Scheduling, and Multi-Tenant Fair Scheduling help optimize resource allocation and ensure equitable access across tenants. This approach simplifies multi-cluster management while supporting scalable AI workloads.

Improving Stability with Fine-Grained Fault Recovery

Fault recovery is critical in distributed AI training and inference. Traditional methods often restart entire jobs upon a single Pod failure, leading to resource inefficiencies. With checkpointing and resume-from-checkpoint techniques, full restarts are often unnecessary. Fine-Grained Job Fault Recovery allows policies to restart only failed Pods or associated tasks, reducing unnecessary disruptions. Timeout configurations can further minimize interventions—if a Pod recovers within the allotted time, no restart is triggered. This approach enhances stability and efficiency in distributed workloads.

Future Developments in Distributed Workload Management

Ongoing advancements in distributed workload management include:

Task-Level Network Topology Affinity Scheduling: Support for distributed inference scenarios, such as integration with lws.
HyperNode Auto-Discovery and Status Updates: Automation for HyperNode lifecycle management.
Dynamic Resource Allocation (DRA): Improved management of heterogeneous resources.
Dynamic GPU Partitioning: Support for MIG and MPS to enhance GPU utilization.

More information for Volcano:

Website: https://volcano.sh/
GitHub: https://github.com/volcano-sh/volcano
Slack: Join the conversation onVolcano Slack.
Weekly Meetings: Attend our weekly meetings and review meeting notes:
- Meeting Link: Zoom
- Meeting Notes: Google Docs
Twitter: Follow us on X (formerly Twitter) for the latest updates.

Volcano v1.11.0 Available Now

Fri, 07 Feb 2025 00:00:00 GMT

As the de facto standard in cloud-native batch computing, Volcano has been widely adopted across various scenarios, including AI, Big Data, and High-Performance Computing (HPC). With over 800 contributors from more than 30 countries and tens of thousands of code commits, Volcano has been deployed in production environments by over 60 enterprises worldwide. It provides the industry with excellent practical standards and solutions for cloud native batch computing.

As user scenarios grow increasingly complex, especially in the scenarios of LLMs, there is a heightened demand for performance, GPU resource utilization, and availability in both training and inference workloads. This has driven Volcano to continuously expand its capabilities and address core user needs. Over the course of 28 releases, Volcano has introduced a series of enhancements and optimizations tailored to batch computing scenarios, helping users seamlessly migrate their workloads to cloud-native platforms. These improvements have resolved numerous pain points, earning the platform widespread praise and fostering a vibrant community with over 30 approvers and reviewers, creating a win-win ecosystem.

The new release of Volcano will mark a new milestone in the New Year 2025, where the community will introduce a series of major features that will continue to deepen its focus on areas such as CNAI (Cloud Native AI) and Big Data, with key features including:

AI Scenarios:

Network Topology-Aware Scheduling: Reduces network communication overhead between training tasks, optimizing performance for large AI model training.
NPU Scheduling and Virtualization: Enhances NPU resource utilization.
GPU Dynamic Partitioning: Introduces MIG and MPS dynamic partitioning to improve GPU resource utilization.
Volcano Global Multi-Cluster AI Job Scheduling: Supports Multi-cluster AI job deployment and distribution.
Checkpointing and Fault Recovery Optimization: Enables finer-grained job restart policies.
Dynamic Resource Allocation (DRA): Supports flexible and efficient management of heterogeneous resources.

Big Data Scenarios:

Elastic Hierarchical Queues: Facilitates smooth migration of Big Data workloads to cloud-native platforms.

Microservices Scenarios:

Online and Offline Workloads colocation with Dynamic Resource Oversubscription: Boosts resource utilization while ensuring QoS for online workloads.
Load-Aware Scheduling and Descheduling: Provides resource defragmentation and load balancing capabilities.

The official release of Volcano v1.11 marks a new chapter in cloud-native batch computing! This update focuses on the core needs of AI and Big Data, introducing network topology-aware scheduling and multi-cluster AI job scheduling, significantly enhancing the performance of AI training and inference tasks. Additionally, online and offline workloads colocation with dynamic resource oversubscription and load-aware descheduling further optimize resource utilization, ensuring high availability for online services. The introduction of elastic hierarchical queues offers more flexible scheduling strategies for Big Data scenarios.

Deep Dive into Key Features

The v1.11 version released this time provides a series of major feature updates for AI, Big data and resource utilization improvement scenarios, mainly including:

Network Topology-Aware Scheduling: Optimizing AI Large Model Training Performance

In AI large model training, model parallelism splits the model across multiple nodes, requiring frequent data exchange between nodes. Network communication often becomes a bottleneck, significantly impacting training efficiency. Data centers feature diverse network types like InfiniBand (IB), RoCE, and NVSwitch, with complex multi-layer switch topologies. The fewer switches spanned between two nodes, the lower the communication latency and the higher the throughput. Thus, users aim to schedule workloads in the optimal performance domain with the highest throughput and lowest latency.

To address this, Volcano introduces Network Topology-Aware Scheduling, leveraging a unified network topology API and intelligent scheduling strategies to tackle network communication performance issues in large-scale AI training jobs.

Unified Network Topology API: Precise Network Structure Representation

To abstract away the differences in data center network types, Volcano defines a new CRD HyperNode, to represent network topology, providing a standardized API. Compared to traditional label-based approaches, HyperNode offers several advantages:

Semantic Consistency: HyperNode provides a standardized way to describe network topology, avoiding inconsistencies in label semantics.
Hierarchical Structure: HyperNode supports tree-like hierarchies, accurately reflecting actual network topologies.
Ease of Management: Cluster administrators can manually create HyperNodes or use automated network topology discovery tools to maintain them.

A HyperNode represents a network topology performance domain, typically mapped to a switch. Multiple HyperNodes connect hierarchically to form a tree structure. For example:

Leaf HyperNodes (s0, s1, s2, s3): Represent actual cluster nodes.
Non-Leaf HyperNodes (s4, s5, s6): Represent other HyperNodes.

In this structure, communication efficiency between nodes depends on the number of HyperNode layers they span. For instance:

node0 and node1 within s0 have the highest communication efficiency.
node1 and node2 spanning two HyperNode layers (s0→s4→s1) have lower efficiency.
node0 and node4 spanning three HyperNode layers (s0→s4→s6) have the lowest efficiency.

HyperNode Configuration Example

Here’s an example of leaf and non-leaf HyperNode configurations:

Leaf HyperNode Example:

apiVersion: topology.volcano.sh/v1alpha1
kind: HyperNode
metadata:
  name: s0
spec:
  tier: 1  # Lower tiers indicate higher communication efficiency
  members: # List of child nodes
  - type: Node  # Child node type
    selector:
      exactMatch: # Exact match
        name: node-0
  - type: Node
    selector:
      regexMatch: # Regex match
        pattern: node-[01]

Non-Leaf HyperNode Example:

apiVersion: topology.volcano.sh/v1alpha1
kind: HyperNode
metadata:
  name: s6
spec:
  tier: 3  # HyperNode tier
  members: # List of child nodes
  - type: HyperNode  # Child node type
    selector:
      exactMatch: # Exact match
        name: s4
  - type: HyperNode
    selector:
      exactMatch: # Exact match
        name: s5

Network Topology-Aware Scheduling Strategy

Volcano Job and PodGroup can set topology constraints via the networkTopology field, supporting the following configurations:

mode: Supports hard and soft modes.
- hard: Enforces strict constraints, requiring tasks within a job to be deployed within the same HyperNode.
- soft: Prefers deploying tasks within the same HyperNode but allows flexibility.
highestTierAllowed: Used with hard mode to specify the maximum HyperNode tier a job can span.

For example, the following configuration restricts a job to HyperNodes of tier 2 or lower (e.g., s4 or s5), otherwise, the job remains in a Pending state:

spec:
  networkTopology:
    mode: hard
    highestTierAllowed: 2

This scheduling strategy allows users to precisely control job topology constraints, ensuring optimal performance and significantly improving training efficiency.

Future Plans

Volcano will continue to refine network topology-aware scheduling, with future plans including:

Automating the conversion of node labels to HyperNode CRs to simplify migration.
Integrating network topology discovery tools to streamline HyperNode management.
Providing CLI tools for easier HyperNode hierarchy visualization and management.

For detailed design and user guide, please refer to:

Design Document: Network Topology Aware Scheduling.

Usage Document: Network Topology Aware Scheduling | Volcano.

Sincerely thanks to community developers: @ecosysbin, @weapons97, @Xu-Wentao, @penggu, @JesseStutler, @Monokaix for their contributions!

Elastic Hierarchical Queues: Flexible Multi-Tenant Resource Management

In multi-tenant environments, fair resource allocation, isolation, and job prioritization are critical. Different departments or teams often share cluster resources while ensuring their jobs receive resources on demand, avoiding contention or waste. Volcano v1.11 introduces Elastic Hierarchical Queues, significantly enhancing queue resource management. Hierarchical queues enable finer-grained resource quota management, cross-level resource sharing and reclamation, and flexible preemption policies, creating an efficient and fair unified scheduling platform. For users migrating from YARN, Volcano seamlessly transitions Big Data workloads to Kubernetes clusters.

Core Capabilities of Elastic Hierarchical Queues

Volcano’s elastic hierarchical queues offer the following key features to meet complex multi-tenant demands:

Configurable Queue Hierarchies: Users can create multi-level queues in a tree structure, each with independent resource quotas and priorities.
Cross-Level Resource Sharing and Reclamation: Idle resources in child queues can be shared with sibling queues and reclaimed when needed.
Fine-Grained Resource Quota Management: Each queue can set parameters like:
- capability: Maximum resource capacity.
- deserved: Fair share of resources; excess can be reclaimed.
- guarantee: Reserved resources, ensuring minimum guarantees.
Flexible Preemption Policies: Supports priority-based preemption to ensure high-priority tasks receive resources promptly.

Hierarchical Queue Example

A simple hierarchical queue structure might look like this:

Root Queue: Manages global resource allocation.
Department Queues: Represent resource pools for different departments or teams.
Child Queues: Represent specific projects or tasks, where users submit jobs.

Use Cases

Multi-Department Resource Sharing: Large enterprises can use hierarchical queues to fairly allocate and isolate resources across departments.
Big Data Task Scheduling: Users migrating from YARN to Kubernetes can leverage hierarchical queues for seamless Big Data workload migration.
AI Training and Inference: Hierarchical queues enable dynamic resource allocation and reclamation for AI tasks.

For detailed design and user guide, please refer to:

Design Document: hierarchical-queue-on-capacity-plugin.

Usage Document: Hierarchical Queue | Volcano.

Sincerely thanks to community developer: @Rui-Gan for this contribution!

Multi-Cluster AI Job Scheduling: Unified Management and Efficient Scheduling Across Clusters

As enterprise workloads grow, single Kubernetes clusters often fall short of meeting the demands of large-scale AI training and inference jobs. Users typically manage multiple Kubernetes clusters to achieve unified workload distribution, deployment, and management. Many users already deploy Volcano across multiple clusters, managed by Karmada. To better support AI jobs in multi-cluster environments, Volcano has incubated the Volcano Global sub-project, extending Volcano’s powerful scheduling capabilities to multi-cluster scenarios. This project provides a unified scheduling platform for multi-cluster AI jobs, supporting cross-cluster job distribution, resource management, and priority control.

Core Capabilities

Volcano Global enhances Karmada with the following features to meet the complex demands of multi-cluster AI job scheduling:

Cross-Cluster Volcano Job Scheduling: Users can deploy and schedule Volcano Jobs across multiple clusters, maximizing resource utilization.
Queue Priority Scheduling: Supports cross-cluster queue priority management, ensuring high-priority queues receive resources first.
Job Priority Scheduling and Queuing: Enables job-level priority scheduling and queuing across clusters, ensuring critical tasks are executed promptly.
Multi-Tenant Fair Scheduling: Provides fair resource allocation across tenants, preventing resource contention.

For detailed deployment and user guide, please refer to: Multi-Cluster AI Job Scheduling | Volcano.

Sincerely thanks to community developers: @Vacant2333, @MondayCha, @lowang-bh, @Monokaix for their contributions!

Online and Offline Workloads colocation with Dynamic Resource Oversubscription: Maximizing Resource Utilization While Ensuring SLO

Background: The Challenge of Resource Utilization

As cloud-native technologies advance, Kubernetes has become the "operating system" of the cloud-native era, with more workloads migrating to Kubernetes platforms. However, despite the flexibility and scalability of cloud-native technologies, data center resource utilization remains low. Online workloads (e.g., microservices) often exhibit peak-and-trough patterns, leaving resources idle during troughs and insufficient during peaks. To improve resource utilization while ensuring high-priority workload SLOs (Service Level Objectives), Volcano introduces a cloud-native colocation solution, combining online and offline workloads with dynamic resource oversubscription to maximize cluster resource utilization while ensuring online workload stability.

Cloud-native colocation involves deploying online workloads (e.g., real-time services) and offline workloads (e.g., batch jobs) on the same cluster. During online workload troughs, offline workloads utilize idle resources; during peaks, offline workloads are throttled to ensure online workload resource needs. This dynamic resource allocation mechanism not only improves resource utilization but also ensures online workload quality of service.

Industry Practices: Volcano’s Unique Advantages

While many companies have explored colocation technologies, existing solutions often fall short, such as being tightly coupled with Kubernetes, using rough oversubscription calculations, or offering inconsistent user experiences. Volcano addresses these issues with the following unique advantages:

Native Support for Offline Job Scheduling: Volcano Scheduler natively supports offline job scheduling without additional adaptation.
Non-Invasive Design: No modifications to Kubernetes are required, allowing users to adopt Volcano without altering existing cluster architectures.
Dynamic Resource Oversubscription: Real-time calculation of resources can be oversold ensures a balance between resource utilization and QoS.
OS-Level Isolation and Guarantees: Kernel-level resource isolation ensures online workload priority and stability.

Volcano Cloud-Native Colocation Solution: End-to-End Resource Optimization

Volcano’s cloud-native colocation solution provides end-to-end resource isolation and sharing mechanisms, including the following core components:

Volcano Scheduler: Manages unified scheduling of online and offline workloads, offering abstractions like queues, groups, job priorities, fair scheduling, and resource reservations to meet the needs of microservices, Big Data, and AI workloads.

Volcano SLO Agent: A daemonset running on each node, the SLO Agent monitors node resource usage, dynamically calculates resources that can be oversold, and allocates them to offline workloads. It also detects CPU/memory pressure and evicts offline workloads when necessary to ensure online workload priority.

Enhanced OS: Volcano implements fine-grained QoS guarantees at the kernel level, using cgroups to set resource limits for online and offline workloads, ensuring online workloads receive sufficient resources even under high load.

Core Capabilities: Balancing Resource Utilization and Stability

Volcano’s cloud-native colocation solution offers the following key capabilities to achieve both resource utilization and workload stability:

Unified Scheduling: Supports unified scheduling of microservices, batch and AI jobs.
QoS-Based Resource Model: Provides QoS-based resource management for online and offline workloads, ensuring high-priority workload stability.
Dynamic Resource Oversubscription: Dynamically calculates oversellable resources based on real-time CPU/memory usage, maximizing resource utilization.
CPU Burst: Allows containers to temporarily exceed CPU limits, avoiding throttling during critical moments and improving responsiveness.
Network Bandwidth Isolation: Supports node-level network bandwidth limits, ensuring online workload network requirements.

For detailed design and user guide, please refer to: Cloud Native Colocation | Volcano.

Sincerely thanks to community developer: @william-wang for this contribution!

Load-Aware Descheduling: Intelligent Cluster Resource Balancing

In Kubernetes clusters, dynamic workload changes often lead to uneven node resource utilization, causing hotspots that impact cluster stability and efficiency. To address this, Volcano v1.11 introduces Load-Aware Descheduling, dynamically adjusting Pod distribution based on real node load to ensure balanced resource utilization and avoid hotspots, improving overall cluster performance and reliability. Load-aware descheduling is incubated in the sub-project: https://github.com/volcano-sh/descheduler.

Core Capabilities:

Load-Aware Scheduling: Monitors real CPU and memory load metrics to dynamically adjust Pod distribution, avoiding reliance on Pod Request-based scheduling.
Timed and Dynamic Triggers: Supports CronTab-based or fixed-interval descheduling to adapt to different scenarios.

Use Cases:

Uneven Node Resource Utilization: Balances node load when some nodes are overutilized while others are underutilized.
Hotspot Node Management: Migrates Pods from overloaded nodes to ensure stability.

Technical Highlights:

Descheduling Based on Actual Load:

Unlike traditional scheduling strategies based on Pod Requests, Volcano's load-aware descheduling is more precise, accurately reflecting the actual resource usage of nodes.
Seamless Integration with Kubernetes Ecosystem:

Compatible with the native Kubernetes scheduler, enabling load-aware descheduling without requiring additional configurations.
Flexible Policy Configuration:

Users can customize descheduling intervals or trigger conditions based on business requirements , ensuring flexibility and controllability in scheduling.

For detailed user guide, please refer to: Load-aware Descheduling | Volcano.

Sincerely thanks to community developer: @Monokaix for this contribution!

Fine-Grained Job Fault Recovery: Efficient Task Interruption Handling

In AI, Big Data, and HPC scenarios, job stability and fault recovery are critical. Traditional fault recovery strategies often restart entire Jobs when a single Pod fails, wasting resources and potentially restarting training from scratch. With the rise of checkpointing and resume-from-checkpoint techniques, single Pod failures no longer require full Job restarts. Volcano v1.11 introduces Fine-Grained Job Fault Recovery feature, offering flexible fault handling mechanisms to efficiently manage task interruptions and improve training efficiency.

Core Capabilities:

Supporting Pod-Granular Restart Policies

Users can configure policies to restart only failed Pods or their associated Tasks, avoiding unnecessary Job restarts and reducing resource waste.

Restarting a Single Pod:
When a specific Pod fails, only that Pod is restarted, leaving other running tasks unaffected.
```
policies:
  - event: PodFailed
    action: RestartPod
```
Restarting an Entire Task:
When a Pod fails, the entire Task (a group of Pods) to which it belongs is restarted. This is suitable for scenarios requiring consistency within a task group.
```
policies:
  - event: PodFailed
    action: RestartTask
```

Support for Setting Timeouts for Actions

Pod failures may be caused by transient issues (e.g., network fluctuations or hardware problems). Volcano allows users to set timeout periods for failure recovery actions. If the Pod recovers within the timeout period, no restart is performed, avoiding unnecessary intervention.

Example Configuration: If a Pod fails and is restarted but does not recover within 10 minutes, the entire Job is restarted.

policies:
  - event: PodFailed
    action: RestartPod
  - event: PodEvicted
    action: RestartJob
    timeout: 10m

New PodPending Event Handling

When a Pod remains in the Pending state for an extended period due to insufficient resources or topological constraints, users can set a timeout for the Pending event. If the Pod does not start running after the timeout, the entire Job can be terminated to avoid resource waste.

Example Configuration:
If a Pod remains in the Pending state for more than 10 minutes, the Job will be terminated.

policies:
  - event: PodPending
    action: TerminateJob
    timeout: 10m

Applicable Scenarios:

AI Large Model Training:
In distributed training, the failure of a single Pod does not affect the overall training progress. Fine-grained failure recovery strategies enable quick task recovery, avoiding the need to restart training from scratch.
Big Data Processing:
In batch processing tasks, failures of individual tasks can be resolved by restarting a single Pod or Task, eliminating the need to restart the entire Job and improving processing efficiency.
High-Performance Computing (HPC):
In HPC scenarios, task stability and efficient recovery are critical. Fine-grained failure recovery strategies minimize task interruption time.

Technical Highlights:

Flexible Policy Configuration:
Users can customize failure recovery policies based on business requirements, supporting Pod, Task, and Job-level restart operations.
Timeout Mechanism:
By setting timeout periods, unnecessary restarts due to transient issues are avoided, enhancing Job stability.
Seamless Compatibility with Checkpointing:
Perfectly integrates with checkpointing and resumption technologies in AI scenarios, ensuring efficient recovery of training tasks.

For detailed design and user guide, please refer to: How to use job policy.

Sincerely thanks to community developer: @bibibox for this contribution!

Volcano Dashboard: A Resource Visualization Component

The Volcano Dashboard is an official resource visualization component for Volcano. After deploying Volcano, users can deploy the dashboard to view and manage cluster resources through a graphical interface. The project is available at: https://github.com/volcano-sh/dashboard.

Current features include:

Cluster overview, including Job counts, statuses, completion rates, Queue counts, and resource utilization.
Job and Queue lists with filtering, sorting, and search capabilities.
Pod lists with filtering, sorting, and search capabilities.

Sincerely thanks to community developers: @WY-Dev0, @Monokaix for their contributions!

Volcano Supports Kubernetes v1.31

Volcano closely follows Kubernetes releases, with full support for Kubernetes v1.31, including comprehensive UT and E2E testing to ensure functionality and reliability.

To contribute to Volcano’s Kubernetes version adaptation, please refer to: adapt-k8s-todo.

Sincerely thanks to community developers: @vie-serendipity, @dongjiang1989 for their contributions!

Volcano Job Supports Preemption Policy

Volcano Jobs now support PreemptionPolicy, allowing users to configure whether Jobs can preempt other Pods. Jobs with PreemptionPolicy: Never will not preempt resources, ensuring stability.

For configuration examples, please refer to: how to configure priorityclass for job.

Sincerely thanks to community developer: @JesseStut for this contribution!

Performance Optimization: Efficient Scheduling at Scale

In Volcano, Queue is one of the most fundamental and critical resources. The status field of a Queue records the states of PodGroups, such as Unknown, Pending, Running, Inqueue, and Completed. However, in large-scale scenarios, frequent changes in PodGroups within a Queue (e.g., when a large number of short-lived tasks are repeatedly submitted) can cause many PodGroups to transition from Running to Completed. In such cases, the Volcano Controller needs to frequently refresh the status field of the Queue, placing significant pressure on the APIServer. Additionally, the Volcano Scheduler updates the status.allocated field of the Queue after Job scheduling, which can lead to Queue update conflicts in large-scale environments, further impacting system performance.

To completely resolve the issues of frequent Queue refreshes and update conflicts in large-scale scenarios, Volcano v1.11 has optimized the Queue management mechanism by migrating PodGroup statistics to Metrics, eliminating the need for persistent storage. This optimization significantly reduces the pressure on the APIServer while improving the overall performance and stability of the system.

Key Improvements After Optimization

Migration of PodGroup Statistics to Metrics
PodGroup state data (e.g., Unknown, Pending, Running) is no longer stored in the status field of the Queue. Instead, it is recorded and displayed through the metrics system. Users can view the statistics of PodGroups in a Queue using the following commands:

View statistics for a specific Queue:
```
vcctl queue get -n [name]
```
View statistics for all Queues:
```
vcctl queue list
```

Reduced APIServer Pressure
By migrating PodGroup statistics to Metrics, frequent updates to the status field of the Queue are avoided, significantly reducing the load on the APIServer and improving system throughput.

Resolved Queue Update Conflicts
In large-scale scenarios, Queue update conflicts have been effectively mitigated, ensuring the efficient operation of the scheduler.

For detailed design and metric names related to the migration of PodGroup state statistics to Metrics, please refer to: Queue podgroup statistics.

Sincerely thanks to community developer: @JesseStutler for this contribution!

Conclusion: Volcano v1.11, A New Era of Cloud-Native Batch Computing

Volcano v1.11 is not just a technological leap but a new chapter in cloud-native batch computing. Whether for AI large model training, Big Data scheduling, or resource optimization, Volcano v1.11 delivers powerful features and flexible solutions. We believe Volcano v1.11 will help users achieve greater heights in cloud-native batch computing, ushering in a new era of AI and Big Data scheduling!

Experience Volcano v1.11.0 now and step into a new era of efficient computing!

v1.11.0 release: https://github.com/volcano-sh/volcano/releases/tag/v1.11.0

Acknowledgments

Volcano v1.11.0 includes contributions from 39 community members. Sincerely thanks to all contributors:

@QingyaFan	@JesseStutler	@bogo-y
@bibibox	@zedongh	@archlitchi
@dongjiang1989	@william-wang	@fengruotj
@SataQiu	@lowang-bh	@Rui-Gan
@xovoxy	@wangyang0616	@PigNatovsky
@Yanping-io	@lishangyuzi	@hwdef
@bood	@kerthcet	@WY-Dev0
@raravena80	@SherlockShemol	@zhifanggao
@conghuhu	@MondayCha	@vie-serendipity
@Prepmachine4	@Monokaix	@lengrongfu
@jasondrogba	@sceneryback	@TymonLee
@liuyuanchun11	@Vacant2333	@matbme
@lekaf974	@kursataktas	@lut777

Volcano v1.10.0 Available Now

Sun, 29 Sep 2024 00:00:00 GMT

On Sep 19, 2024, UTC+8, Volcano version v1.10.0 was officially released. This version introduced the following new features:

Support Queue Priority Scheduling Strategy
Enable Fine-Grained GPU Resource Sharing and Reclaim
Introduce Pod Scheduling Readiness Support
Add Sidecar Container Scheduling Capabilities
Enhance Vcctl Command Line Tool
Ensure Compatibility with Kubernetes v1.30
Strengthen Volcano Security Measures
Optimize Volcano for Large-Scale Performance
Improve GPU Monitoring Function
Optimize Helm Chart Installation And Upgrade Processes

Volcano is the industry-first cloud native batch computing project. Open-sourced at KubeCon Shanghai in June 2019, it became an official CNCF project in April 2020. In April 2022, Volcano was promoted to a CNCF incubating project. By now, more than 600 global developers have committed code to the project. The community is seeing growing popularity among developers, partners, and users.

Key Features

Support Queue Priority Scheduling Strategy

In traditional big data processing scenarios, users can directly set queue priorities to control the scheduling order of jobs. To ease the migration from Hadoop/Yarn to cloud-native platforms, Volcano supports setting priorities at the queue level, reducing migration costs for big data users while enhancing user experience and resource utilization efficiency.

Queues are a fundamental resource in Volcano, each with its own priority. By default, a queue's priority is determined by its share value, which is calculated by dividing the resources allocated to the queue by its total capacity. This is done automatically, with no manual configuration needed. The smaller the share value, the fewer resources the queue has, making it less saturated and more likely to receive resources first. Thus, queues with smaller share values have higher priority, ensuring fairness in resource allocation.

In production environments—especially in big data scenarios—users often prefer to manually set queue priorities to have a clearer understanding of the order in which queues are scheduled. Since the share value is dynamic and changes in real-time as resources are allocated, Volcano introduces a priority field to allow users to set queue priorities more intuitively. The higher the priority, the higher the queue's standing. High-priority queues receive resources first, while low-priority queues have their jobs reclaimed earlier when resources need to be recycled.

Queue Priority Definition:

type QueueSpec struct `{
...
  // Priority define the priority of queue. Higher values are prioritized for scheduling and considered     later during reclamation.
  // +optional
  Priority int32 `json:"priority,omitempty" protobuf:"bytes,10,opt,name=priority"`
}`

To ensure compatibility with the share mechanism, Volcano also considers the share value when calculating queue priorities. By default, if a user has not set a specific queue priority or if priorities are equal, Volcano will fall back to comparing share values. In this case, the queue with the smaller share has higher priority. Users have the flexibility to choose between different priority strategies based on their specific needs—either by using the priority or the share method.

For queue priority design doc, please refer to: Queue priority

Volcano introduced the elastic queue capacity scheduling feature in version v1.9, allowing users to directly set the capacity for each resource dimension within a queue. This feature also supports elastic scheduling based on the deserved value, enabling more fine-grained resource sharing and recycling across queues.

For detailed design information on elastic queue capacity scheduling, refer to the Capacity Scheduling Design Document.

For a step-by-step guide on using the capacity plugin, see the Capacity Plugin User Guide.

Configure each dimension deserved resource samples for the queue:

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: demo-queue
spec:
  reclaimable: true
  deserved: # set the deserved field.
    cpu: 64
    memory: 128Gi
    nvidia.com/a100: 40
    nvidia.com/v100: 80

In version v1.10, Volcano extends its support to include reporting different types of GPU resources within elastic queue capacities. NVIDIA's default Device Plugin does not distinguish between GPU models, instead reporting all resources uniformly as nvidia.com/gpu. This limits AI training and inference tasks from selecting specific GPU models, such as A100 or T4, based on their particular needs. To address this, Volcano now supports reporting distinct GPU models at the Device Plugin level, working with the capacity plugin to enable more precise GPU resource sharing and recycling.

For instructions on using the Device Plugin to report various GPU models, please refer to the GPU Resource Naming Guide.

Note:

In version v1.10.0, the capacity plugin is the default for queue management. Note that the capacity and proportion plugins are incompatible, so after upgrading to v1.10.0, you must set the deserved field for queues to ensure proper functionality.

For detailed instructions, please refer to the Capacity Plugin User Guide.

The capacity plugin allocates cluster resources based on the deserved value set by the user, while the proportion plugin dynamically allocates resources according to queue weight. Users can select either the capacity or proportion plugin for queue management based on their specific needs.

For more details on the proportion plugin, please visit: Proportion Plugin.

Introduce Pod Scheduling Readiness Support

Once a Pod is created, it is considered ready for scheduling. In Kube-scheduler, it will try its best to find a suitable node to place all pending Pods. However, in reality, some Pods may be in a "lack of necessary resources" state for a long time. These Pods actually interfere with the decision-making and operation of the scheduler (and downstream components such as Cluster AutoScaler) in an unnecessary way, causing problems such as resource waste. Pod Scheduling Readiness is a new feature of Kube-sheduler. In Kubernetes v.1.30 GA, it has become a stable feature. It controls the scheduling timing of Pods by setting the schedulingGates field of the Pod.

In previous versions, Volcano has integrated all algorithms of the K8s default scheduler, fully covering the native scheduling functions of Kube-scheduler. Therefore, Volcano can completely replace Kube-scheduler as a unified scheduler under the cloud native platform, supporting unified scheduling of microservices and AI/big data workloads. In the latest version v1.10, Volcano has introduced Pod Scheduling Readiness scheduling capability to further meet users' scheduling needs in diverse scenarios.

For the documentation of Pod Scheduling Readiness features, please refer to: Pod Scheduling Readiness | Kubernetes

For the Pod Scheduling Readiness design doc of volcano, please refer to: Proposal for Support of Pod Scheduling Readiness by ykcai-daniel · Pull Request #3581 · volcano-sh/volcano (github.com)

Add Sidecar Container Scheduling Capabilities

A Sidecar container is an auxiliary container designed to support the main business container by handling tasks such as logging, monitoring, and network initialization.

Prior to Kubernetes v1.28, the concept of Sidecar containers existed only informally, with no dedicated API to distinguish them from business containers. Both types of containers were treated equally, which meant that Sidecar containers could be started after the business container and might end before it. Ideally, Sidecar containers should start before and finish after the business container to ensure complete collection of logs and monitoring data.

Kubernetes v1.28 introduces formal support for Sidecar containers at the API level, implementing unified lifecycle management for init containers, Sidecar containers, and business containers. This update also adjusts how resource requests and limits are calculated for Pods, and the feature will enter Beta status in v1.29.

The development of this feature involved extensive discussions, mainly focusing on maintaining compatibility with existing APIs and minimizing disruptive changes. Rather than introducing a new container type, Kubernetes reuses the init container type and designates Sidecar containers by setting the init container’s restartPolicy to Always. This approach addresses both API compatibility and lifecycle management issues effectively.

With this update, the scheduling of Pods now considers the Sidecar container’s resource requests as part of the business container’s total requests. Consequently, the Volcano scheduler has been updated to support this new calculation method, allowing users to schedule Sidecar containers with Volcano.

For more information on Sidecar containers, visit Sidecar Containers | Kubernetes.

Enhance Vcctl Command Line Tool

vcctl is a command line tool for operating Volcano's built-in CRD resources. It can be conveniently used to view/delete/pause/resume vcjob resources, and supports viewing/deleting/opening/closing/updating queue resources. Volcano has enhanced vcctl in the new version, adding the following features:

Support creating/deleting/viewing/describing jobflow and jobtemplate resources
Support querying vcjob in a specified queue
Support querying Pods by queue and vcjob filtering

For detailed guidance documents on vcctl, please refer to: vcctl Command Line Enhancement.

Ensure Compatibility with Kubernetes v1.30

Volcano closely follows the pace of Kubernetes community versions and supports every major version of Kubernetes. The latest supported version is v1.30, and runs complete UT and E2E use cases to ensure functionality and reliability.

If you want to participate in the development of Volcano adapting to the new version of Kubernetes, please refer to: adapt-k8s-todo for community contributions.

Strengthen Volcano Security Measures

Volcano has always attached great importance to the security of the open source software supply chain. It follows the specifications defined by OpenSSF in terms of license compliance, security vulnerability disclosure and repair, warehouse branch protection, CI inspection, etc. Volcano recently added a new workflow to Github Action, which will run OpenSSF security checks when the code is merged, and update the software security score in real time to continuously improve software security.

At the same time, Volcano has reduced the RBAC permissions of each component, retaining only the necessary permissions, avoiding potential risks of unauthorized access and improving the security of the system.

Shrink permissions of vc scheduler & controller by Monokaix · Pull Request #3545 · volcano-sh/volcano (github.com)

Add pre-install&pre-upgrade hook for admission-init job by Monokaix · Pull Request #3504 · volcano-sh/volcano (github.com)

Optimize Volcano for Large-Scale Performance

In large-scale scenarios, Volcano has done a lot of performance optimization work, mainly including:

Optimize vcjob update strategy, reduce vcjob update and synchronization frequency, reduce API Server pressure, and improve QPS of submitted tasks
Add controller gate switch to vc controller, users can choose to close unnecessary controllers, reduce memory usage and CPU load
All controllers use shared informer to reduce memory usage

Improve GPU Monitoring Function

The new version of Volcano optimizes and enhances GPU monitoring indicators, fixes the problem of inaccurate GPU monitoring, and adds node information to the GPU computing power and video memory monitoring indicators, allowing users to more intuitively view the computing power of each GPU on each node, the total amount and allocated amount of video memory.

Optimize Helm Chart Installation And Upgrade Processes

Volcano has optimized the installation and upgrade process of helm chart, and supports installing helm chart packages to set more custom parameters, mainly including:

By using the helm hook mechanism, after successfully installing Volcano, the volcano-admission-init job is automatically deleted to avoid the subsequent upgrade failure using helm upgrade, related PR: Add pre-install&pre-upgrade hook for admission-init job by Monokaix · Pull Request #3504 · volcano-sh/volcano (github.com)
Update the secret file required by Volcano admission after each successful installation to avoid the problem of repeated installation and uninstallation of Volcano without specifying the helm package name, which will cause the Volcano admission process to fail, related PR: Update volcano-admission secret when it already exists by Monokaix · Pull Request #3653 · volcano-sh/volcano (github.com)
Support setting common labels for resource objects in helm packages, related PR: Add common labels for chart objects by Aakcht · Pull Request #3511 · volcano-sh/volcano (github.com)
Support setting log level for Volcano components through helm, related PR: Expose volcano components (controller, scheduler, etc.) log level control to the helm chat values by chenshiwei-io · Pull Request #3656 · volcano-sh/volcano (github.com)
Support specifying the image registry of Volcano components through helm, related PR: add image registry for helm by calvin0327 · Pull Request #3436 · volcano-sh/volcano (github.com)
Support setting container-level securityContext through helm, related PR: feat: Add securityContext support at container level in helm chart templates by lekaf974 · Pull Request #3704 · volcano-sh/volcano (github.com)

Contributors

Volcano 1.10.0 version includes hundreds of contributions from 36 community contributors. Thanks for your contributions.

Contributors on GitHub:

@googs1025	@WulixuanS	@SataQiu
@guoqinwill	@lowang-bh	@shruti2522
@lukasboettcher	@wangyysde	@bibibox
@Wang-Kai	@y-ykcir	@lekaf974
@yeahdongcn	@Monokaix	@Aakcht
@yxxhero	@babugeet	@liuyuanchun11
@MichaelXcc	@william-wang	@lengrongfu
@xieyanker	@lx1036	@archlitchi
@hwdef	@wangyang0616	@microyahoo
@snappyyouth	@harshitasao	@chenshiwei-io
@TaiPark	@ykcai-daniel	@JesseStutler
@belo4ya

Reference

Release note: v1.10.0

https://github.com/volcano-sh/volcano/releases/tag/v1.10.0

Branch：release-1.10

https://github.com/volcano-sh/volcano/tree/release-1.10

About Volcano

Volcano is designed for high-performance computing applications such as AI, big data, gene sequencing, and rendering, and supports mainstream general computing frameworks. More than 58,000 global developers joined us, among whom the in-house ones come from companies such as Huawei, AWS, Baidu, Tencent, JD, and Xiaohongshu. There are 4.1k+ Stars and 900+ Forks for the project. Volcano has been proven feasible for mass data computing and analytics, such as AI, big data, and gene sequencing. Supported frameworks include Spark, Flink, TensorFlow, PyTorch, Argo, MindSpore, Paddlepaddle, Kubeflow, MPI, Horovod, MXNet, KubeGene, and Ray. The ecosystem is thriving with more developers and use cases coming up.

Volcano v1.9.0 Available Now

Tue, 21 May 2024 00:00:00 GMT

On May 21, 2024, UTC+8, Volcano version v1.9.0 was officially released. This version added the following new features:

Support elastic queue capacity scheduling
Supports affinity scheduling between queues and nodes
GPU sharing feature supports node scoring scheduling
Volcano Support for Kubernetes v1.29
Enhance scheduler metrics
Add license compliance check
Improve scheduling stability

Key Features

Support elastic queue capacity scheduling

Volcano now uses the proportion plugin for queue management. Users can set the guarantee, capacity and other fields of the queue to set the reserved resources and capacity limit of the queue. And by setting the weight value of the queue to realize the resource sharing within the cluster, the queue is proportionally divided into cluster resources according to the weight value, but this queue management method has the following problems:

The capacity of the resources divided by the queue is reflected by the weight, which is not intuitive enough.
All resources in the queue are divided using the same ratio, and the capacity cannot be set separately for each dimension of the queue.

Based on the above considerations, Volcano implements a new queue elasticity capacity management capability, it supports:

Allows users to directly set the capacity of each dimension of resources for the queue instead of setting a weight value.
Elastic capacity scheduling based deserved resources, and queue's resources can be shared and reclaimed back.

For example, in AI large model training scenario, setting different resource capacities for different GPU models in the queue, such as A100 and V100, respectively. At the same time, when the cluster resources are idle, the queue can reuse the resources of other idle queues, and when needed, reclaim the resources set by the user for the queue, that is, the amount of resources deserved, so as to realize the elastic capacity scheduling.

To use this feature, you need to set the deserved field of the queue and set the amount of resources to be deserved for each dimension. At the same time, you need to turn on the capacity plugin and turn off the proportion plugin in the scheduling configuration.

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: demo-queue
spec:
  reclaimable: true
  deserved: # set the deserved field.
    cpu: 64
    memeory: 128Gi
    nvidia.com/a100: 40
    nvidia.com/v100: 80

For a complete usage example of queue elastic capacity scheduling, please refer to: How to use capacity plugin.

For the elastic queue capacity design document, please refer to: Capacity scheduling Design.

Supports affinity scheduling between queues and nodes

Queues are usually associated with departments within the company, and different departments usually need to use different heterogeneous resource types. For example, the large model training team needs to use NIVDIA’s Tesla GPU, and the recommendation team needs to use AMD’s GPU. When users submit jobs to the queue , the job needs to be automatically scheduled to the node of the corresponding resource type according to the attributes of the queue.

Volcano has implemented affinity scheduling capabilities for queues and nodes. Users only need to set the node label that require affinity in the affinity field of the queue. Volcano will automatically schedule jobs submitted to the current queue to the nodes associated with the queue. Users do not need to Set the affinity of the job separately, and only need to set the affinity of the queue uniformly. Jobs submitted to the queue will be scheduled to the corresponding node based on the affinity of the queue and the node.

This feature supports hard affinity, soft affinity, and anti-affinity scheduling at the same time. When using it, you need to set a label with the key volcano.sh/nodegroup-name for the node, and then set the affinity field of the queue to specify hard affinity, soft affinity label values.

For example, the following queue setting means that jobs submitted to the queue need to be scheduled to nodes with label values of groupname1 and groupname2, and will be scheduled to nodes with label values of groupname2 first. At the same time, jobs cannot be scheduled to nodes with label values of groupname3 and groupname4, when resources are insufficient, it can also be scheduled to the node with the label value groupname3.

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: default
  spec:
    reclaimable: true
    weight: 1
    affinity:            # added field
      nodeGroupAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - >
        - >	
        preferredDuringSchedulingIgnoredDuringExecution:
        - >
      nodeGroupAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - >
        - >
        preferredDuringSchedulingIgnoredDuringExecution:
        - >

The scheduling plugin for this feature is called nodegroup, for a complete example of its use see: [How to use nodegroup plugin](https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to _use_nodegroup_plugin.md).

For detailed design documentation, see The nodegroup design.

GPU Sharing is a GPU sharing and isolation solution introduced in Volcano v1.8, which provides GPU sharing and device memory control capabilities to enhance the GPU resource utilization in AI training and inference scenarios. v1.9 adds a new scoring strategy for GPU nodes on top of this feature, so that the optimal node can be selected during job assignment to further enhance resource utilization. Users can set different scoring strategies. Currently, the following two strategies are supported:

Binpack: Provides a binpack algorithm for GPU card granularity, prioritizing to fill up a node with GPU cards that have already been allocated resources to avoid resource fragmentation and waste.
Spread: Prioritizes the use of idle GPU cards over shared cards that have already been allocated resources.

For detailed usage documentation, please refer to: How to use gpu sharing.

Volcano Support for Kubernetes v1.29

Volcano version follows the Kubernetes community version tempo and supports every base version of Kubernetes. The latest supported version is v1.29 and ran full UT, E2E use cases to ensure functionality and reliability. If you would like to participate in the development of Volcano adapting to new versions of Kubernetes, please refer to: https://github.com/volcano-sh/volcano/pull/3459 to make community contributions.

Enhance scheduler metrics

Volcano uses the client-go to talk with Kubernetes. Although the client can set the QPS to avoid requests from being flow-limited, it is difficult to observe how many QPS is actually used by the client, so in order to observe the frequency of requests from the client in real time, Volcano has added a new client-go metrics, which allows users to access the metrics to see the number of GET, POST and other requests per second, so as to get the actual QPS used per second, and thus decide whether or not the client needs to adjust the QPS. The client-go metrics also include client certificate rotation cycle statistics, response size per request statistics, etc.

Users can use curl http://$volcano_scheduler_pod_ip:8080/metrics to get all the detailed metrics of volcano scheduler.

Related PR: #3274.(@Monokaix)

Add license compliance check

In order to enhance the open source license compliance governance standards of the Volcano community, avoid the introduction of infectious open source protocols, and avoid potential risks, the Volcano community has introduced an open source license compliance checking tool. The so-called infectious protocol refers to software that uses this protocol as an open source license. Derivative works generated after modification, use, and copying must also be open sourced under this agreement. If the third-party library introduced by the PR submitted by the developer contains infectious open source protocols such as GPL, LGPL, etc., CI Access Control will intercept it. The developer needs to replace the third-party library with a loose free software license protocol such as MIT, Apache 2.0, BSD, etc. , to pass the open source license compliance check.

Improve scheduling stability

Volcano v1.9.0 has done more optimization in preemption, retry for scheduling failure, avoiding memory leaks, security enhancement, etc. The details include:

Fix the problem of pods not being able to be scheduled due to frequent expansion and contraction of deployment in extreme cases, see PR for details: #3376.(@guoqinwill)
Fix Pod preemption: see PR for details: #3458.(LivingCcj)
Optimize Pod scheduling failure retry mechanism: see PR for details: #3435.(@bibibox)
Metrics optimization: #3463.(@Monokaix)
Security enhancements: #3449.(@lekaf974)

Contributors

Volcano 1.9.0 is brought into being from hundreds of code commits from many contributors. Thanks for your contributions.

Contributors on GitHub:

@daniel-hutao	@wuyueandrew	@googs1025
@7sunarni	@flyingfang	@LivingCcj
@guoqinwill	@panoswoo	@william-wang
@lekaf974	@yangqz	@lowang-bh
@loheagn	@hwdef	@archlitchi
@Lily922	@bibibox	@Monokaix
@belo4ya

Reference

Release note: v1.9.0

https://github.com/volcano-sh/volcano/releases/tag/v1.9.0

Branch：release-1.9

https://github.com/volcano-sh/volcano/tree/release-1.9

About Volcano

Volcano is designed for high-performance computing applications such as AI, big data, gene sequencing, and rendering, and supports mainstream general computing frameworks. More than 58,000 global developers joined us, among whom the in-house ones come from companies such as Huawei, AWS, Baidu, Tencent, JD, and Xiaohongshu. There are 3.8k+ Stars and 800+ Forks for the project. Volcano has been proven feasible for mass data computing and analytics, such as AI, big data, and gene sequencing. Supported frameworks include Spark, Flink, TensorFlow, PyTorch, Argo, MindSpore, Paddlepaddle, Kubeflow, MPI, Horovod, MXNet, KubeGene, and Ray. The ecosystem is thriving with more developers and use cases coming up.

Meet Cloud Native Batch Computing with Volcano in AI & Big Data Scenarios

Fri, 08 Mar 2024 00:00:00 GMT

Cloud native batch computing engine Volcano is designed for high-performance computing applications such as AI, big data, gene sequencing, and rendering, and supports mainstream general computing frameworks. More than 58,000 global developers joined us, among whom the in-house ones come from companies such as Huawei, AWS, Baidu, Tencent, JD, and Xiaohongshu. There are 3.7k+ Stars and 800+ Forks for the project. Volcano has been proven feasible for mass data computing and analytics, such as AI, big data, and gene sequencing. Supported frameworks include Spark, Flink, TensorFlow, PyTorch, Argo, MindSpore, Paddlepaddle, Kubeflow, MPI, Horovod, MXNet, KubeGene, and Ray. The ecosystem is thriving with more developers and use cases coming up.

As the industry-first cloud native batch computing project，Volcano was Open-sourced at KubeCon Shanghai in June 2019, it became an official CNCF project in April 2020. In April 2022, Volcano was promoted to a CNCF incubating project. By now, more than 600 global developers have committed code to the project. The community is seeing growing popularity among developers, partners, and users.

Try new features in Volcano v1.8.2

In Volcano’s Latest release v1.8.2 ，the following new features are added :

Support for vGPU scheduling and isolation
Support for vGPU and user-defined resource preemption capabilities
Addition of JobFlow workflow scheduling engine
Node load-aware scheduling and rescheduling support for diverse monitoring systems
Optimization of Volcano’s ability to schedule microservices
Optimization of Volcano charts packages for publishing and archiving

Try Volcano v1.8.2:https://github.com/volcano-sh/volcano/releases/tag/v1.8.2

Join Volcano Community Co-construction Program

Recently，More than 50 cases related to Volcano have been implemented. These cases are widely distributed in industries such as Internet, advanced manufacturing, finance, life sciences, scientific research, autonomous driving, and medicine. They cover massive data computing and analysis scenarios like AI, big data, genomic sequencing, and rendering. The main users are Tencent, Amazon, ING Bank, Baidu, Xiaohongshu, DiDi, 360, iQIYI, Leinao, Pengcheng Laboratory, Cruise, Li Auto, Unisound, Ximalaya, Vipshop, GrandOmics, BOSS Zhipin, and so on. With the expansion of the Volcano ecosystems, more and more users are highly willing to join the community.

The Volcano community launched the co-construction program to welcome users into the Volcano community, to accelerate cloud native progress, and to ensure a diverse Volcano ecosystem.

Through this program, you will have opportunities for technological guidance, promotion, as well as online and offline technological sharing. If your company or organization recognizes the value that Volcano has to offer, wants help using Volcano, or wants to exert their technological influence, consider joining the program. For details about the requirements and benefits, see https://github.com/volcano-sh/community/blob/master/community-building-program.md

Join Volcano at KubeCon + CloudNativeCon Europe, 19-22 March in Paris!

Volcano will participate in several activities, including:

Speech Schedule
- March 19 at 14:05 - 14:30 am CET:Level 7.3 | Room S03 Volcano Maintainer Kevin Wang, Huawei， presents“Efficient Multi-Cluster GPU Workload Management with Karmada and Volcano”
- March 22 at 11:55 - 12:30 am CET：Pavilion 7 | Level 7.3 | N03 Volcano Maintainer William Wang, Huawei & Mengxuan Li, 4paradigm presents “Cloud Native Batch Computing with Volcano: Updates and Future ”
- March 22 at 16:00 - 16:35 am CET：Pavilion 7 | Level 7.3 | Paris Room Volcano Maintainer William Wang & Hongcai Ren, Huawei presents “Maximizing GPU Utilization Over Multi-Cluster: Challenges and Solutions for Cloud-Native AI Platform”
Booth Hours:
- March 20-22 PM(W, Th, F) :Stop by CNCF Project Pavilion Booth PP18-B at KubeCon +CloudNativeCon Europe to speak with an expert or see a demo!

Volcano v1.8.2 Available Now

Wed, 31 Jan 2024 00:00:00 GMT

On January 9, 2024, UTC+8, Volcano version v1.8.2 was officially released. This version added the following new features:

Support for vGPU scheduling and isolation
Support for vGPU and user-defined resource preemption capabilities
Addition of JobFlow workflow scheduling engine
Node load-aware scheduling and rescheduling support for diverse monitoring systems
Optimization of Volcano's ability to schedule microservices
Optimization of Volcano charts packages for publishing and archiving

Key Features

Support for vGPU scheduling and isolation

Since ChatGPT became popular, the research and development of AI big models has been endless, and different kinds of AI big models have been launched one after another. Due to its huge training tasks requiring a large amount of arithmetic power, the supply of arithmetic power with GPU as the core has become the key infrastructure for the development of the big model industry. In the actual use scenario, users have low resource utilization and inflexible resource allocation for GPU resources, and must purchase a large number of redundant heterogeneous arithmetic to meet the business needs, while the heterogeneous arithmetic itself is costly, which brings a great burden to the development of enterprises. Starting from version 1.8, Volcano provides a common abstraction framework for shareable devices (GPUs, NPUs, FPGAs...) Volcano provides an abstract general framework for shareable devices (GPU, NPU, FPGA...), based on which developers can customize multiple types of shared devices; currently, Volcano has implemented GPU virtualization features based on this framework, which supports GPU device multiplexing, resource isolation and other capabilities, as follows:

GPU Sharing: Each task can request to use part of the resources of a GPU card, and GPU cards can be shared among multiple tasks.
Device Video Memory Control: GPUs can be allocated according to memory (e.g., 3000M) or proportionally (e.g., 50%) to achieve GPU virtualization resource isolation capability.

For more information about vGPU, please refer to:

How to use the vGPU feature:

/docs/KeyFeatures/GPUVirtualization
How to add new heterogeneous arithmetic sharing strategies:

https://github.com/volcano-sh/volcano/blob/master/docs/design/device-sharing.md

Support for vGPU and user-defined resource preemption capabilities

Currently, Volcano supports CPU, Memory and other basic resources preemption, but does not yet support preemption of GPU resources and resources that users develop scheduling plug-ins based on the Volcano framework and manage on their own (e.g., NPU, network resources, etc.). In version 1.8, Volcano restructured the node filtering related processing (PredicateFn callback function), and added the Status type in the return result, which is used to identify whether the current node meets the conditions of job issuance under the scenarios of scheduling, preemption, etc. The GPU preemption function has been released based on the optimized framework, and the user can use the scheduling plug-ins based on the secondary development of Volcano to combine with the business scenarios. The scheduling plug-in can be adapted and upgraded according to the business scenarios. In version 1.8.2, Volcano supports the preemption of the number of node CSIs and the number of node Pods.

For more information on supporting extended resource preemption, please refer to:

https://github.com/volcano-sh/volcano/pull/2916

Addition of JobFlow workflow scheduling engine

JobFlow orchestration engine is widely used in high-performance computing, AI biomedical, image processing and beauty, game AGI, scientific computing and other scenarios, to help users simplify the management of multiple tasks in parallel and dependencies, and significantly improve the overall computing efficiency. JobFlow is a lightweight task flow orchestration engine that focuses on Volcano's job orchestration, providing Volcano with job probes, job completion dependencies, job failure rate tolerance and other diverse job dependency types, and supporting complex process control primitives, with the following specific capabilities:

Supports large-scale job management and complex task flow scheduling.
Supports real-time query to all related jobs and task progress.
Supports automatic operation of jobs and timed startup to release labor costs.
Support for different tasks can set up a variety of action strategies, when the task meets specific conditions can trigger the corresponding action, such as timeout retry, node failure drift, etc.

A demonstration of a JobFlow task running is shown below:

For more information about JobFlow, please refer to:

https://github.com/volcano-sh/volcano/blob/master/docs/design/jobflow/README.md

Node load-aware scheduling and rescheduling support for diverse monitoring systems

The state of a Kubernetes cluster changes in real time as tasks are created and finished. In some scenarios (e.g., adding or removing nodes, changing the affinity of Pods and Nodes, dynamic changes in the job lifecycle, etc.), there are problems such as unbalanced resource utilization among cluster nodes and node performance bottlenecks, etc. At this time, scheduling and re-scheduling based on the real load can help us solve the above problems. Before version 1.8 of Volcano, the real load scheduling and rescheduling metrics acquisition only supports Prometheus, from version 1.8 onwards, Volcano optimizes the monitoring metrics acquisition framework, adds support for ElasticSearch monitoring system, and supports smooth docking of more types of monitoring systems with less adaptation workload.

For more information on supporting multiple monitoring systems, please refer to:

Node load-aware based scheduling:

https://github.com/volcano-sh/volcano/blob/master/docs/design/usage-based-scheduling.md
Re-scheduling:

https://github.com/volcano-sh/volcano/blob/master/docs/design/rescheduling.md

Optimization of Volcano's ability to schedule microservices

Add Kubernetes default scheduler plugin switch

Volcano is a unified converged scheduling system that not only supports AI, BigData and other computation jobs, but also supports microservice workloads, and is compatible with PodTopologySpread, VolumeZone, VolumeLimits, NodeAffinity, and other scheduling plug-ins that are part of the Kubernetes default scheduler, PodAffinity, NodeAffinity, PodAffinity, and other scheduling plugins, and the default scheduling plugin capabilities of Kubernetes are enabled by default in Volcano. Since Volcano 1.8, Kubernetes default scheduling plug-ins can be turned on and off freely by means of configuration files, and all of them are turned on by default. If you choose to turn off some of the plug-ins, for example, turn off the PodTopologySpread and VolumeZone plug-ins, you can set the corresponding value in the predicate plug-in to If you want to disable some plug-ins, such as PodTopologySpread and VolumeZone plug-ins, you can set the corresponding value in the predicate plug-in to false:

actions: "allocate, backfill, preempt"
tiers:
- plugins:
    - name: priority
    - name: gang
    - name: conformance
- plugins:
    - name: drf
    - name: predicates
      arguments:
      predicate.VolumeZoneEnable: false
      predicate.PodTopologySpreadEnable: false
    - name: proportion
    - name: nodeorder

For more information, please refer to:

https://github.com/volcano-sh/volcano/issues/2748

Enhanced Cluster Autoscaling Compatibility

In the Kubernetes platform, Volcano is increasingly used as a scheduler for general-purpose services, in addition to batch computing services.Node Autoscaler is one of the core features of Kubernetes, and it plays an important role in facing the surge in user traffic and saving operational costs. Volcano optimizes job scheduling and other related logic to enhance compatibility and interaction with ClusterAutoscaler, mainly in the following two areas: Timely triggering of capacity expansion for pods entering pipeline state during scheduling phase Candidate nodes are scored in gradients to reduce the impact of cluster terminating pods on the scheduling load, avoiding pods entering invalid pipeline states, which can lead to mis-expansion of the cluster.

For more information, please refer to:

https://github.com/volcano-sh/volcano/issues/3000 https://github.com/volcano-sh/volcano/issues/2782

Fine-grained management of Node resources for increased resilience

When a node's total resources are less than the allocated resources due to some reasons such as device-plugin reporting anomalies, Volcano considers that the node's data is inconsistent, isolates the node, and stops scheduling any new workloads to the node. In version 1.8, node resource management is refined, for example: when the total GPU resource capacity of a node is less than the amount of allocated resources, pods applying for GPU resources are prohibited from scheduling to that node, while jobs applying for non-GPU resources will still be allowed to schedule to that node normally.

For more information, please refer to:

https://github.com/volcano-sh/volcano/issues/2999

Optimization of Volcano charts packages for publishing and archiving

As Volcano is used in more and more production and cloud environments, it is important to have a clean and standardized installation process. Starting from version 1.8, Volcano optimizes the charts package release archive action, standardizes the installation and usage process, and completes the migration of historical versions (v1.6, v1.7) to the new helm repository in the following ways:

Add Volcano charts bin address

helm repo add volcano-sh https://volcano-sh.github.io/helm-charts

Search for all installable versions of Volcano

helm search repo volcano -l

Install the latest version of Volcano

helm install volcano volcano-sh/volcano -n volcano-system --create-namespace

Install the specified version of Volcano, e.g. 1.8.2.

helm install volcano volcano-sh/volcano -n volcano-system --create-namespace --version 1.8.2

For more information on the Volcano charts package, please refer to:

https://github.com/volcano-sh/helm-charts

Contributors

Volcano 1.8.2 is brought into being from hundreds of code commits from 33 contributors. Thanks for your contributions.

Contributors on GitHub:

@shaobo76	@william-wang	@gengwg
@kingeasternsun	@Aakcht	@waiterQ
@Shoothzj	@hwdef	@halegreen
@wulixuan	@Monokaix	@medicharlachiranjeevi
@WulixuanS	@rayoluo	@lowang-bh
@gj199575	@noyoshi	@Tongruizhe
@jinzhejz	@Cdayz	@Mufengzhe
@renwenlong-github	@wangyang0616	@jiamin13579
@zbbkeepgoing	@jiangkaihua	@z2Zhang
@archlitchi	@lixin963	@xiao-jay
@Yanping-io	@Lily922	@shusley244

Reference

Release note: v1.8.0

https://github.com/volcano-sh/volcano/releases/tag/v1.8.0

Release note: v1.8.1

https://github.com/volcano-sh/volcano/releases/tag/v1.8.1

Release note: v1.8.2

https://github.com/volcano-sh/volcano/releases/tag/v1.8.2

Branch：release-1.8

https://github.com/volcano-sh/volcano/tree/release-1.8

About Volcano

Volcano is designed for high-performance computing applications such as AI, big data, gene sequencing, and rendering, and supports mainstream general computing frameworks. More than 58,000 global developers joined us, among whom the in-house ones come from companies such as Huawei, AWS, Baidu, Tencent, JD, and Xiaohongshu. There are 3.5k+ Stars and 800+ Forks for the project. Volcano has been proven feasible for mass data computing and analytics, such as AI, big data, and gene sequencing. Supported frameworks include Spark, Flink, TensorFlow, PyTorch, Argo, MindSpore, Paddlepaddle, Kubeflow, MPI, Horovod, MXNet, KubeGene, and Ray. The ecosystem is thriving with more developers and use cases coming up.

Volcano Community Co-construction Program

Fri, 11 Aug 2023 00:00:00 GMT

As artificial intelligence (AI) technologies advance and large language models (LLMs) grow more popular, the demand for AI compute has been booming. This has generated huge demand for high-performance scheduling for the AI and for hardware like AI chips.

Volcano is the first cloud native batch computing project in the industry. In 2019, it was donated by Huawei Cloud to the Cloud Native Computing Foundation (CNCF) and became CNCF's first and only batch computing incubator project. Volcano provides unified high-performance job management for AI, big data, and high-performance computing (HPC) and supports a variety of high-level scheduling policies, including online and offline scheduling, AI elastic training scheduling, service level agreement (SLA), topology-based scheduling, fairness, load aware scheduling, rescheduling, preemption, and reclamation. It offers unified lifecycle management, job dependency management, and task dependency management for workloads like Spark, Flink, PyTorch, MPI, and TensorFlow. In terms of fine-grained resource management, Volcano supports min-max queue resource management, queue resource reservation, and dynamic resource sharing for multi-tenant resource leasing or preemption. Additionally, Volcano schedules heterogeneous resources including x86, Arm, GPUs, and Ascend, and provides refined scheduling of CPUs and GPUs. Users can allocate resources based on their requirements and significantly improve cost-effectiveness using Volcano.

The Volcano community has attracted more than 58,000 developers worldwide and won more than 3,200 Stars and over 730 Forks on GitHub. The contributors include Huawei, AWS, IBM, Baidu, Tencent, JD, Xiaohongshu, 4Paradigm, BoCloud, DaoCloud, Ruitian Capital, Qiniu Cloud, Yinqing Technology, ByteDance, Kuaishou, Unisound, Infosys, Visa, NetEase, Red Hat, Kingsoft Cloud, Inspur, ZTE, Oracle, and iQIYI.

More than 50 cases related to Volcano have been implemented. These cases are widely distributed in industries such as Internet, advanced manufacturing, finance, life sciences, scientific research, autonomous driving, and medicine. They cover massive data computing and analysis scenarios like AI, big data, genomic sequencing, and rendering. The main users are Tencent, Amazon, ING Bank, Baidu, Xiaohongshu, DiDi, 360, iQIYI, Leinao, Pengcheng Laboratory, Cruise, Li Auto, Unisound, Ximalaya, Vipshop, GrandOmics, BOSS Zhipin, and so on. With the expansion of the Volcano ecosystems, more and more users are highly willing to join the community. Huawei Cloud has worked with 11 partners to launch the Volcano community co-construction program and cultivate a more prosperous Volcano ecosystem.

According to Deng Mingkun, General Manager of Huawei Cloud Open Source Services, "The cloud native batch computing project, Volcano, has been widely adopted in domains such as AI, big data, genomic sequencing, rendering, transcoding, multimedia, and finance, since June 2019. A group of industry users not only actively promote the implementation of Volcano in production environments, but also contribute a lot to the Volcano community based on their own experience. Huawei Cloud intends to work with partners to launch the Volcano community co-construction program to create a more prosperous Volcano ecosystem and help more enterprises accelerate their cloud native progress."

The first batch of members to join the program are Baidu, BoCloud, 4Paradigm, Vipshop, Ruitian Capital, Leinao, Pinlan, 360, NetEase Shufan, Ximalaya, and BOSS Zhipin.

According to Zhou Ti, the tech lead of Baidu's PaddlePaddle open source ecosystem, "PaddlePaddle and Volcano jointly released the PaddlePaddle on Volcano solution to improve PaddlePaddle's computing efficiency. As a platform for high-performance computing, Volcano makes up for Kubernetes' lack of basic capabilities in machine learning, deep learning, HPC, and big data computing. Additionally, Volcano enhances the batch creation and lifecycle management of computing tasks, fair-share scheduling and other aspects on the basis of the native Kubernetes capability. These features meet PaddlePaddle's basic requirements."

Zhao Anquan, General Manager of BoCloud PaaS, said, "BoCloud's HPC solution, based on CNCF's Volcano scheduling engine, a product well respected by many customers, provides a high-concurrency computing platform that runs AI, big data, and simulation calculation applications, resolving many pain points in the industry. We also donated the industry's first HPC job orchestration component JobFlow to the Volcano community so that users can better apply cloud native technologies."

Li Mengxuan, head of heterogeneous computing virtualization in 4Paradigm, said, "The Volcano project enables us to solve the pain points encountered during the implementation of cloud native technologies in AI projects at a low cost, especially in terms of device reuse. The use of Volcano will significantly improve the cluster resource utilization. 4Paradigm will continuously contribute code to the community to build Volcano into a reuse platform that supports all mainstream forms of heterogeneous compute such as NPUs, GPUs, MLUs, and DCUs."

He Yingpeng, head of Vipshop's AI cloud platform, said, "As a top e-commerce platform in China, Vipshop faces problems associated with rapid growth, rapid product iteration, and maintaining a diverse product portfolio. A Volcano-based AI training platform with advanced scheduling policies like queue and gang scheduling can support scheduling of more than 100,000 vCPUs, accelerating Vipshop's service innovation."

Chang Feng, head of the Leinao R&D Center, said, "Volcano is one of the first open source cloud native projects for batch computing. It has dynamically configurable advanced scheduling policies and excellent resource management capabilities, which can address multiple challenges, like job scheduling, lifecycle management, and heterogeneous hardware support in AI scenarios. During the implementation, we expanded Volcano's capabilities to effectively improve our system stability and resource utilization."

Peng Jingtian, co-founder and CTO of Pinlan, said, "CNCF's Volcano project has been successfully applied to our cloud native intelligent building design platform — AlphaDraw. Volcano provides AlphaDraw's algorithm services with batch processing and auto scaling capabilities in scenarios like AI-based model flipping of Computer Aided Design (CAD) two-dimensional drawings and intelligent design of three-dimensional building models, greatly improving Kubernetes cluster resource utilization and optimizing workload performance. As the first member of the Volcano community co-construction program, Pinlan continuously contributes best practices for Cloud+AI in the architectural design field to the community. We expect AlphaDraw and Volcano to develop together to continuously provide more excellent products and solutions for intelligent cloud computing and the cloud native progress of the industry in the future."

Wang Xinyong, a cloud native technology expert from NetEase Shufan, said, "Volcano provides many useful supplements to Kubernetes' native capabilities, enabling it to better orchestrate batch processing tasks like AI training and big data computing. Volcano's excellent task abstraction and management capabilities, multiple scenario-based scheduling mechanisms, and out-of-the-box integration with multiple common open source computing frameworks enable us to focus more on providing business value for users without spending a lot of efforts on reinventing systems."

The owner of the Ruitian Capital Infrastructure Team said, "Volcano supplements native Kubernetes capabilities such as batch task scheduling, resource sharing, and fair scheduling policies, and provides unified interfaces to reduce learning and maintenance costs. In the production environment, Volcano works with our proprietary level-2 scheduling to meet the requirements of tens of thousands of tasks per day, greatly improving the efficiency of policy research."

The leader of the 360 container team said, "Volcano makes up for Kubernetes' lack of basic scheduling capabilities in machine learning and big data computing tasks. It provides various plug-ins to schedule tasks in different scenarios, greatly improving the cluster utilization. Additionally, Volcano supports most mainstream computing frameworks like Spark, TensorFlow, and Flink. The overall design of Volcano follows the design and mechanisms of Kubernetes, which reduces our learning costs."

The head of the Ximalaya AI cloud team said, "Volcano enhances Kubernetes' capabilities like batch task scheduling, resource sharing, and fair scheduling; and provides elastic scheduling. As a basic component for resource scheduling of the machine learning platform, Volcano improves GPU utilization in the production environment.

The owner of BOSS Zhipin AI fundamental platform team said, "BOSS Zhipin builds infrastructures based on Volcano in AI and big data computing scenarios. Volcano's powerful batch processing and robust scheduling policies are very convenient for us. They help support complex service scenarios and greatly improve BOSS Zhipin's cluster resource utilization and stability. With the support of its robust ecosystem and the community, Volcano has greatly helped our technological and business development."

We look forward to working with more organizations to build a more inclusive Volcano community.

Introduction to the Volcano Community Co-construction Program

The Volcano community launched the co-construction program to more quickly include users into the Volcano community, to accelerate cloud native progress, and to ensure a diverse Volcano ecosystem.

For details about the requirements and benefits, see https://github.com/volcano-sh/community/blob/master/community-building-program.md.

Application to the program

Scan the QR code or click to read the full text and fill in the application form.

The result will be sent by email. Please wait.

Volcano 1.7.0 Available Now

Thu, 12 Jan 2023 00:00:00 GMT

Volcano 1.7.0 is now available with the following new features:

enhanced plugin for PyTorch Jobs
Ray on Volcano
enhanced scheduling for general Kubernetes services
multi-architecture images of Volcano
optimized queue status info

Volcano is the industry-first cloud native batch computing project. Open-sourced at KubeCon Shanghai in June 2019, it became an official CNCF project in April 2020. In April 2022, Volcano was promoted to a CNCF incubating project. By now, more than 490 global developers have committed code to the project. The community is seeing growing popularity among developers, partners, and users.

Key Features

1. Enhanced Plugin for PyTorch Jobs

As one of the most popular AI frameworks, PyTorch has been widely used in deep learning fields such as computer vision and natural language processing. More and more users turn to Kubernetes to run PyTorch in containers for higher resource utilization and parallel processing efficiency.

Volcano 1.7 enhanced the plugin for PyTorch Jobs, freeing you from the manual configuration of container ports, MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and RANK environment variables.

Other enhanced plugins include those for TensorFlow, MPI, and PyTorch Jobs. They are designed to help you run computing jobs on desired training frameworks with ease.

Volcano also provides an extended development framework for you to tailor Job plugins to your needs.

Design Documentation: Pytorch-plugin
User Guide: Pytorch Plugin User Guide
Issue：#2292

2. Ray on Volcano

Ray is a unified framework for extending AI and Python applications. It can run on any machine, cluster, cloud, and Kubernetes cluster. Its community and ecosystem are growing steadily.

As machine learning workloads are hosting computing jobs at a density higher than ever before, single-node environments are failing in providing enough resources for training tasks. Here's where Ray comes in, which seamlessly coordinates resources of the entire cluster, instead of a single node, to run the same set of code. Ray is designed for common scenarios and any type of workloads.

For users running multiple types of Jobs, Volcano partners with Ray to provide high-performance batch scheduling. Ray on Volcano has been released in KubeRay v0.4.

User Guide: KubeRay-integration-with-Volcano
Issue: #2429， #213

3. Enhance Scheduling for General Kubernetes Services

Schedulers have their own advantages according to the use case. For example, in batch computing, Volcano provides more scheduling policies and capabilities. In general scheduling, the Kubernetes default scheduler is more balanced. However, it's often the case that a user runs multiple types of tasks in the same cluster. When there are both batch computing and general tasks, scheduling can be a challenge.

Starting from version 1.7, Volcano becomes fully compatible with the Kubernetes default scheduler to schedule and manage long-running services. Now you can use Volcano to centrally schedule both batch computing and general workloads.

Enhancements:

Supports multiple types of schedulers for Volcano scheduler and webhook.
Supports NodeVolumeLimits plugin.
Supports VolumeZone plugin.
Supports PodTopologySpread plugin.
Supports SelectorSpread plugin.

Support for Kubernetes 1.25 is also available in Volcano 1.7.

Issue: #2394，#2510

4. Multi-architecture Images

You can now compile multi-architecture Volcano images by a few clicks through cross compilation. For example, you can compile the base images of the amd64 and arm64 architectures on an amd64 host and push the images to the image repository. During installation and deployment, the system automatically selects a proper image based on the host architecture for you, more user-friendly than before.

User Guide: building-docker-images
Issue: #2435

5. Optimized Queue Status Info

Volcano can now collect statistics on allocated resources in real time to the queue status info, which eases dynamic resource adjustment and puts cluster resources into good use.

Volcano allocates and manages cluster resources by queues. The Capability field limits the resource use for each queue, which is a hard ceiling.

Before, users had no clear view on the allocated resources in queues and idle resources among those defined by Capability. Creating a large number of workloads against insufficient resources may cause job suspension and unexpected cluster scale-out triggered by autoscaler, increasing the cloud resource costs. Now with more detailed status info, you can manage cluster resources more efficiently and avoid excess costs.

Issue: #2571

Contributors

Volcano 1.7.0 is brought into being from hundreds of code commits from 29 contributors. Thanks for your contributions.

Contributors on GitHub:

@xiaoxubeii	@jsolbrig	@Yikun
@tgaddair	@william-wang	@elinx
@Abirdcfly	@xiaoanyunfei	@qiankunli
@wpeng102	@waiterQ	@hwdef
@WingkaiHo	@Monokaix	@kerthcet
@WulixuanS	@autumn0207	@jinzhejz
@lucming	@jimoosciuc	@LY-today
@dontan001	@wangyang0616	@Akiqqqqqqq
@zhoumingcheng	@jiangkaihua	@Thor-wl
@ccchenjiahuan	@zhifanggao

Links

Release note：v1.7.0
Branch：release-1.7

About Volcano

Volcano is designed for high-performance computing applications such as AI, big data, gene sequencing, and rendering, and supports mainstream general computing frameworks. More than 26,000 global developers joined us, among whom the in-house ones come from companies such as Huawei, AWS, Baidu, Tencent, JD, and Xiaohongshu. There are 2,800 Stars and 670 Forks for the project. Volcano has been proven feasible for mass data computing and analytics, such as AI, big data, and gene sequencing. Supported frameworks include Spark, Flink, TensorFlow, PyTorch, Argo, MindSpore, Paddlepaddle, Kubeflow, MPI, Horovod, MXNet, KubeGene, and Ray. The ecosystem is thriving with more developers and use cases coming up.

ING Bank: How Volcano Empowers Their Big Data Analytics Platform

Wed, 28 Dec 2022 00:00:00 GMT

On October 26, 2022, Krzysztof Adamski and Tinco Boekestijn from ING Group delivered a keynote speech "Efficient Scheduling Of High Performance Batch Computing For Analytics Workloads With Volcano" at KubeCon North America. The speech focused on how Volcano, a cloud native batch computing project, supports high-performance scheduling for big data analytics jobs on ING's data management platform. More details: KubeCon + CloudNativeCon North America

Introduction to ING

Internationale Nederlanden Groep (ING), a global financial institution of Dutch origin, was created in 1991 with the merger of Dutch insurer Nationale-Nederlanden and national postal bank NMB Postbank.

ING provides services in more than 40 countries around the world. Core businesses are banking, insurance, and asset management. Their 56,000 employees serve 53.2 million customers worldwide, including natural persons, families, businesses, governments, and organizations such as IMF.

Business Background

Regulations and restrictions on banking vary depending on the country/region. Data silos, data security, and compliance requirements can be really challenging. It is not easy to introduce new technologies. Therefore, ING builds their Data Analytics Platform (DAP) to provide secure, self-service functionality for employees to manage services throughout the entire process.

In 2013, they conceptualized data platform. In 2018, ING introduced cloud native technologies to upgrade their infrastructure platform. Since then, more and more employees and departments turn to the platform, and by now, there are more than 400 projects on the data index platform.

They aim to meet all analytics needs in a highly secure, self-service platform that has the following features:

Open source tool model
Powerful computing
Strict security and compliance measures
One platform for all
Both global and local

Challenges and Solutions

ING is shifting from Hadoop to Kubernetes. They met some challenges in job management and multi-framework support. For example:

Job management
- Pod scheduling: Unaware of upper-layer applications.
- Lack of fine-grained lifecycle management
- Lack of dependencies of tasks and jobs
Scheduling
- Lack of job-based scheduling, such as sorting, priority, preemption, fair scheduling, and resource reservation
- No advanced scheduling algorithms, such as those based on CPU topology, task topology, IO-awareness, and backfilling
- Lack of resource sharing among jobs, queues, and namespaces
Multi-framework support
- Insufficient support for frameworks such as TensorFlow and PyTorch
- Complex management of each framework (such as resource planning and sharing)

Managing applications (stateless and even stateful ones) with Kubernetes would be a perfect choice, if Kubernetes is as user-friendly as Yarn in the scheduling and management of batch computing jobs. Yarn also provides limited support, for example, on TensorFlow and PyTorch. Therefore, ING looked for better solutions.

Kubernetes + Hadoop

When managing clusters, ING once separated Hadoop and Kubernetes. They ran almost all Spark jobs in Hadoop clusters, and other tasks and algorithms in Kubernetes clusters. They want to run all the jobs in Kubernetes clusters to simplify management.

When Kubernetes and Yarn work together, Kubernetes and Hadoop resources are statically divided. During office hours, Hadoop applications and Kubernetes use their own resources. Spark tasks, when heavily pressured, cannot be allocated extra resources. At night, there are only batch processing tasks in clusters. All Kubernetes resources are idle but cannot be allocated to Hadoop. In this case, resources are not fully used.

Kubernetes with Volcano

When managing clusters with Kubernetes and scheduling Spark tasks with Volcano, resources do not need to be statically divided. Cluster resources can be dynamically re-allocated based on the priorities and resource pressure of pods, batch tasks, and interactive tasks, which greatly improves the overall utilization of cluster resources.

For example, during office hours, idle resources of common service applications can be used by batch and interactive applications temporarily. In holidays or nights, batch applications can use all cluster resources for data computing.

Volcano is a batch scheduling engine developed for Kubernetes with the following capabilities:

Job queues with weighted priority
Able to commit above queue limits if the cluster has spare capacity
Able to preempt pods when more pods come in
Configurable strategies to deal with competing workloads
Compatible with Yarn scheduling

Volcano supplements Kubernetes in batch scheduling. Since Apache Spark 3.3, Volcano has become the default batch scheduler of Spark on Kubernetes, making it easier to install and use.

Highlighted Features

Redundancy and Local Affinity

Volcano retains the affinity and anti-affinity policies for pods in Kubernetes, and adds those for tasks.

The idea of DRF is that in a multi-resource environment, resource allocation should be determined by the dominant share of an entity (user or queue). The volcano-scheduler observes the dominant resource requested by each job and uses it as a measure of cluster resource usage. Based on this dominant resource, the volcano-scheduler calculates the share of the job. The job with a lower share has a higher scheduling priority.

For example, a cluster has 18 CPUs and 72 GB memory in total. User1 and User2 are each allocated one queue. Any submitted job will get its scheduling priority based on the dominant resource.

For User1, the CPU share is 0.33 (6/18), the memory share is 0.33 (24/72), and the final share is 0.33.
For User2, the CPU share is 0.67 (12/18), the memory share is 0.33 (24/72), and the final share is 0.67.

Under a DRF policy, the job with a lower share will be first scheduled, that is, the job committed by User1.

Queue resources in a cluster can be divided by configuring weights. However, overcommitted tasks in a queue can use the idle resources in other queues. In this example, after using up the CPUs of its own queue, User2 can use the idle CPUs of User1. When User1 commits a new task, it triggers resource preemption and reclaims the resources occupied by other queues.

Resource Reservation

Batch computing tasks and other services may preempt resources and cause conflicts. Assume there are two available nodes in a cluster and we need to deploy a unified service layer in the cluster to provide services externally, such as Presto or cache services like Alluxio, batch computing tasks may have already taken all resources and we can't deploy or upgrade that service layer. Therefore, ING's platform now allows users to reserve some resources for other services.

DRF Dashboard

ING built a DRF scheduling dashboard based on the monitoring data from Volcano to obtain scheduling data at different layers. In the service cluster, ING stores the tasks of interactive users in one queue, and the computing tasks of all key projects running on the data platform in another queue. ING can take certain resources from other queues to the key project queue, but that won't do any good to the tasks of interactive users.

ING is considering displaying the peak hours of cluster use to provide users with more information. With this, users can decide when to start their tasks based on the cluster resource readiness, improving computing performance without complex configurations in the background.

Summary

Volcano abstracts batch task scheduling, allowing Kubernetes to better serve ING in task scheduling. ING will contribute their developed functions to the community, such as the DRF dashboard, idle resource reservation on each node, auto queue management, new Prometheus monitoring metrics, Grafana dashboard updates, kube-state-metrics update, and cluster role restrictions.

Volcano v1.4 (Beta) Release Note

Mon, 13 Sep 2021 00:00:00 GMT

This article was firstly released at Container Cube on September 6th, 2021, refer to Volcano v1.4.0-Beta发布，支持NUMA-Aware等多个重要特性

Volcano, CNCF's first batch computing project, is now available with a new version, v1.4 (Beta). This version includes multiple important features, such as resource ratio-based partitions on GPU nodes, NUMA-aware, mixed deployment of multiple schedulers, and greatly improved stability.

Resource ratio-based partitions on GPU nodes is developed to avoid idle GPUs while GPU-consuming jobs are starving. This is an important feature contributed by Leinao Cloud, a Volcano community member.

Previously, a scheduler had separate rules for allocating scarce resources such as GPUs and common resources such as CPUs. That is, CPU-consuming jobs can be directly allocated to GPU nodes to consume CPU and memory resources without considering the upcoming GPU jobs and reserving no resources for them. Alternatively, an independent scheduler was configured for GPU nodes, which did not allow CPU-consuming jobs to be scheduled to GPU nodes.

Now with resource ratio-based partitions, you can set a dominant resource (usually GPU) and configure a resource ratio (for example, GPU:CPU:Memory = 1:4:32) for the dominant resource. The scheduler ensures that the ratio of idle GPU, CPU, and memory resources on a GPU node is greater than or equal to the value you set.

In this way, GPU-consuming jobs that meet the ratio requirement can be scheduled to the node at any time, preventing GPU wastes. Compared with other solutions in the industry, this more flexible method improves node resource utilization.

For details about the feature design and usage, you can visit https://github.com/volcano-sh/volcano/blob/master/docs/design/proportional.md.

CPU NUMA-aware is another important feature of this version. For computing-intensive jobs such as AI and big data jobs, enabling NUMA will significantly improve the computing efficiency. With CPU NUMA-aware scheduling, you can configure the NUMA policy to determine whether to enable NUMA for workloads. The scheduler will select a node that meets the NUMA requirements.

For details about the feature design and usage, you can visit https://github.com/volcano-sh/volcano/blob/master/docs/design/numa-aware.md.

You can now deploy different types of schedulers in a Kubernetes cluster to properly schedule resources. The most common use case is deploying default-scheduler and Volcano together. Native Kubernetes resource objects, such as Deployments and StatefulSets, can be scheduled by default-scheduler, and high-performance computing workloads, such as Volcano Jobs, TensorFlow Jobs, and Spark Jobs, can be scheduled by Volcano. This solution can make the best possible use of each type of schedulers and reduce the concurrency pressure of a single scheduler.

For details about the feature design and usage, you can visit https://github.com/volcano-sh/volcano/blob/master/docs/design/multi-scheduler.md.

In addition to the preceding features, Volcano v1.4 (Beta) adds the stress testing automation framework and fixes bugs introduced by the resource comparison function robustness.

The community is collecting roadmap features for Volcano v1.5. We have received requirements on support for cluster resource monitoring, hierarchical queues, enhanced Spark integration, and task dependency. Every piece of your suggestions and issues is welcome.

OpenI-Octopus: How to Avoid Resource Preemption in Kubernetes Clusters

Thu, 26 Aug 2021 00:00:00 GMT

This article was firstly released at Container Cube on September 30th, 2020, refer to鹏城实验室启智章鱼教你彻底摆脱Kubernetes集群资源抢占难题

Introduction to OpenI-Octopus

OpenI-Octopus is a cluster management and resource scheduling system developed and maintained by Peng Cheng Laboratory, Peking University, and University of Science and Technology of China.

This system is completely open-source, complying with the Open-Intelligence license.
It deploys, manages, and schedules jobs using Kubernetes.
AI jobs can be run in clusters. Hardware such as GPUs, NPUs, FPGAs, Huawei Ascend chips, and Cambricon MLUs are supported.
It provides high-performance networks for AI and supports InfiniBand networks.
Monitoring and analysis tools for networks, platforms, and AI jobs are available.
Mainstream deep learning frameworks are supported.
A microservice-based architecture is used.

The service architecture of OpenI-Octopus is illustrated above. The bottom layer is hardware. OpenI-Octopus supports various types of heterogeneous hardware, including CPUs, GPUs, NPUs, and FPGAs. Different hardware types are adapted so that the upper-layer Kubernetes services can identify and manage them.

The second layer is the platform layer. The blue panels on the left cover the node management functions. OpenI-Octopus employs native Kubernetes functions, including orchestration planning and controllers, and enhances scheduling by integrating Volcano.

Management components communicate with integrated development services through the API server.

The rest-server module developed by OpenI-Octopus carries the core functions of the system and integrates monitoring tools such as Grafana and Prometheus, Elasticsearch, Jupyterlab proxy, and model repository.

The panel on the right covers the capabilities of compute nodes, including image factory, O&M, job monitoring, kubebox client, and Jupyterlab client for users to log in to containers.

The top layer is the services provided by the system, such as data engine, model repository, and project center. With remote interconnection, remote users can also enjoy cluster services.

Business Scenarios and Challenges

OpenI-Octopus is built for research teams and laboratories. They develop and train models in fields such as transportation, healthcare, and finance, model training, and perform model inference. These models are used for vehicle tracking, medical image recognition, auxiliary diagnosis, financial quantization, and many other applications. Some deep learning algorithms are used in these courses, which require strong compute resources.

OpenI-Octopus aims to break the model, data, and compute resource silos on the traditional platforms, and provide computing power through a single platform.

At the model layer, OpenI-Octopus provides a multi-architecture, heterogeneous model engine, which supports common open-source computing frameworks and provides model conversion for them.
At the data layer, OpenI-Octopus provides a multi-source, heterogeneous data engine, which supports heterogeneous data convergence and semi-automatic data labeling.
At the resource layer, OpenI-Octopus provides a distributed AI computing engine for job scheduling and the unified representation of heterogeneous hardware.

Service Requirements:

Excellent performance in scientific research and applications of AI, including algorithm training and inference in fields such as smart transportation, healthcare, and finance
High-end heterogeneous hardware resources, clusters with 150P+ computing, and 10 PB-level high-speed storage
Rapid and flexible deployment. The system runs reliably and stably for external teams to use.

Challenges:

No high-performance computing platform to meet service requirements in complex scenarios
Heterogeneous hardware resources need to be efficiently used and flexible scheduling policies must be supported. Resource preemption problems need to be resolved to avoid starvation of key task resources.
The system architecture must be scalable and services must be highly available.

Why Volcano?

At the beginning, OpenI-Octopus looked into several existing open source projects in the community. These projects can basically satisfy the service requirements and reduce the development workload. The OpenI-Octopus team narrowed down their choices to four resource schedulers. The first one was the default Kubernetes scheduler, which is not friendly to batch scheduling. The second choice was Yarn scheduler, which is based on Hadoop. However, the current architecture has been transformed to Kubernetes-based. Yarn does not fit. The last two were kube-batch and Volcano. Volcano is developed from kube-batch, and better supports deep learning and common computing frameworks. Volcano implements scheduling policies through plugins that can be easily customized to develop scenario-specific scheduling policies. That's why OpenI-Octopus chose to integrate Volcano.

Volcano brings the following benefits:

Complete architecture and ecosystem; timely feedback from the fast growing community
Customizable plugins for scenario-specific scheduling policies. Take the binpack plug-in as an example. Its packing algorithm can reduce resource fragments, allowing your cluster resources to be fully used.
Job queue mechanism. Job queues allow clusters to be logically grouped. Users can configure compute resource quotas for different projects, and allocate different types of jobs to different queues for management. In this way, job and compute resource management can be finer-grained.

Secondary Development Based on Volcano - Resource Status Statistics and Management

OpenI-Octopus performed secondary development on Volcano and added some new capabilities.

The first capability is to collect statistics on resources and manage resource status. These resources include both cluster compute resources and resources such as jobs, tasks, and pods generated by Kubernetes after a user submits a job.

OpenI-Octopus manages to do so. It also allows users to customize the conditions and callback events of resource status transition and subscribe to the customized events and corresponding policies at the service logic layer.

Assume that there is a training job that uses an ensemble learning algorithm. Generally, a distributed training manner is used. It has a combination module and several individual learners, all of which can be regarded as tasks. Each individual learner is trained using one type of algorithm, and the combination module combines the results of each individual learner to output the final result. Once the final result is obtained, the entire training job is complete. In this job-task implementation based on Kubernetes, user can create one or more pods for a task. If you want the entire job to exit as long as the combination module runs to completion, instead of waiting until all tasks are successfully executed, You can customize a job exit policy in the scheduler and use the policy at the service layer. Different scenarios may require different policies, and that's why secondary development is needed.

This flowchart shows job state information is transferred among OpenI-Octopus, Kubernetes, and Volcano.

First, both Volcano and OpenI-Octopus listen on all Kubernetes jobs. After the user submits a job to Kubernetes, Volcano updates the job state based on the monitored state of the pod started by the job.OpenI-Octopus will handle the job state changes.The key is how Volcano updates the job states to Kubernetes.

OpenI-Octopus worked out its solution:

1)Develop state machines for Jobs, Tasks, and Replicas.

More detailed resource state statistics and command output
Finer-grained job lifecycle management

2)Customize events and policies. Back to the ensemble learning example. The entire job can run to completion upon the customized event (e.g. MainTaskEvent) released by the scheduler that the specific task is successfully executed.

3)Implement lifecycle callback hooks, which can be added to any state transition event in any state machine. For example, the billing function collects statistics on the running duration of a job based on the start event and end event of the job.

Volcano-based Secondary Development - Privilege Action

Issues:

Resources are starved, and a large number of jobs in the queue keep waiting.
Urgent and key jobs need to be preferentially scheduled.
Users' jobs may be developed online and cannot be terminated unless allowed.

Existing Capabilities of Volcano:

Jobs with different priorities in the same queue can be preempted.
Pod-based eviction
Immediate preemption

Requirements:

When jobs in the same queue are from different tenants, different tenants should have different priorities and preemption permissions.
Job-based eviction
Delayed preemption

This flowchart shows how the delayed preemption plugin works. On the left is the running logic of the plugin in the scheduler. Kubernetes services lie in the middle, and the right part is the core OpenI-Octopus modules.

Specifically, the plugin finds the jobs that need to be preempted in Volcano. The compute resources occupied by these jobs must be sufficient for the high-priority jobs that are waiting. Then, the plugin updates the states of these jobs in Kubernetes. As soon as the core OpenI-Octopus modules detect the state changes, they start a timer to prepare for eviction of these jobs. If the job preemption is canceled while the timer countdown does not end or because the required resources have been released, the timer is also canceled.

The following chart shows the service logic.

A Boolean attribute called Preempt is added to each job, indicating whether the job is a preempted job.

Only jobs with lower priorities in the same queue can be preempted.

The eviction is performed by job instead of by pod.

Pods are evicted based on the ID of the jobs to which the pods belong to reduce the number of affected jobs.
The scheduler notifies Openl-Octopus to stop the jobs at the service layer.

Delayed preemption

The Privileged and WillEvicted states are added for the job state machine.
Jobs in Privileged or WillEvicted state cannot be preempted by other jobs.
If the state of a preempting or preempted job changes, the state of the other party changes accordingly.

Benefits

Enhanced capabilities

Large-scale distributed training jobs can run efficiently.
Multiple AI computing frameworks are supported.
Plugin-based scheduler supports customized development to satisfy scenario-specific requirements.
Multi-queue scheduling makes possible hardware resource grouping and dynamic resource allocation between groups.

Performance tuning

Hardware resource utilization is greatly improved to 90% or higher.
The average job scheduling latency is greatly reduced. The average job waiting time is reduced from 60 seconds (using the Yarn scheduler) to 10 seconds (using Volcano).
System stability is enhanced, cluster node resources are used in balance, and O&M workloads are reduced.

With 120+ nodes managed and 1100+ GPU cards in total, the GPU utilization can reach 90% or higher when the system is overloaded.
The resource usage of each node is balanced, and the difference is less than 20%.
Since the rollout in 2019, more than 120,000 jobs have been run.

iQIYI: Volcano-based Cloud Native Migration Practices

Wed, 25 Aug 2021 00:00:00 GMT

This article was firstly released at Container Cube on September 30th, 2020, refer to揭秘爱奇艺深度学习平台云原生迁移实践

Introduction to iQIYI Jarvis Deep Learning Platform

Overall Architecture of the Platform

The platform supports GPU- and CPU-based training and inference. S3, HDFS, and NFS can be used for storing training data and models. The platform supports TensorFlow, PyTorch, Caffe, Caffe2 and MXNet. It uses TensorFlow and PyTorch. TensorFlow 1.X to 2.X are supported.

The platform can be used in advertising, search, recommendation, NLP, and other services. iQIYI uses Mesos + Marathon as their elastic container platform. When iQIYI started the platform, Kubernetes was not mature enough to be considered. Therefore, our containers do not run on K8s.

One-stop Platform Service

Four small platforms are used to provide the service. The first is the data preprocessing platform. It analyzes the training data in a visualized manner, helps users adjust parameters, and detects abnormal data in a timely manner.

The second is the training code compilation platform. You can use the RunOnce or notebook training to obtain an environment that is the same as the training environment. You then can compile the training code, and commit the code to GitLab.

The third is the training job execution platform. You can run a training job by using the Jarvis training platform, and then the training code will be executed. An algorithm model will be output.

The last is the Jarvis inference platform. You can create an inference service using the platform and provide the inference service for external systems.

Platform Development

iQIYI started from the inference platform. iQIYI first enables the models to provide services for external systems, and then gradually extends the platform functions to support training, development, and data preprocessing. Currently, iQIYI is migrating the elastic container platform from Mesos + Marathon to K8s + Volcano.

Training Platform Architecture Before Volcano Is Used

The following figure shows the training platform architecture before Volcano is used.

The process is as follows:

A.Compile training code and commit it to GitLab.

B.You can create a training job on the web page or using the command line tool. To create a training job, you need to enter the following information:

Required resources
Images. Each version of each framework is supported by an image. Selecting an image means selecting an algorithm framework.
There may be multiple clusters. You need to specify the cluster where the job is expected to run.
The URL of the GitLab project. The project contains the training code you compiled.

C.The Jarvis cli/web converts the request into gRPC and sends it to the Jarvis core.

D.The core converts the request and calls the Marathon API to create a container.

E.The container is started in the specified cluster and executes the training job.

Challenges of Migrating the Training Platform to Kubernetes

The challenges are as follows:

Native pods, Deployments, and jobs cannot meet the requirements of distributed training.
Queue and quota management are not supported.
Lack of scheduling capabilities, such as Gang Scheduling.

Introducing Volcano

The three most important concepts of Volcano are VolcanoJob, queue, and PodGroup. VolcanoJob, referred to as vcjob, is an extension of Kubernetes jobs or an encapsulation of pods.

Queues can be used to manage quotas.

PodGroup is a group of pods and can be used for advanced upper-layer scheduling.

So far:

Volcano is the native batch system of Kubernetes and is highly suitable for AI training.
It does not intrude Kubernetes source code and complies with the Kubernetes development specifications, facilitating secondary development.
It has been accepted by Cloud Native Computing Foundation (CNCF) and is mature.

Power of Volcano

How Does Volcano Solve Problems of Migrating to Kubernetes?

Gang Scheduling

1.1 Gang Scheduling

Gang scheduled pods run simultaneously or none of them run. This is important for AI training, especially distributed training in most scenarios. The feature of distributed training is that a large number of pods, for example, 40 or 50 pods, are started at a time. If some pods of a task are scheduled and some pods are not scheduled, the task cannot run properly. This will also cause resource waste, or even deadlocks.

For example, there are only four GPUs in a resource pool and tasks A and B. Each task has four pods, and each pod requires one GPU. When tasks A and B are created at the same time, without gang scheduling, each task may obtain only two GPUs. In this case, neither of the tasks can be completed, resulting in a deadlock. Unless resources are added to the pool, the deadlock cannot be resolved.

Volcano schedules jobs in the unit of PodGroup to implement gang scheduling, avoiding the preceding problem.

1.2 Native Support for Distributed Tasks

Take TensorFlow distributed training as an example. It has the following roles: Parameter Server (PS), master, and worker. PS is used to store parameters. Master and worker are used to calculate gradients. In each iteration, master and worker obtain parameters from PS and update the calculated gradients to PS. PS aggregates the gradients returned from master and worker, updates the parameters, and broadcasts the updated parameters to master and worker.

Let's focus on one of its network structures. If master or worker needs to communicate with PS, problems will occur. When creating a pod, a user may not know the IP address of the pod. Multiple pods created in a deployment may not know the IP address or domain name of each other. Without Volcano, solutions to these problems are complicated.

Each role must know the IP address or domain name of the other roles, the role it plays, and the number of indexes. A TF_CONFIG configuration file is required to include the IP addresses or domain names of master, worker, and PS. These are difficult to implement in Kubernetes. However, with Volcano, the solutions become simple.

Volcano can easily build TF_CONFIG through file injection to support TensorFlow distributed training. Volcano injects a folder (etc/volcano) to multiple pods under a vcjob. The folder includes all domain names of master, volcano, and PS. In this way, each pod knows the peers in the entire cluster, and the TensorFlow distributed training can be performed.

Currently, TensorFlow provides some high-level APIs, such as TF estimator. The single-node code and distributed code in the estimator are the same, but the configuration of TF_CONFIG is different. If the environment variables or configuration files in such a format are passed, distributed training can be performed. If platforms can build the TF_CONFIG file, users can directly run the file.

1.3 Horovod/MPI

Volcano supports Horovod, which is similar to TensorFlow. They are both used for distributed training but differ in the way of updating parameters.

Horovod uses the ring allreduce method to update parameters, and what does that mean for us when we want to build a basic environment for upper-layer applications to use? What does the ring allreduce architecture require?

First, we need to ensure that each node knows the domain name of each other, as mentioned earlier. Second, we need to enable a node to SSH log in to another through port 22 without a password. This passwordless SSH can be automatically implemented with Volcano's SSH plugin, saving a lot of trouble.

1.4 Quotas and Queues

Volcano uses queues (CRD objects) to schedule jobs. Let's assume that we have two queues, as shown in the following figure. Queue1 has a quota of 20 GPUs and queue2 has a quota of 10 GPUs. The resources of queue1 are abundant so new jobs in queue 1 can be scheduled. However, all resources in queue2 have been used, and new jobs in queue2 cannot be scheduled and have to wait in queue. As a result, the PodGroups changes to the pending state.

The teams in our platform are similar to the Volcano queues. How? Each team has a quota, and quotas are independent between teams. When the resource usage reaches the quota of a team, the jobs in the team have to wait in queue. When resources are available, the queued jobs will be executed based on the priority, which means the jobs with a higher priority will be run first. Considering this similarity, the interconnection between Volcano and iQIYI’s platform can be fairly easy.

1.5 Integration with Volcano

iQIYI has added the volcano_plugin, which encapsulates the RESTful APIs of vcjob, queue, and PodGroup. It converts the gRPC requests into YAML configurations that comply with the Kubernetes API specifications, and calls the Kubernetes API to create containers.

Jarvis Core determines which backend to use based on the passed cluster information.

Encountered Issues

Issue 1

Symptom:During Volcano upgrade, the image in https://github.com/volcano_x0002_sh/volcano/blob/master/installer/volcano-development.yaml was directly modified, and kubectl apply -f was executed. The existing queues and vcjobs all disappeared.

Cause:volcano-admission-init in the YAML file was executed repeatedly. As a result, Volcano was reset.

Solution: Upgrade only the necessary components.

Issue 2

Symptom: When list_and_watch was used to monitor vcjob status, the watch connection broke every 80 to 90 seconds when there were no new events, and the disconnection duration varied. Such issue did not occur when the same code was used to monitor pods.

Cause: The default http timeout for CRD objects in Kubernetes is time.Duration(float64(minRequestTimeout) * (rand.Float64() + 1.0)), where miniRequestTimeout is set to 1 minute. You can specify timeoutSecond on the client to avoid this issue.

Issue 3

Symptom: The container entry address in Jarvis is a bash script. When the script was run in Kubernetes, a container did not exit until 30 seconds after the stop command was delivered.

Cause:Bash did not pass the signal to child processes. When graceful stop timeout was reached, the daemon process detected that the container had not exited and sent a SIGKILL signal to kill the bash script and exit the container. However, other processes in the container had no chance to clean up.

Solution:Use dum-init to run a script such as the following entry script:

#!/usr/bin/dumb-init /bin/bash

my-web-server & # launch a process in the background

my-other-server # launch another process in the foreground

1.6Modifications on Volcano

The SVC plugin now supports the input parameter nodeport. It means when we create a vcjob and pass the SVC parameter, a nodeport will be created, so our TensorBoard and other services can be accessed externally.
We have fixed the bug that creation fails when the name of the SSH plugin exceeds 63 bytes.
Volcano has fixed the bug in the queue capability that resources can be used over the capability. For details, see https://github.com/volcano-sh/volcano/issues/921.
After a vcjob is annotated, if a pod fails, the vcjob deletion is not triggered. For details, see https://github.com/volcano_x0002_sh/volcano/issues/805.

Summary

Volcano makes up for the lack of basic deep learning capabilities in Kubernetes.

Gang Scheduler
Queue management

Volcano code complies with the Kubernetes standards and is non-intrusive.

Lower development and interconnection costs
Easy for secondary development

Volcano-based Jarvis has been released and is running properly.

Using Volcano in Large-Scale, Distributed Offline Computing

Wed, 25 Aug 2021 00:00:00 GMT

This article was firstly released at Container Cube on December 24th, 2020, refer to锐天投资基于Volcano的大规模分布式离线计算平台的应用实践

Service Scenarios and Solution Selection

Service Scenarios

VMs for research and development for policy personnel
AI training and inference
Data ETL
General-purposed, distributed batch processing jobs

Why Use Kubernetes？

A distributed batch processing platform can be used to manage compute and storage resources. In this use case, Ruitian decided to use Kubernetes to manage compute resources due to the following reasons:

Containers streamline development for users in different environments. Ruitian has four to five groups of users who use different development environments and policies. Environment isolation posed a great challenge to resource management and development efficiency. Now with containers, environments are encapsulated in containers that can be scheduled using Kubernetes.
Heterogeneous devices such as GPUs can be supported through Device Plugins.
Data storage can be centralized by using etcd.
Kubernetes has a robust technology ecosystem.
The Go language complies with the technology stack in Ruitian.

Why Use CephFS

CephFS is a type of distributed file storage interface provided by Ceph. Ceph provides three types of storage interfaces: S3, block storage, and CephFS. The reasons for using CephFS are as follows:

Posix Filesystem permission and interface: Local file systems are widely used in our businesses and CephFS provides stable file system mounting. In multi-tenant scenarios, each user has a UID, and the data of each user can be accessed only by themselves. Posix Filesystem provides a permission mechanism that allows users to seamlessly migrate their file permissions to SAP.
Strong consistency: A file written to node A can be directly read on node B.
Small file access at scale and high-bandwidth I/O
Hierarchical hardware support
Kubernetes ReadWriteMany PV

Why Volcano

Why not default-scheduler

Ruitian did not choose the default-scheduler, because it cannot provide queue scheduling, fair scheduling, multi-tenant isolation, and advanced scheduling policies such as gang scheduling. Fair scheduling and advanced scheduling policies are the most important factors. Fair scheduling decides which job to run first when there are too many jobs in a queue or when the cluster has available resources. To achieve this, each queue must be mapped to a team, and each namespace must correspond to a user. The default-scheduler cannot meet the preceding requirements.

Another option was kube-batch, a batch processing scheduler of the community. However, it is only a scheduler and does not provide any solution other than scheduling. What Ruitian needed was a batch processing solution that takes care of scheduling and processing for the environment and CRDs.

Why is Volcano

Supports fair scheduling.
Supports advanced scheduling policies, such as gang scheduling and binpack.
Supports mutual access between pods through SSH plug-ins.
Supports injecting job dependencies to pods via ENV plug-ins and supports Tensorflow Worker Sharding.
Provides services externally via SVC plug-ins.

Such a scheduling platform can satisfy Ruitian.

System Architecture

Service Architecture

Ceph-based high-performance storage
Kubernetes-based heterogeneous hardware management
Loki + Grafana for user and monitoring panel
Hybrid deployment of middleware and application layer, making full use of cluster resources
Extended service scenarios with Batch Jobs

Multi-tenancy

When a user submits a job, multi-tenancy can be a problem. For example, when a user adds a pod to a cluster, the cluster needs to know the running user and the UID. By default, the UID of a running user is that of the image builder, which means the UIDs of the pods submitted by all users can be the same. This is not allowed because the data obtained and generated by a user should not be accessible to other users.

In this case, Ruitian uses Kubernetes namespaces to isolate all resources. One namespace corresponds to one user. Namespaces interconnect with the development information through the existing LDAP service and OIDC to authenticate users and authorize them through RBAC to use pod security policies (PSPs). A PSP requires users to specify UID and GID in SecurityContext when submitting a pod to a cluster. The entire runtime environment of the user is subject to these settings.

With PSPs, users can be isolated when accessing data, which is all stored in Ceph. Multi-tenancy is thereby easily managed.

Workflows

What comes next is basic workflows. The local configurations are rendered into a job YAML and then submitted. All dependency data of the user is synchronized to CephFS, and the pod is mounted with a PVC. Each user has the PVC permissions of their own directory in their own namespace. The permissions are managed and controlled through IBS. In this way, jobs are submitted to the cluster to run.

In-depth Customization on Volcano

In the basic submission framework, Ruitian provides libraries for users and is developing a submission tool, Jobctl. This tool can be used as a command line tool or as the Python list that is input to the notebook or directly to the Python script of the user. Jobctl supports asynchronous and synchronous submissions. In the asynchronous mode, jobs are continuously submitted to the entire cluster. After the jobs are submitted, Jobctl exits directly. In the synchronous mode, Jobctl submits and watches jobs, and returns the execution results to the user only after the jobs are complete.

With Jobctl, Kubernetes complexities can be shielded for users. In addition, command line submission and Python Lib integration are supported, and the most basic parallel execution by replicas and by day is provided.

OOM Auto Scale Up

The first customization is to scale up resources of the entire job during OOM. Users may not be able to configure the exact memory required, and need to submit the job again for verification after the OOM. Therefore, Ruitian customized OOMKill Auto Scale-Up to modify the Volcano Controller to automatically scale up the resources requested by the OOMKill pods. After the scale-up, the jobs are automatically submitted again. The user will be informed upon the successful submission. This function guarantees reasonable memory requests without manual intervention, combining the Volcano policy event mechanism mentioned above.

MinSuccess

If the number of pods that run to completion reaches minAvailable, the job is complete.
Non-Gang jobs cannot be flexibly scheduled.

If the number of pods that run to completion reaches minSuccess, the job is complete.
Decouple the number of jobs required by Gang and the number of jobs required for completing Jobs.

NodeZone

One Volcano instance manages all nodes.
Noisy Neighbor cannot be resolved.
Resources cannot be reserved for emergency.

Multiple Volcano instances manage multiple zones.
Certain jobs are physically isolated.

Volcano Namespace Quota

The default Kubernetes quotas cannot satisfy Ruitian's system. When the native namespace quota is triggered, pods directly fail. Therefore, Ruitian re-designed the quota in Volcano.

When the Volcano namespace quota is triggered, pod creation in a queue will be suspended.

Volcano Monitoring and Alarming

Volcano Exporter

Outputs the queue label of the job.
Outputs the queue capability.
Outputs the job start time and end time.

WatchDog

Registers the Informer and collects metrics.
Reports job failure and usage alarms.
Automatically updates the queue capability.

Job dashboard

The upper panel covers the information about all jobs and provides a state table to display the job completion status. The panels below display the CPU, memory, and network resource usage. The negative axes refer to wasted cluster resources, which are allocated to pods (jobs) but not actually used during job running. These time series tables can provide resource insights to users in real time.

Cluster resource dashboard

Graphs show the usage of overall queue resources, including CPU and memory. For jobs that consume a large amount of resources, for example, 300 or 500 GiB of memory, users need to know whether there is any node that can run such jobs. Therefore, we need to display the resource usage of each node available.

Challenges and Solutions in High-Concurrency Scenarios

In Ruitian, the number of compute nodes in a single cluster has reached 200 and long-time jobs (1 week) and short-time jobs (1 minute) co-exist. The total storage capacity is 1.5 PB, the read/write bandwidth is 15 GB/s, and the number of pods increases by 100,000 to 300,000 every day. These brought challenges.

Challenge 1: Too Large Jobs

Issues:

The CPU usage exceeds Max Request Size (1.5 MB) of etcd when there are a large number of pods.
Adjusting Max Request Size will impact etcd due to a large number of objects.

Solution:

Submit a job in the form of multiple replicas for a single task.
The information provided by ENV plug-ins in a pod is read in Sharding mode.

Challenge 2: Out of CPU/Memory

Issues:

There are a limited number of nodes, and a large number of short-term jobs keep being scheduled.
Kubelet PLEG is under great pressure, and the pod binding takes too long.

Issues:

There are a limited number of nodes, and a large number of short-term jobs keep being scheduled.
Kubelet PLEG is under great pressure, and the pod binding takes too long.
The default session interval of Volcano is 1s. As a result, cache snapshots are inconsistent.
Out of CPU + Out of Memory

Solution:

Add binding task numbers for nodes.
When a snapshot is being created for a session, the nodes whose binding task number is smaller than 0 are skipped.

HPC on Volcano: How Containers Support HPC Applications in the Meteorological Industry

Tue, 24 Aug 2021 00:00:00 GMT

This article was firstly released at Container Cube on October 27th, 2020, refer to HPC on Volcano：容器在气象行业HPC高性能计算场景的应用

Kubernetes has become the de facto standard for cloud native application orchestration and management. An increasing number of applications are being reconstructed or built to employ Kubernetes. High performance computing (HPC) is a popular distributed computing mode and is widely used in many fields. For users who have deployed HPC applications and are eager to containerize and manage their applications using Kubernetes, Volcano, CNCF's first distributed scheduling system for batch computing, is a good choice. Volcano supports multiple types of computing frameworks, such as Spark, TensorFlow, and Message Passing Interface (MPI). This article uses a traditional HPC application, the Weather Research and Forecasting (WRF) model, as an example to describe how Volcano works for HPC applications.

About HPC

HPC and HPCC are two common terms in the area of computing jobs. HPCC is short for high performance computer cluster, which integrates a large amount of computer software and hardware to conduct parallel computing on large computing jobs. HPC is widely used in CAE simulation, animation rendering, physics, chemistry, oil exploration, and life, meteorological, and environmental science.

An HPCC consists of three parts:

Portable Batch System (PBS): A resource manager that manages all node resources in a cluster. Other common resource management systems include Slurm and Platform Load Sharing Facility (or simply LSF).
Maui: A third-party job scheduler that supports multiple priority-based scheduling policies, resource reservations, and preemption mechanisms. Maui provides more advanced scheduling services than the default schedulers embedded in most resource managers.
Open MPI: An upper-layer communication environment that provides a communication library and compilation functions and starts distributed tasks.

PBS and Maui are imperceptible to users. Users only need to submit jobs in the mode defined by PBS and do not need to understand internal implementation details. However, users are required to learn how to use Open MPI to compile parallel computing applications.

The following uses mpirun -np 4 ./mpi_hello_world as an example to illustrate how an MPI job runs.

Invoke Open MPI or other MPI libraries to compile the source code. In this example, Hello World! is printed.
Use a compiler that supports MPI to compile the executable program mpi_hello_world.
Distribute mpi_hello_world to each node. You can also make mpi_hello_world accessible by sharing the file system.
Run mpirun to execute mpi_hello_world in parallel.

About WRF

The Weather Research and Forecasting (WRF) model is a common HPC application. WRF is a mesoscale numerical weather prediction (NWP) system designed for both atmospheric research and forecasting. It allows researchers to produce simulations based on real or hypothetical atmospheric conditions.

WRF consists of multiple modules with different processing flows. The following illustrates a WRF process.

As shown in the figure above, this WRF process has four parts:

External data sources
WRF Pre-Processing System (WPS)
WRF, which is the core simulation system
Post-processing system

External Data Sources

The WRF model data includes static geographical data and gridded data. Geographical data refers to geographical information in a domain, such as mountains, rivers, lakes, and forests. Gridded data refers to the meteorological environment data in a domain, such as temperature, wind speed, wind direction, air humidity, and rainfall.

WPS

——WPS，WRF Pre-processing System）

WPS loads geographical and meteorological data, interpolates meteorological data to grids, and finally provides data input for the WRF. It consists of three main programs:

geogrid.exe: defines model projections, domain range, and nesting relationships, interpolates terrestrial parameters, and processes terrain and gridded data.
ungrib.exe: extracts required meteorological parameters from the GRIB data.
metgrid.exe: interpolates meteorological parameters to simulation domains.

The three programs work together to generate data used for meteorological simulation. Currently, the three programs do not support MPI parallel computing.

WRF

As the core module of the WRF model, WRF performs simulation and prediction based on the meteorological information generated by WPS. WRF consists of two main programs:

real.exe: initializes the actual meteorological data.
wrf.exe: simulates and predicts results.

real.exe and wrf.exe can run as MPI parallel jobs to improve the computing speed.

As shown in the preceding figure, wrfinput_d0X and wrfbdy_d0X are the calculation results generated by real.exe. wrf.exe performs meteorological simulation based on these results to generate the final result wrfout_dxx_yyyy-mm-dd_hh:mm:ss, which is verified and displayed by the post-processing system.

Post-Processing System

The post-processing system verifies and displays the calculation results generated by WRF. It consists of various third-party images and verification tools. The following figure shows the simulation and prediction results of the relative humidity in each area in CONUS 2.5km case.

CONUS 2.5km refers to the 2.5 km resolution case covering the Continental U.S. (CONUS) domain. (In this case, the entire domain is divided into multiple cubes of 2.5 km x 2.5 km x 2.5 km. The meteorological information in each cube is considered consistent.)

HPC on Volcano

As mentioned above, an HPCC consists of a resource manager, scheduler, and MPI parallel computing library. In the container context, Kubernetes functions as the resource manager and Volcano functions as the scheduler.

To run HPC applications in the Kubernetes+Volcano environment is to run HPC jobs in containers, as shown in the following figure.

Two types of containers are involved: master and worker. The master container starts the mpirun and mpiexec commands, and the worker containers run computing jobs.

To support MPI jobs, Volcano has been enhanced to provide the following functions:

Multiple pod charts, which are used to define master and worker pods at the same time
Gang scheduling, which ensures that all pods in a job are simultaneously started
Mapping of host IP addresses of the master and worker pods
SSH password-free login between the master and worker pods
Job lifecycle management

The following is an example of running an MPI job on Volcano.

Define a Volcano MPI job by the mpi_sample.yaml file.

apiVersion: batch.Volcano.sh/v1alpha1
kind: Job
metadata:
  name: mpi-job
  labels:
    # Set the job type based on service requirements.
    "Volcano.sh/job-type": "MPI"
spec:
  # Set the minimum number of required pods (less than the total number of replicas).
  # For this example, set it to the total number of mpimaster and mpiworker replicas.
  minAvailable: 3
  # Specify Volcano as the scheduler.
  schedulerName: Volcano
  plugins:
    # Configure SSH password-free authentication.
    ssh: []
    # Configure the network information, such as hosts file and headless Service, required for running the job.
    svc: []
  # Define a policy in which the entire MPI job will be restarted when a pod is evicted.
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: mpimaster
      # Define another policy in which the entire MPI job will be considered as complete when mpiexec execution completes.
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          # The Volcano-related information will be stored in the /etc/Volcano directory.
          containers:
            # The master container will perform the following operations:
            # 1. Start the sshd service.
            # 2. Obtain the mpiworker container list from /etc/Volcano/mpiworker.host.
            # 3. Run mpirun/mpiexec.
            - command:
                - /bin/sh
                - -c
                - |
                  MPI_HOST=`cat /etc/Volcano/mpiworker.host | tr "\n" ","`;
                  mkdir -p /var/run/sshd; /usr/sbin/sshd;
                  mpiexec --allow-run-as-root --host $`{MPI_HOST}` -np 2 mpi_hello_world;
              image: Volcanosh/example-mpi:0.0.1
              imagePullPolicy: IfNotPresent
              name: mpimaster
              ports:
                - containerPort: 22
                  name: mpijob-port
              workingDir: /home
              resources:
                requests:
                  cpu: "100m"
                  memory: "1024Mi"
                limits:
                  cpu: "100m"
                  memory: "1024Mi"
          restartPolicy: OnFailure
          imagePullSecrets:
            - name: default-secret
    - replicas: 2
      name: mpiworker
      template:
        spec:
          containers:
            # The worker containers will only start the sshd service.
            - command:
                - /bin/sh
                - -c
                - |
                  mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
              image: Volcanosh/example-mpi:0.0.1
              imagePullPolicy: IfNotPresent
              name: mpiworker
              ports:
                - containerPort: 22
                  name: mpijob-port
              workingDir: /home
              resources:
                requests:
                  cpu: "100m"
                  memory: "2048Mi"
                limits:
                  cpu: "100m"
                  memory: "2048Mi"
          restartPolicy: OnFailure
          imagePullSecrets:
            - name: default-secret

Commit the Volcano MPI job.

The job is executed.

Check the execution result of the master pod.

The preceding execution result shows that Volcano clears only the worker pods and retains the master pod after the job completes. In this way, you can run the kubectl command to obtain the execution result.

Note that there may be latency in the container network. When a job starts, the master pod may fail to connect to the worker pods. If this happens, Volcano will automatically restart the master pod to make the job run properly.

If you intend to use Volcano to run a WRF job, you need to replace mpi_hello_world with real.exe and wrf.exe and perform the following operations:

Build Docker images, which must include a complete WRF running environment.
Mount the data (original or intermediate data) required for calculation to the corresponding container.

In this way, you can run meteorological simulation jobs in the Kubernetes+Volcano environment.

How Does Volcano Empower a Content Recommendation Engine in Xiaohongshu

Tue, 24 Aug 2021 00:00:00 GMT

This article was firstly released at Container Cube on May 27th, 2021, refer to 小红书基于Volcano的大规模离线与在线推荐模型训练实践

Introduction to Xiaohongshu

Xiaohongshu is a leading life-sharing community in China. Popular among female users and joined by more and more trendy boys, now Xiaohongshu has more than 100 million monthly active users. This UGC community has hundreds of thousands of notes submitted every day and nearly 10 billion views/hits per day.

The recommendation on the homepage is in the charge of our recommendation team and is one of the core service scenarios of Xiaohongshu. In the first years when Xiaohongshu was established, all of the recommended notes were manually selected without assistance of any machine learning models. As a result, we recommended the same content to almost every user.

Since 2016, we started to explore personalized recommendation for different users. In 2018, the first recommendation machine learning model based on SparkML and GBDT was introduced. It had only tens of thousands of parameters. Since the end of 2018, we accelerated the model iteration. By the second half of 2020, our model scale reached hundreds of billions of parameters. We also introduced online learning, and the model could be updated in hours. From April this year, the model is updated every few minutes, which means the model can capture users' behavior within one or two minutes to get users' short-term interests and generate recommendations that are more appealing for users.

Big Data Architecture in Xiaohongshu Search, Recommendation, and Ad Scenarios

The architecture consists of four parts. The upper left corner shows the interaction between the client and the real-time service/tracking data service. After being started, the Xiaohongshu app requests an online service for recommendation. The online service caches the recommended notes and requested features, and returns the recommendation results to the client. When the user browses the notes recommended to him/her, a series of interaction behaviors are generated. The interaction behaviors become data flows that pass through the tracking data service and go to the original tracking data flow.

In the lower left corner, there are attribution and summary tasks used to clean and process user behavior data in real time to generate the label data flow. The label data flow and feature data flow are combined to generate the training samples, and the three major products of Xiaohongshu big data: training data, OLAP database, and offline Hive table.

The upper right corner shows online and offline training. Online training trains the data in real time to generate the updated data of the model. Offline training generates a full model and uploads it to the online service.

Behavior Attributions and Labels

This task for user behaviors can be divided into two parts: attribution and label. Attribution is to associate each behavior captured in data tracking with the past behaviors of the user. For example, you browse 4 notes in the app, one found on the Discovery page, one on the search results page, and two on a blogger's homepage. You click Like for the final piece. Your browsing and clicking Like are tracked. The tracking data does not tell us what happened before the like is given.

We can determine the cause of the like behavior based on the user behavior flow and user's historical browsing records prior to the like. This process is called attribution. Through the attribution task, we can also add labels about why a user follows a blogger, and which of the blogger's notes the user has viewed before following the blogger.

Opposite to attribution, label calculation summarizes the actions performed by a user after a certain behavior. If the user browsed four notes on the Discovery page, for each note Xiaohongshu makes several labels about whether the user liked the notes after browsing, or whether the user tapped to enter the note details page and how long the user stayed on this page. This label data is important for subsequent model training and the generation of daily user reports.

Real-Time Big Data Products for Search, Recommendation, and Ad

After the label data is generated, the above three big data products are provided for the service.

The model training data is used to train models in real time, and provide more accurate and real-time information about users' latest interests.

Both the ad hoc data analysis and offline warehouse perform analysis based on the label data. The ad hoc data analysis is real-time. For example, if there is any change in the system and policy, effects should be observed immediately from a multi-dimensional segmentation perspective. In contrast, the offline data warehouse provides daily or weekly reports, or shows the changes have been made to certain metrics in the past six months.

Online and Offline Model Training

Training data generated after the combination of label data and feature data is used for both offline and online training.

Though the same data source is used, there are differences between online and offline training. For online training, the data source is provided to Kafka for online consumption. After that, a model update data flow is output, which actually means that the last batch of model changes is released online in real time. Offline training is performed on a batch and daily basis. A full migration parameter model is released, and data is migrated back to the online service.

Evolution from Offline Batch Computing to Online Stream Computing

Offline Batch Only

The preceding figure shows the earliest offline batch label calculation process. Click behaviors of users are collected and recorded in the ODS table, that is, the original log table. The attribution and label calculation are performed by the Spark task, which is an offline batch task. The processed data forms a data warehouse table to generate daily reports and experiment reports as big data products. In the batch environment, the report is generated on a T+1 cycle. Generally, we cannot obtain the complete reports until the second or third day after each experiment.

Offline Batch + Online Streaming

Increase in the number of developers poses higher requirements on service implementation. Therefore, we introduced the real-time link, which is completely based on the Flink framework. The real-time link inputs data through streaming Kafka, outputs the data to Kafka, and sends the data to the OLAP database and real-time experiment analysis. However, the challenge is that Spark and Flink are two different programming frameworks. For example, the logic for determining whether a click on an ad is valid is complex, because an interaction behavior or a stay of at least three seconds after the click is required before the click can be called a valid click.

If there are two data flows, the logic is implemented both in the offline service and the Flink framework. Many problems may occur when the same logic is implemented twice. One problem is that development needs to be repeated in both the Spark and Flink frameworks. A bigger problem is that the logic may be changed after being developed in both of the frameworks.

For some complex scenarios, the change may cause inconsistency between reports and requests. In some other scenarios, when the data warehouse makes a request offline, and the logic is implemented only in the Spark task, but not implemented offline, if we want to view the task, we need to implement the logic again, which will cause extra workload.

After the upgrade we made, all new labels are calculated in real time, not in Spark. However, interruption may occur in the real time mode. After the interruption, calculation may start from the latest data, and the earlier data may have changed. This problem is simple to solve in the offline mode, as we can re-run data of each hour to form the complete data.

We actually need to solve a technical problem: How to convert the real-time Flink training task from an offline data warehouse table to a real-time flow table, use the same computing logic to generate new data and then backfill the new data to the real-time flow table? In this way, the core logic of the real time and offline modes only needs to be implemented in real time, which solves the problem of inconsistent logics between the two developments.

Offline Training

The preceding figure shows the training process of the machine learning model. At the earliest, there was only one batch data calculation task. Feature data and user behavior data are stored in offline tables. The Spark task is used to combine the data to generate a training task, and then release the learning model. The entire process may be performed once a day.

Online + Offline Training

To capture users' real-time interests and make judgment on newly released notes more quickly, more real-time update is required for our models. Therefore, Flink is used for real-time model training. After Flink generates data, the Volcano scheduling system is used to update models in real time and in batches.

Optimization and Multi-cloud Management of Offline Training

The preceding figure shows the technology stack of Xiaohongshu in machine learning and big data. Xiaohongshu is a relatively new company without on-premises equipment rooms. All of our services are deployed on clouds provided by cloud vendors, and most of the services are managed through Kubernetes.

We have two important platforms. One is the stream computing platform called Baichuan, which is used to manage the Flink tasks of real-time label computing and online learning mentioned above. The other is the task management platform for machine learning, which is called Firefly. Our model training is based on TensorFlow and runs on the machine learning platform. For sparse and discrete large-scale model training of recommendation, search and advertisement, we also developed a TensorFlow-based training framework, LarC. The framework models of TensorFlow and LarC run on the machine learning platform through Firefly.

The key point between these two platforms is how to schedule tasks to the Kubernetes clusters. In fact, the native Kubernetes has a big problem in this scenario, because it performs scheduling based on individual pods.

However, the stream computing and machine learning tasks are not single-pod tasks. They are tasks performed on a group of pods. Therefore, we have encountered a big challenge at the beginning. Assume that there are two jobs, each job contains 10 pods, and each pod requires one core of CPU. That is, 20 cores are required for the two jobs to run simultaneously. If the current cluster has only 15 available cores and we are using the native Kubernetes scheduler, the scheduler may schedule 7 cores to one job and 8 cores to another, so that both jobs can obtain some resources to run. However, neither of them can be completed properly because the numbers of cores allocated to them do not meet the requirements. As a result, deadlocks occur. This is caused by the limits of the native Kubernetes scheduling.

To solve this, we need to first schedule 10 cores to one job to ensure that it can be properly completed and exited. After that, the 10 cores are released and allocated to another job so that both jobs can be properly completed.

Based on these, we researched products and found Volcano. Its predecessor is Kubernetes batch and can completely meet our requirements. Therefore, we participate in the Kubernetes batch community and become a loyal user of Volcano.

Enhanced Volcano Scheduling: binpack -> task-topology

The scheduling algorithm supported by the native Volcano is binpack. Machine learning training tasks are classified into two types: worker for performing forward and reverse computing, which is a computing task; ps with the main task of storing parameters, which is a memory-type service. If the native open-source Volcano is used, its default scheduling algorithm optimizes the resources to reduce fragments. Therefore, it will schedule as many tasks as possible to the same node, and then schedule all worker and ps tasks of those tasks to the same node. When that node does not have capacity for one of the ps tasks (taking ps1 as an example here), it can only be put on another node.

In this scenario, workers and ps0 are on the same node. The I/O between them does not cross nodes, leading to fast I/O and large storage capacity. But the running speed is slow because ps1 is on another node.

With task-topology algorithm, tasks are scheduled to nodes in a balanced manner, the speed is balanced, and the overall storage capacity is greatly improved. The optimization from binpack to task-topology can increase the throughput of task training by 10% to 20%.

Data Transfer Between Multiple Clouds

In the online mode, users are distributed to different AZs. The feature cache of the recommendation service is stored in the local AZ. After user data tracking, users are distributed to different clusters based on their requests, and the label system performs computing for each user. Finally, all label system computing is transferred to the cloud vendors that provide offline training and services for us to combine and generate data training, and perform distributed model training. The trained models are distributed to different AZs for online services.

How to implement transfer learning under this architecture Xiaohongshu users consume traffic on the homepage, which is a scenario where a large amount of data is generated and accumulated, and a model involving hundreds of billions of parameters will be trained. How do we use this large model built on recommendations for the search and advertisement scenarios? After the recommendation model is trained, it is synchronized to the search training cluster. The search training uses the search data to discover the recommendation model, and release the final search model online. In this way, the small-scale data training can obtain features of the large recommendation model training, so that the large recommendation model can be utilized by the search scenario.

Volcano Blog

Introducing Kthena: Redefining LLM Inference for the Cloud-Native Era

The "Last Mile" Challenge of LLM Serving​

Kthena: The Intelligent Brain for Cloud-Native Inference​

Core Features and Advantages​

1. Production-Grade Inference Orchestration (ModelServing)​

2. Out-of-the-Box Deployment (ModelBooster)​

3. Intelligent, Model-Aware Routing​

4. Cost-Driven Autoscaling​

5. Broad Hardware & Engine Support​

6. Built-in Flow Control & Fairness​

Performance Benchmarks​

Community & Industry Support​

Start Exploring Kthena Today​

Volcano v1.13 Released: Comprehensive Enhancement of Scheduling Capabilities for LLM Training and Inference

Release Highlights​

Support LeaderWorkerSet for Large Model Inference Scenarios​

Introduce Cron VolcanoJob​

Support Label-based HyperNode Auto-Discovery​

Add Native Ray Framework Support​

Introduce HCCL Plugin Support​

Enhance NodeGroup Functionality​

Introduce ResourceStrategyFit Plugin​

Independent Scoring Strategy by Resource Type​

Scarce Resource Avoidance (SRA)​

Decouple Colocation from OS​

Support Custom OverSubscription Resource Names​

Add Kubernetes 1.33 Support​

Conclusion: Volcano v1.13.0 Continues to Lead Cloud-Native Batch Computing​

Acknowledgments​

iFlytek Enhances AI Infrastructure with Volcano, Wins CNCF End-User Case Study Award

iFlytek's Challenges​

About Volcano​

Significant Results iFlytek Achieved with Volcano​

References​

Volcano v1.12.0 Available Now

Volcano v1.12 released: Advancing Cloud-Native AI and Batch Computing​

Highlights of v1.12​

Core Feature Details​

Network Topology-Aware Scheduling (Alpha Release)​

Dynamic MIG Partitioning for GPU Virtualization​

Support for Dynamic Resource Allocation (DRA)​

Queue Capacity Management in Volcano Global​

Security Enhancements​

Performance Improvements for Large-Scale Scenarios​

Gang Scheduling for Various Workload Types​

Job Flow Enhancements​

Finer-Grained Permission Control in Multi-Tenant Scenarios​

Kubernetes 1.32 Support​

Enhanced Queue Monitoring Metrics​

Fuzz Testing Support​

Stability Enhancements​

Pre-Upgrade Notes​

Summary and Future Work​

Roadmap and Call for Contributions​

Community Engagement​

Acknowledgments​

Volcano completes security audit

Fuzzing​

Getting involved in Volcano​

How volcano boosts distributed training and inference performance

The Growing Demand for LLM Workloads and Associated Challenges​

Addressing Network Bottlenecks Through Topology-Aware Scheduling​

Managing Multi-Cluster Environments for Scalability​

Improving Stability with Fine-Grained Fault Recovery​

Future Developments in Distributed Workload Management​

Volcano v1.11.0 Available Now

Deep Dive into Key Features​

Network Topology-Aware Scheduling: Optimizing AI Large Model Training Performance​

Unified Network Topology API: Precise Network Structure Representation​

HyperNode Configuration Example​

Network Topology-Aware Scheduling Strategy​

Future Plans​

Elastic Hierarchical Queues: Flexible Multi-Tenant Resource Management​

Core Capabilities of Elastic Hierarchical Queues​

Hierarchical Queue Example​

Use Cases​

Multi-Cluster AI Job Scheduling: Unified Management and Efficient Scheduling Across Clusters​

Core Capabilities​

Online and Offline Workloads colocation with Dynamic Resource Oversubscription: Maximizing Resource Utilization While Ensuring SLO​

The "Last Mile" Challenge of LLM Serving

Kthena: The Intelligent Brain for Cloud-Native Inference

Core Features and Advantages

1. Production-Grade Inference Orchestration (ModelServing)

2. Out-of-the-Box Deployment (ModelBooster)

3. Intelligent, Model-Aware Routing

4. Cost-Driven Autoscaling

5. Broad Hardware & Engine Support

6. Built-in Flow Control & Fairness

Performance Benchmarks

Community & Industry Support

Start Exploring Kthena Today

Release Highlights

Support LeaderWorkerSet for Large Model Inference Scenarios

Introduce Cron VolcanoJob

Support Label-based HyperNode Auto-Discovery

Add Native Ray Framework Support

Introduce HCCL Plugin Support

Enhance NodeGroup Functionality

Introduce ResourceStrategyFit Plugin

Independent Scoring Strategy by Resource Type

Scarce Resource Avoidance (SRA)

Decouple Colocation from OS

Support Custom OverSubscription Resource Names

Add Kubernetes 1.33 Support

Conclusion: Volcano v1.13.0 Continues to Lead Cloud-Native Batch Computing

Acknowledgments

iFlytek's Challenges

About Volcano

Significant Results iFlytek Achieved with Volcano

References

Volcano v1.12 released: Advancing Cloud-Native AI and Batch Computing

Highlights of v1.12

Core Feature Details

Network Topology-Aware Scheduling (Alpha Release)

Dynamic MIG Partitioning for GPU Virtualization

Support for Dynamic Resource Allocation (DRA)

Queue Capacity Management in Volcano Global

Security Enhancements

Performance Improvements for Large-Scale Scenarios

Gang Scheduling for Various Workload Types

Job Flow Enhancements

Finer-Grained Permission Control in Multi-Tenant Scenarios

Kubernetes 1.32 Support

Enhanced Queue Monitoring Metrics

Fuzz Testing Support

Stability Enhancements

Pre-Upgrade Notes

Summary and Future Work

Roadmap and Call for Contributions

Community Engagement

Acknowledgments

Fuzzing

Getting involved in Volcano

The Growing Demand for LLM Workloads and Associated Challenges

Addressing Network Bottlenecks Through Topology-Aware Scheduling

Managing Multi-Cluster Environments for Scalability

Improving Stability with Fine-Grained Fault Recovery

Future Developments in Distributed Workload Management

Deep Dive into Key Features

Network Topology-Aware Scheduling: Optimizing AI Large Model Training Performance

Unified Network Topology API: Precise Network Structure Representation

HyperNode Configuration Example

Network Topology-Aware Scheduling Strategy

Future Plans

Elastic Hierarchical Queues: Flexible Multi-Tenant Resource Management

Core Capabilities of Elastic Hierarchical Queues

Hierarchical Queue Example

Use Cases

Multi-Cluster AI Job Scheduling: Unified Management and Efficient Scheduling Across Clusters

Core Capabilities

Online and Offline Workloads colocation with Dynamic Resource Oversubscription: Maximizing Resource Utilization While Ensuring SLO

Background: The Challenge of Resource Utilization

Industry Practices: Volcano’s Unique Advantages

Volcano Cloud-Native Colocation Solution: End-to-End Resource Optimization

Core Capabilities: Balancing Resource Utilization and Stability

Load-Aware Descheduling: Intelligent Cluster Resource Balancing

Core Capabilities:

Use Cases:

Technical Highlights: