Scheduling Gates Queue Admission User Guide

Overview

This page describes how to enable and use the SchedulingGatesQueueAdmission feature to prevent cluster autoscalers (such as Cluster Autoscaler or Karpenter) from triggering unnecessary scale-ups when pods are waiting for Volcano queue capacity.

Problem

Volcano marks pods as Unschedulable for any allocation failure, whether it’s due to insufficient cluster resources (where autoscaling is appropriate) or queue capacity limits (where autoscaling is not needed). Cluster autoscalers cannot distinguish between these scenarios, causing unnecessary node scale-ups.

The problem is described in detail in the design document.

Solution

This feature uses Kubernetes schedulingGates to hold pods until the queue has capacity. While gated, pods are invisible to autoscalers. The gate is removed only after the queue capacity check passes and if the pod then cannot be scheduled due to missing nodes, it is marked as Unschedulable, allowing autoscalers to respond correctly.

Prerequisites

  • Volcano v1.15+ with the SchedulingGatesQueueAdmission feature gate enabled.
  • The capacity plugin configured in the scheduler (the feature is implemented in the capacity plugin and will soon be integrated in proportion as well).

1. Enable the Feature Gate

The feature is Alpha and disabled by default. Enable it on both the scheduler and webhook-manager.

Using Helm

helm install volcano volcano/volcano --namespace volcano-system --create-namespace \
  --set custom.scheduler_feature_gates="SchedulingGatesQueueAdmission=true" \
  --set custom.admission_feature_gates="SchedulingGatesQueueAdmission=true"

Using kubectl apply

Add the following flag to both the volcano-scheduler and volcano-admission deployments:

--feature-gates=SchedulingGatesQueueAdmission=true

Optionally, configure the number of async gate removal workers (default: 5):

--gate-removal-worker-num=10

These workers asynchronously process gate removals — each worker picks up a pod whose queue capacity check has passed and removes its scheduling gate, allowing the pod to proceed to scheduling. Increasing this number can help throughput when many pods are being ungated concurrently.

2. Configure the Capacity Plugin

Ensure the capacity plugin is enabled in your scheduler configuration. The reserved resource tracking that prevents race conditions between gate removal and pod allocation is implemented in this plugin.

Example scheduler configuration:

actions: "enqueue, allocate, backfill"
tiers:
- plugins:
  - name: priority
  - name: gang
- plugins:
  - name: predicates
  - name: capacity
  - name: nodeorder

3. Opt-in Pods

The feature is opt-in per pod, and one can start using it by adding the following annotation to pods that should use gate-controlled queue admission:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  annotations:
    # Opt-in annotation
    scheduling.volcano.sh/queue-allocation-gate: "true"
spec:
  schedulerName: volcano
  containers:
  - name: worker
    image: nginx
    resources:
      requests:
        cpu: "1"
        memory: "1Gi"

When this pod is created:

  1. The Volcano webhook injects a scheduling.volcano.sh/queue-allocation-gate scheduling gate.
  2. The pod stays gated (invisible to autoscalers) until the queue has capacity.
  3. Once capacity is available, the scheduler removes the gate.
  4. If the pod can be placed on a node, it gets scheduled normally.
  5. If no node matches (e.g., needs a specific node type), it gets marked Unschedulable, correctly triggering the autoscaler.

4. Verify the Feature is Working

After creating an opted-in pod, verify the gate was injected by the mutation webhook:

kubectl get pod my-pod -o jsonpath='{.spec.schedulingGates}'

Expected output (while waiting for queue capacity):

[{"name":"scheduling.volcano.sh/queue-allocation-gate"}]

Once the queue has capacity and the scheduler removes the gate, the field will be empty:

kubectl get pod my-pod -o jsonpath='{.spec.schedulingGates}'
# empty output

Interaction with other Scheduling Gates

If a pod has additional scheduling gates from other controllers (e.g., example.com/my-gate), Volcano will not remove its gate until the pod has only the Volcano gate remaining. This ensures Volcano does not interfere with other gate controllers and avoids reserving queue capacity for pods that are still blocked by external dependencies.

Limitations

  • Once a pod’s gate is removed, it reserves queue capacity until it is scheduled or deleted. If the pod remains unschedulable (e.g., waiting for the autoscaler to add nodes), it continues to hold queue capacity, potentially blocking other pods. Additionally, the feature currently does not implement a timeout for reserved capacity. Operators should be aware that pods that have been ungated but remain unschedulable can hold queue capacity indefinitely.
  • The feature is only implemented in the capacity plugin. Users relying on the proportion plugin for queue resource management will still face false autoscaler scale-ups, as the scheduling gates mechanism is not yet integrated with proportion. Tracking issue: #5271.