快速开始

本节将指导您快速上手Volcano,内容涵盖从部署基础的Volcano Job/Deployment到与Volcano队列集成等。

前提条件

您需要一个已成功安装Volcano组件的Kubernetes集群。如果尚未安装Volcano,请参考安装文档

快速入门:部署一个Volcano Job

本快速入门指南将引导您部署一个简单的Volcano Job。如果未指定队列,Volcano Job默认使用default队列。

步骤1:创建Volcano Job

创建一个名为vcjob-quickstart.yaml的文件,内容如下:

# vcjob-quickstart.yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: quickstart-job
spec:
  minAvailable: 3
  schedulerName: volcano
  # 如果省略 'queue' 字段,将使用 'default' 队列。
  # queue: default
  policies:
    # 如果Pod失败(例如,由于应用程序错误),则重启整个作业。
    - event: PodFailed
      action: RestartJob
  tasks:
    - replicas: 3
      name: completion-task
      policies:
      # 当此特定任务成功完成时,将整个作业标记为"完成"。
      - event: TaskCompleted
        action: CompleteJob
      template:
        spec:
          containers:
            - command:
              - sh
              - -c
              - 'echo "Job is running and will complete!"; sleep 100; echo "Job done!"'
              image: busybox:latest
              name: busybox-container
              resources:
                requests:
                  cpu: 1
                limits:
                  cpu: 1
          restartPolicy: Never

此Job会创建三个Pod,并将它们作为一个组进行调度。Pod模板使用了一个简单的busybox容器,并休眠100秒。当Pod完成后,Job的状态也会转为完成。

步骤2:监控Job和Pod状态

您可以观察VolcanoJob及其关联Pod的进度。

首先,检查VolcanoJob的状态。您应该会看到类似以下的输出(确切的时间戳和UID会有所不同):

# kubectl get vcjob quickstart-job -oyaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  # ... (元数据详情) ...
  name: quickstart-job
  namespace: default
  # ...
spec:
  maxRetry: 3
  minAvailable: 3
  policies:
  - action: RestartJob
    event: PodFailed
  queue: default
  schedulerName: volcano
  tasks:
  - maxRetry: 3
    minAvailable: 3
    name: completion-task
    policies:
    - action: CompleteJob
      event: TaskCompleted
    replicas: 3
    template:
      metadata: {}
      spec:
        containers:
        - command:
          - sh
          - -c
          - echo "Job is running and will complete!"; sleep 100; echo "Job done!"
          image: busybox:latest
          name: busybox-container
          resources:
            limits:
              cpu: "1"
            requests:
              cpu: "1"
        restartPolicy: Never
status:
  conditions:
  - lastTransitionTime: "2025-05-28T08:39:22Z"
    status: Pending
  - lastTransitionTime: "2025-05-28T08:39:23Z"
    status: Pending
  - lastTransitionTime: "2025-05-28T08:39:27Z"
    status: Pending
  - lastTransitionTime: "2025-05-28T08:39:28Z"
    status: Pending
  - lastTransitionTime: "2025-05-28T08:39:30Z"
    status: Running
  minAvailable: 3
  running: 3
  state:
    lastTransitionTime: "2025-05-28T08:39:30Z"
    phase: Running
  taskStatusCount:
    completion-task:
      phase:
        Running: 3

接下来,检查Volcano Job创建的Pod的状态:

kubectl get pod -l volcano.sh/job-name=quickstart-job

最初,Pod将处于Running状态。大约100秒后,busybox容器将退出,Pod的状态将变为Completed

NAME                               READY   STATUS      RESTARTS   AGE
quickstart-job-completion-task-0   0/1     Completed   0          3m59s
quickstart-job-completion-task-1   0/1     Completed   0          3m59s
quickstart-job-completion-task-2   0/1     Completed   0          3m59s

一旦Pod完成,VolcanoJob中的TaskCompleted策略将触发CompleteJob操作,这会将VolcanoJob的阶段转换为Completed

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  # ... (元数据详情) ...
  name: quickstart-job
  namespace: default
  # ...
status:
  #...
  minAvailable: 3
  runningDuration: 1m49s
  state:
    lastTransitionTime: "2025-05-28T08:41:11Z"
    phase: Completed
  version: 3

部署标准Kubernetes工作负载 (Deployment、StatefulSet等)

Volcano能够与Deployment、StatefulSet等标准Kubernetes工作负载无缝集成,扩展了它们的调度能力。这意味着您可以利用Volcano的高级特性,例如成组调度(gang scheduling)。通过成组调度,您可以指定一个最小数量的Pod,这些Pod必须能够作为一个组被同时调度,然后该工作负载的任何Pod才能启动。

步骤1:创建带有group-min-member注解的Deployment

让我们创建一个Deployment,它期望有3个副本,但要求至少有2个Pod能被Volcano作为一个组进行调度。

# deployment-with-minmember.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app-deployment
  annotations:
    # 对成组调度至关重要:此注解告知Volcano将此Deployment视为一个组,
    # 要求至少2个Pod能够一起调度,然后才会启动任何Pod。
    scheduling.volcano.sh/group-min-member: "2"
    # 可选:您也可以为此Deployment创建的PodGroup指定一个特定的Volcano队列。
    # scheduling.volcano.sh/queue-name: "my-deployment-queue"
  labels:
    app: my-app
spec:
  replicas: 3 # 我们期望应用有3个副本
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      schedulerName: volcano # 关键:确保此Deployment的Pod使用Volcano调度器
      containers:
        - name: my-container
          image: busybox
          command: ["sh", "-c", "echo 'Hello Volcano from Deployment'; sleep 3600"] # 一个长时间运行的命令,用于演示
          resources:
            requests:
              cpu: 1
            limits:
              cpu: 1

步骤2:观察自动创建的PodGroup和Pod

当您应用带有scheduling.volcano.sh/group-min-member注解的Deployment(或StatefulSet)时,Volcano会自动创建一个PodGroup资源。此PodGroup负责为属于该工作负载的Pod强制执行成组调度约束。

检查PodGroup的状态:

kubectl get pg podgroup-[ReplicaSet的UID] -oyaml

您应该会看到类似以下的输出:

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  # ...
  name: podgroup-09e95eb0-e520-4b50-a15c-c14cad844674
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: my-app-deployment-74644c8849
    uid: 09e95eb0-e520-4b50-a15c-c14cad844674
  # ...
spec:
  minMember: 2
  minResources:
    count/pods: "2"
    cpu: "2"
    limits.cpu: "2"
    pods: "2"
    requests.cpu: "2"
  queue: default
status:
  conditions:
  - lastTransitionTime: "2025-05-28T09:08:13Z"
    reason: tasks in gang are ready to be scheduled
    status: "True"
    transitionID: e0b1508e-4b77-4dea-836f-0b14f9ca58df
    type: Scheduled
  phase: Running
  running: 3

您将观察到Volcano调度器会确保至少minMember(本例中为2)个Pod能够一起调度,然后才允许此Deployment中的任何Pod启动。如果资源不足以满足这些Pod的需求,它们将保持Pending状态。

使用自定义队列部署工作负载

步骤1:创建自定义队列

让我们创建一个名为development-queue的队列,并为其指定特定的CPU能力(capability)。分配给此队列的作业将竞争该队列定义的能力范围内的资源。

创建一个名为queue.yaml的文件:

# queue.yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: development-queue
spec:
  weight: 1 # 队列间的调度优先级相对权重
  reclaimable: false # 如果为true,其他队列中的作业可以回收此队列中的资源
  capability:
    cpu: 2

在集群中创建队列:

kubectl create -f queue.yaml

新队列将被创建并进入Open状态:

# kubectl get queue development-queue -oyaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  # ...
  name: development-queue
  # ...
spec:
  capability:
    cpu: 2
  parent: root
  reclaimable: false
  weight: 1
status:
  allocated:
    cpu: "0"
    memory: "0"
  state: Open

步骤2:创建使用自定义队列的Volcano Job

现在,我们创建一个显式使用development-queue的VolcanoJob。

创建一个名为vcjob-with-queue.yaml的文件并应用它:

# vcjob-with-queue.yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: job-with-custom-queue
spec:
  minAvailable: 1
  schedulerName: volcano
  queue: development-queue # 将此作业分配给我们的自定义队列
  tasks:
    - replicas: 1
      name: custom-queue-task
      policies:
      - event: TaskCompleted
        action: CompleteJob
      template:
        spec:
          containers:
            - command:
              - sh
              - -c
              - 'echo "Running in custom queue"; sleep 100; echo "Done!"'
              image: busybox:latest
              name: busybox-in-queue
              resources:
                requests:
                  cpu: 1
                limits:
                  cpu: 1
          restartPolicy: Never

步骤3:检查自定义队列的状态

您可以监控自定义队列的状态,以查看已分配多少资源:

kubectl get queue development-queue -oyaml

预期输出:

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  # ...
  name: development-queue
  # ...
spec:
  capability:
    cpu: 2
  parent: root
  reclaimable: false
  weight: 1
status:
  allocated:
    cpu: "1"
    memory: "0"
    pods: "1"
  state: Open