Background
Policy provides an API of volcano job and task lifecycle management for users. For example, in some scenarios, especially
in AI, big data and HPC field, it is required to restart a job if any master or worker fails. Users can easily achieve that
by configuring policy for the volcano job under job.spec.
Key Points
- Volcano allows users to configure a pair of
Event(Events) andActionfor a volcano job or a task. If the specified event(events) happens, the target action will be triggered. If timeout is configured, the target action will be executed after the timeout delay. - If the policy is configured under
job.speconly, it will work for all tasks by default. If the policy is configured undertask.speconly, it will only work for the task. If the policy is configured in both job and task level, it will obey the task policy. - Users can set multiple policy for a job or a task.
- Currently, Volcano provides 6 built-in events for users. The details are as follows.
| ID | Event | Description |
|---|---|---|
| 1 | PodFailed |
Check whether there is any pod’ status is Failed. |
| 2 | PodEvicted |
Check whether there is any pod is evicted. |
| 3 | PodPending |
Check whether there is any pod is pending. It is usually used with timeout. If the pod is not pending, the timeout action will be canceled. |
| 4 | TaskCompleted |
Check whether there is a task whose all pods are succeed. If minsuccess is configured for a task, it will also be regarded as task completes. |
| 5 | Unknown |
Check whether the status of a volcano job is Unknown. The most possible factor is task unschedulable. It is triggered when part pods can’t be scheduled while some are already running in gang-scheduling case. |
| 6 | * |
It means all the events, which is not so common used. |
- Currently, Volcano provides 5 built-in actions for users. The details are as follows.
| ID | Action | Description |
|---|---|---|
| 1 | AbortJob |
Abort the whole job, but it can be resumed. All pods will be evicted and no pod will be recreated. |
| 2 | RestartJob |
Restart the whole job. |
| 3 | RestartTask |
The task will be restarted. This action cannot work with job level events such as Unknown. |
| 4 | RestartPod |
The pod will be restarted. This action cannot work with job level events such as Unknown. |
| 5 | RestartPartition |
The partition will be restarted. This action cannot work with job level events such as Unknown. |
| 6 | TerminateJob |
Terminate the whole job and it cannot be resumed. All pods will be evicted and no pod will be recreated. |
| 7 | CompleteJob |
Regard the job as completed. The unfinished pods will be killed. |
Examples
Set a pair of
eventandaction.apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: tensorflow-dist-mnist spec: minAvailable: 3 schedulerName: volcano plugins: env: [] svc: [] policies: - event: PodEvicted # Job level policy. If any pod is evicted, restart the job. action: RestartJob queue: default tasks: - replicas: 1 name: ps template: spec: containers: - command: - sh - -c - | PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}; ## Get the index from the environment variable and configure it in the TF job. python /var/tf_dist_mnist/dist_mnist.py image: volcanosh/dist-mnist-tf-example:0.0.1 name: tensorflow ports: - containerPort: 2222 name: tfjob-port resources: {} restartPolicy: Never - replicas: 2 name: worker policies: - event: TaskCompleted # Task level policy. If this task completes, complete the job. action: CompleteJob template: spec: containers: - command: - sh - -c - | PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}; python /var/tf_dist_mnist/dist_mnist.py image: volcanosh/dist-mnist-tf-example:0.0.1 name: tensorflow ports: - containerPort: 2222 name: tfjob-port resources: {} restartPolicy: NeverSet a pair of
eventsandaction.apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: tensorflow-dist-mnist spec: minAvailable: 3 schedulerName: volcano plugins: env: [] svc: [] queue: default tasks: - replicas: 1 name: ps policies: - events: [PodEvicted, PodFailed] # Task level policy. If any pod is evicted or fails in this task, restart the job. action: RestartJob template: spec: containers: - command: - sh - -c - | PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}; ## Get the index from the environment variable and configure it in the TF job. python /var/tf_dist_mnist/dist_mnist.py image: volcanosh/dist-mnist-tf-example:0.0.1 name: tensorflow ports: - containerPort: 2222 name: tfjob-port resources: {} restartPolicy: Never - replicas: 2 name: worker policies: - event: TaskCompleted # Task level policy. If this task completes, complete the job. action: CompleteJob template: spec: containers: - command: - sh - -c - | PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}; python /var/tf_dist_mnist/dist_mnist.py image: volcanosh/dist-mnist-tf-example:0.0.1 name: tensorflow ports: - containerPort: 2222 name: tfjob-port resources: {} restartPolicy: NeverSet a pair of
eventsandaction.apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: tensorflow-dist-mnist spec: minAvailable: 3 schedulerName: volcano plugins: env: [] svc: [] queue: default tasks: - replicas: 1 name: ps policies: - events: PodFailed # Task level policy. If any pod fails in this task, restart the pod. action: RestartPod - events: PodEvicted # Task level policy. If any pod is evicted in this task, restart the job after 10 minutes. action: RestartJob timeout: 10m template: spec: containers: - command: - sh - -c - | PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}; ## Get the index from the environment variable and configure it in the TF job. python /var/tf_dist_mnist/dist_mnist.py image: volcanosh/dist-mnist-tf-example:0.0.1 name: tensorflow ports: - containerPort: 2222 name: tfjob-port resources: {} restartPolicy: Never - replicas: 2 name: worker policies: - event: TaskCompleted # Task level policy. If this task completes, complete the job. action: CompleteJob template: spec: containers: - command: - sh - -c - | PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}; python /var/tf_dist_mnist/dist_mnist.py image: volcanosh/dist-mnist-tf-example:0.0.1 name: tensorflow ports: - containerPort: 2222 name: tfjob-port resources: {} restartPolicy: Never