Background
Env Plugin is designed for business that a pod should be aware of its index in the task such as MPI and TensorFlow. The indices will be registered as environment variables automatically when the Volcano job is created. For example, a tensorflow job consists of 1 ps and 2 workers. And each worker maps to a slice of raw data. In order to make the workers be aware of its target slice, they get their index in the environment variables.
Key Points
- The index keys of the environment variables are
VK_TASK_INDEXandVC_TASK_INDEX, they have the same value. - The value of the indices is a number which ranges from
0tolength - 1. Thelengthequals to the number of replicas of the task. It is also the index of the pod in the task.
Examples
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: tensorflow-dist-mnist
spec:
minAvailable: 3
schedulerName: volcano
plugins:
env: [] ## Env plugin register, note that no values are needed in the array.
svc: []
policies:
- event: PodEvicted
action: RestartJob
queue: default
tasks:
- replicas: 1
name: ps
template:
spec:
containers:
- command:
- sh
- -c
- |
PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}; ## Get the index from the environment variable and configure it in the TF job.
python /var/tf_dist_mnist/dist_mnist.py
image: volcanosh/dist-mnist-tf-example:0.0.1
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources: {}
restartPolicy: Never
- replicas: 2
name: worker
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- command:
- sh
- -c
- |
PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
python /var/tf_dist_mnist/dist_mnist.py
image: volcanosh/dist-mnist-tf-example:0.0.1
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources: {}
restartPolicy: Never
Note:
When config env plugin in the tensorflow job above, all the pods will have 2 environment variables
VK_TASK_INDEXandVC_TASK_INDEX. The environment variables registered in thepspod are as follows.[root@tensorflow-dist-mnist-ps-0 /] env | grep TASK_INDEX VK_TASK_INDEX=0 VC_TASK_INDEX=0Considering the 2 workers, you will get that their names are
tensorflow-dist-mnist-worker-0andtensorflow-dist-mnist-worker-1. And the corresponding values of the index environment variables are0and1.[root@tensorflow-dist-mnist-worker-0 /] env | grep TASK_INDEX VK_TASK_INDEX=0 VC_TASK_INDEX=0[root@tensorflow-dist-mnist-worker-1 /] env | grep TASK_INDEX VK_TASK_INDEX=1 VC_TASK_INDEX=1
Note:
- Because of historical reasons, environment variables
VK_TASK_INDEXandVC_TASK_INDEXboth exist,VK_TASK_INDEXwill be deprecated in the future releases. - No value are needed when register env plugin in the volcano job.