Introduction
Pytorch plugin is designed to optimize the user experience when running pytorch jobs, it not only allows users to write less yaml, but also ensures the normal operation of Pytorch jobs.
How the Pytorch Plugin Works
The Pytorch Plugin will do the following:
- Open ports used by Pytorch for all containers of the job
- Force open
svcplugins - Add some envs such like
MASTER_ADDR,MASTER_PORT,WORLD_SIZE,RANKwhich pytorch distributed training needed to containers automatically - Add an init container to worker pods to wait for the master node to be ready before starting (ensures master starts first)
Parameters of the Pytorch Plugin
Arguments
| ID | Name | Type | Default Value | Required | Description | Example |
|---|---|---|---|---|---|---|
| 1 | master | string | master | No | Name of Pytorch master | –master=master |
| 2 | worker | string | worker | No | Name of Pytorch worker | –worker=worker |
| 3 | port | int | 23456 | No | The port to open for the container | –port=23456 |
| 4 | wait-master-enabled | bool | false | No | Enable init container to wait for master | –wait-master-enabled=true |
| 5 | wait-master-timeout | int | 300 | No | Timeout in seconds for waiting master (only effective when wait-master-enabled=true) | –wait-master-timeout=600 |
| 6 | wait-master-image | string | busybox:1.36.1 | No | Image for wait-for-master init container (only effective when wait-master-enabled=true) | –wait-master-image=busybox:latest |
Examples
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: pytorch-job
spec:
minAvailable: 1
schedulerName: volcano
plugins:
pytorch: [
"--master=master",
"--worker=worker",
"--port=23456",
"--wait-master-enabled=true", # Enable init container to wait for master (optional, default: false)
"--wait-master-timeout=600", # Timeout in seconds (optional, default: 300)
"--wait-master-image=busybox:1.36.1" # Init container image (optional, default: busybox:1.36.1)
]
tasks:
- replicas: 1
name: master
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- image: gcr.io/kubeflow-ci/pytorch-dist-sendrecv-test:1.0
imagePullPolicy: IfNotPresent
name: master
restartPolicy: OnFailure
- replicas: 2
name: worker
template:
spec:
containers:
- image: gcr.io/kubeflow-ci/pytorch-dist-sendrecv-test:1.0
imagePullPolicy: IfNotPresent
name: worker
workingDir: /home
restartPolicy: OnFailure
Notes
- The
wait-for-masterinit container feature is disabled by default. Enable it by setting--wait-master-enabled=true - When enabled, an init container will be added to worker pods to ensure the master is ready before starting workers
- Default init container image is
busybox:1.36.1, can be customized via--wait-master-image - Workers will wait for the master to become ready with a configurable timeout (default 300 seconds / 5 minutes)
- If the master doesn’t become ready within the timeout, the worker pod will fail with an error message
- The init container checks the master’s port connectivity using multiple fallback methods:
nc -z(netcat) if available/dev/tcpwith timeout command if available/dev/tcpdirect connection as fallback
- Note: The parameters
--wait-master-timeoutand--wait-master-imageare only effective when--wait-master-enabled=true - Image Requirements: The custom image should have at least one of the following:
nc(netcat) command - recommended, available in busybox, alpine/dev/tcpsupport in shell - available in bash/sh- Recommended images:
busybox:1.36.1,alpine:latest,bash:latest
- Customization examples:
- Enable feature:
--wait-master-enabled=true - Custom timeout:
--wait-master-enabled=true --wait-master-timeout=600(10 minutes) - Custom image:
--wait-master-enabled=true --wait-master-image=busybox:latest
- Enable feature: