PodGroup is a group of pods with strong association and is mainly used in batch scheduling, for example, ps and worker tasks in TensorFlow. PodGroup is of a Custom Resource Definition (CRD) type.
- apiVersion: batch.volcano.sh/v1alpha1
- lastTransitionTime: "2020-08-11T12:28:57Z"
message: '1/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable.'
- lastTransitionTime: "2020-08-11T12:29:02Z"
reason: tasks in gang are ready to be scheduled
minMember indicates the minimum number of pods or tasks running under the PodGroup. If the cluster resource cannot meet the demand of running the minimum number of pods or tasks, no pod or task in the PodGroup will be scheduled.
queue indicates the queue to which the PodGroup belongs. The queue must be in the Open state.
priorityClassName represents the priority of the PodGroup and is used by the scheduler to sort all the PodGroups in the queue during scheduling. Note that system-node-critical and system-cluster-critical are reserved values, which mean the highest priority. If
priorityClassName is not specified, the default priority is used.
minResources indicates the minimum resources for running the PodGroup. If available resources in the cluster cannot satisfy the requirement, no pod or task in the PodGroup will be scheduled.
phase indicates the current status of the PodGroup.
conditions represents the status log of the PodGroup, including the key events that occurred in the lifecycle of the PodGroup.
running indicates the number of running pods or tasks in the PodGroup.
succeed indicates the number of successful pods or tasks in the PodGroup.
failed indicates the number of failed pods or tasks in the PodGroup.
pending indicates that the PodGroup has been accepted by Volcano but its resource requirement has not been satisfied yet. Once satisfied, the status will turn to running.
running indicates that there are at least minMember pods or tasks running under the PodGroup.
unknown indicates that among minMember pods or tasks, some are running while others are not scheduled. The reason could be due to the lack of resources. The scheduler will wait until ControllerManager starts these pods or tasks again.
inqueue indicates that the PodGroup has passed validation and is waiting to be bound to a node. It is a transient state between pending and running.
In some scenarios such as machine learning training, you do not need all tasks of a job to be completed. Instead, when a specified number of tasks are completed, the job can be achieved. In this case, the
minMember field is suitable.
priorityClassName is used in preemptive priority scheduling.
In some scenarios such as big data analytics, a job can run only when available resources meet the minimum requirement.
minResources is suitable for such scenarios.
If no PodGroup is specified when a VolcanoJob is created, Volcano will create a PodGroup with the same name as the VolcanoJob.