Pod Group Status

In Coscheduling v1alph1 design, PodGroup’s status only includes counters of related pods which is not enough for PodGroup lifecycle management. More information about PodGroup’s status will be introduced in this design doc for lifecycle management, e.g. PodGroupPhase.

Function Detail

To include more information for PodGroup current status/phase, the following types are introduced:

// PodGroupPhase is the phase of a pod group at the current time.
type PodGroupPhase string

// These are the valid phase of podGroups.
const (
    // PodPending means the pod group has been accepted by the system, but scheduler can not allocate
    // enough resources to it.
    PodGroupPending PodGroupPhase = "Pending"

    // PodRunning means `spec.minMember` pods of PodGroups has been in running phase.
    PodGroupRunning PodGroupPhase = "Running"

    // PodGroupUnknown means part of `spec.minMember` pods are running but the other part can not
    // be scheduled, e.g. not enough resource; scheduler will wait for related controller to recover it.
    PodGroupUnknown PodGroupPhase = "Unknown"
)

type PodGroupConditionType string

const (
    PodGroupUnschedulableType PodGroupConditionType = "Unschedulable"
)

// PodGroupCondition contains details for the current state of this pod group.
type PodGroupCondition struct {
    // Type is the type of the condition
    Type PodGroupConditionType `json:"type,omitempty" protobuf:"bytes,1,opt,name=type"`

    // Status is the status of the condition.
    Status v1.ConditionStatus `json:"status,omitempty" protobuf:"bytes,2,opt,name=status"`

    // The ID of condition transition.
    TransitionID string `json:"transitionID,omitempty" protobuf:"bytes,3,opt,name=transitionID"`

    // Last time the phase transitioned from another to current phase.
    // +optional
    LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty" protobuf:"bytes,4,opt,name=lastTransitionTime"`

    // Unique, one-word, CamelCase reason for the phase's last transition.
    // +optional
    Reason string `json:"reason,omitempty" protobuf:"bytes,5,opt,name=reason"`

    // Human-readable message indicating details about last transition.
    // +optional
    Message string `json:"message,omitempty" protobuf:"bytes,6,opt,name=message"`
}

const (
    // PodFailedReason is probed if pod of PodGroup failed
    PodFailedReason string = "PodFailed"

    // PodDeletedReason is probed if pod of PodGroup deleted
    PodDeletedReason string = "PodDeleted"

    // NotEnoughResourcesReason is probed if there're not enough resources to schedule pods
    NotEnoughResourcesReason string = "NotEnoughResources"

    // NotEnoughPodsReason is probed if there're not enough tasks compared to `spec.minMember`
    NotEnoughPodsReason string = "NotEnoughTasks"
)

// PodGroupStatus represents the current state of a pod group.
type PodGroupStatus struct {
    // Current phase of PodGroup.
    Phase PodGroupPhase `json:"phase,omitempty" protobuf:"bytes,1,opt,name=phase"`

    // The conditions of PodGroup.
    // +optional
    Conditions []PodGroupCondition `json:"conditions,omitempty" protobuf:"bytes,2,opt,name=conditions"`

    // The number of actively running pods.
    // +optional
    Running int32 `json:"running,omitempty" protobuf:"bytes,3,opt,name=running"`

    // The number of pods which reached phase Succeeded.
    // +optional
    Succeeded int32 `json:"succeeded,omitempty" protobuf:"bytes,4,opt,name=succeeded"`

    // The number of pods which reached phase Failed.
    // +optional
    Failed int32 `json:"failed,omitempty" protobuf:"bytes,5,opt,name=failed"`
}

According to the PodGroup’s lifecycle, the following phase/state transactions are reasonable. And related reasons will be appended to Reason field.

From To Reason
Pending Running When every pods of spec.minMember are running
Running Unknown When some pods of spec.minMember are restarted but can not be rescheduled
Unknown Pending When all pods (spec.minMember) in PodGroups are deleted

Feature Interaction

Cluster AutoScale

Cluster Autoscaler is a tool that automatically adjusts the size of the Kubernetes cluster when one of the following conditions is true:

  • there are pods that failed to run in the cluster due to insufficient resources,
  • there are nodes in the cluster that have been underutilized for an extended period of time and their pods can be placed on other existing nodes.

When Cluster-Autoscaler scale-out a new node, it leverage predicates in scheduler to check whether the new node can be scheduled. But Coscheduling is not an implementation of predicates for now; so it’ll not work well together with Cluster-Autoscaler right now. Alternative solution will be proposed later for that.

Operators/Controllers

The lifecycle of PodGroup are managed by operators/controllers, the scheduler only probes related state for controllers. For example, if PodGroup is Unknown for MPI job, the controller need to re-start all pods in PodGroup.

Reference