Requirements
- Before reading the guidance, please make sure you are aware of basic concepts such as
actionpluginsessiontiervolcano jobpodgroupqueueand so on. If they are still strange to you, please refer to Volcano Docs for more details. - Before reading the guidance, please make sure you have general understanding of Volcano scheduling workflow.
Background
In order to adjust the scheduling process and algorithms to different scenarios, Volcano allows users to configure actions
and plugins for volcano scheduler. The scheduling pipeline consists of a series of actions. The plugins implement the algorithms,
which will be called in actions as registered session functions. As what you can see in the configmap volcano-scheduler-configmap,
plugins are divided into 2 tiers by default. It may confuse some users. Besides, it is necessary to provide a guidance about
how to configure volcano scheduler.
Key Points
- All the configurations are configured in the configmap
volcano-scheduler-configmap, which is under the namespacevolcano-system. - The configuration is made up of 2 parts:
actionsandtiers. actionsdefines the scheduling pipeline. They will be executed in order in each session.tiersdivides the plugins into several categories. All the functions defined in the plugins will be registered when a session is open and called when actions are executed.- In some scenarios, users may configure different plugins which registers the same functions. It will depend on the business
requirement to decide how to combine these functions. That’s why
tieris required.
Actions
Actionimplements the main logic of scheduling.- Volcano allow users to make self-defined actions.
- Volcano provides 7 built-in actions until April 2022. The details are as follows.
| ID | Name | Required | Description |
|---|---|---|---|
| 1 | enqueue | Y | Judge whether the idle resource in the cluster can satisfy the basic demand of a workload. If yes, set the podgroup of the workload to be inqueue, otherwise keep the podgroup pending. Notice that the default value for parameter overcommit-factor is 1.2. |
| 2 | allocate | Y | Try to allocate resource to workloads whose corresponding podgroup status is inqueue. |
| 3 | backfill | N | Try to allocate resource to workloads whose pods are BestEffort. |
| 4 | preempt | N | Recognise workloads with high priority. Try to evict pods with low priority and allocate the resource to them. |
| 5 | reclaim | N | Pick out queues whose resources have been borrowed by other queues and reclaim them back. |
| 6 | elect | N | Select a workload satisfying some conditions. It is designed to work with resource reservation for target workload. Will deprecated at future releases. |
| 7 | reserve | N | Select a series of nodes and reserve resource. It is designed to work with resource reservation for target workload. Will deprecated at future releases. |
Tiers and Plugins
Pluginprovides implementation details about scheduling algorithms by registering a series of functions. These functions will be called during actions are executed.- In general, a plugin mainly consists of 3 functions:
NameOnSessionOpenOnSessionClose.Nameprovides the name of the plugin.OnSessionOpenexecutes some operations when a session starts and register some functions about scheduling details.OnSessionCloseclean up some resource when a session finishes. - Some plugins provide arguments for users to match their custom scenarios.
- Different plugins may register same functions with different logic. Please make sure they can work together when configuring plugins.
- Volcano provides 15 built-in plugins until April 2022. The details are as follows.
| ID | Name | Arguments | Registered Functions | Description |
|---|---|---|---|---|
| 1 | binpack | * binpack.weight * binpack.cpu * binpack.memory * binpack.resources |
* nodeOrderFn | Try to bind pods to nodes with high resource usage to reduce fragmentation. |
| 2 | conformance | / | * preemptableFn * reclaimableFn |
Skip critical pods and not evict them. |
| 3 | drf | / | * preemptableFn * queueOrderFn * reclaimFn * jobOrderFn * namespaceOrderFn |
Provide fair resource shares for all queues. |
| 4 | extender | * extender.urlPrefix * extender.httpTimeout * extender.onSessionOpenVerb * extender.onSessionCloseVerb * extender.predicateVerb * extender.prioritizeVerb * extender.preemptableVerb * extender.reclaimableVerb * extender.queueOverusedVerb * extender.jobEnqueueableVerb * extender.ignorable |
* predicateFn * batchNodeOrderFn * preemptableFn * reclaimableFn * jobEnqueueableFn * overusedFn |
Add outer http server to execute custom actions. |
| 5 | gang | / | * jobValidFn * reclaimableFn * preemptableFn * jobOrderFn * JobReadyFn * jobPipelineFn * jobStarvingFn |
Consider the minimal resource requirement or member number for a workload when allocate resource to it. |
| 6 | nodeorder | * nodeaffinity.weight * podaffinity.weight * leastrequested.weight * balancedresource.weight * mostrequested.weight * tainttoleration.weight * imagelocality.weight |
* nodeOrderFn * batchNodeOrderFn |
Sort all nodes in custom way. |
| 7 | numaaware | * weight | * predicateFn * batchNodeOrderFn |
Consider CPU Numa as a key factor when binding a pod to a node. |
| 8 | overcommit | * overcommit-factor | * jobEnqueueableFn * jobEnqueuedFn |
Set the available resource as the given times of the whole resource of the cluster. |
| 9 | predicate | * predicate.GPUSharingEnable * predicate.CacheEnable * predicate.ProportionalEnable * predicate.resources * predicate.resources.nvidia.com/gpu.cpu * predicate.resources.nvidia.com/gpu.memory |
* predicateFn |
Add custom functions about how to filter nodes for pods. |
| 10 | priority | / | * taskOrderFn * jobOrderFn * preemptableFn * jobStarvingFn |
Defines priority for workloads. |
| 11 | proportion | / | * queueOrderFn * reclaimableFn * overusedFn * allocatableFn * jobEnqueueableFn |
Divide the whole resources of the cluster to all queues as proportion according to queues’ configurations |
| 12 | reservation | / | * targetJobFn * reservedNodesFn |
Sort nodes as resource usage and lock parts for target workload as reservation. |
| 13 | sla | * sla-waiting-time | * jobOrderFn * jobEnqueueableFn * JobPipelinedFn |
Sort workloads according to the SLA settings. |
| 14 | task-topology | / | * taskOrderFn * nodeOrderFn |
Bind pods with different roles to nodes according to the given policy. |
| 15 | tdm | * tdm.revocable-zone.rz1 * tdm.revocable-zone.rz2 * tdm.evict.period |
* predicateFn * nodeOrderFn * preemptableFn * victimTasksFn * jobOrderFn * jobPipelinedFn * jobStarvingFn |
Enable part of nodes to be in the charge of K8s and other clusters in different period. |
Examples
# default configuration for scheduler
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- plugins:
- name: overcommit
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
Note:
According to the default configuration, the scheduling process works as follows at a session. The scheduler will run the following pipeline regularly. The period is
1sby default.graph LR 1(Start) --> 2(OpenSession) --> 3(enqueue) --> 4(allocate) --> 5(backfill) --> 6(CloseSession) --> 7(End)All the functions in the configured plugins will be registered when executing
OpenSessionand called when executing the configured actions. For example,jobEnqueueablefunction, which is registered inovercommitplugin and called atenqueueaction, aims to judge whether the idle resource of the cluster can satisfy the minimal demand of a workload.Both
overcommitandproportionplugin have registered functionjobEnqueueableFn, which will be called in the functionJobEnqueueable. Besides,overcommitandproportionare in the same tier. According to the implementation ofJobEnqueueable, it will get through all the plugins in different tiers in order. If anyjobEnqueueableFnreturns a value belows0, it stops executing thejobEnqueueableFnregistered in the following plugins and returnsfalse. Namely, if thejobEnqueueableFnregistered inovercommitreturns a value belows0,jobEnqueueableFn, which is called inenqueueaction, will returnfalseand never call thejobEnqueueableFnregistered in theproportionplugin.
FAQ
- How can I decide which plugins should be grouped into a tier? How many tiers should I set for my business?
> In most scenarios, users should not concern about how to divide plugins to different tiers. It’s OK to configure all
plugins within a single tier. Only on condition that it is related with eviction will you need to think of how to organize
the plugins in different tiers. For example, when enable
reclaimaction, the scheduler will try to collect a set of victims. In order to reduce the influence to users’ business, it is reasonable to pick out victims as less as possible. Then you can configure the plugins evicting the least pods in the first tier. And configure other plugins with eviction at the second tier. If the first tier can pick out victims, it will not call the functions registered in the plugins, which is configured at the second tier.