Skip to main content
Version: v1.13.0 (Latest)

Numa-aware

Overview

When a node runs many CPU-bound pods, the workload can move to different CPU cores depending on whether the pod is throttled and which CPU cores are available at scheduling time. Many workloads are not sensitive to this migration and work fine without any intervention. However, in workloads where CPU cache affinity and scheduling latency significantly affect workload performance, special CPU management policies are needed to determine placement preferences on the node.

The Challenge

The CPU Manager and Topology Manager are Kubelet components that help with CPU placement. However, they have the following limitations:

  1. Scheduler Unawareness: The scheduler is not topology-aware. This means a pod might be scheduled on a node only to fail due to the Topology Manager. This is unacceptable for TensorFlow jobs—if any worker or parameter server fails on a node, the entire job will fail.

  2. Node-level Only: These managers operate at the node level, which results in an inability to match the best node for NUMA topology across the entire cluster.

How Numa-aware Plugin Works

The Numa-aware plugin aims to address these limitations:

  • CPU Resource Topology Scheduling: Supports scheduling based on CPU topology
  • Pod-level Topology Policies: Supports topology policies at the pod level

The plugin:

  1. Collects NUMA topology information from nodes
  2. Evaluates CPU and memory placement requirements
  3. Scores nodes based on NUMA affinity
  4. Ensures tasks are placed on nodes that can satisfy their topology requirements

Scenario

Common scenarios for NUMA-aware scheduling are computation-intensive jobs that are sensitive to CPU parameters and scheduling delays:

Scientific Computing

High-performance scientific calculations benefit from NUMA-aware scheduling to ensure optimal memory access patterns.

Video Processing

Video decoding workloads can achieve better performance when scheduled with NUMA awareness.

Animation Rendering

Animation rendering jobs that are CPU-intensive benefit from optimized CPU and memory placement.

Big Data Offline Processing

Large-scale data processing jobs can achieve better throughput with NUMA-optimized scheduling.

Configuration

Enable Topology Manager on Nodes

First, ensure the Kubelet is configured with topology management:

# kubelet configuration
topologyManagerPolicy: single-numa-node
cpuManagerPolicy: static

Scheduler Configuration

Enable the Numa-aware plugin:

tiers:
- plugins:
- name: priority
- name: gang
- plugins:
- name: predicates
- name: nodeorder
- name: numa-aware
arguments:
numa-aware.weight: 10

Configuration Parameters

ParameterDescriptionDefault
numa-aware.weightWeight of NUMA-aware scoring1

Example

Node with NUMA Topology

A typical NUMA node might have:

  • 2 NUMA nodes
  • Each with 16 CPU cores
  • Each with 64GB memory
NUMA Node 0: CPU 0-15, 64GB Memory
NUMA Node 1: CPU 16-31, 64GB Memory

Job Requiring NUMA Awareness

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: numa-aware-job
spec:
schedulerName: volcano
minAvailable: 1
tasks:
- replicas: 1
name: compute
template:
metadata:
annotations:
volcano.sh/numa-topology-policy: single-numa-node
spec:
containers:
- name: compute
image: compute-intensive-app:latest
resources:
requests:
cpu: "8"
memory: "32Gi"
limits:
cpu: "8"
memory: "32Gi"

In this example:

  • The job requests 8 CPUs and 32GB memory
  • The NUMA policy requires all resources from a single NUMA node
  • The scheduler will find a node that can satisfy this requirement from a single NUMA node

Pod with Topology Policy Annotation

apiVersion: v1
kind: Pod
metadata:
name: numa-sensitive-pod
annotations:
volcano.sh/numa-topology-policy: best-effort
spec:
schedulerName: volcano
containers:
- name: app
image: myapp:latest
resources:
requests:
cpu: "4"
memory: "16Gi"
limits:
cpu: "4"
memory: "16Gi"

NUMA Topology Policies

The plugin supports several topology policies:

PolicyDescription
noneNo NUMA preference
best-effortTry to place on optimal NUMA node, but don't fail if not possible
restrictedOnly place on nodes that can satisfy the NUMA requirement
single-numa-nodeAll resources must come from a single NUMA node