Troubleshooting Topology Management

Kubernetes keeps many aspects of how pods execute on nodes abstracted from the user. This is by design. However, some workloads require stronger guarantees in terms of latency and/or performance in order to operate acceptably. The kubelet provides methods to enable more complex workload placement policies while keeping the abstraction free from explicit placement directives.

You can manage topology within nodes. This means helping the kubelet to configure the host operating system so that Pods and containers are placed on the correct side of inner boundaries, such as NUMA domains. (NUMA is an abbreviation of non-uniform memory access, and refers to an idea that CPUs might be topologically closer to specific regions of memory, due to the physical layout of the hardware components and the way that these are connected).

Sources of troubleshooting information

You can use the following means to troubleshoot the reason why a pod could not be deployed or became rejected at a node, in the context of topology management:

Pod status - indicates topology affinity errors
system logs - include valuable information for debugging; for example, about generated hints
kubelet state file - the dump of internal state of the Memory Manager (including the node map and memory maps)
You can use the device plugin resource API to retrieve information about the memory reserved for containers

Troubleshoot `TopologyAffinityError`

This error typically occurs in the following situations:

a node has not enough resources available to satisfy the pod's request
the pod's request is rejected due to particular Topology Manager policy constraints

The error appears in the status of a pod:

kubectl get pods

NAME         READY   STATUS                  RESTARTS   AGE
guaranteed   0/1     TopologyAffinityError   0          113s

Use kubectl describe pod <id> or kubectl events to obtain a detailed error message:

Warning  TopologyAffinityError  10m   kubelet, dell8  Resources cannot be allocated with Topology locality

Examine system logs

Search system logs with respect to a particular pod.

The set of hints generated by CPU Manager should be present in the logs. Also, the set of hints that Memory Manager generated for the pod can be found in the logs.

Topology Manager merges these hints to calculate a single best hint. The best hint should also be present in the logs.

The best hint indicates where to allocate all the resources. Topology Manager tests this hint against its current policy, and based on the verdict, it either admits the pod to the node or rejects it.

Also, search the logs for occurrences associated with the Memory Manager; for example to find out information about cgroups and cpuset.mems updates.

Examples

Examine the memory manager state on a node

Let us first deploy a sample Guaranteed pod whose specification is as follows:

apiVersion: v1
kind: Pod
metadata:
  name: guaranteed
spec:
  containers:
  - name: guaranteed
    image: consumer
    imagePullPolicy: Never
    resources:
      limits:
        cpu: "2"
        memory: 150Gi
      requests:
        cpu: "2"
        memory: 150Gi
    command: ["sleep","infinity"]

Next, log into the node where it was deployed and examine the state file in /var/lib/kubelet/memory_manager_state:

{
   "policyName":"Static",
   "machineState":{
      "0":{
         "numberOfAssignments":1,
         "memoryMap":{
            "hugepages-1Gi":{
               "total":0,
               "systemReserved":0,
               "allocatable":0,
               "reserved":0,
               "free":0
            },
            "memory":{
               "total":134987354112,
               "systemReserved":3221225472,
               "allocatable":131766128640,
               "reserved":131766128640,
               "free":0
            }
         },
         "nodes":[
            0,
            1
         ]
      },
      "1":{
         "numberOfAssignments":1,
         "memoryMap":{
            "hugepages-1Gi":{
               "total":0,
               "systemReserved":0,
               "allocatable":0,
               "reserved":0,
               "free":0
            },
            "memory":{
               "total":135286722560,
               "systemReserved":2252341248,
               "allocatable":133034381312,
               "reserved":29295144960,
               "free":103739236352
            }
         },
         "nodes":[
            0,
            1
         ]
      }
   },
   "entries":{
      "fa9bdd38-6df9-4cf9-aa67-8c4814da37a8":{
         "guaranteed":[
            {
               "numaAffinity":[
                  0,
                  1
               ],
               "type":"memory",
               "size":161061273600
            }
         ]
      }
   },
   "checksum":4142013182
}

It can be deduced from the state file that the pod was pinned to both NUMA nodes, i.e.:

"numaAffinity":[
   0,
   1
],

Pinned term means that pod's memory consumption is constrained (through cgroups configuration) to these NUMA nodes.

This automatically implies that Memory Manager instantiated a new group that comprises these two NUMA nodes, i.e. 0 and 1 indexed NUMA nodes.

In order to analyse memory resources available in a group,the corresponding entries from NUMA nodes belonging to the group must be added up.

For example, the total amount of free "conventional" memory in the group can be computed by adding up the free memory available at every NUMA node in the group, i.e., in the "memory" section of NUMA node 0 ("free":0) and NUMA node 1 ("free":103739236352). So, the total amount of free "conventional" memory in this group is equal to 0 + 103739236352 bytes.

The line "systemReserved":3221225472 indicates that the administrator of this node reserved 3221225472 bytes (i.e. 3Gi) to serve kubelet and system processes at NUMA node 0, by using --reserved-memory flag.

Check the device plugin resource API

The kubelet provides a PodResourceLister gRPC service to enable discovery of resources and associated metadata. By using its List gRPC endpoint, information about reserved memory for each container can be retrieved, which is contained in protobuf ContainerMemory message.

This information can be retrieved solely for pods in Guaranteed QoS class.

Last modified April 01, 2025 at 2:55 PM PST: Improve docs for memory manager (fa812c0368)