Troubleshooting Topology Management
Kubernetes keeps many aspects of how pods execute on nodes abstracted
from the user. This is by design. However, some workloads require
stronger guarantees in terms of latency and/or performance in order to operate
acceptably. The kubelet provides methods to enable more complex workload
placement policies while keeping the abstraction free from explicit placement
directives.
You can manage topology within nodes. This means helping the kubelet to configure the host operating system so that Pods and containers are placed on the correct side of inner boundaries, such as NUMA domains. (NUMA is an abbreviation of non-uniform memory access, and refers to an idea that CPUs might be topologically closer to specific regions of memory, due to the physical layout of the hardware components and the way that these are connected).
Sources of troubleshooting information
You can use the following means to troubleshoot the reason why a pod could not be deployed or became rejected at a node, in the context of topology management:
- Pod status - indicates topology affinity errors
- system logs - include valuable information for debugging; for example, about generated hints
- kubelet state file - the dump of internal state of the Memory Manager (including the node map and memory maps)
- You can use the device plugin resource API to retrieve information about the memory reserved for containers
Troubleshoot TopologyAffinityError
This error typically occurs in the following situations:
- a node has not enough resources available to satisfy the pod's request
- the pod's request is rejected due to particular Topology Manager policy constraints
The error appears in the status of a pod:
kubectl get pods
NAME READY STATUS RESTARTS AGE
guaranteed 0/1 TopologyAffinityError 0 113s
Use kubectl describe pod <id> or kubectl events to obtain a detailed error message:
Warning TopologyAffinityError 10m kubelet, dell8 Resources cannot be allocated with Topology locality
Examine system logs
Search system logs with respect to a particular pod.
The set of hints generated by CPU Manager should be present in the logs. Also, the set of hints that Memory Manager generated for the pod can be found in the logs.
Topology Manager merges these hints to calculate a single best hint. The best hint should also be present in the logs.
The best hint indicates where to allocate all the resources. Topology Manager tests this hint against its current policy, and based on the verdict, it either admits the pod to the node or rejects it.
Also, search the logs for occurrences associated with the Memory Manager;
for example to find out information about cgroups and cpuset.mems updates.
Examples
Examine the memory manager state on a node
Let us first deploy a sample Guaranteed pod whose specification is as follows:
apiVersion: v1
kind: Pod
metadata:
name: guaranteed
spec:
containers:
- name: guaranteed
image: consumer
imagePullPolicy: Never
resources:
limits:
cpu: "2"
memory: 150Gi
requests:
cpu: "2"
memory: 150Gi
command: ["sleep","infinity"]
Next, log into the node where it was deployed and examine the state file in
/var/lib/kubelet/memory_manager_state:
{
"policyName":"Static",
"machineState":{
"0":{
"numberOfAssignments":1,
"memoryMap":{
"hugepages-1Gi":{
"total":0,
"systemReserved":0,
"allocatable":0,
"reserved":0,
"free":0
},
"memory":{
"total":134987354112,
"systemReserved":3221225472,
"allocatable":131766128640,
"reserved":131766128640,
"free":0
}
},
"nodes":[
0,
1
]
},
"1":{
"numberOfAssignments":1,
"memoryMap":{
"hugepages-1Gi":{
"total":0,
"systemReserved":0,
"allocatable":0,
"reserved":0,
"free":0
},
"memory":{
"total":135286722560,
"systemReserved":2252341248,
"allocatable":133034381312,
"reserved":29295144960,
"free":103739236352
}
},
"nodes":[
0,
1
]
}
},
"entries":{
"fa9bdd38-6df9-4cf9-aa67-8c4814da37a8":{
"guaranteed":[
{
"numaAffinity":[
0,
1
],
"type":"memory",
"size":161061273600
}
]
}
},
"checksum":4142013182
}
It can be deduced from the state file that the pod was pinned to both NUMA nodes, i.e.:
"numaAffinity":[
0,
1
],
Pinned term means that pod's memory consumption is constrained (through cgroups configuration)
to these NUMA nodes.
This automatically implies that Memory Manager instantiated a new group that
comprises these two NUMA nodes, i.e. 0 and 1 indexed NUMA nodes.
In order to analyse memory resources available in a group,the corresponding entries from NUMA nodes belonging to the group must be added up.
For example, the total amount of free "conventional" memory in the group can be computed
by adding up the free memory available at every NUMA node in the group,
i.e., in the "memory" section of NUMA node 0 ("free":0) and NUMA node 1 ("free":103739236352).
So, the total amount of free "conventional" memory in this group is equal to 0 + 103739236352 bytes.
The line "systemReserved":3221225472 indicates that the administrator of this node reserved
3221225472 bytes (i.e. 3Gi) to serve kubelet and system processes at NUMA node 0,
by using --reserved-memory flag.
Check the device plugin resource API
The kubelet provides a PodResourceLister gRPC service to enable discovery of resources and associated metadata.
By using its List gRPC endpoint,
information about reserved memory for each container can be retrieved, which is contained
in protobuf ContainerMemory message.
This information can be retrieved solely for pods in Guaranteed QoS class.