This is the multi-page printable view of this section. Click here to print.
Documentation
- 1: Kubernetes Documentation
- 2: Getting started
- 2.1: Learning environment
- 2.2: Production environment
- 2.2.1: Container Runtimes
- 2.2.2: Installing Kubernetes with deployment tools
- 2.2.2.1: Bootstrapping clusters with kubeadm
- 2.2.2.1.1: Installing kubeadm
- 2.2.2.1.2: Troubleshooting kubeadm
- 2.2.2.1.3: Creating a cluster with kubeadm
- 2.2.2.1.4: Customizing components with the kubeadm API
- 2.2.2.1.5: Options for Highly Available Topology
- 2.2.2.1.6: Creating Highly Available Clusters with kubeadm
- 2.2.2.1.7: Set up a High Availability etcd Cluster with kubeadm
- 2.2.2.1.8: Configuring each kubelet in your cluster using kubeadm
- 2.2.2.1.9: Dual-stack support with kubeadm
- 2.2.3: Turnkey Cloud Solutions
- 2.3: Best practices
- 3: Concepts
- 3.1: Overview
- 3.1.1: Kubernetes Components
- 3.1.2: Objects In Kubernetes
- 3.1.2.1: Kubernetes Object Management
- 3.1.2.2: Object Names and IDs
- 3.1.2.3: Labels and Selectors
- 3.1.2.4: Namespaces
- 3.1.2.5: Annotations
- 3.1.2.6: Field Selectors
- 3.1.2.7: Finalizers
- 3.1.2.8: Owners and Dependents
- 3.1.2.9: Recommended Labels
- 3.1.3: The Kubernetes API
- 3.2: Cluster Architecture
- 3.2.1: Nodes
- 3.2.2: Communication between Nodes and the Control Plane
- 3.2.3: Controllers
- 3.2.4: Leases
- 3.2.5: Cloud Controller Manager
- 3.2.6: About cgroup v2
- 3.2.7: Container Runtime Interface (CRI)
- 3.2.8: Garbage Collection
- 3.2.9: Mixed Version Proxy
- 3.3: Containers
- 3.3.1: Images
- 3.3.2: Container Environment
- 3.3.3: Runtime Class
- 3.3.4: Container Lifecycle Hooks
- 3.4: Workloads
- 3.4.1: Pods
- 3.4.1.1: Pod Lifecycle
- 3.4.1.2: Init Containers
- 3.4.1.3: Sidecar Containers
- 3.4.1.4: Ephemeral Containers
- 3.4.1.5: Disruptions
- 3.4.1.6: Pod Quality of Service Classes
- 3.4.1.7: User Namespaces
- 3.4.1.8: Downward API
- 3.4.2: Workload Management
- 3.4.2.1: Deployments
- 3.4.2.2: ReplicaSet
- 3.4.2.3: StatefulSets
- 3.4.2.4: DaemonSet
- 3.4.2.5: Jobs
- 3.4.2.6: Automatic Cleanup for Finished Jobs
- 3.4.2.7: CronJob
- 3.4.2.8: ReplicationController
- 3.4.3: Autoscaling Workloads
- 3.4.4: Managing Workloads
- 3.5: Services, Load Balancing, and Networking
- 3.5.1: Service
- 3.5.2: Ingress
- 3.5.3: Ingress Controllers
- 3.5.4: Gateway API
- 3.5.5: EndpointSlices
- 3.5.6: Network Policies
- 3.5.7: DNS for Services and Pods
- 3.5.8: IPv4/IPv6 dual-stack
- 3.5.9: Topology Aware Routing
- 3.5.10: Networking on Windows
- 3.5.11: Service ClusterIP allocation
- 3.5.12: Service Internal Traffic Policy
- 3.6: Storage
- 3.6.1: Volumes
- 3.6.2: Persistent Volumes
- 3.6.3: Projected Volumes
- 3.6.4: Ephemeral Volumes
- 3.6.5: Storage Classes
- 3.6.6: Volume Attributes Classes
- 3.6.7: Dynamic Volume Provisioning
- 3.6.8: Volume Snapshots
- 3.6.9: Volume Snapshot Classes
- 3.6.10: CSI Volume Cloning
- 3.6.11: Storage Capacity
- 3.6.12: Node-specific Volume Limits
- 3.6.13: Volume Health Monitoring
- 3.6.14: Windows Storage
- 3.7: Configuration
- 3.7.1: Configuration Best Practices
- 3.7.2: ConfigMaps
- 3.7.3: Secrets
- 3.7.4: Liveness, Readiness, and Startup Probes
- 3.7.5: Resource Management for Pods and Containers
- 3.7.6: Organizing Cluster Access Using kubeconfig Files
- 3.7.7: Resource Management for Windows nodes
- 3.8: Security
- 3.8.1: Cloud Native Security and Kubernetes
- 3.8.2: Pod Security Standards
- 3.8.3: Pod Security Admission
- 3.8.4: Service Accounts
- 3.8.5: Pod Security Policies
- 3.8.6: Security For Windows Nodes
- 3.8.7: Controlling Access to the Kubernetes API
- 3.8.8: Role Based Access Control Good Practices
- 3.8.9: Good practices for Kubernetes Secrets
- 3.8.10: Multi-tenancy
- 3.8.11: Hardening Guide - Authentication Mechanisms
- 3.8.12: Kubernetes API Server Bypass Risks
- 3.8.13: Linux kernel security constraints for Pods and containers
- 3.8.14: Security Checklist
- 3.8.15: Application Security Checklist
- 3.9: Policies
- 3.9.1: Limit Ranges
- 3.9.2: Resource Quotas
- 3.9.3: Process ID Limits And Reservations
- 3.9.4: Node Resource Managers
- 3.10: Scheduling, Preemption and Eviction
- 3.10.1: Kubernetes Scheduler
- 3.10.2: Assigning Pods to Nodes
- 3.10.3: Pod Overhead
- 3.10.4: Pod Scheduling Readiness
- 3.10.5: Pod Topology Spread Constraints
- 3.10.6: Taints and Tolerations
- 3.10.7: Scheduling Framework
- 3.10.8: Dynamic Resource Allocation
- 3.10.9: Scheduler Performance Tuning
- 3.10.10: Resource Bin Packing
- 3.10.11: Pod Priority and Preemption
- 3.10.12: Node-pressure Eviction
- 3.10.13: API-initiated Eviction
- 3.11: Cluster Administration
- 3.11.1: Node Shutdowns
- 3.11.2: Certificates
- 3.11.3: Cluster Networking
- 3.11.4: Logging Architecture
- 3.11.5: Compatibility Version For Kubernetes Control Plane Components
- 3.11.6: Metrics For Kubernetes System Components
- 3.11.7: Metrics for Kubernetes Object States
- 3.11.8: System Logs
- 3.11.9: Traces For Kubernetes System Components
- 3.11.10: Proxies in Kubernetes
- 3.11.11: API Priority and Fairness
- 3.11.12: Cluster Autoscaling
- 3.11.13: Installing Addons
- 3.11.14: Coordinated Leader Election
- 3.12: Windows in Kubernetes
- 3.13: Extending Kubernetes
- 3.13.1: Compute, Storage, and Networking Extensions
- 3.13.1.1: Network Plugins
- 3.13.1.2: Device Plugins
- 3.13.2: Extending the Kubernetes API
- 3.13.2.1: Custom Resources
- 3.13.2.2: Kubernetes API Aggregation Layer
- 3.13.3: Operator pattern
- 4: Tasks
- 4.1: Install Tools
- 4.1.1: Install and Set Up kubectl on Linux
- 4.1.2: Install and Set Up kubectl on macOS
- 4.1.3: Install and Set Up kubectl on Windows
- 4.2: Administer a Cluster
- 4.2.1: Administration with kubeadm
- 4.2.1.1: Adding Linux worker nodes
- 4.2.1.2: Adding Windows worker nodes
- 4.2.1.3: Upgrading kubeadm clusters
- 4.2.1.4: Upgrading Linux nodes
- 4.2.1.5: Upgrading Windows nodes
- 4.2.1.6: Configuring a cgroup driver
- 4.2.1.7: Certificate Management with kubeadm
- 4.2.1.8: Reconfiguring a kubeadm cluster
- 4.2.1.9: Changing The Kubernetes Package Repository
- 4.2.2: Overprovision Node Capacity For A Cluster
- 4.2.3: Migrating from dockershim
- 4.2.3.1: Changing the Container Runtime on a Node from Docker Engine to containerd
- 4.2.3.2: Migrate Docker Engine nodes from dockershim to cri-dockerd
- 4.2.3.3: Find Out What Container Runtime is Used on a Node
- 4.2.3.4: Troubleshooting CNI plugin-related errors
- 4.2.3.5: Check whether dockershim removal affects you
- 4.2.3.6: Migrating telemetry and security agents from dockershim
- 4.2.4: Generate Certificates Manually
- 4.2.5: Manage Memory, CPU, and API Resources
- 4.2.5.1: Configure Default Memory Requests and Limits for a Namespace
- 4.2.5.2: Configure Default CPU Requests and Limits for a Namespace
- 4.2.5.3: Configure Minimum and Maximum Memory Constraints for a Namespace
- 4.2.5.4: Configure Minimum and Maximum CPU Constraints for a Namespace
- 4.2.5.5: Configure Memory and CPU Quotas for a Namespace
- 4.2.5.6: Configure a Pod Quota for a Namespace
- 4.2.6: Install a Network Policy Provider
- 4.2.6.1: Use Antrea for NetworkPolicy
- 4.2.6.2: Use Calico for NetworkPolicy
- 4.2.6.3: Use Cilium for NetworkPolicy
- 4.2.6.4: Use Kube-router for NetworkPolicy
- 4.2.6.5: Romana for NetworkPolicy
- 4.2.6.6: Weave Net for NetworkPolicy
- 4.2.7: Access Clusters Using the Kubernetes API
- 4.2.8: Advertise Extended Resources for a Node
- 4.2.9: Autoscale the DNS Service in a Cluster
- 4.2.10: Change the Access Mode of a PersistentVolume to ReadWriteOncePod
- 4.2.11: Change the default StorageClass
- 4.2.12: Switching from Polling to CRI Event-based Updates to Container Status
- 4.2.13: Change the Reclaim Policy of a PersistentVolume
- 4.2.14: Cloud Controller Manager Administration
- 4.2.15: Configure a kubelet image credential provider
- 4.2.16: Configure Quotas for API Objects
- 4.2.17: Control CPU Management Policies on the Node
- 4.2.18: Control Topology Management Policies on a node
- 4.2.19: Customizing DNS Service
- 4.2.20: Debugging DNS Resolution
- 4.2.21: Declare Network Policy
- 4.2.22: Developing Cloud Controller Manager
- 4.2.23: Enable Or Disable A Kubernetes API
- 4.2.24: Encrypting Confidential Data at Rest
- 4.2.25: Decrypt Confidential Data that is Already Encrypted at Rest
- 4.2.26: Guaranteed Scheduling For Critical Add-On Pods
- 4.2.27: IP Masquerade Agent User Guide
- 4.2.28: Limit Storage Consumption
- 4.2.29: Migrate Replicated Control Plane To Use Cloud Controller Manager
- 4.2.30: Namespaces Walkthrough
- 4.2.31: Operating etcd clusters for Kubernetes
- 4.2.32: Reserve Compute Resources for System Daemons
- 4.2.33: Running Kubernetes Node Components as a Non-root User
- 4.2.34: Safely Drain a Node
- 4.2.35: Securing a Cluster
- 4.2.36: Set Kubelet Parameters Via A Configuration File
- 4.2.37: Share a Cluster with Namespaces
- 4.2.38: Upgrade A Cluster
- 4.2.39: Use Cascading Deletion in a Cluster
- 4.2.40: Using a KMS provider for data encryption
- 4.2.41: Using CoreDNS for Service Discovery
- 4.2.42: Using NodeLocal DNSCache in Kubernetes Clusters
- 4.2.43: Using sysctls in a Kubernetes Cluster
- 4.2.44: Utilizing the NUMA-aware Memory Manager
- 4.2.45: Verify Signed Kubernetes Artifacts
- 4.3: Configure Pods and Containers
- 4.3.1: Assign Memory Resources to Containers and Pods
- 4.3.2: Assign CPU Resources to Containers and Pods
- 4.3.3: Assign Pod-level CPU and memory resources
- 4.3.4: Configure GMSA for Windows Pods and containers
- 4.3.5: Resize CPU and Memory Resources assigned to Containers
- 4.3.6: Configure RunAsUserName for Windows pods and containers
- 4.3.7: Create a Windows HostProcess Pod
- 4.3.8: Configure Quality of Service for Pods
- 4.3.9: Assign Extended Resources to a Container
- 4.3.10: Configure a Pod to Use a Volume for Storage
- 4.3.11: Configure a Pod to Use a PersistentVolume for Storage
- 4.3.12: Configure a Pod to Use a Projected Volume for Storage
- 4.3.13: Configure a Security Context for a Pod or Container
- 4.3.14: Configure Service Accounts for Pods
- 4.3.15: Pull an Image from a Private Registry
- 4.3.16: Configure Liveness, Readiness and Startup Probes
- 4.3.17: Assign Pods to Nodes
- 4.3.18: Assign Pods to Nodes using Node Affinity
- 4.3.19: Configure Pod Initialization
- 4.3.20: Attach Handlers to Container Lifecycle Events
- 4.3.21: Configure a Pod to Use a ConfigMap
- 4.3.22: Share Process Namespace between Containers in a Pod
- 4.3.23: Use a User Namespace With a Pod
- 4.3.24: Use an Image Volume With a Pod
- 4.3.25: Create static Pods
- 4.3.26: Translate a Docker Compose File to Kubernetes Resources
- 4.3.27: Enforce Pod Security Standards by Configuring the Built-in Admission Controller
- 4.3.28: Enforce Pod Security Standards with Namespace Labels
- 4.3.29: Migrate from PodSecurityPolicy to the Built-In PodSecurity Admission Controller
- 4.4: Monitoring, Logging, and Debugging
- 4.4.1: Troubleshooting Applications
- 4.4.1.1: Debug Pods
- 4.4.1.2: Debug Services
- 4.4.1.3: Debug a StatefulSet
- 4.4.1.4: Determine the Reason for Pod Failure
- 4.4.1.5: Debug Init Containers
- 4.4.1.6: Debug Running Pods
- 4.4.1.7: Get a Shell to a Running Container
- 4.4.2: Troubleshooting Clusters
- 4.4.2.1: Troubleshooting kubectl
- 4.4.2.2: Resource metrics pipeline
- 4.4.2.3: Tools for Monitoring Resources
- 4.4.2.4: Monitor Node Health
- 4.4.2.5: Debugging Kubernetes nodes with crictl
- 4.4.2.6: Auditing
- 4.4.2.7: Debugging Kubernetes Nodes With Kubectl
- 4.4.2.8: Developing and debugging services locally using telepresence
- 4.4.2.9: Windows debugging tips
- 4.5: Manage Kubernetes Objects
- 4.5.1: Declarative Management of Kubernetes Objects Using Configuration Files
- 4.5.2: Declarative Management of Kubernetes Objects Using Kustomize
- 4.5.3: Managing Kubernetes Objects Using Imperative Commands
- 4.5.4: Imperative Management of Kubernetes Objects Using Configuration Files
- 4.5.5: Update API Objects in Place Using kubectl patch
- 4.5.6: Migrate Kubernetes Objects Using Storage Version Migration
- 4.6: Managing Secrets
- 4.6.1: Managing Secrets using kubectl
- 4.6.2: Managing Secrets using Configuration File
- 4.6.3: Managing Secrets using Kustomize
- 4.7: Inject Data Into Applications
- 4.7.1: Define a Command and Arguments for a Container
- 4.7.2: Define Dependent Environment Variables
- 4.7.3: Define Environment Variables for a Container
- 4.7.4: Expose Pod Information to Containers Through Environment Variables
- 4.7.5: Expose Pod Information to Containers Through Files
- 4.7.6: Distribute Credentials Securely Using Secrets
- 4.8: Run Applications
- 4.8.1: Run a Stateless Application Using a Deployment
- 4.8.2: Run a Single-Instance Stateful Application
- 4.8.3: Run a Replicated Stateful Application
- 4.8.4: Scale a StatefulSet
- 4.8.5: Delete a StatefulSet
- 4.8.6: Force Delete StatefulSet Pods
- 4.8.7: Horizontal Pod Autoscaling
- 4.8.8: HorizontalPodAutoscaler Walkthrough
- 4.8.9: Specifying a Disruption Budget for your Application
- 4.8.10: Accessing the Kubernetes API from a Pod
- 4.9: Run Jobs
- 4.9.1: Running Automated Tasks with a CronJob
- 4.9.2: Coarse Parallel Processing Using a Work Queue
- 4.9.3: Fine Parallel Processing Using a Work Queue
- 4.9.4: Indexed Job for Parallel Processing with Static Work Assignment
- 4.9.5: Job with Pod-to-Pod Communication
- 4.9.6: Parallel Processing using Expansions
- 4.9.7: Handling retriable and non-retriable pod failures with Pod failure policy
- 4.10: Access Applications in a Cluster
- 4.10.1: Deploy and Access the Kubernetes Dashboard
- 4.10.2: Accessing Clusters
- 4.10.3: Configure Access to Multiple Clusters
- 4.10.4: Use Port Forwarding to Access Applications in a Cluster
- 4.10.5: Use a Service to Access an Application in a Cluster
- 4.10.6: Connect a Frontend to a Backend Using Services
- 4.10.7: Create an External Load Balancer
- 4.10.8: List All Container Images Running in a Cluster
- 4.10.9: Set up Ingress on Minikube with the NGINX Ingress Controller
- 4.10.10: Communicate Between Containers in the Same Pod Using a Shared Volume
- 4.10.11: Configure DNS for a Cluster
- 4.10.12: Access Services Running on Clusters
- 4.11: Extend Kubernetes
- 4.11.1: Configure the Aggregation Layer
- 4.11.2: Use Custom Resources
- 4.11.2.1: Extend the Kubernetes API with CustomResourceDefinitions
- 4.11.2.2: Versions in CustomResourceDefinitions
- 4.11.3: Set up an Extension API Server
- 4.11.4: Configure Multiple Schedulers
- 4.11.5: Use an HTTP Proxy to Access the Kubernetes API
- 4.11.6: Use a SOCKS5 Proxy to Access the Kubernetes API
- 4.11.7: Set up Konnectivity service
- 4.12: TLS
- 4.12.1: Configure Certificate Rotation for the Kubelet
- 4.12.2: Manage TLS Certificates in a Cluster
- 4.12.3: Manual Rotation of CA Certificates
- 4.13: Manage Cluster Daemons
- 4.13.1: Building a Basic DaemonSet
- 4.13.2: Perform a Rolling Update on a DaemonSet
- 4.13.3: Perform a Rollback on a DaemonSet
- 4.13.4: Running Pods on Only Some Nodes
- 4.14: Networking
- 4.14.1: Adding entries to Pod /etc/hosts with HostAliases
- 4.14.2: Extend Service IP Ranges
- 4.14.3: Validate IPv4/IPv6 dual-stack
- 4.15: Extend kubectl with plugins
- 4.16: Manage HugePages
- 4.17: Schedule GPUs
- 5: Tutorials
- 5.1: Hello Minikube
- 5.2: Learn Kubernetes Basics
- 5.2.1: Create a Cluster
- 5.2.1.1: Using Minikube to Create a Cluster
- 5.2.2: Deploy an App
- 5.2.2.1: Using kubectl to Create a Deployment
- 5.2.3: Explore Your App
- 5.2.3.1: Viewing Pods and Nodes
- 5.2.4: Expose Your App Publicly
- 5.2.4.1: Using a Service to Expose Your App
- 5.2.5: Scale Your App
- 5.2.6: Update Your App
- 5.2.6.1: Performing a Rolling Update
- 5.3: Configuration
- 5.3.1: Updating Configuration via a ConfigMap
- 5.3.2: Configuring Redis using a ConfigMap
- 5.3.3: Adopting Sidecar Containers
- 5.4: Security
- 5.4.1: Apply Pod Security Standards at the Cluster Level
- 5.4.2: Apply Pod Security Standards at the Namespace Level
- 5.4.3: Restrict a Container's Access to Resources with AppArmor
- 5.4.4: Restrict a Container's Syscalls with seccomp
- 5.5: Stateless Applications
- 5.5.1: Exposing an External IP Address to Access an Application in a Cluster
- 5.5.2: Example: Deploying PHP Guestbook application with Redis
- 5.6: Stateful Applications
- 5.6.1: StatefulSet Basics
- 5.6.2: Example: Deploying WordPress and MySQL with Persistent Volumes
- 5.6.3: Example: Deploying Cassandra with a StatefulSet
- 5.6.4: Running ZooKeeper, A Distributed System Coordinator
- 5.7: Cluster Management
- 5.8: Services
- 6: Reference
- 6.1: Glossary
- 6.2: API Overview
- 6.2.1: Kubernetes API Concepts
- 6.2.2: Server-Side Apply
- 6.2.3: Client Libraries
- 6.2.4: Common Expression Language in Kubernetes
- 6.2.5: Kubernetes Deprecation Policy
- 6.2.6: Deprecated API Migration Guide
- 6.2.7: Kubernetes API health endpoints
- 6.3: API Access Control
- 6.3.1: Authenticating
- 6.3.2: Authenticating with Bootstrap Tokens
- 6.3.3: Authorization
- 6.3.4: Using RBAC Authorization
- 6.3.5: Using Node Authorization
- 6.3.6: Webhook Mode
- 6.3.7: Using ABAC Authorization
- 6.3.8: Admission Control in Kubernetes
- 6.3.9: Dynamic Admission Control
- 6.3.10: Managing Service Accounts
- 6.3.11: Certificates and Certificate Signing Requests
- 6.3.12: Mapping PodSecurityPolicies to Pod Security Standards
- 6.3.13: Kubelet authentication/authorization
- 6.3.14: TLS bootstrapping
- 6.3.15: Mutating Admission Policy
- 6.3.16: Validating Admission Policy
- 6.4: Well-Known Labels, Annotations and Taints
- 6.4.1: Audit Annotations
- 6.5: Kubernetes API
- 6.5.1: Workload Resources
- 6.5.1.1: Pod
- 6.5.1.2: Binding
- 6.5.1.3: PodTemplate
- 6.5.1.4: ReplicationController
- 6.5.1.5: ReplicaSet
- 6.5.1.6: Deployment
- 6.5.1.7: StatefulSet
- 6.5.1.8: ControllerRevision
- 6.5.1.9: DaemonSet
- 6.5.1.10: Job
- 6.5.1.11: CronJob
- 6.5.1.12: HorizontalPodAutoscaler
- 6.5.1.13: HorizontalPodAutoscaler
- 6.5.1.14: PriorityClass
- 6.5.1.15: PodSchedulingContext v1alpha3
- 6.5.1.16: ResourceClaim v1alpha3
- 6.5.1.17: ResourceClaim v1beta1
- 6.5.1.18: ResourceClaimTemplate v1alpha3
- 6.5.1.19: ResourceClaimTemplate v1beta1
- 6.5.1.20: ResourceSlice v1alpha3
- 6.5.1.21: ResourceSlice v1beta1
- 6.5.2: Service Resources
- 6.5.2.1: Service
- 6.5.2.2: Endpoints
- 6.5.2.3: EndpointSlice
- 6.5.2.4: Ingress
- 6.5.2.5: IngressClass
- 6.5.3: Config and Storage Resources
- 6.5.3.1: ConfigMap
- 6.5.3.2: Secret
- 6.5.3.3: CSIDriver
- 6.5.3.4: CSINode
- 6.5.3.5: CSIStorageCapacity
- 6.5.3.6: PersistentVolumeClaim
- 6.5.3.7: PersistentVolume
- 6.5.3.8: StorageClass
- 6.5.3.9: StorageVersionMigration v1alpha1
- 6.5.3.10: Volume
- 6.5.3.11: VolumeAttachment
- 6.5.3.12: VolumeAttributesClass v1beta1
- 6.5.4: Authentication Resources
- 6.5.4.1: ServiceAccount
- 6.5.4.2: TokenRequest
- 6.5.4.3: TokenReview
- 6.5.4.4: CertificateSigningRequest
- 6.5.4.5: ClusterTrustBundle v1alpha1
- 6.5.4.6: SelfSubjectReview
- 6.5.5: Authorization Resources
- 6.5.5.1: LocalSubjectAccessReview
- 6.5.5.2: SelfSubjectAccessReview
- 6.5.5.3: SelfSubjectRulesReview
- 6.5.5.4: SubjectAccessReview
- 6.5.5.5: ClusterRole
- 6.5.5.6: ClusterRoleBinding
- 6.5.5.7: Role
- 6.5.5.8: RoleBinding
- 6.5.6: Policy Resources
- 6.5.6.1: FlowSchema
- 6.5.6.2: LimitRange
- 6.5.6.3: ResourceQuota
- 6.5.6.4: NetworkPolicy
- 6.5.6.5: PodDisruptionBudget
- 6.5.6.6: PriorityLevelConfiguration
- 6.5.6.7: ValidatingAdmissionPolicy
- 6.5.6.8: ValidatingAdmissionPolicyBinding
- 6.5.7: Extend Resources
- 6.5.7.1: CustomResourceDefinition
- 6.5.7.2: DeviceClass v1alpha3
- 6.5.7.3: DeviceClass v1beta1
- 6.5.7.4: MutatingWebhookConfiguration
- 6.5.7.5: ValidatingWebhookConfiguration
- 6.5.8: Cluster Resources
- 6.5.8.1: APIService
- 6.5.8.2: ComponentStatus
- 6.5.8.3: Event
- 6.5.8.4: IPAddress v1beta1
- 6.5.8.5: Lease
- 6.5.8.6: LeaseCandidate v1alpha1
- 6.5.8.7: Namespace
- 6.5.8.8: Node
- 6.5.8.9: RuntimeClass
- 6.5.8.10: ServiceCIDR v1beta1
- 6.5.9: Common Definitions
- 6.5.9.1: DeleteOptions
- 6.5.9.2: LabelSelector
- 6.5.9.3: ListMeta
- 6.5.9.4: LocalObjectReference
- 6.5.9.5: NodeSelectorRequirement
- 6.5.9.6: ObjectFieldSelector
- 6.5.9.7: ObjectMeta
- 6.5.9.8: ObjectReference
- 6.5.9.9: Patch
- 6.5.9.10: Quantity
- 6.5.9.11: ResourceFieldSelector
- 6.5.9.12: Status
- 6.5.9.13: TypedLocalObjectReference
- 6.5.10: Common Parameters
- 6.6: Instrumentation
- 6.6.1: Kubernetes Component SLI Metrics
- 6.6.2: CRI Pod & Container Metrics
- 6.6.3: Node metrics data
- 6.6.4: Kubernetes z-pages
- 6.6.5: Kubernetes Metrics Reference
- 6.7: Kubernetes Issues and Security
- 6.7.1: Kubernetes Issue Tracker
- 6.7.2: Kubernetes Security and Disclosure Information
- 6.7.3: Official CVE Feed
- 6.8: Node Reference Information
- 6.8.1: Kubelet Checkpoint API
- 6.8.2: Linux Kernel Version Requirements
- 6.8.3: Articles on dockershim Removal and on Using CRI-compatible Runtimes
- 6.8.4: Node Labels Populated By The Kubelet
- 6.8.5: Local Files And Paths Used By The Kubelet
- 6.8.6: Kubelet Configuration Directory Merging
- 6.8.7: Kubelet Device Manager API Versions
- 6.8.8: Kubelet Systemd Watchdog
- 6.8.9: Node Status
- 6.8.10: Seccomp and Kubernetes
- 6.9: Networking Reference
- 6.9.1: Protocols for Services
- 6.9.2: Ports and Protocols
- 6.9.3: Virtual IPs and Service Proxies
- 6.10: Setup tools
- 6.10.1: Kubeadm
- 6.10.1.1: Kubeadm Generated
- 6.10.1.1.1:
- 6.10.1.1.2:
- 6.10.1.1.2.1:
- 6.10.1.1.2.2:
- 6.10.1.1.2.3:
- 6.10.1.1.2.4:
- 6.10.1.1.2.5:
- 6.10.1.1.2.6:
- 6.10.1.1.2.7:
- 6.10.1.1.2.8:
- 6.10.1.1.2.9:
- 6.10.1.1.2.10:
- 6.10.1.1.2.11:
- 6.10.1.1.2.12:
- 6.10.1.1.2.13:
- 6.10.1.1.2.14:
- 6.10.1.1.2.15:
- 6.10.1.1.2.16:
- 6.10.1.1.3:
- 6.10.1.1.4:
- 6.10.1.1.4.1:
- 6.10.1.1.4.2:
- 6.10.1.1.4.3:
- 6.10.1.1.4.4:
- 6.10.1.1.4.5:
- 6.10.1.1.4.6:
- 6.10.1.1.4.7:
- 6.10.1.1.4.8:
- 6.10.1.1.4.9:
- 6.10.1.1.4.10:
- 6.10.1.1.5:
- 6.10.1.1.5.1:
- 6.10.1.1.5.2:
- 6.10.1.1.5.3:
- 6.10.1.1.5.4:
- 6.10.1.1.5.5:
- 6.10.1.1.5.6:
- 6.10.1.1.5.7:
- 6.10.1.1.5.8:
- 6.10.1.1.5.9:
- 6.10.1.1.5.10:
- 6.10.1.1.5.11:
- 6.10.1.1.5.12:
- 6.10.1.1.5.13:
- 6.10.1.1.5.14:
- 6.10.1.1.5.15:
- 6.10.1.1.5.16:
- 6.10.1.1.5.17:
- 6.10.1.1.5.18:
- 6.10.1.1.5.19:
- 6.10.1.1.5.20:
- 6.10.1.1.5.21:
- 6.10.1.1.5.22:
- 6.10.1.1.5.23:
- 6.10.1.1.5.24:
- 6.10.1.1.5.25:
- 6.10.1.1.5.26:
- 6.10.1.1.5.27:
- 6.10.1.1.5.28:
- 6.10.1.1.5.29:
- 6.10.1.1.5.30:
- 6.10.1.1.5.31:
- 6.10.1.1.5.32:
- 6.10.1.1.5.33:
- 6.10.1.1.5.34:
- 6.10.1.1.5.35:
- 6.10.1.1.5.36:
- 6.10.1.1.5.37:
- 6.10.1.1.5.38:
- 6.10.1.1.5.39:
- 6.10.1.1.5.40:
- 6.10.1.1.5.41:
- 6.10.1.1.5.42:
- 6.10.1.1.5.43:
- 6.10.1.1.5.44:
- 6.10.1.1.5.45:
- 6.10.1.1.6:
- 6.10.1.1.6.1:
- 6.10.1.1.6.2:
- 6.10.1.1.6.3:
- 6.10.1.1.6.4:
- 6.10.1.1.6.5:
- 6.10.1.1.6.6:
- 6.10.1.1.6.7:
- 6.10.1.1.6.8:
- 6.10.1.1.6.9:
- 6.10.1.1.6.10:
- 6.10.1.1.6.11:
- 6.10.1.1.6.12:
- 6.10.1.1.6.13:
- 6.10.1.1.6.14:
- 6.10.1.1.7:
- 6.10.1.1.8:
- 6.10.1.1.9:
- 6.10.1.1.10:
- 6.10.1.1.10.1:
- 6.10.1.1.10.2:
- 6.10.1.1.10.3:
- 6.10.1.1.10.4:
- 6.10.1.1.10.5:
- 6.10.1.1.10.6:
- 6.10.1.1.10.7:
- 6.10.1.1.10.8:
- 6.10.1.1.10.9:
- 6.10.1.1.10.10:
- 6.10.1.1.10.11:
- 6.10.1.1.10.12:
- 6.10.1.1.10.13:
- 6.10.1.1.10.14:
- 6.10.1.1.10.15:
- 6.10.1.1.10.16:
- 6.10.1.1.10.17:
- 6.10.1.1.10.18:
- 6.10.1.1.10.19:
- 6.10.1.1.10.20:
- 6.10.1.1.10.21:
- 6.10.1.1.10.22:
- 6.10.1.1.10.23:
- 6.10.1.1.10.24:
- 6.10.1.1.10.25:
- 6.10.1.1.10.26:
- 6.10.1.1.10.27:
- 6.10.1.1.11:
- 6.10.1.1.12:
- 6.10.1.2: kubeadm init
- 6.10.1.3: kubeadm join
- 6.10.1.4: kubeadm upgrade
- 6.10.1.5: kubeadm upgrade phases
- 6.10.1.6: kubeadm config
- 6.10.1.7: kubeadm reset
- 6.10.1.8: kubeadm token
- 6.10.1.9: kubeadm version
- 6.10.1.10: kubeadm alpha
- 6.10.1.11: kubeadm certs
- 6.10.1.12: kubeadm init phase
- 6.10.1.13: kubeadm join phase
- 6.10.1.14: kubeadm kubeconfig
- 6.10.1.15: kubeadm reset phase
- 6.10.1.16: Implementation details
- 6.11: Command line tool (kubectl)
- 6.11.1: Introduction to kubectl
- 6.11.2: kubectl Quick Reference
- 6.11.3: kubectl reference
- 6.11.3.1: kubectl
- 6.11.3.2: kubectl annotate
- 6.11.3.3: kubectl api-resources
- 6.11.3.4: kubectl api-versions
- 6.11.3.5: kubectl apply
- 6.11.3.5.1: kubectl apply edit-last-applied
- 6.11.3.5.2: kubectl apply set-last-applied
- 6.11.3.5.3: kubectl apply view-last-applied
- 6.11.3.6: kubectl attach
- 6.11.3.7: kubectl auth
- 6.11.3.7.1: kubectl auth can-i
- 6.11.3.7.2: kubectl auth reconcile
- 6.11.3.7.3: kubectl auth whoami
- 6.11.3.8: kubectl autoscale
- 6.11.3.9: kubectl certificate
- 6.11.3.9.1: kubectl certificate approve
- 6.11.3.9.2: kubectl certificate deny
- 6.11.3.10: kubectl cluster-info
- 6.11.3.10.1: kubectl cluster-info dump
- 6.11.3.11: kubectl completion
- 6.11.3.12: kubectl config
- 6.11.3.12.1: kubectl config current-context
- 6.11.3.12.2: kubectl config delete-cluster
- 6.11.3.12.3: kubectl config delete-context
- 6.11.3.12.4: kubectl config delete-user
- 6.11.3.12.5: kubectl config get-clusters
- 6.11.3.12.6: kubectl config get-contexts
- 6.11.3.12.7: kubectl config get-users
- 6.11.3.12.8: kubectl config rename-context
- 6.11.3.12.9: kubectl config set
- 6.11.3.12.10: kubectl config set-cluster
- 6.11.3.12.11: kubectl config set-context
- 6.11.3.12.12: kubectl config set-credentials
- 6.11.3.12.13: kubectl config unset
- 6.11.3.12.14: kubectl config use-context
- 6.11.3.12.15: kubectl config view
- 6.11.3.13: kubectl cordon
- 6.11.3.14: kubectl cp
- 6.11.3.15: kubectl create
- 6.11.3.15.1: kubectl create clusterrole
- 6.11.3.15.2: kubectl create clusterrolebinding
- 6.11.3.15.3: kubectl create configmap
- 6.11.3.15.4: kubectl create cronjob
- 6.11.3.15.5: kubectl create deployment
- 6.11.3.15.6: kubectl create ingress
- 6.11.3.15.7: kubectl create job
- 6.11.3.15.8: kubectl create namespace
- 6.11.3.15.9: kubectl create poddisruptionbudget
- 6.11.3.15.10: kubectl create priorityclass
- 6.11.3.15.11: kubectl create quota
- 6.11.3.15.12: kubectl create role
- 6.11.3.15.13: kubectl create rolebinding
- 6.11.3.15.14: kubectl create secret
- 6.11.3.15.15: kubectl create secret docker-registry
- 6.11.3.15.16: kubectl create secret generic
- 6.11.3.15.17: kubectl create secret tls
- 6.11.3.15.18: kubectl create service
- 6.11.3.15.19: kubectl create service clusterip
- 6.11.3.15.20: kubectl create service externalname
- 6.11.3.15.21: kubectl create service loadbalancer
- 6.11.3.15.22: kubectl create service nodeport
- 6.11.3.15.23: kubectl create serviceaccount
- 6.11.3.15.24: kubectl create token
- 6.11.3.16: kubectl debug
- 6.11.3.17: kubectl delete
- 6.11.3.18: kubectl describe
- 6.11.3.19: kubectl diff
- 6.11.3.20: kubectl drain
- 6.11.3.21: kubectl edit
- 6.11.3.22: kubectl events
- 6.11.3.23: kubectl exec
- 6.11.3.24: kubectl explain
- 6.11.3.25: kubectl expose
- 6.11.3.26: kubectl get
- 6.11.3.27: kubectl kustomize
- 6.11.3.28: kubectl label
- 6.11.3.29: kubectl logs
- 6.11.3.30: kubectl options
- 6.11.3.31: kubectl patch
- 6.11.3.32: kubectl plugin
- 6.11.3.32.1: kubectl plugin list
- 6.11.3.33: kubectl port-forward
- 6.11.3.34: kubectl proxy
- 6.11.3.35: kubectl replace
- 6.11.3.36: kubectl rollout
- 6.11.3.36.1: kubectl rollout history
- 6.11.3.36.2: kubectl rollout pause
- 6.11.3.36.3: kubectl rollout restart
- 6.11.3.36.4: kubectl rollout resume
- 6.11.3.36.5: kubectl rollout status
- 6.11.3.36.6: kubectl rollout undo
- 6.11.3.37: kubectl run
- 6.11.3.38: kubectl scale
- 6.11.3.39: kubectl set
- 6.11.3.39.1: kubectl set env
- 6.11.3.39.2: kubectl set image
- 6.11.3.39.3: kubectl set resources
- 6.11.3.39.4: kubectl set selector
- 6.11.3.39.5: kubectl set serviceaccount
- 6.11.3.39.6: kubectl set subject
- 6.11.3.40: kubectl taint
- 6.11.3.41: kubectl top
- 6.11.3.41.1: kubectl top node
- 6.11.3.41.2: kubectl top pod
- 6.11.3.42: kubectl uncordon
- 6.11.3.43: kubectl version
- 6.11.3.44: kubectl wait
- 6.11.4: kubectl Commands
- 6.11.5: kubectl
- 6.11.6: JSONPath Support
- 6.11.7: kubectl for Docker Users
- 6.11.8: kubectl Usage Conventions
- 6.12: Component tools
- 6.12.1: Feature Gates
- 6.12.2: Feature Gates (removed)
- 6.12.3: kubelet
- 6.12.4: kube-apiserver
- 6.12.5: kube-controller-manager
- 6.12.6: kube-proxy
- 6.12.7: kube-scheduler
- 6.13: Debug cluster
- 6.13.1: Flow control
- 6.14: Configuration APIs
- 6.14.1: Client Authentication (v1)
- 6.14.2: Client Authentication (v1beta1)
- 6.14.3: Event Rate Limit Configuration (v1alpha1)
- 6.14.4: Image Policy API (v1alpha1)
- 6.14.5: kube-apiserver Admission (v1)
- 6.14.6: kube-apiserver Audit Configuration (v1)
- 6.14.7: kube-apiserver Configuration (v1)
- 6.14.8: kube-apiserver Configuration (v1alpha1)
- 6.14.9: kube-apiserver Configuration (v1beta1)
- 6.14.10: kube-controller-manager Configuration (v1alpha1)
- 6.14.11: kube-proxy Configuration (v1alpha1)
- 6.14.12: kube-scheduler Configuration (v1)
- 6.14.13: kubeadm Configuration (v1beta3)
- 6.14.14: kubeadm Configuration (v1beta4)
- 6.14.15: kubeconfig (v1)
- 6.14.16: Kubelet Configuration (v1)
- 6.14.17: Kubelet Configuration (v1alpha1)
- 6.14.18: Kubelet Configuration (v1beta1)
- 6.14.19: Kubelet CredentialProvider (v1)
- 6.14.20: WebhookAdmission Configuration (v1)
- 6.15: External APIs
- 6.15.1: Kubernetes Custom Metrics (v1beta2)
- 6.15.2: Kubernetes External Metrics (v1beta1)
- 6.15.3: Kubernetes Metrics (v1beta1)
- 6.16: Scheduling
- 6.16.1: Scheduler Configuration
- 6.16.2: Scheduling Policies
- 6.17: Other Tools
- 7: Contribute to Kubernetes
- 7.1: Contribute to Kubernetes Documentation
- 7.2: Suggesting content improvements
- 7.3: Contributing new content
- 7.3.1: Opening a pull request
- 7.3.2: Documenting a feature for a release
- 7.3.3: Submitting blog posts and case studies
- 7.4: Reviewing changes
- 7.5: Localizing Kubernetes documentation
- 7.6: Participating in SIG Docs
- 7.6.1: Roles and responsibilities
- 7.6.2: Issue Wranglers
- 7.6.3: PR wranglers
- 7.7: Documentation style overview
- 7.7.1: Documentation Content Guide
- 7.7.2: Documentation Style Guide
- 7.7.3: Diagram Guide
- 7.7.4: Writing a new topic
- 7.7.5: Page content types
- 7.7.6: Content organization
- 7.7.7: Custom Hugo Shortcodes
- 7.8: Updating Reference Documentation
- 7.8.1: Reference Documentation Quickstart
- 7.8.2: Contributing to the Upstream Kubernetes Code
- 7.8.3: Generating Reference Documentation for the Kubernetes API
- 7.8.4: Generating Reference Documentation for kubectl Commands
- 7.8.5: Generating Reference Documentation for Metrics
- 7.8.6: Generating Reference Pages for Kubernetes Components and Tools
- 7.8.7:
- 7.9: Advanced contributing
- 7.10: Viewing Site Analytics
- 8: Docs smoke test page
1 - Kubernetes Documentation
1.1 - Available Documentation Versions
This website contains documentation for the current version of Kubernetes and the four previous versions of Kubernetes.
The availability of documentation for a Kubernetes version is separate from whether that release is currently supported. Read Support period to learn about which versions of Kubernetes are officially supported, and for how long.
2 - Getting started
This section lists the different ways to set up and run Kubernetes. When you install Kubernetes, choose an installation type based on: ease of maintenance, security, control, available resources, and expertise required to operate and manage a cluster.
You can download Kubernetes to deploy a Kubernetes cluster on a local machine, into the cloud, or for your own datacenter.
Several Kubernetes components such as kube-apiserver or kube-proxy can also be deployed as container images within the cluster.
It is recommended to run Kubernetes components as container images wherever that is possible, and to have Kubernetes manage those components. Components that run containers - notably, the kubelet - can't be included in this category.
If you don't want to manage a Kubernetes cluster yourself, you could pick a managed service, including certified platforms. There are also other standardized and custom solutions across a wide range of cloud and bare metal environments.
Learning environment
If you're learning Kubernetes, use the tools supported by the Kubernetes community, or tools in the ecosystem to set up a Kubernetes cluster on a local machine. See Install tools.
Production environment
When evaluating a solution for a production environment, consider which aspects of operating a Kubernetes cluster (or abstractions) you want to manage yourself and which you prefer to hand off to a provider.
For a cluster you're managing yourself, the officially supported tool for deploying Kubernetes is kubeadm.
What's next
- Download Kubernetes
- Download and install tools including
kubectl
- Select a container runtime for your new cluster
- Learn about best practices for cluster setup
Kubernetes is designed for its control plane to run on Linux. Within your cluster you can run applications on Linux or other operating systems, including Windows.
- Learn to set up clusters with Windows nodes
2.1 - Learning environment
2.2 - Production environment
A production-quality Kubernetes cluster requires planning and preparation. If your Kubernetes cluster is to run critical workloads, it must be configured to be resilient. This page explains steps you can take to set up a production-ready cluster, or to promote an existing cluster for production use. If you're already familiar with production setup and want the links, skip to What's next.
Production considerations
Typically, a production Kubernetes cluster environment has more requirements than a personal learning, development, or test environment Kubernetes. A production environment may require secure access by many users, consistent availability, and the resources to adapt to changing demands.
As you decide where you want your production Kubernetes environment to live (on premises or in a cloud) and the amount of management you want to take on or hand to others, consider how your requirements for a Kubernetes cluster are influenced by the following issues:
Availability: A single-machine Kubernetes learning environment has a single point of failure. Creating a highly available cluster means considering:
- Separating the control plane from the worker nodes.
- Replicating the control plane components on multiple nodes.
- Load balancing traffic to the cluster’s API server.
- Having enough worker nodes available, or able to quickly become available, as changing workloads warrant it.
Scale: If you expect your production Kubernetes environment to receive a stable amount of demand, you might be able to set up for the capacity you need and be done. However, if you expect demand to grow over time or change dramatically based on things like season or special events, you need to plan how to scale to relieve increased pressure from more requests to the control plane and worker nodes or scale down to reduce unused resources.
Security and access management: You have full admin privileges on your own Kubernetes learning cluster. But shared clusters with important workloads, and more than one or two users, require a more refined approach to who and what can access cluster resources. You can use role-based access control (RBAC) and other security mechanisms to make sure that users and workloads can get access to the resources they need, while keeping workloads, and the cluster itself, secure. You can set limits on the resources that users and workloads can access by managing policies and container resources.
Before building a Kubernetes production environment on your own, consider handing off some or all of this job to Turnkey Cloud Solutions providers or other Kubernetes Partners. Options include:
- Serverless: Just run workloads on third-party equipment without managing a cluster at all. You will be charged for things like CPU usage, memory, and disk requests.
- Managed control plane: Let the provider manage the scale and availability of the cluster's control plane, as well as handle patches and upgrades.
- Managed worker nodes: Configure pools of nodes to meet your needs, then the provider makes sure those nodes are available and ready to implement upgrades when needed.
- Integration: There are providers that integrate Kubernetes with other services you may need, such as storage, container registries, authentication methods, and development tools.
Whether you build a production Kubernetes cluster yourself or work with partners, review the following sections to evaluate your needs as they relate to your cluster’s control plane, worker nodes, user access, and workload resources.
Production cluster setup
In a production-quality Kubernetes cluster, the control plane manages the cluster from services that can be spread across multiple computers in different ways. Each worker node, however, represents a single entity that is configured to run Kubernetes pods.
Production control plane
The simplest Kubernetes cluster has the entire control plane and worker node services running on the same machine. You can grow that environment by adding worker nodes, as reflected in the diagram illustrated in Kubernetes Components. If the cluster is meant to be available for a short period of time, or can be discarded if something goes seriously wrong, this might meet your needs.
If you need a more permanent, highly available cluster, however, you should consider ways of extending the control plane. By design, one-machine control plane services running on a single machine are not highly available. If keeping the cluster up and running and ensuring that it can be repaired if something goes wrong is important, consider these steps:
- Choose deployment tools: You can deploy a control plane using tools such as kubeadm, kops, and kubespray. See Installing Kubernetes with deployment tools to learn tips for production-quality deployments using each of those deployment methods. Different Container Runtimes are available to use with your deployments.
- Manage certificates: Secure communications between control plane services are implemented using certificates. Certificates are automatically generated during deployment or you can generate them using your own certificate authority. See PKI certificates and requirements for details.
- Configure load balancer for apiserver: Configure a load balancer to distribute external API requests to the apiserver service instances running on different nodes. See Create an External Load Balancer for details.
- Separate and backup etcd service: The etcd services can either run on the same machines as other control plane services or run on separate machines, for extra security and availability. Because etcd stores cluster configuration data, backing up the etcd database should be done regularly to ensure that you can repair that database if needed. See the etcd FAQ for details on configuring and using etcd. See Operating etcd clusters for Kubernetes and Set up a High Availability etcd cluster with kubeadm for details.
- Create multiple control plane systems: For high availability, the control plane should not be limited to a single machine. If the control plane services are run by an init service (such as systemd), each service should run on at least three machines. However, running control plane services as pods in Kubernetes ensures that the replicated number of services that you request will always be available. The scheduler should be fault tolerant, but not highly available. Some deployment tools set up Raft consensus algorithm to do leader election of Kubernetes services. If the primary goes away, another service elects itself and take over.
- Span multiple zones: If keeping your cluster available at all times is critical, consider creating a cluster that runs across multiple data centers, referred to as zones in cloud environments. Groups of zones are referred to as regions. By spreading a cluster across multiple zones in the same region, it can improve the chances that your cluster will continue to function even if one zone becomes unavailable. See Running in multiple zones for details.
- Manage on-going features: If you plan to keep your cluster over time, there are tasks you need to do to maintain its health and security. For example, if you installed with kubeadm, there are instructions to help you with Certificate Management and Upgrading kubeadm clusters. See Administer a Cluster for a longer list of Kubernetes administrative tasks.
To learn about available options when you run control plane services, see kube-apiserver, kube-controller-manager, and kube-scheduler component pages. For highly available control plane examples, see Options for Highly Available topology, Creating Highly Available clusters with kubeadm, and Operating etcd clusters for Kubernetes. See Backing up an etcd cluster for information on making an etcd backup plan.
Production worker nodes
Production-quality workloads need to be resilient and anything they rely on needs to be resilient (such as CoreDNS). Whether you manage your own control plane or have a cloud provider do it for you, you still need to consider how you want to manage your worker nodes (also referred to simply as nodes).
- Configure nodes: Nodes can be physical or virtual machines. If you want to
create and manage your own nodes, you can install a supported operating system,
then add and run the appropriate
Node services. Consider:
- The demands of your workloads when you set up nodes by having appropriate memory, CPU, and disk speed and storage capacity available.
- Whether generic computer systems will do or you have workloads that need GPU processors, Windows nodes, or VM isolation.
- Validate nodes: See Valid node setup for information on how to ensure that a node meets the requirements to join a Kubernetes cluster.
- Add nodes to the cluster: If you are managing your own cluster you can add nodes by setting up your own machines and either adding them manually or having them register themselves to the cluster’s apiserver. See the Nodes section for information on how to set up Kubernetes to add nodes in these ways.
- Scale nodes: Have a plan for expanding the capacity your cluster will eventually need. See Considerations for large clusters to help determine how many nodes you need, based on the number of pods and containers you need to run. If you are managing nodes yourself, this can mean purchasing and installing your own physical equipment.
- Autoscale nodes: Read Cluster Autoscaling to learn about the tools available to automatically manage your nodes and the capacity they provide.
- Set up node health checks: For important workloads, you want to make sure that the nodes and pods running on those nodes are healthy. Using the Node Problem Detector daemon, you can ensure your nodes are healthy.
Production user management
In production, you may be moving from a model where you or a small group of people are accessing the cluster to where there may potentially be dozens or hundreds of people. In a learning environment or platform prototype, you might have a single administrative account for everything you do. In production, you will want more accounts with different levels of access to different namespaces.
Taking on a production-quality cluster means deciding how you want to selectively allow access by other users. In particular, you need to select strategies for validating the identities of those who try to access your cluster (authentication) and deciding if they have permissions to do what they are asking (authorization):
- Authentication: The apiserver can authenticate users using client certificates, bearer tokens, an authenticating proxy, or HTTP basic auth. You can choose which authentication methods you want to use. Using plugins, the apiserver can leverage your organization’s existing authentication methods, such as LDAP or Kerberos. See Authentication for a description of these different methods of authenticating Kubernetes users.
- Authorization: When you set out to authorize your regular users, you will probably choose
between RBAC and ABAC authorization. See Authorization Overview
to review different modes for authorizing user accounts (as well as service account access to
your cluster):
- Role-based access control (RBAC): Lets you assign access to your cluster by allowing specific sets of permissions to authenticated users. Permissions can be assigned for a specific namespace (Role) or across the entire cluster (ClusterRole). Then using RoleBindings and ClusterRoleBindings, those permissions can be attached to particular users.
- Attribute-based access control (ABAC): Lets you create policies based on resource attributes in the cluster and will allow or deny access based on those attributes. Each line of a policy file identifies versioning properties (apiVersion and kind) and a map of spec properties to match the subject (user or group), resource property, non-resource property (/version or /apis), and readonly. See Examples for details.
As someone setting up authentication and authorization on your production Kubernetes cluster, here are some things to consider:
- Set the authorization mode: When the Kubernetes API server (kube-apiserver) starts, supported authorization modes must be set using an --authorization-config file or the --authorization-mode flag. For example, that flag in the kube-adminserver.yaml file (in /etc/kubernetes/manifests) could be set to Node,RBAC. This would allow Node and RBAC authorization for authenticated requests.
- Create user certificates and role bindings (RBAC): If you are using RBAC authorization, users can create a CertificateSigningRequest (CSR) that can be signed by the cluster CA. Then you can bind Roles and ClusterRoles to each user. See Certificate Signing Requests for details.
- Create policies that combine attributes (ABAC): If you are using ABAC authorization, you can assign combinations of attributes to form policies to authorize selected users or groups to access particular resources (such as a pod), namespace, or apiGroup. For more information, see Examples.
- Consider Admission Controllers: Additional forms of authorization for requests that can come in through the API server include Webhook Token Authentication. Webhooks and other special authorization types need to be enabled by adding Admission Controllers to the API server.
Set limits on workload resources
Demands from production workloads can cause pressure both inside and outside of the Kubernetes control plane. Consider these items when setting up for the needs of your cluster's workloads:
- Set namespace limits: Set per-namespace quotas on things like memory and CPU. See Manage Memory, CPU, and API Resources for details. You can also set Hierarchical Namespaces for inheriting limits.
- Prepare for DNS demand: If you expect workloads to massively scale up, your DNS service must be ready to scale up as well. See Autoscale the DNS service in a Cluster.
- Create additional service accounts: User accounts determine what users can
do on a cluster, while a service account defines pod access within a particular
namespace. By default, a pod takes on the default service account from its namespace.
See Managing Service Accounts
for information on creating a new service account. For example, you might want to:
- Add secrets that a pod could use to pull images from a particular container registry. See Configure Service Accounts for Pods for an example.
- Assign RBAC permissions to a service account. See ServiceAccount permissions for details.
What's next
- Decide if you want to build your own production Kubernetes or obtain one from available Turnkey Cloud Solutions or Kubernetes Partners.
- If you choose to build your own cluster, plan how you want to handle certificates and set up high availability for features such as etcd and the API server.
- Choose from kubeadm, kops or Kubespray deployment methods.
- Configure user management by determining your Authentication and Authorization methods.
- Prepare for application workloads by setting up resource limits, DNS autoscaling and service accounts.
2.2.1 - Container Runtimes
You need to install a container runtime into each node in the cluster so that Pods can run there. This page outlines what is involved and describes related tasks for setting up nodes.
Kubernetes 1.32 requires that you use a runtime that conforms with the Container Runtime Interface (CRI).
See CRI version support for more information.
This page provides an outline of how to use several common container runtimes with Kubernetes.
Note:
Kubernetes releases before v1.24 included a direct integration with Docker Engine, using a component named dockershim. That special direct integration is no longer part of Kubernetes (this removal was announced as part of the v1.20 release). You can read Check whether Dockershim removal affects you to understand how this removal might affect you. To learn about migrating from using dockershim, see Migrating from dockershim.
If you are running a version of Kubernetes other than v1.32, check the documentation for that version.
Install and configure prerequisites
Network configuration
By default, the Linux kernel does not allow IPv4 packets to be routed between interfaces. Most Kubernetes cluster networking implementations will change this setting (if needed), but some might expect the administrator to do it for them. (Some might also expect other sysctl parameters to be set, kernel modules to be loaded, etc; consult the documentation for your specific network implementation.)
Enable IPv4 packet forwarding
To manually enable IPv4 packet forwarding:
# sysctl params required by setup, params persist across reboots
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.ipv4.ip_forward = 1
EOF
# Apply sysctl params without reboot
sudo sysctl --system
Verify that net.ipv4.ip_forward
is set to 1 with:
sysctl net.ipv4.ip_forward
cgroup drivers
On Linux, control groups are used to constrain resources that are allocated to processes.
Both the kubelet and the underlying container runtime need to interface with control groups to enforce resource management for pods and containers and set resources such as cpu/memory requests and limits. To interface with control groups, the kubelet and the container runtime need to use a cgroup driver. It's critical that the kubelet and the container runtime use the same cgroup driver and are configured the same.
There are two cgroup drivers available:
cgroupfs driver
The cgroupfs
driver is the default cgroup driver in the kubelet.
When the cgroupfs
driver is used, the kubelet and the container runtime directly interface with
the cgroup filesystem to configure cgroups.
The cgroupfs
driver is not recommended when
systemd is the
init system because systemd expects a single cgroup manager on
the system. Additionally, if you use cgroup v2, use the systemd
cgroup driver instead of cgroupfs
.
systemd cgroup driver
When systemd is chosen as the init
system for a Linux distribution, the init process generates and consumes a root control group
(cgroup
) and acts as a cgroup manager.
systemd has a tight integration with cgroups and allocates a cgroup per systemd
unit. As a result, if you use systemd
as the init system with the cgroupfs
driver, the system gets two different cgroup managers.
Two cgroup managers result in two views of the available and in-use resources in
the system. In some cases, nodes that are configured to use cgroupfs
for the
kubelet and container runtime, but use systemd
for the rest of the processes become
unstable under resource pressure.
The approach to mitigate this instability is to use systemd
as the cgroup driver for
the kubelet and the container runtime when systemd is the selected init system.
To set systemd
as the cgroup driver, edit the
KubeletConfiguration
option of cgroupDriver
and set it to systemd
. For example:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
...
cgroupDriver: systemd
Note:
Starting with v1.22 and later, when creating a cluster with kubeadm, if the user does not set thecgroupDriver
field under KubeletConfiguration
, kubeadm defaults it to systemd
.If you configure systemd
as the cgroup driver for the kubelet, you must also
configure systemd
as the cgroup driver for the container runtime. Refer to
the documentation for your container runtime for instructions. For example:
In Kubernetes 1.32, with the KubeletCgroupDriverFromCRI
feature gate
enabled and a container runtime that supports the RuntimeConfig
CRI RPC,
the kubelet automatically detects the appropriate cgroup driver from the runtime,
and ignores the cgroupDriver
setting within the kubelet configuration.
Caution:
Changing the cgroup driver of a Node that has joined a cluster is a sensitive operation. If the kubelet has created Pods using the semantics of one cgroup driver, changing the container runtime to another cgroup driver can cause errors when trying to re-create the Pod sandbox for such existing Pods. Restarting the kubelet may not solve such errors.
If you have automation that makes it feasible, replace the node with another using the updated configuration, or reinstall it using automation.
Migrating to the systemd
driver in kubeadm managed clusters
If you wish to migrate to the systemd
cgroup driver in existing kubeadm managed clusters,
follow configuring a cgroup driver.
CRI version support
Your container runtime must support at least v1alpha2 of the container runtime interface.
Kubernetes starting v1.26 only works with v1 of the CRI API. Earlier versions default to v1 version, however if a container runtime does not support the v1 API, the kubelet falls back to using the (deprecated) v1alpha2 API instead.
Container runtimes
containerd
This section outlines the necessary steps to use containerd as CRI runtime.
To install containerd on your system, follow the instructions on
getting started with containerd.
Return to this step once you've created a valid config.toml
configuration file.
You can find this file under the path /etc/containerd/config.toml
.
You can find this file under the path C:\Program Files\containerd\config.toml
.
On Linux the default CRI socket for containerd is /run/containerd/containerd.sock
.
On Windows the default CRI endpoint is npipe://./pipe/containerd-containerd
.
Configuring the systemd
cgroup driver
To use the systemd
cgroup driver in /etc/containerd/config.toml
with runc
, set
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
...
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
The systemd
cgroup driver is recommended if you use cgroup v2.
Note:
If you installed containerd from a package (for example, RPM or .deb
), you may find
that the CRI integration plugin is disabled by default.
You need CRI support enabled to use containerd with Kubernetes. Make sure that cri
is not included in thedisabled_plugins
list within /etc/containerd/config.toml
;
if you made changes to that file, also restart containerd
.
If you experience container crash loops after the initial cluster installation or after
installing a CNI, the containerd configuration provided with the package might contain
incompatible configuration parameters. Consider resetting the containerd configuration
with containerd config default > /etc/containerd/config.toml
as specified in
getting-started.md
and then set the configuration parameters specified above accordingly.
If you apply this change, make sure to restart containerd:
sudo systemctl restart containerd
When using kubeadm, manually configure the cgroup driver for kubelet.
In Kubernetes v1.28, you can enable automatic detection of the cgroup driver as an alpha feature. See systemd cgroup driver for more details.
Overriding the sandbox (pause) image
In your containerd config you can overwrite the sandbox image by setting the following config:
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "registry.k8s.io/pause:3.10"
You might need to restart containerd
as well once you've updated the config file: systemctl restart containerd
.
CRI-O
This section contains the necessary steps to install CRI-O as a container runtime.
To install CRI-O, follow CRI-O Install Instructions.
cgroup driver
CRI-O uses the systemd cgroup driver per default, which is likely to work fine
for you. To switch to the cgroupfs
cgroup driver, either edit
/etc/crio/crio.conf
or place a drop-in configuration in
/etc/crio/crio.conf.d/02-cgroup-manager.conf
, for example:
[crio.runtime]
conmon_cgroup = "pod"
cgroup_manager = "cgroupfs"
You should also note the changed conmon_cgroup
, which has to be set to the value
pod
when using CRI-O with cgroupfs
. It is generally necessary to keep the
cgroup driver configuration of the kubelet (usually done via kubeadm) and CRI-O
in sync.
In Kubernetes v1.28, you can enable automatic detection of the cgroup driver as an alpha feature. See systemd cgroup driver for more details.
For CRI-O, the CRI socket is /var/run/crio/crio.sock
by default.
Overriding the sandbox (pause) image
In your CRI-O config you can set the following config value:
[crio.image]
pause_image="registry.k8s.io/pause:3.10"
This config option supports live configuration reload to apply this change: systemctl reload crio
or by sending
SIGHUP
to the crio
process.
Docker Engine
Note:
These instructions assume that you are using thecri-dockerd
adapter to integrate
Docker Engine with Kubernetes.On each of your nodes, install Docker for your Linux distribution as per Install Docker Engine.
Install
cri-dockerd
, following the directions in the install section of the documentation.
For cri-dockerd
, the CRI socket is /run/cri-dockerd.sock
by default.
Mirantis Container Runtime
Mirantis Container Runtime (MCR) is a commercially available container runtime that was formerly known as Docker Enterprise Edition.
You can use Mirantis Container Runtime with Kubernetes using the open source
cri-dockerd
component, included with MCR.
To learn more about how to install Mirantis Container Runtime, visit MCR Deployment Guide.
Check the systemd unit named cri-docker.socket
to find out the path to the CRI
socket.
Overriding the sandbox (pause) image
The cri-dockerd
adapter accepts a command line argument for
specifying which container image to use as the Pod infrastructure container (“pause image”).
The command line argument to use is --pod-infra-container-image
.
What's next
As well as a container runtime, your cluster will need a working network plugin.
2.2.2 - Installing Kubernetes with deployment tools
There are many methods and tools for setting up your own production Kubernetes cluster. For example:
Cluster API: A Kubernetes sub-project focused on providing declarative APIs and tooling to simplify provisioning, upgrading, and operating multiple Kubernetes clusters.
kops: An automated cluster provisioning tool. For tutorials, best practices, configuration options and information on reaching out to the community, please check the
kOps
website for details.kubespray: A composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks. You can reach out to the community on Slack channel #kubespray.
2.2.2.1 - Bootstrapping clusters with kubeadm
2.2.2.1.1 - Installing kubeadm
This page shows how to install the kubeadm
toolbox.
For information on how to create a cluster with kubeadm once you have performed this installation process,
see the Creating a cluster with kubeadm page.
This installation guide is for Kubernetes v1.32. If you want to use a different Kubernetes version, please refer to the following pages instead:
Before you begin
- A compatible Linux host. The Kubernetes project provides generic instructions for Linux distributions based on Debian and Red Hat, and those distributions without a package manager.
- 2 GB or more of RAM per machine (any less will leave little room for your apps).
- 2 CPUs or more for control plane machines.
- Full network connectivity between all machines in the cluster (public or private network is fine).
- Unique hostname, MAC address, and product_uuid for every node. See here for more details.
- Certain ports are open on your machines. See here for more details.
Note:
Thekubeadm
installation is done via binaries that use dynamic linking and assumes that your target system provides glibc
.
This is a reasonable assumption on many Linux distributions (including Debian, Ubuntu, Fedora, CentOS, etc.)
but it is not always the case with custom and lightweight distributions which don't include glibc
by default, such as Alpine Linux.
The expectation is that the distribution either includes glibc
or a
compatibility layer
that provides the expected symbols.Verify the MAC address and product_uuid are unique for every node
- You can get the MAC address of the network interfaces using the command
ip link
orifconfig -a
- The product_uuid can be checked by using the command
sudo cat /sys/class/dmi/id/product_uuid
It is very likely that hardware devices will have unique addresses, although some virtual machines may have identical values. Kubernetes uses these values to uniquely identify the nodes in the cluster. If these values are not unique to each node, the installation process may fail.
Check network adapters
If you have more than one network adapter, and your Kubernetes components are not reachable on the default route, we recommend you add IP route(s) so Kubernetes cluster addresses go via the appropriate adapter.
Check required ports
These required ports need to be open in order for Kubernetes components to communicate with each other. You can use tools like netcat to check if a port is open. For example:
nc 127.0.0.1 6443 -v
The pod network plugin you use may also require certain ports to be open. Since this differs with each pod network plugin, please see the documentation for the plugins about what port(s) those need.
Swap configuration
The default behavior of a kubelet is to fail to start if swap memory is detected on a node. This means that swap should either be disabled or tolerated by kubelet.
- To tolerate swap, add
failSwapOn: false
to kubelet configuration or as a command line argument. Note: even iffailSwapOn: false
is provided, workloads wouldn't have swap access by default. This can be changed by setting aswapBehavior
, again in the kubelet configuration file. To use swap, set aswapBehavior
other than the defaultNoSwap
setting. See Swap memory management for more details. - To disable swap,
sudo swapoff -a
can be used to disable swapping temporarily. To make this change persistent across reboots, make sure swap is disabled in config files like/etc/fstab
,systemd.swap
, depending how it was configured on your system.
Installing a container runtime
To run containers in Pods, Kubernetes uses a container runtime.
By default, Kubernetes uses the Container Runtime Interface (CRI) to interface with your chosen container runtime.
If you don't specify a runtime, kubeadm automatically tries to detect an installed container runtime by scanning through a list of known endpoints.
If multiple or no container runtimes are detected kubeadm will throw an error and will request that you specify which one you want to use.
See container runtimes for more information.
Note:
Docker Engine does not implement the CRI which is a requirement for a container runtime to work with Kubernetes. For that reason, an additional service cri-dockerd has to be installed. cri-dockerd is a project based on the legacy built-in Docker Engine support that was removed from the kubelet in version 1.24.The tables below include the known endpoints for supported operating systems:
Runtime | Path to Unix domain socket |
---|---|
containerd | unix:///var/run/containerd/containerd.sock |
CRI-O | unix:///var/run/crio/crio.sock |
Docker Engine (using cri-dockerd) | unix:///var/run/cri-dockerd.sock |
Runtime | Path to Windows named pipe |
---|---|
containerd | npipe:////./pipe/containerd-containerd |
Docker Engine (using cri-dockerd) | npipe:////./pipe/cri-dockerd |
Installing kubeadm, kubelet and kubectl
You will install these packages on all of your machines:
kubeadm
: the command to bootstrap the cluster.kubelet
: the component that runs on all of the machines in your cluster and does things like starting pods and containers.kubectl
: the command line util to talk to your cluster.
kubeadm will not install or manage kubelet
or kubectl
for you, so you will
need to ensure they match the version of the Kubernetes control plane you want
kubeadm to install for you. If you do not, there is a risk of a version skew occurring that
can lead to unexpected, buggy behaviour. However, one minor version skew between the
kubelet and the control plane is supported, but the kubelet version may never exceed the API
server version. For example, the kubelet running 1.7.0 should be fully compatible with a 1.8.0 API server,
but not vice versa.
For information about installing kubectl
, see Install and set up kubectl.
Warning:
These instructions exclude all Kubernetes packages from any system upgrades. This is because kubeadm and Kubernetes require special attention to upgrade.For more information on version skews, see:
- Kubernetes version and version-skew policy
- Kubeadm-specific version skew policy
apt.kubernetes.io
and yum.kubernetes.io
) have been
deprecated and frozen starting from September 13, 2023.
Using the new package repositories hosted at pkgs.k8s.io
is strongly recommended and required in order to install Kubernetes versions released after September 13, 2023.
The deprecated legacy repositories, and their contents, might be removed at any time in the future and without
a further notice period. The new package repositories provide downloads for Kubernetes versions starting with v1.24.0.Note:
There's a dedicated package repository for each Kubernetes minor version. If you want to install a minor version other than v1.32, please see the installation guide for your desired minor version.These instructions are for Kubernetes v1.32.
Update the
apt
package index and install packages needed to use the Kubernetesapt
repository:sudo apt-get update # apt-transport-https may be a dummy package; if so, you can skip that package sudo apt-get install -y apt-transport-https ca-certificates curl gpg
Download the public signing key for the Kubernetes package repositories. The same signing key is used for all repositories so you can disregard the version in the URL:
# If the directory `/etc/apt/keyrings` does not exist, it should be created before the curl command, read the note below. # sudo mkdir -p -m 755 /etc/apt/keyrings curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.32/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
Note:
In releases older than Debian 12 and Ubuntu 22.04, directory/etc/apt/keyrings
does not
exist by default, and it should be created before the curl command.Add the appropriate Kubernetes
apt
repository. Please note that this repository have packages only for Kubernetes 1.32; for other Kubernetes minor versions, you need to change the Kubernetes minor version in the URL to match your desired minor version (you should also check that you are reading the documentation for the version of Kubernetes that you plan to install).# This overwrites any existing configuration in /etc/apt/sources.list.d/kubernetes.list echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.32/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
Update the
apt
package index, install kubelet, kubeadm and kubectl, and pin their version:sudo apt-get update sudo apt-get install -y kubelet kubeadm kubectl sudo apt-mark hold kubelet kubeadm kubectl
(Optional) Enable the kubelet service before running kubeadm:
sudo systemctl enable --now kubelet
Set SELinux to
permissive
mode:These instructions are for Kubernetes 1.32.
# Set SELinux in permissive mode (effectively disabling it) sudo setenforce 0 sudo sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config
Caution:
- Setting SELinux in permissive mode by running
setenforce 0
andsed ...
effectively disables it. This is required to allow containers to access the host filesystem; for example, some cluster network plugins require that. You have to do this until SELinux support is improved in the kubelet. - You can leave SELinux enabled if you know how to configure it but it may require settings that are not supported by kubeadm.
Add the Kubernetes
yum
repository. Theexclude
parameter in the repository definition ensures that the packages related to Kubernetes are not upgraded upon runningyum update
as there's a special procedure that must be followed for upgrading Kubernetes. Please note that this repository have packages only for Kubernetes 1.32; for other Kubernetes minor versions, you need to change the Kubernetes minor version in the URL to match your desired minor version (you should also check that you are reading the documentation for the version of Kubernetes that you plan to install).# This overwrites any existing configuration in /etc/yum.repos.d/kubernetes.repo cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo [kubernetes] name=Kubernetes baseurl=https://pkgs.k8s.io/core:/stable:/v1.32/rpm/ enabled=1 gpgcheck=1 gpgkey=https://pkgs.k8s.io/core:/stable:/v1.32/rpm/repodata/repomd.xml.key exclude=kubelet kubeadm kubectl cri-tools kubernetes-cni EOF
Install kubelet, kubeadm and kubectl:
sudo yum install -y kubelet kubeadm kubectl --disableexcludes=kubernetes
(Optional) Enable the kubelet service before running kubeadm:
sudo systemctl enable --now kubelet
Install CNI plugins (required for most pod network):
CNI_PLUGINS_VERSION="v1.3.0"
ARCH="amd64"
DEST="/opt/cni/bin"
sudo mkdir -p "$DEST"
curl -L "https://github.com/containernetworking/plugins/releases/download/${CNI_PLUGINS_VERSION}/cni-plugins-linux-${ARCH}-${CNI_PLUGINS_VERSION}.tgz" | sudo tar -C "$DEST" -xz
Define the directory to download command files:
Note:
TheDOWNLOAD_DIR
variable must be set to a writable directory.
If you are running Flatcar Container Linux, set DOWNLOAD_DIR="/opt/bin"
.DOWNLOAD_DIR="/usr/local/bin"
sudo mkdir -p "$DOWNLOAD_DIR"
Optionally install crictl (required for interaction with the Container Runtime Interface (CRI), optional for kubeadm):
CRICTL_VERSION="v1.31.0"
ARCH="amd64"
curl -L "https://github.com/kubernetes-sigs/cri-tools/releases/download/${CRICTL_VERSION}/crictl-${CRICTL_VERSION}-linux-${ARCH}.tar.gz" | sudo tar -C $DOWNLOAD_DIR -xz
Install kubeadm
, kubelet
and add a kubelet
systemd service:
RELEASE="$(curl -sSL https://dl.k8s.io/release/stable.txt)"
ARCH="amd64"
cd $DOWNLOAD_DIR
sudo curl -L --remote-name-all https://dl.k8s.io/release/${RELEASE}/bin/linux/${ARCH}/{kubeadm,kubelet}
sudo chmod +x {kubeadm,kubelet}
RELEASE_VERSION="v0.16.2"
curl -sSL "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/krel/templates/latest/kubelet/kubelet.service" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | sudo tee /usr/lib/systemd/system/kubelet.service
sudo mkdir -p /usr/lib/systemd/system/kubelet.service.d
curl -sSL "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/krel/templates/latest/kubeadm/10-kubeadm.conf" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | sudo tee /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf
Note:
Please refer to the note in the Before you begin section for Linux distributions that do not includeglibc
by default.Install kubectl
by following the instructions on Install Tools page.
Optionally, enable the kubelet service before running kubeadm:
sudo systemctl enable --now kubelet
Note:
The Flatcar Container Linux distribution mounts the/usr
directory as a read-only filesystem.
Before bootstrapping your cluster, you need to take additional steps to configure a writable directory.
See the Kubeadm Troubleshooting guide
to learn how to set up a writable directory.The kubelet is now restarting every few seconds, as it waits in a crashloop for kubeadm to tell it what to do.
Configuring a cgroup driver
Both the container runtime and the kubelet have a property called "cgroup driver", which is important for the management of cgroups on Linux machines.
Warning:
Matching the container runtime and kubelet cgroup drivers is required or otherwise the kubelet process will fail.
See Configuring a cgroup driver for more details.
Troubleshooting
If you are running into difficulties with kubeadm, please consult our troubleshooting docs.
What's next
2.2.2.1.2 - Troubleshooting kubeadm
As with any program, you might run into an error installing or running kubeadm. This page lists some common failure scenarios and have provided steps that can help you understand and fix the problem.
If your problem is not listed below, please follow the following steps:
If you think your problem is a bug with kubeadm:
- Go to github.com/kubernetes/kubeadm and search for existing issues.
- If no issue exists, please open one and follow the issue template.
If you are unsure about how kubeadm works, you can ask on Slack in
#kubeadm
, or open a question on StackOverflow. Please include relevant tags like#kubernetes
and#kubeadm
so folks can help you.
Not possible to join a v1.18 Node to a v1.17 cluster due to missing RBAC
In v1.18 kubeadm added prevention for joining a Node in the cluster if a Node with the same name already exists. This required adding RBAC for the bootstrap-token user to be able to GET a Node object.
However this causes an issue where kubeadm join
from v1.18 cannot join a cluster created by kubeadm v1.17.
To workaround the issue you have two options:
Execute kubeadm init phase bootstrap-token
on a control-plane node using kubeadm v1.18.
Note that this enables the rest of the bootstrap-token permissions as well.
or
Apply the following RBAC manually using kubectl apply -f ...
:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kubeadm:get-nodes
rules:
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kubeadm:get-nodes
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kubeadm:get-nodes
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: system:bootstrappers:kubeadm:default-node-token
ebtables
or some similar executable not found during installation
If you see the following warnings while running kubeadm init
[preflight] WARNING: ebtables not found in system path
[preflight] WARNING: ethtool not found in system path
Then you may be missing ebtables
, ethtool
or a similar executable on your node.
You can install them with the following commands:
- For Ubuntu/Debian users, run
apt install ebtables ethtool
. - For CentOS/Fedora users, run
yum install ebtables ethtool
.
kubeadm blocks waiting for control plane during installation
If you notice that kubeadm init
hangs after printing out the following line:
[apiclient] Created API client, waiting for the control plane to become ready
This may be caused by a number of problems. The most common are:
- network connection problems. Check that your machine has full network connectivity before continuing.
- the cgroup driver of the container runtime differs from that of the kubelet. To understand how to configure it properly, see Configuring a cgroup driver.
- control plane containers are crashlooping or hanging. You can check this by running
docker ps
and investigating each container by runningdocker logs
. For other container runtime, see Debugging Kubernetes nodes with crictl.
kubeadm blocks when removing managed containers
The following could happen if the container runtime halts and does not remove any Kubernetes-managed containers:
sudo kubeadm reset
[preflight] Running pre-flight checks
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Removing kubernetes-managed containers
(block)
A possible solution is to restart the container runtime and then re-run kubeadm reset
.
You can also use crictl
to debug the state of the container runtime. See
Debugging Kubernetes nodes with crictl.
Pods in RunContainerError
, CrashLoopBackOff
or Error
state
Right after kubeadm init
there should not be any pods in these states.
- If there are pods in one of these states right after
kubeadm init
, please open an issue in the kubeadm repo.coredns
(orkube-dns
) should be in thePending
state until you have deployed the network add-on. - If you see Pods in the
RunContainerError
,CrashLoopBackOff
orError
state after deploying the network add-on and nothing happens tocoredns
(orkube-dns
), it's very likely that the Pod Network add-on that you installed is somehow broken. You might have to grant it more RBAC privileges or use a newer version. Please file an issue in the Pod Network providers' issue tracker and get the issue triaged there.
coredns
is stuck in the Pending
state
This is expected and part of the design. kubeadm is network provider-agnostic, so the admin
should install the pod network add-on
of choice. You have to install a Pod Network
before CoreDNS may be deployed fully. Hence the Pending
state before the network is set up.
HostPort
services do not work
The HostPort
and HostIP
functionality is available depending on your Pod Network
provider. Please contact the author of the Pod Network add-on to find out whether
HostPort
and HostIP
functionality are available.
Calico, Canal, and Flannel CNI providers are verified to support HostPort.
For more information, see the CNI portmap documentation.
If your network provider does not support the portmap CNI plugin, you may need to use the
NodePort feature of services
or use HostNetwork=true
.
Pods are not accessible via their Service IP
Many network add-ons do not yet enable hairpin mode which allows pods to access themselves via their Service IP. This is an issue related to CNI. Please contact the network add-on provider to get the latest status of their support for hairpin mode.
If you are using VirtualBox (directly or via Vagrant), you will need to ensure that
hostname -i
returns a routable IP address. By default, the first interface is connected to a non-routable host-only network. A work around is to modify/etc/hosts
, see this Vagrantfile for an example.
TLS certificate errors
The following error indicates a possible certificate mismatch.
# kubectl get pods
Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")
Verify that the
$HOME/.kube/config
file contains a valid certificate, and regenerate a certificate if necessary. The certificates in a kubeconfig file are base64 encoded. Thebase64 --decode
command can be used to decode the certificate andopenssl x509 -text -noout
can be used for viewing the certificate information.Unset the
KUBECONFIG
environment variable using:unset KUBECONFIG
Or set it to the default
KUBECONFIG
location:export KUBECONFIG=/etc/kubernetes/admin.conf
Another workaround is to overwrite the existing
kubeconfig
for the "admin" user:mv $HOME/.kube $HOME/.kube.bak mkdir $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config
Kubelet client certificate rotation fails
By default, kubeadm configures a kubelet with automatic rotation of client certificates by using the
/var/lib/kubelet/pki/kubelet-client-current.pem
symlink specified in /etc/kubernetes/kubelet.conf
.
If this rotation process fails you might see errors such as x509: certificate has expired or is not yet valid
in kube-apiserver logs. To fix the issue you must follow these steps:
Backup and delete
/etc/kubernetes/kubelet.conf
and/var/lib/kubelet/pki/kubelet-client*
from the failed node.From a working control plane node in the cluster that has
/etc/kubernetes/pki/ca.key
executekubeadm kubeconfig user --org system:nodes --client-name system:node:$NODE > kubelet.conf
.$NODE
must be set to the name of the existing failed node in the cluster. Modify the resultedkubelet.conf
manually to adjust the cluster name and server endpoint, or passkubeconfig user --config
(see Generating kubeconfig files for additional users). If your cluster does not have theca.key
you must sign the embedded certificates in thekubelet.conf
externally.Copy this resulted
kubelet.conf
to/etc/kubernetes/kubelet.conf
on the failed node.Restart the kubelet (
systemctl restart kubelet
) on the failed node and wait for/var/lib/kubelet/pki/kubelet-client-current.pem
to be recreated.Manually edit the
kubelet.conf
to point to the rotated kubelet client certificates, by replacingclient-certificate-data
andclient-key-data
with:client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem client-key: /var/lib/kubelet/pki/kubelet-client-current.pem
Restart the kubelet.
Make sure the node becomes
Ready
.
Default NIC When using flannel as the pod network in Vagrant
The following error might indicate that something was wrong in the pod network:
Error from server (NotFound): the server could not find the requested resource
If you're using flannel as the pod network inside Vagrant, then you will have to specify the default interface name for flannel.
Vagrant typically assigns two interfaces to all VMs. The first, for which all hosts are assigned the IP address
10.0.2.15
, is for external traffic that gets NATed.This may lead to problems with flannel, which defaults to the first interface on a host. This leads to all hosts thinking they have the same public IP address. To prevent this, pass the
--iface eth1
flag to flannel so that the second interface is chosen.
Non-public IP used for containers
In some situations kubectl logs
and kubectl run
commands may return with the
following errors in an otherwise functional cluster:
Error from server: Get https://10.19.0.41:10250/containerLogs/default/mysql-ddc65b868-glc5m/mysql: dial tcp 10.19.0.41:10250: getsockopt: no route to host
This may be due to Kubernetes using an IP that can not communicate with other IPs on the seemingly same subnet, possibly by policy of the machine provider.
DigitalOcean assigns a public IP to
eth0
as well as a private one to be used internally as anchor for their floating IP feature, yetkubelet
will pick the latter as the node'sInternalIP
instead of the public one.Use
ip addr show
to check for this scenario instead ofifconfig
becauseifconfig
will not display the offending alias IP address. Alternatively an API endpoint specific to DigitalOcean allows to query for the anchor IP from the droplet:curl http://169.254.169.254/metadata/v1/interfaces/public/0/anchor_ipv4/address
The workaround is to tell
kubelet
which IP to use using--node-ip
. When using DigitalOcean, it can be the public one (assigned toeth0
) or the private one (assigned toeth1
) should you want to use the optional private network. ThekubeletExtraArgs
section of the kubeadmNodeRegistrationOptions
structure can be used for this.Then restart
kubelet
:systemctl daemon-reload systemctl restart kubelet
coredns
pods have CrashLoopBackOff
or Error
state
If you have nodes that are running SELinux with an older version of Docker, you might experience a scenario
where the coredns
pods are not starting. To solve that, you can try one of the following options:
Upgrade to a newer version of Docker.
Modify the
coredns
deployment to setallowPrivilegeEscalation
totrue
:
kubectl -n kube-system get deployment coredns -o yaml | \
sed 's/allowPrivilegeEscalation: false/allowPrivilegeEscalation: true/g' | \
kubectl apply -f -
Another cause for CoreDNS to have CrashLoopBackOff
is when a CoreDNS Pod deployed in Kubernetes detects a loop.
A number of workarounds
are available to avoid Kubernetes trying to restart the CoreDNS Pod every time CoreDNS detects the loop and exits.
Warning:
Disabling SELinux or settingallowPrivilegeEscalation
to true
can compromise
the security of your cluster.etcd pods restart continually
If you encounter the following error:
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:110: decoding init error from pipe caused \"read parent: connection reset by peer\""
This issue appears if you run CentOS 7 with Docker 1.13.1.84. This version of Docker can prevent the kubelet from executing into the etcd container.
To work around the issue, choose one of these options:
Roll back to an earlier version of Docker, such as 1.13.1-75
yum downgrade docker-1.13.1-75.git8633870.el7.centos.x86_64 docker-client-1.13.1-75.git8633870.el7.centos.x86_64 docker-common-1.13.1-75.git8633870.el7.centos.x86_64
Install one of the more recent recommended versions, such as 18.06:
sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo yum install docker-ce-18.06.1.ce-3.el7.x86_64
Not possible to pass a comma separated list of values to arguments inside a --component-extra-args
flag
kubeadm init
flags such as --component-extra-args
allow you to pass custom arguments to a control-plane
component like the kube-apiserver. However, this mechanism is limited due to the underlying type used for parsing
the values (mapStringString
).
If you decide to pass an argument that supports multiple, comma-separated values such as
--apiserver-extra-args "enable-admission-plugins=LimitRanger,NamespaceExists"
this flag will fail with
flag: malformed pair, expect string=string
. This happens because the list of arguments for
--apiserver-extra-args
expects key=value
pairs and in this case NamespacesExists
is considered
as a key that is missing a value.
Alternatively, you can try separating the key=value
pairs like so:
--apiserver-extra-args "enable-admission-plugins=LimitRanger,enable-admission-plugins=NamespaceExists"
but this will result in the key enable-admission-plugins
only having the value of NamespaceExists
.
A known workaround is to use the kubeadm configuration file.
kube-proxy scheduled before node is initialized by cloud-controller-manager
In cloud provider scenarios, kube-proxy can end up being scheduled on new worker nodes before the cloud-controller-manager has initialized the node addresses. This causes kube-proxy to fail to pick up the node's IP address properly and has knock-on effects to the proxy function managing load balancers.
The following error can be seen in kube-proxy Pods:
server.go:610] Failed to retrieve node IP: host IP unknown; known addresses: []
proxier.go:340] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
A known solution is to patch the kube-proxy DaemonSet to allow scheduling it on control-plane nodes regardless of their conditions, keeping it off of other nodes until their initial guarding conditions abate:
kubectl -n kube-system patch ds kube-proxy -p='{
"spec": {
"template": {
"spec": {
"tolerations": [
{
"key": "CriticalAddonsOnly",
"operator": "Exists"
},
{
"effect": "NoSchedule",
"key": "node-role.kubernetes.io/control-plane"
}
]
}
}
}
}'
The tracking issue for this problem is here.
/usr
is mounted read-only on nodes
On Linux distributions such as Fedora CoreOS or Flatcar Container Linux, the directory /usr
is mounted as a read-only filesystem.
For flex-volume support,
Kubernetes components like the kubelet and kube-controller-manager use the default path of
/usr/libexec/kubernetes/kubelet-plugins/volume/exec/
, yet the flex-volume directory must be writeable
for the feature to work.
Note:
FlexVolume was deprecated in the Kubernetes v1.23 release.To workaround this issue, you can configure the flex-volume directory using the kubeadm configuration file.
On the primary control-plane Node (created using kubeadm init
), pass the following
file using --config
:
apiVersion: kubeadm.k8s.io/v1beta4
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
- name: "volume-plugin-dir"
value: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
---
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
controllerManager:
extraArgs:
- name: "flex-volume-plugin-dir"
value: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
On joining Nodes:
apiVersion: kubeadm.k8s.io/v1beta4
kind: JoinConfiguration
nodeRegistration:
kubeletExtraArgs:
- name: "volume-plugin-dir"
value: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
Alternatively, you can modify /etc/fstab
to make the /usr
mount writeable, but please
be advised that this is modifying a design principle of the Linux distribution.
kubeadm upgrade plan
prints out context deadline exceeded
error message
This error message is shown when upgrading a Kubernetes cluster with kubeadm
in
the case of running an external etcd. This is not a critical bug and happens because
older versions of kubeadm perform a version check on the external etcd cluster.
You can proceed with kubeadm upgrade apply ...
.
This issue is fixed as of version 1.19.
kubeadm reset
unmounts /var/lib/kubelet
If /var/lib/kubelet
is being mounted, performing a kubeadm reset
will effectively unmount it.
To workaround the issue, re-mount the /var/lib/kubelet
directory after performing the kubeadm reset
operation.
This is a regression introduced in kubeadm 1.15. The issue is fixed in 1.20.
Cannot use the metrics-server securely in a kubeadm cluster
In a kubeadm cluster, the metrics-server
can be used insecurely by passing the --kubelet-insecure-tls
to it. This is not recommended for production clusters.
If you want to use TLS between the metrics-server and the kubelet there is a problem, since kubeadm deploys a self-signed serving certificate for the kubelet. This can cause the following errors on the side of the metrics-server:
x509: certificate signed by unknown authority
x509: certificate is valid for IP-foo not IP-bar
See Enabling signed kubelet serving certificates to understand how to configure the kubelets in a kubeadm cluster to have properly signed serving certificates.
Also see How to run the metrics-server securely.
Upgrade fails due to etcd hash not changing
Only applicable to upgrading a control plane node with a kubeadm binary v1.28.3 or later, where the node is currently managed by kubeadm versions v1.28.0, v1.28.1 or v1.28.2.
Here is the error message you may encounter:
[upgrade/etcd] Failed to upgrade etcd: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: static Pod hash for component etcd on Node kinder-upgrade-control-plane-1 did not change after 5m0s: timed out waiting for the condition
[upgrade/etcd] Waiting for previous etcd to become available
I0907 10:10:09.109104 3704 etcd.go:588] [etcd] attempting to see if all cluster endpoints ([https://172.17.0.6:2379/ https://172.17.0.4:2379/ https://172.17.0.3:2379/]) are available 1/10
[upgrade/etcd] Etcd was rolled back and is now available
static Pod hash for component etcd on Node kinder-upgrade-control-plane-1 did not change after 5m0s: timed out waiting for the condition
couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.rollbackOldManifests
cmd/kubeadm/app/phases/upgrade/staticpods.go:525
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.upgradeComponent
cmd/kubeadm/app/phases/upgrade/staticpods.go:254
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.performEtcdStaticPodUpgrade
cmd/kubeadm/app/phases/upgrade/staticpods.go:338
...
The reason for this failure is that the affected versions generate an etcd manifest file with unwanted defaults in the PodSpec. This will result in a diff from the manifest comparison, and kubeadm will expect a change in the Pod hash, but the kubelet will never update the hash.
There are two way to workaround this issue if you see it in your cluster:
The etcd upgrade can be skipped between the affected versions and v1.28.3 (or later) by using:
kubeadm upgrade {apply|node} [version] --etcd-upgrade=false
This is not recommended in case a new etcd version was introduced by a later v1.28 patch version.
Before upgrade, patch the manifest for the etcd static pod, to remove the problematic defaulted attributes:
diff --git a/etc/kubernetes/manifests/etcd_defaults.yaml b/etc/kubernetes/manifests/etcd_origin.yaml index d807ccbe0aa..46b35f00e15 100644 --- a/etc/kubernetes/manifests/etcd_defaults.yaml +++ b/etc/kubernetes/manifests/etcd_origin.yaml @@ -43,7 +43,6 @@ spec: scheme: HTTP initialDelaySeconds: 10 periodSeconds: 10 - successThreshold: 1 timeoutSeconds: 15 name: etcd resources: @@ -59,26 +58,18 @@ spec: scheme: HTTP initialDelaySeconds: 10 periodSeconds: 10 - successThreshold: 1 timeoutSeconds: 15 - terminationMessagePath: /dev/termination-log - terminationMessagePolicy: File volumeMounts: - mountPath: /var/lib/etcd name: etcd-data - mountPath: /etc/kubernetes/pki/etcd name: etcd-certs - dnsPolicy: ClusterFirst - enableServiceLinks: true hostNetwork: true priority: 2000001000 priorityClassName: system-node-critical - restartPolicy: Always - schedulerName: default-scheduler securityContext: seccompProfile: type: RuntimeDefault - terminationGracePeriodSeconds: 30 volumes: - hostPath: path: /etc/kubernetes/pki/etcd
More information can be found in the tracking issue for this bug.
2.2.2.1.3 - Creating a cluster with kubeadm
Using kubeadm
, you can create a minimum viable Kubernetes cluster that conforms to best practices.
In fact, you can use kubeadm
to set up a cluster that will pass the
Kubernetes Conformance tests.
kubeadm
also supports other cluster lifecycle functions, such as
bootstrap tokens and cluster upgrades.
The kubeadm
tool is good if you need:
- A simple way for you to try out Kubernetes, possibly for the first time.
- A way for existing users to automate setting up a cluster and test their application.
- A building block in other ecosystem and/or installer tools with a larger scope.
You can install and use kubeadm
on various machines: your laptop, a set
of cloud servers, a Raspberry Pi, and more. Whether you're deploying into the
cloud or on-premises, you can integrate kubeadm
into provisioning systems such
as Ansible or Terraform.
Before you begin
To follow this guide, you need:
- One or more machines running a deb/rpm-compatible Linux OS; for example: Ubuntu or CentOS.
- 2 GiB or more of RAM per machine--any less leaves little room for your apps.
- At least 2 CPUs on the machine that you use as a control-plane node.
- Full network connectivity among all machines in the cluster. You can use either a public or a private network.
You also need to use a version of kubeadm
that can deploy the version
of Kubernetes that you want to use in your new cluster.
Kubernetes' version and version skew support policy
applies to kubeadm
as well as to Kubernetes overall.
Check that policy to learn about what versions of Kubernetes and kubeadm
are supported. This page is written for Kubernetes v1.32.
The kubeadm
tool's overall feature state is General Availability (GA). Some sub-features are
still under active development. The implementation of creating the cluster may change
slightly as the tool evolves, but the overall implementation should be pretty stable.
Note:
Any commands underkubeadm alpha
are, by definition, supported on an alpha level.Objectives
- Install a single control-plane Kubernetes cluster
- Install a Pod network on the cluster so that your Pods can talk to each other
Instructions
Preparing the hosts
Component installation
Install a container runtime and kubeadm on all the hosts. For detailed instructions and other prerequisites, see Installing kubeadm.
Note:
If you have already installed kubeadm, see the first two steps of the Upgrading Linux nodes document for instructions on how to upgrade kubeadm.
When you upgrade, the kubelet restarts every few seconds as it waits in a crashloop for kubeadm to tell it what to do. This crashloop is expected and normal. After you initialize your control-plane, the kubelet runs normally.
Network setup
kubeadm similarly to other Kubernetes components tries to find a usable IP on the network interfaces associated with a default gateway on a host. Such an IP is then used for the advertising and/or listening performed by a component.
To find out what this IP is on a Linux host you can use:
ip route show # Look for a line starting with "default via"
Note:
If two or more default gateways are present on the host, a Kubernetes component will try to use the first one it encounters that has a suitable global unicast IP address. While making this choice, the exact ordering of gateways might vary between different operating systems and kernel versions.Kubernetes components do not accept custom network interface as an option, therefore a custom IP address must be passed as a flag to all components instances that need such a custom configuration.
Note:
If the host does not have a default gateway and if a custom IP address is not passed to a Kubernetes component, the component may exit with an error.To configure the API server advertise address for control plane nodes created with both
init
and join
, the flag --apiserver-advertise-address
can be used.
Preferably, this option can be set in the kubeadm API
as InitConfiguration.localAPIEndpoint
and JoinConfiguration.controlPlane.localAPIEndpoint
.
For kubelets on all nodes, the --node-ip
option can be passed in
.nodeRegistration.kubeletExtraArgs
inside a kubeadm configuration file
(InitConfiguration
or JoinConfiguration
).
For dual-stack see Dual-stack support with kubeadm.
The IP addresses that you assign to control plane components become part of their X.509 certificates' subject alternative name fields. Changing these IP addresses would require signing new certificates and restarting the affected components, so that the change in certificate files is reflected. See Manual certificate renewal for more details on this topic.
Warning:
The Kubernetes project recommends against this approach (configuring all component instances with custom IP addresses). Instead, the Kubernetes maintainers recommend to setup the host network, so that the default gateway IP is the one that Kubernetes components auto-detect and use. On Linux nodes, you can use commands such asip route
to configure networking; your operating
system might also provide higher level network management tools. If your node's default gateway
is a public IP address, you should configure packet filtering or other security measures that
protect the nodes and your cluster.Preparing the required container images
This step is optional and only applies in case you wish kubeadm init
and kubeadm join
to not download the default container images which are hosted at registry.k8s.io
.
Kubeadm has commands that can help you pre-pull the required images when creating a cluster without an internet connection on its nodes. See Running kubeadm without an internet connection for more details.
Kubeadm allows you to use a custom image repository for the required images. See Using custom images for more details.
Initializing your control-plane node
The control-plane node is the machine where the control plane components run, including etcd (the cluster database) and the API Server (which the kubectl command line tool communicates with).
- (Recommended) If you have plans to upgrade this single control-plane
kubeadm
cluster to high availability you should specify the--control-plane-endpoint
to set the shared endpoint for all control-plane nodes. Such an endpoint can be either a DNS name or an IP address of a load-balancer. - Choose a Pod network add-on, and verify whether it requires any arguments to
be passed to
kubeadm init
. Depending on which third-party provider you choose, you might need to set the--pod-network-cidr
to a provider-specific value. See Installing a Pod network add-on. - (Optional)
kubeadm
tries to detect the container runtime by using a list of well known endpoints. To use different container runtime or if there are more than one installed on the provisioned node, specify the--cri-socket
argument tokubeadm
. See Installing a runtime.
To initialize the control-plane node run:
kubeadm init <args>
Considerations about apiserver-advertise-address and ControlPlaneEndpoint
While --apiserver-advertise-address
can be used to set the advertised address for this particular
control-plane node's API server, --control-plane-endpoint
can be used to set the shared endpoint
for all control-plane nodes.
--control-plane-endpoint
allows both IP addresses and DNS names that can map to IP addresses.
Please contact your network administrator to evaluate possible solutions with respect to such mapping.
Here is an example mapping:
192.168.0.102 cluster-endpoint
Where 192.168.0.102
is the IP address of this node and cluster-endpoint
is a custom DNS name that maps to this IP.
This will allow you to pass --control-plane-endpoint=cluster-endpoint
to kubeadm init
and pass the same DNS name to
kubeadm join
. Later you can modify cluster-endpoint
to point to the address of your load-balancer in a
high availability scenario.
Turning a single control plane cluster created without --control-plane-endpoint
into a highly available cluster
is not supported by kubeadm.
More information
For more information about kubeadm init
arguments, see the kubeadm reference guide.
To configure kubeadm init
with a configuration file see
Using kubeadm init with a configuration file.
To customize control plane components, including optional IPv6 assignment to liveness probe for control plane components and etcd server, provide extra arguments to each component as documented in custom arguments.
To reconfigure a cluster that has already been created see Reconfiguring a kubeadm cluster.
To run kubeadm init
again, you must first tear down the cluster.
If you join a node with a different architecture to your cluster, make sure that your deployed DaemonSets have container image support for this architecture.
kubeadm init
first runs a series of prechecks to ensure that the machine
is ready to run Kubernetes. These prechecks expose warnings and exit on errors. kubeadm init
then downloads and installs the cluster control plane components. This may take several minutes.
After it finishes you should see:
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
You should now deploy a Pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
/docs/concepts/cluster-administration/addons/
You can now join any number of machines by running the following on each node
as root:
kubeadm join <control-plane-host>:<control-plane-port> --token <token> --discovery-token-ca-cert-hash sha256:<hash>
To make kubectl work for your non-root user, run these commands, which are
also part of the kubeadm init
output:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Alternatively, if you are the root
user, you can run:
export KUBECONFIG=/etc/kubernetes/admin.conf
Warning:
The kubeconfig file admin.conf
that kubeadm init
generates contains a certificate with
Subject: O = kubeadm:cluster-admins, CN = kubernetes-admin
. The group kubeadm:cluster-admins
is bound to the built-in cluster-admin
ClusterRole.
Do not share the admin.conf
file with anyone.
kubeadm init
generates another kubeconfig file super-admin.conf
that contains a certificate with
Subject: O = system:masters, CN = kubernetes-super-admin
.
system:masters
is a break-glass, super user group that bypasses the authorization layer (for example RBAC).
Do not share the super-admin.conf
file with anyone. It is recommended to move the file to a safe location.
See
Generating kubeconfig files for additional users
on how to use kubeadm kubeconfig user
to generate kubeconfig files for additional users.
Make a record of the kubeadm join
command that kubeadm init
outputs. You
need this command to join nodes to your cluster.
The token is used for mutual authentication between the control-plane node and the joining
nodes. The token included here is secret. Keep it safe, because anyone with this
token can add authenticated nodes to your cluster. These tokens can be listed,
created, and deleted with the kubeadm token
command. See the
kubeadm reference guide.
Installing a Pod network add-on
Caution:
This section contains important information about networking setup and deployment order. Read all of this advice carefully before proceeding.
You must deploy a Container Network Interface (CNI) based Pod network add-on so that your Pods can communicate with each other. Cluster DNS (CoreDNS) will not start up before a network is installed.
Take care that your Pod network must not overlap with any of the host networks: you are likely to see problems if there is any overlap. (If you find a collision between your network plugin's preferred Pod network and some of your host networks, you should think of a suitable CIDR block to use instead, then use that during
kubeadm init
with--pod-network-cidr
and as a replacement in your network plugin's YAML).By default,
kubeadm
sets up your cluster to use and enforce use of RBAC (role based access control). Make sure that your Pod network plugin supports RBAC, and so do any manifests that you use to deploy it.If you want to use IPv6--either dual-stack, or single-stack IPv6 only networking--for your cluster, make sure that your Pod network plugin supports IPv6. IPv6 support was added to CNI in v0.6.0.
Note:
Kubeadm should be CNI agnostic and the validation of CNI providers is out of the scope of our current e2e testing. If you find an issue related to a CNI plugin you should log a ticket in its respective issue tracker instead of the kubeadm or kubernetes issue trackers.Several external projects provide Kubernetes Pod networks using CNI, some of which also support Network Policy.
See a list of add-ons that implement the Kubernetes networking model.
Please refer to the Installing Addons page for a non-exhaustive list of networking addons supported by Kubernetes. You can install a Pod network add-on with the following command on the control-plane node or a node that has the kubeconfig credentials:
kubectl apply -f <add-on.yaml>
Note:
Only a few CNI plugins support Windows. More details and setup instructions can be found in Adding Windows worker nodes.You can install only one Pod network per cluster.
Once a Pod network has been installed, you can confirm that it is working by
checking that the CoreDNS Pod is Running
in the output of kubectl get pods --all-namespaces
.
And once the CoreDNS Pod is up and running, you can continue by joining your nodes.
If your network is not working or CoreDNS is not in the Running
state, check out the
troubleshooting guide
for kubeadm
.
Managed node labels
By default, kubeadm enables the NodeRestriction
admission controller that restricts what labels can be self-applied by kubelets on node registration.
The admission controller documentation covers what labels are permitted to be used with the kubelet --node-labels
option.
The node-role.kubernetes.io/control-plane
label is such a restricted label and kubeadm manually applies it using
a privileged client after a node has been created. To do that manually you can do the same by using kubectl label
and ensure it is using a privileged kubeconfig such as the kubeadm managed /etc/kubernetes/admin.conf
.
Control plane node isolation
By default, your cluster will not schedule Pods on the control plane nodes for security reasons. If you want to be able to schedule Pods on the control plane nodes, for example for a single machine Kubernetes cluster, run:
kubectl taint nodes --all node-role.kubernetes.io/control-plane-
The output will look something like:
node "test-01" untainted
...
This will remove the node-role.kubernetes.io/control-plane:NoSchedule
taint
from any nodes that have it, including the control plane nodes, meaning that the
scheduler will then be able to schedule Pods everywhere.
Additionally, you can execute the following command to remove the
node.kubernetes.io/exclude-from-external-load-balancers
label
from the control plane node, which excludes it from the list of backend servers:
kubectl label nodes --all node.kubernetes.io/exclude-from-external-load-balancers-
Adding more control plane nodes
See Creating Highly Available Clusters with kubeadm for steps on creating a high availability kubeadm cluster by adding more control plane nodes.
Adding worker nodes
The worker nodes are where your workloads run.
The following pages show how to add Linux and Windows worker nodes to the cluster by using
the kubeadm join
command:
(Optional) Controlling your cluster from machines other than the control-plane node
In order to get a kubectl on some other computer (e.g. laptop) to talk to your cluster, you need to copy the administrator kubeconfig file from your control-plane node to your workstation like this:
scp root@<control-plane-host>:/etc/kubernetes/admin.conf .
kubectl --kubeconfig ./admin.conf get nodes
Note:
The example above assumes SSH access is enabled for root. If that is not the
case, you can copy the admin.conf
file to be accessible by some other user
and scp
using that other user instead.
The admin.conf
file gives the user superuser privileges over the cluster.
This file should be used sparingly. For normal users, it's recommended to
generate an unique credential to which you grant privileges. You can do
this with the kubeadm kubeconfig user --client-name <CN>
command. That command will print out a KubeConfig file to STDOUT which you
should save to a file and distribute to your user. After that, grant
privileges by using kubectl create (cluster)rolebinding
.
(Optional) Proxying API Server to localhost
If you want to connect to the API Server from outside the cluster, you can use
kubectl proxy
:
scp root@<control-plane-host>:/etc/kubernetes/admin.conf .
kubectl --kubeconfig ./admin.conf proxy
You can now access the API Server locally at http://localhost:8001/api/v1
Clean up
If you used disposable servers for your cluster, for testing, you can
switch those off and do no further clean up. You can use
kubectl config delete-cluster
to delete your local references to the
cluster.
However, if you want to deprovision your cluster more cleanly, you should first drain the node and make sure that the node is empty, then deconfigure the node.
Remove the node
Talking to the control-plane node with the appropriate credentials, run:
kubectl drain <node name> --delete-emptydir-data --force --ignore-daemonsets
Before removing the node, reset the state installed by kubeadm
:
kubeadm reset
The reset process does not reset or clean up iptables rules or IPVS tables. If you wish to reset iptables, you must do so manually:
iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X
If you want to reset the IPVS tables, you must run the following command:
ipvsadm -C
Now remove the node:
kubectl delete node <node name>
If you wish to start over, run kubeadm init
or kubeadm join
with the
appropriate arguments.
Clean up the control plane
You can use kubeadm reset
on the control plane host to trigger a best-effort
clean up.
See the kubeadm reset
reference documentation for more information about this subcommand and its
options.
Version skew policy
While kubeadm allows version skew against some components that it manages, it is recommended that you match the kubeadm version with the versions of the control plane components, kube-proxy and kubelet.
kubeadm's skew against the Kubernetes version
kubeadm can be used with Kubernetes components that are the same version as kubeadm
or one version older. The Kubernetes version can be specified to kubeadm by using the
--kubernetes-version
flag of kubeadm init
or the
ClusterConfiguration.kubernetesVersion
field when using --config
. This option will control the versions
of kube-apiserver, kube-controller-manager, kube-scheduler and kube-proxy.
Example:
- kubeadm is at 1.32
kubernetesVersion
must be at 1.32 or 1.31
kubeadm's skew against the kubelet
Similarly to the Kubernetes version, kubeadm can be used with a kubelet version that is the same version as kubeadm or three versions older.
Example:
- kubeadm is at 1.32
- kubelet on the host must be at 1.32, 1.31, 1.30 or 1.29
kubeadm's skew against kubeadm
There are certain limitations on how kubeadm commands can operate on existing nodes or whole clusters managed by kubeadm.
If new nodes are joined to the cluster, the kubeadm binary used for kubeadm join
must match
the last version of kubeadm used to either create the cluster with kubeadm init
or to upgrade
the same node with kubeadm upgrade
. Similar rules apply to the rest of the kubeadm commands
with the exception of kubeadm upgrade
.
Example for kubeadm join
:
- kubeadm version 1.32 was used to create a cluster with
kubeadm init
- Joining nodes must use a kubeadm binary that is at version 1.32
Nodes that are being upgraded must use a version of kubeadm that is the same MINOR version or one MINOR version newer than the version of kubeadm used for managing the node.
Example for kubeadm upgrade
:
- kubeadm version 1.31 was used to create or upgrade the node
- The version of kubeadm used for upgrading the node must be at 1.31 or 1.32
To learn more about the version skew between the different Kubernetes component see the Version Skew Policy.
Limitations
Cluster resilience
The cluster created here has a single control-plane node, with a single etcd database running on it. This means that if the control-plane node fails, your cluster may lose data and may need to be recreated from scratch.
Workarounds:
Regularly back up etcd. The etcd data directory configured by kubeadm is at
/var/lib/etcd
on the control-plane node.Use multiple control-plane nodes. You can read Options for Highly Available topology to pick a cluster topology that provides high-availability.
Platform compatibility
kubeadm deb/rpm packages and binaries are built for amd64, arm (32-bit), arm64, ppc64le, and s390x following the multi-platform proposal.
Multiplatform container images for the control plane and addons are also supported since v1.12.
Only some of the network providers offer solutions for all platforms. Please consult the list of network providers above or the documentation from each provider to figure out whether the provider supports your chosen platform.
Troubleshooting
If you are running into difficulties with kubeadm, please consult our troubleshooting docs.
What's next
- Verify that your cluster is running properly with Sonobuoy
- See Upgrading kubeadm clusters
for details about upgrading your cluster using
kubeadm
. - Learn about advanced
kubeadm
usage in the kubeadm reference documentation - Learn more about Kubernetes concepts and
kubectl
. - See the Cluster Networking page for a bigger list of Pod network add-ons.
- See the list of add-ons to explore other add-ons, including tools for logging, monitoring, network policy, visualization & control of your Kubernetes cluster.
- Configure how your cluster handles logs for cluster events and from applications running in Pods. See Logging Architecture for an overview of what is involved.
Feedback
- For bugs, visit the kubeadm GitHub issue tracker
- For support, visit the #kubeadm Slack channel
- General SIG Cluster Lifecycle development Slack channel: #sig-cluster-lifecycle
- SIG Cluster Lifecycle SIG information
- SIG Cluster Lifecycle mailing list: kubernetes-sig-cluster-lifecycle
2.2.2.1.4 - Customizing components with the kubeadm API
This page covers how to customize the components that kubeadm deploys. For control plane components
you can use flags in the ClusterConfiguration
structure or patches per-node. For the kubelet
and kube-proxy you can use KubeletConfiguration
and KubeProxyConfiguration
, accordingly.
All of these options are possible via the kubeadm configuration API. For more details on each field in the configuration you can navigate to our API reference pages.
Note:
Customizing the CoreDNS deployment of kubeadm is currently not supported. You must manually patch thekube-system/coredns
ConfigMap
and recreate the CoreDNS Pods after that. Alternatively,
you can skip the default CoreDNS deployment and deploy your own variant.
For more details on that see Using init phases with kubeadm.Note:
To reconfigure a cluster that has already been created see Reconfiguring a kubeadm cluster.Customizing the control plane with flags in ClusterConfiguration
The kubeadm ClusterConfiguration
object exposes a way for users to override the default
flags passed to control plane components such as the APIServer, ControllerManager, Scheduler and Etcd.
The components are defined using the following structures:
apiServer
controllerManager
scheduler
etcd
These structures contain a common extraArgs
field, that consists of name
/ value
pairs.
To override a flag for a control plane component:
- Add the appropriate
extraArgs
to your configuration. - Add flags to the
extraArgs
field. - Run
kubeadm init
with--config <YOUR CONFIG YAML>
.
Note:
You can generate aClusterConfiguration
object with default values by running kubeadm config print init-defaults
and saving the output to a file of your choice.Note:
TheClusterConfiguration
object is currently global in kubeadm clusters. This means that any flags that you add,
will apply to all instances of the same component on different nodes. To apply individual configuration per component
on different nodes you can use patches.Note:
Duplicate flags (keys), or passing the same flag--foo
multiple times, is currently not supported.
To workaround that you must use patches.APIServer flags
For details, see the reference documentation for kube-apiserver.
Example usage:
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
kubernetesVersion: v1.16.0
apiServer:
extraArgs:
- name: "enable-admission-plugins"
value: "AlwaysPullImages,DefaultStorageClass"
- name: "audit-log-path"
value: "/home/johndoe/audit.log"
ControllerManager flags
For details, see the reference documentation for kube-controller-manager.
Example usage:
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
kubernetesVersion: v1.16.0
controllerManager:
extraArgs:
- name: "cluster-signing-key-file"
value: "/home/johndoe/keys/ca.key"
- name: "deployment-controller-sync-period"
value: "50"
Scheduler flags
For details, see the reference documentation for kube-scheduler.
Example usage:
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
kubernetesVersion: v1.16.0
scheduler:
extraArgs:
- name: "config"
value: "/etc/kubernetes/scheduler-config.yaml"
extraVolumes:
- name: schedulerconfig
hostPath: /home/johndoe/schedconfig.yaml
mountPath: /etc/kubernetes/scheduler-config.yaml
readOnly: true
pathType: "File"
Etcd flags
For details, see the etcd server documentation.
Example usage:
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
etcd:
local:
extraArgs:
- name: "election-timeout"
value: 1000
Customizing with patches
Kubernetes v1.22 [beta]
Kubeadm allows you to pass a directory with patch files to InitConfiguration
and JoinConfiguration
on individual nodes. These patches can be used as the last customization step before component configuration
is written to disk.
You can pass this file to kubeadm init
with --config <YOUR CONFIG YAML>
:
apiVersion: kubeadm.k8s.io/v1beta4
kind: InitConfiguration
patches:
directory: /home/user/somedir
Note:
Forkubeadm init
you can pass a file containing both a ClusterConfiguration
and InitConfiguration
separated by ---
.You can pass this file to kubeadm join
with --config <YOUR CONFIG YAML>
:
apiVersion: kubeadm.k8s.io/v1beta4
kind: JoinConfiguration
patches:
directory: /home/user/somedir
The directory must contain files named target[suffix][+patchtype].extension
.
For example, kube-apiserver0+merge.yaml
or just etcd.json
.
target
can be one ofkube-apiserver
,kube-controller-manager
,kube-scheduler
,etcd
andkubeletconfiguration
.suffix
is an optional string that can be used to determine which patches are applied first alpha-numerically.patchtype
can be one ofstrategic
,merge
orjson
and these must match the patching formats supported by kubectl. The defaultpatchtype
isstrategic
.extension
must be eitherjson
oryaml
.
Note:
If you are usingkubeadm upgrade
to upgrade your kubeadm nodes you must again provide the same
patches, so that the customization is preserved after upgrade. To do that you can use the --patches
flag, which must point to the same directory. kubeadm upgrade
currently does not support a configuration
API structure that can be used for the same purpose.Customizing the kubelet
To customize the kubelet you can add a KubeletConfiguration
next to the ClusterConfiguration
or InitConfiguration
separated by ---
within the same configuration file.
This file can then be passed to kubeadm init
and kubeadm will apply the same base KubeletConfiguration
to all nodes in the cluster.
For applying instance-specific configuration over the base KubeletConfiguration
you can use the
kubeletconfiguration
patch target.
Alternatively, you can use kubelet flags as overrides by passing them in the
nodeRegistration.kubeletExtraArgs
field supported by both InitConfiguration
and JoinConfiguration
.
Some kubelet flags are deprecated, so check their status in the
kubelet reference documentation before using them.
For additional details see Configuring each kubelet in your cluster using kubeadm
Customizing kube-proxy
To customize kube-proxy you can pass a KubeProxyConfiguration
next your ClusterConfiguration
or
InitConfiguration
to kubeadm init
separated by ---
.
For more details you can navigate to our API reference pages.
Note:
kubeadm deploys kube-proxy as a DaemonSet, which means that theKubeProxyConfiguration
would apply to all instances of kube-proxy in the cluster.2.2.2.1.5 - Options for Highly Available Topology
This page explains the two options for configuring the topology of your highly available (HA) Kubernetes clusters.
You can set up an HA cluster:
- With stacked control plane nodes, where etcd nodes are colocated with control plane nodes
- With external etcd nodes, where etcd runs on separate nodes from the control plane
You should carefully consider the advantages and disadvantages of each topology before setting up an HA cluster.
Note:
kubeadm bootstraps the etcd cluster statically. Read the etcd Clustering Guide for more details.Stacked etcd topology
A stacked HA cluster is a topology where the distributed data storage cluster provided by etcd is stacked on top of the cluster formed by the nodes managed by kubeadm that run control plane components.
Each control plane node runs an instance of the kube-apiserver
, kube-scheduler
, and kube-controller-manager
.
The kube-apiserver
is exposed to worker nodes using a load balancer.
Each control plane node creates a local etcd member and this etcd member communicates only with
the kube-apiserver
of this node. The same applies to the local kube-controller-manager
and kube-scheduler
instances.
This topology couples the control planes and etcd members on the same nodes. It is simpler to set up than a cluster with external etcd nodes, and simpler to manage for replication.
However, a stacked cluster runs the risk of failed coupling. If one node goes down, both an etcd member and a control plane instance are lost, and redundancy is compromised. You can mitigate this risk by adding more control plane nodes.
You should therefore run a minimum of three stacked control plane nodes for an HA cluster.
This is the default topology in kubeadm. A local etcd member is created automatically
on control plane nodes when using kubeadm init
and kubeadm join --control-plane
.
External etcd topology
An HA cluster with external etcd is a topology where the distributed data storage cluster provided by etcd is external to the cluster formed by the nodes that run control plane components.
Like the stacked etcd topology, each control plane node in an external etcd topology runs
an instance of the kube-apiserver
, kube-scheduler
, and kube-controller-manager
.
And the kube-apiserver
is exposed to worker nodes using a load balancer. However,
etcd members run on separate hosts, and each etcd host communicates with the
kube-apiserver
of each control plane node.
This topology decouples the control plane and etcd member. It therefore provides an HA setup where losing a control plane instance or an etcd member has less impact and does not affect the cluster redundancy as much as the stacked HA topology.
However, this topology requires twice the number of hosts as the stacked HA topology. A minimum of three hosts for control plane nodes and three hosts for etcd nodes are required for an HA cluster with this topology.
What's next
2.2.2.1.6 - Creating Highly Available Clusters with kubeadm
This page explains two different approaches to setting up a highly available Kubernetes cluster using kubeadm:
- With stacked control plane nodes. This approach requires less infrastructure. The etcd members and control plane nodes are co-located.
- With an external etcd cluster. This approach requires more infrastructure. The control plane nodes and etcd members are separated.
Before proceeding, you should carefully consider which approach best meets the needs of your applications and environment. Options for Highly Available topology outlines the advantages and disadvantages of each.
If you encounter issues with setting up the HA cluster, please report these in the kubeadm issue tracker.
See also the upgrade documentation.
Caution:
This page does not address running your cluster on a cloud provider. In a cloud environment, neither approach documented here works with Service objects of type LoadBalancer, or with dynamic PersistentVolumes.Before you begin
The prerequisites depend on which topology you have selected for your cluster's control plane:
You need:
- Three or more machines that meet kubeadm's minimum requirements for
the control-plane nodes. Having an odd number of control plane nodes can help
with leader selection in the case of machine or zone failure.
- including a container runtime, already set up and working
- Three or more machines that meet kubeadm's minimum
requirements for the workers
- including a container runtime, already set up and working
- Full network connectivity between all machines in the cluster (public or private network)
- Superuser privileges on all machines using
sudo
- You can use a different tool; this guide uses
sudo
in the examples.
- You can use a different tool; this guide uses
- SSH access from one device to all nodes in the system
kubeadm
andkubelet
already installed on all machines.
See Stacked etcd topology for context.
You need:
- Three or more machines that meet kubeadm's minimum requirements for
the control-plane nodes. Having an odd number of control plane nodes can help
with leader selection in the case of machine or zone failure.
- including a container runtime, already set up and working
- Three or more machines that meet kubeadm's minimum
requirements for the workers
- including a container runtime, already set up and working
- Full network connectivity between all machines in the cluster (public or private network)
- Superuser privileges on all machines using
sudo
- You can use a different tool; this guide uses
sudo
in the examples.
- You can use a different tool; this guide uses
- SSH access from one device to all nodes in the system
kubeadm
andkubelet
already installed on all machines.
And you also need:
- Three or more additional machines, that will become etcd cluster members.
Having an odd number of members in the etcd cluster is a requirement for achieving
optimal voting quorum.
- These machines again need to have
kubeadm
andkubelet
installed. - These machines also require a container runtime, that is already set up and working.
- These machines again need to have
See External etcd topology for context.
Container images
Each host should have access read and fetch images from the Kubernetes container image registry,
registry.k8s.io
. If you want to deploy a highly-available cluster where the hosts do not have
access to pull images, this is possible. You must ensure by some other means that the correct
container images are already available on the relevant hosts.
Command line interface
To manage Kubernetes once your cluster is set up, you should
install kubectl on your PC. It is also useful
to install the kubectl
tool on each control plane node, as this can be
helpful for troubleshooting.
First steps for both methods
Create load balancer for kube-apiserver
Note:
There are many configurations for load balancers. The following example is only one option. Your cluster requirements may need a different configuration.Create a kube-apiserver load balancer with a name that resolves to DNS.
In a cloud environment you should place your control plane nodes behind a TCP forwarding load balancer. This load balancer distributes traffic to all healthy control plane nodes in its target list. The health check for an apiserver is a TCP check on the port the kube-apiserver listens on (default value
:6443
).It is not recommended to use an IP address directly in a cloud environment.
The load balancer must be able to communicate with all control plane nodes on the apiserver port. It must also allow incoming traffic on its listening port.
Make sure the address of the load balancer always matches the address of kubeadm's
ControlPlaneEndpoint
.Read the Options for Software Load Balancing guide for more details.
Add the first control plane node to the load balancer, and test the connection:
nc -v <LOAD_BALANCER_IP> <PORT>
A connection refused error is expected because the API server is not yet running. A timeout, however, means the load balancer cannot communicate with the control plane node. If a timeout occurs, reconfigure the load balancer to communicate with the control plane node.
Add the remaining control plane nodes to the load balancer target group.
Stacked control plane and etcd nodes
Steps for the first control plane node
Initialize the control plane:
sudo kubeadm init --control-plane-endpoint "LOAD_BALANCER_DNS:LOAD_BALANCER_PORT" --upload-certs
You can use the
--kubernetes-version
flag to set the Kubernetes version to use. It is recommended that the versions of kubeadm, kubelet, kubectl and Kubernetes match.The
--control-plane-endpoint
flag should be set to the address or DNS and port of the load balancer.The
--upload-certs
flag is used to upload the certificates that should be shared across all the control-plane instances to the cluster. If instead, you prefer to copy certs across control-plane nodes manually or using automation tools, please remove this flag and refer to Manual certificate distribution section below.
Note:
Thekubeadm init
flags--config
and--certificate-key
cannot be mixed, therefore if you want to use the kubeadm configuration you must add thecertificateKey
field in the appropriate config locations (underInitConfiguration
andJoinConfiguration: controlPlane
).Note:
Some CNI network plugins require additional configuration, for example specifying the pod IP CIDR, while others do not. See the CNI network documentation. To add a pod CIDR pass the flag--pod-network-cidr
, or if you are using a kubeadm configuration file set thepodSubnet
field under thenetworking
object ofClusterConfiguration
.The output looks similar to:
... You can now join any number of control-plane node by running the following command on each as a root: kubeadm join 192.168.0.200:6443 --token 9vr73a.a8uxyaju799qwdjv --discovery-token-ca-cert-hash sha256:7c2e69131a36ae2a042a339b33381c6d0d43887e2de83720eff5359e26aec866 --control-plane --certificate-key f8902e114ef118304e561c3ecd4d0b543adc226b7a07f675f56564185ffe0c07 Please note that the certificate-key gives access to cluster sensitive data, keep it secret! As a safeguard, uploaded-certs will be deleted in two hours; If necessary, you can use kubeadm init phase upload-certs to reload certs afterward. Then you can join any number of worker nodes by running the following on each as root: kubeadm join 192.168.0.200:6443 --token 9vr73a.a8uxyaju799qwdjv --discovery-token-ca-cert-hash sha256:7c2e69131a36ae2a042a339b33381c6d0d43887e2de83720eff5359e26aec866
Copy this output to a text file. You will need it later to join control plane and worker nodes to the cluster.
When
--upload-certs
is used withkubeadm init
, the certificates of the primary control plane are encrypted and uploaded in thekubeadm-certs
Secret.To re-upload the certificates and generate a new decryption key, use the following command on a control plane node that is already joined to the cluster:
sudo kubeadm init phase upload-certs --upload-certs
You can also specify a custom
--certificate-key
duringinit
that can later be used byjoin
. To generate such a key you can use the following command:kubeadm certs certificate-key
The certificate key is a hex encoded string that is an AES key of size 32 bytes.
Note:
Thekubeadm-certs
Secret and the decryption key expire after two hours.Caution:
As stated in the command output, the certificate key gives access to cluster sensitive data, keep it secret!Apply the CNI plugin of your choice: Follow these instructions to install the CNI provider. Make sure the configuration corresponds to the Pod CIDR specified in the kubeadm configuration file (if applicable).
Note:
You must pick a network plugin that suits your use case and deploy it before you move on to next step. If you don't do this, you will not be able to launch your cluster properly.Type the following and watch the pods of the control plane components get started:
kubectl get pod -n kube-system -w
Steps for the rest of the control plane nodes
For each additional control plane node you should:
Execute the join command that was previously given to you by the
kubeadm init
output on the first node. It should look something like this:sudo kubeadm join 192.168.0.200:6443 --token 9vr73a.a8uxyaju799qwdjv --discovery-token-ca-cert-hash sha256:7c2e69131a36ae2a042a339b33381c6d0d43887e2de83720eff5359e26aec866 --control-plane --certificate-key f8902e114ef118304e561c3ecd4d0b543adc226b7a07f675f56564185ffe0c07
- The
--control-plane
flag tellskubeadm join
to create a new control plane. - The
--certificate-key ...
will cause the control plane certificates to be downloaded from thekubeadm-certs
Secret in the cluster and be decrypted using the given key.
- The
You can join multiple control-plane nodes in parallel.
External etcd nodes
Setting up a cluster with external etcd nodes is similar to the procedure used for stacked etcd with the exception that you should setup etcd first, and you should pass the etcd information in the kubeadm config file.
Set up the etcd cluster
Follow these instructions to set up the etcd cluster.
Set up SSH as described here.
Copy the following files from any etcd node in the cluster to the first control plane node:
export CONTROL_PLANE="ubuntu@10.0.0.7" scp /etc/kubernetes/pki/etcd/ca.crt "${CONTROL_PLANE}": scp /etc/kubernetes/pki/apiserver-etcd-client.crt "${CONTROL_PLANE}": scp /etc/kubernetes/pki/apiserver-etcd-client.key "${CONTROL_PLANE}":
- Replace the value of
CONTROL_PLANE
with theuser@host
of the first control-plane node.
- Replace the value of
Set up the first control plane node
Create a file called
kubeadm-config.yaml
with the following contents:--- apiVersion: kubeadm.k8s.io/v1beta4 kind: ClusterConfiguration kubernetesVersion: stable controlPlaneEndpoint: "LOAD_BALANCER_DNS:LOAD_BALANCER_PORT" # change this (see below) etcd: external: endpoints: - https://ETCD_0_IP:2379 # change ETCD_0_IP appropriately - https://ETCD_1_IP:2379 # change ETCD_1_IP appropriately - https://ETCD_2_IP:2379 # change ETCD_2_IP appropriately caFile: /etc/kubernetes/pki/etcd/ca.crt certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key
Note:
The difference between stacked etcd and external etcd here is that the external etcd setup requires a configuration file with the etcd endpoints under theexternal
object foretcd
. In the case of the stacked etcd topology, this is managed automatically.Replace the following variables in the config template with the appropriate values for your cluster:
LOAD_BALANCER_DNS
LOAD_BALANCER_PORT
ETCD_0_IP
ETCD_1_IP
ETCD_2_IP
The following steps are similar to the stacked etcd setup:
Run
sudo kubeadm init --config kubeadm-config.yaml --upload-certs
on this node.Write the output join commands that are returned to a text file for later use.
Apply the CNI plugin of your choice.
Note:
You must pick a network plugin that suits your use case and deploy it before you move on to next step. If you don't do this, you will not be able to launch your cluster properly.
Steps for the rest of the control plane nodes
The steps are the same as for the stacked etcd setup:
- Make sure the first control plane node is fully initialized.
- Join each control plane node with the join command you saved to a text file. It's recommended to join the control plane nodes one at a time.
- Don't forget that the decryption key from
--certificate-key
expires after two hours, by default.
Common tasks after bootstrapping control plane
Install workers
Worker nodes can be joined to the cluster with the command you stored previously
as the output from the kubeadm init
command:
sudo kubeadm join 192.168.0.200:6443 --token 9vr73a.a8uxyaju799qwdjv --discovery-token-ca-cert-hash sha256:7c2e69131a36ae2a042a339b33381c6d0d43887e2de83720eff5359e26aec866
Manual certificate distribution
If you choose to not use kubeadm init
with the --upload-certs
flag this means that
you are going to have to manually copy the certificates from the primary control plane node to the
joining control plane nodes.
There are many ways to do this. The following example uses ssh
and scp
:
SSH is required if you want to control all nodes from a single machine.
Enable ssh-agent on your main device that has access to all other nodes in the system:
eval $(ssh-agent)
Add your SSH identity to the session:
ssh-add ~/.ssh/path_to_private_key
SSH between nodes to check that the connection is working correctly.
When you SSH to any node, add the
-A
flag. This flag allows the node that you have logged into via SSH to access the SSH agent on your PC. Consider alternative methods if you do not fully trust the security of your user session on the node.ssh -A 10.0.0.7
When using sudo on any node, make sure to preserve the environment so SSH forwarding works:
sudo -E -s
After configuring SSH on all the nodes you should run the following script on the first control plane node after running
kubeadm init
. This script will copy the certificates from the first control plane node to the other control plane nodes:In the following example, replace
CONTROL_PLANE_IPS
with the IP addresses of the other control plane nodes.USER=ubuntu # customizable CONTROL_PLANE_IPS="10.0.0.7 10.0.0.8" for host in ${CONTROL_PLANE_IPS}; do scp /etc/kubernetes/pki/ca.crt "${USER}"@$host: scp /etc/kubernetes/pki/ca.key "${USER}"@$host: scp /etc/kubernetes/pki/sa.key "${USER}"@$host: scp /etc/kubernetes/pki/sa.pub "${USER}"@$host: scp /etc/kubernetes/pki/front-proxy-ca.crt "${USER}"@$host: scp /etc/kubernetes/pki/front-proxy-ca.key "${USER}"@$host: scp /etc/kubernetes/pki/etcd/ca.crt "${USER}"@$host:etcd-ca.crt # Skip the next line if you are using external etcd scp /etc/kubernetes/pki/etcd/ca.key "${USER}"@$host:etcd-ca.key done
Caution:
Copy only the certificates in the above list. kubeadm will take care of generating the rest of the certificates with the required SANs for the joining control-plane instances. If you copy all the certificates by mistake, the creation of additional nodes could fail due to a lack of required SANs.Then on each joining control plane node you have to run the following script before running
kubeadm join
. This script will move the previously copied certificates from the home directory to/etc/kubernetes/pki
:USER=ubuntu # customizable mkdir -p /etc/kubernetes/pki/etcd mv /home/${USER}/ca.crt /etc/kubernetes/pki/ mv /home/${USER}/ca.key /etc/kubernetes/pki/ mv /home/${USER}/sa.pub /etc/kubernetes/pki/ mv /home/${USER}/sa.key /etc/kubernetes/pki/ mv /home/${USER}/front-proxy-ca.crt /etc/kubernetes/pki/ mv /home/${USER}/front-proxy-ca.key /etc/kubernetes/pki/ mv /home/${USER}/etcd-ca.crt /etc/kubernetes/pki/etcd/ca.crt # Skip the next line if you are using external etcd mv /home/${USER}/etcd-ca.key /etc/kubernetes/pki/etcd/ca.key
2.2.2.1.7 - Set up a High Availability etcd Cluster with kubeadm
By default, kubeadm runs a local etcd instance on each control plane node. It is also possible to treat the etcd cluster as external and provision etcd instances on separate hosts. The differences between the two approaches are covered in the Options for Highly Available topology page.
This task walks through the process of creating a high availability external etcd cluster of three members that can be used by kubeadm during cluster creation.
Before you begin
- Three hosts that can talk to each other over TCP ports 2379 and 2380. This document assumes these default ports. However, they are configurable through the kubeadm config file.
- Each host must have systemd and a bash compatible shell installed.
- Each host must have a container runtime, kubelet, and kubeadm installed.
- Each host should have access to the Kubernetes container image registry (
registry.k8s.io
) or list/pull the required etcd image usingkubeadm config images list/pull
. This guide will set up etcd instances as static pods managed by a kubelet. - Some infrastructure to copy files between hosts. For example
ssh
andscp
can satisfy this requirement.
Setting up the cluster
The general approach is to generate all certs on one node and only distribute the necessary files to the other nodes.
Note:
kubeadm contains all the necessary cryptographic machinery to generate the certificates described below; no other cryptographic tooling is required for this example.Note:
The examples below use IPv4 addresses but you can also configure kubeadm, the kubelet and etcd to use IPv6 addresses. Dual-stack is supported by some Kubernetes options, but not by etcd. For more details on Kubernetes dual-stack support see Dual-stack support with kubeadm.Configure the kubelet to be a service manager for etcd.
Since etcd was created first, you must override the service priority by creating a new unit file that has higher precedence than the kubeadm-provided kubelet unit file.Note:
You must do this on every host where etcd should be running.cat << EOF > /etc/systemd/system/kubelet.service.d/kubelet.conf # Replace "systemd" with the cgroup driver of your container runtime. The default value in the kubelet is "cgroupfs". # Replace the value of "containerRuntimeEndpoint" for a different container runtime if needed. # apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration authentication: anonymous: enabled: false webhook: enabled: false authorization: mode: AlwaysAllow cgroupDriver: systemd address: 127.0.0.1 containerRuntimeEndpoint: unix:///var/run/containerd/containerd.sock staticPodPath: /etc/kubernetes/manifests EOF cat << EOF > /etc/systemd/system/kubelet.service.d/20-etcd-service-manager.conf [Service] ExecStart= ExecStart=/usr/bin/kubelet --config=/etc/systemd/system/kubelet.service.d/kubelet.conf Restart=always EOF systemctl daemon-reload systemctl restart kubelet
Check the kubelet status to ensure it is running.
systemctl status kubelet
Create configuration files for kubeadm.
Generate one kubeadm configuration file for each host that will have an etcd member running on it using the following script.
# Update HOST0, HOST1 and HOST2 with the IPs of your hosts export HOST0=10.0.0.6 export HOST1=10.0.0.7 export HOST2=10.0.0.8 # Update NAME0, NAME1 and NAME2 with the hostnames of your hosts export NAME0="infra0" export NAME1="infra1" export NAME2="infra2" # Create temp directories to store files that will end up on other hosts mkdir -p /tmp/${HOST0}/ /tmp/${HOST1}/ /tmp/${HOST2}/ HOSTS=(${HOST0} ${HOST1} ${HOST2}) NAMES=(${NAME0} ${NAME1} ${NAME2}) for i in "${!HOSTS[@]}"; do HOST=${HOSTS[$i]} NAME=${NAMES[$i]} cat << EOF > /tmp/${HOST}/kubeadmcfg.yaml --- apiVersion: "kubeadm.k8s.io/v1beta4" kind: InitConfiguration nodeRegistration: name: ${NAME} localAPIEndpoint: advertiseAddress: ${HOST} --- apiVersion: "kubeadm.k8s.io/v1beta4" kind: ClusterConfiguration etcd: local: serverCertSANs: - "${HOST}" peerCertSANs: - "${HOST}" extraArgs: - name: initial-cluster value: ${NAMES[0]}=https://${HOSTS[0]}:2380,${NAMES[1]}=https://${HOSTS[1]}:2380,${NAMES[2]}=https://${HOSTS[2]}:2380 - name: initial-cluster-state value: new - name: name value: ${NAME} - name: listen-peer-urls value: https://${HOST}:2380 - name: listen-client-urls value: https://${HOST}:2379 - name: advertise-client-urls value: https://${HOST}:2379 - name: initial-advertise-peer-urls value: https://${HOST}:2380 EOF done
Generate the certificate authority.
If you already have a CA then the only action that is copying the CA's
crt
andkey
file to/etc/kubernetes/pki/etcd/ca.crt
and/etc/kubernetes/pki/etcd/ca.key
. After those files have been copied, proceed to the next step, "Create certificates for each member".If you do not already have a CA then run this command on
$HOST0
(where you generated the configuration files for kubeadm).kubeadm init phase certs etcd-ca
This creates two files:
/etc/kubernetes/pki/etcd/ca.crt
/etc/kubernetes/pki/etcd/ca.key
Create certificates for each member.
kubeadm init phase certs etcd-server --config=/tmp/${HOST2}/kubeadmcfg.yaml kubeadm init phase certs etcd-peer --config=/tmp/${HOST2}/kubeadmcfg.yaml kubeadm init phase certs etcd-healthcheck-client --config=/tmp/${HOST2}/kubeadmcfg.yaml kubeadm init phase certs apiserver-etcd-client --config=/tmp/${HOST2}/kubeadmcfg.yaml cp -R /etc/kubernetes/pki /tmp/${HOST2}/ # cleanup non-reusable certificates find /etc/kubernetes/pki -not -name ca.crt -not -name ca.key -type f -delete kubeadm init phase certs etcd-server --config=/tmp/${HOST1}/kubeadmcfg.yaml kubeadm init phase certs etcd-peer --config=/tmp/${HOST1}/kubeadmcfg.yaml kubeadm init phase certs etcd-healthcheck-client --config=/tmp/${HOST1}/kubeadmcfg.yaml kubeadm init phase certs apiserver-etcd-client --config=/tmp/${HOST1}/kubeadmcfg.yaml cp -R /etc/kubernetes/pki /tmp/${HOST1}/ find /etc/kubernetes/pki -not -name ca.crt -not -name ca.key -type f -delete kubeadm init phase certs etcd-server --config=/tmp/${HOST0}/kubeadmcfg.yaml kubeadm init phase certs etcd-peer --config=/tmp/${HOST0}/kubeadmcfg.yaml kubeadm init phase certs etcd-healthcheck-client --config=/tmp/${HOST0}/kubeadmcfg.yaml kubeadm init phase certs apiserver-etcd-client --config=/tmp/${HOST0}/kubeadmcfg.yaml # No need to move the certs because they are for HOST0 # clean up certs that should not be copied off this host find /tmp/${HOST2} -name ca.key -type f -delete find /tmp/${HOST1} -name ca.key -type f -delete
Copy certificates and kubeadm configs.
The certificates have been generated and now they must be moved to their respective hosts.
USER=ubuntu HOST=${HOST1} scp -r /tmp/${HOST}/* ${USER}@${HOST}: ssh ${USER}@${HOST} USER@HOST $ sudo -Es root@HOST $ chown -R root:root pki root@HOST $ mv pki /etc/kubernetes/
Ensure all expected files exist.
The complete list of required files on
$HOST0
is:/tmp/${HOST0} └── kubeadmcfg.yaml --- /etc/kubernetes/pki ├── apiserver-etcd-client.crt ├── apiserver-etcd-client.key └── etcd ├── ca.crt ├── ca.key ├── healthcheck-client.crt ├── healthcheck-client.key ├── peer.crt ├── peer.key ├── server.crt └── server.key
On
$HOST1
:$HOME └── kubeadmcfg.yaml --- /etc/kubernetes/pki ├── apiserver-etcd-client.crt ├── apiserver-etcd-client.key └── etcd ├── ca.crt ├── healthcheck-client.crt ├── healthcheck-client.key ├── peer.crt ├── peer.key ├── server.crt └── server.key
On
$HOST2
:$HOME └── kubeadmcfg.yaml --- /etc/kubernetes/pki ├── apiserver-etcd-client.crt ├── apiserver-etcd-client.key └── etcd ├── ca.crt ├── healthcheck-client.crt ├── healthcheck-client.key ├── peer.crt ├── peer.key ├── server.crt └── server.key
Create the static pod manifests.
Now that the certificates and configs are in place it's time to create the manifests. On each host run the
kubeadm
command to generate a static manifest for etcd.root@HOST0 $ kubeadm init phase etcd local --config=/tmp/${HOST0}/kubeadmcfg.yaml root@HOST1 $ kubeadm init phase etcd local --config=$HOME/kubeadmcfg.yaml root@HOST2 $ kubeadm init phase etcd local --config=$HOME/kubeadmcfg.yaml
Optional: Check the cluster health.
If
etcdctl
isn't available, you can run this tool inside a container image. You would do that directly with your container runtime using a tool such ascrictl run
and not through KubernetesETCDCTL_API=3 etcdctl \ --cert /etc/kubernetes/pki/etcd/peer.crt \ --key /etc/kubernetes/pki/etcd/peer.key \ --cacert /etc/kubernetes/pki/etcd/ca.crt \ --endpoints https://${HOST0}:2379 endpoint health ... https://[HOST0 IP]:2379 is healthy: successfully committed proposal: took = 16.283339ms https://[HOST1 IP]:2379 is healthy: successfully committed proposal: took = 19.44402ms https://[HOST2 IP]:2379 is healthy: successfully committed proposal: took = 35.926451ms
- Set
${HOST0}
to the IP address of the host you are testing.
- Set
What's next
Once you have an etcd cluster with 3 working members, you can continue setting up a highly available control plane using the external etcd method with kubeadm.
2.2.2.1.8 - Configuring each kubelet in your cluster using kubeadm
Kubernetes v1.11 [stable]
The lifecycle of the kubeadm CLI tool is decoupled from the kubelet, which is a daemon that runs on each node within the Kubernetes cluster. The kubeadm CLI tool is executed by the user when Kubernetes is initialized or upgraded, whereas the kubelet is always running in the background.
Since the kubelet is a daemon, it needs to be maintained by some kind of an init system or service manager. When the kubelet is installed using DEBs or RPMs, systemd is configured to manage the kubelet. You can use a different service manager instead, but you need to configure it manually.
Some kubelet configuration details need to be the same across all kubelets involved in the cluster, while
other configuration aspects need to be set on a per-kubelet basis to accommodate the different
characteristics of a given machine (such as OS, storage, and networking). You can manage the configuration
of your kubelets manually, but kubeadm now provides a KubeletConfiguration
API type for
managing your kubelet configurations centrally.
Kubelet configuration patterns
The following sections describe patterns to kubelet configuration that are simplified by using kubeadm, rather than managing the kubelet configuration for each Node manually.
Propagating cluster-level configuration to each kubelet
You can provide the kubelet with default values to be used by kubeadm init
and kubeadm join
commands. Interesting examples include using a different container runtime or setting the default subnet
used by services.
If you want your services to use the subnet 10.96.0.0/12
as the default for services, you can pass
the --service-cidr
parameter to kubeadm:
kubeadm init --service-cidr 10.96.0.0/12
Virtual IPs for services are now allocated from this subnet. You also need to set the DNS address used
by the kubelet, using the --cluster-dns
flag. This setting needs to be the same for every kubelet
on every manager and Node in the cluster. The kubelet provides a versioned, structured API object
that can configure most parameters in the kubelet and push out this configuration to each running
kubelet in the cluster. This object is called
KubeletConfiguration
.
The KubeletConfiguration
allows the user to specify flags such as the cluster DNS IP addresses expressed as
a list of values to a camelCased key, illustrated by the following example:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
clusterDNS:
- 10.96.0.10
For more details on the KubeletConfiguration
have a look at this section.
Providing instance-specific configuration details
Some hosts require specific kubelet configurations due to differences in hardware, operating system, networking, or other host-specific parameters. The following list provides a few examples.
The path to the DNS resolution file, as specified by the
--resolv-conf
kubelet configuration flag, may differ among operating systems, or depending on whether you are usingsystemd-resolved
. If this path is wrong, DNS resolution will fail on the Node whose kubelet is configured incorrectly.The Node API object
.metadata.name
is set to the machine's hostname by default, unless you are using a cloud provider. You can use the--hostname-override
flag to override the default behavior if you need to specify a Node name different from the machine's hostname.Currently, the kubelet cannot automatically detect the cgroup driver used by the container runtime, but the value of
--cgroup-driver
must match the cgroup driver used by the container runtime to ensure the health of the kubelet.To specify the container runtime you must set its endpoint with the
--container-runtime-endpoint=<path>
flag.
The recommended way of applying such instance-specific configuration is by using
KubeletConfiguration
patches.
Configure kubelets using kubeadm
It is possible to configure the kubelet that kubeadm will start if a custom
KubeletConfiguration
API object is passed with a configuration file like so kubeadm ... --config some-config-file.yaml
.
By calling kubeadm config print init-defaults --component-configs KubeletConfiguration
you can
see all the default values for this structure.
It is also possible to apply instance-specific patches over the base KubeletConfiguration
.
Have a look at Customizing the kubelet
for more details.
Workflow when using kubeadm init
When you call kubeadm init
, the kubelet configuration is marshalled to disk
at /var/lib/kubelet/config.yaml
, and also uploaded to a kubelet-config
ConfigMap in the kube-system
namespace of the cluster. A kubelet configuration file is also written to /etc/kubernetes/kubelet.conf
with the baseline cluster-wide configuration for all kubelets in the cluster. This configuration file
points to the client certificates that allow the kubelet to communicate with the API server. This
addresses the need to
propagate cluster-level configuration to each kubelet.
To address the second pattern of
providing instance-specific configuration details,
kubeadm writes an environment file to /var/lib/kubelet/kubeadm-flags.env
, which contains a list of
flags to pass to the kubelet when it starts. The flags are presented in the file like this:
KUBELET_KUBEADM_ARGS="--flag1=value1 --flag2=value2 ..."
In addition to the flags used when starting the kubelet, the file also contains dynamic
parameters such as the cgroup driver and whether to use a different container runtime socket
(--cri-socket
).
After marshalling these two files to disk, kubeadm attempts to run the following two commands, if you are using systemd:
systemctl daemon-reload && systemctl restart kubelet
If the reload and restart are successful, the normal kubeadm init
workflow continues.
Workflow when using kubeadm join
When you run kubeadm join
, kubeadm uses the Bootstrap Token credential to perform
a TLS bootstrap, which fetches the credential needed to download the
kubelet-config
ConfigMap and writes it to /var/lib/kubelet/config.yaml
. The dynamic
environment file is generated in exactly the same way as kubeadm init
.
Next, kubeadm
runs the following two commands to load the new configuration into the kubelet:
systemctl daemon-reload && systemctl restart kubelet
After the kubelet loads the new configuration, kubeadm writes the
/etc/kubernetes/bootstrap-kubelet.conf
KubeConfig file, which contains a CA certificate and Bootstrap
Token. These are used by the kubelet to perform the TLS Bootstrap and obtain a unique
credential, which is stored in /etc/kubernetes/kubelet.conf
.
When the /etc/kubernetes/kubelet.conf
file is written, the kubelet has finished performing the TLS Bootstrap.
Kubeadm deletes the /etc/kubernetes/bootstrap-kubelet.conf
file after completing the TLS Bootstrap.
The kubelet drop-in file for systemd
kubeadm
ships with configuration for how systemd should run the kubelet.
Note that the kubeadm CLI command never touches this drop-in file.
This configuration file installed by the kubeadm
package is written to
/usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf
and is used by systemd.
It augments the basic
kubelet.service
.
If you want to override that further, you can make a directory /etc/systemd/system/kubelet.service.d/
(not /usr/lib/systemd/system/kubelet.service.d/
) and put your own customizations into a file there.
For example, you might add a new local file /etc/systemd/system/kubelet.service.d/local-overrides.conf
to override the unit settings configured by kubeadm
.
Here is what you are likely to find in /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf
:
Note:
The contents below are just an example. If you don't want to use a package manager follow the guide outlined in the (Without a package manager) section.[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
# This is a file that "kubeadm init" and "kubeadm join" generate at runtime, populating
# the KUBELET_KUBEADM_ARGS variable dynamically
EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
# This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably,
# the user should use the .NodeRegistration.KubeletExtraArgs object in the configuration files instead.
# KUBELET_EXTRA_ARGS should be sourced from this file.
EnvironmentFile=-/etc/default/kubelet
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
This file specifies the default locations for all of the files managed by kubeadm for the kubelet.
- The KubeConfig file to use for the TLS Bootstrap is
/etc/kubernetes/bootstrap-kubelet.conf
, but it is only used if/etc/kubernetes/kubelet.conf
does not exist. - The KubeConfig file with the unique kubelet identity is
/etc/kubernetes/kubelet.conf
. - The file containing the kubelet's ComponentConfig is
/var/lib/kubelet/config.yaml
. - The dynamic environment file that contains
KUBELET_KUBEADM_ARGS
is sourced from/var/lib/kubelet/kubeadm-flags.env
. - The file that can contain user-specified flag overrides with
KUBELET_EXTRA_ARGS
is sourced from/etc/default/kubelet
(for DEBs), or/etc/sysconfig/kubelet
(for RPMs).KUBELET_EXTRA_ARGS
is last in the flag chain and has the highest priority in the event of conflicting settings.
Kubernetes binaries and package contents
The DEB and RPM packages shipped with the Kubernetes releases are:
Package name | Description |
---|---|
kubeadm | Installs the /usr/bin/kubeadm CLI tool and the kubelet drop-in file for the kubelet. |
kubelet | Installs the /usr/bin/kubelet binary. |
kubectl | Installs the /usr/bin/kubectl binary. |
cri-tools | Installs the /usr/bin/crictl binary from the cri-tools git repository. |
kubernetes-cni | Installs the /opt/cni/bin binaries from the plugins git repository. |
2.2.2.1.9 - Dual-stack support with kubeadm
Kubernetes v1.23 [stable]
Your Kubernetes cluster includes dual-stack networking, which means that cluster networking lets you use either address family. In a cluster, the control plane can assign both an IPv4 address and an IPv6 address to a single Pod or a Service.
Before you begin
You need to have installed the kubeadm tool, following the steps from Installing kubeadm.
For each server that you want to use as a node, make sure it allows IPv6 forwarding.
Enable IPv6 packet forwarding
To check if IPv6 packet forwarding is enabled:
sysctl net.ipv6.conf.all.forwarding
If the output is net.ipv6.conf.all.forwarding = 1
it is already enabled.
Otherwise it is not enabled yet.
To manually enable IPv6 packet forwarding:
# sysctl params required by setup, params persist across reboots
cat <<EOF | sudo tee -a /etc/sysctl.d/k8s.conf
net.ipv6.conf.all.forwarding = 1
EOF
# Apply sysctl params without reboot
sudo sysctl --system
You need to have an IPv4 and and IPv6 address range to use. Cluster operators typically
use private address ranges for IPv4. For IPv6, a cluster operator typically chooses a global
unicast address block from within 2000::/3
, using a range that is assigned to the operator.
You don't have to route the cluster's IP address ranges to the public internet.
The size of the IP address allocations should be suitable for the number of Pods and Services that you are planning to run.
Note:
If you are upgrading an existing cluster with thekubeadm upgrade
command,
kubeadm
does not support making modifications to the pod IP address range
(“cluster CIDR”) nor to the cluster's Service address range (“Service CIDR”).Create a dual-stack cluster
To create a dual-stack cluster with kubeadm init
you can pass command line arguments
similar to the following example:
# These address ranges are examples
kubeadm init --pod-network-cidr=10.244.0.0/16,2001:db8:42:0::/56 --service-cidr=10.96.0.0/16,2001:db8:42:1::/112
To make things clearer, here is an example kubeadm
configuration file
kubeadm-config.yaml
for the primary dual-stack control plane node.
---
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
networking:
podSubnet: 10.244.0.0/16,2001:db8:42:0::/56
serviceSubnet: 10.96.0.0/16,2001:db8:42:1::/112
---
apiVersion: kubeadm.k8s.io/v1beta4
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: "10.100.0.1"
bindPort: 6443
nodeRegistration:
kubeletExtraArgs:
- name: "node-ip"
value: "10.100.0.2,fd00:1:2:3::2"
advertiseAddress
in InitConfiguration specifies the IP address that the API Server
will advertise it is listening on. The value of advertiseAddress
equals the
--apiserver-advertise-address
flag of kubeadm init
.
Run kubeadm to initiate the dual-stack control plane node:
kubeadm init --config=kubeadm-config.yaml
The kube-controller-manager flags --node-cidr-mask-size-ipv4|--node-cidr-mask-size-ipv6
are set with default values. See configure IPv4/IPv6 dual stack.
Note:
The--apiserver-advertise-address
flag does not support dual-stack.Join a node to dual-stack cluster
Before joining a node, make sure that the node has IPv6 routable network interface and allows IPv6 forwarding.
Here is an example kubeadm configuration file
kubeadm-config.yaml
for joining a worker node to the cluster.
apiVersion: kubeadm.k8s.io/v1beta4
kind: JoinConfiguration
discovery:
bootstrapToken:
apiServerEndpoint: 10.100.0.1:6443
token: "clvldh.vjjwg16ucnhp94qr"
caCertHashes:
- "sha256:a4863cde706cfc580a439f842cc65d5ef112b7b2be31628513a9881cf0d9fe0e"
# change auth info above to match the actual token and CA certificate hash for your cluster
nodeRegistration:
kubeletExtraArgs:
- name: "node-ip"
value: "10.100.0.2,fd00:1:2:3::3"
Also, here is an example kubeadm configuration file
kubeadm-config.yaml
for joining another control plane node to the cluster.
apiVersion: kubeadm.k8s.io/v1beta4
kind: JoinConfiguration
controlPlane:
localAPIEndpoint:
advertiseAddress: "10.100.0.2"
bindPort: 6443
discovery:
bootstrapToken:
apiServerEndpoint: 10.100.0.1:6443
token: "clvldh.vjjwg16ucnhp94qr"
caCertHashes:
- "sha256:a4863cde706cfc580a439f842cc65d5ef112b7b2be31628513a9881cf0d9fe0e"
# change auth info above to match the actual token and CA certificate hash for your cluster
nodeRegistration:
kubeletExtraArgs:
- name: "node-ip"
value: "10.100.0.2,fd00:1:2:3::4"
advertiseAddress
in JoinConfiguration.controlPlane specifies the IP address that the
API Server will advertise it is listening on. The value of advertiseAddress
equals
the --apiserver-advertise-address
flag of kubeadm join
.
kubeadm join --config=kubeadm-config.yaml
Create a single-stack cluster
Note:
Dual-stack support doesn't mean that you need to use dual-stack addressing. You can deploy a single-stack cluster that has the dual-stack networking feature enabled.To make things more clear, here is an example kubeadm
configuration file
kubeadm-config.yaml
for the single-stack control plane node.
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
networking:
podSubnet: 10.244.0.0/16
serviceSubnet: 10.96.0.0/16
What's next
- Validate IPv4/IPv6 dual-stack networking
- Read about Dual-stack cluster networking
- Learn more about the kubeadm configuration format
2.2.3 - Turnkey Cloud Solutions
This page provides a list of Kubernetes certified solution providers. From each provider page, you can learn how to install and setup production ready clusters.
2.3 - Best practices
2.3.1 - Considerations for large clusters
A cluster is a set of nodes (physical or virtual machines) running Kubernetes agents, managed by the control plane. Kubernetes v1.32 supports clusters with up to 5,000 nodes. More specifically, Kubernetes is designed to accommodate configurations that meet all of the following criteria:
- No more than 110 pods per node
- No more than 5,000 nodes
- No more than 150,000 total pods
- No more than 300,000 total containers
You can scale your cluster by adding or removing nodes. The way you do this depends on how your cluster is deployed.
Cloud provider resource quotas
To avoid running into cloud provider quota issues, when creating a cluster with many nodes, consider:
- Requesting a quota increase for cloud resources such as:
- Computer instances
- CPUs
- Storage volumes
- In-use IP addresses
- Packet filtering rule sets
- Number of load balancers
- Network subnets
- Log streams
- Gating the cluster scaling actions to bring up new nodes in batches, with a pause between batches, because some cloud providers rate limit the creation of new instances.
Control plane components
For a large cluster, you need a control plane with sufficient compute and other resources.
Typically you would run one or two control plane instances per failure zone, scaling those instances vertically first and then scaling horizontally after reaching the point of falling returns to (vertical) scale.
You should run at least one instance per failure zone to provide fault-tolerance. Kubernetes nodes do not automatically steer traffic towards control-plane endpoints that are in the same failure zone; however, your cloud provider might have its own mechanisms to do this.
For example, using a managed load balancer, you configure the load balancer to send traffic that originates from the kubelet and Pods in failure zone A, and direct that traffic only to the control plane hosts that are also in zone A. If a single control-plane host or endpoint failure zone A goes offline, that means that all the control-plane traffic for nodes in zone A is now being sent between zones. Running multiple control plane hosts in each zone makes that outcome less likely.
etcd storage
To improve performance of large clusters, you can store Event objects in a separate dedicated etcd instance.
When creating a cluster, you can (using custom tooling):
- start and configure additional etcd instance
- configure the API server to use it for storing events
See Operating etcd clusters for Kubernetes and Set up a High Availability etcd cluster with kubeadm for details on configuring and managing etcd for a large cluster.
Addon resources
Kubernetes resource limits help to minimize the impact of memory leaks and other ways that pods and containers can impact on other components. These resource limits apply to addon resources just as they apply to application workloads.
For example, you can set CPU and memory limits for a logging component:
...
containers:
- name: fluentd-cloud-logging
image: fluent/fluentd-kubernetes-daemonset:v1
resources:
limits:
cpu: 100m
memory: 200Mi
Addons' default limits are typically based on data collected from experience running each addon on small or medium Kubernetes clusters. When running on large clusters, addons often consume more of some resources than their default limits. If a large cluster is deployed without adjusting these values, the addon(s) may continuously get killed because they keep hitting the memory limit. Alternatively, the addon may run but with poor performance due to CPU time slice restrictions.
To avoid running into cluster addon resource issues, when creating a cluster with many nodes, consider the following:
- Some addons scale vertically - there is one replica of the addon for the cluster or serving a whole failure zone. For these addons, increase requests and limits as you scale out your cluster.
- Many addons scale horizontally - you add capacity by running more pods - but with a very large cluster you may also need to raise CPU or memory limits slightly. The Vertical Pod Autoscaler can run in recommender mode to provide suggested figures for requests and limits.
- Some addons run as one copy per node, controlled by a DaemonSet: for example, a node-level log aggregator. Similar to the case with horizontally-scaled addons, you may also need to raise CPU or memory limits slightly.
What's next
VerticalPodAutoscaler
is a custom resource that you can deploy into your cluster to help you manage resource requests and limits for pods.
Learn more about Vertical Pod Autoscaler and how you can use it to scale cluster components, including cluster-critical addons.Read about cluster autoscaling
The addon resizer helps you in resizing the addons automatically as your cluster's scale changes.
2.3.2 - Running in multiple zones
This page describes running Kubernetes across multiple zones.
Background
Kubernetes is designed so that a single Kubernetes cluster can run across multiple failure zones, typically where these zones fit within a logical grouping called a region. Major cloud providers define a region as a set of failure zones (also called availability zones) that provide a consistent set of features: within a region, each zone offers the same APIs and services.
Typical cloud architectures aim to minimize the chance that a failure in one zone also impairs services in another zone.
Control plane behavior
All control plane components support running as a pool of interchangeable resources, replicated per component.
When you deploy a cluster control plane, place replicas of control plane components across multiple failure zones. If availability is an important concern, select at least three failure zones and replicate each individual control plane component (API server, scheduler, etcd, cluster controller manager) across at least three failure zones. If you are running a cloud controller manager then you should also replicate this across all the failure zones you selected.
Note:
Kubernetes does not provide cross-zone resilience for the API server endpoints. You can use various techniques to improve availability for the cluster API server, including DNS round-robin, SRV records, or a third-party load balancing solution with health checking.Node behavior
Kubernetes automatically spreads the Pods for workload resources (such as Deployment or StatefulSet) across different nodes in a cluster. This spreading helps reduce the impact of failures.
When nodes start up, the kubelet on each node automatically adds labels to the Node object that represents that specific kubelet in the Kubernetes API. These labels can include zone information.
If your cluster spans multiple zones or regions, you can use node labels in conjunction with Pod topology spread constraints to control how Pods are spread across your cluster among fault domains: regions, zones, and even specific nodes. These hints enable the scheduler to place Pods for better expected availability, reducing the risk that a correlated failure affects your whole workload.
For example, you can set a constraint to make sure that the 3 replicas of a StatefulSet are all running in different zones to each other, whenever that is feasible. You can define this declaratively without explicitly defining which availability zones are in use for each workload.
Distributing nodes across zones
Kubernetes' core does not create nodes for you; you need to do that yourself, or use a tool such as the Cluster API to manage nodes on your behalf.
Using tools such as the Cluster API you can define sets of machines to run as worker nodes for your cluster across multiple failure domains, and rules to automatically heal the cluster in case of whole-zone service disruption.
Manual zone assignment for Pods
You can apply node selector constraints to Pods that you create, as well as to Pod templates in workload resources such as Deployment, StatefulSet, or Job.
Storage access for zones
When persistent volumes are created, Kubernetes automatically adds zone labels
to any PersistentVolumes that are linked to a specific zone.
The scheduler then ensures,
through its NoVolumeZoneConflict
predicate, that pods which claim a given PersistentVolume
are only placed into the same zone as that volume.
Please note that the method of adding zone labels can depend on your cloud provider and the storage provisioner you’re using. Always refer to the specific documentation for your environment to ensure correct configuration.
You can specify a StorageClass for PersistentVolumeClaims that specifies the failure domains (zones) that the storage in that class may use. To learn about configuring a StorageClass that is aware of failure domains or zones, see Allowed topologies.
Networking
By itself, Kubernetes does not include zone-aware networking. You can use a
network plugin
to configure cluster networking, and that network solution might have zone-specific
elements. For example, if your cloud provider supports Services with
type=LoadBalancer
, the load balancer might only send traffic to Pods running in the
same zone as the load balancer element processing a given connection.
Check your cloud provider's documentation for details.
For custom or on-premises deployments, similar considerations apply. Service and Ingress behavior, including handling of different failure zones, does vary depending on exactly how your cluster is set up.
Fault recovery
When you set up your cluster, you might also need to consider whether and how
your setup can restore service if all the failure zones in a region go
off-line at the same time. For example, do you rely on there being at least
one node able to run Pods in a zone?
Make sure that any cluster-critical repair work does not rely
on there being at least one healthy node in your cluster. For example: if all nodes
are unhealthy, you might need to run a repair Job with a special
toleration so that the repair
can complete enough to bring at least one node into service.
Kubernetes doesn't come with an answer for this challenge; however, it's something to consider.
What's next
To learn how the scheduler places Pods in a cluster, honoring the configured constraints, visit Scheduling and Eviction.
2.3.3 - Validate node setup
Node Conformance Test
Node conformance test is a containerized test framework that provides a system verification and functionality test for a node. The test validates whether the node meets the minimum requirements for Kubernetes; a node that passes the test is qualified to join a Kubernetes cluster.
Node Prerequisite
To run node conformance test, a node must satisfy the same prerequisites as a standard Kubernetes node. At a minimum, the node should have the following daemons installed:
- CRI-compatible container runtimes such as Docker, containerd and CRI-O
- kubelet
Running Node Conformance Test
To run the node conformance test, perform the following steps:
Work out the value of the
--kubeconfig
option for the kubelet; for example:--kubeconfig=/var/lib/kubelet/config.yaml
. Because the test framework starts a local control plane to test the kubelet, usehttp://localhost:8080
as the URL of the API server. There are some other kubelet command line parameters you may want to use:--cloud-provider
: If you are using--cloud-provider=gce
, you should remove the flag to run the test.
Run the node conformance test with command:
# $CONFIG_DIR is the pod manifest path of your kubelet. # $LOG_DIR is the test output path. sudo docker run -it --rm --privileged --net=host \ -v /:/rootfs -v $CONFIG_DIR:$CONFIG_DIR -v $LOG_DIR:/var/result \ registry.k8s.io/node-test:0.2
Running Node Conformance Test for Other Architectures
Kubernetes also provides node conformance test docker images for other architectures:
Arch | Image |
---|---|
amd64 | node-test-amd64 |
arm | node-test-arm |
arm64 | node-test-arm64 |
Running Selected Test
To run specific tests, overwrite the environment variable FOCUS
with the
regular expression of tests you want to run.
sudo docker run -it --rm --privileged --net=host \
-v /:/rootfs:ro -v $CONFIG_DIR:$CONFIG_DIR -v $LOG_DIR:/var/result \
-e FOCUS=MirrorPod \ # Only run MirrorPod test
registry.k8s.io/node-test:0.2
To skip specific tests, overwrite the environment variable SKIP
with the
regular expression of tests you want to skip.
sudo docker run -it --rm --privileged --net=host \
-v /:/rootfs:ro -v $CONFIG_DIR:$CONFIG_DIR -v $LOG_DIR:/var/result \
-e SKIP=MirrorPod \ # Run all conformance tests but skip MirrorPod test
registry.k8s.io/node-test:0.2
Node conformance test is a containerized version of node e2e test. By default, it runs all conformance tests.
Theoretically, you can run any node e2e test if you configure the container and mount required volumes properly. But it is strongly recommended to only run conformance test, because it requires much more complex configuration to run non-conformance test.
Caveats
- The test leaves some docker images on the node, including the node conformance test image and images of containers used in the functionality test.
- The test leaves dead containers on the node. These containers are created during the functionality test.
2.3.4 - Enforcing Pod Security Standards
This page provides an overview of best practices when it comes to enforcing Pod Security Standards.
Using the built-in Pod Security Admission Controller
Kubernetes v1.25 [stable]
The Pod Security Admission Controller intends to replace the deprecated PodSecurityPolicies.
Configure all cluster namespaces
Namespaces that lack any configuration at all should be considered significant gaps in your cluster security model. We recommend taking the time to analyze the types of workloads occurring in each namespace, and by referencing the Pod Security Standards, decide on an appropriate level for each of them. Unlabeled namespaces should only indicate that they've yet to be evaluated.
In the scenario that all workloads in all namespaces have the same security requirements, we provide an example that illustrates how the PodSecurity labels can be applied in bulk.
Embrace the principle of least privilege
In an ideal world, every pod in every namespace would meet the requirements of the restricted
policy. However, this is not possible nor practical, as some workloads will require elevated
privileges for legitimate reasons.
- Namespaces allowing
privileged
workloads should establish and enforce appropriate access controls. - For workloads running in those permissive namespaces, maintain documentation about their unique security requirements. If at all possible, consider how those requirements could be further constrained.
Adopt a multi-mode strategy
The audit
and warn
modes of the Pod Security Standards admission controller make it easy to
collect important security insights about your pods without breaking existing workloads.
It is good practice to enable these modes for all namespaces, setting them to the desired level
and version you would eventually like to enforce
. The warnings and audit annotations generated in
this phase can guide you toward that state. If you expect workload authors to make changes to fit
within the desired level, enable the warn
mode. If you expect to use audit logs to monitor/drive
changes to fit within the desired level, enable the audit
mode.
When you have the enforce
mode set to your desired value, these modes can still be useful in a
few different ways:
- By setting
warn
to the same level asenforce
, clients will receive warnings when attempting to create Pods (or resources that have Pod templates) that do not pass validation. This will help them update those resources to become compliant. - In Namespaces that pin
enforce
to a specific non-latest version, setting theaudit
andwarn
modes to the same level asenforce
, but to thelatest
version, gives visibility into settings that were allowed by previous versions but are not allowed per current best practices.
Third-party alternatives
Other alternatives for enforcing security profiles are being developed in the Kubernetes ecosystem:
The decision to go with a built-in solution (e.g. PodSecurity admission controller) versus a third-party tool is entirely dependent on your own situation. When evaluating any solution, trust of your supply chain is crucial. Ultimately, using any of the aforementioned approaches will be better than doing nothing.
2.3.5 - PKI certificates and requirements
Kubernetes requires PKI certificates for authentication over TLS. If you install Kubernetes with kubeadm, the certificates that your cluster requires are automatically generated. You can also generate your own certificates -- for example, to keep your private keys more secure by not storing them on the API server. This page explains the certificates that your cluster requires.
How certificates are used by your cluster
Kubernetes requires PKI for the following operations:
Server certificates
- Server certificate for the API server endpoint
- Server certificate for the etcd server
- Server certificates for each kubelet (every node runs a kubelet)
- Optional server certificate for the front-proxy
Client certificates
- Client certificates for each kubelet, used to authenticate to the API server as a client of the Kubernetes API
- Client certificate for each API server, used to authenticate to etcd
- Client certificate for the controller manager to securely communicate with the API server
- Client certificate for the scheduler to securely communicate with the API server
- Client certificates, one for each node, for kube-proxy to authenticate to the API server
- Optional client certificates for administrators of the cluster to authenticate to the API server
- Optional client certificate for the front-proxy
Kubelet's server and client certificates
To establish a secure connection and authenticate itself to the kubelet, the API Server requires a client certificate and key pair.
In this scenario, there are two approaches for certificate usage:
Shared Certificates: The kube-apiserver can utilize the same certificate and key pair it uses to authenticate its clients. This means that the existing certificates, such as
apiserver.crt
andapiserver.key
, can be used for communicating with the kubelet servers.Separate Certificates: Alternatively, the kube-apiserver can generate a new client certificate and key pair to authenticate its communication with the kubelet servers. In this case, a distinct certificate named
kubelet-client.crt
and its corresponding private key,kubelet-client.key
are created.
Note:
front-proxy
certificates are required only if you run kube-proxy to support
an extension API server.etcd also implements mutual TLS to authenticate clients and peers.
Where certificates are stored
If you install Kubernetes with kubeadm, most certificates are stored in /etc/kubernetes/pki
.
All paths in this documentation are relative to that directory, with the exception of user account
certificates which kubeadm places in /etc/kubernetes
.
Configure certificates manually
If you don't want kubeadm to generate the required certificates, you can create them using a single root CA or by providing all certificates. See Certificates for details on creating your own certificate authority. See Certificate Management with kubeadm for more on managing certificates.
Single root CA
You can create a single root CA, controlled by an administrator. This root CA can then create multiple intermediate CAs, and delegate all further creation to Kubernetes itself.
Required CAs:
Path | Default CN | Description |
---|---|---|
ca.crt,key | kubernetes-ca | Kubernetes general CA |
etcd/ca.crt,key | etcd-ca | For all etcd-related functions |
front-proxy-ca.crt,key | kubernetes-front-proxy-ca | For the front-end proxy |
On top of the above CAs, it is also necessary to get a public/private key pair for service account
management, sa.key
and sa.pub
.
The following example illustrates the CA key and certificate files shown in the previous table:
/etc/kubernetes/pki/ca.crt
/etc/kubernetes/pki/ca.key
/etc/kubernetes/pki/etcd/ca.crt
/etc/kubernetes/pki/etcd/ca.key
/etc/kubernetes/pki/front-proxy-ca.crt
/etc/kubernetes/pki/front-proxy-ca.key
All certificates
If you don't wish to copy the CA private keys to your cluster, you can generate all certificates yourself.
Required certificates:
Default CN | Parent CA | O (in Subject) | kind | hosts (SAN) |
---|---|---|---|---|
kube-etcd | etcd-ca | server, client | <hostname> , <Host_IP> , localhost , 127.0.0.1 | |
kube-etcd-peer | etcd-ca | server, client | <hostname> , <Host_IP> , localhost , 127.0.0.1 | |
kube-etcd-healthcheck-client | etcd-ca | client | ||
kube-apiserver-etcd-client | etcd-ca | client | ||
kube-apiserver | kubernetes-ca | server | <hostname> , <Host_IP> , <advertise_IP> 1 | |
kube-apiserver-kubelet-client | kubernetes-ca | system:masters | client | |
front-proxy-client | kubernetes-front-proxy-ca | client |
Note:
Instead of using the super-user groupsystem:masters
for kube-apiserver-kubelet-client
a less privileged group can be used. kubeadm uses the kubeadm:cluster-admins
group for
that purpose.where kind
maps to one or more of the x509 key usage, which is also documented in the
.spec.usages
of a CertificateSigningRequest
type:
kind | Key usage |
---|---|
server | digital signature, key encipherment, server auth |
client | digital signature, key encipherment, client auth |
Note:
Hosts/SAN listed above are the recommended ones for getting a working cluster; if required by a specific setup, it is possible to add additional SANs on all the server certificates.Note:
For kubeadm users only:
- The scenario where you are copying to your cluster CA certificates without private keys is referred as external CA in the kubeadm documentation.
- If you are comparing the above list with a kubeadm generated PKI, please be aware that
kube-etcd
,kube-etcd-peer
andkube-etcd-healthcheck-client
certificates are not generated in case of external etcd.
Certificate paths
Certificates should be placed in a recommended path (as used by kubeadm). Paths should be specified using the given argument regardless of location.
DefaultCN | recommendedkeypath | recommendedcertpath | command | keyargument | certargument |
---|---|---|---|---|---|
etcd-ca | etcd/ca.key | etcd/ca.crt | kube-apiserver | --etcd-cafile | |
kube-apiserver-etcd-client | apiserver-etcd-client.key | apiserver-etcd-client.crt | kube-apiserver | --etcd-keyfile | --etcd-certfile |
kubernetes-ca | ca.key | ca.crt | kube-apiserver | --client-ca-file | |
kubernetes-ca | ca.key | ca.crt | kube-controller-manager | --cluster-signing-key-file | --client-ca-file,--root-ca-file,--cluster-signing-cert-file |
kube-apiserver | apiserver.key | apiserver.crt | kube-apiserver | --tls-private-key-file | --tls-cert-file |
kube-apiserver-kubelet-client | apiserver-kubelet-client.key | apiserver-kubelet-client.crt | kube-apiserver | --kubelet-client-key | --kubelet-client-certificate |
front-proxy-ca | front-proxy-ca.key | front-proxy-ca.crt | kube-apiserver | --requestheader-client-ca-file | |
front-proxy-ca | front-proxy-ca.key | front-proxy-ca.crt | kube-controller-manager | --requestheader-client-ca-file | |
front-proxy-client | front-proxy-client.key | front-proxy-client.crt | kube-apiserver | --proxy-client-key-file | --proxy-client-cert-file |
etcd-ca | etcd/ca.key | etcd/ca.crt | etcd | --trusted-ca-file,--peer-trusted-ca-file | |
kube-etcd | etcd/server.key | etcd/server.crt | etcd | --key-file | --cert-file |
kube-etcd-peer | etcd/peer.key | etcd/peer.crt | etcd | --peer-key-file | --peer-cert-file |
etcd-ca | etcd/ca.crt | etcdctl | --cacert | ||
kube-etcd-healthcheck-client | etcd/healthcheck-client.key | etcd/healthcheck-client.crt | etcdctl | --key | --cert |
Same considerations apply for the service account key pair:
private key path | public key path | command | argument |
---|---|---|---|
sa.key | kube-controller-manager | --service-account-private-key-file | |
sa.pub | kube-apiserver | --service-account-key-file |
The following example illustrates the file paths from the previous tables you need to provide if you are generating all of your own keys and certificates:
/etc/kubernetes/pki/etcd/ca.key
/etc/kubernetes/pki/etcd/ca.crt
/etc/kubernetes/pki/apiserver-etcd-client.key
/etc/kubernetes/pki/apiserver-etcd-client.crt
/etc/kubernetes/pki/ca.key
/etc/kubernetes/pki/ca.crt
/etc/kubernetes/pki/apiserver.key
/etc/kubernetes/pki/apiserver.crt
/etc/kubernetes/pki/apiserver-kubelet-client.key
/etc/kubernetes/pki/apiserver-kubelet-client.crt
/etc/kubernetes/pki/front-proxy-ca.key
/etc/kubernetes/pki/front-proxy-ca.crt
/etc/kubernetes/pki/front-proxy-client.key
/etc/kubernetes/pki/front-proxy-client.crt
/etc/kubernetes/pki/etcd/server.key
/etc/kubernetes/pki/etcd/server.crt
/etc/kubernetes/pki/etcd/peer.key
/etc/kubernetes/pki/etcd/peer.crt
/etc/kubernetes/pki/etcd/healthcheck-client.key
/etc/kubernetes/pki/etcd/healthcheck-client.crt
/etc/kubernetes/pki/sa.key
/etc/kubernetes/pki/sa.pub
Configure certificates for user accounts
You must manually configure these administrator accounts and service accounts:
Filename | Credential name | Default CN | O (in Subject) |
---|---|---|---|
admin.conf | default-admin | kubernetes-admin | <admin-group> |
super-admin.conf | default-super-admin | kubernetes-super-admin | system:masters |
kubelet.conf | default-auth | system:node:<nodeName> (see note) | system:nodes |
controller-manager.conf | default-controller-manager | system:kube-controller-manager | |
scheduler.conf | default-scheduler | system:kube-scheduler |
Note:
The value of<nodeName>
for kubelet.conf
must match precisely the value of the node name
provided by the kubelet as it registers with the apiserver. For further details, read the
Node Authorization.Note:
In the above example <admin-group>
is implementation specific. Some tools sign the
certificate in the default admin.conf
to be part of the system:masters
group.
system:masters
is a break-glass, super user group can bypass the authorization
layer of Kubernetes, such as RBAC. Also some tools do not generate a separate
super-admin.conf
with a certificate bound to this super user group.
kubeadm generates two separate administrator certificates in kubeconfig files.
One is in admin.conf
and has Subject: O = kubeadm:cluster-admins, CN = kubernetes-admin
.
kubeadm:cluster-admins
is a custom group bound to the cluster-admin
ClusterRole.
This file is generated on all kubeadm managed control plane machines.
Another is in super-admin.conf
that has Subject: O = system:masters, CN = kubernetes-super-admin
.
This file is generated only on the node where kubeadm init
was called.
For each configuration, generate an x509 certificate/key pair with the given Common Name (CN) and Organization (O).
Run
kubectl
as follows for each configuration:KUBECONFIG=<filename> kubectl config set-cluster default-cluster --server=https://<host ip>:6443 --certificate-authority <path-to-kubernetes-ca> --embed-certs KUBECONFIG=<filename> kubectl config set-credentials <credential-name> --client-key <path-to-key>.pem --client-certificate <path-to-cert>.pem --embed-certs KUBECONFIG=<filename> kubectl config set-context default-system --cluster default-cluster --user <credential-name> KUBECONFIG=<filename> kubectl config use-context default-system
These files are used as follows:
Filename | Command | Comment |
---|---|---|
admin.conf | kubectl | Configures administrator user for the cluster |
super-admin.conf | kubectl | Configures super administrator user for the cluster |
kubelet.conf | kubelet | One required for each node in the cluster. |
controller-manager.conf | kube-controller-manager | Must be added to manifest in manifests/kube-controller-manager.yaml |
scheduler.conf | kube-scheduler | Must be added to manifest in manifests/kube-scheduler.yaml |
The following files illustrate full paths to the files listed in the previous table:
/etc/kubernetes/admin.conf
/etc/kubernetes/super-admin.conf
/etc/kubernetes/kubelet.conf
/etc/kubernetes/controller-manager.conf
/etc/kubernetes/scheduler.conf
3 - Concepts
The Concepts section helps you learn about the parts of the Kubernetes system and the abstractions Kubernetes uses to represent your cluster, and helps you obtain a deeper understanding of how Kubernetes works.
3.1 - Overview
This page is an overview of Kubernetes.
The name Kubernetes originates from Greek, meaning helmsman or pilot. K8s as an abbreviation results from counting the eight letters between the "K" and the "s". Google open-sourced the Kubernetes project in 2014. Kubernetes combines over 15 years of Google's experience running production workloads at scale with best-of-breed ideas and practices from the community.
Why you need Kubernetes and what it can do
Containers are a good way to bundle and run your applications. In a production environment, you need to manage the containers that run the applications and ensure that there is no downtime. For example, if a container goes down, another container needs to start. Wouldn't it be easier if this behavior was handled by a system?
That's how Kubernetes comes to the rescue! Kubernetes provides you with a framework to run distributed systems resiliently. It takes care of scaling and failover for your application, provides deployment patterns, and more. For example: Kubernetes can easily manage a canary deployment for your system.
Kubernetes provides you with:
- Service discovery and load balancing Kubernetes can expose a container using the DNS name or using their own IP address. If traffic to a container is high, Kubernetes is able to load balance and distribute the network traffic so that the deployment is stable.
- Storage orchestration Kubernetes allows you to automatically mount a storage system of your choice, such as local storages, public cloud providers, and more.
- Automated rollouts and rollbacks You can describe the desired state for your deployed containers using Kubernetes, and it can change the actual state to the desired state at a controlled rate. For example, you can automate Kubernetes to create new containers for your deployment, remove existing containers and adopt all their resources to the new container.
- Automatic bin packing You provide Kubernetes with a cluster of nodes that it can use to run containerized tasks. You tell Kubernetes how much CPU and memory (RAM) each container needs. Kubernetes can fit containers onto your nodes to make the best use of your resources.
- Self-healing Kubernetes restarts containers that fail, replaces containers, kills containers that don't respond to your user-defined health check, and doesn't advertise them to clients until they are ready to serve.
- Secret and configuration management Kubernetes lets you store and manage sensitive information, such as passwords, OAuth tokens, and SSH keys. You can deploy and update secrets and application configuration without rebuilding your container images, and without exposing secrets in your stack configuration.
- Batch execution In addition to services, Kubernetes can manage your batch and CI workloads, replacing containers that fail, if desired.
- Horizontal scaling Scale your application up and down with a simple command, with a UI, or automatically based on CPU usage.
- IPv4/IPv6 dual-stack Allocation of IPv4 and IPv6 addresses to Pods and Services
- Designed for extensibility Add features to your Kubernetes cluster without changing upstream source code.
What Kubernetes is not
Kubernetes is not a traditional, all-inclusive PaaS (Platform as a Service) system. Since Kubernetes operates at the container level rather than at the hardware level, it provides some generally applicable features common to PaaS offerings, such as deployment, scaling, load balancing, and lets users integrate their logging, monitoring, and alerting solutions. However, Kubernetes is not monolithic, and these default solutions are optional and pluggable. Kubernetes provides the building blocks for building developer platforms, but preserves user choice and flexibility where it is important.
Kubernetes:
- Does not limit the types of applications supported. Kubernetes aims to support an extremely diverse variety of workloads, including stateless, stateful, and data-processing workloads. If an application can run in a container, it should run great on Kubernetes.
- Does not deploy source code and does not build your application. Continuous Integration, Delivery, and Deployment (CI/CD) workflows are determined by organization cultures and preferences as well as technical requirements.
- Does not provide application-level services, such as middleware (for example, message buses), data-processing frameworks (for example, Spark), databases (for example, MySQL), caches, nor cluster storage systems (for example, Ceph) as built-in services. Such components can run on Kubernetes, and/or can be accessed by applications running on Kubernetes through portable mechanisms, such as the Open Service Broker.
- Does not dictate logging, monitoring, or alerting solutions. It provides some integrations as proof of concept, and mechanisms to collect and export metrics.
- Does not provide nor mandate a configuration language/system (for example, Jsonnet). It provides a declarative API that may be targeted by arbitrary forms of declarative specifications.
- Does not provide nor adopt any comprehensive machine configuration, maintenance, management, or self-healing systems.
- Additionally, Kubernetes is not a mere orchestration system. In fact, it eliminates the need for orchestration. The technical definition of orchestration is execution of a defined workflow: first do A, then B, then C. In contrast, Kubernetes comprises a set of independent, composable control processes that continuously drive the current state towards the provided desired state. It shouldn't matter how you get from A to C. Centralized control is also not required. This results in a system that is easier to use and more powerful, robust, resilient, and extensible.
Historical context for Kubernetes
Let's take a look at why Kubernetes is so useful by going back in time.
Traditional deployment era:
Early on, organizations ran applications on physical servers. There was no way to define resource boundaries for applications in a physical server, and this caused resource allocation issues. For example, if multiple applications run on a physical server, there can be instances where one application would take up most of the resources, and as a result, the other applications would underperform. A solution for this would be to run each application on a different physical server. But this did not scale as resources were underutilized, and it was expensive for organizations to maintain many physical servers.
Virtualized deployment era:
As a solution, virtualization was introduced. It allows you to run multiple Virtual Machines (VMs) on a single physical server's CPU. Virtualization allows applications to be isolated between VMs and provides a level of security as the information of one application cannot be freely accessed by another application.
Virtualization allows better utilization of resources in a physical server and allows better scalability because an application can be added or updated easily, reduces hardware costs, and much more. With virtualization you can present a set of physical resources as a cluster of disposable virtual machines.
Each VM is a full machine running all the components, including its own operating system, on top of the virtualized hardware.
Container deployment era:
Containers are similar to VMs, but they have relaxed isolation properties to share the Operating System (OS) among the applications. Therefore, containers are considered lightweight. Similar to a VM, a container has its own filesystem, share of CPU, memory, process space, and more. As they are decoupled from the underlying infrastructure, they are portable across clouds and OS distributions.
Containers have become popular because they provide extra benefits, such as:
- Agile application creation and deployment: increased ease and efficiency of container image creation compared to VM image use.
- Continuous development, integration, and deployment: provides for reliable and frequent container image build and deployment with quick and efficient rollbacks (due to image immutability).
- Dev and Ops separation of concerns: create application container images at build/release time rather than deployment time, thereby decoupling applications from infrastructure.
- Observability: not only surfaces OS-level information and metrics, but also application health and other signals.
- Environmental consistency across development, testing, and production: runs the same on a laptop as it does in the cloud.
- Cloud and OS distribution portability: runs on Ubuntu, RHEL, CoreOS, on-premises, on major public clouds, and anywhere else.
- Application-centric management: raises the level of abstraction from running an OS on virtual hardware to running an application on an OS using logical resources.
- Loosely coupled, distributed, elastic, liberated micro-services: applications are broken into smaller, independent pieces and can be deployed and managed dynamically – not a monolithic stack running on one big single-purpose machine.
- Resource isolation: predictable application performance.
- Resource utilization: high efficiency and density.
What's next
- Take a look at the Kubernetes Components
- Take a look at the The Kubernetes API
- Take a look at the Cluster Architecture
- Ready to Get Started?
3.1.1 - Kubernetes Components
This page provides a high-level overview of the essential components that make up a Kubernetes cluster.
Core Components
A Kubernetes cluster consists of a control plane and one or more worker nodes. Here's a brief overview of the main components:
Control Plane Components
Manage the overall state of the cluster:
- kube-apiserver
- The core component server that exposes the Kubernetes HTTP API
- etcd
- Consistent and highly-available key value store for all API server data
- kube-scheduler
- Looks for Pods not yet bound to a node, and assigns each Pod to a suitable node.
- kube-controller-manager
- Runs controllers to implement Kubernetes API behavior.
- cloud-controller-manager (optional)
- Integrates with underlying cloud provider(s).
Node Components
Run on every node, maintaining running pods and providing the Kubernetes runtime environment:
- kubelet
- Ensures that Pods are running, including their containers.
- kube-proxy (optional)
- Maintains network rules on nodes to implement Services.
- Container runtime
- Software responsible for running containers. Read Container Runtimes to learn more.
Your cluster may require additional software on each node; for example, you might also run systemd on a Linux node to supervise local components.
Addons
Addons extend the functionality of Kubernetes. A few important examples include:
- DNS
- For cluster-wide DNS resolution
- Web UI (Dashboard)
- For cluster management via a web interface
- Container Resource Monitoring
- For collecting and storing container metrics
- Cluster-level Logging
- For saving container logs to a central log store
Flexibility in Architecture
Kubernetes allows for flexibility in how these components are deployed and managed. The architecture can be adapted to various needs, from small development environments to large-scale production deployments.
For more detailed information about each component and various ways to configure your cluster architecture, see the Cluster Architecture page.
3.1.2 - Objects In Kubernetes
This page explains how Kubernetes objects are represented in the Kubernetes API, and how you can
express them in .yaml
format.
Understanding Kubernetes objects
Kubernetes objects are persistent entities in the Kubernetes system. Kubernetes uses these entities to represent the state of your cluster. Specifically, they can describe:
- What containerized applications are running (and on which nodes)
- The resources available to those applications
- The policies around how those applications behave, such as restart policies, upgrades, and fault-tolerance
A Kubernetes object is a "record of intent"--once you create the object, the Kubernetes system will constantly work to ensure that the object exists. By creating an object, you're effectively telling the Kubernetes system what you want your cluster's workload to look like; this is your cluster's desired state.
To work with Kubernetes objects—whether to create, modify, or delete them—you'll need to use the
Kubernetes API. When you use the kubectl
command-line
interface, for example, the CLI makes the necessary Kubernetes API calls for you. You can also use
the Kubernetes API directly in your own programs using one of the
Client Libraries.
Object spec and status
Almost every Kubernetes object includes two nested object fields that govern
the object's configuration: the object spec
and the object status
.
For objects that have a spec
, you have to set this when you create the object,
providing a description of the characteristics you want the resource to have:
its desired state.
The status
describes the current state of the object, supplied and updated
by the Kubernetes system and its components. The Kubernetes
control plane continually
and actively manages every object's actual state to match the desired state you
supplied.
For example: in Kubernetes, a Deployment is an object that can represent an
application running on your cluster. When you create the Deployment, you
might set the Deployment spec
to specify that you want three replicas of
the application to be running. The Kubernetes system reads the Deployment
spec and starts three instances of your desired application--updating
the status to match your spec. If any of those instances should fail
(a status change), the Kubernetes system responds to the difference
between spec and status by making a correction--in this case, starting
a replacement instance.
For more information on the object spec, status, and metadata, see the Kubernetes API Conventions.
Describing a Kubernetes object
When you create an object in Kubernetes, you must provide the object spec that describes its
desired state, as well as some basic information about the object (such as a name). When you use
the Kubernetes API to create the object (either directly or via kubectl
), that API request must
include that information as JSON in the request body.
Most often, you provide the information to kubectl
in a file known as a manifest.
By convention, manifests are YAML (you could also use JSON format).
Tools such as kubectl
convert the information from a manifest into JSON or another supported
serialization format when making the API request over HTTP.
Here's an example manifest that shows the required fields and object spec for a Kubernetes Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 2 # tells deployment to run 2 pods matching the template
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
One way to create a Deployment using a manifest file like the one above is to use the
kubectl apply
command
in the kubectl
command-line interface, passing the .yaml
file as an argument. Here's an example:
kubectl apply -f https://k8s.io/examples/application/deployment.yaml
The output is similar to this:
deployment.apps/nginx-deployment created
Required fields
In the manifest (YAML or JSON file) for the Kubernetes object you want to create, you'll need to set values for the following fields:
apiVersion
- Which version of the Kubernetes API you're using to create this objectkind
- What kind of object you want to createmetadata
- Data that helps uniquely identify the object, including aname
string,UID
, and optionalnamespace
spec
- What state you desire for the object
The precise format of the object spec
is different for every Kubernetes object, and contains
nested fields specific to that object. The Kubernetes API Reference
can help you find the spec format for all of the objects you can create using Kubernetes.
For example, see the spec
field
for the Pod API reference.
For each Pod, the .spec
field specifies the pod and its desired state (such as the container image name for
each container within that pod).
Another example of an object specification is the
spec
field
for the StatefulSet API. For StatefulSet, the .spec
field specifies the StatefulSet and
its desired state.
Within the .spec
of a StatefulSet is a template
for Pod objects. That template describes Pods that the StatefulSet controller will create in order to
satisfy the StatefulSet specification.
Different kinds of objects can also have different .status
; again, the API reference pages
detail the structure of that .status
field, and its content for each different type of object.
Note:
See Configuration Best Practices for additional information on writing YAML configuration files.Server side field validation
Starting with Kubernetes v1.25, the API server offers server side
field validation
that detects unrecognized or duplicate fields in an object. It provides all the functionality
of kubectl --validate
on the server side.
The kubectl
tool uses the --validate
flag to set the level of field validation. It accepts the
values ignore
, warn
, and strict
while also accepting the values true
(equivalent to strict
)
and false
(equivalent to ignore
). The default validation setting for kubectl
is --validate=true
.
Strict
- Strict field validation, errors on validation failure
Warn
- Field validation is performed, but errors are exposed as warnings rather than failing the request
Ignore
- No server side field validation is performed
When kubectl
cannot connect to an API server that supports field validation it will fall back
to using client-side validation. Kubernetes 1.27 and later versions always offer field validation;
older Kubernetes releases might not. If your cluster is older than v1.27, check the documentation
for your version of Kubernetes.
What's next
If you're new to Kubernetes, read more about the following:
- Pods which are the most important basic Kubernetes objects.
- Deployment objects.
- Controllers in Kubernetes.
- kubectl and kubectl commands.
Kubernetes Object Management
explains how to use kubectl
to manage objects.
You might need to install kubectl if you don't already have it available.
To learn about the Kubernetes API in general, visit:
To learn about objects in Kubernetes in more depth, read other pages in this section:
3.1.2.1 - Kubernetes Object Management
The kubectl
command-line tool supports several different ways to create and manage
Kubernetes objects. This document provides an overview of the different
approaches. Read the Kubectl book for
details of managing objects by Kubectl.
Management techniques
Warning:
A Kubernetes object should be managed using only one technique. Mixing and matching techniques for the same object results in undefined behavior.Management technique | Operates on | Recommended environment | Supported writers | Learning curve |
---|---|---|---|---|
Imperative commands | Live objects | Development projects | 1+ | Lowest |
Imperative object configuration | Individual files | Production projects | 1 | Moderate |
Declarative object configuration | Directories of files | Production projects | 1+ | Highest |
Imperative commands
When using imperative commands, a user operates directly on live objects
in a cluster. The user provides operations to
the kubectl
command as arguments or flags.
This is the recommended way to get started or to run a one-off task in a cluster. Because this technique operates directly on live objects, it provides no history of previous configurations.
Examples
Run an instance of the nginx container by creating a Deployment object:
kubectl create deployment nginx --image nginx
Trade-offs
Advantages compared to object configuration:
- Commands are expressed as a single action word.
- Commands require only a single step to make changes to the cluster.
Disadvantages compared to object configuration:
- Commands do not integrate with change review processes.
- Commands do not provide an audit trail associated with changes.
- Commands do not provide a source of records except for what is live.
- Commands do not provide a template for creating new objects.
Imperative object configuration
In imperative object configuration, the kubectl command specifies the operation (create, replace, etc.), optional flags and at least one file name. The file specified must contain a full definition of the object in YAML or JSON format.
See the API reference for more details on object definitions.
Warning:
The imperativereplace
command replaces the existing
spec with the newly provided one, dropping all changes to the object missing from
the configuration file. This approach should not be used with resource
types whose specs are updated independently of the configuration file.
Services of type LoadBalancer
, for example, have their externalIPs
field updated
independently from the configuration by the cluster.Examples
Create the objects defined in a configuration file:
kubectl create -f nginx.yaml
Delete the objects defined in two configuration files:
kubectl delete -f nginx.yaml -f redis.yaml
Update the objects defined in a configuration file by overwriting the live configuration:
kubectl replace -f nginx.yaml
Trade-offs
Advantages compared to imperative commands:
- Object configuration can be stored in a source control system such as Git.
- Object configuration can integrate with processes such as reviewing changes before push and audit trails.
- Object configuration provides a template for creating new objects.
Disadvantages compared to imperative commands:
- Object configuration requires basic understanding of the object schema.
- Object configuration requires the additional step of writing a YAML file.
Advantages compared to declarative object configuration:
- Imperative object configuration behavior is simpler and easier to understand.
- As of Kubernetes version 1.5, imperative object configuration is more mature.
Disadvantages compared to declarative object configuration:
- Imperative object configuration works best on files, not directories.
- Updates to live objects must be reflected in configuration files, or they will be lost during the next replacement.
Declarative object configuration
When using declarative object configuration, a user operates on object
configuration files stored locally, however the user does not define the
operations to be taken on the files. Create, update, and delete operations
are automatically detected per-object by kubectl
. This enables working on
directories, where different operations might be needed for different objects.
Note:
Declarative object configuration retains changes made by other writers, even if the changes are not merged back to the object configuration file. This is possible by using thepatch
API operation to write only
observed differences, instead of using the replace
API operation to replace the entire object configuration.Examples
Process all object configuration files in the configs
directory, and create or
patch the live objects. You can first diff
to see what changes are going to be
made, and then apply:
kubectl diff -f configs/
kubectl apply -f configs/
Recursively process directories:
kubectl diff -R -f configs/
kubectl apply -R -f configs/
Trade-offs
Advantages compared to imperative object configuration:
- Changes made directly to live objects are retained, even if they are not merged back into the configuration files.
- Declarative object configuration has better support for operating on directories and automatically detecting operation types (create, patch, delete) per-object.
Disadvantages compared to imperative object configuration:
- Declarative object configuration is harder to debug and understand results when they are unexpected.
- Partial updates using diffs create complex merge and patch operations.
What's next
- Managing Kubernetes Objects Using Imperative Commands
- Imperative Management of Kubernetes Objects Using Configuration Files
- Declarative Management of Kubernetes Objects Using Configuration Files
- Declarative Management of Kubernetes Objects Using Kustomize
- Kubectl Command Reference
- Kubectl Book
- Kubernetes API Reference
3.1.2.2 - Object Names and IDs
Each object in your cluster has a Name that is unique for that type of resource. Every Kubernetes object also has a UID that is unique across your whole cluster.
For example, you can only have one Pod named myapp-1234
within the same namespace, but you can have one Pod and one Deployment that are each named myapp-1234
.
For non-unique user-provided attributes, Kubernetes provides labels and annotations.
Names
A client-provided string that refers to an object in a resource URL, such as /api/v1/pods/some-name
.
Only one object of a given kind can have a given name at a time. However, if you delete the object, you can make a new object with the same name.
Names must be unique across all API versions of the same resource. API resources are distinguished by their API group, resource type, namespace (for namespaced resources), and name. In other words, API version is irrelevant in this context.
Note:
In cases when objects represent a physical entity, like a Node representing a physical host, when the host is re-created under the same name without deleting and re-creating the Node, Kubernetes treats the new host as the old one, which may lead to inconsistencies.The server may generate a name when generateName
is provided instead of name
in a resource create request.
When generateName
is used, the provided value is used as a name prefix, which server appends a generated suffix
to. Even though the name is generated, it may conflict with existing names resulting in a HTTP 409 response. This
became far less likely to happen in Kubernetes v1.31 and later, since the server will make up to 8 attempt to generate a
unique name before returning a HTTP 409 response.
Below are four types of commonly used name constraints for resources.
DNS Subdomain Names
Most resource types require a name that can be used as a DNS subdomain name as defined in RFC 1123. This means the name must:
- contain no more than 253 characters
- contain only lowercase alphanumeric characters, '-' or '.'
- start with an alphanumeric character
- end with an alphanumeric character
RFC 1123 Label Names
Some resource types require their names to follow the DNS label standard as defined in RFC 1123. This means the name must:
- contain at most 63 characters
- contain only lowercase alphanumeric characters or '-'
- start with an alphanumeric character
- end with an alphanumeric character
RFC 1035 Label Names
Some resource types require their names to follow the DNS label standard as defined in RFC 1035. This means the name must:
- contain at most 63 characters
- contain only lowercase alphanumeric characters or '-'
- start with an alphabetic character
- end with an alphanumeric character
Note:
The only difference between the RFC 1035 and RFC 1123 label standards is that RFC 1123 labels are allowed to start with a digit, whereas RFC 1035 labels can start with a lowercase alphabetic character only.Path Segment Names
Some resource types require their names to be able to be safely encoded as a path segment. In other words, the name may not be "." or ".." and the name may not contain "/" or "%".
Here's an example manifest for a Pod named nginx-demo
.
apiVersion: v1
kind: Pod
metadata:
name: nginx-demo
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
Note:
Some resource types have additional restrictions on their names.UIDs
A Kubernetes systems-generated string to uniquely identify objects.
Every object created over the whole lifetime of a Kubernetes cluster has a distinct UID. It is intended to distinguish between historical occurrences of similar entities.
Kubernetes UIDs are universally unique identifiers (also known as UUIDs). UUIDs are standardized as ISO/IEC 9834-8 and as ITU-T X.667.
What's next
- Read about labels and annotations in Kubernetes.
- See the Identifiers and Names in Kubernetes design document.
3.1.2.3 - Labels and Selectors
Labels are key/value pairs that are attached to objects such as Pods. Labels are intended to be used to specify identifying attributes of objects that are meaningful and relevant to users, but do not directly imply semantics to the core system. Labels can be used to organize and to select subsets of objects. Labels can be attached to objects at creation time and subsequently added and modified at any time. Each object can have a set of key/value labels defined. Each Key must be unique for a given object.
"metadata": {
"labels": {
"key1" : "value1",
"key2" : "value2"
}
}
Labels allow for efficient queries and watches and are ideal for use in UIs and CLIs. Non-identifying information should be recorded using annotations.
Motivation
Labels enable users to map their own organizational structures onto system objects in a loosely coupled fashion, without requiring clients to store these mappings.
Service deployments and batch processing pipelines are often multi-dimensional entities (e.g., multiple partitions or deployments, multiple release tracks, multiple tiers, multiple micro-services per tier). Management often requires cross-cutting operations, which breaks encapsulation of strictly hierarchical representations, especially rigid hierarchies determined by the infrastructure rather than by users.
Example labels:
"release" : "stable"
,"release" : "canary"
"environment" : "dev"
,"environment" : "qa"
,"environment" : "production"
"tier" : "frontend"
,"tier" : "backend"
,"tier" : "cache"
"partition" : "customerA"
,"partition" : "customerB"
"track" : "daily"
,"track" : "weekly"
These are examples of commonly used labels; you are free to develop your own conventions. Keep in mind that label Key must be unique for a given object.
Syntax and character set
Labels are key/value pairs. Valid label keys have two segments: an optional
prefix and name, separated by a slash (/
). The name segment is required and
must be 63 characters or less, beginning and ending with an alphanumeric
character ([a-z0-9A-Z]
) with dashes (-
), underscores (_
), dots (.
),
and alphanumerics between. The prefix is optional. If specified, the prefix
must be a DNS subdomain: a series of DNS labels separated by dots (.
),
not longer than 253 characters in total, followed by a slash (/
).
If the prefix is omitted, the label Key is presumed to be private to the user.
Automated system components (e.g. kube-scheduler
, kube-controller-manager
,
kube-apiserver
, kubectl
, or other third-party automation) which add labels
to end-user objects must specify a prefix.
The kubernetes.io/
and k8s.io/
prefixes are
reserved for Kubernetes core components.
Valid label value:
- must be 63 characters or less (can be empty),
- unless empty, must begin and end with an alphanumeric character (
[a-z0-9A-Z]
), - could contain dashes (
-
), underscores (_
), dots (.
), and alphanumerics between.
For example, here's a manifest for a Pod that has two labels
environment: production
and app: nginx
:
apiVersion: v1
kind: Pod
metadata:
name: label-demo
labels:
environment: production
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
Label selectors
Unlike names and UIDs, labels do not provide uniqueness. In general, we expect many objects to carry the same label(s).
Via a label selector, the client/user can identify a set of objects. The label selector is the core grouping primitive in Kubernetes.
The API currently supports two types of selectors: equality-based and set-based.
A label selector can be made of multiple requirements which are comma-separated.
In the case of multiple requirements, all must be satisfied so the comma separator
acts as a logical AND (&&
) operator.
The semantics of empty or non-specified selectors are dependent on the context, and API types that use selectors should document the validity and meaning of them.
Note:
For some API types, such as ReplicaSets, the label selectors of two instances must not overlap within a namespace, or the controller can see that as conflicting instructions and fail to determine how many replicas should be present.Caution:
For both equality-based and set-based conditions there is no logical OR (||
) operator.
Ensure your filter statements are structured accordingly.Equality-based requirement
Equality- or inequality-based requirements allow filtering by label keys and values.
Matching objects must satisfy all of the specified label constraints, though they may
have additional labels as well. Three kinds of operators are admitted =
,==
,!=
.
The first two represent equality (and are synonyms), while the latter represents inequality.
For example:
environment = production
tier != frontend
The former selects all resources with key equal to environment
and value equal to production
.
The latter selects all resources with key equal to tier
and value distinct from frontend
,
and all resources with no labels with the tier
key. One could filter for resources in production
excluding frontend
using the comma operator: environment=production,tier!=frontend
One usage scenario for equality-based label requirement is for Pods to specify
node selection criteria. For example, the sample Pod below selects nodes where
the accelerator
label exists and is set to nvidia-tesla-p100
.
apiVersion: v1
kind: Pod
metadata:
name: cuda-test
spec:
containers:
- name: cuda-test
image: "registry.k8s.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-p100
Set-based requirement
Set-based label requirements allow filtering keys according to a set of values.
Three kinds of operators are supported: in
,notin
and exists
(only the key identifier).
For example:
environment in (production, qa)
tier notin (frontend, backend)
partition
!partition
- The first example selects all resources with key equal to
environment
and value equal toproduction
orqa
. - The second example selects all resources with key equal to
tier
and values other thanfrontend
andbackend
, and all resources with no labels with thetier
key. - The third example selects all resources including a label with key
partition
; no values are checked. - The fourth example selects all resources without a label with key
partition
; no values are checked.
Similarly the comma separator acts as an AND operator. So filtering resources
with a partition
key (no matter the value) and with environment
different
than qa
can be achieved using partition,environment notin (qa)
.
The set-based label selector is a general form of equality since
environment=production
is equivalent to environment in (production)
;
similarly for !=
and notin
.
Set-based requirements can be mixed with equality-based requirements.
For example: partition in (customerA, customerB),environment!=qa
.
API
LIST and WATCH filtering
For list and watch operations, you can specify label selectors to filter the sets of objects returned; you specify the filter using a query parameter. (To learn in detail about watches in Kubernetes, read efficient detection of changes). Both requirements are permitted (presented here as they would appear in a URL query string):
- equality-based requirements:
?labelSelector=environment%3Dproduction,tier%3Dfrontend
- set-based requirements:
?labelSelector=environment+in+%28production%2Cqa%29%2Ctier+in+%28frontend%29
Both label selector styles can be used to list or watch resources via a REST client.
For example, targeting apiserver
with kubectl
and using equality-based one may write:
kubectl get pods -l environment=production,tier=frontend
or using set-based requirements:
kubectl get pods -l 'environment in (production),tier in (frontend)'
As already mentioned set-based requirements are more expressive. For instance, they can implement the OR operator on values:
kubectl get pods -l 'environment in (production, qa)'
or restricting negative matching via notin operator:
kubectl get pods -l 'environment,environment notin (frontend)'
Set references in API objects
Some Kubernetes objects, such as services
and replicationcontrollers
,
also use label selectors to specify sets of other resources, such as
pods.
Service and ReplicationController
The set of pods that a service
targets is defined with a label selector.
Similarly, the population of pods that a replicationcontroller
should
manage is also defined with a label selector.
Label selectors for both objects are defined in json
or yaml
files using maps,
and only equality-based requirement selectors are supported:
"selector": {
"component" : "redis",
}
or
selector:
component: redis
This selector (respectively in json
or yaml
format) is equivalent to
component=redis
or component in (redis)
.
Resources that support set-based requirements
Newer resources, such as Job
,
Deployment
,
ReplicaSet
, and
DaemonSet
,
support set-based requirements as well.
selector:
matchLabels:
component: redis
matchExpressions:
- { key: tier, operator: In, values: [cache] }
- { key: environment, operator: NotIn, values: [dev] }
matchLabels
is a map of {key,value}
pairs. A single {key,value}
in the
matchLabels
map is equivalent to an element of matchExpressions
, whose key
field is "key", the operator
is "In", and the values
array contains only "value".
matchExpressions
is a list of pod selector requirements. Valid operators include
In, NotIn, Exists, and DoesNotExist. The values set must be non-empty in the case of
In and NotIn. All of the requirements, from both matchLabels
and matchExpressions
are ANDed together -- they must all be satisfied in order to match.
Selecting sets of nodes
One use case for selecting over labels is to constrain the set of nodes onto which a pod can schedule. See the documentation on node selection for more information.
Using labels effectively
You can apply a single label to any resources, but this is not always the best practice. There are many scenarios where multiple labels should be used to distinguish resource sets from one another.
For instance, different applications would use different values for the app
label, but a
multi-tier application, such as the guestbook example,
would additionally need to distinguish each tier. The frontend could carry the following labels:
labels:
app: guestbook
tier: frontend
while the Redis master and replica would have different tier
labels, and perhaps even an
additional role
label:
labels:
app: guestbook
tier: backend
role: master
and
labels:
app: guestbook
tier: backend
role: replica
The labels allow for slicing and dicing the resources along any dimension specified by a label:
kubectl apply -f examples/guestbook/all-in-one/guestbook-all-in-one.yaml
kubectl get pods -Lapp -Ltier -Lrole
NAME READY STATUS RESTARTS AGE APP TIER ROLE
guestbook-fe-4nlpb 1/1 Running 0 1m guestbook frontend <none>
guestbook-fe-ght6d 1/1 Running 0 1m guestbook frontend <none>
guestbook-fe-jpy62 1/1 Running 0 1m guestbook frontend <none>
guestbook-redis-master-5pg3b 1/1 Running 0 1m guestbook backend master
guestbook-redis-replica-2q2yf 1/1 Running 0 1m guestbook backend replica
guestbook-redis-replica-qgazl 1/1 Running 0 1m guestbook backend replica
my-nginx-divi2 1/1 Running 0 29m nginx <none> <none>
my-nginx-o0ef1 1/1 Running 0 29m nginx <none> <none>
kubectl get pods -lapp=guestbook,role=replica
NAME READY STATUS RESTARTS AGE
guestbook-redis-replica-2q2yf 1/1 Running 0 3m
guestbook-redis-replica-qgazl 1/1 Running 0 3m
Updating labels
Sometimes you may want to relabel existing pods and other resources before creating
new resources. This can be done with kubectl label
.
For example, if you want to label all your NGINX Pods as frontend tier, run:
kubectl label pods -l app=nginx tier=fe
pod/my-nginx-2035384211-j5fhi labeled
pod/my-nginx-2035384211-u2c7e labeled
pod/my-nginx-2035384211-u3t6x labeled
This first filters all pods with the label "app=nginx", and then labels them with the "tier=fe". To see the pods you labeled, run:
kubectl get pods -l app=nginx -L tier
NAME READY STATUS RESTARTS AGE TIER
my-nginx-2035384211-j5fhi 1/1 Running 0 23m fe
my-nginx-2035384211-u2c7e 1/1 Running 0 23m fe
my-nginx-2035384211-u3t6x 1/1 Running 0 23m fe
This outputs all "app=nginx" pods, with an additional label column of pods' tier
(specified with -L
or --label-columns
).
For more information, please see kubectl label.
What's next
3.1.2.4 - Namespaces
In Kubernetes, namespaces provide a mechanism for isolating groups of resources within a single cluster. Names of resources need to be unique within a namespace, but not across namespaces. Namespace-based scoping is applicable only for namespaced objects (e.g. Deployments, Services, etc.) and not for cluster-wide objects (e.g. StorageClass, Nodes, PersistentVolumes, etc.).
When to Use Multiple Namespaces
Namespaces are intended for use in environments with many users spread across multiple teams, or projects. For clusters with a few to tens of users, you should not need to create or think about namespaces at all. Start using namespaces when you need the features they provide.
Namespaces provide a scope for names. Names of resources need to be unique within a namespace, but not across namespaces. Namespaces cannot be nested inside one another and each Kubernetes resource can only be in one namespace.
Namespaces are a way to divide cluster resources between multiple users (via resource quota).
It is not necessary to use multiple namespaces to separate slightly different resources, such as different versions of the same software: use labels to distinguish resources within the same namespace.
Note:
For a production cluster, consider not using thedefault
namespace. Instead, make other namespaces and use those.Initial namespaces
Kubernetes starts with four initial namespaces:
default
- Kubernetes includes this namespace so that you can start using your new cluster without first creating a namespace.
kube-node-lease
- This namespace holds Lease objects associated with each node. Node leases allow the kubelet to send heartbeats so that the control plane can detect node failure.
kube-public
- This namespace is readable by all clients (including those not authenticated). This namespace is mostly reserved for cluster usage, in case that some resources should be visible and readable publicly throughout the whole cluster. The public aspect of this namespace is only a convention, not a requirement.
kube-system
- The namespace for objects created by the Kubernetes system.
Working with Namespaces
Creation and deletion of namespaces are described in the Admin Guide documentation for namespaces.
Note:
Avoid creating namespaces with the prefixkube-
, since it is reserved for Kubernetes system namespaces.Viewing namespaces
You can list the current namespaces in a cluster using:
kubectl get namespace
NAME STATUS AGE
default Active 1d
kube-node-lease Active 1d
kube-public Active 1d
kube-system Active 1d
Setting the namespace for a request
To set the namespace for a current request, use the --namespace
flag.
For example:
kubectl run nginx --image=nginx --namespace=<insert-namespace-name-here>
kubectl get pods --namespace=<insert-namespace-name-here>
Setting the namespace preference
You can permanently save the namespace for all subsequent kubectl commands in that context.
kubectl config set-context --current --namespace=<insert-namespace-name-here>
# Validate it
kubectl config view --minify | grep namespace:
Namespaces and DNS
When you create a Service,
it creates a corresponding DNS entry.
This entry is of the form <service-name>.<namespace-name>.svc.cluster.local
, which means
that if a container only uses <service-name>
, it will resolve to the service which
is local to a namespace. This is useful for using the same configuration across
multiple namespaces such as Development, Staging and Production. If you want to reach
across namespaces, you need to use the fully qualified domain name (FQDN).
As a result, all namespace names must be valid RFC 1123 DNS labels.
Warning:
By creating namespaces with the same name as public top-level domains, Services in these namespaces can have short DNS names that overlap with public DNS records. Workloads from any namespace performing a DNS lookup without a trailing dot will be redirected to those services, taking precedence over public DNS.
To mitigate this, limit privileges for creating namespaces to trusted users. If required, you could additionally configure third-party security controls, such as admission webhooks, to block creating any namespace with the name of public TLDs.
Not all objects are in a namespace
Most Kubernetes resources (e.g. pods, services, replication controllers, and others) are in some namespaces. However namespace resources are not themselves in a namespace. And low-level resources, such as nodes and persistentVolumes, are not in any namespace.
To see which Kubernetes resources are and aren't in a namespace:
# In a namespace
kubectl api-resources --namespaced=true
# Not in a namespace
kubectl api-resources --namespaced=false
Automatic labelling
Kubernetes 1.22 [stable]
The Kubernetes control plane sets an immutable label
kubernetes.io/metadata.name
on all namespaces.
The value of the label is the namespace name.
What's next
- Learn more about creating a new namespace.
- Learn more about deleting a namespace.
3.1.2.5 - Annotations
You can use Kubernetes annotations to attach arbitrary non-identifying metadata to objects. Clients such as tools and libraries can retrieve this metadata.
Attaching metadata to objects
You can use either labels or annotations to attach metadata to Kubernetes objects. Labels can be used to select objects and to find collections of objects that satisfy certain conditions. In contrast, annotations are not used to identify and select objects. The metadata in an annotation can be small or large, structured or unstructured, and can include characters not permitted by labels. It is possible to use labels as well as annotations in the metadata of the same object.
Annotations, like labels, are key/value maps:
"metadata": {
"annotations": {
"key1" : "value1",
"key2" : "value2"
}
}
Note:
The keys and the values in the map must be strings. In other words, you cannot use numeric, boolean, list or other types for either the keys or the values.Here are some examples of information that could be recorded in annotations:
Fields managed by a declarative configuration layer. Attaching these fields as annotations distinguishes them from default values set by clients or servers, and from auto-generated fields and fields set by auto-sizing or auto-scaling systems.
Build, release, or image information like timestamps, release IDs, git branch, PR numbers, image hashes, and registry address.
Pointers to logging, monitoring, analytics, or audit repositories.
Client library or tool information that can be used for debugging purposes: for example, name, version, and build information.
User or tool/system provenance information, such as URLs of related objects from other ecosystem components.
Lightweight rollout tool metadata: for example, config or checkpoints.
Phone or pager numbers of persons responsible, or directory entries that specify where that information can be found, such as a team web site.
Directives from the end-user to the implementations to modify behavior or engage non-standard features.
Instead of using annotations, you could store this type of information in an external database or directory, but that would make it much harder to produce shared client libraries and tools for deployment, management, introspection, and the like.
Syntax and character set
Annotations are key/value pairs. Valid annotation keys have two segments: an optional prefix and name, separated by a slash (/
). The name segment is required and must be 63 characters or less, beginning and ending with an alphanumeric character ([a-z0-9A-Z]
) with dashes (-
), underscores (_
), dots (.
), and alphanumerics between. The prefix is optional. If specified, the prefix must be a DNS subdomain: a series of DNS labels separated by dots (.
), not longer than 253 characters in total, followed by a slash (/
).
If the prefix is omitted, the annotation Key is presumed to be private to the user. Automated system components (e.g. kube-scheduler
, kube-controller-manager
, kube-apiserver
, kubectl
, or other third-party automation) which add annotations to end-user objects must specify a prefix.
The kubernetes.io/
and k8s.io/
prefixes are reserved for Kubernetes core components.
For example, here's a manifest for a Pod that has the annotation imageregistry: https://hub.docker.com/
:
apiVersion: v1
kind: Pod
metadata:
name: annotations-demo
annotations:
imageregistry: "https://hub.docker.com/"
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
What's next
- Learn more about Labels and Selectors.
- Find Well-known labels, Annotations and Taints
3.1.2.6 - Field Selectors
Field selectors let you select Kubernetes objects based on the value of one or more resource fields. Here are some examples of field selector queries:
metadata.name=my-service
metadata.namespace!=default
status.phase=Pending
This kubectl
command selects all Pods for which the value of the status.phase
field is Running
:
kubectl get pods --field-selector status.phase=Running
Note:
Field selectors are essentially resource filters. By default, no selectors/filters are applied, meaning that all resources of the specified type are selected. This makes thekubectl
queries kubectl get pods
and kubectl get pods --field-selector ""
equivalent.Supported fields
Supported field selectors vary by Kubernetes resource type. All resource types support the metadata.name
and metadata.namespace
fields. Using unsupported field selectors produces an error. For example:
kubectl get ingress --field-selector foo.bar=baz
Error from server (BadRequest): Unable to find "ingresses" that match label selector "", field selector "foo.bar=baz": "foo.bar" is not a known field selector: only "metadata.name", "metadata.namespace"
List of supported fields
Kind | Fields |
---|---|
Pod | spec.nodeName spec.restartPolicy spec.schedulerName spec.serviceAccountName spec.hostNetwork status.phase status.podIP status.nominatedNodeName |
Event | involvedObject.kind involvedObject.namespace involvedObject.name involvedObject.uid involvedObject.apiVersion involvedObject.resourceVersion involvedObject.fieldPath reason reportingComponent source type |
Secret | type |
Namespace | status.phase |
ReplicaSet | status.replicas |
ReplicationController | status.replicas |
Job | status.successful |
Node | spec.unschedulable |
CertificateSigningRequest | spec.signerName |
Custom resources fields
All custom resource types support the metadata.name
and metadata.namespace
fields.
Additionally, the spec.versions[*].selectableFields
field of a CustomResourceDefinition
declares which other fields in a custom resource may be used in field selectors. See selectable fields for custom resources
for more information about how to use field selectors with CustomResourceDefinitions.
Supported operators
You can use the =
, ==
, and !=
operators with field selectors (=
and ==
mean the same thing). This kubectl
command, for example, selects all Kubernetes Services that aren't in the default
namespace:
kubectl get services --all-namespaces --field-selector metadata.namespace!=default
Chained selectors
As with label and other selectors, field selectors can be chained together as a comma-separated list. This kubectl
command selects all Pods for which the status.phase
does not equal Running
and the spec.restartPolicy
field equals Always
:
kubectl get pods --field-selector=status.phase!=Running,spec.restartPolicy=Always
Multiple resource types
You can use field selectors across multiple resource types. This kubectl
command selects all Statefulsets and Services that are not in the default
namespace:
kubectl get statefulsets,services --all-namespaces --field-selector metadata.namespace!=default
3.1.2.7 - Finalizers
Finalizers are namespaced keys that tell Kubernetes to wait until specific conditions are met before it fully deletes resources marked for deletion. Finalizers alert controllers to clean up resources the deleted object owned.
When you tell Kubernetes to delete an object that has finalizers specified for
it, the Kubernetes API marks the object for deletion by populating .metadata.deletionTimestamp
,
and returns a 202
status code (HTTP "Accepted"). The target object remains in a terminating state while the
control plane, or other components, take the actions defined by the finalizers.
After these actions are complete, the controller removes the relevant finalizers
from the target object. When the metadata.finalizers
field is empty,
Kubernetes considers the deletion complete and deletes the object.
You can use finalizers to control garbage collection of resources. For example, you can define a finalizer to clean up related resources or infrastructure before the controller deletes the target resource.
You can use finalizers to control garbage collection of objects by alerting controllers to perform specific cleanup tasks before deleting the target resource.
Finalizers don't usually specify the code to execute. Instead, they are typically lists of keys on a specific resource similar to annotations. Kubernetes specifies some finalizers automatically, but you can also specify your own.
How finalizers work
When you create a resource using a manifest file, you can specify finalizers in
the metadata.finalizers
field. When you attempt to delete the resource, the
API server handling the delete request notices the values in the finalizers
field
and does the following:
- Modifies the object to add a
metadata.deletionTimestamp
field with the time you started the deletion. - Prevents the object from being removed until all items are removed from its
metadata.finalizers
field - Returns a
202
status code (HTTP "Accepted")
The controller managing that finalizer notices the update to the object setting the
metadata.deletionTimestamp
, indicating deletion of the object has been requested.
The controller then attempts to satisfy the requirements of the finalizers
specified for that resource. Each time a finalizer condition is satisfied, the
controller removes that key from the resource's finalizers
field. When the
finalizers
field is emptied, an object with a deletionTimestamp
field set
is automatically deleted. You can also use finalizers to prevent deletion of unmanaged resources.
A common example of a finalizer is kubernetes.io/pv-protection
, which prevents
accidental deletion of PersistentVolume
objects. When a PersistentVolume
object is in use by a Pod, Kubernetes adds the pv-protection
finalizer. If you
try to delete the PersistentVolume
, it enters a Terminating
status, but the
controller can't delete it because the finalizer exists. When the Pod stops
using the PersistentVolume
, Kubernetes clears the pv-protection
finalizer,
and the controller deletes the volume.
Note:
When you
DELETE
an object, Kubernetes adds the deletion timestamp for that object and then immediately starts to restrict changes to the.metadata.finalizers
field for the object that is now pending deletion. You can remove existing finalizers (deleting an entry from thefinalizers
list) but you cannot add a new finalizer. You also cannot modify thedeletionTimestamp
for an object once it is set.After the deletion is requested, you can not resurrect this object. The only way is to delete it and make a new similar object.
Owner references, labels, and finalizers
Like labels, owner references describe the relationships between objects in Kubernetes, but are used for a different purpose. When a controller manages objects like Pods, it uses labels to track changes to groups of related objects. For example, when a Job creates one or more Pods, the Job controller applies labels to those pods and tracks changes to any Pods in the cluster with the same label.
The Job controller also adds owner references to those Pods, pointing at the Job that created the Pods. If you delete the Job while these Pods are running, Kubernetes uses the owner references (not labels) to determine which Pods in the cluster need cleanup.
Kubernetes also processes finalizers when it identifies owner references on a resource targeted for deletion.
In some situations, finalizers can block the deletion of dependent objects, which can cause the targeted owner object to remain for longer than expected without being fully deleted. In these situations, you should check finalizers and owner references on the target owner and dependent objects to troubleshoot the cause.
Note:
In cases where objects are stuck in a deleting state, avoid manually removing finalizers to allow deletion to continue. Finalizers are usually added to resources for a reason, so forcefully removing them can lead to issues in your cluster. This should only be done when the purpose of the finalizer is understood and is accomplished in another way (for example, manually cleaning up some dependent object).What's next
- Read Using Finalizers to Control Deletion on the Kubernetes blog.
3.1.2.8 - Owners and Dependents
In Kubernetes, some objects are owners of other objects. For example, a ReplicaSet is the owner of a set of Pods. These owned objects are dependents of their owner.
Ownership is different from the labels and selectors
mechanism that some resources also use. For example, consider a Service that
creates EndpointSlice
objects. The Service uses labels to allow the control plane to
determine which EndpointSlice
objects are used for that Service. In addition
to the labels, each EndpointSlice
that is managed on behalf of a Service has
an owner reference. Owner references help different parts of Kubernetes avoid
interfering with objects they don’t control.
Owner references in object specifications
Dependent objects have a metadata.ownerReferences
field that references their
owner object. A valid owner reference consists of the object name and a UID
within the same namespace as the dependent object. Kubernetes sets the value of
this field automatically for objects that are dependents of other objects like
ReplicaSets, DaemonSets, Deployments, Jobs and CronJobs, and ReplicationControllers.
You can also configure these relationships manually by changing the value of
this field. However, you usually don't need to and can allow Kubernetes to
automatically manage the relationships.
Dependent objects also have an ownerReferences.blockOwnerDeletion
field that
takes a boolean value and controls whether specific dependents can block garbage
collection from deleting their owner object. Kubernetes automatically sets this
field to true
if a controller
(for example, the Deployment controller) sets the value of the
metadata.ownerReferences
field. You can also set the value of the
blockOwnerDeletion
field manually to control which dependents block garbage
collection.
A Kubernetes admission controller controls user access to change this field for dependent resources, based on the delete permissions of the owner. This control prevents unauthorized users from delaying owner object deletion.
Note:
Cross-namespace owner references are disallowed by design. Namespaced dependents can specify cluster-scoped or namespaced owners. A namespaced owner must exist in the same namespace as the dependent. If it does not, the owner reference is treated as absent, and the dependent is subject to deletion once all owners are verified absent.
Cluster-scoped dependents can only specify cluster-scoped owners. In v1.20+, if a cluster-scoped dependent specifies a namespaced kind as an owner, it is treated as having an unresolvable owner reference, and is not able to be garbage collected.
In v1.20+, if the garbage collector detects an invalid cross-namespace ownerReference
,
or a cluster-scoped dependent with an ownerReference
referencing a namespaced kind, a warning Event
with a reason of OwnerRefInvalidNamespace
and an involvedObject
of the invalid dependent is reported.
You can check for that kind of Event by running
kubectl get events -A --field-selector=reason=OwnerRefInvalidNamespace
.
Ownership and finalizers
When you tell Kubernetes to delete a resource, the API server allows the
managing controller to process any finalizer rules
for the resource. Finalizers
prevent accidental deletion of resources your cluster may still need to function
correctly. For example, if you try to delete a PersistentVolume that is still
in use by a Pod, the deletion does not happen immediately because the
PersistentVolume
has the kubernetes.io/pv-protection
finalizer on it.
Instead, the volume remains in the Terminating
status until Kubernetes clears
the finalizer, which only happens after the PersistentVolume
is no longer
bound to a Pod.
Kubernetes also adds finalizers to an owner resource when you use either
foreground or orphan cascading deletion.
In foreground deletion, it adds the foreground
finalizer so that the
controller must delete dependent resources that also have
ownerReferences.blockOwnerDeletion=true
before it deletes the owner. If you
specify an orphan deletion policy, Kubernetes adds the orphan
finalizer so
that the controller ignores dependent resources after it deletes the owner
object.
What's next
- Learn more about Kubernetes finalizers.
- Learn about garbage collection.
- Read the API reference for object metadata.
3.1.2.9 - Recommended Labels
You can visualize and manage Kubernetes objects with more tools than kubectl and the dashboard. A common set of labels allows tools to work interoperably, describing objects in a common manner that all tools can understand.
In addition to supporting tooling, the recommended labels describe applications in a way that can be queried.
The metadata is organized around the concept of an application. Kubernetes is not a platform as a service (PaaS) and doesn't have or enforce a formal notion of an application. Instead, applications are informal and described with metadata. The definition of what an application contains is loose.
Note:
These are recommended labels. They make it easier to manage applications but aren't required for any core tooling.Shared labels and annotations share a common prefix: app.kubernetes.io
. Labels
without a prefix are private to users. The shared prefix ensures that shared labels
do not interfere with custom user labels.
Labels
In order to take full advantage of using these labels, they should be applied on every resource object.
Key | Description | Example | Type |
---|---|---|---|
app.kubernetes.io/name | The name of the application | mysql | string |
app.kubernetes.io/instance | A unique name identifying the instance of an application | mysql-abcxyz | string |
app.kubernetes.io/version | The current version of the application (e.g., a SemVer 1.0, revision hash, etc.) | 5.7.21 | string |
app.kubernetes.io/component | The component within the architecture | database | string |
app.kubernetes.io/part-of | The name of a higher level application this one is part of | wordpress | string |
app.kubernetes.io/managed-by | The tool being used to manage the operation of an application | Helm | string |
To illustrate these labels in action, consider the following StatefulSet object:
# This is an excerpt
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/name: mysql
app.kubernetes.io/instance: mysql-abcxyz
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/component: database
app.kubernetes.io/part-of: wordpress
app.kubernetes.io/managed-by: Helm
Applications And Instances Of Applications
An application can be installed one or more times into a Kubernetes cluster and, in some cases, the same namespace. For example, WordPress can be installed more than once where different websites are different installations of WordPress.
The name of an application and the instance name are recorded separately. For
example, WordPress has a app.kubernetes.io/name
of wordpress
while it has
an instance name, represented as app.kubernetes.io/instance
with a value of
wordpress-abcxyz
. This enables the application and instance of the application
to be identifiable. Every instance of an application must have a unique name.
Examples
To illustrate different ways to use these labels the following examples have varying complexity.
A Simple Stateless Service
Consider the case for a simple stateless service deployed using Deployment
and Service
objects. The following two snippets represent how the labels could be used in their simplest form.
The Deployment
is used to oversee the pods running the application itself.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: myservice
app.kubernetes.io/instance: myservice-abcxyz
...
The Service
is used to expose the application.
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: myservice
app.kubernetes.io/instance: myservice-abcxyz
...
Web Application With A Database
Consider a slightly more complicated application: a web application (WordPress) using a database (MySQL), installed using Helm. The following snippets illustrate the start of objects used to deploy this application.
The start to the following Deployment
is used for WordPress:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: wordpress
app.kubernetes.io/instance: wordpress-abcxyz
app.kubernetes.io/version: "4.9.4"
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/component: server
app.kubernetes.io/part-of: wordpress
...
The Service
is used to expose WordPress:
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: wordpress
app.kubernetes.io/instance: wordpress-abcxyz
app.kubernetes.io/version: "4.9.4"
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/component: server
app.kubernetes.io/part-of: wordpress
...
MySQL is exposed as a StatefulSet
with metadata for both it and the larger application it belongs to:
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/name: mysql
app.kubernetes.io/instance: mysql-abcxyz
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/component: database
app.kubernetes.io/part-of: wordpress
...
The Service
is used to expose MySQL as part of WordPress:
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: mysql
app.kubernetes.io/instance: mysql-abcxyz
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/component: database
app.kubernetes.io/part-of: wordpress
...
With the MySQL StatefulSet
and Service
you'll notice information about both MySQL and WordPress, the broader application, are included.
3.1.3 - The Kubernetes API
The core of Kubernetes' control plane is the API server. The API server exposes an HTTP API that lets end users, different parts of your cluster, and external components communicate with one another.
The Kubernetes API lets you query and manipulate the state of API objects in Kubernetes (for example: Pods, Namespaces, ConfigMaps, and Events).
Most operations can be performed through the kubectl command-line interface or other command-line tools, such as kubeadm, which in turn use the API. However, you can also access the API directly using REST calls. Kubernetes provides a set of client libraries for those looking to write applications using the Kubernetes API.
Each Kubernetes cluster publishes the specification of the APIs that the cluster serves.
There are two mechanisms that Kubernetes uses to publish these API specifications; both are useful
to enable automatic interoperability. For example, the kubectl
tool fetches and caches the API
specification for enabling command-line completion and other features.
The two supported mechanisms are as follows:
The Discovery API provides information about the Kubernetes APIs: API names, resources, versions, and supported operations. This is a Kubernetes specific term as it is a separate API from the Kubernetes OpenAPI. It is intended to be a brief summary of the available resources and it does not detail specific schema for the resources. For reference about resource schemas, please refer to the OpenAPI document.
The Kubernetes OpenAPI Document provides (full) OpenAPI v2.0 and 3.0 schemas for all Kubernetes API endpoints. The OpenAPI v3 is the preferred method for accessing OpenAPI as it provides a more comprehensive and accurate view of the API. It includes all the available API paths, as well as all resources consumed and produced for every operations on every endpoints. It also includes any extensibility components that a cluster supports. The data is a complete specification and is significantly larger than that from the Discovery API.
Discovery API
Kubernetes publishes a list of all group versions and resources supported via the Discovery API. This includes the following for each resource:
- Name
- Cluster or namespaced scope
- Endpoint URL and supported verbs
- Alternative names
- Group, version, kind
The API is available in both aggregated and unaggregated form. The aggregated discovery serves two endpoints, while the unaggregated discovery serves a separate endpoint for each group version.
Aggregated discovery
Kubernetes v1.30 [stable]
(enabled by default: true)Kubernetes offers stable support for aggregated discovery, publishing
all resources supported by a cluster through two endpoints (/api
and
/apis
). Requesting this
endpoint drastically reduces the number of requests sent to fetch the
discovery data from the cluster. You can access the data by
requesting the respective endpoints with an Accept
header indicating
the aggregated discovery resource:
Accept: application/json;v=v2;g=apidiscovery.k8s.io;as=APIGroupDiscoveryList
.
Without indicating the resource type using the Accept
header, the default
response for the /api
and /apis
endpoint is an unaggregated discovery
document.
The discovery document for the built-in resources can be found in the Kubernetes GitHub repository. This Github document can be used as a reference of the base set of the available resources if a Kubernetes cluster is not available to query.
The endpoint also supports ETag and protobuf encoding.
Unaggregated discovery
Without discovery aggregation, discovery is published in levels, with the root endpoints publishing discovery information for downstream documents.
A list of all group versions supported by a cluster is published at
the /api
and /apis
endpoints. Example:
{
"kind": "APIGroupList",
"apiVersion": "v1",
"groups": [
{
"name": "apiregistration.k8s.io",
"versions": [
{
"groupVersion": "apiregistration.k8s.io/v1",
"version": "v1"
}
],
"preferredVersion": {
"groupVersion": "apiregistration.k8s.io/v1",
"version": "v1"
}
},
{
"name": "apps",
"versions": [
{
"groupVersion": "apps/v1",
"version": "v1"
}
],
"preferredVersion": {
"groupVersion": "apps/v1",
"version": "v1"
}
},
...
}
Additional requests are needed to obtain the discovery document for each group version at
/apis/<group>/<version>
(for example:
/apis/rbac.authorization.k8s.io/v1alpha1
), which advertises the list of
resources served under a particular group version. These endpoints are used by
kubectl to fetch the list of resources supported by a cluster.
OpenAPI interface definition
For details about the OpenAPI specifications, see the OpenAPI documentation.
Kubernetes serves both OpenAPI v2.0 and OpenAPI v3.0. OpenAPI v3 is the
preferred method of accessing the OpenAPI because it offers a more comprehensive
(lossless) representation of Kubernetes resources. Due to limitations of OpenAPI
version 2, certain fields are dropped from the published OpenAPI including but not
limited to default
, nullable
, oneOf
.
OpenAPI V2
The Kubernetes API server serves an aggregated OpenAPI v2 spec via the
/openapi/v2
endpoint. You can request the response format using
request headers as follows:
Header | Possible values | Notes |
---|---|---|
Accept-Encoding | gzip | not supplying this header is also acceptable |
Accept | application/com.github.proto-openapi.spec.v2@v1.0+protobuf | mainly for intra-cluster use |
application/json | default | |
* | serves application/json |
Warning:
The validation rules published as part of OpenAPI schemas may not be complete, and usually aren't. Additional validation occurs within the API server. If you want precise and complete verification, akubectl apply --dry-run=server
runs all the applicable validation (and also activates admission-time
checks).OpenAPI V3
Kubernetes v1.27 [stable]
(enabled by default: true)Kubernetes supports publishing a description of its APIs as OpenAPI v3.
A discovery endpoint /openapi/v3
is provided to see a list of all
group/versions available. This endpoint only returns JSON. These
group/versions are provided in the following format:
{
"paths": {
...,
"api/v1": {
"serverRelativeURL": "/openapi/v3/api/v1?hash=CC0E9BFD992D8C59AEC98A1E2336F899E8318D3CF4C68944C3DEC640AF5AB52D864AC50DAA8D145B3494F75FA3CFF939FCBDDA431DAD3CA79738B297795818CF"
},
"apis/admissionregistration.k8s.io/v1": {
"serverRelativeURL": "/openapi/v3/apis/admissionregistration.k8s.io/v1?hash=E19CC93A116982CE5422FC42B590A8AFAD92CDE9AE4D59B5CAAD568F083AD07946E6CB5817531680BCE6E215C16973CD39003B0425F3477CFD854E89A9DB6597"
},
....
}
}
The relative URLs are pointing to immutable OpenAPI descriptions, in
order to improve client-side caching. The proper HTTP caching headers
are also set by the API server for that purpose (Expires
to 1 year in
the future, and Cache-Control
to immutable
). When an obsolete URL is
used, the API server returns a redirect to the newest URL.
The Kubernetes API server publishes an OpenAPI v3 spec per Kubernetes
group version at the /openapi/v3/apis/<group>/<version>?hash=<hash>
endpoint.
Refer to the table below for accepted request headers.
Header | Possible values | Notes |
---|---|---|
Accept-Encoding | gzip | not supplying this header is also acceptable |
Accept | application/com.github.proto-openapi.spec.v3@v1.0+protobuf | mainly for intra-cluster use |
application/json | default | |
* | serves application/json |
A Golang implementation to fetch the OpenAPI V3 is provided in the package
k8s.io/client-go/openapi3
.
Kubernetes 1.32 publishes OpenAPI v2.0 and v3.0; there are no plans to support 3.1 in the near future.
Protobuf serialization
Kubernetes implements an alternative Protobuf based serialization format that is primarily intended for intra-cluster communication. For more information about this format, see the Kubernetes Protobuf serialization design proposal and the Interface Definition Language (IDL) files for each schema located in the Go packages that define the API objects.
Persistence
Kubernetes stores the serialized state of objects by writing them into etcd.
API groups and versioning
To make it easier to eliminate fields or restructure resource representations,
Kubernetes supports multiple API versions, each at a different API path, such
as /api/v1
or /apis/rbac.authorization.k8s.io/v1alpha1
.
Versioning is done at the API level rather than at the resource or field level to ensure that the API presents a clear, consistent view of system resources and behavior, and to enable controlling access to end-of-life and/or experimental APIs.
To make it easier to evolve and to extend its API, Kubernetes implements API groups that can be enabled or disabled.
API resources are distinguished by their API group, resource type, namespace (for namespaced resources), and name. The API server handles the conversion between API versions transparently: all the different versions are actually representations of the same persisted data. The API server may serve the same underlying data through multiple API versions.
For example, suppose there are two API versions, v1
and v1beta1
, for the same
resource. If you originally created an object using the v1beta1
version of its
API, you can later read, update, or delete that object using either the v1beta1
or the v1
API version, until the v1beta1
version is deprecated and removed.
At that point you can continue accessing and modifying the object using the v1
API.
API changes
Any system that is successful needs to grow and change as new use cases emerge or existing ones change. Therefore, Kubernetes has designed the Kubernetes API to continuously change and grow. The Kubernetes project aims to not break compatibility with existing clients, and to maintain that compatibility for a length of time so that other projects have an opportunity to adapt.
In general, new API resources and new resource fields can be added often and frequently. Elimination of resources or fields requires following the API deprecation policy.
Kubernetes makes a strong commitment to maintain compatibility for official Kubernetes APIs
once they reach general availability (GA), typically at API version v1
. Additionally,
Kubernetes maintains compatibility with data persisted via beta API versions of official Kubernetes APIs,
and ensures that data can be converted and accessed via GA API versions when the feature goes stable.
If you adopt a beta API version, you will need to transition to a subsequent beta or stable API version once the API graduates. The best time to do this is while the beta API is in its deprecation period, since objects are simultaneously accessible via both API versions. Once the beta API completes its deprecation period and is no longer served, the replacement API version must be used.
Note:
Although Kubernetes also aims to maintain compatibility for alpha APIs versions, in some circumstances this is not possible. If you use any alpha API versions, check the release notes for Kubernetes when upgrading your cluster, in case the API did change in incompatible ways that require deleting all existing alpha objects prior to upgrade.Refer to API versions reference for more details on the API version level definitions.
API Extension
The Kubernetes API can be extended in one of two ways:
- Custom resources let you declaratively define how the API server should provide your chosen resource API.
- You can also extend the Kubernetes API by implementing an aggregation layer.
What's next
- Learn how to extend the Kubernetes API by adding your own CustomResourceDefinition.
- Controlling Access To The Kubernetes API describes how the cluster manages authentication and authorization for API access.
- Learn about API endpoints, resource types and samples by reading API Reference.
- Learn about what constitutes a compatible change, and how to change the API, from API changes.
3.2 - Cluster Architecture
A Kubernetes cluster consists of a control plane plus a set of worker machines, called nodes, that run containerized applications. Every cluster needs at least one worker node in order to run Pods.
The worker node(s) host the Pods that are the components of the application workload. The control plane manages the worker nodes and the Pods in the cluster. In production environments, the control plane usually runs across multiple computers and a cluster usually runs multiple nodes, providing fault-tolerance and high availability.
This document outlines the various components you need to have for a complete and working Kubernetes cluster.
About this architecture
The diagram in Figure 1 presents an example reference architecture for a Kubernetes cluster. The actual distribution of components can vary based on specific cluster setups and requirements.
In the diagram, each node runs the kube-proxy
component. You need a
network proxy component on each node to ensure that the
Service API and associated behaviors
are available on your cluster network. However, some network plugins provide their own,
third party implementation of proxying. When you use that kind of network plugin,
the node does not need to run kube-proxy
.
Control plane components
The control plane's components make global decisions about the cluster (for example, scheduling),
as well as detecting and responding to cluster events (for example, starting up a new
pod when a Deployment's
replicas
field is unsatisfied).
Control plane components can be run on any machine in the cluster. However, for simplicity, setup scripts typically start all control plane components on the same machine, and do not run user containers on this machine. See Creating Highly Available clusters with kubeadm for an example control plane setup that runs across multiple machines.
kube-apiserver
The API server is a component of the Kubernetes control plane that exposes the Kubernetes API. The API server is the front end for the Kubernetes control plane.
The main implementation of a Kubernetes API server is kube-apiserver. kube-apiserver is designed to scale horizontally—that is, it scales by deploying more instances. You can run several instances of kube-apiserver and balance traffic between those instances.
etcd
Consistent and highly-available key value store used as Kubernetes' backing store for all cluster data.
If your Kubernetes cluster uses etcd as its backing store, make sure you have a back up plan for the data.
You can find in-depth information about etcd in the official documentation.
kube-scheduler
Control plane component that watches for newly created Pods with no assigned node, and selects a node for them to run on.
Factors taken into account for scheduling decisions include: individual and collective resource requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference, and deadlines.
kube-controller-manager
Control plane component that runs controller processes.
Logically, each controller is a separate process, but to reduce complexity, they are all compiled into a single binary and run in a single process.
There are many different types of controllers. Some examples of them are:
- Node controller: Responsible for noticing and responding when nodes go down.
- Job controller: Watches for Job objects that represent one-off tasks, then creates Pods to run those tasks to completion.
- EndpointSlice controller: Populates EndpointSlice objects (to provide a link between Services and Pods).
- ServiceAccount controller: Create default ServiceAccounts for new namespaces.
The above is not an exhaustive list.
cloud-controller-manager
A Kubernetes control plane component that embeds cloud-specific control logic. The cloud controller manager lets you link your cluster into your cloud provider's API, and separates out the components that interact with that cloud platform from components that only interact with your cluster.The cloud-controller-manager only runs controllers that are specific to your cloud provider. If you are running Kubernetes on your own premises, or in a learning environment inside your own PC, the cluster does not have a cloud controller manager.
As with the kube-controller-manager, the cloud-controller-manager combines several logically independent control loops into a single binary that you run as a single process. You can scale horizontally (run more than one copy) to improve performance or to help tolerate failures.
The following controllers can have cloud provider dependencies:
- Node controller: For checking the cloud provider to determine if a node has been deleted in the cloud after it stops responding
- Route controller: For setting up routes in the underlying cloud infrastructure
- Service controller: For creating, updating and deleting cloud provider load balancers
Node components
Node components run on every node, maintaining running pods and providing the Kubernetes runtime environment.
kubelet
An agent that runs on each node in the cluster. It makes sure that containers are running in a Pod.
The kubelet takes a set of PodSpecs that are provided through various mechanisms and ensures that the containers described in those PodSpecs are running and healthy. The kubelet doesn't manage containers which were not created by Kubernetes.
kube-proxy (optional)
kube-proxy is a network proxy that runs on each node in your cluster, implementing part of the Kubernetes Service concept.
kube-proxy maintains network rules on nodes. These network rules allow network communication to your Pods from network sessions inside or outside of your cluster.
kube-proxy uses the operating system packet filtering layer if there is one and it's available. Otherwise, kube-proxy forwards the traffic itself.
If you use a network plugin that implements packet forwarding for Services by itself, and providing equivalent behavior to kube-proxy, then you do not need to run kube-proxy on the nodes in your cluster.Container runtime
A fundamental component that empowers Kubernetes to run containers effectively. It is responsible for managing the execution and lifecycle of containers within the Kubernetes environment.
Kubernetes supports container runtimes such as containerd, CRI-O, and any other implementation of the Kubernetes CRI (Container Runtime Interface).
Addons
Addons use Kubernetes resources (DaemonSet,
Deployment, etc) to implement cluster features.
Because these are providing cluster-level features, namespaced resources for
addons belong within the kube-system
namespace.
Selected addons are described below; for an extended list of available addons, please see Addons.
DNS
While the other addons are not strictly required, all Kubernetes clusters should have cluster DNS, as many examples rely on it.
Cluster DNS is a DNS server, in addition to the other DNS server(s) in your environment, which serves DNS records for Kubernetes services.
Containers started by Kubernetes automatically include this DNS server in their DNS searches.
Web UI (Dashboard)
Dashboard is a general purpose, web-based UI for Kubernetes clusters. It allows users to manage and troubleshoot applications running in the cluster, as well as the cluster itself.
Container resource monitoring
Container Resource Monitoring records generic time-series metrics about containers in a central database, and provides a UI for browsing that data.
Cluster-level Logging
A cluster-level logging mechanism is responsible for saving container logs to a central log store with a search/browsing interface.
Network plugins
Network plugins are software components that implement the container network interface (CNI) specification. They are responsible for allocating IP addresses to pods and enabling them to communicate with each other within the cluster.
Architecture variations
While the core components of Kubernetes remain consistent, the way they are deployed and managed can vary. Understanding these variations is crucial for designing and maintaining Kubernetes clusters that meet specific operational needs.
Control plane deployment options
The control plane components can be deployed in several ways:
- Traditional deployment
- Control plane components run directly on dedicated machines or VMs, often managed as systemd services.
- Static Pods
- Control plane components are deployed as static Pods, managed by the kubelet on specific nodes. This is a common approach used by tools like kubeadm.
- Self-hosted
- The control plane runs as Pods within the Kubernetes cluster itself, managed by Deployments and StatefulSets or other Kubernetes primitives.
- Managed Kubernetes services
- Cloud providers often abstract away the control plane, managing its components as part of their service offering.
Workload placement considerations
The placement of workloads, including the control plane components, can vary based on cluster size, performance requirements, and operational policies:
- In smaller or development clusters, control plane components and user workloads might run on the same nodes.
- Larger production clusters often dedicate specific nodes to control plane components, separating them from user workloads.
- Some organizations run critical add-ons or monitoring tools on control plane nodes.
Cluster management tools
Tools like kubeadm, kops, and Kubespray offer different approaches to deploying and managing clusters, each with its own method of component layout and management.
The flexibility of Kubernetes architecture allows organizations to tailor their clusters to specific needs, balancing factors such as operational complexity, performance, and management overhead.
Customization and extensibility
Kubernetes architecture allows for significant customization:
- Custom schedulers can be deployed to work alongside the default Kubernetes scheduler or to replace it entirely.
- API servers can be extended with CustomResourceDefinitions and API Aggregation.
- Cloud providers can integrate deeply with Kubernetes using the cloud-controller-manager.
The flexibility of Kubernetes architecture allows organizations to tailor their clusters to specific needs, balancing factors such as operational complexity, performance, and management overhead.
What's next
Learn more about the following:
- Nodes and their communication with the control plane.
- Kubernetes controllers.
- kube-scheduler which is the default scheduler for Kubernetes.
- Etcd's official documentation.
- Several container runtimes in Kubernetes.
- Integrating with cloud providers using cloud-controller-manager.
- kubectl commands.
3.2.1 - Nodes
Kubernetes runs your workload by placing containers into Pods to run on Nodes. A node may be a virtual or physical machine, depending on the cluster. Each node is managed by the control plane and contains the services necessary to run Pods.
Typically you have several nodes in a cluster; in a learning or resource-limited environment, you might have only one node.
The components on a node include the kubelet, a container runtime, and the kube-proxy.
Management
There are two main ways to have Nodes added to the API server:
- The kubelet on a node self-registers to the control plane
- You (or another human user) manually add a Node object
After you create a Node object, or the kubelet on a node self-registers, the control plane checks whether the new Node object is valid. For example, if you try to create a Node from the following JSON manifest:
{
"kind": "Node",
"apiVersion": "v1",
"metadata": {
"name": "10.240.79.157",
"labels": {
"name": "my-first-k8s-node"
}
}
}
Kubernetes creates a Node object internally (the representation). Kubernetes checks
that a kubelet has registered to the API server that matches the metadata.name
field of the Node. If the node is healthy (i.e. all necessary services are running),
then it is eligible to run a Pod. Otherwise, that node is ignored for any cluster activity
until it becomes healthy.
Note:
Kubernetes keeps the object for the invalid Node and continues checking to see whether it becomes healthy.
You, or a controller, must explicitly delete the Node object to stop that health checking.
The name of a Node object must be a valid DNS subdomain name.
Node name uniqueness
The name identifies a Node. Two Nodes cannot have the same name at the same time. Kubernetes also assumes that a resource with the same name is the same object. In case of a Node, it is implicitly assumed that an instance using the same name will have the same state (e.g. network settings, root disk contents) and attributes like node labels. This may lead to inconsistencies if an instance was modified without changing its name. If the Node needs to be replaced or updated significantly, the existing Node object needs to be removed from API server first and re-added after the update.
Self-registration of Nodes
When the kubelet flag --register-node
is true (the default), the kubelet will attempt to
register itself with the API server. This is the preferred pattern, used by most distros.
For self-registration, the kubelet is started with the following options:
--kubeconfig
- Path to credentials to authenticate itself to the API server.--cloud-provider
- How to talk to a cloud provider to read metadata about itself.--register-node
- Automatically register with the API server.--register-with-taints
- Register the node with the given list of taints (comma separated<key>=<value>:<effect>
).No-op if
register-node
is false.--node-ip
- Optional comma-separated list of the IP addresses for the node. You can only specify a single address for each address family. For example, in a single-stack IPv4 cluster, you set this value to be the IPv4 address that the kubelet should use for the node. See configure IPv4/IPv6 dual stack for details of running a dual-stack cluster.If you don't provide this argument, the kubelet uses the node's default IPv4 address, if any; if the node has no IPv4 addresses then the kubelet uses the node's default IPv6 address.
--node-labels
- Labels to add when registering the node in the cluster (see label restrictions enforced by the NodeRestriction admission plugin).--node-status-update-frequency
- Specifies how often kubelet posts its node status to the API server.
When the Node authorization mode and NodeRestriction admission plugin are enabled, kubelets are only authorized to create/modify their own Node resource.
Note:
As mentioned in the Node name uniqueness section,
when Node configuration needs to be updated, it is a good practice to re-register
the node with the API server. For example, if the kubelet is being restarted with
a new set of --node-labels
, but the same Node name is used, the change will
not take effect, as labels are only set (or modified) upon Node registration with the API server.
Pods already scheduled on the Node may misbehave or cause issues if the Node configuration will be changed on kubelet restart. For example, already running Pod may be tainted against the new labels assigned to the Node, while other Pods, that are incompatible with that Pod will be scheduled based on this new label. Node re-registration ensures all Pods will be drained and properly re-scheduled.
Manual Node administration
You can create and modify Node objects using kubectl.
When you want to create Node objects manually, set the kubelet flag --register-node=false
.
You can modify Node objects regardless of the setting of --register-node
.
For example, you can set labels on an existing Node or mark it unschedulable.
You can use labels on Nodes in conjunction with node selectors on Pods to control scheduling. For example, you can constrain a Pod to only be eligible to run on a subset of the available nodes.
Marking a node as unschedulable prevents the scheduler from placing new pods onto that Node but does not affect existing Pods on the Node. This is useful as a preparatory step before a node reboot or other maintenance.
To mark a Node unschedulable, run:
kubectl cordon $NODENAME
See Safely Drain a Node for more details.
Note:
Pods that are part of a DaemonSet tolerate being run on an unschedulable Node. DaemonSets typically provide node-local services that should run on the Node even if it is being drained of workload applications.Node status
A Node's status contains the following information:
You can use kubectl
to view a Node's status and other details:
kubectl describe node <insert-node-name-here>
See Node Status for more details.
Node heartbeats
Heartbeats, sent by Kubernetes nodes, help your cluster determine the availability of each node, and to take action when failures are detected.
For nodes there are two forms of heartbeats:
- Updates to the
.status
of a Node. - Lease objects
within the
kube-node-lease
namespace. Each Node has an associated Lease object.
Node controller
The node controller is a Kubernetes control plane component that manages various aspects of nodes.
The node controller has multiple roles in a node's life. The first is assigning a CIDR block to the node when it is registered (if CIDR assignment is turned on).
The second is keeping the node controller's internal list of nodes up to date with the cloud provider's list of available machines. When running in a cloud environment and whenever a node is unhealthy, the node controller asks the cloud provider if the VM for that node is still available. If not, the node controller deletes the node from its list of nodes.
The third is monitoring the nodes' health. The node controller is responsible for:
- In the case that a node becomes unreachable, updating the
Ready
condition in the Node's.status
field. In this case the node controller sets theReady
condition toUnknown
. - If a node remains unreachable: triggering
API-initiated eviction
for all of the Pods on the unreachable node. By default, the node controller
waits 5 minutes between marking the node as
Unknown
and submitting the first eviction request.
By default, the node controller checks the state of each node every 5 seconds.
This period can be configured using the --node-monitor-period
flag on the
kube-controller-manager
component.
Rate limits on eviction
In most cases, the node controller limits the eviction rate to
--node-eviction-rate
(default 0.1) per second, meaning it won't evict pods
from more than 1 node per 10 seconds.
The node eviction behavior changes when a node in a given availability zone
becomes unhealthy. The node controller checks what percentage of nodes in the zone
are unhealthy (the Ready
condition is Unknown
or False
) at the same time:
- If the fraction of unhealthy nodes is at least
--unhealthy-zone-threshold
(default 0.55), then the eviction rate is reduced. - If the cluster is small (i.e. has less than or equal to
--large-cluster-size-threshold
nodes - default 50), then evictions are stopped. - Otherwise, the eviction rate is reduced to
--secondary-node-eviction-rate
(default 0.01) per second.
The reason these policies are implemented per availability zone is because one availability zone might become partitioned from the control plane while the others remain connected. If your cluster does not span multiple cloud provider availability zones, then the eviction mechanism does not take per-zone unavailability into account.
A key reason for spreading your nodes across availability zones is so that the
workload can be shifted to healthy zones when one entire zone goes down.
Therefore, if all nodes in a zone are unhealthy, then the node controller evicts at
the normal rate of --node-eviction-rate
. The corner case is when all zones are
completely unhealthy (none of the nodes in the cluster are healthy). In such a
case, the node controller assumes that there is some problem with connectivity
between the control plane and the nodes, and doesn't perform any evictions.
(If there has been an outage and some nodes reappear, the node controller does
evict pods from the remaining nodes that are unhealthy or unreachable).
The node controller is also responsible for evicting pods running on nodes with
NoExecute
taints, unless those pods tolerate that taint.
The node controller also adds taints
corresponding to node problems like node unreachable or not ready. This means
that the scheduler won't place Pods onto unhealthy nodes.
Resource capacity tracking
Node objects track information about the Node's resource capacity: for example, the amount of memory available and the number of CPUs. Nodes that self register report their capacity during registration. If you manually add a Node, then you need to set the node's capacity information when you add it.
The Kubernetes scheduler ensures that there are enough resources for all the Pods on a Node. The scheduler checks that the sum of the requests of containers on the node is no greater than the node's capacity. That sum of requests includes all containers managed by the kubelet, but excludes any containers started directly by the container runtime, and also excludes any processes running outside of the kubelet's control.
Note:
If you want to explicitly reserve resources for non-Pod processes, see reserve resources for system daemons.Node topology
Kubernetes v1.27 [stable]
(enabled by default: true)If you have enabled the TopologyManager
feature gate, then
the kubelet can use topology hints when making resource assignment decisions.
See Control Topology Management Policies on a Node
for more information.
Swap memory management
Kubernetes v1.30 [beta]
(enabled by default: true)To enable swap on a node, the NodeSwap
feature gate must be enabled on
the kubelet (default is true), and the --fail-swap-on
command line flag or failSwapOn
configuration setting
must be set to false.
To allow Pods to utilize swap, swapBehavior
should not be set to NoSwap
(which is the default behavior) in the kubelet config.
Warning:
When the memory swap feature is turned on, Kubernetes data such as the content of Secret objects that were written to tmpfs now could be swapped to disk.A user can also optionally configure memorySwap.swapBehavior
in order to
specify how a node will use swap memory. For example,
memorySwap:
swapBehavior: LimitedSwap
NoSwap
(default): Kubernetes workloads will not use swap.LimitedSwap
: The utilization of swap memory by Kubernetes workloads is subject to limitations. Only Pods of Burstable QoS are permitted to employ swap.
If configuration for memorySwap
is not specified and the feature gate is
enabled, by default the kubelet will apply the same behaviour as the
NoSwap
setting.
With LimitedSwap
, Pods that do not fall under the Burstable QoS classification (i.e.
BestEffort
/Guaranteed
Qos Pods) are prohibited from utilizing swap memory.
To maintain the aforementioned security and node health guarantees, these Pods
are not permitted to use swap memory when LimitedSwap
is in effect.
Prior to detailing the calculation of the swap limit, it is necessary to define the following terms:
nodeTotalMemory
: The total amount of physical memory available on the node.totalPodsSwapAvailable
: The total amount of swap memory on the node that is available for use by Pods (some swap memory may be reserved for system use).containerMemoryRequest
: The container's memory request.
Swap limitation is configured as:
(containerMemoryRequest / nodeTotalMemory) * totalPodsSwapAvailable
.
It is important to note that, for containers within Burstable QoS Pods, it is possible to opt-out of swap usage by specifying memory requests that are equal to memory limits. Containers configured in this manner will not have access to swap memory.
Swap is supported only with cgroup v2, cgroup v1 is not supported.
For more information, and to assist with testing and provide feedback, please see the blog-post about Kubernetes 1.28: NodeSwap graduates to Beta1, KEP-2400 and its design proposal.
What's next
Learn more about the following:
- Components that make up a node.
- API definition for Node.
- Node section of the architecture design document.
- Graceful/non-graceful node shutdown.
- Cluster autoscaling to manage the number and size of nodes in your cluster.
- Taints and Tolerations.
- Node Resource Managers.
- Resource Management for Windows nodes.
3.2.2 - Communication between Nodes and the Control Plane
This document catalogs the communication paths between the API server and the Kubernetes cluster. The intent is to allow users to customize their installation to harden the network configuration such that the cluster can be run on an untrusted network (or on fully public IPs on a cloud provider).
Node to Control Plane
Kubernetes has a "hub-and-spoke" API pattern. All API usage from nodes (or the pods they run) terminates at the API server. None of the other control plane components are designed to expose remote services. The API server is configured to listen for remote connections on a secure HTTPS port (typically 443) with one or more forms of client authentication enabled. One or more forms of authorization should be enabled, especially if anonymous requests or service account tokens are allowed.
Nodes should be provisioned with the public root certificate for the cluster such that they can connect securely to the API server along with valid client credentials. A good approach is that the client credentials provided to the kubelet are in the form of a client certificate. See kubelet TLS bootstrapping for automated provisioning of kubelet client certificates.
Pods that wish to connect to the API server can do so securely by leveraging a service account so
that Kubernetes will automatically inject the public root certificate and a valid bearer token
into the pod when it is instantiated.
The kubernetes
service (in default
namespace) is configured with a virtual IP address that is
redirected (via kube-proxy
) to the HTTPS endpoint on the API server.
The control plane components also communicate with the API server over the secure port.
As a result, the default operating mode for connections from the nodes and pod running on the nodes to the control plane is secured by default and can run over untrusted and/or public networks.
Control plane to node
There are two primary communication paths from the control plane (the API server) to the nodes. The first is from the API server to the kubelet process which runs on each node in the cluster. The second is from the API server to any node, pod, or service through the API server's proxy functionality.
API server to kubelet
The connections from the API server to the kubelet are used for:
- Fetching logs for pods.
- Attaching (usually through
kubectl
) to running pods. - Providing the kubelet's port-forwarding functionality.
These connections terminate at the kubelet's HTTPS endpoint. By default, the API server does not verify the kubelet's serving certificate, which makes the connection subject to man-in-the-middle attacks and unsafe to run over untrusted and/or public networks.
To verify this connection, use the --kubelet-certificate-authority
flag to provide the API
server with a root certificate bundle to use to verify the kubelet's serving certificate.
If that is not possible, use SSH tunneling between the API server and kubelet if required to avoid connecting over an untrusted or public network.
Finally, Kubelet authentication and/or authorization should be enabled to secure the kubelet API.
API server to nodes, pods, and services
The connections from the API server to a node, pod, or service default to plain HTTP connections
and are therefore neither authenticated nor encrypted. They can be run over a secure HTTPS
connection by prefixing https:
to the node, pod, or service name in the API URL, but they will
not validate the certificate provided by the HTTPS endpoint nor provide client credentials. So
while the connection will be encrypted, it will not provide any guarantees of integrity. These
connections are not currently safe to run over untrusted or public networks.
SSH tunnels
Kubernetes supports SSH tunnels to protect the control plane to nodes communication paths. In this configuration, the API server initiates an SSH tunnel to each node in the cluster (connecting to the SSH server listening on port 22) and passes all traffic destined for a kubelet, node, pod, or service through the tunnel. This tunnel ensures that the traffic is not exposed outside of the network in which the nodes are running.
Note:
SSH tunnels are currently deprecated, so you shouldn't opt to use them unless you know what you are doing. The Konnectivity service is a replacement for this communication channel.Konnectivity service
Kubernetes v1.18 [beta]
As a replacement to the SSH tunnels, the Konnectivity service provides TCP level proxy for the control plane to cluster communication. The Konnectivity service consists of two parts: the Konnectivity server in the control plane network and the Konnectivity agents in the nodes network. The Konnectivity agents initiate connections to the Konnectivity server and maintain the network connections. After enabling the Konnectivity service, all control plane to nodes traffic goes through these connections.
Follow the Konnectivity service task to set up the Konnectivity service in your cluster.
What's next
- Read about the Kubernetes control plane components
- Learn more about Hubs and Spoke model
- Learn how to Secure a Cluster
- Learn more about the Kubernetes API
- Set up Konnectivity service
- Use Port Forwarding to Access Applications in a Cluster
- Learn how to Fetch logs for Pods, use kubectl port-forward
3.2.3 - Controllers
In robotics and automation, a control loop is a non-terminating loop that regulates the state of a system.
Here is one example of a control loop: a thermostat in a room.
When you set the temperature, that's telling the thermostat about your desired state. The actual room temperature is the current state. The thermostat acts to bring the current state closer to the desired state, by turning equipment on or off.
In Kubernetes, controllers are control loops that watch the state of your cluster, then make or request changes where needed. Each controller tries to move the current cluster state closer to the desired state.Controller pattern
A controller tracks at least one Kubernetes resource type. These objects have a spec field that represents the desired state. The controller(s) for that resource are responsible for making the current state come closer to that desired state.
The controller might carry the action out itself; more commonly, in Kubernetes, a controller will send messages to the API server that have useful side effects. You'll see examples of this below.
Control via API server
The Job controller is an example of a Kubernetes built-in controller. Built-in controllers manage state by interacting with the cluster API server.
Job is a Kubernetes resource that runs a Pod, or perhaps several Pods, to carry out a task and then stop.
(Once scheduled, Pod objects become part of the desired state for a kubelet).
When the Job controller sees a new task it makes sure that, somewhere in your cluster, the kubelets on a set of Nodes are running the right number of Pods to get the work done. The Job controller does not run any Pods or containers itself. Instead, the Job controller tells the API server to create or remove Pods. Other components in the control plane act on the new information (there are new Pods to schedule and run), and eventually the work is done.
After you create a new Job, the desired state is for that Job to be completed. The Job controller makes the current state for that Job be nearer to your desired state: creating Pods that do the work you wanted for that Job, so that the Job is closer to completion.
Controllers also update the objects that configure them.
For example: once the work is done for a Job, the Job controller
updates that Job object to mark it Finished
.
(This is a bit like how some thermostats turn a light off to indicate that your room is now at the temperature you set).
Direct control
In contrast with Job, some controllers need to make changes to things outside of your cluster.
For example, if you use a control loop to make sure there are enough Nodes in your cluster, then that controller needs something outside the current cluster to set up new Nodes when needed.
Controllers that interact with external state find their desired state from the API server, then communicate directly with an external system to bring the current state closer in line.
(There actually is a controller that horizontally scales the nodes in your cluster.)
The important point here is that the controller makes some changes to bring about your desired state, and then reports the current state back to your cluster's API server. Other control loops can observe that reported data and take their own actions.
In the thermostat example, if the room is very cold then a different controller might also turn on a frost protection heater. With Kubernetes clusters, the control plane indirectly works with IP address management tools, storage services, cloud provider APIs, and other services by extending Kubernetes to implement that.
Desired versus current state
Kubernetes takes a cloud-native view of systems, and is able to handle constant change.
Your cluster could be changing at any point as work happens and control loops automatically fix failures. This means that, potentially, your cluster never reaches a stable state.
As long as the controllers for your cluster are running and able to make useful changes, it doesn't matter if the overall state is stable or not.
Design
As a tenet of its design, Kubernetes uses lots of controllers that each manage a particular aspect of cluster state. Most commonly, a particular control loop (controller) uses one kind of resource as its desired state, and has a different kind of resource that it manages to make that desired state happen. For example, a controller for Jobs tracks Job objects (to discover new work) and Pod objects (to run the Jobs, and then to see when the work is finished). In this case something else creates the Jobs, whereas the Job controller creates Pods.
It's useful to have simple controllers rather than one, monolithic set of control loops that are interlinked. Controllers can fail, so Kubernetes is designed to allow for that.
Note:
There can be several controllers that create or update the same kind of object. Behind the scenes, Kubernetes controllers make sure that they only pay attention to the resources linked to their controlling resource.
For example, you can have Deployments and Jobs; these both create Pods. The Job controller does not delete the Pods that your Deployment created, because there is information (labels) the controllers can use to tell those Pods apart.
Ways of running controllers
Kubernetes comes with a set of built-in controllers that run inside the kube-controller-manager. These built-in controllers provide important core behaviors.
The Deployment controller and Job controller are examples of controllers that come as part of Kubernetes itself ("built-in" controllers). Kubernetes lets you run a resilient control plane, so that if any of the built-in controllers were to fail, another part of the control plane will take over the work.
You can find controllers that run outside the control plane, to extend Kubernetes. Or, if you want, you can write a new controller yourself. You can run your own controller as a set of Pods, or externally to Kubernetes. What fits best will depend on what that particular controller does.
What's next
- Read about the Kubernetes control plane
- Discover some of the basic Kubernetes objects
- Learn more about the Kubernetes API
- If you want to write your own controller, see Kubernetes extension patterns and the sample-controller repository.
3.2.4 - Leases
Distributed systems often have a need for leases, which provide a mechanism to lock shared resources
and coordinate activity between members of a set.
In Kubernetes, the lease concept is represented by Lease
objects in the coordination.k8s.io
API Group,
which are used for system-critical capabilities such as node heartbeats and component-level leader election.
Node heartbeats
Kubernetes uses the Lease API to communicate kubelet node heartbeats to the Kubernetes API server.
For every Node
, there is a Lease
object with a matching name in the kube-node-lease
namespace. Under the hood, every kubelet heartbeat is an update request to this Lease
object, updating
the spec.renewTime
field for the Lease. The Kubernetes control plane uses the time stamp of this field
to determine the availability of this Node
.
See Node Lease objects for more details.
Leader election
Kubernetes also uses Leases to ensure only one instance of a component is running at any given time.
This is used by control plane components like kube-controller-manager
and kube-scheduler
in
HA configurations, where only one instance of the component should be actively running while the other
instances are on stand-by.
Read coordinated leader election to learn about how Kubernetes builds on the Lease API to select which component instance acts as leader.
API server identity
Kubernetes v1.26 [beta]
(enabled by default: true)Starting in Kubernetes v1.26, each kube-apiserver
uses the Lease API to publish its identity to the
rest of the system. While not particularly useful on its own, this provides a mechanism for clients to
discover how many instances of kube-apiserver
are operating the Kubernetes control plane.
Existence of kube-apiserver leases enables future capabilities that may require coordination between
each kube-apiserver.
You can inspect Leases owned by each kube-apiserver by checking for lease objects in the kube-system
namespace
with the name apiserver-<sha256-hash>
. Alternatively you can use the label selector apiserver.kubernetes.io/identity=kube-apiserver
:
kubectl -n kube-system get lease -l apiserver.kubernetes.io/identity=kube-apiserver
NAME HOLDER AGE
apiserver-07a5ea9b9b072c4a5f3d1c3702 apiserver-07a5ea9b9b072c4a5f3d1c3702_0c8914f7-0f35-440e-8676-7844977d3a05 5m33s
apiserver-7be9e061c59d368b3ddaf1376e apiserver-7be9e061c59d368b3ddaf1376e_84f2a85d-37c1-4b14-b6b9-603e62e4896f 4m23s
apiserver-1dfef752bcb36637d2763d1868 apiserver-1dfef752bcb36637d2763d1868_c5ffa286-8a9a-45d4-91e7-61118ed58d2e 4m43s
The SHA256 hash used in the lease name is based on the OS hostname as seen by that API server. Each kube-apiserver should be
configured to use a hostname that is unique within the cluster. New instances of kube-apiserver that use the same hostname
will take over existing Leases using a new holder identity, as opposed to instantiating new Lease objects. You can check the
hostname used by kube-apisever by checking the value of the kubernetes.io/hostname
label:
kubectl -n kube-system get lease apiserver-07a5ea9b9b072c4a5f3d1c3702 -o yaml
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
creationTimestamp: "2023-07-02T13:16:48Z"
labels:
apiserver.kubernetes.io/identity: kube-apiserver
kubernetes.io/hostname: master-1
name: apiserver-07a5ea9b9b072c4a5f3d1c3702
namespace: kube-system
resourceVersion: "334899"
uid: 90870ab5-1ba9-4523-b215-e4d4e662acb1
spec:
holderIdentity: apiserver-07a5ea9b9b072c4a5f3d1c3702_0c8914f7-0f35-440e-8676-7844977d3a05
leaseDurationSeconds: 3600
renewTime: "2023-07-04T21:58:48.065888Z"
Expired leases from kube-apiservers that no longer exist are garbage collected by new kube-apiservers after 1 hour.
You can disable API server identity leases by disabling the APIServerIdentity
feature gate.
Workloads
Your own workload can define its own use of Leases. For example, you might run a custom
controller where a primary or leader member
performs operations that its peers do not. You define a Lease so that the controller replicas can select
or elect a leader, using the Kubernetes API for coordination.
If you do use a Lease, it's a good practice to define a name for the Lease that is obviously linked to
the product or component. For example, if you have a component named Example Foo, use a Lease named
example-foo
.
If a cluster operator or another end user could deploy multiple instances of a component, select a name prefix and pick a mechanism (such as hash of the name of the Deployment) to avoid name collisions for the Leases.
You can use another approach so long as it achieves the same outcome: different software products do not conflict with one another.
3.2.5 - Cloud Controller Manager
Kubernetes v1.11 [beta]
Cloud infrastructure technologies let you run Kubernetes on public, private, and hybrid clouds. Kubernetes believes in automated, API-driven infrastructure without tight coupling between components.
The cloud-controller-manager is a Kubernetes control plane component that embeds cloud-specific control logic. The cloud controller manager lets you link your cluster into your cloud provider's API, and separates out the components that interact with that cloud platform from components that only interact with your cluster.
By decoupling the interoperability logic between Kubernetes and the underlying cloud infrastructure, the cloud-controller-manager component enables cloud providers to release features at a different pace compared to the main Kubernetes project.
The cloud-controller-manager is structured using a plugin mechanism that allows different cloud providers to integrate their platforms with Kubernetes.
Design
The cloud controller manager runs in the control plane as a replicated set of processes (usually, these are containers in Pods). Each cloud-controller-manager implements multiple controllers in a single process.
Note:
You can also run the cloud controller manager as a Kubernetes addon rather than as part of the control plane.Cloud controller manager functions
The controllers inside the cloud controller manager include:
Node controller
The node controller is responsible for updating Node objects when new servers are created in your cloud infrastructure. The node controller obtains information about the hosts running inside your tenancy with the cloud provider. The node controller performs the following functions:
- Update a Node object with the corresponding server's unique identifier obtained from the cloud provider API.
- Annotating and labelling the Node object with cloud-specific information, such as the region the node is deployed into and the resources (CPU, memory, etc) that it has available.
- Obtain the node's hostname and network addresses.
- Verifying the node's health. In case a node becomes unresponsive, this controller checks with your cloud provider's API to see if the server has been deactivated / deleted / terminated. If the node has been deleted from the cloud, the controller deletes the Node object from your Kubernetes cluster.
Some cloud provider implementations split this into a node controller and a separate node lifecycle controller.
Route controller
The route controller is responsible for configuring routes in the cloud appropriately so that containers on different nodes in your Kubernetes cluster can communicate with each other.
Depending on the cloud provider, the route controller might also allocate blocks of IP addresses for the Pod network.
Service controller
Services integrate with cloud infrastructure components such as managed load balancers, IP addresses, network packet filtering, and target health checking. The service controller interacts with your cloud provider's APIs to set up load balancers and other infrastructure components when you declare a Service resource that requires them.
Authorization
This section breaks down the access that the cloud controller manager requires on various API objects, in order to perform its operations.
Node controller
The Node controller only works with Node objects. It requires full access to read and modify Node objects.
v1/Node
:
- get
- list
- create
- update
- patch
- watch
- delete
Route controller
The route controller listens to Node object creation and configures routes appropriately. It requires Get access to Node objects.
v1/Node
:
- get
Service controller
The service controller watches for Service object create, update and delete events and then configures Endpoints for those Services appropriately (for EndpointSlices, the kube-controller-manager manages these on demand).
To access Services, it requires list, and watch access. To update Services, it requires patch and update access.
To set up Endpoints resources for the Services, it requires access to create, list, get, watch, and update.
v1/Service
:
- list
- get
- watch
- patch
- update
Others
The implementation of the core of the cloud controller manager requires access to create Event objects, and to ensure secure operation, it requires access to create ServiceAccounts.
v1/Event
:
- create
- patch
- update
v1/ServiceAccount
:
- create
The RBAC ClusterRole for the cloud controller manager looks like:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cloud-controller-manager
rules:
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- update
- apiGroups:
- ""
resources:
- nodes
verbs:
- '*'
- apiGroups:
- ""
resources:
- nodes/status
verbs:
- patch
- apiGroups:
- ""
resources:
- services
verbs:
- list
- patch
- update
- watch
- apiGroups:
- ""
resources:
- serviceaccounts
verbs:
- create
- apiGroups:
- ""
resources:
- persistentvolumes
verbs:
- get
- list
- update
- watch
- apiGroups:
- ""
resources:
- endpoints
verbs:
- create
- get
- list
- watch
- update
What's next
Cloud Controller Manager Administration has instructions on running and managing the cloud controller manager.
To upgrade a HA control plane to use the cloud controller manager, see Migrate Replicated Control Plane To Use Cloud Controller Manager.
Want to know how to implement your own cloud controller manager, or extend an existing project?
- The cloud controller manager uses Go interfaces, specifically,
CloudProvider
interface defined incloud.go
from kubernetes/cloud-provider to allow implementations from any cloud to be plugged in. - The implementation of the shared controllers highlighted in this document (Node, Route, and Service),
and some scaffolding along with the shared cloudprovider interface, is part of the Kubernetes core.
Implementations specific to cloud providers are outside the core of Kubernetes and implement
the
CloudProvider
interface. - For more information about developing plugins, see Developing Cloud Controller Manager.
- The cloud controller manager uses Go interfaces, specifically,
3.2.6 - About cgroup v2
On Linux, control groups constrain resources that are allocated to processes.
The kubelet and the underlying container runtime need to interface with cgroups to enforce resource management for pods and containers which includes cpu/memory requests and limits for containerized workloads.
There are two versions of cgroups in Linux: cgroup v1 and cgroup v2. cgroup v2 is
the new generation of the cgroup
API.
What is cgroup v2?
Kubernetes v1.25 [stable]
cgroup v2 is the next version of the Linux cgroup
API. cgroup v2 provides a
unified control system with enhanced resource management
capabilities.
cgroup v2 offers several improvements over cgroup v1, such as the following:
- Single unified hierarchy design in API
- Safer sub-tree delegation to containers
- Newer features like Pressure Stall Information
- Enhanced resource allocation management and isolation across multiple resources
- Unified accounting for different types of memory allocations (network memory, kernel memory, etc)
- Accounting for non-immediate resource changes such as page cache write backs
Some Kubernetes features exclusively use cgroup v2 for enhanced resource management and isolation. For example, the MemoryQoS feature improves memory QoS and relies on cgroup v2 primitives.
Using cgroup v2
The recommended way to use cgroup v2 is to use a Linux distribution that enables and uses cgroup v2 by default.
To check if your distribution uses cgroup v2, refer to Identify cgroup version on Linux nodes.
Requirements
cgroup v2 has the following requirements:
- OS distribution enables cgroup v2
- Linux Kernel version is 5.8 or later
- Container runtime supports cgroup v2. For example:
- containerd v1.4 and later
- cri-o v1.20 and later
- The kubelet and the container runtime are configured to use the systemd cgroup driver
Linux Distribution cgroup v2 support
For a list of Linux distributions that use cgroup v2, refer to the cgroup v2 documentation
- Container Optimized OS (since M97)
- Ubuntu (since 21.10, 22.04+ recommended)
- Debian GNU/Linux (since Debian 11 bullseye)
- Fedora (since 31)
- Arch Linux (since April 2021)
- RHEL and RHEL-like distributions (since 9)
To check if your distribution is using cgroup v2, refer to your distribution's documentation or follow the instructions in Identify the cgroup version on Linux nodes.
You can also enable cgroup v2 manually on your Linux distribution by modifying
the kernel cmdline boot arguments. If your distribution uses GRUB,
systemd.unified_cgroup_hierarchy=1
should be added in GRUB_CMDLINE_LINUX
under /etc/default/grub
, followed by sudo update-grub
. However, the
recommended approach is to use a distribution that already enables cgroup v2 by
default.
Migrating to cgroup v2
To migrate to cgroup v2, ensure that you meet the requirements, then upgrade to a kernel version that enables cgroup v2 by default.
The kubelet automatically detects that the OS is running on cgroup v2 and performs accordingly with no additional configuration required.
There should not be any noticeable difference in the user experience when switching to cgroup v2, unless users are accessing the cgroup file system directly, either on the node or from within the containers.
cgroup v2 uses a different API than cgroup v1, so if there are any applications that directly access the cgroup file system, they need to be updated to newer versions that support cgroup v2. For example:
- Some third-party monitoring and security agents may depend on the cgroup filesystem. Update these agents to versions that support cgroup v2.
- If you run cAdvisor as a stand-alone DaemonSet for monitoring pods and containers, update it to v0.43.0 or later.
- If you deploy Java applications, prefer to use versions which fully support cgroup v2:
- OpenJDK / HotSpot: jdk8u372, 11.0.16, 15 and later
- IBM Semeru Runtimes: 8.0.382.0, 11.0.20.0, 17.0.8.0, and later
- IBM Java: 8.0.8.6 and later
- If you are using the uber-go/automaxprocs package, make sure the version you use is v1.5.1 or higher.
Identify the cgroup version on Linux Nodes
The cgroup version depends on the Linux distribution being used and the
default cgroup version configured on the OS. To check which cgroup version your
distribution uses, run the stat -fc %T /sys/fs/cgroup/
command on
the node:
stat -fc %T /sys/fs/cgroup/
For cgroup v2, the output is cgroup2fs
.
For cgroup v1, the output is tmpfs.
What's next
- Learn more about cgroups
- Learn more about container runtime
- Learn more about cgroup drivers
3.2.7 - Container Runtime Interface (CRI)
The CRI is a plugin interface which enables the kubelet to use a wide variety of container runtimes, without having a need to recompile the cluster components.
You need a working container runtime on each Node in your cluster, so that the kubelet can launch Pods and their containers.
The Container Runtime Interface (CRI) is the main protocol for the communication between the kubelet and Container Runtime.
The Kubernetes Container Runtime Interface (CRI) defines the main gRPC protocol for the communication between the node components kubelet and container runtime.
The API
Kubernetes v1.23 [stable]
The kubelet acts as a client when connecting to the container runtime via gRPC.
The runtime and image service endpoints have to be available in the container
runtime, which can be configured separately within the kubelet by using the
--image-service-endpoint
command line flags.
For Kubernetes v1.32, the kubelet prefers to use CRI v1
.
If a container runtime does not support v1
of the CRI, then the kubelet tries to
negotiate any older supported version.
The v1.32 kubelet can also negotiate CRI v1alpha2
, but
this version is considered as deprecated.
If the kubelet cannot negotiate a supported CRI version, the kubelet gives up
and doesn't register as a node.
Upgrading
When upgrading Kubernetes, the kubelet tries to automatically select the latest CRI version on restart of the component. If that fails, then the fallback will take place as mentioned above. If a gRPC re-dial was required because the container runtime has been upgraded, then the container runtime must also support the initially selected version or the redial is expected to fail. This requires a restart of the kubelet.
What's next
- Learn more about the CRI protocol definition
3.2.8 - Garbage Collection
Garbage collection is a collective term for the various mechanisms Kubernetes uses to clean up cluster resources. This allows the clean up of resources like the following:
- Terminated pods
- Completed Jobs
- Objects without owner references
- Unused containers and container images
- Dynamically provisioned PersistentVolumes with a StorageClass reclaim policy of Delete
- Stale or expired CertificateSigningRequests (CSRs)
- Nodes deleted in the following scenarios:
- On a cloud when the cluster uses a cloud controller manager
- On-premises when the cluster uses an addon similar to a cloud controller manager
- Node Lease objects
Owners and dependents
Many objects in Kubernetes link to each other through owner references. Owner references tell the control plane which objects are dependent on others. Kubernetes uses owner references to give the control plane, and other API clients, the opportunity to clean up related resources before deleting an object. In most cases, Kubernetes manages owner references automatically.
Ownership is different from the labels and selectors
mechanism that some resources also use. For example, consider a
Service that creates
EndpointSlice
objects. The Service uses labels to allow the control plane to
determine which EndpointSlice
objects are used for that Service. In addition
to the labels, each EndpointSlice
that is managed on behalf of a Service has
an owner reference. Owner references help different parts of Kubernetes avoid
interfering with objects they don’t control.
Note:
Cross-namespace owner references are disallowed by design. Namespaced dependents can specify cluster-scoped or namespaced owners. A namespaced owner must exist in the same namespace as the dependent. If it does not, the owner reference is treated as absent, and the dependent is subject to deletion once all owners are verified absent.
Cluster-scoped dependents can only specify cluster-scoped owners. In v1.20+, if a cluster-scoped dependent specifies a namespaced kind as an owner, it is treated as having an unresolvable owner reference, and is not able to be garbage collected.
In v1.20+, if the garbage collector detects an invalid cross-namespace ownerReference
,
or a cluster-scoped dependent with an ownerReference
referencing a namespaced kind, a warning Event
with a reason of OwnerRefInvalidNamespace
and an involvedObject
of the invalid dependent is reported.
You can check for that kind of Event by running
kubectl get events -A --field-selector=reason=OwnerRefInvalidNamespace
.
Cascading deletion
Kubernetes checks for and deletes objects that no longer have owner references, like the pods left behind when you delete a ReplicaSet. When you delete an object, you can control whether Kubernetes deletes the object's dependents automatically, in a process called cascading deletion. There are two types of cascading deletion, as follows:
- Foreground cascading deletion
- Background cascading deletion
You can also control how and when garbage collection deletes resources that have owner references using Kubernetes finalizers.
Foreground cascading deletion
In foreground cascading deletion, the owner object you're deleting first enters a deletion in progress state. In this state, the following happens to the owner object:
- The Kubernetes API server sets the object's
metadata.deletionTimestamp
field to the time the object was marked for deletion. - The Kubernetes API server also sets the
metadata.finalizers
field toforegroundDeletion
. - The object remains visible through the Kubernetes API until the deletion process is complete.
After the owner object enters the deletion in progress state, the controller deletes dependents it knows about. After deleting all the dependent objects it knows about, the controller deletes the owner object. At this point, the object is no longer visible in the Kubernetes API.
During foreground cascading deletion, the only dependents that block owner
deletion are those that have the ownerReference.blockOwnerDeletion=true
field
and are in the garbage collection controller cache. The garbage collection controller
cache may not contain objects whose resource type cannot be listed / watched successfully,
or objects that are created concurrent with deletion of an owner object.
See Use foreground cascading deletion
to learn more.
Background cascading deletion
In background cascading deletion, the Kubernetes API server deletes the owner object immediately and the garbage collector controller (custom or default) cleans up the dependent objects in the background. If a finalizer exists, it ensures that objects are not deleted until all necessary clean-up tasks are completed. By default, Kubernetes uses background cascading deletion unless you manually use foreground deletion or choose to orphan the dependent objects.
See Use background cascading deletion to learn more.
Orphaned dependents
When Kubernetes deletes an owner object, the dependents left behind are called orphan objects. By default, Kubernetes deletes dependent objects. To learn how to override this behaviour, see Delete owner objects and orphan dependents.
Garbage collection of unused containers and images
The kubelet performs garbage collection on unused images every five minutes and on unused containers every minute. You should avoid using external garbage collection tools, as these can break the kubelet behavior and remove containers that should exist.
To configure options for unused container and image garbage collection, tune the
kubelet using a configuration file
and change the parameters related to garbage collection using the
KubeletConfiguration
resource type.
Container image lifecycle
Kubernetes manages the lifecycle of all images through its image manager, which is part of the kubelet, with the cooperation of cadvisor. The kubelet considers the following disk usage limits when making garbage collection decisions:
HighThresholdPercent
LowThresholdPercent
Disk usage above the configured HighThresholdPercent
value triggers garbage
collection, which deletes images in order based on the last time they were used,
starting with the oldest first. The kubelet deletes images
until disk usage reaches the LowThresholdPercent
value.
Garbage collection for unused container images
Kubernetes v1.30 [beta]
(enabled by default: true)As a beta feature, you can specify the maximum time a local image can be unused for, regardless of disk usage. This is a kubelet setting that you configure for each node.
To configure the setting, you need to set a value for the imageMaximumGCAge
field in the kubelet configuration file.
The value is specified as a Kubernetes duration. See duration in the glossary for more details.
For example, you can set the configuration field to 12h45m
,
which means 12 hours and 45 minutes.
Note:
This feature does not track image usage across kubelet restarts. If the kubelet is restarted, the tracked image age is reset, causing the kubelet to wait the fullimageMaximumGCAge
duration before qualifying images for garbage collection
based on image age.Container garbage collection
The kubelet garbage collects unused containers based on the following variables, which you can define:
MinAge
: the minimum age at which the kubelet can garbage collect a container. Disable by setting to0
.MaxPerPodContainer
: the maximum number of dead containers each Pod can have. Disable by setting to less than0
.MaxContainers
: the maximum number of dead containers the cluster can have. Disable by setting to less than0
.
In addition to these variables, the kubelet garbage collects unidentified and deleted containers, typically starting with the oldest first.
MaxPerPodContainer
and MaxContainers
may potentially conflict with each other
in situations where retaining the maximum number of containers per Pod
(MaxPerPodContainer
) would go outside the allowable total of global dead
containers (MaxContainers
). In this situation, the kubelet adjusts
MaxPerPodContainer
to address the conflict. A worst-case scenario would be to
downgrade MaxPerPodContainer
to 1
and evict the oldest containers.
Additionally, containers owned by pods that have been deleted are removed once
they are older than MinAge
.
Note:
The kubelet only garbage collects the containers it manages.Configuring garbage collection
You can tune garbage collection of resources by configuring options specific to the controllers managing those resources. The following pages show you how to configure garbage collection:
What's next
- Learn more about ownership of Kubernetes objects.
- Learn more about Kubernetes finalizers.
- Learn about the TTL controller that cleans up finished Jobs.
3.2.9 - Mixed Version Proxy
Kubernetes v1.28 [alpha]
(enabled by default: false)Kubernetes 1.32 includes an alpha feature that lets an API Server proxy a resource requests to other peer API servers. This is useful when there are multiple API servers running different versions of Kubernetes in one cluster (for example, during a long-lived rollout to a new release of Kubernetes).
This enables cluster administrators to configure highly available clusters that can be upgraded more safely, by directing resource requests (made during the upgrade) to the correct kube-apiserver. That proxying prevents users from seeing unexpected 404 Not Found errors that stem from the upgrade process.
This mechanism is called the Mixed Version Proxy.
Enabling the Mixed Version Proxy
Ensure that UnknownVersionInteroperabilityProxy
feature gate
is enabled when you start the API Server:
kube-apiserver \
--feature-gates=UnknownVersionInteroperabilityProxy=true \
# required command line arguments for this feature
--peer-ca-file=<path to kube-apiserver CA cert>
--proxy-client-cert-file=<path to aggregator proxy cert>,
--proxy-client-key-file=<path to aggregator proxy key>,
--requestheader-client-ca-file=<path to aggregator CA cert>,
# requestheader-allowed-names can be set to blank to allow any Common Name
--requestheader-allowed-names=<valid Common Names to verify proxy client cert against>,
# optional flags for this feature
--peer-advertise-ip=`IP of this kube-apiserver that should be used by peers to proxy requests`
--peer-advertise-port=`port of this kube-apiserver that should be used by peers to proxy requests`
# …and other flags as usual
Proxy transport and authentication between API servers
The source kube-apiserver reuses the existing APIserver client authentication flags
--proxy-client-cert-file
and--proxy-client-key-file
to present its identity that will be verified by its peer (the destination kube-apiserver). The destination API server verifies that peer connection based on the configuration you specify using the--requestheader-client-ca-file
command line argument.To authenticate the destination server's serving certs, you must configure a certificate authority bundle by specifying the
--peer-ca-file
command line argument to the source API server.
Configuration for peer API server connectivity
To set the network location of a kube-apiserver that peers will use to proxy requests, use the
--peer-advertise-ip
and --peer-advertise-port
command line arguments to kube-apiserver or specify
these fields in the API server configuration file.
If these flags are unspecified, peers will use the value from either --advertise-address
or
--bind-address
command line argument to the kube-apiserver.
If those too, are unset, the host's default interface is used.
Mixed version proxying
When you enable mixed version proxying, the aggregation layer loads a special filter that does the following:
- When a resource request reaches an API server that cannot serve that API (either because it is at a version pre-dating the introduction of the API or the API is turned off on the API server) the API server attempts to send the request to a peer API server that can serve the requested API. It does so by identifying API groups / versions / resources that the local server doesn't recognise, and tries to proxy those requests to a peer API server that is capable of handling the request.
- If the peer API server fails to respond, the source API server responds with 503 ("Service Unavailable") error.
How it works under the hood
When an API Server receives a resource request, it first checks which API servers can
serve the requested resource. This check happens using the internal
StorageVersion
API.
If the resource is known to the API server that received the request (for example,
GET /api/v1/pods/some-pod
), the request is handled locally.If there is no internal
StorageVersion
object found for the requested resource (for example,GET /my-api/v1/my-resource
) and the configured APIService specifies proxying to an extension API server, that proxying happens following the usual flow for extension APIs.If a valid internal
StorageVersion
object is found for the requested resource (for example,GET /batch/v1/jobs
) and the API server trying to handle the request (the handling API server) has thebatch
API disabled, then the handling API server fetches the peer API servers that do serve the relevant API group / version / resource (api/v1/batch
in this case) using the information in the fetchedStorageVersion
object. The handling API server then proxies the request to one of the matching peer kube-apiservers that are aware of the requested resource.If there is no peer known for that API group / version / resource, the handling API server passes the request to its own handler chain which should eventually return a 404 ("Not Found") response.
If the handling API server has identified and selected a peer API server, but that peer fails to respond (for reasons such as network connectivity issues, or a data race between the request being received and a controller registering the peer's info into the control plane), then the handling API server responds with a 503 ("Service Unavailable") error.
3.3 - Containers
This page will discuss containers and container images, as well as their use in operations and solution development.
The word container is an overloaded term. Whenever you use the word, check whether your audience uses the same definition.
Each container that you run is repeatable; the standardization from having dependencies included means that you get the same behavior wherever you run it.
Containers decouple applications from the underlying host infrastructure. This makes deployment easier in different cloud or OS environments.
Each node in a Kubernetes cluster runs the containers that form the Pods assigned to that node. Containers in a Pod are co-located and co-scheduled to run on the same node.
Container images
A container image is a ready-to-run software package containing everything needed to run an application: the code and any runtime it requires, application and system libraries, and default values for any essential settings.
Containers are intended to be stateless and immutable: you should not change the code of a container that is already running. If you have a containerized application and want to make changes, the correct process is to build a new image that includes the change, then recreate the container to start from the updated image.
Container runtimes
A fundamental component that empowers Kubernetes to run containers effectively. It is responsible for managing the execution and lifecycle of containers within the Kubernetes environment.
Kubernetes supports container runtimes such as containerd, CRI-O, and any other implementation of the Kubernetes CRI (Container Runtime Interface).
Usually, you can allow your cluster to pick the default container runtime for a Pod. If you need to use more than one container runtime in your cluster, you can specify the RuntimeClass for a Pod to make sure that Kubernetes runs those containers using a particular container runtime.
You can also use RuntimeClass to run different Pods with the same container runtime but with different settings.
3.3.1 - Images
A container image represents binary data that encapsulates an application and all its software dependencies. Container images are executable software bundles that can run standalone and that make very well defined assumptions about their runtime environment.
You typically create a container image of your application and push it to a registry before referring to it in a Pod.
This page provides an outline of the container image concept.
Note:
If you are looking for the container images for a Kubernetes release (such as v1.32, the latest minor release), visit Download Kubernetes.Image names
Container images are usually given a name such as pause
, example/mycontainer
, or kube-apiserver
.
Images can also include a registry hostname; for example: fictional.registry.example/imagename
,
and possibly a port number as well; for example: fictional.registry.example:10443/imagename
.
If you don't specify a registry hostname, Kubernetes assumes that you mean the Docker public registry. You can change this behaviour by setting default image registry in container runtime configuration.
After the image name part you can add a tag or digest (in the same way you would when using with commands
like docker
or podman
). Tags let you identify different versions of the same series of images.
Digests are a unique identifier for a specific version of an image. Digests are hashes of the image's content,
and are immutable. Tags can be moved to point to different images, but digests are fixed.
Image tags consist of lowercase and uppercase letters, digits, underscores (_
),
periods (.
), and dashes (-
). It can be up to 128 characters long. And must follow the
next regex pattern: [a-zA-Z0-9_][a-zA-Z0-9._-]{0,127}
You can read more about and find validation regex in the
OCI Distribution Specification.
If you don't specify a tag, Kubernetes assumes you mean the tag latest
.
Image digests consists of a hash algorithm (such as sha256
) and a hash value. For example:
sha256:1ff6c18fbef2045af6b9c16bf034cc421a29027b800e4f9b68ae9b1cb3e9ae07
You can find more information about digests format in the
OCI Image Specification.
Some image name examples that Kubernetes can use are:
busybox
- Image name only, no tag or digest. Kubernetes will use Docker public registry and latest tag. (Same asdocker.io/library/busybox:latest
)busybox:1.32.0
- Image name with tag. Kubernetes will use Docker public registry. (Same asdocker.io/library/busybox:1.32.0
)registry.k8s.io/pause:latest
- Image name with a custom registry and latest tag.registry.k8s.io/pause:3.5
- Image name with a custom registry and non-latest tag.registry.k8s.io/pause@sha256:1ff6c18fbef2045af6b9c16bf034cc421a29027b800e4f9b68ae9b1cb3e9ae07
- Image name with digest.registry.k8s.io/pause:3.5@sha256:1ff6c18fbef2045af6b9c16bf034cc421a29027b800e4f9b68ae9b1cb3e9ae07
- Image name with tag and digest. Only digest will be used for pulling.
Updating images
When you first create a Deployment,
StatefulSet, Pod, or other
object that includes a Pod template, then by default the pull policy of all
containers in that pod will be set to IfNotPresent
if it is not explicitly
specified. This policy causes the
kubelet to skip pulling an
image if it already exists.
Image pull policy
The imagePullPolicy
for a container and the tag of the image affect when the
kubelet attempts to pull (download) the specified image.
Here's a list of the values you can set for imagePullPolicy
and the effects
these values have:
IfNotPresent
- the image is pulled only if it is not already present locally.
Always
- every time the kubelet launches a container, the kubelet queries the container image registry to resolve the name to an image digest. If the kubelet has a container image with that exact digest cached locally, the kubelet uses its cached image; otherwise, the kubelet pulls the image with the resolved digest, and uses that image to launch the container.
Never
- the kubelet does not try fetching the image. If the image is somehow already present locally, the kubelet attempts to start the container; otherwise, startup fails. See pre-pulled images for more details.
The caching semantics of the underlying image provider make even
imagePullPolicy: Always
efficient, as long as the registry is reliably accessible.
Your container runtime can notice that the image layers already exist on the node
so that they don't need to be downloaded again.
Note:
You should avoid using the :latest
tag when deploying containers in production as
it is harder to track which version of the image is running and more difficult to
roll back properly.
Instead, specify a meaningful tag such as v1.42.0
and/or a digest.
To make sure the Pod always uses the same version of a container image, you can specify
the image's digest;
replace <image-name>:<tag>
with <image-name>@<digest>
(for example, image@sha256:45b23dee08af5e43a7fea6c4cf9c25ccf269ee113168c19722f87876677c5cb2
).
When using image tags, if the image registry were to change the code that the tag on that image represents, you might end up with a mix of Pods running the old and new code. An image digest uniquely identifies a specific version of the image, so Kubernetes runs the same code every time it starts a container with that image name and digest specified. Specifying an image by digest fixes the code that you run so that a change at the registry cannot lead to that mix of versions.
There are third-party admission controllers that mutate Pods (and pod templates) when they are created, so that the running workload is defined based on an image digest rather than a tag. That might be useful if you want to make sure that all your workload is running the same code no matter what tag changes happen at the registry.
Default image pull policy
When you (or a controller) submit a new Pod to the API server, your cluster sets the
imagePullPolicy
field when specific conditions are met:
- if you omit the
imagePullPolicy
field, and you specify the digest for the container image, theimagePullPolicy
is automatically set toIfNotPresent
. - if you omit the
imagePullPolicy
field, and the tag for the container image is:latest
,imagePullPolicy
is automatically set toAlways
; - if you omit the
imagePullPolicy
field, and you don't specify the tag for the container image,imagePullPolicy
is automatically set toAlways
; - if you omit the
imagePullPolicy
field, and you specify the tag for the container image that isn't:latest
, theimagePullPolicy
is automatically set toIfNotPresent
.
Note:
The value of imagePullPolicy
of the container is always set when the object is
first created, and is not updated if the image's tag or digest later changes.
For example, if you create a Deployment with an image whose tag is not
:latest
, and later update that Deployment's image to a :latest
tag, the
imagePullPolicy
field will not change to Always
. You must manually change
the pull policy of any object after its initial creation.
Required image pull
If you would like to always force a pull, you can do one of the following:
- Set the
imagePullPolicy
of the container toAlways
. - Omit the
imagePullPolicy
and use:latest
as the tag for the image to use; Kubernetes will set the policy toAlways
when you submit the Pod. - Omit the
imagePullPolicy
and the tag for the image to use; Kubernetes will set the policy toAlways
when you submit the Pod. - Enable the AlwaysPullImages admission controller.
ImagePullBackOff
When a kubelet starts creating containers for a Pod using a container runtime,
it might be possible the container is in Waiting
state because of ImagePullBackOff
.
The status ImagePullBackOff
means that a container could not start because Kubernetes
could not pull a container image (for reasons such as invalid image name, or pulling
from a private registry without imagePullSecret
). The BackOff
part indicates
that Kubernetes will keep trying to pull the image, with an increasing back-off delay.
Kubernetes raises the delay between each attempt until it reaches a compiled-in limit, which is 300 seconds (5 minutes).
Image pull per runtime class
Kubernetes v1.29 [alpha]
(enabled by default: false)If you enable the RuntimeClassInImageCriApi
feature gate,
the kubelet references container images by a tuple of (image name, runtime handler) rather than just the
image name or digest. Your container runtime
may adapt its behavior based on the selected runtime handler.
Pulling images based on runtime class will be helpful for VM based containers like windows hyperV containers.
Serial and parallel image pulls
By default, kubelet pulls images serially. In other words, kubelet sends only one image pull request to the image service at a time. Other image pull requests have to wait until the one being processed is complete.
Nodes make image pull decisions in isolation. Even when you use serialized image pulls, two different nodes can pull the same image in parallel.
If you would like to enable parallel image pulls, you can set the field
serializeImagePulls
to false in the kubelet configuration.
With serializeImagePulls
set to false, image pull requests will be sent to the image service immediately,
and multiple images will be pulled at the same time.
When enabling parallel image pulls, please make sure the image service of your container runtime can handle parallel image pulls.
The kubelet never pulls multiple images in parallel on behalf of one Pod. For example, if you have a Pod that has an init container and an application container, the image pulls for the two containers will not be parallelized. However, if you have two Pods that use different images, the kubelet pulls the images in parallel on behalf of the two different Pods, when parallel image pulls is enabled.
Maximum parallel image pulls
Kubernetes v1.32 [beta]
When serializeImagePulls
is set to false, the kubelet defaults to no limit on the
maximum number of images being pulled at the same time. If you would like to
limit the number of parallel image pulls, you can set the field maxParallelImagePulls
in kubelet configuration. With maxParallelImagePulls
set to n, only n images
can be pulled at the same time, and any image pull beyond n will have to wait
until at least one ongoing image pull is complete.
Limiting the number parallel image pulls would prevent image pulling from consuming too much network bandwidth or disk I/O, when parallel image pulling is enabled.
You can set maxParallelImagePulls
to a positive number that is greater than or
equal to 1. If you set maxParallelImagePulls
to be greater than or equal to 2, you
must set the serializeImagePulls
to false. The kubelet will fail to start with invalid
maxParallelImagePulls
settings.
Multi-architecture images with image indexes
As well as providing binary images, a container registry can also serve a
container image index.
An image index can point to multiple image manifests
for architecture-specific versions of a container. The idea is that you can have a name for an image
(for example: pause
, example/mycontainer
, kube-apiserver
) and allow different systems to
fetch the right binary image for the machine architecture they are using.
Kubernetes itself typically names container images with a suffix -$(ARCH)
. For backward
compatibility, please generate the older images with suffixes. The idea is to generate say pause
image which has the manifest for all the arch(es) and say pause-amd64
which is backwards
compatible for older configurations or YAML files which may have hard coded the images with
suffixes.
Using a private registry
Private registries may require keys to read images from them.
Credentials can be provided in several ways:
- Configuring Nodes to Authenticate to a Private Registry
- all pods can read any configured private registries
- requires node configuration by cluster administrator
- Kubelet Credential Provider to dynamically fetch credentials for private registries
- kubelet can be configured to use credential provider exec plugin for the respective private registry.
- Pre-pulled Images
- all pods can use any images cached on a node
- requires root access to all nodes to set up
- Specifying ImagePullSecrets on a Pod
- only pods which provide their own keys can access the private registry
- Vendor-specific or local extensions
- if you're using a custom node configuration, you (or your cloud provider) can implement your mechanism for authenticating the node to the container registry.
These options are explained in more detail below.
Configuring nodes to authenticate to a private registry
Specific instructions for setting credentials depends on the container runtime and registry you chose to use. You should refer to your solution's documentation for the most accurate information.
For an example of configuring a private container image registry, see the Pull an Image from a Private Registry task. That example uses a private registry in Docker Hub.
Kubelet credential provider for authenticated image pulls
Note:
This approach is especially suitable when kubelet needs to fetch registry credentials dynamically. Most commonly used for registries provided by cloud providers where auth tokens are short-lived.You can configure the kubelet to invoke a plugin binary to dynamically fetch registry credentials for a container image. This is the most robust and versatile way to fetch credentials for private registries, but also requires kubelet-level configuration to enable.
See Configure a kubelet image credential provider for more details.
Interpretation of config.json
The interpretation of config.json
varies between the original Docker
implementation and the Kubernetes interpretation. In Docker, the auths
keys
can only specify root URLs, whereas Kubernetes allows glob URLs as well as
prefix-matched paths. The only limitation is that glob patterns (*
) have to
include the dot (.
) for each subdomain. The amount of matched subdomains has
to be equal to the amount of glob patterns (*.
), for example:
*.kubernetes.io
will not matchkubernetes.io
, butabc.kubernetes.io
*.*.kubernetes.io
will not matchabc.kubernetes.io
, butabc.def.kubernetes.io
prefix.*.io
will matchprefix.kubernetes.io
*-good.kubernetes.io
will matchprefix-good.kubernetes.io
This means that a config.json
like this is valid:
{
"auths": {
"my-registry.io/images": { "auth": "…" },
"*.my-registry.io/images": { "auth": "…" }
}
}
Image pull operations would now pass the credentials to the CRI container runtime for every valid pattern. For example the following container image names would match successfully:
my-registry.io/images
my-registry.io/images/my-image
my-registry.io/images/another-image
sub.my-registry.io/images/my-image
But not:
a.sub.my-registry.io/images/my-image
a.b.sub.my-registry.io/images/my-image
The kubelet performs image pulls sequentially for every found credential. This
means, that multiple entries in config.json
for different paths are possible, too:
{
"auths": {
"my-registry.io/images": {
"auth": "…"
},
"my-registry.io/images/subpath": {
"auth": "…"
}
}
}
If now a container specifies an image my-registry.io/images/subpath/my-image
to be pulled, then the kubelet will try to download them from both
authentication sources if one of them fails.
Pre-pulled images
Note:
This approach is suitable if you can control node configuration. It will not work reliably if your cloud provider manages nodes and replaces them automatically.By default, the kubelet tries to pull each image from the specified registry.
However, if the imagePullPolicy
property of the container is set to IfNotPresent
or Never
,
then a local image is used (preferentially or exclusively, respectively).
If you want to rely on pre-pulled images as a substitute for registry authentication, you must ensure all nodes in the cluster have the same pre-pulled images.
This can be used to preload certain images for speed or as an alternative to authenticating to a private registry.
All pods will have read access to any pre-pulled images.
Specifying imagePullSecrets on a Pod
Note:
This is the recommended approach to run containers based on images in private registries.Kubernetes supports specifying container image registry keys on a Pod.
imagePullSecrets
must all be in the same namespace as the Pod. The referenced
Secrets must be of type kubernetes.io/dockercfg
or kubernetes.io/dockerconfigjson
.
Creating a Secret with a Docker config
You need to know the username, registry password and client email address for authenticating to the registry, as well as its hostname. Run the following command, substituting the appropriate uppercase values:
kubectl create secret docker-registry <name> \
--docker-server=DOCKER_REGISTRY_SERVER \
--docker-username=DOCKER_USER \
--docker-password=DOCKER_PASSWORD \
--docker-email=DOCKER_EMAIL
If you already have a Docker credentials file then, rather than using the above
command, you can import the credentials file as a Kubernetes
Secrets.
Create a Secret based on existing Docker credentials
explains how to set this up.
This is particularly useful if you are using multiple private container
registries, as kubectl create secret docker-registry
creates a Secret that
only works with a single private registry.
Note:
Pods can only reference image pull secrets in their own namespace, so this process needs to be done one time per namespace.Referring to an imagePullSecrets on a Pod
Now, you can create pods which reference that secret by adding an imagePullSecrets
section to a Pod definition. Each item in the imagePullSecrets
array can only
reference a Secret in the same namespace.
For example:
cat <<EOF > pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: foo
namespace: awesomeapps
spec:
containers:
- name: foo
image: janedoe/awesomeapp:v1
imagePullSecrets:
- name: myregistrykey
EOF
cat <<EOF >> ./kustomization.yaml
resources:
- pod.yaml
EOF
This needs to be done for each pod that is using a private registry.
However, setting of this field can be automated by setting the imagePullSecrets in a ServiceAccount resource.
Check Add ImagePullSecrets to a Service Account for detailed instructions.
You can use this in conjunction with a per-node .docker/config.json
. The credentials
will be merged.
Use cases
There are a number of solutions for configuring private registries. Here are some common use cases and suggested solutions.
- Cluster running only non-proprietary (e.g. open-source) images. No need to hide images.
- Use public images from a public registry
- No configuration required.
- Some cloud providers automatically cache or mirror public images, which improves availability and reduces the time to pull images.
- Use public images from a public registry
- Cluster running some proprietary images which should be hidden to those outside the company, but
visible to all cluster users.
- Use a hosted private registry
- Manual configuration may be required on the nodes that need to access to private registry
- Or, run an internal private registry behind your firewall with open read access.
- No Kubernetes configuration is required.
- Use a hosted container image registry service that controls image access
- It will work better with cluster autoscaling than manual node configuration.
- Or, on a cluster where changing the node configuration is inconvenient, use
imagePullSecrets
.
- Use a hosted private registry
- Cluster with proprietary images, a few of which require stricter access control.
- Ensure AlwaysPullImages admission controller is active. Otherwise, all Pods potentially have access to all images.
- Move sensitive data into a "Secret" resource, instead of packaging it in an image.
- A multi-tenant cluster where each tenant needs own private registry.
- Ensure AlwaysPullImages admission controller is active. Otherwise, all Pods of all tenants potentially have access to all images.
- Run a private registry with authorization required.
- Generate registry credential for each tenant, put into secret, and populate secret to each tenant namespace.
- The tenant adds that secret to imagePullSecrets of each namespace.
If you need access to multiple registries, you can create one secret for each registry.
Legacy built-in kubelet credential provider
In older versions of Kubernetes, the kubelet had a direct integration with cloud provider credentials. This gave it the ability to dynamically fetch credentials for image registries.
There were three built-in implementations of the kubelet credential provider integration: ACR (Azure Container Registry), ECR (Elastic Container Registry), and GCR (Google Container Registry).
For more information on the legacy mechanism, read the documentation for the version of Kubernetes that you are using. Kubernetes v1.26 through to v1.32 do not include the legacy mechanism, so you would need to either:
- configure a kubelet image credential provider on each node
- specify image pull credentials using
imagePullSecrets
and at least one Secret
What's next
- Read the OCI Image Manifest Specification.
- Learn about container image garbage collection.
- Learn more about pulling an Image from a Private Registry.
3.3.2 - Container Environment
This page describes the resources available to Containers in the Container environment.
Container environment
The Kubernetes Container environment provides several important resources to Containers:
- A filesystem, which is a combination of an image and one or more volumes.
- Information about the Container itself.
- Information about other objects in the cluster.
Container information
The hostname of a Container is the name of the Pod in which the Container is running.
It is available through the hostname
command or the
gethostname
function call in libc.
The Pod name and namespace are available as environment variables through the downward API.
User defined environment variables from the Pod definition are also available to the Container, as are any environment variables specified statically in the container image.
Cluster information
A list of all services that were running when a Container was created is available to that Container as environment variables. This list is limited to services within the same namespace as the new Container's Pod and Kubernetes control plane services.
For a service named foo that maps to a Container named bar, the following variables are defined:
FOO_SERVICE_HOST=<the host the service is running on>
FOO_SERVICE_PORT=<the port the service is running on>
Services have dedicated IP addresses and are available to the Container via DNS, if DNS addon is enabled.
What's next
- Learn more about Container lifecycle hooks.
- Get hands-on experience attaching handlers to Container lifecycle events.
3.3.3 - Runtime Class
Kubernetes v1.20 [stable]
This page describes the RuntimeClass resource and runtime selection mechanism.
RuntimeClass is a feature for selecting the container runtime configuration. The container runtime configuration is used to run a Pod's containers.
Motivation
You can set a different RuntimeClass between different Pods to provide a balance of performance versus security. For example, if part of your workload deserves a high level of information security assurance, you might choose to schedule those Pods so that they run in a container runtime that uses hardware virtualization. You'd then benefit from the extra isolation of the alternative runtime, at the expense of some additional overhead.
You can also use RuntimeClass to run different Pods with the same container runtime but with different settings.
Setup
- Configure the CRI implementation on nodes (runtime dependent)
- Create the corresponding RuntimeClass resources
1. Configure the CRI implementation on nodes
The configurations available through RuntimeClass are Container Runtime Interface (CRI) implementation dependent. See the corresponding documentation (below) for your CRI implementation for how to configure.
Note:
RuntimeClass assumes a homogeneous node configuration across the cluster by default (which means that all nodes are configured the same way with respect to container runtimes). To support heterogeneous node configurations, see Scheduling below.The configurations have a corresponding handler
name, referenced by the RuntimeClass. The
handler must be a valid DNS label name.
2. Create the corresponding RuntimeClass resources
The configurations setup in step 1 should each have an associated handler
name, which identifies
the configuration. For each handler, create a corresponding RuntimeClass object.
The RuntimeClass resource currently only has 2 significant fields: the RuntimeClass name
(metadata.name
) and the handler (handler
). The object definition looks like this:
# RuntimeClass is defined in the node.k8s.io API group
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
# The name the RuntimeClass will be referenced by.
# RuntimeClass is a non-namespaced resource.
name: myclass
# The name of the corresponding CRI configuration
handler: myconfiguration
The name of a RuntimeClass object must be a valid DNS subdomain name.
Note:
It is recommended that RuntimeClass write operations (create/update/patch/delete) be restricted to the cluster administrator. This is typically the default. See Authorization Overview for more details.Usage
Once RuntimeClasses are configured for the cluster, you can specify a
runtimeClassName
in the Pod spec to use it. For example:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
runtimeClassName: myclass
# ...
This will instruct the kubelet to use the named RuntimeClass to run this pod. If the named
RuntimeClass does not exist, or the CRI cannot run the corresponding handler, the pod will enter the
Failed
terminal phase. Look for a
corresponding event for an
error message.
If no runtimeClassName
is specified, the default RuntimeHandler will be used, which is equivalent
to the behavior when the RuntimeClass feature is disabled.
CRI Configuration
For more details on setting up CRI runtimes, see CRI installation.
containerd
Runtime handlers are configured through containerd's configuration at
/etc/containerd/config.toml
. Valid handlers are configured under the runtimes section:
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.${HANDLER_NAME}]
See containerd's config documentation for more details:
CRI-O
Runtime handlers are configured through CRI-O's configuration at /etc/crio/crio.conf
. Valid
handlers are configured under the
crio.runtime table:
[crio.runtime.runtimes.${HANDLER_NAME}]
runtime_path = "${PATH_TO_BINARY}"
See CRI-O's config documentation for more details.
Scheduling
Kubernetes v1.16 [beta]
By specifying the scheduling
field for a RuntimeClass, you can set constraints to
ensure that Pods running with this RuntimeClass are scheduled to nodes that support it.
If scheduling
is not set, this RuntimeClass is assumed to be supported by all nodes.
To ensure pods land on nodes supporting a specific RuntimeClass, that set of nodes should have a
common label which is then selected by the runtimeclass.scheduling.nodeSelector
field. The
RuntimeClass's nodeSelector is merged with the pod's nodeSelector in admission, effectively taking
the intersection of the set of nodes selected by each. If there is a conflict, the pod will be
rejected.
If the supported nodes are tainted to prevent other RuntimeClass pods from running on the node, you
can add tolerations
to the RuntimeClass. As with the nodeSelector
, the tolerations are merged
with the pod's tolerations in admission, effectively taking the union of the set of nodes tolerated
by each.
To learn more about configuring the node selector and tolerations, see Assigning Pods to Nodes.
Pod Overhead
Kubernetes v1.24 [stable]
You can specify overhead resources that are associated with running a Pod. Declaring overhead allows the cluster (including the scheduler) to account for it when making decisions about Pods and resources.
Pod overhead is defined in RuntimeClass through the overhead
field. Through the use of this field,
you can specify the overhead of running pods utilizing this RuntimeClass and ensure these overheads
are accounted for in Kubernetes.
What's next
- RuntimeClass Design
- RuntimeClass Scheduling Design
- Read about the Pod Overhead concept
- PodOverhead Feature Design
3.3.4 - Container Lifecycle Hooks
This page describes how kubelet managed Containers can use the Container lifecycle hook framework to run code triggered by events during their management lifecycle.
Overview
Analogous to many programming language frameworks that have component lifecycle hooks, such as Angular, Kubernetes provides Containers with lifecycle hooks. The hooks enable Containers to be aware of events in their management lifecycle and run code implemented in a handler when the corresponding lifecycle hook is executed.
Container hooks
There are two hooks that are exposed to Containers:
PostStart
This hook is executed immediately after a container is created. However, there is no guarantee that the hook will execute before the container ENTRYPOINT. No parameters are passed to the handler.
PreStop
This hook is called immediately before a container is terminated due to an API request or management
event such as a liveness/startup probe failure, preemption, resource contention and others. A call
to the PreStop
hook fails if the container is already in a terminated or completed state and the
hook must complete before the TERM signal to stop the container can be sent. The Pod's termination
grace period countdown begins before the PreStop
hook is executed, so regardless of the outcome of
the handler, the container will eventually terminate within the Pod's termination grace period. No
parameters are passed to the handler.
A more detailed description of the termination behavior can be found in Termination of Pods.
Hook handler implementations
Containers can access a hook by implementing and registering a handler for that hook. There are three types of hook handlers that can be implemented for Containers:
- Exec - Executes a specific command, such as
pre-stop.sh
, inside the cgroups and namespaces of the Container. Resources consumed by the command are counted against the Container. - HTTP - Executes an HTTP request against a specific endpoint on the Container.
- Sleep - Pauses the container for a specified duration.
This is a beta-level feature default enabled by the
PodLifecycleSleepAction
feature gate.
Note:
Enable thePodLifecycleSleepActionAllowZero
feature gate if you want to set a sleep duration of zero seconds (effectively a no-op) for your Sleep lifecycle hooks.Hook handler execution
When a Container lifecycle management hook is called,
the Kubernetes management system executes the handler according to the hook action,
httpGet
, tcpSocket
and sleep
are executed by the kubelet process, and exec
is executed in the container.
The PostStart
hook handler call is initiated when a container is created,
meaning the container ENTRYPOINT and the PostStart
hook are triggered simultaneously.
However, if the PostStart
hook takes too long to execute or if it hangs,
it can prevent the container from transitioning to a running
state.
PreStop
hooks are not executed asynchronously from the signal to stop the Container; the hook must
complete its execution before the TERM signal can be sent. If a PreStop
hook hangs during
execution, the Pod's phase will be Terminating
and remain there until the Pod is killed after its
terminationGracePeriodSeconds
expires. This grace period applies to the total time it takes for
both the PreStop
hook to execute and for the Container to stop normally. If, for example,
terminationGracePeriodSeconds
is 60, and the hook takes 55 seconds to complete, and the Container
takes 10 seconds to stop normally after receiving the signal, then the Container will be killed
before it can stop normally, since terminationGracePeriodSeconds
is less than the total time
(55+10) it takes for these two things to happen.
If either a PostStart
or PreStop
hook fails,
it kills the Container.
Users should make their hook handlers as lightweight as possible. There are cases, however, when long running commands make sense, such as when saving state prior to stopping a Container.
Hook delivery guarantees
Hook delivery is intended to be at least once,
which means that a hook may be called multiple times for any given event,
such as for PostStart
or PreStop
.
It is up to the hook implementation to handle this correctly.
Generally, only single deliveries are made. If, for example, an HTTP hook receiver is down and is unable to take traffic, there is no attempt to resend. In some rare cases, however, double delivery may occur. For instance, if a kubelet restarts in the middle of sending a hook, the hook might be resent after the kubelet comes back up.
Debugging Hook handlers
The logs for a Hook handler are not exposed in Pod events.
If a handler fails for some reason, it broadcasts an event.
For PostStart
, this is the FailedPostStartHook
event,
and for PreStop
, this is the FailedPreStopHook
event.
To generate a failed FailedPostStartHook
event yourself, modify the lifecycle-events.yaml file to change the postStart command to "badcommand" and apply it.
Here is some example output of the resulting events you see from running kubectl describe pod lifecycle-demo
:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7s default-scheduler Successfully assigned default/lifecycle-demo to ip-XXX-XXX-XX-XX.us-east-2...
Normal Pulled 6s kubelet Successfully pulled image "nginx" in 229.604315ms
Normal Pulling 4s (x2 over 6s) kubelet Pulling image "nginx"
Normal Created 4s (x2 over 5s) kubelet Created container lifecycle-demo-container
Normal Started 4s (x2 over 5s) kubelet Started container lifecycle-demo-container
Warning FailedPostStartHook 4s (x2 over 5s) kubelet Exec lifecycle hook ([badcommand]) for Container "lifecycle-demo-container" in Pod "lifecycle-demo_default(30229739-9651-4e5a-9a32-a8f1688862db)" failed - error: command 'badcommand' exited with 126: , message: "OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: exec: \"badcommand\": executable file not found in $PATH: unknown\r\n"
Normal Killing 4s (x2 over 5s) kubelet FailedPostStartHook
Normal Pulled 4s kubelet Successfully pulled image "nginx" in 215.66395ms
Warning BackOff 2s (x2 over 3s) kubelet Back-off restarting failed container
What's next
- Learn more about the Container environment.
- Get hands-on experience attaching handlers to Container lifecycle events.
3.4 - Workloads
A workload is an application running on Kubernetes. Whether your workload is a single component or several that work together, on Kubernetes you run it inside a set of pods. In Kubernetes, a Pod represents a set of running containers on your cluster.
Kubernetes pods have a defined lifecycle. For example, once a pod is running in your cluster then a critical fault on the node where that pod is running means that all the pods on that node fail. Kubernetes treats that level of failure as final: you would need to create a new Pod to recover, even if the node later becomes healthy.
However, to make life considerably easier, you don't need to manage each Pod directly. Instead, you can use workload resources that manage a set of pods on your behalf. These resources configure controllers that make sure the right number of the right kind of pod are running, to match the state you specified.
Kubernetes provides several built-in workload resources:
- Deployment and ReplicaSet (replacing the legacy resource ReplicationController). Deployment is a good fit for managing a stateless application workload on your cluster, where any Pod in the Deployment is interchangeable and can be replaced if needed.
- StatefulSet lets you run one or more related Pods that do track state somehow. For example, if your workload records data persistently, you can run a StatefulSet that matches each Pod with a PersistentVolume. Your code, running in the Pods for that StatefulSet, can replicate data to other Pods in the same StatefulSet to improve overall resilience.
- DaemonSet defines Pods that provide facilities that are local to nodes. Every time you add a node to your cluster that matches the specification in a DaemonSet, the control plane schedules a Pod for that DaemonSet onto the new node. Each pod in a DaemonSet performs a job similar to a system daemon on a classic Unix / POSIX server. A DaemonSet might be fundamental to the operation of your cluster, such as a plugin to run cluster networking, it might help you to manage the node, or it could provide optional behavior that enhances the container platform you are running.
- Job and CronJob provide different ways to define tasks that run to completion and then stop. You can use a Job to define a task that runs to completion, just once. You can use a CronJob to run the same Job multiple times according a schedule.
In the wider Kubernetes ecosystem, you can find third-party workload resources that provide additional behaviors. Using a custom resource definition, you can add in a third-party workload resource if you want a specific behavior that's not part of Kubernetes' core. For example, if you wanted to run a group of Pods for your application but stop work unless all the Pods are available (perhaps for some high-throughput distributed task), then you can implement or install an extension that does provide that feature.
What's next
As well as reading about each API kind for workload management, you can read how to do specific tasks:
- Run a stateless application using a Deployment
- Run a stateful application either as a single instance or as a replicated set
- Run automated tasks with a CronJob
To learn about Kubernetes' mechanisms for separating code from configuration, visit Configuration.
There are two supporting concepts that provide backgrounds about how Kubernetes manages pods for applications:
- Garbage collection tidies up objects from your cluster after their owning resource has been removed.
- The time-to-live after finished controller removes Jobs once a defined time has passed since they completed.
Once your application is running, you might want to make it available on the internet as a Service or, for web application only, using an Ingress.
3.4.1 - Pods
Pods are the smallest deployable units of computing that you can create and manage in Kubernetes.
A Pod (as in a pod of whales or pea pod) is a group of one or more containers, with shared storage and network resources, and a specification for how to run the containers. A Pod's contents are always co-located and co-scheduled, and run in a shared context. A Pod models an application-specific "logical host": it contains one or more application containers which are relatively tightly coupled. In non-cloud contexts, applications executed on the same physical or virtual machine are analogous to cloud applications executed on the same logical host.
As well as application containers, a Pod can contain init containers that run during Pod startup. You can also inject ephemeral containers for debugging a running Pod.
What is a Pod?
Note:
You need to install a container runtime into each node in the cluster so that Pods can run there.The shared context of a Pod is a set of Linux namespaces, cgroups, and potentially other facets of isolation - the same things that isolate a container. Within a Pod's context, the individual applications may have further sub-isolations applied.
A Pod is similar to a set of containers with shared namespaces and shared filesystem volumes.
Pods in a Kubernetes cluster are used in two main ways:
Pods that run a single container. The "one-container-per-Pod" model is the most common Kubernetes use case; in this case, you can think of a Pod as a wrapper around a single container; Kubernetes manages Pods rather than managing the containers directly.
Pods that run multiple containers that need to work together. A Pod can encapsulate an application composed of multiple co-located containers that are tightly coupled and need to share resources. These co-located containers form a single cohesive unit.
Grouping multiple co-located and co-managed containers in a single Pod is a relatively advanced use case. You should use this pattern only in specific instances in which your containers are tightly coupled.
You don't need to run multiple containers to provide replication (for resilience or capacity); if you need multiple replicas, see Workload management.
Using Pods
The following is an example of a Pod which consists of a container running the image nginx:1.14.2
.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
To create the Pod shown above, run the following command:
kubectl apply -f https://k8s.io/examples/pods/simple-pod.yaml
Pods are generally not created directly and are created using workload resources. See Working with Pods for more information on how Pods are used with workload resources.
Workload resources for managing pods
Usually you don't need to create Pods directly, even singleton Pods. Instead, create them using workload resources such as Deployment or Job. If your Pods need to track state, consider the StatefulSet resource.
Each Pod is meant to run a single instance of a given application. If you want to scale your application horizontally (to provide more overall resources by running more instances), you should use multiple Pods, one for each instance. In Kubernetes, this is typically referred to as replication. Replicated Pods are usually created and managed as a group by a workload resource and its controller.
See Pods and controllers for more information on how Kubernetes uses workload resources, and their controllers, to implement application scaling and auto-healing.
Pods natively provide two kinds of shared resources for their constituent containers: networking and storage.
Working with Pods
You'll rarely create individual Pods directly in Kubernetes—even singleton Pods. This is because Pods are designed as relatively ephemeral, disposable entities. When a Pod gets created (directly by you, or indirectly by a controller), the new Pod is scheduled to run on a Node in your cluster. The Pod remains on that node until the Pod finishes execution, the Pod object is deleted, the Pod is evicted for lack of resources, or the node fails.
Note:
Restarting a container in a Pod should not be confused with restarting a Pod. A Pod is not a process, but an environment for running container(s). A Pod persists until it is deleted.The name of a Pod must be a valid DNS subdomain value, but this can produce unexpected results for the Pod hostname. For best compatibility, the name should follow the more restrictive rules for a DNS label.
Pod OS
Kubernetes v1.25 [stable]
You should set the .spec.os.name
field to either windows
or linux
to indicate the OS on
which you want the pod to run. These two are the only operating systems supported for now by
Kubernetes. In the future, this list may be expanded.
In Kubernetes v1.32, the value of .spec.os.name
does not affect
how the kube-scheduler
picks a node for the Pod to run on. In any cluster where there is more than one operating system for
running nodes, you should set the
kubernetes.io/os
label correctly on each node, and define pods with a nodeSelector
based on the operating system
label. The kube-scheduler assigns your pod to a node based on other criteria and may or may not
succeed in picking a suitable node placement where the node OS is right for the containers in that Pod.
The Pod security standards also use this
field to avoid enforcing policies that aren't relevant to the operating system.
Pods and controllers
You can use workload resources to create and manage multiple Pods for you. A controller for the resource handles replication and rollout and automatic healing in case of Pod failure. For example, if a Node fails, a controller notices that Pods on that Node have stopped working and creates a replacement Pod. The scheduler places the replacement Pod onto a healthy Node.
Here are some examples of workload resources that manage one or more Pods:
Pod templates
Controllers for workload resources create Pods from a pod template and manage those Pods on your behalf.
PodTemplates are specifications for creating Pods, and are included in workload resources such as Deployments, Jobs, and DaemonSets.
Each controller for a workload resource uses the PodTemplate
inside the workload
object to make actual Pods. The PodTemplate
is part of the desired state of whatever
workload resource you used to run your app.
When you create a Pod, you can include environment variables in the Pod template for the containers that run in the Pod.
The sample below is a manifest for a simple Job with a template
that starts one
container. The container in that Pod prints a message then pauses.
apiVersion: batch/v1
kind: Job
metadata:
name: hello
spec:
template:
# This is the pod template
spec:
containers:
- name: hello
image: busybox:1.28
command: ['sh', '-c', 'echo "Hello, Kubernetes!" && sleep 3600']
restartPolicy: OnFailure
# The pod template ends here
Modifying the pod template or switching to a new pod template has no direct effect on the Pods that already exist. If you change the pod template for a workload resource, that resource needs to create replacement Pods that use the updated template.
For example, the StatefulSet controller ensures that the running Pods match the current pod template for each StatefulSet object. If you edit the StatefulSet to change its pod template, the StatefulSet starts to create new Pods based on the updated template. Eventually, all of the old Pods are replaced with new Pods, and the update is complete.
Each workload resource implements its own rules for handling changes to the Pod template. If you want to read more about StatefulSet specifically, read Update strategy in the StatefulSet Basics tutorial.
On Nodes, the kubelet does not directly observe or manage any of the details around pod templates and updates; those details are abstracted away. That abstraction and separation of concerns simplifies system semantics, and makes it feasible to extend the cluster's behavior without changing existing code.
Pod update and replacement
As mentioned in the previous section, when the Pod template for a workload resource is changed, the controller creates new Pods based on the updated template instead of updating or patching the existing Pods.
Kubernetes doesn't prevent you from managing Pods directly. It is possible to
update some fields of a running Pod, in place. However, Pod update operations
like
patch
, and
replace
have some limitations:
Most of the metadata about a Pod is immutable. For example, you cannot change the
namespace
,name
,uid
, orcreationTimestamp
fields; thegeneration
field is unique. It only accepts updates that increment the field's current value.If the
metadata.deletionTimestamp
is set, no new entry can be added to themetadata.finalizers
list.Pod updates may not change fields other than
spec.containers[*].image
,spec.initContainers[*].image
,spec.activeDeadlineSeconds
orspec.tolerations
. Forspec.tolerations
, you can only add new entries.When updating the
spec.activeDeadlineSeconds
field, two types of updates are allowed:- setting the unassigned field to a positive number;
- updating the field from a positive number to a smaller, non-negative number.
Resource sharing and communication
Pods enable data sharing and communication among their constituent containers.
Storage in Pods
A Pod can specify a set of shared storage volumes. All containers in the Pod can access the shared volumes, allowing those containers to share data. Volumes also allow persistent data in a Pod to survive in case one of the containers within needs to be restarted. See Storage for more information on how Kubernetes implements shared storage and makes it available to Pods.
Pod networking
Each Pod is assigned a unique IP address for each address family. Every
container in a Pod shares the network namespace, including the IP address and
network ports. Inside a Pod (and only then), the containers that belong to the Pod
can communicate with one another using localhost
. When containers in a Pod communicate
with entities outside the Pod,
they must coordinate how they use the shared network resources (such as ports).
Within a Pod, containers share an IP address and port space, and
can find each other via localhost
. The containers in a Pod can also communicate
with each other using standard inter-process communications like SystemV semaphores
or POSIX shared memory. Containers in different Pods have distinct IP addresses
and can not communicate by OS-level IPC without special configuration.
Containers that want to interact with a container running in a different Pod can
use IP networking to communicate.
Containers within the Pod see the system hostname as being the same as the configured
name
for the Pod. There's more about this in the networking
section.
Pod security settings
To set security constraints on Pods and containers, you use the
securityContext
field in the Pod specification. This field gives you
granular control over what a Pod or individual containers can do. For example:
- Drop specific Linux capabilities to avoid the impact of a CVE.
- Force all processes in the Pod to run as a non-root user or as a specific user or group ID.
- Set a specific seccomp profile.
- Set Windows security options, such as whether containers run as HostProcess.
Caution:
You can also use the Pod securityContext to enable privileged mode in Linux containers. Privileged mode overrides many of the other security settings in the securityContext. Avoid using this setting unless you can't grant the equivalent permissions by using other fields in the securityContext. In Kubernetes 1.26 and later, you can run Windows containers in a similarly privileged mode by setting thewindowsOptions.hostProcess
flag on the
security context of the Pod spec. For details and instructions, see
Create a Windows HostProcess Pod.- To learn about kernel-level security constraints that you can use, see Linux kernel security constraints for Pods and containers.
- To learn more about the Pod security context, see Configure a Security Context for a Pod or Container.
Static Pods
Static Pods are managed directly by the kubelet daemon on a specific node, without the API server observing them. Whereas most Pods are managed by the control plane (for example, a Deployment), for static Pods, the kubelet directly supervises each static Pod (and restarts it if it fails).
Static Pods are always bound to one Kubelet on a specific node. The main use for static Pods is to run a self-hosted control plane: in other words, using the kubelet to supervise the individual control plane components.
The kubelet automatically tries to create a mirror Pod on the Kubernetes API server for each static Pod. This means that the Pods running on a node are visible on the API server, but cannot be controlled from there. See the guide Create static Pods for more information.
Note:
Thespec
of a static Pod cannot refer to other API objects
(e.g., ServiceAccount,
ConfigMap,
Secret, etc).Pods with multiple containers
Pods are designed to support multiple cooperating processes (as containers) that form a cohesive unit of service. The containers in a Pod are automatically co-located and co-scheduled on the same physical or virtual machine in the cluster. The containers can share resources and dependencies, communicate with one another, and coordinate when and how they are terminated.
Pods in a Kubernetes cluster are used in two main ways:
- Pods that run a single container. The "one-container-per-Pod" model is the most common Kubernetes use case; in this case, you can think of a Pod as a wrapper around a single container; Kubernetes manages Pods rather than managing the containers directly.
- Pods that run multiple containers that need to work together. A Pod can encapsulate an application composed of multiple co-located containers that are tightly coupled and need to share resources. These co-located containers form a single cohesive unit of service—for example, one container serving data stored in a shared volume to the public, while a separate sidecar container refreshes or updates those files. The Pod wraps these containers, storage resources, and an ephemeral network identity together as a single unit.
For example, you might have a container that acts as a web server for files in a shared volume, and a separate sidecar container that updates those files from a remote source, as in the following diagram:
Some Pods have init containers as well as app containers. By default, init containers run and complete before the app containers are started.
You can also have sidecar containers that provide auxiliary services to the main application Pod (for example: a service mesh).
Kubernetes v1.29 [beta]
Enabled by default, the SidecarContainers
feature gate
allows you to specify restartPolicy: Always
for init containers.
Setting the Always
restart policy ensures that the containers where you set it are
treated as sidecars that are kept running during the entire lifetime of the Pod.
Containers that you explicitly define as sidecar containers
start up before the main application Pod and remain running until the Pod is
shut down.
Container probes
A probe is a diagnostic performed periodically by the kubelet on a container. To perform a diagnostic, the kubelet can invoke different actions:
ExecAction
(performed with the help of the container runtime)TCPSocketAction
(checked directly by the kubelet)HTTPGetAction
(checked directly by the kubelet)
You can read more about probes in the Pod Lifecycle documentation.
What's next
- Learn about the lifecycle of a Pod.
- Learn about RuntimeClass and how you can use it to configure different Pods with different container runtime configurations.
- Read about PodDisruptionBudget and how you can use it to manage application availability during disruptions.
- Pod is a top-level resource in the Kubernetes REST API. The Pod object definition describes the object in detail.
- The Distributed System Toolkit: Patterns for Composite Containers explains common layouts for Pods with more than one container.
- Read about Pod topology spread constraints
To understand the context for why Kubernetes wraps a common Pod API in other resources (such as StatefulSets or Deployments), you can read about the prior art, including:
3.4.1.1 - Pod Lifecycle
This page describes the lifecycle of a Pod. Pods follow a defined lifecycle, starting
in the Pending
phase, moving through Running
if at least one
of its primary containers starts OK, and then through either the Succeeded
or
Failed
phases depending on whether any container in the Pod terminated in failure.
Like individual application containers, Pods are considered to be relatively ephemeral (rather than durable) entities. Pods are created, assigned a unique ID (UID), and scheduled to run on nodes where they remain until termination (according to restart policy) or deletion. If a Node dies, the Pods running on (or scheduled to run on) that node are marked for deletion. The control plane marks the Pods for removal after a timeout period.
Pod lifetime
Whilst a Pod is running, the kubelet is able to restart containers to handle some kind of faults. Within a Pod, Kubernetes tracks different container states and determines what action to take to make the Pod healthy again.
In the Kubernetes API, Pods have both a specification and an actual status. The status for a Pod object consists of a set of Pod conditions. You can also inject custom readiness information into the condition data for a Pod, if that is useful to your application.
Pods are only scheduled once in their lifetime; assigning a Pod to a specific node is called binding, and the process of selecting which node to use is called scheduling. Once a Pod has been scheduled and is bound to a node, Kubernetes tries to run that Pod on the node. The Pod runs on that node until it stops, or until the Pod is terminated; if Kubernetes isn't able to start the Pod on the selected node (for example, if the node crashes before the Pod starts), then that particular Pod never starts.
You can use Pod Scheduling Readiness to delay scheduling for a Pod until all its scheduling gates are removed. For example, you might want to define a set of Pods but only trigger scheduling once all the Pods have been created.
Pods and fault recovery
If one of the containers in the Pod fails, then Kubernetes may try to restart that specific container. Read How Pods handle problems with containers to learn more.
Pods can however fail in a way that the cluster cannot recover from, and in that case Kubernetes does not attempt to heal the Pod further; instead, Kubernetes deletes the Pod and relies on other components to provide automatic healing.
If a Pod is scheduled to a node and that node then fails, the Pod is treated as unhealthy and Kubernetes eventually deletes the Pod. A Pod won't survive an eviction due to a lack of resources or Node maintenance.
Kubernetes uses a higher-level abstraction, called a controller, that handles the work of managing the relatively disposable Pod instances.
A given Pod (as defined by a UID) is never "rescheduled" to a different node; instead,
that Pod can be replaced by a new, near-identical Pod. If you make a replacement Pod, it can
even have same name (as in .metadata.name
) that the old Pod had, but the replacement
would have a different .metadata.uid
from the old Pod.
Kubernetes does not guarantee that a replacement for an existing Pod would be scheduled to the same node as the old Pod that was being replaced.
Associated lifetimes
When something is said to have the same lifetime as a Pod, such as a volume, that means that the thing exists as long as that specific Pod (with that exact UID) exists. If that Pod is deleted for any reason, and even if an identical replacement is created, the related thing (a volume, in this example) is also destroyed and created anew.
Pod phase
A Pod's status
field is a
PodStatus
object, which has a phase
field.
The phase of a Pod is a simple, high-level summary of where the Pod is in its lifecycle. The phase is not intended to be a comprehensive rollup of observations of container or Pod state, nor is it intended to be a comprehensive state machine.
The number and meanings of Pod phase values are tightly guarded.
Other than what is documented here, nothing should be assumed about Pods that
have a given phase
value.
Here are the possible values for phase
:
Value | Description |
---|---|
Pending | The Pod has been accepted by the Kubernetes cluster, but one or more of the containers has not been set up and made ready to run. This includes time a Pod spends waiting to be scheduled as well as the time spent downloading container images over the network. |
Running | The Pod has been bound to a node, and all of the containers have been created. At least one container is still running, or is in the process of starting or restarting. |
Succeeded | All containers in the Pod have terminated in success, and will not be restarted. |
Failed | All containers in the Pod have terminated, and at least one container has terminated in failure. That is, the container either exited with non-zero status or was terminated by the system, and is not set for automatic restarting. |
Unknown | For some reason the state of the Pod could not be obtained. This phase typically occurs due to an error in communicating with the node where the Pod should be running. |
Note:
When a pod is failing to start repeatedly, CrashLoopBackOff
may appear in the Status
field of some kubectl commands. Similarly, when a pod is being deleted, Terminating
may appear in the Status
field of some kubectl commands.
Make sure not to confuse Status, a kubectl display field for user intuition, with the pod's phase
.
Pod phase is an explicit part of the Kubernetes data model and of the
Pod API.
NAMESPACE NAME READY STATUS RESTARTS AGE
alessandras-namespace alessandras-pod 0/1 CrashLoopBackOff 200 2d9h
A Pod is granted a term to terminate gracefully, which defaults to 30 seconds.
You can use the flag --force
to terminate a Pod by force.
Since Kubernetes 1.27, the kubelet transitions deleted Pods, except for
static Pods and
force-deleted Pods
without a finalizer, to a terminal phase (Failed
or Succeeded
depending on
the exit statuses of the pod containers) before their deletion from the API server.
If a node dies or is disconnected from the rest of the cluster, Kubernetes
applies a policy for setting the phase
of all Pods on the lost node to Failed.
Container states
As well as the phase of the Pod overall, Kubernetes tracks the state of each container inside a Pod. You can use container lifecycle hooks to trigger events to run at certain points in a container's lifecycle.
Once the scheduler
assigns a Pod to a Node, the kubelet starts creating containers for that Pod
using a container runtime.
There are three possible container states: Waiting
, Running
, and Terminated
.
To check the state of a Pod's containers, you can use
kubectl describe pod <name-of-pod>
. The output shows the state for each container
within that Pod.
Each state has a specific meaning:
Waiting
If a container is not in either the Running
or Terminated
state, it is Waiting
.
A container in the Waiting
state is still running the operations it requires in
order to complete start up: for example, pulling the container image from a container
image registry, or applying Secret
data.
When you use kubectl
to query a Pod with a container that is Waiting
, you also see
a Reason field to summarize why the container is in that state.
Running
The Running
status indicates that a container is executing without issues. If there
was a postStart
hook configured, it has already executed and finished. When you use
kubectl
to query a Pod with a container that is Running
, you also see information
about when the container entered the Running
state.
Terminated
A container in the Terminated
state began execution and then either ran to
completion or failed for some reason. When you use kubectl
to query a Pod with
a container that is Terminated
, you see a reason, an exit code, and the start and
finish time for that container's period of execution.
If a container has a preStop
hook configured, this hook runs before the container enters
the Terminated
state.
How Pods handle problems with containers
Kubernetes manages container failures within Pods using a restartPolicy
defined in the Pod spec
. This policy determines how Kubernetes reacts to containers exiting due to errors or other reasons, which falls in the following sequence:
- Initial crash: Kubernetes attempts an immediate restart based on the Pod
restartPolicy
. - Repeated crashes: After the initial crash Kubernetes applies an exponential
backoff delay for subsequent restarts, described in
restartPolicy
. This prevents rapid, repeated restart attempts from overloading the system. - CrashLoopBackOff state: This indicates that the backoff delay mechanism is currently in effect for a given container that is in a crash loop, failing and restarting repeatedly.
- Backoff reset: If a container runs successfully for a certain duration (e.g., 10 minutes), Kubernetes resets the backoff delay, treating any new crash as the first one.
In practice, a CrashLoopBackOff
is a condition or event that might be seen as output
from the kubectl
command, while describing or listing Pods, when a container in the Pod
fails to start properly and then continually tries and fails in a loop.
In other words, when a container enters the crash loop, Kubernetes applies the exponential backoff delay mentioned in the Container restart policy. This mechanism prevents a faulty container from overwhelming the system with continuous failed start attempts.
The CrashLoopBackOff
can be caused by issues like the following:
- Application errors that cause the container to exit.
- Configuration errors, such as incorrect environment variables or missing configuration files.
- Resource constraints, where the container might not have enough memory or CPU to start properly.
- Health checks failing if the application doesn't start serving within the expected time.
- Container liveness probes or startup probes returning a
Failure
result as mentioned in the probes section.
To investigate the root cause of a CrashLoopBackOff
issue, a user can:
- Check logs: Use
kubectl logs <name-of-pod>
to check the logs of the container. This is often the most direct way to diagnose the issue causing the crashes. - Inspect events: Use
kubectl describe pod <name-of-pod>
to see events for the Pod, which can provide hints about configuration or resource issues. - Review configuration: Ensure that the Pod configuration, including environment variables and mounted volumes, is correct and that all required external resources are available.
- Check resource limits: Make sure that the container has enough CPU and memory allocated. Sometimes, increasing the resources in the Pod definition can resolve the issue.
- Debug application: There might exist bugs or misconfigurations in the application code. Running this container image locally or in a development environment can help diagnose application specific issues.
Container restart policy
The spec
of a Pod has a restartPolicy
field with possible values Always, OnFailure,
and Never. The default value is Always.
The restartPolicy
for a Pod applies to app containers
in the Pod and to regular init containers.
Sidecar containers
ignore the Pod-level restartPolicy
field: in Kubernetes, a sidecar is defined as an
entry inside initContainers
that has its container-level restartPolicy
set to Always
.
For init containers that exit with an error, the kubelet restarts the init container if
the Pod level restartPolicy
is either OnFailure
or Always
:
Always
: Automatically restarts the container after any termination.OnFailure
: Only restarts the container if it exits with an error (non-zero exit status).Never
: Does not automatically restart the terminated container.
When the kubelet is handling container restarts according to the configured restart
policy, that only applies to restarts that make replacement containers inside the
same Pod and running on the same node. After containers in a Pod exit, the kubelet
restarts them with an exponential backoff delay (10s, 20s, 40s, …), that is capped at
300 seconds (5 minutes). Once a container has executed for 10 minutes without any
problems, the kubelet resets the restart backoff timer for that container.
Sidecar containers and Pod lifecycle
explains the behaviour of init containers
when specify restartpolicy
field on it.
Configurable container restart delay
Kubernetes v1.32 [alpha]
(enabled by default: false)With the alpha feature gate KubeletCrashLoopBackOffMax
enabled, you can
reconfigure the maximum delay between container start retries from the default
of 300s (5 minutes). This configuration is set per node using kubelet
configuration. In your kubelet configuration,
under crashLoopBackOff
set the maxContainerRestartPeriod
field between
"1s"
and "300s"
. As described above in Container restart
policy, delays on that node will still start at 10s and
increase exponentially by 2x each restart, but will now be capped at your
configured maximum. If the maxContainerRestartPeriod
you configure is less
than the default initial value of 10s, the initial delay will instead be set to
the configured maximum.
See the following kubelet configuration examples:
# container restart delays will start at 10s, increasing
# 2x each time they are restarted, to a maximum of 100s
kind: KubeletConfiguration
crashLoopBackOff:
maxContainerRestartPeriod: "100s"
# delays between container restarts will always be 2s
kind: KubeletConfiguration
crashLoopBackOff:
maxContainerRestartPeriod: "2s"
Pod conditions
A Pod has a PodStatus, which has an array of PodConditions through which the Pod has or has not passed. Kubelet manages the following PodConditions:
PodScheduled
: the Pod has been scheduled to a node.PodReadyToStartContainers
: (beta feature; enabled by default) the Pod sandbox has been successfully created and networking configured.ContainersReady
: all containers in the Pod are ready.Initialized
: all init containers have completed successfully.Ready
: the Pod is able to serve requests and should be added to the load balancing pools of all matching Services.
Field name | Description |
---|---|
type | Name of this Pod condition. |
status | Indicates whether that condition is applicable, with possible values "True ", "False ", or "Unknown ". |
lastProbeTime | Timestamp of when the Pod condition was last probed. |
lastTransitionTime | Timestamp for when the Pod last transitioned from one status to another. |
reason | Machine-readable, UpperCamelCase text indicating the reason for the condition's last transition. |
message | Human-readable message indicating details about the last status transition. |
Pod readiness
Kubernetes v1.14 [stable]
Your application can inject extra feedback or signals into PodStatus:
Pod readiness. To use this, set readinessGates
in the Pod's spec
to
specify a list of additional conditions that the kubelet evaluates for Pod readiness.
Readiness gates are determined by the current state of status.condition
fields for the Pod. If Kubernetes cannot find such a condition in the
status.conditions
field of a Pod, the status of the condition
is defaulted to "False
".
Here is an example:
kind: Pod
...
spec:
readinessGates:
- conditionType: "www.example.com/feature-1"
status:
conditions:
- type: Ready # a built in PodCondition
status: "False"
lastProbeTime: null
lastTransitionTime: 2018-01-01T00:00:00Z
- type: "www.example.com/feature-1" # an extra PodCondition
status: "False"
lastProbeTime: null
lastTransitionTime: 2018-01-01T00:00:00Z
containerStatuses:
- containerID: docker://abcd...
ready: true
...
The Pod conditions you add must have names that meet the Kubernetes label key format.
Status for Pod readiness
The kubectl patch
command does not support patching object status.
To set these status.conditions
for the Pod, applications and
operators should use
the PATCH
action.
You can use a Kubernetes client library to
write code that sets custom Pod conditions for Pod readiness.
For a Pod that uses custom conditions, that Pod is evaluated to be ready only when both the following statements apply:
- All containers in the Pod are ready.
- All conditions specified in
readinessGates
areTrue
.
When a Pod's containers are Ready but at least one custom condition is missing or
False
, the kubelet sets the Pod's condition to ContainersReady
.
Pod network readiness
Kubernetes v1.29 [beta]
Note:
During its early development, this condition was namedPodHasNetwork
.After a Pod gets scheduled on a node, it needs to be admitted by the kubelet and
to have any required storage volumes mounted. Once these phases are complete,
the kubelet works with
a container runtime (using Container Runtime Interface (CRI)) to set up a
runtime sandbox and configure networking for the Pod. If the
PodReadyToStartContainersCondition
feature gate is enabled
(it is enabled by default for Kubernetes 1.32), the
PodReadyToStartContainers
condition will be added to the status.conditions
field of a Pod.
The PodReadyToStartContainers
condition is set to False
by the Kubelet when it detects a
Pod does not have a runtime sandbox with networking configured. This occurs in
the following scenarios:
- Early in the lifecycle of the Pod, when the kubelet has not yet begun to set up a sandbox for the Pod using the container runtime.
- Later in the lifecycle of the Pod, when the Pod sandbox has been destroyed due to either:
- the node rebooting, without the Pod getting evicted
- for container runtimes that use virtual machines for isolation, the Pod sandbox virtual machine rebooting, which then requires creating a new sandbox and fresh container network configuration.
The PodReadyToStartContainers
condition is set to True
by the kubelet after the
successful completion of sandbox creation and network configuration for the Pod
by the runtime plugin. The kubelet can start pulling container images and create
containers after PodReadyToStartContainers
condition has been set to True
.
For a Pod with init containers, the kubelet sets the Initialized
condition to
True
after the init containers have successfully completed (which happens
after successful sandbox creation and network configuration by the runtime
plugin). For a Pod without init containers, the kubelet sets the Initialized
condition to True
before sandbox creation and network configuration starts.
Container probes
A probe is a diagnostic performed periodically by the kubelet on a container. To perform a diagnostic, the kubelet either executes code within the container, or makes a network request.
Check mechanisms
There are four different ways to check a container using a probe. Each probe must define exactly one of these four mechanisms:
exec
- Executes a specified command inside the container. The diagnostic is considered successful if the command exits with a status code of 0.
grpc
- Performs a remote procedure call using gRPC.
The target should implement
gRPC health checks.
The diagnostic is considered successful if the
status
of the response isSERVING
. httpGet
- Performs an HTTP
GET
request against the Pod's IP address on a specified port and path. The diagnostic is considered successful if the response has a status code greater than or equal to 200 and less than 400. tcpSocket
- Performs a TCP check against the Pod's IP address on a specified port. The diagnostic is considered successful if the port is open. If the remote system (the container) closes the connection immediately after it opens, this counts as healthy.
Caution:
Unlike the other mechanisms,exec
probe's implementation involves the creation/forking of multiple processes each time when executed.
As a result, in case of the clusters having higher pod densities, lower intervals of initialDelaySeconds
, periodSeconds
, configuring any probe with exec mechanism might introduce an overhead on the cpu usage of the node.
In such scenarios, consider using the alternative probe mechanisms to avoid the overhead.Probe outcome
Each probe has one of three results:
Success
- The container passed the diagnostic.
Failure
- The container failed the diagnostic.
Unknown
- The diagnostic failed (no action should be taken, and the kubelet will make further checks).
Types of probe
The kubelet can optionally perform and react to three kinds of probes on running containers:
livenessProbe
- Indicates whether the container is running. If
the liveness probe fails, the kubelet kills the container, and the container
is subjected to its restart policy. If a container does not
provide a liveness probe, the default state is
Success
. readinessProbe
- Indicates whether the container is ready to respond to requests.
If the readiness probe fails, the endpoints controller removes the Pod's IP
address from the endpoints of all Services that match the Pod. The default
state of readiness before the initial delay is
Failure
. If a container does not provide a readiness probe, the default state isSuccess
. startupProbe
- Indicates whether the application within the container is started.
All other probes are disabled if a startup probe is provided, until it succeeds.
If the startup probe fails, the kubelet kills the container, and the container
is subjected to its restart policy. If a container does not
provide a startup probe, the default state is
Success
.
For more information about how to set up a liveness, readiness, or startup probe, see Configure Liveness, Readiness and Startup Probes.
When should you use a liveness probe?
If the process in your container is able to crash on its own whenever it
encounters an issue or becomes unhealthy, you do not necessarily need a liveness
probe; the kubelet will automatically perform the correct action in accordance
with the Pod's restartPolicy
.
If you'd like your container to be killed and restarted if a probe fails, then
specify a liveness probe, and specify a restartPolicy
of Always or OnFailure.
When should you use a readiness probe?
If you'd like to start sending traffic to a Pod only when a probe succeeds, specify a readiness probe. In this case, the readiness probe might be the same as the liveness probe, but the existence of the readiness probe in the spec means that the Pod will start without receiving any traffic and only start receiving traffic after the probe starts succeeding.
If you want your container to be able to take itself down for maintenance, you can specify a readiness probe that checks an endpoint specific to readiness that is different from the liveness probe.
If your app has a strict dependency on back-end services, you can implement both a liveness and a readiness probe. The liveness probe passes when the app itself is healthy, but the readiness probe additionally checks that each required back-end service is available. This helps you avoid directing traffic to Pods that can only respond with error messages.
If your container needs to work on loading large data, configuration files, or migrations during startup, you can use a startup probe. However, if you want to detect the difference between an app that has failed and an app that is still processing its startup data, you might prefer a readiness probe.
Note:
If you want to be able to drain requests when the Pod is deleted, you do not necessarily need a readiness probe; on deletion, the Pod automatically puts itself into an unready state regardless of whether the readiness probe exists. The Pod remains in the unready state while it waits for the containers in the Pod to stop.When should you use a startup probe?
Startup probes are useful for Pods that have containers that take a long time to come into service. Rather than set a long liveness interval, you can configure a separate configuration for probing the container as it starts up, allowing a time longer than the liveness interval would allow.
If your container usually starts in more than
initialDelaySeconds + failureThreshold × periodSeconds
, you should specify a
startup probe that checks the same endpoint as the liveness probe. The default for
periodSeconds
is 10s. You should then set its failureThreshold
high enough to
allow the container to start, without changing the default values of the liveness
probe. This helps to protect against deadlocks.
Termination of Pods
Because Pods represent processes running on nodes in the cluster, it is important to
allow those processes to gracefully terminate when they are no longer needed (rather
than being abruptly stopped with a KILL
signal and having no chance to clean up).
The design aim is for you to be able to request deletion and know when processes terminate, but also be able to ensure that deletes eventually complete. When you request deletion of a Pod, the cluster records and tracks the intended grace period before the Pod is allowed to be forcefully killed. With that forceful shutdown tracking in place, the kubelet attempts graceful shutdown.
Typically, with this graceful termination of the pod, kubelet makes requests to the container runtime to attempt to stop the containers in the pod by first sending a TERM (aka. SIGTERM) signal, with a grace period timeout, to the main process in each container. The requests to stop the containers are processed by the container runtime asynchronously. There is no guarantee to the order of processing for these requests. Many container runtimes respect the STOPSIGNAL
value defined in the container image and, if different, send the container image configured STOPSIGNAL instead of TERM.
Once the grace period has expired, the KILL signal is sent to any remaining
processes, and the Pod is then deleted from the
API Server. If the kubelet or the
container runtime's management service is restarted while waiting for processes to terminate, the
cluster retries from the start including the full original grace period.
Pod termination flow, illustrated with an example:
You use the
kubectl
tool to manually delete a specific Pod, with the default grace period (30 seconds).The Pod in the API server is updated with the time beyond which the Pod is considered "dead" along with the grace period. If you use
kubectl describe
to check the Pod you're deleting, that Pod shows up as "Terminating". On the node where the Pod is running: as soon as the kubelet sees that a Pod has been marked as terminating (a graceful shutdown duration has been set), the kubelet begins the local Pod shutdown process.If one of the Pod's containers has defined a
preStop
hook and theterminationGracePeriodSeconds
in the Pod spec is not set to 0, the kubelet runs that hook inside of the container. The defaultterminationGracePeriodSeconds
setting is 30 seconds.If the
preStop
hook is still running after the grace period expires, the kubelet requests a small, one-off grace period extension of 2 seconds.Note:
If thepreStop
hook needs longer to complete than the default grace period allows, you must modifyterminationGracePeriodSeconds
to suit this.The kubelet triggers the container runtime to send a TERM signal to process 1 inside each container.
There is special ordering if the Pod has any sidecar containers defined. Otherwise, the containers in the Pod receive the TERM signal at different times and in an arbitrary order. If the order of shutdowns matters, consider using a
preStop
hook to synchronize (or switch to using sidecar containers).
At the same time as the kubelet is starting graceful shutdown of the Pod, the control plane evaluates whether to remove that shutting-down Pod from EndpointSlice (and Endpoints) objects, where those objects represent a Service with a configured selector. ReplicaSets and other workload resources no longer treat the shutting-down Pod as a valid, in-service replica.
Pods that shut down slowly should not continue to serve regular traffic and should start terminating and finish processing open connections. Some applications need to go beyond finishing open connections and need more graceful termination, for example, session draining and completion.
Any endpoints that represent the terminating Pods are not immediately removed from EndpointSlices, and a status indicating terminating state is exposed from the EndpointSlice API (and the legacy Endpoints API). Terminating endpoints always have their
ready
status asfalse
(for backward compatibility with versions before 1.26), so load balancers will not use it for regular traffic.If traffic draining on terminating Pod is needed, the actual readiness can be checked as a condition
serving
. You can find more details on how to implement connections draining in the tutorial Pods And Endpoints Termination FlowThe kubelet ensures the Pod is shut down and terminated
- When the grace period expires, if there is still any container running in the Pod, the
kubelet triggers forcible shutdown.
The container runtime sends
SIGKILL
to any processes still running in any container in the Pod. The kubelet also cleans up a hiddenpause
container if that container runtime uses one. - The kubelet transitions the Pod into a terminal phase (
Failed
orSucceeded
depending on the end state of its containers). - The kubelet triggers forcible removal of the Pod object from the API server, by setting grace period to 0 (immediate deletion).
- The API server deletes the Pod's API object, which is then no longer visible from any client.
- When the grace period expires, if there is still any container running in the Pod, the
kubelet triggers forcible shutdown.
The container runtime sends
Forced Pod termination
Caution:
Forced deletions can be potentially disruptive for some workloads and their Pods.By default, all deletes are graceful within 30 seconds. The kubectl delete
command supports
the --grace-period=<seconds>
option which allows you to override the default and specify your
own value.
Setting the grace period to 0
forcibly and immediately deletes the Pod from the API
server. If the Pod was still running on a node, that forcible deletion triggers the kubelet to
begin immediate cleanup.
Using kubectl, You must specify an additional flag --force
along with --grace-period=0
in order to perform force deletions.
When a force deletion is performed, the API server does not wait for confirmation from the kubelet that the Pod has been terminated on the node it was running on. It removes the Pod in the API immediately so a new Pod can be created with the same name. On the node, Pods that are set to terminate immediately will still be given a small grace period before being force killed.
Caution:
Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.If you need to force-delete Pods that are part of a StatefulSet, refer to the task documentation for deleting Pods from a StatefulSet.
Pod shutdown and sidecar containers
If your Pod includes one or more sidecar containers (init containers with an Always restart policy), the kubelet will delay sending the TERM signal to these sidecar containers until the last main container has fully terminated. The sidecar containers will be terminated in the reverse order they are defined in the Pod spec. This ensures that sidecar containers continue serving the other containers in the Pod until they are no longer needed.
This means that slow termination of a main container will also delay the termination of the sidecar containers. If the grace period expires before the termination process is complete, the Pod may enter forced termination. In this case, all remaining containers in the Pod will be terminated simultaneously with a short grace period.
Similarly, if the Pod has a preStop
hook that exceeds the termination grace period, emergency termination may occur.
In general, if you have used preStop
hooks to control the termination order without sidecar containers, you can now
remove them and allow the kubelet to manage sidecar termination automatically.
Garbage collection of Pods
For failed Pods, the API objects remain in the cluster's API until a human or controller process explicitly removes them.
The Pod garbage collector (PodGC), which is a controller in the control plane, cleans up
terminated Pods (with a phase of Succeeded
or Failed
), when the number of Pods exceeds the
configured threshold (determined by terminated-pod-gc-threshold
in the kube-controller-manager).
This avoids a resource leak as Pods are created and terminated over time.
Additionally, PodGC cleans up any Pods which satisfy any of the following conditions:
- are orphan Pods - bound to a node which no longer exists,
- are unscheduled terminating Pods,
- are terminating Pods, bound to a non-ready node tainted with
node.kubernetes.io/out-of-service
.
Along with cleaning up the Pods, PodGC will also mark them as failed if they are in a non-terminal phase. Also, PodGC adds a Pod disruption condition when cleaning up an orphan Pod. See Pod disruption conditions for more details.
What's next
Get hands-on experience attaching handlers to container lifecycle events.
Get hands-on experience configuring Liveness, Readiness and Startup Probes.
Learn more about container lifecycle hooks.
Learn more about sidecar containers.
For detailed information about Pod and container status in the API, see the API reference documentation covering
status
for Pod.
3.4.1.2 - Init Containers
This page provides an overview of init containers: specialized containers that run before app containers in a Pod. Init containers can contain utilities or setup scripts not present in an app image.
You can specify init containers in the Pod specification alongside the containers
array (which describes app containers).
In Kubernetes, a sidecar container is a container that starts before the main application container and continues to run. This document is about init containers: containers that run to completion during Pod initialization.
Understanding init containers
A Pod can have multiple containers running apps within it, but it can also have one or more init containers, which are run before the app containers are started.
Init containers are exactly like regular containers, except:
- Init containers always run to completion.
- Each init container must complete successfully before the next one starts.
If a Pod's init container fails, the kubelet repeatedly restarts that init container until it succeeds.
However, if the Pod has a restartPolicy
of Never, and an init container fails during startup of that Pod, Kubernetes treats the overall Pod as failed.
To specify an init container for a Pod, add the initContainers
field into
the Pod specification,
as an array of container
items (similar to the app containers
field and its contents).
See Container in the
API reference for more details.
The status of the init containers is returned in .status.initContainerStatuses
field as an array of the container statuses (similar to the .status.containerStatuses
field).
Differences from regular containers
Init containers support all the fields and features of app containers, including resource limits, volumes, and security settings. However, the resource requests and limits for an init container are handled differently, as documented in Resource sharing within containers.
Regular init containers (in other words: excluding sidecar containers) do not support the
lifecycle
, livenessProbe
, readinessProbe
, or startupProbe
fields. Init containers
must run to completion before the Pod can be ready; sidecar containers continue running
during a Pod's lifetime, and do support some probes. See sidecar container
for further details about sidecar containers.
If you specify multiple init containers for a Pod, kubelet runs each init container sequentially. Each init container must succeed before the next can run. When all of the init containers have run to completion, kubelet initializes the application containers for the Pod and runs them as usual.
Differences from sidecar containers
Init containers run and complete their tasks before the main application container starts. Unlike sidecar containers, init containers are not continuously running alongside the main containers.
Init containers run to completion sequentially, and the main container does not start until all the init containers have successfully completed.
init containers do not support lifecycle
, livenessProbe
, readinessProbe
, or
startupProbe
whereas sidecar containers support all these probes to control their lifecycle.
Init containers share the same resources (CPU, memory, network) with the main application containers but do not interact directly with them. They can, however, use shared volumes for data exchange.
Using init containers
Because init containers have separate images from app containers, they have some advantages for start-up related code:
- Init containers can contain utilities or custom code for setup that are not present in an app
image. For example, there is no need to make an image
FROM
another image just to use a tool likesed
,awk
,python
, ordig
during setup. - The application image builder and deployer roles can work independently without the need to jointly build a single app image.
- Init containers can run with a different view of the filesystem than app containers in the same Pod. Consequently, they can be given access to Secrets that app containers cannot access.
- Because init containers run to completion before any app containers start, init containers offer a mechanism to block or delay app container startup until a set of preconditions are met. Once preconditions are met, all of the app containers in a Pod can start in parallel.
- Init containers can securely run utilities or custom code that would otherwise make an app container image less secure. By keeping unnecessary tools separate you can limit the attack surface of your app container image.
Examples
Here are some ideas for how to use init containers:
Wait for a Service to be created, using a shell one-line command like:
for i in {1..100}; do sleep 1; if nslookup myservice; then exit 0; fi; done; exit 1
Register this Pod with a remote server from the downward API with a command like:
curl -X POST http://$MANAGEMENT_SERVICE_HOST:$MANAGEMENT_SERVICE_PORT/register -d 'instance=$(<POD_NAME>)&ip=$(<POD_IP>)'
Wait for some time before starting the app container with a command like
sleep 60
Clone a Git repository into a Volume
Place values into a configuration file and run a template tool to dynamically generate a configuration file for the main app container. For example, place the
POD_IP
value in a configuration and generate the main app configuration file using Jinja.
Init containers in use
This example defines a simple Pod that has two init containers.
The first waits for myservice
, and the second waits for mydb
. Once both
init containers complete, the Pod runs the app container from its spec
section.
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
labels:
app.kubernetes.io/name: MyApp
spec:
containers:
- name: myapp-container
image: busybox:1.28
command: ['sh', '-c', 'echo The app is running! && sleep 3600']
initContainers:
- name: init-myservice
image: busybox:1.28
command: ['sh', '-c', "until nslookup myservice.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
- name: init-mydb
image: busybox:1.28
command: ['sh', '-c', "until nslookup mydb.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for mydb; sleep 2; done"]
You can start this Pod by running:
kubectl apply -f myapp.yaml
The output is similar to this:
pod/myapp-pod created
And check on its status with:
kubectl get -f myapp.yaml
The output is similar to this:
NAME READY STATUS RESTARTS AGE
myapp-pod 0/1 Init:0/2 0 6m
or for more details:
kubectl describe -f myapp.yaml
The output is similar to this:
Name: myapp-pod
Namespace: default
[...]
Labels: app.kubernetes.io/name=MyApp
Status: Pending
[...]
Init Containers:
init-myservice:
[...]
State: Running
[...]
init-mydb:
[...]
State: Waiting
Reason: PodInitializing
Ready: False
[...]
Containers:
myapp-container:
[...]
State: Waiting
Reason: PodInitializing
Ready: False
[...]
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
16s 16s 1 {default-scheduler } Normal Scheduled Successfully assigned myapp-pod to 172.17.4.201
16s 16s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice} Normal Pulling pulling image "busybox"
13s 13s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice} Normal Pulled Successfully pulled image "busybox"
13s 13s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice} Normal Created Created container init-myservice
13s 13s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice} Normal Started Started container init-myservice
To see logs for the init containers in this Pod, run:
kubectl logs myapp-pod -c init-myservice # Inspect the first init container
kubectl logs myapp-pod -c init-mydb # Inspect the second init container
At this point, those init containers will be waiting to discover Services named
mydb
and myservice
.
Here's a configuration you can use to make those Services appear:
---
apiVersion: v1
kind: Service
metadata:
name: myservice
spec:
ports:
- protocol: TCP
port: 80
targetPort: 9376
---
apiVersion: v1
kind: Service
metadata:
name: mydb
spec:
ports:
- protocol: TCP
port: 80
targetPort: 9377
To create the mydb
and myservice
services:
kubectl apply -f services.yaml
The output is similar to this:
service/myservice created
service/mydb created
You'll then see that those init containers complete, and that the myapp-pod
Pod moves into the Running state:
kubectl get -f myapp.yaml
The output is similar to this:
NAME READY STATUS RESTARTS AGE
myapp-pod 1/1 Running 0 9m
This simple example should provide some inspiration for you to create your own init containers. What's next contains a link to a more detailed example.
Detailed behavior
During Pod startup, the kubelet delays running init containers until the networking and storage are ready. Then the kubelet runs the Pod's init containers in the order they appear in the Pod's spec.
Each init container must exit successfully before
the next container starts. If a container fails to start due to the runtime or
exits with failure, it is retried according to the Pod restartPolicy
. However,
if the Pod restartPolicy
is set to Always, the init containers use
restartPolicy
OnFailure.
A Pod cannot be Ready
until all init containers have succeeded. The ports on an
init container are not aggregated under a Service. A Pod that is initializing
is in the Pending
state but should have a condition Initialized
set to false.
If the Pod restarts, or is restarted, all init containers must execute again.
Changes to the init container spec are limited to the container image field.
Directly altering the image
field of an init container does not restart the
Pod or trigger its recreation. If the Pod has yet to start, that change may
have an effect on how the Pod boots up.
For a pod template you can typically change any field for an init container; the impact of making that change depends on where the pod template is used.
Because init containers can be restarted, retried, or re-executed, init container
code should be idempotent. In particular, code that writes into any emptyDir
volume
should be prepared for the possibility that an output file already exists.
Init containers have all of the fields of an app container. However, Kubernetes
prohibits readinessProbe
from being used because init containers cannot
define readiness distinct from completion. This is enforced during validation.
Use activeDeadlineSeconds
on the Pod to prevent init containers from failing forever.
The active deadline includes init containers.
However it is recommended to use activeDeadlineSeconds
only if teams deploy their application
as a Job, because activeDeadlineSeconds
has an effect even after initContainer finished.
The Pod which is already running correctly would be killed by activeDeadlineSeconds
if you set.
The name of each app and init container in a Pod must be unique; a validation error is thrown for any container sharing a name with another.
Resource sharing within containers
Given the order of execution for init, sidecar and app containers, the following rules for resource usage apply:
- The highest of any particular resource request or limit defined on all init containers is the effective init request/limit. If any resource has no resource limit specified this is considered as the highest limit.
- The Pod's effective request/limit for a resource is the higher of:
- the sum of all app containers request/limit for a resource
- the effective init request/limit for a resource
- Scheduling is done based on effective requests/limits, which means init containers can reserve resources for initialization that are not used during the life of the Pod.
- The QoS (quality of service) tier of the Pod's effective QoS tier is the QoS tier for init containers and app containers alike.
Quota and limits are applied based on the effective Pod request and limit.
Init containers and Linux cgroups
On Linux, resource allocations for Pod level control groups (cgroups) are based on the effective Pod request and limit, the same as the scheduler.
Pod restart reasons
A Pod can restart, causing re-execution of init containers, for the following reasons:
- The Pod infrastructure container is restarted. This is uncommon and would have to be done by someone with root access to nodes.
- All containers in a Pod are terminated while
restartPolicy
is set to Always, forcing a restart, and the init container completion record has been lost due to garbage collection.
The Pod will not be restarted when the init container image is changed, or the init container completion record has been lost due to garbage collection. This applies for Kubernetes v1.20 and later. If you are using an earlier version of Kubernetes, consult the documentation for the version you are using.
What's next
Learn more about the following:
- Creating a Pod that has an init container.
- Debug init containers.
- Overview of kubelet and kubectl.
- Types of probes: liveness, readiness, startup probe.
- Sidecar containers.
3.4.1.3 - Sidecar Containers
Kubernetes v1.29 [beta]
Sidecar containers are the secondary containers that run along with the main application container within the same Pod. These containers are used to enhance or to extend the functionality of the primary app container by providing additional services, or functionality such as logging, monitoring, security, or data synchronization, without directly altering the primary application code.
Typically, you only have one app container in a Pod. For example, if you have a web application that requires a local webserver, the local webserver is a sidecar and the web application itself is the app container.
Sidecar containers in Kubernetes
Kubernetes implements sidecar containers as a special case of init containers; sidecar containers remain running after Pod startup. This document uses the term regular init containers to clearly refer to containers that only run during Pod startup.
Provided that your cluster has the SidecarContainers
feature gate enabled
(the feature is active by default since Kubernetes v1.29), you can specify a restartPolicy
for containers listed in a Pod's initContainers
field.
These restartable sidecar containers are independent from other init containers and from
the main application container(s) within the same pod.
These can be started, stopped, or restarted without effecting the main application container
and other init containers.
You can also run a Pod with multiple containers that are not marked as init or sidecar
containers. This is appropriate if the containers within the Pod are required for the
Pod to work overall, but you don't need to control which containers start or stop first.
You could also do this if you need to support older versions of Kubernetes that don't
support a container-level restartPolicy
field.
Example application
Here's an example of a Deployment with two containers, one of which is a sidecar:
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
labels:
app: myapp
spec:
replicas: 1
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: alpine:latest
command: ['sh', '-c', 'while true; do echo "logging" >> /opt/logs.txt; sleep 1; done']
volumeMounts:
- name: data
mountPath: /opt
initContainers:
- name: logshipper
image: alpine:latest
restartPolicy: Always
command: ['sh', '-c', 'tail -F /opt/logs.txt']
volumeMounts:
- name: data
mountPath: /opt
volumes:
- name: data
emptyDir: {}
Sidecar containers and Pod lifecycle
If an init container is created with its restartPolicy
set to Always
, it will
start and remain running during the entire life of the Pod. This can be helpful for
running supporting services separated from the main application containers.
If a readinessProbe
is specified for this init container, its result will be used
to determine the ready
state of the Pod.
Since these containers are defined as init containers, they benefit from the same ordering and sequential guarantees as regular init containers, allowing you to mix sidecar containers with regular init containers for complex Pod initialization flows.
Compared to regular init containers, sidecars defined within initContainers
continue to
run after they have started. This is important when there is more than one entry inside
.spec.initContainers
for a Pod. After a sidecar-style init container is running (the kubelet
has set the started
status for that init container to true), the kubelet then starts the
next init container from the ordered .spec.initContainers
list.
That status either becomes true because there is a process running in the
container and no startup probe defined, or as a result of its startupProbe
succeeding.
Upon Pod termination, the kubelet postpones terminating sidecar containers until the main application container has fully stopped. The sidecar containers are then shut down in the opposite order of their appearance in the Pod specification. This approach ensures that the sidecars remain operational, supporting other containers within the Pod, until their service is no longer required.
Jobs with sidecar containers
If you define a Job that uses sidecar using Kubernetes-style init containers, the sidecar container in each Pod does not prevent the Job from completing after the main container has finished.
Here's an example of a Job with two containers, one of which is a sidecar:
apiVersion: batch/v1
kind: Job
metadata:
name: myjob
spec:
template:
spec:
containers:
- name: myjob
image: alpine:latest
command: ['sh', '-c', 'echo "logging" > /opt/logs.txt']
volumeMounts:
- name: data
mountPath: /opt
initContainers:
- name: logshipper
image: alpine:latest
restartPolicy: Always
command: ['sh', '-c', 'tail -F /opt/logs.txt']
volumeMounts:
- name: data
mountPath: /opt
restartPolicy: Never
volumes:
- name: data
emptyDir: {}
Differences from application containers
Sidecar containers run alongside app containers in the same pod. However, they do not execute the primary application logic; instead, they provide supporting functionality to the main application.
Sidecar containers have their own independent lifecycles. They can be started, stopped, and restarted independently of app containers. This means you can update, scale, or maintain sidecar containers without affecting the primary application.
Sidecar containers share the same network and storage namespaces with the primary container. This co-location allows them to interact closely and share resources.
From Kubernetes perspective, sidecars graceful termination is less important.
When other containers took all alloted graceful termination time, sidecar containers
will receive the SIGTERM
following with SIGKILL
faster than may be expected.
So exit codes different from 0
(0
indicates successful exit), for sidecar containers are normal
on Pod termination and should be generally ignored by the external tooling.
Differences from init containers
Sidecar containers work alongside the main container, extending its functionality and providing additional services.
Sidecar containers run concurrently with the main application container. They are active throughout the lifecycle of the pod and can be started and stopped independently of the main container. Unlike init containers, sidecar containers support probes to control their lifecycle.
Sidecar containers can interact directly with the main application containers, because like init containers they always share the same network, and can optionally also share volumes (filesystems).
Init containers stop before the main containers start up, so init containers cannot
exchange messages with the app container in a Pod. Any data passing is one-way
(for example, an init container can put information inside an emptyDir
volume).
Changing the image of a sidecar container will not cause the Pod to restart, but will trigger a container restart.
Resource sharing within containers
Given the order of execution for init, sidecar and app containers, the following rules for resource usage apply:
- The highest of any particular resource request or limit defined on all init containers is the effective init request/limit. If any resource has no resource limit specified this is considered as the highest limit.
- The Pod's effective request/limit for a resource is the sum of
pod overhead and the higher of:
- the sum of all non-init containers(app and sidecar containers) request/limit for a resource
- the effective init request/limit for a resource
- Scheduling is done based on effective requests/limits, which means init containers can reserve resources for initialization that are not used during the life of the Pod.
- The QoS (quality of service) tier of the Pod's effective QoS tier is the QoS tier for all init, sidecar and app containers alike.
Quota and limits are applied based on the effective Pod request and limit.
Sidecar containers and Linux cgroups
On Linux, resource allocations for Pod level control groups (cgroups) are based on the effective Pod request and limit, the same as the scheduler.
What's next
- Learn how to Adopt Sidecar Containers
- Read a blog post on native sidecar containers.
- Read about creating a Pod that has an init container.
- Learn about the types of probes: liveness, readiness, startup probe.
- Learn about pod overhead.
3.4.1.4 - Ephemeral Containers
Kubernetes v1.25 [stable]
This page provides an overview of ephemeral containers: a special type of container that runs temporarily in an existing Pod to accomplish user-initiated actions such as troubleshooting. You use ephemeral containers to inspect services rather than to build applications.
Understanding ephemeral containers
Pods are the fundamental building block of Kubernetes applications. Since Pods are intended to be disposable and replaceable, you cannot add a container to a Pod once it has been created. Instead, you usually delete and replace Pods in a controlled fashion using deployments.
Sometimes it's necessary to inspect the state of an existing Pod, however, for example to troubleshoot a hard-to-reproduce bug. In these cases you can run an ephemeral container in an existing Pod to inspect its state and run arbitrary commands.
What is an ephemeral container?
Ephemeral containers differ from other containers in that they lack guarantees
for resources or execution, and they will never be automatically restarted, so
they are not appropriate for building applications. Ephemeral containers are
described using the same ContainerSpec
as regular containers, but many fields
are incompatible and disallowed for ephemeral containers.
- Ephemeral containers may not have ports, so fields such as
ports
,livenessProbe
,readinessProbe
are disallowed. - Pod resource allocations are immutable, so setting
resources
is disallowed. - For a complete list of allowed fields, see the EphemeralContainer reference documentation.
Ephemeral containers are created using a special ephemeralcontainers
handler
in the API rather than by adding them directly to pod.spec
, so it's not
possible to add an ephemeral container using kubectl edit
.
Like regular containers, you may not change or remove an ephemeral container after you have added it to a Pod.
Note:
Ephemeral containers are not supported by static pods.Uses for ephemeral containers
Ephemeral containers are useful for interactive troubleshooting when kubectl exec
is insufficient because a container has crashed or a container image
doesn't include debugging utilities.
In particular, distroless images
enable you to deploy minimal container images that reduce attack surface
and exposure to bugs and vulnerabilities. Since distroless images do not include a
shell or any debugging utilities, it's difficult to troubleshoot distroless
images using kubectl exec
alone.
When using ephemeral containers, it's helpful to enable process namespace sharing so you can view processes in other containers.
What's next
- Learn how to debug pods using ephemeral containers.
3.4.1.5 - Disruptions
This guide is for application owners who want to build highly available applications, and thus need to understand what types of disruptions can happen to Pods.
It is also for cluster administrators who want to perform automated cluster actions, like upgrading and autoscaling clusters.
Voluntary and involuntary disruptions
Pods do not disappear until someone (a person or a controller) destroys them, or there is an unavoidable hardware or system software error.
We call these unavoidable cases involuntary disruptions to an application. Examples are:
- a hardware failure of the physical machine backing the node
- cluster administrator deletes VM (instance) by mistake
- cloud provider or hypervisor failure makes VM disappear
- a kernel panic
- the node disappears from the cluster due to cluster network partition
- eviction of a pod due to the node being out-of-resources.
Except for the out-of-resources condition, all these conditions should be familiar to most users; they are not specific to Kubernetes.
We call other cases voluntary disruptions. These include both actions initiated by the application owner and those initiated by a Cluster Administrator. Typical application owner actions include:
- deleting the deployment or other controller that manages the pod
- updating a deployment's pod template causing a restart
- directly deleting a pod (e.g. by accident)
Cluster administrator actions include:
- Draining a node for repair or upgrade.
- Draining a node from a cluster to scale the cluster down (learn about Cluster Autoscaling).
- Removing a pod from a node to permit something else to fit on that node.
These actions might be taken directly by the cluster administrator, or by automation run by the cluster administrator, or by your cluster hosting provider.
Ask your cluster administrator or consult your cloud provider or distribution documentation to determine if any sources of voluntary disruptions are enabled for your cluster. If none are enabled, you can skip creating Pod Disruption Budgets.
Caution:
Not all voluntary disruptions are constrained by Pod Disruption Budgets. For example, deleting deployments or pods bypasses Pod Disruption Budgets.Dealing with disruptions
Here are some ways to mitigate involuntary disruptions:
- Ensure your pod requests the resources it needs.
- Replicate your application if you need higher availability. (Learn about running replicated stateless and stateful applications.)
- For even higher availability when running replicated applications, spread applications across racks (using anti-affinity) or across zones (if using a multi-zone cluster.)
The frequency of voluntary disruptions varies. On a basic Kubernetes cluster, there are no automated voluntary disruptions (only user-triggered ones). However, your cluster administrator or hosting provider may run some additional services which cause voluntary disruptions. For example, rolling out node software updates can cause voluntary disruptions. Also, some implementations of cluster (node) autoscaling may cause voluntary disruptions to defragment and compact nodes. Your cluster administrator or hosting provider should have documented what level of voluntary disruptions, if any, to expect. Certain configuration options, such as using PriorityClasses in your pod spec can also cause voluntary (and involuntary) disruptions.
Pod disruption budgets
Kubernetes v1.21 [stable]
Kubernetes offers features to help you run highly available applications even when you introduce frequent voluntary disruptions.
As an application owner, you can create a PodDisruptionBudget (PDB) for each application. A PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions. For example, a quorum-based application would like to ensure that the number of replicas running is never brought below the number needed for a quorum. A web front end might want to ensure that the number of replicas serving load never falls below a certain percentage of the total.
Cluster managers and hosting providers should use tools which respect PodDisruptionBudgets by calling the Eviction API instead of directly deleting pods or deployments.
For example, the kubectl drain
subcommand lets you mark a node as going out of
service. When you run kubectl drain
, the tool tries to evict all of the Pods on
the Node you're taking out of service. The eviction request that kubectl
submits on
your behalf may be temporarily rejected, so the tool periodically retries all failed
requests until all Pods on the target node are terminated, or until a configurable timeout
is reached.
A PDB specifies the number of replicas that an application can tolerate having, relative to how
many it is intended to have. For example, a Deployment which has a .spec.replicas: 5
is
supposed to have 5 pods at any given time. If its PDB allows for there to be 4 at a time,
then the Eviction API will allow voluntary disruption of one (but not two) pods at a time.
The group of pods that comprise the application is specified using a label selector, the same as the one used by the application's controller (deployment, stateful-set, etc).
The "intended" number of pods is computed from the .spec.replicas
of the workload resource
that is managing those pods. The control plane discovers the owning workload resource by
examining the .metadata.ownerReferences
of the Pod.
Involuntary disruptions cannot be prevented by PDBs; however they do count against the budget.
Pods which are deleted or unavailable due to a rolling upgrade to an application do count against the disruption budget, but workload resources (such as Deployment and StatefulSet) are not limited by PDBs when doing rolling upgrades. Instead, the handling of failures during application updates is configured in the spec for the specific workload resource.
It is recommended to set AlwaysAllow
Unhealthy Pod Eviction Policy
to your PodDisruptionBudgets to support eviction of misbehaving applications during a node drain.
The default behavior is to wait for the application pods to become healthy
before the drain can proceed.
When a pod is evicted using the eviction API, it is gracefully
terminated, honoring the
terminationGracePeriodSeconds
setting in its PodSpec.
PodDisruptionBudget example
Consider a cluster with 3 nodes, node-1
through node-3
.
The cluster is running several applications. One of them has 3 replicas initially called
pod-a
, pod-b
, and pod-c
. Another, unrelated pod without a PDB, called pod-x
, is also shown.
Initially, the pods are laid out as follows:
node-1 | node-2 | node-3 |
---|---|---|
pod-a available | pod-b available | pod-c available |
pod-x available |
All 3 pods are part of a deployment, and they collectively have a PDB which requires there be at least 2 of the 3 pods to be available at all times.
For example, assume the cluster administrator wants to reboot into a new kernel version to fix a bug in the kernel.
The cluster administrator first tries to drain node-1
using the kubectl drain
command.
That tool tries to evict pod-a
and pod-x
. This succeeds immediately.
Both pods go into the terminating
state at the same time.
This puts the cluster in this state:
node-1 draining | node-2 | node-3 |
---|---|---|
pod-a terminating | pod-b available | pod-c available |
pod-x terminating |
The deployment notices that one of the pods is terminating, so it creates a replacement
called pod-d
. Since node-1
is cordoned, it lands on another node. Something has
also created pod-y
as a replacement for pod-x
.
(Note: for a StatefulSet, pod-a
, which would be called something like pod-0
, would need
to terminate completely before its replacement, which is also called pod-0
but has a
different UID, could be created. Otherwise, the example applies to a StatefulSet as well.)
Now the cluster is in this state:
node-1 draining | node-2 | node-3 |
---|---|---|
pod-a terminating | pod-b available | pod-c available |
pod-x terminating | pod-d starting | pod-y |
At some point, the pods terminate, and the cluster looks like this:
node-1 drained | node-2 | node-3 |
---|---|---|
pod-b available | pod-c available | |
pod-d starting | pod-y |
At this point, if an impatient cluster administrator tries to drain node-2
or
node-3
, the drain command will block, because there are only 2 available
pods for the deployment, and its PDB requires at least 2. After some time passes, pod-d
becomes available.
The cluster state now looks like this:
node-1 drained | node-2 | node-3 |
---|---|---|
pod-b available | pod-c available | |
pod-d available | pod-y |
Now, the cluster administrator tries to drain node-2
.
The drain command will try to evict the two pods in some order, say
pod-b
first and then pod-d
. It will succeed at evicting pod-b
.
But, when it tries to evict pod-d
, it will be refused because that would leave only
one pod available for the deployment.
The deployment creates a replacement for pod-b
called pod-e
.
Because there are not enough resources in the cluster to schedule
pod-e
the drain will again block. The cluster may end up in this
state:
node-1 drained | node-2 | node-3 | no node |
---|---|---|---|
pod-b terminating | pod-c available | pod-e pending | |
pod-d available | pod-y |
At this point, the cluster administrator needs to add a node back to the cluster to proceed with the upgrade.
You can see how Kubernetes varies the rate at which disruptions can happen, according to:
- how many replicas an application needs
- how long it takes to gracefully shutdown an instance
- how long it takes a new instance to start up
- the type of controller
- the cluster's resource capacity
Pod disruption conditions
Kubernetes v1.31 [stable]
(enabled by default: true)A dedicated Pod DisruptionTarget
condition
is added to indicate
that the Pod is about to be deleted due to a disruption.
The reason
field of the condition additionally
indicates one of the following reasons for the Pod termination:
PreemptionByScheduler
- Pod is due to be preempted by a scheduler in order to accommodate a new Pod with a higher priority. For more information, see Pod priority preemption.
DeletionByTaintManager
- Pod is due to be deleted by Taint Manager (which is part of the node lifecycle controller within
kube-controller-manager
) due to aNoExecute
taint that the Pod does not tolerate; see taint-based evictions. EvictionByEvictionAPI
- Pod has been marked for eviction using the Kubernetes API .
DeletionByPodGC
- Pod, that is bound to a no longer existing Node, is due to be deleted by Pod garbage collection.
TerminationByKubelet
- Pod has been terminated by the kubelet, because of either node pressure eviction, the graceful node shutdown, or preemption for system critical pods.
In all other disruption scenarios, like eviction due to exceeding
Pod container limits,
Pods don't receive the DisruptionTarget
condition because the disruptions were
probably caused by the Pod and would reoccur on retry.
Note:
A Pod disruption might be interrupted. The control plane might re-attempt to continue the disruption of the same Pod, but it is not guaranteed. As a result, theDisruptionTarget
condition might be added to a Pod, but that Pod might then not actually be
deleted. In such a situation, after some time, the
Pod disruption condition will be cleared.Along with cleaning up the pods, the Pod garbage collector (PodGC) will also mark them as failed if they are in a non-terminal phase (see also Pod garbage collection).
When using a Job (or CronJob), you may want to use these Pod disruption conditions as part of your Job's Pod failure policy.
Separating Cluster Owner and Application Owner Roles
Often, it is useful to think of the Cluster Manager and Application Owner as separate roles with limited knowledge of each other. This separation of responsibilities may make sense in these scenarios:
- when there are many application teams sharing a Kubernetes cluster, and there is natural specialization of roles
- when third-party tools or services are used to automate cluster management
Pod Disruption Budgets support this separation of roles by providing an interface between the roles.
If you do not have such a separation of responsibilities in your organization, you may not need to use Pod Disruption Budgets.
How to perform Disruptive Actions on your Cluster
If you are a Cluster Administrator, and you need to perform a disruptive action on all the nodes in your cluster, such as a node or system software upgrade, here are some options:
- Accept downtime during the upgrade.
- Failover to another complete replica cluster.
- No downtime, but may be costly both for the duplicated nodes and for human effort to orchestrate the switchover.
- Write disruption tolerant applications and use PDBs.
- No downtime.
- Minimal resource duplication.
- Allows more automation of cluster administration.
- Writing disruption-tolerant applications is tricky, but the work to tolerate voluntary disruptions largely overlaps with work to support autoscaling and tolerating involuntary disruptions.
What's next
Follow steps to protect your application by configuring a Pod Disruption Budget.
Learn more about draining nodes
Learn about updating a deployment including steps to maintain its availability during the rollout.
3.4.1.6 - Pod Quality of Service Classes
This page introduces Quality of Service (QoS) classes in Kubernetes, and explains how Kubernetes assigns a QoS class to each Pod as a consequence of the resource constraints that you specify for the containers in that Pod. Kubernetes relies on this classification to make decisions about which Pods to evict when there are not enough available resources on a Node.
Quality of Service classes
Kubernetes classifies the Pods that you run and allocates each Pod into a specific
quality of service (QoS) class. Kubernetes uses that classification to influence how different
pods are handled. Kubernetes does this classification based on the
resource requests
of the Containers in that Pod, along with
how those requests relate to resource limits.
This is known as Quality of Service
(QoS) class. Kubernetes assigns every Pod a QoS class based on the resource requests
and limits of its component Containers. QoS classes are used by Kubernetes to decide
which Pods to evict from a Node experiencing
Node Pressure. The possible
QoS classes are Guaranteed
, Burstable
, and BestEffort
. When a Node runs out of resources,
Kubernetes will first evict BestEffort
Pods running on that Node, followed by Burstable
and
finally Guaranteed
Pods. When this eviction is due to resource pressure, only Pods exceeding
resource requests are candidates for eviction.
Guaranteed
Pods that are Guaranteed
have the strictest resource limits and are least likely
to face eviction. They are guaranteed not to be killed until they exceed their limits
or there are no lower-priority Pods that can be preempted from the Node. They may
not acquire resources beyond their specified limits. These Pods can also make
use of exclusive CPUs using the
static
CPU management policy.
Criteria
For a Pod to be given a QoS class of Guaranteed
:
- Every Container in the Pod must have a memory limit and a memory request.
- For every Container in the Pod, the memory limit must equal the memory request.
- Every Container in the Pod must have a CPU limit and a CPU request.
- For every Container in the Pod, the CPU limit must equal the CPU request.
Burstable
Pods that are Burstable
have some lower-bound resource guarantees based on the request, but
do not require a specific limit. If a limit is not specified, it defaults to a
limit equivalent to the capacity of the Node, which allows the Pods to flexibly increase
their resources if resources are available. In the event of Pod eviction due to Node
resource pressure, these Pods are evicted only after all BestEffort
Pods are evicted.
Because a Burstable
Pod can include a Container that has no resource limits or requests, a Pod
that is Burstable
can try to use any amount of node resources.
Criteria
A Pod is given a QoS class of Burstable
if:
- The Pod does not meet the criteria for QoS class
Guaranteed
. - At least one Container in the Pod has a memory or CPU request or limit.
BestEffort
Pods in the BestEffort
QoS class can use node resources that aren't specifically assigned
to Pods in other QoS classes. For example, if you have a node with 16 CPU cores available to the
kubelet, and you assign 4 CPU cores to a Guaranteed
Pod, then a Pod in the BestEffort
QoS class can try to use any amount of the remaining 12 CPU cores.
The kubelet prefers to evict BestEffort
Pods if the node comes under resource pressure.
Criteria
A Pod has a QoS class of BestEffort
if it doesn't meet the criteria for either Guaranteed
or Burstable
. In other words, a Pod is BestEffort
only if none of the Containers in the Pod have a
memory limit or a memory request, and none of the Containers in the Pod have a
CPU limit or a CPU request.
Containers in a Pod can request other resources (not CPU or memory) and still be classified as
BestEffort
.
Memory QoS with cgroup v2
Kubernetes v1.22 [alpha]
(enabled by default: false)Memory QoS uses the memory controller of cgroup v2 to guarantee memory resources in Kubernetes.
Memory requests and limits of containers in pod are used to set specific interfaces memory.min
and memory.high
provided by the memory controller. When memory.min
is set to memory requests,
memory resources are reserved and never reclaimed by the kernel; this is how Memory QoS ensures
memory availability for Kubernetes pods. And if memory limits are set in the container,
this means that the system needs to limit container memory usage; Memory QoS uses memory.high
to throttle workload approaching its memory limit, ensuring that the system is not overwhelmed
by instantaneous memory allocation.
Memory QoS relies on QoS class to determine which settings to apply; however, these are different mechanisms that both provide controls over quality of service.
Some behavior is independent of QoS class
Certain behavior is independent of the QoS class assigned by Kubernetes. For example:
Any Container exceeding a resource limit will be killed and restarted by the kubelet without affecting other Containers in that Pod.
If a Container exceeds its resource request and the node it runs on faces resource pressure, the Pod it is in becomes a candidate for eviction. If this occurs, all Containers in the Pod will be terminated. Kubernetes may create a replacement Pod, usually on a different node.
The resource request of a Pod is equal to the sum of the resource requests of its component Containers, and the resource limit of a Pod is equal to the sum of the resource limits of its component Containers.
The kube-scheduler does not consider QoS class when selecting which Pods to preempt. Preemption can occur when a cluster does not have enough resources to run all the Pods you defined.
What's next
- Learn about resource management for Pods and Containers.
- Learn about Node-pressure eviction.
- Learn about Pod priority and preemption.
- Learn about Pod disruptions.
- Learn how to assign memory resources to containers and pods.
- Learn how to assign CPU resources to containers and pods.
- Learn how to configure Quality of Service for Pods.
3.4.1.7 - User Namespaces
Kubernetes v1.30 [beta]
This page explains how user namespaces are used in Kubernetes pods. A user namespace isolates the user running inside the container from the one in the host.
A process running as root in a container can run as a different (non-root) user in the host; in other words, the process has full privileges for operations inside the user namespace, but is unprivileged for operations outside the namespace.
You can use this feature to reduce the damage a compromised container can do to the host or other pods in the same node. There are several security vulnerabilities rated either HIGH or CRITICAL that were not exploitable when user namespaces is active. It is expected user namespace will mitigate some future vulnerabilities too.
Before you begin
This is a Linux-only feature and support is needed in Linux for idmap mounts on the filesystems used. This means:
- On the node, the filesystem you use for
/var/lib/kubelet/pods/
, or the custom directory you configure for this, needs idmap mount support. - All the filesystems used in the pod's volumes must support idmap mounts.
In practice this means you need at least Linux 6.3, as tmpfs started supporting idmap mounts in that version. This is usually needed as several Kubernetes features use tmpfs (the service account token that is mounted by default uses a tmpfs, Secrets use a tmpfs, etc.)
Some popular filesystems that support idmap mounts in Linux 6.3 are: btrfs, ext4, xfs, fat, tmpfs, overlayfs.
In addition, the container runtime and its underlying OCI runtime must support user namespaces. The following OCI runtimes offer support:
Note:
Some OCI runtimes do not include the support needed for using user namespaces in Linux pods. If you use a managed Kubernetes, or have downloaded it from packages and set it up, it's possible that nodes in your cluster use a runtime that doesn't include this support.To use user namespaces with Kubernetes, you also need to use a CRI container runtime to use this feature with Kubernetes pods:
- containerd: version 2.0 (and later) supports user namespaces for containers.
- CRI-O: version 1.25 (and later) supports user namespaces for containers.
You can see the status of user namespaces support in cri-dockerd tracked in an issue on GitHub.
Introduction
User namespaces is a Linux feature that allows to map users in the container to different users in the host. Furthermore, the capabilities granted to a pod in a user namespace are valid only in the namespace and void outside of it.
A pod can opt-in to use user namespaces by setting the pod.spec.hostUsers
field
to false
.
The kubelet will pick host UIDs/GIDs a pod is mapped to, and will do so in a way to guarantee that no two pods on the same node use the same mapping.
The runAsUser
, runAsGroup
, fsGroup
, etc. fields in the pod.spec
always
refer to the user inside the container.
The valid UIDs/GIDs when this feature is enabled is the range 0-65535. This
applies to files and processes (runAsUser
, runAsGroup
, etc.).
Files using a UID/GID outside this range will be seen as belonging to the
overflow ID, usually 65534 (configured in /proc/sys/kernel/overflowuid
and
/proc/sys/kernel/overflowgid
). However, it is not possible to modify those
files, even by running as the 65534 user/group.
Most applications that need to run as root but don't access other host namespaces or resources, should continue to run fine without any changes needed if user namespaces is activated.
Understanding user namespaces for pods
Several container runtimes with their default configuration (like Docker Engine, containerd, CRI-O) use Linux namespaces for isolation. Other technologies exist and can be used with those runtimes too (e.g. Kata Containers uses VMs instead of Linux namespaces). This page is applicable for container runtimes using Linux namespaces for isolation.
When creating a pod, by default, several new namespaces are used for isolation: a network namespace to isolate the network of the container, a PID namespace to isolate the view of processes, etc. If a user namespace is used, this will isolate the users in the container from the users in the node.
This means containers can run as root and be mapped to a non-root user on the
host. Inside the container the process will think it is running as root (and
therefore tools like apt
, yum
, etc. work fine), while in reality the process
doesn't have privileges on the host. You can verify this, for example, if you
check which user the container process is running by executing ps aux
from
the host. The user ps
shows is not the same as the user you see if you
execute inside the container the command id
.
This abstraction limits what can happen, for example, if the container manages to escape to the host. Given that the container is running as a non-privileged user on the host, it is limited what it can do to the host.
Furthermore, as users on each pod will be mapped to different non-overlapping users in the host, it is limited what they can do to other pods too.
Capabilities granted to a pod are also limited to the pod user namespace and mostly invalid out of it, some are even completely void. Here are two examples:
CAP_SYS_MODULE
does not have any effect if granted to a pod using user namespaces, the pod isn't able to load kernel modules.CAP_SYS_ADMIN
is limited to the pod's user namespace and invalid outside of it.
Without using a user namespace a container running as root, in the case of a container breakout, has root privileges on the node. And if some capability were granted to the container, the capabilities are valid on the host too. None of this is true when we use user namespaces.
If you want to know more details about what changes when user namespaces are in
use, see man 7 user_namespaces
.
Set up a node to support user namespaces
By default, the kubelet assigns pods UIDs/GIDs above the range 0-65535, based on the assumption that the host's files and processes use UIDs/GIDs within this range, which is standard for most Linux distributions. This approach prevents any overlap between the UIDs/GIDs of the host and those of the pods.
Avoiding the overlap is important to mitigate the impact of vulnerabilities such as CVE-2021-25741, where a pod can potentially read arbitrary files in the host. If the UIDs/GIDs of the pod and the host don't overlap, it is limited what a pod would be able to do: the pod UID/GID won't match the host's file owner/group.
The kubelet can use a custom range for user IDs and group IDs for pods. To configure a custom range, the node needs to have:
- A user
kubelet
in the system (you cannot use any other username here) - The binary
getsubids
installed (part of shadow-utils) and in thePATH
for the kubelet binary. - A configuration of subordinate UIDs/GIDs for the
kubelet
user (seeman 5 subuid
andman 5 subgid
).
This setting only gathers the UID/GID range configuration and does not change
the user executing the kubelet
.
You must follow some constraints for the subordinate ID range that you assign
to the kubelet
user:
The subordinate user ID, that starts the UID range for Pods, must be a multiple of 65536 and must also be greater than or equal to 65536. In other words, you cannot use any ID from the range 0-65535 for Pods; the kubelet imposes this restriction to make it difficult to create an accidentally insecure configuration.
The subordinate ID count must be a multiple of 65536 (for Kubernetes 1.32 the subordinate ID count for each Pod is hard-coded to 65536).
The subordinate ID count must be at least
65536 x <maxPods>
where<maxPods>
is the maximum number of pods that can run on the node.You must assign the same range for both user IDs and for group IDs, It doesn't matter if other users have user ID ranges that don't align with the group ID ranges.
None of the assigned ranges should overlap with any other assignment.
The subordinate configuration must be only one line. In other words, you can't have multiple ranges.
For example, you could define /etc/subuid
and /etc/subgid
to both have
these entries for the kubelet
user:
# The format is
# name:firstID:count of IDs
# where
# - firstID is 65536 (the minimum value possible)
# - count of IDs is 110 * 65536
# (110 is the default limit for number of pods on the node)
kubelet:65536:7208960
Integration with Pod security admission checks
Kubernetes v1.29 [alpha]
For Linux Pods that enable user namespaces, Kubernetes relaxes the application of
Pod Security Standards in a controlled way.
This behavior can be controlled by the feature
gate
UserNamespacesPodSecurityStandards
, which allows an early opt-in for end
users. Admins have to ensure that user namespaces are enabled by all nodes
within the cluster if using the feature gate.
If you enable the associated feature gate and create a Pod that uses user
namespaces, the following fields won't be constrained even in contexts that enforce the
Baseline or Restricted pod security standard. This behavior does not
present a security concern because root
inside a Pod with user namespaces
actually refers to the user inside the container, that is never mapped to a
privileged user on the host. Here's the list of fields that are not checks for Pods in those
circumstances:
spec.securityContext.runAsNonRoot
spec.containers[*].securityContext.runAsNonRoot
spec.initContainers[*].securityContext.runAsNonRoot
spec.ephemeralContainers[*].securityContext.runAsNonRoot
spec.securityContext.runAsUser
spec.containers[*].securityContext.runAsUser
spec.initContainers[*].securityContext.runAsUser
spec.ephemeralContainers[*].securityContext.runAsUser
Limitations
When using a user namespace for the pod, it is disallowed to use other host
namespaces. In particular, if you set hostUsers: false
then you are not
allowed to set any of:
hostNetwork: true
hostIPC: true
hostPID: true
What's next
- Take a look at Use a User Namespace With a Pod
3.4.1.8 - Downward API
It is sometimes useful for a container to have information about itself, without being overly coupled to Kubernetes. The downward API allows containers to consume information about themselves or the cluster without using the Kubernetes client or API server.
An example is an existing application that assumes a particular well-known environment variable holds a unique identifier. One possibility is to wrap the application, but that is tedious and error-prone, and it violates the goal of low coupling. A better option would be to use the Pod's name as an identifier, and inject the Pod's name into the well-known environment variable.
In Kubernetes, there are two ways to expose Pod and container fields to a running container:
Together, these two ways of exposing Pod and container fields are called the downward API.
Available fields
Only some Kubernetes API fields are available through the downward API. This section lists which fields you can make available.
You can pass information from available Pod-level fields using fieldRef
.
At the API level, the spec
for a Pod always defines at least one
Container.
You can pass information from available Container-level fields using
resourceFieldRef
.
Information available via fieldRef
For some Pod-level fields, you can provide them to a container either as
an environment variable or using a downwardAPI
volume. The fields available
via either mechanism are:
metadata.name
- the pod's name
metadata.namespace
- the pod's namespace
metadata.uid
- the pod's unique ID
metadata.annotations['<KEY>']
- the value of the pod's annotation named
<KEY>
(for example,metadata.annotations['myannotation']
) metadata.labels['<KEY>']
- the text value of the pod's label named
<KEY>
(for example,metadata.labels['mylabel']
)
The following information is available through environment variables but not as a downwardAPI volume fieldRef:
spec.serviceAccountName
- the name of the pod's service account
spec.nodeName
- the name of the node where the Pod is executing
status.hostIP
- the primary IP address of the node to which the Pod is assigned
status.hostIPs
- the IP addresses is a dual-stack version of
status.hostIP
, the first is always the same asstatus.hostIP
. status.podIP
- the pod's primary IP address (usually, its IPv4 address)
status.podIPs
- the IP addresses is a dual-stack version of
status.podIP
, the first is always the same asstatus.podIP
The following information is available through a downwardAPI
volume
fieldRef
, but not as environment variables:
metadata.labels
- all of the pod's labels, formatted as
label-key="escaped-label-value"
with one label per line metadata.annotations
- all of the pod's annotations, formatted as
annotation-key="escaped-annotation-value"
with one annotation per line
Information available via resourceFieldRef
These container-level fields allow you to provide information about requests and limits for resources such as CPU and memory.
resource: limits.cpu
- A container's CPU limit
resource: requests.cpu
- A container's CPU request
resource: limits.memory
- A container's memory limit
resource: requests.memory
- A container's memory request
resource: limits.hugepages-*
- A container's hugepages limit
resource: requests.hugepages-*
- A container's hugepages request
resource: limits.ephemeral-storage
- A container's ephemeral-storage limit
resource: requests.ephemeral-storage
- A container's ephemeral-storage request
Fallback information for resource limits
If CPU and memory limits are not specified for a container, and you use the downward API to try to expose that information, then the kubelet defaults to exposing the maximum allocatable value for CPU and memory based on the node allocatable calculation.
What's next
You can read about downwardAPI
volumes.
You can try using the downward API to expose container- or Pod-level information:
3.4.2 - Workload Management
Kubernetes provides several built-in APIs for declarative management of your workloads and the components of those workloads.
Ultimately, your applications run as containers inside Pods; however, managing individual Pods would be a lot of effort. For example, if a Pod fails, you probably want to run a new Pod to replace it. Kubernetes can do that for you.
You use the Kubernetes API to create a workload object that represents a higher abstraction level than a Pod, and then the Kubernetes control plane automatically manages Pod objects on your behalf, based on the specification for the workload object you defined.
The built-in APIs for managing workloads are:
Deployment (and, indirectly, ReplicaSet), the most common way to run an application on your cluster. Deployment is a good fit for managing a stateless application workload on your cluster, where any Pod in the Deployment is interchangeable and can be replaced if needed. (Deployments are a replacement for the legacy ReplicationController API).
A StatefulSet lets you manage one or more Pods – all running the same application code – where the Pods rely on having a distinct identity. This is different from a Deployment where the Pods are expected to be interchangeable. The most common use for a StatefulSet is to be able to make a link between its Pods and their persistent storage. For example, you can run a StatefulSet that associates each Pod with a PersistentVolume. If one of the Pods in the StatefulSet fails, Kubernetes makes a replacement Pod that is connected to the same PersistentVolume.
A DaemonSet defines Pods that provide facilities that are local to a specific node; for example, a driver that lets containers on that node access a storage system. You use a DaemonSet when the driver, or other node-level service, has to run on the node where it's useful. Each Pod in a DaemonSet performs a role similar to a system daemon on a classic Unix / POSIX server. A DaemonSet might be fundamental to the operation of your cluster, such as a plugin to let that node access cluster networking, it might help you to manage the node, or it could provide less essential facilities that enhance the container platform you are running. You can run DaemonSets (and their pods) across every node in your cluster, or across just a subset (for example, only install the GPU accelerator driver on nodes that have a GPU installed).
You can use a Job and / or a CronJob to define tasks that run to completion and then stop. A Job represents a one-off task, whereas each CronJob repeats according to a schedule.
Other topics in this section:
3.4.2.1 - Deployments
A Deployment provides declarative updates for Pods and ReplicaSets.
You describe a desired state in a Deployment, and the Deployment Controller changes the actual state to the desired state at a controlled rate. You can define Deployments to create new ReplicaSets, or to remove existing Deployments and adopt all their resources with new Deployments.
Note:
Do not manage ReplicaSets owned by a Deployment. Consider opening an issue in the main Kubernetes repository if your use case is not covered below.Use Case
The following are typical use cases for Deployments:
- Create a Deployment to rollout a ReplicaSet. The ReplicaSet creates Pods in the background. Check the status of the rollout to see if it succeeds or not.
- Declare the new state of the Pods by updating the PodTemplateSpec of the Deployment. A new ReplicaSet is created and the Deployment manages moving the Pods from the old ReplicaSet to the new one at a controlled rate. Each new ReplicaSet updates the revision of the Deployment.
- Rollback to an earlier Deployment revision if the current state of the Deployment is not stable. Each rollback updates the revision of the Deployment.
- Scale up the Deployment to facilitate more load.
- Pause the rollout of a Deployment to apply multiple fixes to its PodTemplateSpec and then resume it to start a new rollout.
- Use the status of the Deployment as an indicator that a rollout has stuck.
- Clean up older ReplicaSets that you don't need anymore.
Creating a Deployment
The following is an example of a Deployment. It creates a ReplicaSet to bring up three nginx
Pods:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
In this example:
A Deployment named
nginx-deployment
is created, indicated by the.metadata.name
field. This name will become the basis for the ReplicaSets and Pods which are created later. See Writing a Deployment Spec for more details.The Deployment creates a ReplicaSet that creates three replicated Pods, indicated by the
.spec.replicas
field.The
.spec.selector
field defines how the created ReplicaSet finds which Pods to manage. In this case, you select a label that is defined in the Pod template (app: nginx
). However, more sophisticated selection rules are possible, as long as the Pod template itself satisfies the rule.Note:
The.spec.selector.matchLabels
field is a map of {key,value} pairs. A single {key,value} in thematchLabels
map is equivalent to an element ofmatchExpressions
, whosekey
field is "key", theoperator
is "In", and thevalues
array contains only "value". All of the requirements, from bothmatchLabels
andmatchExpressions
, must be satisfied in order to match.The
template
field contains the following sub-fields:- The Pods are labeled
app: nginx
using the.metadata.labels
field. - The Pod template's specification, or
.template.spec
field, indicates that the Pods run one container,nginx
, which runs thenginx
Docker Hub image at version 1.14.2. - Create one container and name it
nginx
using the.spec.template.spec.containers[0].name
field.
- The Pods are labeled
Before you begin, make sure your Kubernetes cluster is up and running. Follow the steps given below to create the above Deployment:
Create the Deployment by running the following command:
kubectl apply -f https://k8s.io/examples/controllers/nginx-deployment.yaml
Run
kubectl get deployments
to check if the Deployment was created.If the Deployment is still being created, the output is similar to the following:
NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/3 0 0 1s
When you inspect the Deployments in your cluster, the following fields are displayed:
NAME
lists the names of the Deployments in the namespace.READY
displays how many replicas of the application are available to your users. It follows the pattern ready/desired.UP-TO-DATE
displays the number of replicas that have been updated to achieve the desired state.AVAILABLE
displays how many replicas of the application are available to your users.AGE
displays the amount of time that the application has been running.
Notice how the number of desired replicas is 3 according to
.spec.replicas
field.To see the Deployment rollout status, run
kubectl rollout status deployment/nginx-deployment
.The output is similar to:
Waiting for rollout to finish: 2 out of 3 new replicas have been updated... deployment "nginx-deployment" successfully rolled out
Run the
kubectl get deployments
again a few seconds later. The output is similar to this:NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 3/3 3 3 18s
Notice that the Deployment has created all three replicas, and all replicas are up-to-date (they contain the latest Pod template) and available.
To see the ReplicaSet (
rs
) created by the Deployment, runkubectl get rs
. The output is similar to this:NAME DESIRED CURRENT READY AGE nginx-deployment-75675f5897 3 3 3 18s
ReplicaSet output shows the following fields:
NAME
lists the names of the ReplicaSets in the namespace.DESIRED
displays the desired number of replicas of the application, which you define when you create the Deployment. This is the desired state.CURRENT
displays how many replicas are currently running.READY
displays how many replicas of the application are available to your users.AGE
displays the amount of time that the application has been running.
Notice that the name of the ReplicaSet is always formatted as
[DEPLOYMENT-NAME]-[HASH]
. This name will become the basis for the Pods which are created.The
HASH
string is the same as thepod-template-hash
label on the ReplicaSet.To see the labels automatically generated for each Pod, run
kubectl get pods --show-labels
. The output is similar to:NAME READY STATUS RESTARTS AGE LABELS nginx-deployment-75675f5897-7ci7o 1/1 Running 0 18s app=nginx,pod-template-hash=75675f5897 nginx-deployment-75675f5897-kzszj 1/1 Running 0 18s app=nginx,pod-template-hash=75675f5897 nginx-deployment-75675f5897-qqcnn 1/1 Running 0 18s app=nginx,pod-template-hash=75675f5897
The created ReplicaSet ensures that there are three
nginx
Pods.
Note:
You must specify an appropriate selector and Pod template labels in a Deployment
(in this case, app: nginx
).
Do not overlap labels or selectors with other controllers (including other Deployments and StatefulSets). Kubernetes doesn't stop you from overlapping, and if multiple controllers have overlapping selectors those controllers might conflict and behave unexpectedly.
Pod-template-hash label
Caution:
Do not change this label.The pod-template-hash
label is added by the Deployment controller to every ReplicaSet that a Deployment creates or adopts.
This label ensures that child ReplicaSets of a Deployment do not overlap. It is generated by hashing the PodTemplate
of the ReplicaSet and using the resulting hash as the label value that is added to the ReplicaSet selector, Pod template labels,
and in any existing Pods that the ReplicaSet might have.
Updating a Deployment
Note:
A Deployment's rollout is triggered if and only if the Deployment's Pod template (that is,.spec.template
)
is changed, for example if the labels or container images of the template are updated. Other updates, such as scaling the Deployment, do not trigger a rollout.Follow the steps given below to update your Deployment:
Let's update the nginx Pods to use the
nginx:1.16.1
image instead of thenginx:1.14.2
image.kubectl set image deployment.v1.apps/nginx-deployment nginx=nginx:1.16.1
or use the following command:
kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1
where
deployment/nginx-deployment
indicates the Deployment,nginx
indicates the Container the update will take place andnginx:1.16.1
indicates the new image and its tag.The output is similar to:
deployment.apps/nginx-deployment image updated
Alternatively, you can
edit
the Deployment and change.spec.template.spec.containers[0].image
fromnginx:1.14.2
tonginx:1.16.1
:kubectl edit deployment/nginx-deployment
The output is similar to:
deployment.apps/nginx-deployment edited
To see the rollout status, run:
kubectl rollout status deployment/nginx-deployment
The output is similar to this:
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
or
deployment "nginx-deployment" successfully rolled out
Get more details on your updated Deployment:
After the rollout succeeds, you can view the Deployment by running
kubectl get deployments
. The output is similar to this:NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 3/3 3 3 36s
Run
kubectl get rs
to see that the Deployment updated the Pods by creating a new ReplicaSet and scaling it up to 3 replicas, as well as scaling down the old ReplicaSet to 0 replicas.kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-deployment-1564180365 3 3 3 6s nginx-deployment-2035384211 0 0 0 36s
Running
get pods
should now show only the new Pods:kubectl get pods
The output is similar to this:
NAME READY STATUS RESTARTS AGE nginx-deployment-1564180365-khku8 1/1 Running 0 14s nginx-deployment-1564180365-nacti 1/1 Running 0 14s nginx-deployment-1564180365-z9gth 1/1 Running 0 14s
Next time you want to update these Pods, you only need to update the Deployment's Pod template again.
Deployment ensures that only a certain number of Pods are down while they are being updated. By default, it ensures that at least 75% of the desired number of Pods are up (25% max unavailable).
Deployment also ensures that only a certain number of Pods are created above the desired number of Pods. By default, it ensures that at most 125% of the desired number of Pods are up (25% max surge).
For example, if you look at the above Deployment closely, you will see that it first creates a new Pod, then deletes an old Pod, and creates another new one. It does not kill old Pods until a sufficient number of new Pods have come up, and does not create new Pods until a sufficient number of old Pods have been killed. It makes sure that at least 3 Pods are available and that at max 4 Pods in total are available. In case of a Deployment with 4 replicas, the number of Pods would be between 3 and 5.
Get details of your Deployment:
kubectl describe deployments
The output is similar to this:
Name: nginx-deployment Namespace: default CreationTimestamp: Thu, 30 Nov 2017 10:56:25 +0000 Labels: app=nginx Annotations: deployment.kubernetes.io/revision=2 Selector: app=nginx Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable StrategyType: RollingUpdate MinReadySeconds: 0 RollingUpdateStrategy: 25% max unavailable, 25% max surge Pod Template: Labels: app=nginx Containers: nginx: Image: nginx:1.16.1 Port: 80/TCP Environment: <none> Mounts: <none> Volumes: <none> Conditions: Type Status Reason ---- ------ ------ Available True MinimumReplicasAvailable Progressing True NewReplicaSetAvailable OldReplicaSets: <none> NewReplicaSet: nginx-deployment-1564180365 (3/3 replicas created) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ScalingReplicaSet 2m deployment-controller Scaled up replica set nginx-deployment-2035384211 to 3 Normal ScalingReplicaSet 24s deployment-controller Scaled up replica set nginx-deployment-1564180365 to 1 Normal ScalingReplicaSet 22s deployment-controller Scaled down replica set nginx-deployment-2035384211 to 2 Normal ScalingReplicaSet 22s deployment-controller Scaled up replica set nginx-deployment-1564180365 to 2 Normal ScalingReplicaSet 19s deployment-controller Scaled down replica set nginx-deployment-2035384211 to 1 Normal ScalingReplicaSet 19s deployment-controller Scaled up replica set nginx-deployment-1564180365 to 3 Normal ScalingReplicaSet 14s deployment-controller Scaled down replica set nginx-deployment-2035384211 to 0
Here you see that when you first created the Deployment, it created a ReplicaSet (nginx-deployment-2035384211) and scaled it up to 3 replicas directly. When you updated the Deployment, it created a new ReplicaSet (nginx-deployment-1564180365) and scaled it up to 1 and waited for it to come up. Then it scaled down the old ReplicaSet to 2 and scaled up the new ReplicaSet to 2 so that at least 3 Pods were available and at most 4 Pods were created at all times. It then continued scaling up and down the new and the old ReplicaSet, with the same rolling update strategy. Finally, you'll have 3 available replicas in the new ReplicaSet, and the old ReplicaSet is scaled down to 0.
Note:
Kubernetes doesn't count terminating Pods when calculating the number ofavailableReplicas
, which must be between
replicas - maxUnavailable
and replicas + maxSurge
. As a result, you might notice that there are more Pods than
expected during a rollout, and that the total resources consumed by the Deployment is more than replicas + maxSurge
until the terminationGracePeriodSeconds
of the terminating Pods expires.Rollover (aka multiple updates in-flight)
Each time a new Deployment is observed by the Deployment controller, a ReplicaSet is created to bring up
the desired Pods. If the Deployment is updated, the existing ReplicaSet that controls Pods whose labels
match .spec.selector
but whose template does not match .spec.template
are scaled down. Eventually, the new
ReplicaSet is scaled to .spec.replicas
and all old ReplicaSets is scaled to 0.
If you update a Deployment while an existing rollout is in progress, the Deployment creates a new ReplicaSet as per the update and start scaling that up, and rolls over the ReplicaSet that it was scaling up previously -- it will add it to its list of old ReplicaSets and start scaling it down.
For example, suppose you create a Deployment to create 5 replicas of nginx:1.14.2
,
but then update the Deployment to create 5 replicas of nginx:1.16.1
, when only 3
replicas of nginx:1.14.2
had been created. In that case, the Deployment immediately starts
killing the 3 nginx:1.14.2
Pods that it had created, and starts creating
nginx:1.16.1
Pods. It does not wait for the 5 replicas of nginx:1.14.2
to be created
before changing course.
Label selector updates
It is generally discouraged to make label selector updates and it is suggested to plan your selectors up front. In any case, if you need to perform a label selector update, exercise great caution and make sure you have grasped all of the implications.
Note:
In API versionapps/v1
, a Deployment's label selector is immutable after it gets created.- Selector additions require the Pod template labels in the Deployment spec to be updated with the new label too, otherwise a validation error is returned. This change is a non-overlapping one, meaning that the new selector does not select ReplicaSets and Pods created with the old selector, resulting in orphaning all old ReplicaSets and creating a new ReplicaSet.
- Selector updates changes the existing value in a selector key -- result in the same behavior as additions.
- Selector removals removes an existing key from the Deployment selector -- do not require any changes in the Pod template labels. Existing ReplicaSets are not orphaned, and a new ReplicaSet is not created, but note that the removed label still exists in any existing Pods and ReplicaSets.
Rolling Back a Deployment
Sometimes, you may want to rollback a Deployment; for example, when the Deployment is not stable, such as crash looping. By default, all of the Deployment's rollout history is kept in the system so that you can rollback anytime you want (you can change that by modifying revision history limit).
Note:
A Deployment's revision is created when a Deployment's rollout is triggered. This means that the new revision is created if and only if the Deployment's Pod template (.spec.template
) is changed,
for example if you update the labels or container images of the template. Other updates, such as scaling the Deployment,
do not create a Deployment revision, so that you can facilitate simultaneous manual- or auto-scaling.
This means that when you roll back to an earlier revision, only the Deployment's Pod template part is
rolled back.Suppose that you made a typo while updating the Deployment, by putting the image name as
nginx:1.161
instead ofnginx:1.16.1
:kubectl set image deployment/nginx-deployment nginx=nginx:1.161
The output is similar to this:
deployment.apps/nginx-deployment image updated
The rollout gets stuck. You can verify it by checking the rollout status:
kubectl rollout status deployment/nginx-deployment
The output is similar to this:
Waiting for rollout to finish: 1 out of 3 new replicas have been updated...
Press Ctrl-C to stop the above rollout status watch. For more information on stuck rollouts, read more here.
You see that the number of old replicas (adding the replica count from
nginx-deployment-1564180365
andnginx-deployment-2035384211
) is 3, and the number of new replicas (fromnginx-deployment-3066724191
) is 1.kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-deployment-1564180365 3 3 3 25s nginx-deployment-2035384211 0 0 0 36s nginx-deployment-3066724191 1 1 0 6s
Looking at the Pods created, you see that 1 Pod created by new ReplicaSet is stuck in an image pull loop.
kubectl get pods
The output is similar to this:
NAME READY STATUS RESTARTS AGE nginx-deployment-1564180365-70iae 1/1 Running 0 25s nginx-deployment-1564180365-jbqqo 1/1 Running 0 25s nginx-deployment-1564180365-hysrc 1/1 Running 0 25s nginx-deployment-3066724191-08mng 0/1 ImagePullBackOff 0 6s
Note:
The Deployment controller stops the bad rollout automatically, and stops scaling up the new ReplicaSet. This depends on the rollingUpdate parameters (maxUnavailable
specifically) that you have specified. Kubernetes by default sets the value to 25%.Get the description of the Deployment:
kubectl describe deployment
The output is similar to this:
Name: nginx-deployment Namespace: default CreationTimestamp: Tue, 15 Mar 2016 14:48:04 -0700 Labels: app=nginx Selector: app=nginx Replicas: 3 desired | 1 updated | 4 total | 3 available | 1 unavailable StrategyType: RollingUpdate MinReadySeconds: 0 RollingUpdateStrategy: 25% max unavailable, 25% max surge Pod Template: Labels: app=nginx Containers: nginx: Image: nginx:1.161 Port: 80/TCP Host Port: 0/TCP Environment: <none> Mounts: <none> Volumes: <none> Conditions: Type Status Reason ---- ------ ------ Available True MinimumReplicasAvailable Progressing True ReplicaSetUpdated OldReplicaSets: nginx-deployment-1564180365 (3/3 replicas created) NewReplicaSet: nginx-deployment-3066724191 (1/1 replicas created) Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 1m 1m 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-2035384211 to 3 22s 22s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-1564180365 to 1 22s 22s 1 {deployment-controller } Normal ScalingReplicaSet Scaled down replica set nginx-deployment-2035384211 to 2 22s 22s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-1564180365 to 2 21s 21s 1 {deployment-controller } Normal ScalingReplicaSet Scaled down replica set nginx-deployment-2035384211 to 1 21s 21s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-1564180365 to 3 13s 13s 1 {deployment-controller } Normal ScalingReplicaSet Scaled down replica set nginx-deployment-2035384211 to 0 13s 13s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-3066724191 to 1
To fix this, you need to rollback to a previous revision of Deployment that is stable.
Checking Rollout History of a Deployment
Follow the steps given below to check the rollout history:
First, check the revisions of this Deployment:
kubectl rollout history deployment/nginx-deployment
The output is similar to this:
deployments "nginx-deployment" REVISION CHANGE-CAUSE 1 kubectl apply --filename=https://k8s.io/examples/controllers/nginx-deployment.yaml 2 kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1 3 kubectl set image deployment/nginx-deployment nginx=nginx:1.161
CHANGE-CAUSE
is copied from the Deployment annotationkubernetes.io/change-cause
to its revisions upon creation. You can specify theCHANGE-CAUSE
message by:- Annotating the Deployment with
kubectl annotate deployment/nginx-deployment kubernetes.io/change-cause="image updated to 1.16.1"
- Manually editing the manifest of the resource.
- Annotating the Deployment with
To see the details of each revision, run:
kubectl rollout history deployment/nginx-deployment --revision=2
The output is similar to this:
deployments "nginx-deployment" revision 2 Labels: app=nginx pod-template-hash=1159050644 Annotations: kubernetes.io/change-cause=kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1 Containers: nginx: Image: nginx:1.16.1 Port: 80/TCP QoS Tier: cpu: BestEffort memory: BestEffort Environment Variables: <none> No volumes.
Rolling Back to a Previous Revision
Follow the steps given below to rollback the Deployment from the current version to the previous version, which is version 2.
Now you've decided to undo the current rollout and rollback to the previous revision:
kubectl rollout undo deployment/nginx-deployment
The output is similar to this:
deployment.apps/nginx-deployment rolled back
Alternatively, you can rollback to a specific revision by specifying it with
--to-revision
:kubectl rollout undo deployment/nginx-deployment --to-revision=2
The output is similar to this:
deployment.apps/nginx-deployment rolled back
For more details about rollout related commands, read
kubectl rollout
.The Deployment is now rolled back to a previous stable revision. As you can see, a
DeploymentRollback
event for rolling back to revision 2 is generated from Deployment controller.Check if the rollback was successful and the Deployment is running as expected, run:
kubectl get deployment nginx-deployment
The output is similar to this:
NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 3/3 3 3 30m
Get the description of the Deployment:
kubectl describe deployment nginx-deployment
The output is similar to this:
Name: nginx-deployment Namespace: default CreationTimestamp: Sun, 02 Sep 2018 18:17:55 -0500 Labels: app=nginx Annotations: deployment.kubernetes.io/revision=4 kubernetes.io/change-cause=kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1 Selector: app=nginx Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable StrategyType: RollingUpdate MinReadySeconds: 0 RollingUpdateStrategy: 25% max unavailable, 25% max surge Pod Template: Labels: app=nginx Containers: nginx: Image: nginx:1.16.1 Port: 80/TCP Host Port: 0/TCP Environment: <none> Mounts: <none> Volumes: <none> Conditions: Type Status Reason ---- ------ ------ Available True MinimumReplicasAvailable Progressing True NewReplicaSetAvailable OldReplicaSets: <none> NewReplicaSet: nginx-deployment-c4747d96c (3/3 replicas created) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ScalingReplicaSet 12m deployment-controller Scaled up replica set nginx-deployment-75675f5897 to 3 Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 1 Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 2 Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 2 Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 1 Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 3 Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 0 Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-595696685f to 1 Normal DeploymentRollback 15s deployment-controller Rolled back deployment "nginx-deployment" to revision 2 Normal ScalingReplicaSet 15s deployment-controller Scaled down replica set nginx-deployment-595696685f to 0
Scaling a Deployment
You can scale a Deployment by using the following command:
kubectl scale deployment/nginx-deployment --replicas=10
The output is similar to this:
deployment.apps/nginx-deployment scaled
Assuming horizontal Pod autoscaling is enabled in your cluster, you can set up an autoscaler for your Deployment and choose the minimum and maximum number of Pods you want to run based on the CPU utilization of your existing Pods.
kubectl autoscale deployment/nginx-deployment --min=10 --max=15 --cpu-percent=80
The output is similar to this:
deployment.apps/nginx-deployment scaled
Proportional scaling
RollingUpdate Deployments support running multiple versions of an application at the same time. When you or an autoscaler scales a RollingUpdate Deployment that is in the middle of a rollout (either in progress or paused), the Deployment controller balances the additional replicas in the existing active ReplicaSets (ReplicaSets with Pods) in order to mitigate risk. This is called proportional scaling.
For example, you are running a Deployment with 10 replicas, maxSurge=3, and maxUnavailable=2.
Ensure that the 10 replicas in your Deployment are running.
kubectl get deploy
The output is similar to this:
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE nginx-deployment 10 10 10 10 50s
You update to a new image which happens to be unresolvable from inside the cluster.
kubectl set image deployment/nginx-deployment nginx=nginx:sometag
The output is similar to this:
deployment.apps/nginx-deployment image updated
The image update starts a new rollout with ReplicaSet nginx-deployment-1989198191, but it's blocked due to the
maxUnavailable
requirement that you mentioned above. Check out the rollout status:kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-deployment-1989198191 5 5 0 9s nginx-deployment-618515232 8 8 8 1m
Then a new scaling request for the Deployment comes along. The autoscaler increments the Deployment replicas to 15. The Deployment controller needs to decide where to add these new 5 replicas. If you weren't using proportional scaling, all 5 of them would be added in the new ReplicaSet. With proportional scaling, you spread the additional replicas across all ReplicaSets. Bigger proportions go to the ReplicaSets with the most replicas and lower proportions go to ReplicaSets with less replicas. Any leftovers are added to the ReplicaSet with the most replicas. ReplicaSets with zero replicas are not scaled up.
In our example above, 3 replicas are added to the old ReplicaSet and 2 replicas are added to the new ReplicaSet. The rollout process should eventually move all replicas to the new ReplicaSet, assuming the new replicas become healthy. To confirm this, run:
kubectl get deploy
The output is similar to this:
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
nginx-deployment 15 18 7 8 7m
The rollout status confirms how the replicas were added to each ReplicaSet.
kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE
nginx-deployment-1989198191 7 7 0 7m
nginx-deployment-618515232 11 11 11 7m
Pausing and Resuming a rollout of a Deployment
When you update a Deployment, or plan to, you can pause rollouts for that Deployment before you trigger one or more updates. When you're ready to apply those changes, you resume rollouts for the Deployment. This approach allows you to apply multiple fixes in between pausing and resuming without triggering unnecessary rollouts.
For example, with a Deployment that was created:
Get the Deployment details:
kubectl get deploy
The output is similar to this:
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE nginx 3 3 3 3 1m
Get the rollout status:
kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-2142116321 3 3 3 1m
Pause by running the following command:
kubectl rollout pause deployment/nginx-deployment
The output is similar to this:
deployment.apps/nginx-deployment paused
Then update the image of the Deployment:
kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1
The output is similar to this:
deployment.apps/nginx-deployment image updated
Notice that no new rollout started:
kubectl rollout history deployment/nginx-deployment
The output is similar to this:
deployments "nginx" REVISION CHANGE-CAUSE 1 <none>
Get the rollout status to verify that the existing ReplicaSet has not changed:
kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-2142116321 3 3 3 2m
You can make as many updates as you wish, for example, update the resources that will be used:
kubectl set resources deployment/nginx-deployment -c=nginx --limits=cpu=200m,memory=512Mi
The output is similar to this:
deployment.apps/nginx-deployment resource requirements updated
The initial state of the Deployment prior to pausing its rollout will continue its function, but new updates to the Deployment will not have any effect as long as the Deployment rollout is paused.
Eventually, resume the Deployment rollout and observe a new ReplicaSet coming up with all the new updates:
kubectl rollout resume deployment/nginx-deployment
The output is similar to this:
deployment.apps/nginx-deployment resumed
Watch the status of the rollout until it's done.
kubectl get rs --watch
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-2142116321 2 2 2 2m nginx-3926361531 2 2 0 6s nginx-3926361531 2 2 1 18s nginx-2142116321 1 2 2 2m nginx-2142116321 1 2 2 2m nginx-3926361531 3 2 1 18s nginx-3926361531 3 2 1 18s nginx-2142116321 1 1 1 2m nginx-3926361531 3 3 1 18s nginx-3926361531 3 3 2 19s nginx-2142116321 0 1 1 2m nginx-2142116321 0 1 1 2m nginx-2142116321 0 0 0 2m nginx-3926361531 3 3 3 20s
Get the status of the latest rollout:
kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-2142116321 0 0 0 2m nginx-3926361531 3 3 3 28s
Note:
You cannot rollback a paused Deployment until you resume it.Deployment status
A Deployment enters various states during its lifecycle. It can be progressing while rolling out a new ReplicaSet, it can be complete, or it can fail to progress.
Progressing Deployment
Kubernetes marks a Deployment as progressing when one of the following tasks is performed:
- The Deployment creates a new ReplicaSet.
- The Deployment is scaling up its newest ReplicaSet.
- The Deployment is scaling down its older ReplicaSet(s).
- New Pods become ready or available (ready for at least MinReadySeconds).
When the rollout becomes “progressing”, the Deployment controller adds a condition with the following
attributes to the Deployment's .status.conditions
:
type: Progressing
status: "True"
reason: NewReplicaSetCreated
|reason: FoundNewReplicaSet
|reason: ReplicaSetUpdated
You can monitor the progress for a Deployment by using kubectl rollout status
.
Complete Deployment
Kubernetes marks a Deployment as complete when it has the following characteristics:
- All of the replicas associated with the Deployment have been updated to the latest version you've specified, meaning any updates you've requested have been completed.
- All of the replicas associated with the Deployment are available.
- No old replicas for the Deployment are running.
When the rollout becomes “complete”, the Deployment controller sets a condition with the following
attributes to the Deployment's .status.conditions
:
type: Progressing
status: "True"
reason: NewReplicaSetAvailable
This Progressing
condition will retain a status value of "True"
until a new rollout
is initiated. The condition holds even when availability of replicas changes (which
does instead affect the Available
condition).
You can check if a Deployment has completed by using kubectl rollout status
. If the rollout completed
successfully, kubectl rollout status
returns a zero exit code.
kubectl rollout status deployment/nginx-deployment
The output is similar to this:
Waiting for rollout to finish: 2 of 3 updated replicas are available...
deployment "nginx-deployment" successfully rolled out
and the exit status from kubectl rollout
is 0 (success):
echo $?
0
Failed Deployment
Your Deployment may get stuck trying to deploy its newest ReplicaSet without ever completing. This can occur due to some of the following factors:
- Insufficient quota
- Readiness probe failures
- Image pull errors
- Insufficient permissions
- Limit ranges
- Application runtime misconfiguration
One way you can detect this condition is to specify a deadline parameter in your Deployment spec:
(.spec.progressDeadlineSeconds
). .spec.progressDeadlineSeconds
denotes the
number of seconds the Deployment controller waits before indicating (in the Deployment status) that the
Deployment progress has stalled.
The following kubectl
command sets the spec with progressDeadlineSeconds
to make the controller report
lack of progress of a rollout for a Deployment after 10 minutes:
kubectl patch deployment/nginx-deployment -p '{"spec":{"progressDeadlineSeconds":600}}'
The output is similar to this:
deployment.apps/nginx-deployment patched
Once the deadline has been exceeded, the Deployment controller adds a DeploymentCondition with the following
attributes to the Deployment's .status.conditions
:
type: Progressing
status: "False"
reason: ProgressDeadlineExceeded
This condition can also fail early and is then set to status value of "False"
due to reasons as ReplicaSetCreateError
.
Also, the deadline is not taken into account anymore once the Deployment rollout completes.
See the Kubernetes API conventions for more information on status conditions.
Note:
Kubernetes takes no action on a stalled Deployment other than to report a status condition withreason: ProgressDeadlineExceeded
. Higher level orchestrators can take advantage of it and act accordingly, for
example, rollback the Deployment to its previous version.Note:
If you pause a Deployment rollout, Kubernetes does not check progress against your specified deadline. You can safely pause a Deployment rollout in the middle of a rollout and resume without triggering the condition for exceeding the deadline.You may experience transient errors with your Deployments, either due to a low timeout that you have set or due to any other kind of error that can be treated as transient. For example, let's suppose you have insufficient quota. If you describe the Deployment you will notice the following section:
kubectl describe deployment nginx-deployment
The output is similar to this:
<...>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True ReplicaSetUpdated
ReplicaFailure True FailedCreate
<...>
If you run kubectl get deployment nginx-deployment -o yaml
, the Deployment status is similar to this:
status:
availableReplicas: 2
conditions:
- lastTransitionTime: 2016-10-04T12:25:39Z
lastUpdateTime: 2016-10-04T12:25:39Z
message: Replica set "nginx-deployment-4262182780" is progressing.
reason: ReplicaSetUpdated
status: "True"
type: Progressing
- lastTransitionTime: 2016-10-04T12:25:42Z
lastUpdateTime: 2016-10-04T12:25:42Z
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: 2016-10-04T12:25:39Z
lastUpdateTime: 2016-10-04T12:25:39Z
message: 'Error creating: pods "nginx-deployment-4262182780-" is forbidden: exceeded quota:
object-counts, requested: pods=1, used: pods=3, limited: pods=2'
reason: FailedCreate
status: "True"
type: ReplicaFailure
observedGeneration: 3
replicas: 2
unavailableReplicas: 2
Eventually, once the Deployment progress deadline is exceeded, Kubernetes updates the status and the reason for the Progressing condition:
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing False ProgressDeadlineExceeded
ReplicaFailure True FailedCreate
You can address an issue of insufficient quota by scaling down your Deployment, by scaling down other
controllers you may be running, or by increasing quota in your namespace. If you satisfy the quota
conditions and the Deployment controller then completes the Deployment rollout, you'll see the
Deployment's status update with a successful condition (status: "True"
and reason: NewReplicaSetAvailable
).
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
type: Available
with status: "True"
means that your Deployment has minimum availability. Minimum availability is dictated
by the parameters specified in the deployment strategy. type: Progressing
with status: "True"
means that your Deployment
is either in the middle of a rollout and it is progressing or that it has successfully completed its progress and the minimum
required new replicas are available (see the Reason of the condition for the particulars - in our case
reason: NewReplicaSetAvailable
means that the Deployment is complete).
You can check if a Deployment has failed to progress by using kubectl rollout status
. kubectl rollout status
returns a non-zero exit code if the Deployment has exceeded the progression deadline.
kubectl rollout status deployment/nginx-deployment
The output is similar to this:
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
error: deployment "nginx" exceeded its progress deadline
and the exit status from kubectl rollout
is 1 (indicating an error):
echo $?
1
Operating on a failed deployment
All actions that apply to a complete Deployment also apply to a failed Deployment. You can scale it up/down, roll back to a previous revision, or even pause it if you need to apply multiple tweaks in the Deployment Pod template.
Clean up Policy
You can set .spec.revisionHistoryLimit
field in a Deployment to specify how many old ReplicaSets for
this Deployment you want to retain. The rest will be garbage-collected in the background. By default,
it is 10.
Note:
Explicitly setting this field to 0, will result in cleaning up all the history of your Deployment thus that Deployment will not be able to roll back.Canary Deployment
If you want to roll out releases to a subset of users or servers using the Deployment, you can create multiple Deployments, one for each release, following the canary pattern described in managing resources.
Writing a Deployment Spec
As with all other Kubernetes configs, a Deployment needs .apiVersion
, .kind
, and .metadata
fields.
For general information about working with config files, see
deploying applications,
configuring containers, and using kubectl to manage resources documents.
When the control plane creates new Pods for a Deployment, the .metadata.name
of the
Deployment is part of the basis for naming those Pods. The name of a Deployment must be a valid
DNS subdomain
value, but this can produce unexpected results for the Pod hostnames. For best compatibility,
the name should follow the more restrictive rules for a
DNS label.
A Deployment also needs a .spec
section.
Pod Template
The .spec.template
and .spec.selector
are the only required fields of the .spec
.
The .spec.template
is a Pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind
.
In addition to required fields for a Pod, a Pod template in a Deployment must specify appropriate labels and an appropriate restart policy. For labels, make sure not to overlap with other controllers. See selector.
Only a .spec.template.spec.restartPolicy
equal to Always
is
allowed, which is the default if not specified.
Replicas
.spec.replicas
is an optional field that specifies the number of desired Pods. It defaults to 1.
Should you manually scale a Deployment, example via kubectl scale deployment deployment --replicas=X
, and then you update that Deployment based on a manifest
(for example: by running kubectl apply -f deployment.yaml
),
then applying that manifest overwrites the manual scaling that you previously did.
If a HorizontalPodAutoscaler (or any
similar API for horizontal scaling) is managing scaling for a Deployment, don't set .spec.replicas
.
Instead, allow the Kubernetes
control plane to manage the
.spec.replicas
field automatically.
Selector
.spec.selector
is a required field that specifies a label selector
for the Pods targeted by this Deployment.
.spec.selector
must match .spec.template.metadata.labels
, or it will be rejected by the API.
In API version apps/v1
, .spec.selector
and .metadata.labels
do not default to .spec.template.metadata.labels
if not set. So they must be set explicitly. Also note that .spec.selector
is immutable after creation of the Deployment in apps/v1
.
A Deployment may terminate Pods whose labels match the selector if their template is different
from .spec.template
or if the total number of such Pods exceeds .spec.replicas
. It brings up new
Pods with .spec.template
if the number of Pods is less than the desired number.
Note:
You should not create other Pods whose labels match this selector, either directly, by creating another Deployment, or by creating another controller such as a ReplicaSet or a ReplicationController. If you do so, the first Deployment thinks that it created these other Pods. Kubernetes does not stop you from doing this.If you have multiple controllers that have overlapping selectors, the controllers will fight with each other and won't behave correctly.
Strategy
.spec.strategy
specifies the strategy used to replace old Pods by new ones.
.spec.strategy.type
can be "Recreate" or "RollingUpdate". "RollingUpdate" is
the default value.
Recreate Deployment
All existing Pods are killed before new ones are created when .spec.strategy.type==Recreate
.
Note:
This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a StatefulSet.Rolling Update Deployment
The Deployment updates Pods in a rolling update
fashion when .spec.strategy.type==RollingUpdate
. You can specify maxUnavailable
and maxSurge
to control
the rolling update process.
Max Unavailable
.spec.strategy.rollingUpdate.maxUnavailable
is an optional field that specifies the maximum number
of Pods that can be unavailable during the update process. The value can be an absolute number (for example, 5)
or a percentage of desired Pods (for example, 10%). The absolute number is calculated from percentage by
rounding down. The value cannot be 0 if .spec.strategy.rollingUpdate.maxSurge
is 0. The default value is 25%.
For example, when this value is set to 30%, the old ReplicaSet can be scaled down to 70% of desired Pods immediately when the rolling update starts. Once new Pods are ready, old ReplicaSet can be scaled down further, followed by scaling up the new ReplicaSet, ensuring that the total number of Pods available at all times during the update is at least 70% of the desired Pods.
Max Surge
.spec.strategy.rollingUpdate.maxSurge
is an optional field that specifies the maximum number of Pods
that can be created over the desired number of Pods. The value can be an absolute number (for example, 5) or a
percentage of desired Pods (for example, 10%). The value cannot be 0 if MaxUnavailable
is 0. The absolute number
is calculated from the percentage by rounding up. The default value is 25%.
For example, when this value is set to 30%, the new ReplicaSet can be scaled up immediately when the rolling update starts, such that the total number of old and new Pods does not exceed 130% of desired Pods. Once old Pods have been killed, the new ReplicaSet can be scaled up further, ensuring that the total number of Pods running at any time during the update is at most 130% of desired Pods.
Here are some Rolling Update Deployment examples that use the maxUnavailable
and maxSurge
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
Progress Deadline Seconds
.spec.progressDeadlineSeconds
is an optional field that specifies the number of seconds you want
to wait for your Deployment to progress before the system reports back that the Deployment has
failed progressing - surfaced as a condition with type: Progressing
, status: "False"
.
and reason: ProgressDeadlineExceeded
in the status of the resource. The Deployment controller will keep
retrying the Deployment. This defaults to 600. In the future, once automatic rollback will be implemented, the Deployment
controller will roll back a Deployment as soon as it observes such a condition.
If specified, this field needs to be greater than .spec.minReadySeconds
.