The Concepts section helps you learn about the parts of the Kubernetes system and the abstractions Kubernetes uses to represent your cluster, and helps you obtain a deeper understanding of how Kubernetes works.
This the multi-page printable view of this section. Click here to print.
Concepts
- 1: Overview
- 1.1: What is Kubernetes?
- 1.2: Kubernetes Components
- 1.3: The Kubernetes API
- 1.4: Working with Kubernetes Objects
- 1.4.1: Understanding Kubernetes Objects
- 1.4.2: Kubernetes Object Management
- 1.4.3: Object Names and IDs
- 1.4.4: Namespaces
- 1.4.5: Labels and Selectors
- 1.4.6: Annotations
- 1.4.7: Field Selectors
- 1.4.8: Recommended Labels
- 2: Cluster Architecture
- 2.1: Nodes
- 2.2: Control Plane-Node Communication
- 2.3: Controllers
- 2.4: Cloud Controller Manager
- 3: Containers
- 3.1: Images
- 3.2: Container Environment
- 3.3: Runtime Class
- 3.4: Container Lifecycle Hooks
- 4: Workloads
- 4.1: Pods
- 4.1.1: Pod Lifecycle
- 4.1.2: Init Containers
- 4.1.3: Pod Topology Spread Constraints
- 4.1.4: Disruptions
- 4.1.5: Ephemeral Containers
- 4.2: Workload Resources
- 4.2.1: Deployments
- 4.2.2: ReplicaSet
- 4.2.3: StatefulSets
- 4.2.4: DaemonSet
- 4.2.5: Jobs
- 4.2.6: Garbage Collection
- 4.2.7: TTL Controller for Finished Resources
- 4.2.8: CronJob
- 4.2.9: ReplicationController
- 5: Services, Load Balancing, and Networking
- 5.1: Service
- 5.2: Topology-aware traffic routing with topology keys
- 5.3: DNS for Services and Pods
- 5.4: Connecting Applications with Services
- 5.5: Ingress
- 5.6: Ingress Controllers
- 5.7: EndpointSlices
- 5.8: Service Internal Traffic Policy
- 5.9: Topology Aware Hints
- 5.10: Network Policies
- 5.11: Adding entries to Pod /etc/hosts with HostAliases
- 5.12: IPv4/IPv6 dual-stack
- 6: Storage
- 6.1: Volumes
- 6.2: Persistent Volumes
- 6.3: Volume Snapshots
- 6.4: CSI Volume Cloning
- 6.5: Storage Classes
- 6.6: Volume Snapshot Classes
- 6.7: Dynamic Volume Provisioning
- 6.8: Storage Capacity
- 6.9: Ephemeral Volumes
- 6.10: Node-specific Volume Limits
- 6.11: Volume Health Monitoring
- 7: Configuration
- 7.1: Configuration Best Practices
- 7.2: ConfigMaps
- 7.3: Secrets
- 7.4: Managing Resources for Containers
- 7.5: Organizing Cluster Access Using kubeconfig Files
- 7.6: Pod Priority and Preemption
- 8: Security
- 8.1: Overview of Cloud Native Security
- 8.2: Pod Security Standards
- 8.3: Controlling Access to the Kubernetes API
- 9: Policies
- 9.1: Limit Ranges
- 9.2: Resource Quotas
- 9.3: Pod Security Policies
- 9.4: Process ID Limits And Reservations
- 9.5: Node Resource Managers
- 10: Scheduling and Eviction
- 10.1: Kubernetes Scheduler
- 10.2: Assigning Pods to Nodes
- 10.3: Resource Bin Packing for Extended Resources
- 10.4: Taints and Tolerations
- 10.5: Pod Overhead
- 10.6: Eviction Policy
- 10.7: Scheduling Framework
- 10.8: Scheduler Performance Tuning
- 11: Cluster Administration
- 11.1: Certificates
- 11.2: Managing Resources
- 11.3: Cluster Networking
- 11.4: Logging Architecture
- 11.5: Metrics For Kubernetes System Components
- 11.6: System Logs
- 11.7: Garbage collection for container images
- 11.8: Proxies in Kubernetes
- 11.9: API Priority and Fairness
- 11.10: Installing Addons
- 12: Extending Kubernetes
- 12.1: Extending the Kubernetes API
- 12.2: Compute, Storage, and Networking Extensions
- 12.2.1: Network Plugins
- 12.2.2: Device Plugins
- 12.3: Operator pattern
- 12.4: Service Catalog
1 - Overview
1.1 - What is Kubernetes?
This page is an overview of Kubernetes.
Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.
The name Kubernetes originates from Greek, meaning helmsman or pilot. Google open-sourced the Kubernetes project in 2014. Kubernetes combines over 15 years of Google's experience running production workloads at scale with best-of-breed ideas and practices from the community.
Going back in time
Let's take a look at why Kubernetes is so useful by going back in time.
Traditional deployment era: Early on, organizations ran applications on physical servers. There was no way to define resource boundaries for applications in a physical server, and this caused resource allocation issues. For example, if multiple applications run on a physical server, there can be instances where one application would take up most of the resources, and as a result, the other applications would underperform. A solution for this would be to run each application on a different physical server. But this did not scale as resources were underutilized, and it was expensive for organizations to maintain many physical servers.
Virtualized deployment era: As a solution, virtualization was introduced. It allows you to run multiple Virtual Machines (VMs) on a single physical server's CPU. Virtualization allows applications to be isolated between VMs and provides a level of security as the information of one application cannot be freely accessed by another application.
Virtualization allows better utilization of resources in a physical server and allows better scalability because an application can be added or updated easily, reduces hardware costs, and much more. With virtualization you can present a set of physical resources as a cluster of disposable virtual machines.
Each VM is a full machine running all the components, including its own operating system, on top of the virtualized hardware.
Container deployment era: Containers are similar to VMs, but they have relaxed isolation properties to share the Operating System (OS) among the applications. Therefore, containers are considered lightweight. Similar to a VM, a container has its own filesystem, share of CPU, memory, process space, and more. As they are decoupled from the underlying infrastructure, they are portable across clouds and OS distributions.
Containers have become popular because they provide extra benefits, such as:
- Agile application creation and deployment: increased ease and efficiency of container image creation compared to VM image use.
- Continuous development, integration, and deployment: provides for reliable and frequent container image build and deployment with quick and efficient rollbacks (due to image immutability).
- Dev and Ops separation of concerns: create application container images at build/release time rather than deployment time, thereby decoupling applications from infrastructure.
- Observability not only surfaces OS-level information and metrics, but also application health and other signals.
- Environmental consistency across development, testing, and production: Runs the same on a laptop as it does in the cloud.
- Cloud and OS distribution portability: Runs on Ubuntu, RHEL, CoreOS, on-premises, on major public clouds, and anywhere else.
- Application-centric management: Raises the level of abstraction from running an OS on virtual hardware to running an application on an OS using logical resources.
- Loosely coupled, distributed, elastic, liberated micro-services: applications are broken into smaller, independent pieces and can be deployed and managed dynamically – not a monolithic stack running on one big single-purpose machine.
- Resource isolation: predictable application performance.
- Resource utilization: high efficiency and density.
Why you need Kubernetes and what it can do
Containers are a good way to bundle and run your applications. In a production environment, you need to manage the containers that run the applications and ensure that there is no downtime. For example, if a container goes down, another container needs to start. Wouldn't it be easier if this behavior was handled by a system?
That's how Kubernetes comes to the rescue! Kubernetes provides you with a framework to run distributed systems resiliently. It takes care of scaling and failover for your application, provides deployment patterns, and more. For example, Kubernetes can easily manage a canary deployment for your system.
Kubernetes provides you with:
- Service discovery and load balancing Kubernetes can expose a container using the DNS name or using their own IP address. If traffic to a container is high, Kubernetes is able to load balance and distribute the network traffic so that the deployment is stable.
- Storage orchestration Kubernetes allows you to automatically mount a storage system of your choice, such as local storages, public cloud providers, and more.
- Automated rollouts and rollbacks You can describe the desired state for your deployed containers using Kubernetes, and it can change the actual state to the desired state at a controlled rate. For example, you can automate Kubernetes to create new containers for your deployment, remove existing containers and adopt all their resources to the new container.
- Automatic bin packing You provide Kubernetes with a cluster of nodes that it can use to run containerized tasks. You tell Kubernetes how much CPU and memory (RAM) each container needs. Kubernetes can fit containers onto your nodes to make the best use of your resources.
- Self-healing Kubernetes restarts containers that fail, replaces containers, kills containers that don't respond to your user-defined health check, and doesn't advertise them to clients until they are ready to serve.
- Secret and configuration management Kubernetes lets you store and manage sensitive information, such as passwords, OAuth tokens, and SSH keys. You can deploy and update secrets and application configuration without rebuilding your container images, and without exposing secrets in your stack configuration.
What Kubernetes is not
Kubernetes is not a traditional, all-inclusive PaaS (Platform as a Service) system. Since Kubernetes operates at the container level rather than at the hardware level, it provides some generally applicable features common to PaaS offerings, such as deployment, scaling, load balancing, and lets users integrate their logging, monitoring, and alerting solutions. However, Kubernetes is not monolithic, and these default solutions are optional and pluggable. Kubernetes provides the building blocks for building developer platforms, but preserves user choice and flexibility where it is important.
Kubernetes:
- Does not limit the types of applications supported. Kubernetes aims to support an extremely diverse variety of workloads, including stateless, stateful, and data-processing workloads. If an application can run in a container, it should run great on Kubernetes.
- Does not deploy source code and does not build your application. Continuous Integration, Delivery, and Deployment (CI/CD) workflows are determined by organization cultures and preferences as well as technical requirements.
- Does not provide application-level services, such as middleware (for example, message buses), data-processing frameworks (for example, Spark), databases (for example, MySQL), caches, nor cluster storage systems (for example, Ceph) as built-in services. Such components can run on Kubernetes, and/or can be accessed by applications running on Kubernetes through portable mechanisms, such as the Open Service Broker.
- Does not dictate logging, monitoring, or alerting solutions. It provides some integrations as proof of concept, and mechanisms to collect and export metrics.
- Does not provide nor mandate a configuration language/system (for example, Jsonnet). It provides a declarative API that may be targeted by arbitrary forms of declarative specifications.
- Does not provide nor adopt any comprehensive machine configuration, maintenance, management, or self-healing systems.
- Additionally, Kubernetes is not a mere orchestration system. In fact, it eliminates the need for orchestration. The technical definition of orchestration is execution of a defined workflow: first do A, then B, then C. In contrast, Kubernetes comprises a set of independent, composable control processes that continuously drive the current state towards the provided desired state. It shouldn't matter how you get from A to C. Centralized control is also not required. This results in a system that is easier to use and more powerful, robust, resilient, and extensible.
What's next
- Take a look at the Kubernetes Components
- Ready to Get Started?
1.2 - Kubernetes Components
When you deploy Kubernetes, you get a cluster.
A Kubernetes cluster consists of a set of worker machines, called nodes, that run containerized applications. Every cluster has at least one worker node.
The worker node(s) host the Pods that are the components of the application workload. The control plane manages the worker nodes and the Pods in the cluster. In production environments, the control plane usually runs across multiple computers and a cluster usually runs multiple nodes, providing fault-tolerance and high availability.
This document outlines the various components you need to have a complete and working Kubernetes cluster.
Here's the diagram of a Kubernetes cluster with all the components tied together.
Control Plane Components
The control plane's components make global decisions about the cluster (for example, scheduling), as well as detecting and responding to cluster events (for example, starting up a new pod when a deployment's replicas
field is unsatisfied).
Control plane components can be run on any machine in the cluster. However, for simplicity, set up scripts typically start all control plane components on the same machine, and do not run user containers on this machine. See Creating Highly Available clusters with kubeadm for an example control plane setup that runs across multiple VMs.
kube-apiserver
The API server is a component of the Kubernetes control plane that exposes the Kubernetes API. The API server is the front end for the Kubernetes control plane.
The main implementation of a Kubernetes API server is kube-apiserver. kube-apiserver is designed to scale horizontally—that is, it scales by deploying more instances. You can run several instances of kube-apiserver and balance traffic between those instances.
etcd
Consistent and highly-available key value store used as Kubernetes' backing store for all cluster data.
If your Kubernetes cluster uses etcd as its backing store, make sure you have a back up plan for those data.
You can find in-depth information about etcd in the official documentation.
kube-scheduler
Control plane component that watches for newly created Pods with no assigned node, and selects a node for them to run on.
Factors taken into account for scheduling decisions include: individual and collective resource requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference, and deadlines.
kube-controller-manager
Control Plane component that runs controller processes.
Logically, each controller is a separate process, but to reduce complexity, they are all compiled into a single binary and run in a single process.
Some types of these controllers are:
- Node controller: Responsible for noticing and responding when nodes go down.
- Job controller: Watches for Job objects that represent one-off tasks, then creates Pods to run those tasks to completion.
- Endpoints controller: Populates the Endpoints object (that is, joins Services & Pods).
- Service Account & Token controllers: Create default accounts and API access tokens for new namespaces.
cloud-controller-manager
A Kubernetes control plane component that embeds cloud-specific control logic. The cloud controller manager lets you link your cluster into your cloud provider's API, and separates out the components that interact with that cloud platform from components that only interact with your cluster.The cloud-controller-manager only runs controllers that are specific to your cloud provider. If you are running Kubernetes on your own premises, or in a learning environment inside your own PC, the cluster does not have a cloud controller manager.
As with the kube-controller-manager, the cloud-controller-manager combines several logically independent control loops into a single binary that you run as a single process. You can scale horizontally (run more than one copy) to improve performance or to help tolerate failures.
The following controllers can have cloud provider dependencies:
- Node controller: For checking the cloud provider to determine if a node has been deleted in the cloud after it stops responding
- Route controller: For setting up routes in the underlying cloud infrastructure
- Service controller: For creating, updating and deleting cloud provider load balancers
Node Components
Node components run on every node, maintaining running pods and providing the Kubernetes runtime environment.
kubelet
An agent that runs on each node in the cluster. It makes sure that containers are running in a Pod.
The kubelet takes a set of PodSpecs that are provided through various mechanisms and ensures that the containers described in those PodSpecs are running and healthy. The kubelet doesn't manage containers which were not created by Kubernetes.
kube-proxy
kube-proxy is a network proxy that runs on each node in your cluster, implementing part of the Kubernetes Service concept.
kube-proxy maintains network rules on nodes. These network rules allow network communication to your Pods from network sessions inside or outside of your cluster.
kube-proxy uses the operating system packet filtering layer if there is one and it's available. Otherwise, kube-proxy forwards the traffic itself.
Container runtime
The container runtime is the software that is responsible for running containers.
Kubernetes supports several container runtimes: Docker, containerd, CRI-O, and any implementation of the Kubernetes CRI (Container Runtime Interface).
Addons
Addons use Kubernetes resources (DaemonSet,
Deployment, etc)
to implement cluster features. Because these are providing cluster-level features, namespaced resources
for addons belong within the kube-system
namespace.
Selected addons are described below; for an extended list of available addons, please see Addons.
DNS
While the other addons are not strictly required, all Kubernetes clusters should have cluster DNS, as many examples rely on it.
Cluster DNS is a DNS server, in addition to the other DNS server(s) in your environment, which serves DNS records for Kubernetes services.
Containers started by Kubernetes automatically include this DNS server in their DNS searches.
Web UI (Dashboard)
Dashboard is a general purpose, web-based UI for Kubernetes clusters. It allows users to manage and troubleshoot applications running in the cluster, as well as the cluster itself.
Container Resource Monitoring
Container Resource Monitoring records generic time-series metrics about containers in a central database, and provides a UI for browsing that data.
Cluster-level Logging
A cluster-level logging mechanism is responsible for saving container logs to a central log store with search/browsing interface.
What's next
- Learn about Nodes
- Learn about Controllers
- Learn about kube-scheduler
- Read etcd's official documentation
1.3 - The Kubernetes API
The core of Kubernetes' control plane is the API server. The API server exposes an HTTP API that lets end users, different parts of your cluster, and external components communicate with one another.
The Kubernetes API lets you query and manipulate the state of API objects in Kubernetes (for example: Pods, Namespaces, ConfigMaps, and Events).
Most operations can be performed through the kubectl command-line interface or other command-line tools, such as kubeadm, which in turn use the API. However, you can also access the API directly using REST calls.
Consider using one of the client libraries if you are writing an application using the Kubernetes API.
OpenAPI specification
Complete API details are documented using OpenAPI.
The Kubernetes API server serves an OpenAPI spec via the /openapi/v2
endpoint.
You can request the response format using request headers as follows:
Header | Possible values | Notes |
---|---|---|
Accept-Encoding | gzip | not supplying this header is also acceptable |
Accept | application/com.github.proto-openapi.spec.v2@v1.0+protobuf | mainly for intra-cluster use |
application/json | default | |
* | serves application/json |
Kubernetes implements an alternative Protobuf based serialization format that is primarily intended for intra-cluster communication. For more information about this format, see the Kubernetes Protobuf serialization design proposal and the Interface Definition Language (IDL) files for each schema located in the Go packages that define the API objects.
Persistence
Kubernetes stores the serialized state of objects by writing them into etcd.
API groups and versioning
To make it easier to eliminate fields or restructure resource representations,
Kubernetes supports multiple API versions, each at a different API path, such
as /api/v1
or /apis/rbac.authorization.k8s.io/v1alpha1
.
Versioning is done at the API level rather than at the resource or field level to ensure that the API presents a clear, consistent view of system resources and behavior, and to enable controlling access to end-of-life and/or experimental APIs.
To make it easier to evolve and to extend its API, Kubernetes implements API groups that can be enabled or disabled.
API resources are distinguished by their API group, resource type, namespace (for namespaced resources), and name. The API server handles the conversion between API versions transparently: all the different versions are actually representations of the same persisted data. The API server may serve the same underlying data through multiple API versions.
For example, suppose there are two API versions, v1
and v1beta1
, for the same
resource. If you originally created an object using the v1beta1
version of its
API, you can later read, update, or delete that object
using either the v1beta1
or the v1
API version.
API changes
Any system that is successful needs to grow and change as new use cases emerge or existing ones change. Therefore, Kubernetes has designed the Kubernetes API to continuously change and grow. The Kubernetes project aims to not break compatibility with existing clients, and to maintain that compatibility for a length of time so that other projects have an opportunity to adapt.
In general, new API resources and new resource fields can be added often and frequently. Elimination of resources or fields requires following the API deprecation policy.
Kubernetes makes a strong commitment to maintain compatibility for official Kubernetes APIs
once they reach general availability (GA), typically at API version v1
. Additionally,
Kubernetes keeps compatibility even for beta API versions wherever feasible:
if you adopt a beta API you can continue to interact with your cluster using that API,
even after the feature goes stable.
Note: Although Kubernetes also aims to maintain compatibility for alpha APIs versions, in some circumstances this is not possible. If you use any alpha API versions, check the release notes for Kubernetes when upgrading your cluster, in case the API did change.
Refer to API versions reference for more details on the API version level definitions.
API Extension
The Kubernetes API can be extended in one of two ways:
- Custom resources let you declaratively define how the API server should provide your chosen resource API.
- You can also extend the Kubernetes API by implementing an aggregation layer.
What's next
- Learn how to extend the Kubernetes API by adding your own CustomResourceDefinition.
- Controlling Access To The Kubernetes API describes how the cluster manages authentication and authorization for API access.
- Learn about API endpoints, resource types and samples by reading API Reference.
- Learn about what constitutes a compatible change, and how to change the API, from API changes.
1.4 - Working with Kubernetes Objects
1.4.1 - Understanding Kubernetes Objects
This page explains how Kubernetes objects are represented in the Kubernetes API, and how you can express them in .yaml
format.
Understanding Kubernetes objects
Kubernetes objects are persistent entities in the Kubernetes system. Kubernetes uses these entities to represent the state of your cluster. Specifically, they can describe:
- What containerized applications are running (and on which nodes)
- The resources available to those applications
- The policies around how those applications behave, such as restart policies, upgrades, and fault-tolerance
A Kubernetes object is a "record of intent"--once you create the object, the Kubernetes system will constantly work to ensure that object exists. By creating an object, you're effectively telling the Kubernetes system what you want your cluster's workload to look like; this is your cluster's desired state.
To work with Kubernetes objects--whether to create, modify, or delete them--you'll need to use the Kubernetes API. When you use the kubectl
command-line interface, for example, the CLI makes the necessary Kubernetes API calls for you. You can also use the Kubernetes API directly in your own programs using one of the Client Libraries.
Object Spec and Status
Almost every Kubernetes object includes two nested object fields that govern
the object's configuration: the object spec
and the object status
.
For objects that have a spec
, you have to set this when you create the object,
providing a description of the characteristics you want the resource to have:
its desired state.
The status
describes the current state of the object, supplied and updated
by the Kubernetes system and its components. The Kubernetes
control plane continually
and actively manages every object's actual state to match the desired state you
supplied.
For example: in Kubernetes, a Deployment is an object that can represent an
application running on your cluster. When you create the Deployment, you
might set the Deployment spec
to specify that you want three replicas of
the application to be running. The Kubernetes system reads the Deployment
spec and starts three instances of your desired application--updating
the status to match your spec. If any of those instances should fail
(a status change), the Kubernetes system responds to the difference
between spec and status by making a correction--in this case, starting
a replacement instance.
For more information on the object spec, status, and metadata, see the Kubernetes API Conventions.
Describing a Kubernetes object
When you create an object in Kubernetes, you must provide the object spec that describes its desired state, as well as some basic information about the object (such as a name). When you use the Kubernetes API to create the object (either directly or via kubectl
), that API request must include that information as JSON in the request body. Most often, you provide the information to kubectl
in a .yaml file. kubectl
converts the information to JSON when making the API request.
Here's an example .yaml
file that shows the required fields and object spec for a Kubernetes Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 2 # tells deployment to run 2 pods matching the template
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
One way to create a Deployment using a .yaml
file like the one above is to use the
kubectl apply
command
in the kubectl
command-line interface, passing the .yaml
file as an argument. Here's an example:
kubectl apply -f https://k8s.io/examples/application/deployment.yaml --record
The output is similar to this:
deployment.apps/nginx-deployment created
Required Fields
In the .yaml
file for the Kubernetes object you want to create, you'll need to set values for the following fields:
apiVersion
- Which version of the Kubernetes API you're using to create this objectkind
- What kind of object you want to createmetadata
- Data that helps uniquely identify the object, including aname
string,UID
, and optionalnamespace
spec
- What state you desire for the object
The precise format of the object spec
is different for every Kubernetes object, and contains nested fields specific to that object. The Kubernetes API Reference can help you find the spec format for all of the objects you can create using Kubernetes.
For example, the spec
format for a Pod can be found in
PodSpec v1 core,
and the spec
format for a Deployment can be found in
DeploymentSpec v1 apps.
What's next
- Learn about the most important basic Kubernetes objects, such as Pod.
- Learn about controllers in Kubernetes.
- Using the Kubernetes API explains some more API concepts.
1.4.2 - Kubernetes Object Management
The kubectl
command-line tool supports several different ways to create and manage
Kubernetes objects. This document provides an overview of the different
approaches. Read the Kubectl book for
details of managing objects by Kubectl.
Management techniques
Warning: A Kubernetes object should be managed using only one technique. Mixing and matching techniques for the same object results in undefined behavior.
Management technique | Operates on | Recommended environment | Supported writers | Learning curve |
---|---|---|---|---|
Imperative commands | Live objects | Development projects | 1+ | Lowest |
Imperative object configuration | Individual files | Production projects | 1 | Moderate |
Declarative object configuration | Directories of files | Production projects | 1+ | Highest |
Imperative commands
When using imperative commands, a user operates directly on live objects
in a cluster. The user provides operations to
the kubectl
command as arguments or flags.
This is the recommended way to get started or to run a one-off task in a cluster. Because this technique operates directly on live objects, it provides no history of previous configurations.
Examples
Run an instance of the nginx container by creating a Deployment object:
kubectl create deployment nginx --image nginx
Trade-offs
Advantages compared to object configuration:
- Commands are expressed as a single action word.
- Commands require only a single step to make changes to the cluster.
Disadvantages compared to object configuration:
- Commands do not integrate with change review processes.
- Commands do not provide an audit trail associated with changes.
- Commands do not provide a source of records except for what is live.
- Commands do not provide a template for creating new objects.
Imperative object configuration
In imperative object configuration, the kubectl command specifies the operation (create, replace, etc.), optional flags and at least one file name. The file specified must contain a full definition of the object in YAML or JSON format.
See the API reference for more details on object definitions.
Warning: The imperativereplace
command replaces the existing spec with the newly provided one, dropping all changes to the object missing from the configuration file. This approach should not be used with resource types whose specs are updated independently of the configuration file. Services of typeLoadBalancer
, for example, have theirexternalIPs
field updated independently from the configuration by the cluster.
Examples
Create the objects defined in a configuration file:
kubectl create -f nginx.yaml
Delete the objects defined in two configuration files:
kubectl delete -f nginx.yaml -f redis.yaml
Update the objects defined in a configuration file by overwriting the live configuration:
kubectl replace -f nginx.yaml
Trade-offs
Advantages compared to imperative commands:
- Object configuration can be stored in a source control system such as Git.
- Object configuration can integrate with processes such as reviewing changes before push and audit trails.
- Object configuration provides a template for creating new objects.
Disadvantages compared to imperative commands:
- Object configuration requires basic understanding of the object schema.
- Object configuration requires the additional step of writing a YAML file.
Advantages compared to declarative object configuration:
- Imperative object configuration behavior is simpler and easier to understand.
- As of Kubernetes version 1.5, imperative object configuration is more mature.
Disadvantages compared to declarative object configuration:
- Imperative object configuration works best on files, not directories.
- Updates to live objects must be reflected in configuration files, or they will be lost during the next replacement.
Declarative object configuration
When using declarative object configuration, a user operates on object
configuration files stored locally, however the user does not define the
operations to be taken on the files. Create, update, and delete operations
are automatically detected per-object by kubectl
. This enables working on
directories, where different operations might be needed for different objects.
Note: Declarative object configuration retains changes made by other writers, even if the changes are not merged back to the object configuration file. This is possible by using thepatch
API operation to write only observed differences, instead of using thereplace
API operation to replace the entire object configuration.
Examples
Process all object configuration files in the configs
directory, and create or
patch the live objects. You can first diff
to see what changes are going to be
made, and then apply:
kubectl diff -f configs/
kubectl apply -f configs/
Recursively process directories:
kubectl diff -R -f configs/
kubectl apply -R -f configs/
Trade-offs
Advantages compared to imperative object configuration:
- Changes made directly to live objects are retained, even if they are not merged back into the configuration files.
- Declarative object configuration has better support for operating on directories and automatically detecting operation types (create, patch, delete) per-object.
Disadvantages compared to imperative object configuration:
- Declarative object configuration is harder to debug and understand results when they are unexpected.
- Partial updates using diffs create complex merge and patch operations.
What's next
- Managing Kubernetes Objects Using Imperative Commands
- Managing Kubernetes Objects Using Object Configuration (Imperative)
- Managing Kubernetes Objects Using Object Configuration (Declarative)
- Managing Kubernetes Objects Using Kustomize (Declarative)
- Kubectl Command Reference
- Kubectl Book
- Kubernetes API Reference
1.4.3 - Object Names and IDs
Each object in your cluster has a Name that is unique for that type of resource. Every Kubernetes object also has a UID that is unique across your whole cluster.
For example, you can only have one Pod named myapp-1234
within the same namespace, but you can have one Pod and one Deployment that are each named myapp-1234
.
For non-unique user-provided attributes, Kubernetes provides labels and annotations.
Names
A client-provided string that refers to an object in a resource URL, such as /api/v1/pods/some-name
.
Only one object of a given kind can have a given name at a time. However, if you delete the object, you can make a new object with the same name.
Note: In cases when objects represent a physical entity, like a Node representing a physical host, when the host is re-created under the same name without deleting and re-creating the Node, Kubernetes treats the new host as the old one, which may lead to inconsistencies.
Below are three types of commonly used name constraints for resources.
DNS Subdomain Names
Most resource types require a name that can be used as a DNS subdomain name as defined in RFC 1123. This means the name must:
- contain no more than 253 characters
- contain only lowercase alphanumeric characters, '-' or '.'
- start with an alphanumeric character
- end with an alphanumeric character
DNS Label Names
Some resource types require their names to follow the DNS label standard as defined in RFC 1123. This means the name must:
- contain at most 63 characters
- contain only lowercase alphanumeric characters or '-'
- start with an alphanumeric character
- end with an alphanumeric character
Path Segment Names
Some resource types require their names to be able to be safely encoded as a path segment. In other words, the name may not be "." or ".." and the name may not contain "/" or "%".
Here's an example manifest for a Pod named nginx-demo
.
apiVersion: v1
kind: Pod
metadata:
name: nginx-demo
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
Note: Some resource types have additional restrictions on their names.
UIDs
A Kubernetes systems-generated string to uniquely identify objects.
Every object created over the whole lifetime of a Kubernetes cluster has a distinct UID. It is intended to distinguish between historical occurrences of similar entities.
Kubernetes UIDs are universally unique identifiers (also known as UUIDs). UUIDs are standardized as ISO/IEC 9834-8 and as ITU-T X.667.
What's next
- Read about labels in Kubernetes.
- See the Identifiers and Names in Kubernetes design document.
1.4.4 - Namespaces
Kubernetes supports multiple virtual clusters backed by the same physical cluster. These virtual clusters are called namespaces.
When to Use Multiple Namespaces
Namespaces are intended for use in environments with many users spread across multiple teams, or projects. For clusters with a few to tens of users, you should not need to create or think about namespaces at all. Start using namespaces when you need the features they provide.
Namespaces provide a scope for names. Names of resources need to be unique within a namespace, but not across namespaces. Namespaces cannot be nested inside one another and each Kubernetes resource can only be in one namespace.
Namespaces are a way to divide cluster resources between multiple users (via resource quota).
It is not necessary to use multiple namespaces to separate slightly different resources, such as different versions of the same software: use labels to distinguish resources within the same namespace.
Working with Namespaces
Creation and deletion of namespaces are described in the Admin Guide documentation for namespaces.
Note: Avoid creating namespace with prefixkube-
, since it is reserved for Kubernetes system namespaces.
Viewing namespaces
You can list the current namespaces in a cluster using:
kubectl get namespace
NAME STATUS AGE
default Active 1d
kube-node-lease Active 1d
kube-public Active 1d
kube-system Active 1d
Kubernetes starts with four initial namespaces:
default
The default namespace for objects with no other namespacekube-system
The namespace for objects created by the Kubernetes systemkube-public
This namespace is created automatically and is readable by all users (including those not authenticated). This namespace is mostly reserved for cluster usage, in case that some resources should be visible and readable publicly throughout the whole cluster. The public aspect of this namespace is only a convention, not a requirement.kube-node-lease
This namespace for the lease objects associated with each node which improves the performance of the node heartbeats as the cluster scales.
Setting the namespace for a request
To set the namespace for a current request, use the --namespace
flag.
For example:
kubectl run nginx --image=nginx --namespace=<insert-namespace-name-here>
kubectl get pods --namespace=<insert-namespace-name-here>
Setting the namespace preference
You can permanently save the namespace for all subsequent kubectl commands in that context.
kubectl config set-context --current --namespace=<insert-namespace-name-here>
# Validate it
kubectl config view --minify | grep namespace:
Namespaces and DNS
When you create a Service,
it creates a corresponding DNS entry.
This entry is of the form <service-name>.<namespace-name>.svc.cluster.local
, which means
that if a container only uses <service-name>
, it will resolve to the service which
is local to a namespace. This is useful for using the same configuration across
multiple namespaces such as Development, Staging and Production. If you want to reach
across namespaces, you need to use the fully qualified domain name (FQDN).
Not All Objects are in a Namespace
Most Kubernetes resources (e.g. pods, services, replication controllers, and others) are in some namespaces. However namespace resources are not themselves in a namespace. And low-level resources, such as nodes and persistentVolumes, are not in any namespace.
To see which Kubernetes resources are and aren't in a namespace:
# In a namespace
kubectl api-resources --namespaced=true
# Not in a namespace
kubectl api-resources --namespaced=false
Automatic labelling
Kubernetes 1.21 [beta]
The Kubernetes control plane sets an immutable label
kubernetes.io/metadata.name
on all namespaces, provided that the NamespaceDefaultLabelName
feature gate is enabled.
The value of the label is the namespace name.
What's next
- Learn more about creating a new namespace.
- Learn more about deleting a namespace.
1.4.5 - Labels and Selectors
Labels are key/value pairs that are attached to objects, such as pods. Labels are intended to be used to specify identifying attributes of objects that are meaningful and relevant to users, but do not directly imply semantics to the core system. Labels can be used to organize and to select subsets of objects. Labels can be attached to objects at creation time and subsequently added and modified at any time. Each object can have a set of key/value labels defined. Each Key must be unique for a given object.
"metadata": {
"labels": {
"key1" : "value1",
"key2" : "value2"
}
}
Labels allow for efficient queries and watches and are ideal for use in UIs and CLIs. Non-identifying information should be recorded using annotations.
Motivation
Labels enable users to map their own organizational structures onto system objects in a loosely coupled fashion, without requiring clients to store these mappings.
Service deployments and batch processing pipelines are often multi-dimensional entities (e.g., multiple partitions or deployments, multiple release tracks, multiple tiers, multiple micro-services per tier). Management often requires cross-cutting operations, which breaks encapsulation of strictly hierarchical representations, especially rigid hierarchies determined by the infrastructure rather than by users.
Example labels:
"release" : "stable"
,"release" : "canary"
"environment" : "dev"
,"environment" : "qa"
,"environment" : "production"
"tier" : "frontend"
,"tier" : "backend"
,"tier" : "cache"
"partition" : "customerA"
,"partition" : "customerB"
"track" : "daily"
,"track" : "weekly"
These are examples of commonly used labels; you are free to develop your own conventions. Keep in mind that label Key must be unique for a given object.
Syntax and character set
Labels are key/value pairs. Valid label keys have two segments: an optional prefix and name, separated by a slash (/
). The name segment is required and must be 63 characters or less, beginning and ending with an alphanumeric character ([a-z0-9A-Z]
) with dashes (-
), underscores (_
), dots (.
), and alphanumerics between. The prefix is optional. If specified, the prefix must be a DNS subdomain: a series of DNS labels separated by dots (.
), not longer than 253 characters in total, followed by a slash (/
).
If the prefix is omitted, the label Key is presumed to be private to the user. Automated system components (e.g. kube-scheduler
, kube-controller-manager
, kube-apiserver
, kubectl
, or other third-party automation) which add labels to end-user objects must specify a prefix.
The kubernetes.io/
and k8s.io/
prefixes are reserved for Kubernetes core components.
Valid label value:
- must be 63 characters or less (cannot be empty),
- must begin and end with an alphanumeric character (
[a-z0-9A-Z]
), - could contain dashes (
-
), underscores (_
), dots (.
), and alphanumerics between.
For example, here's the configuration file for a Pod that has two labels environment: production
and app: nginx
:
apiVersion: v1
kind: Pod
metadata:
name: label-demo
labels:
environment: production
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
Label selectors
Unlike names and UIDs, labels do not provide uniqueness. In general, we expect many objects to carry the same label(s).
Via a label selector, the client/user can identify a set of objects. The label selector is the core grouping primitive in Kubernetes.
The API currently supports two types of selectors: equality-based and set-based.
A label selector can be made of multiple requirements which are comma-separated. In the case of multiple requirements, all must be satisfied so the comma separator acts as a logical AND (&&
) operator.
The semantics of empty or non-specified selectors are dependent on the context, and API types that use selectors should document the validity and meaning of them.
Note: For some API types, such as ReplicaSets, the label selectors of two instances must not overlap within a namespace, or the controller can see that as conflicting instructions and fail to determine how many replicas should be present.
Caution: For both equality-based and set-based conditions there is no logical OR (||
) operator. Ensure your filter statements are structured accordingly.
Equality-based requirement
Equality- or inequality-based requirements allow filtering by label keys and values. Matching objects must satisfy all of the specified label constraints, though they may have additional labels as well.
Three kinds of operators are admitted =
,==
,!=
. The first two represent equality (and are synonyms), while the latter represents inequality. For example:
environment = production
tier != frontend
The former selects all resources with key equal to environment
and value equal to production
.
The latter selects all resources with key equal to tier
and value distinct from frontend
, and all resources with no labels with the tier
key.
One could filter for resources in production
excluding frontend
using the comma operator: environment=production,tier!=frontend
One usage scenario for equality-based label requirement is for Pods to specify
node selection criteria. For example, the sample Pod below selects nodes with
the label "accelerator=nvidia-tesla-p100
".
apiVersion: v1
kind: Pod
metadata:
name: cuda-test
spec:
containers:
- name: cuda-test
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-p100
Set-based requirement
Set-based label requirements allow filtering keys according to a set of values. Three kinds of operators are supported: in
,notin
and exists
(only the key identifier). For example:
environment in (production, qa)
tier notin (frontend, backend)
partition
!partition
- The first example selects all resources with key equal to
environment
and value equal toproduction
orqa
. - The second example selects all resources with key equal to
tier
and values other thanfrontend
andbackend
, and all resources with no labels with thetier
key. - The third example selects all resources including a label with key
partition
; no values are checked. - The fourth example selects all resources without a label with key
partition
; no values are checked.
Similarly the comma separator acts as an AND operator. So filtering resources with a partition
key (no matter the value) and with environment
different than qa
can be achieved using partition,environment notin (qa)
.
The set-based label selector is a general form of equality since environment=production
is equivalent to environment in (production)
; similarly for !=
and notin
.
Set-based requirements can be mixed with equality-based requirements. For example: partition in (customerA, customerB),environment!=qa
.
API
LIST and WATCH filtering
LIST and WATCH operations may specify label selectors to filter the sets of objects returned using a query parameter. Both requirements are permitted (presented here as they would appear in a URL query string):
- equality-based requirements:
?labelSelector=environment%3Dproduction,tier%3Dfrontend
- set-based requirements:
?labelSelector=environment+in+%28production%2Cqa%29%2Ctier+in+%28frontend%29
Both label selector styles can be used to list or watch resources via a REST client. For example, targeting apiserver
with kubectl
and using equality-based one may write:
kubectl get pods -l environment=production,tier=frontend
or using set-based requirements:
kubectl get pods -l 'environment in (production),tier in (frontend)'
As already mentioned set-based requirements are more expressive. For instance, they can implement the OR operator on values:
kubectl get pods -l 'environment in (production, qa)'
or restricting negative matching via exists operator:
kubectl get pods -l 'environment,environment notin (frontend)'
Set references in API objects
Some Kubernetes objects, such as services
and replicationcontrollers
,
also use label selectors to specify sets of other resources, such as
pods.
Service and ReplicationController
The set of pods that a service
targets is defined with a label selector. Similarly, the population of pods that a replicationcontroller
should manage is also defined with a label selector.
Labels selectors for both objects are defined in json
or yaml
files using maps, and only equality-based requirement selectors are supported:
"selector": {
"component" : "redis",
}
or
selector:
component: redis
this selector (respectively in json
or yaml
format) is equivalent to component=redis
or component in (redis)
.
Resources that support set-based requirements
Newer resources, such as Job
,
Deployment
,
ReplicaSet
, and
DaemonSet
,
support set-based requirements as well.
selector:
matchLabels:
component: redis
matchExpressions:
- {key: tier, operator: In, values: [cache]}
- {key: environment, operator: NotIn, values: [dev]}
matchLabels
is a map of {key,value}
pairs. A single {key,value}
in the matchLabels
map is equivalent to an element of matchExpressions
, whose key
field is "key", the operator
is "In", and the values
array contains only "value". matchExpressions
is a list of pod selector requirements. Valid operators include In, NotIn, Exists, and DoesNotExist. The values set must be non-empty in the case of In and NotIn. All of the requirements, from both matchLabels
and matchExpressions
are ANDed together -- they must all be satisfied in order to match.
Selecting sets of nodes
One use case for selecting over labels is to constrain the set of nodes onto which a pod can schedule. See the documentation on node selection for more information.
1.4.6 - Annotations
You can use Kubernetes annotations to attach arbitrary non-identifying metadata to objects. Clients such as tools and libraries can retrieve this metadata.
Attaching metadata to objects
You can use either labels or annotations to attach metadata to Kubernetes objects. Labels can be used to select objects and to find collections of objects that satisfy certain conditions. In contrast, annotations are not used to identify and select objects. The metadata in an annotation can be small or large, structured or unstructured, and can include characters not permitted by labels.
Annotations, like labels, are key/value maps:
"metadata": {
"annotations": {
"key1" : "value1",
"key2" : "value2"
}
}
Here are some examples of information that could be recorded in annotations:
Fields managed by a declarative configuration layer. Attaching these fields as annotations distinguishes them from default values set by clients or servers, and from auto-generated fields and fields set by auto-sizing or auto-scaling systems.
Build, release, or image information like timestamps, release IDs, git branch, PR numbers, image hashes, and registry address.
Pointers to logging, monitoring, analytics, or audit repositories.
Client library or tool information that can be used for debugging purposes: for example, name, version, and build information.
User or tool/system provenance information, such as URLs of related objects from other ecosystem components.
Lightweight rollout tool metadata: for example, config or checkpoints.
Phone or pager numbers of persons responsible, or directory entries that specify where that information can be found, such as a team web site.
Directives from the end-user to the implementations to modify behavior or engage non-standard features.
Instead of using annotations, you could store this type of information in an external database or directory, but that would make it much harder to produce shared client libraries and tools for deployment, management, introspection, and the like.
Syntax and character set
Annotations are key/value pairs. Valid annotation keys have two segments: an optional prefix and name, separated by a slash (/
). The name segment is required and must be 63 characters or less, beginning and ending with an alphanumeric character ([a-z0-9A-Z]
) with dashes (-
), underscores (_
), dots (.
), and alphanumerics between. The prefix is optional. If specified, the prefix must be a DNS subdomain: a series of DNS labels separated by dots (.
), not longer than 253 characters in total, followed by a slash (/
).
If the prefix is omitted, the annotation Key is presumed to be private to the user. Automated system components (e.g. kube-scheduler
, kube-controller-manager
, kube-apiserver
, kubectl
, or other third-party automation) which add annotations to end-user objects must specify a prefix.
The kubernetes.io/
and k8s.io/
prefixes are reserved for Kubernetes core components.
For example, here's the configuration file for a Pod that has the annotation imageregistry: https://hub.docker.com/
:
apiVersion: v1
kind: Pod
metadata:
name: annotations-demo
annotations:
imageregistry: "https://hub.docker.com/"
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
What's next
Learn more about Labels and Selectors.
1.4.7 - Field Selectors
Field selectors let you select Kubernetes resources based on the value of one or more resource fields. Here are some examples of field selector queries:
metadata.name=my-service
metadata.namespace!=default
status.phase=Pending
This kubectl
command selects all Pods for which the value of the status.phase
field is Running
:
kubectl get pods --field-selector status.phase=Running
Note: Field selectors are essentially resource filters. By default, no selectors/filters are applied, meaning that all resources of the specified type are selected. This makes thekubectl
querieskubectl get pods
andkubectl get pods --field-selector ""
equivalent.
Supported fields
Supported field selectors vary by Kubernetes resource type. All resource types support the metadata.name
and metadata.namespace
fields. Using unsupported field selectors produces an error. For example:
kubectl get ingress --field-selector foo.bar=baz
Error from server (BadRequest): Unable to find "ingresses" that match label selector "", field selector "foo.bar=baz": "foo.bar" is not a known field selector: only "metadata.name", "metadata.namespace"
Supported operators
You can use the =
, ==
, and !=
operators with field selectors (=
and ==
mean the same thing). This kubectl
command, for example, selects all Kubernetes Services that aren't in the default
namespace:
kubectl get services --all-namespaces --field-selector metadata.namespace!=default
Chained selectors
As with label and other selectors, field selectors can be chained together as a comma-separated list. This kubectl
command selects all Pods for which the status.phase
does not equal Running
and the spec.restartPolicy
field equals Always
:
kubectl get pods --field-selector=status.phase!=Running,spec.restartPolicy=Always
Multiple resource types
You use field selectors across multiple resource types. This kubectl
command selects all Statefulsets and Services that are not in the default
namespace:
kubectl get statefulsets,services --all-namespaces --field-selector metadata.namespace!=default
1.4.8 - Recommended Labels
You can visualize and manage Kubernetes objects with more tools than kubectl and the dashboard. A common set of labels allows tools to work interoperably, describing objects in a common manner that all tools can understand.
In addition to supporting tooling, the recommended labels describe applications in a way that can be queried.
The metadata is organized around the concept of an application. Kubernetes is not a platform as a service (PaaS) and doesn't have or enforce a formal notion of an application. Instead, applications are informal and described with metadata. The definition of what an application contains is loose.
Note: These are recommended labels. They make it easier to manage applications but aren't required for any core tooling.
Shared labels and annotations share a common prefix: app.kubernetes.io
. Labels
without a prefix are private to users. The shared prefix ensures that shared labels
do not interfere with custom user labels.
Labels
In order to take full advantage of using these labels, they should be applied on every resource object.
Key | Description | Example | Type |
---|---|---|---|
app.kubernetes.io/name | The name of the application | mysql | string |
app.kubernetes.io/instance | A unique name identifying the instance of an application | mysql-abcxzy | string |
app.kubernetes.io/version | The current version of the application (e.g., a semantic version, revision hash, etc.) | 5.7.21 | string |
app.kubernetes.io/component | The component within the architecture | database | string |
app.kubernetes.io/part-of | The name of a higher level application this one is part of | wordpress | string |
app.kubernetes.io/managed-by | The tool being used to manage the operation of an application | helm | string |
To illustrate these labels in action, consider the following StatefulSet object:
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/name: mysql
app.kubernetes.io/instance: mysql-abcxzy
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/component: database
app.kubernetes.io/part-of: wordpress
app.kubernetes.io/managed-by: helm
Applications And Instances Of Applications
An application can be installed one or more times into a Kubernetes cluster and, in some cases, the same namespace. For example, WordPress can be installed more than once where different websites are different installations of WordPress.
The name of an application and the instance name are recorded separately. For
example, WordPress has a app.kubernetes.io/name
of wordpress
while it has
an instance name, represented as app.kubernetes.io/instance
with a value of
wordpress-abcxzy
. This enables the application and instance of the application
to be identifiable. Every instance of an application must have a unique name.
Examples
To illustrate different ways to use these labels the following examples have varying complexity.
A Simple Stateless Service
Consider the case for a simple stateless service deployed using Deployment
and Service
objects. The following two snippets represent how the labels could be used in their simplest form.
The Deployment
is used to oversee the pods running the application itself.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: myservice
app.kubernetes.io/instance: myservice-abcxzy
...
The Service
is used to expose the application.
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: myservice
app.kubernetes.io/instance: myservice-abcxzy
...
Web Application With A Database
Consider a slightly more complicated application: a web application (WordPress) using a database (MySQL), installed using Helm. The following snippets illustrate the start of objects used to deploy this application.
The start to the following Deployment
is used for WordPress:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: wordpress
app.kubernetes.io/instance: wordpress-abcxzy
app.kubernetes.io/version: "4.9.4"
app.kubernetes.io/managed-by: helm
app.kubernetes.io/component: server
app.kubernetes.io/part-of: wordpress
...
The Service
is used to expose WordPress:
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: wordpress
app.kubernetes.io/instance: wordpress-abcxzy
app.kubernetes.io/version: "4.9.4"
app.kubernetes.io/managed-by: helm
app.kubernetes.io/component: server
app.kubernetes.io/part-of: wordpress
...
MySQL is exposed as a StatefulSet
with metadata for both it and the larger application it belongs to:
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/name: mysql
app.kubernetes.io/instance: mysql-abcxzy
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/managed-by: helm
app.kubernetes.io/component: database
app.kubernetes.io/part-of: wordpress
...
The Service
is used to expose MySQL as part of WordPress:
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: mysql
app.kubernetes.io/instance: mysql-abcxzy
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/managed-by: helm
app.kubernetes.io/component: database
app.kubernetes.io/part-of: wordpress
...
With the MySQL StatefulSet
and Service
you'll notice information about both MySQL and WordPress, the broader application, are included.
2 - Cluster Architecture
2.1 - Nodes
Kubernetes runs your workload by placing containers into Pods to run on Nodes. A node may be a virtual or physical machine, depending on the cluster. Each node is managed by the control plane and contains the services necessary to run Pods
Typically you have several nodes in a cluster; in a learning or resource-limited environment, you might have only one node.
The components on a node include the kubelet, a container runtime, and the kube-proxy.
Management
There are two main ways to have Nodes added to the API server:
- The kubelet on a node self-registers to the control plane
- You (or another human user) manually add a Node object
After you create a Node object, or the kubelet on a node self-registers, the control plane checks whether the new Node object is valid. For example, if you try to create a Node from the following JSON manifest:
{
"kind": "Node",
"apiVersion": "v1",
"metadata": {
"name": "10.240.79.157",
"labels": {
"name": "my-first-k8s-node"
}
}
}
Kubernetes creates a Node object internally (the representation). Kubernetes checks
that a kubelet has registered to the API server that matches the metadata.name
field of the Node. If the node is healthy (i.e. all necessary services are running),
then it is eligible to run a Pod. Otherwise, that node is ignored for any cluster activity
until it becomes healthy.
Note:Kubernetes keeps the object for the invalid Node and continues checking to see whether it becomes healthy.
You, or a controller, must explicitly delete the Node object to stop that health checking.
The name of a Node object must be a valid DNS subdomain name.
Node name uniqueness
The name identifies a Node. Two Nodes cannot have the same name at the same time. Kubernetes also assumes that a resource with the same name is the same object. In case of a Node, it is implicitly assumed that an instance using the same name will have the same state (e.g. network settings, root disk contents). This may lead to inconsistencies if an instance was modified without changing its name. If the Node needs to be replaced or updated significantly, the existing Node object needs to be removed from API server first and re-added after the update.
Self-registration of Nodes
When the kubelet flag --register-node
is true (the default), the kubelet will attempt to
register itself with the API server. This is the preferred pattern, used by most distros.
For self-registration, the kubelet is started with the following options:
--kubeconfig
- Path to credentials to authenticate itself to the API server.--cloud-provider
- How to talk to a cloud provider to read metadata about itself.--register-node
- Automatically register with the API server.--register-with-taints
- Register the node with the given list of taints (comma separated<key>=<value>:<effect>
).No-op if
register-node
is false.--node-ip
- IP address of the node.--node-labels
- Labels to add when registering the node in the cluster (see label restrictions enforced by the NodeRestriction admission plugin).--node-status-update-frequency
- Specifies how often kubelet posts node status to master.
When the Node authorization mode and NodeRestriction admission plugin are enabled, kubelets are only authorized to create/modify their own Node resource.
Manual Node administration
You can create and modify Node objects using kubectl.
When you want to create Node objects manually, set the kubelet flag --register-node=false
.
You can modify Node objects regardless of the setting of --register-node
.
For example, you can set labels on an existing Node or mark it unschedulable.
You can use labels on Nodes in conjunction with node selectors on Pods to control scheduling. For example, you can constrain a Pod to only be eligible to run on a subset of the available nodes.
Marking a node as unschedulable prevents the scheduler from placing new pods onto that Node but does not affect existing Pods on the Node. This is useful as a preparatory step before a node reboot or other maintenance.
To mark a Node unschedulable, run:
kubectl cordon $NODENAME
Note: Pods that are part of a DaemonSet tolerate being run on an unschedulable Node. DaemonSets typically provide node-local services that should run on the Node even if it is being drained of workload applications.
Node status
A Node's status contains the following information:
You can use kubectl
to view a Node's status and other details:
kubectl describe node <insert-node-name-here>
Each section of the output is described below.
Addresses
The usage of these fields varies depending on your cloud provider or bare metal configuration.
- HostName: The hostname as reported by the node's kernel. Can be overridden via the kubelet
--hostname-override
parameter. - ExternalIP: Typically the IP address of the node that is externally routable (available from outside the cluster).
- InternalIP: Typically the IP address of the node that is routable only within the cluster.
Conditions
The conditions
field describes the status of all Running
nodes. Examples of conditions include:
Node Condition | Description |
---|---|
Ready | True if the node is healthy and ready to accept pods, False if the node is not healthy and is not accepting pods, and Unknown if the node controller has not heard from the node in the last node-monitor-grace-period (default is 40 seconds) |
DiskPressure | True if pressure exists on the disk size--that is, if the disk capacity is low; otherwise False |
MemoryPressure | True if pressure exists on the node memory--that is, if the node memory is low; otherwise False |
PIDPressure | True if pressure exists on the processes—that is, if there are too many processes on the node; otherwise False |
NetworkUnavailable | True if the network for the node is not correctly configured, otherwise False |
Note: If you use command-line tools to print details of a cordoned Node, the Condition includesSchedulingDisabled
.SchedulingDisabled
is not a Condition in the Kubernetes API; instead, cordoned nodes are marked Unschedulable in their spec.
The node condition is represented as a JSON object. For example, the following structure describes a healthy node:
"conditions": [
{
"type": "Ready",
"status": "True",
"reason": "KubeletReady",
"message": "kubelet is posting ready status",
"lastHeartbeatTime": "2019-06-05T18:38:35Z",
"lastTransitionTime": "2019-06-05T11:41:27Z"
}
]
If the Status of the Ready condition remains Unknown
or False
for longer than the pod-eviction-timeout
(an argument passed to the kube-controller-manager), then all the Pods on the node are scheduled for deletion by the node controller. The default eviction timeout duration is five minutes. In some cases when the node is unreachable, the API server is unable to communicate with the kubelet on the node. The decision to delete the pods cannot be communicated to the kubelet until communication with the API server is re-established. In the meantime, the pods that are scheduled for deletion may continue to run on the partitioned node.
The node controller does not force delete pods until it is confirmed that they have stopped
running in the cluster. You can see the pods that might be running on an unreachable node as
being in the Terminating
or Unknown
state. In cases where Kubernetes cannot deduce from the
underlying infrastructure if a node has permanently left a cluster, the cluster administrator
may need to delete the node object by hand. Deleting the node object from Kubernetes causes
all the Pod objects running on the node to be deleted from the API server and frees up their
names.
The node lifecycle controller automatically creates taints that represent conditions. The scheduler takes the Node's taints into consideration when assigning a Pod to a Node. Pods can also have tolerations which let them tolerate a Node's taints.
See Taint Nodes by Condition for more details.
Capacity and Allocatable
Describes the resources available on the node: CPU, memory, and the maximum number of pods that can be scheduled onto the node.
The fields in the capacity block indicate the total amount of resources that a Node has. The allocatable block indicates the amount of resources on a Node that is available to be consumed by normal Pods.
You may read more about capacity and allocatable resources while learning how to reserve compute resources on a Node.
Info
Describes general information about the node, such as kernel version, Kubernetes version (kubelet and kube-proxy version), Docker version (if used), and OS name. This information is gathered by Kubelet from the node.
Node controller
The node controller is a Kubernetes control plane component that manages various aspects of nodes.
The node controller has multiple roles in a node's life. The first is assigning a CIDR block to the node when it is registered (if CIDR assignment is turned on).
The second is keeping the node controller's internal list of nodes up to date with the cloud provider's list of available machines. When running in a cloud environment and whenever a node is unhealthy, the node controller asks the cloud provider if the VM for that node is still available. If not, the node controller deletes the node from its list of nodes.
The third is monitoring the nodes' health. The node controller is responsible for:
- Updating the NodeReady condition of NodeStatus to ConditionUnknown when a node becomes unreachable, as the node controller stops receiving heartbeats for some reason such as the node being down.
- Evicting all the pods from the node using graceful termination if the node continues to be unreachable. The default timeouts are 40s to start reporting ConditionUnknown and 5m after that to start evicting pods.
The node controller checks the state of each node every --node-monitor-period
seconds.
Heartbeats
Heartbeats, sent by Kubernetes nodes, help determine the availability of a node.
There are two forms of heartbeats: updates of NodeStatus
and the
Lease object.
Each Node has an associated Lease object in the kube-node-lease
namespace.
Lease is a lightweight resource, which improves the performance
of the node heartbeats as the cluster scales.
The kubelet is responsible for creating and updating the NodeStatus
and
a Lease object.
- The kubelet updates the
NodeStatus
either when there is change in status or if there has been no update for a configured interval. The default interval forNodeStatus
updates is 5 minutes, which is much longer than the 40 second default timeout for unreachable nodes. - The kubelet creates and then updates its Lease object every 10 seconds
(the default update interval). Lease updates occur independently from the
NodeStatus
updates. If the Lease update fails, the kubelet retries with exponential backoff starting at 200 milliseconds and capped at 7 seconds.
Reliability
In most cases, the node controller limits the eviction rate to
--node-eviction-rate
(default 0.1) per second, meaning it won't evict pods
from more than 1 node per 10 seconds.
The node eviction behavior changes when a node in a given availability zone becomes unhealthy. The node controller checks what percentage of nodes in the zone are unhealthy (NodeReady condition is ConditionUnknown or ConditionFalse) at the same time:
- If the fraction of unhealthy nodes is at least
--unhealthy-zone-threshold
(default 0.55), then the eviction rate is reduced. - If the cluster is small (i.e. has less than or equal to
--large-cluster-size-threshold
nodes - default 50), then evictions are stopped. - Otherwise, the eviction rate is reduced to
--secondary-node-eviction-rate
(default 0.01) per second.
The reason these policies are implemented per availability zone is because one availability zone might become partitioned from the master while the others remain connected. If your cluster does not span multiple cloud provider availability zones, then there is only one availability zone (i.e. the whole cluster).
A key reason for spreading your nodes across availability zones is so that the
workload can be shifted to healthy zones when one entire zone goes down.
Therefore, if all nodes in a zone are unhealthy, then the node controller evicts at
the normal rate of --node-eviction-rate
. The corner case is when all zones are
completely unhealthy (i.e. there are no healthy nodes in the cluster). In such a
case, the node controller assumes that there is some problem with master
connectivity and stops all evictions until some connectivity is restored.
The node controller is also responsible for evicting pods running on nodes with
NoExecute
taints, unless those pods tolerate that taint.
The node controller also adds taints
corresponding to node problems like node unreachable or not ready. This means
that the scheduler won't place Pods onto unhealthy nodes.
Caution:kubectl cordon
marks a node as 'unschedulable', which has the side effect of the service controller removing the node from any LoadBalancer node target lists it was previously eligible for, effectively removing incoming load balancer traffic from the cordoned node(s).
Node capacity
Node objects track information about the Node's resource capacity: for example, the amount of memory available and the number of CPUs. Nodes that self register report their capacity during registration. If you manually add a Node, then you need to set the node's capacity information when you add it.
The Kubernetes scheduler ensures that there are enough resources for all the Pods on a Node. The scheduler checks that the sum of the requests of containers on the node is no greater than the node's capacity. That sum of requests includes all containers managed by the kubelet, but excludes any containers started directly by the container runtime, and also excludes any processes running outside of the kubelet's control.
Note: If you want to explicitly reserve resources for non-Pod processes, see reserve resources for system daemons.
Node topology
Kubernetes v1.16 [alpha]
If you have enabled the TopologyManager
feature gate, then
the kubelet can use topology hints when making resource assignment decisions.
See Control Topology Management Policies on a Node
for more information.
Graceful node shutdown
Kubernetes v1.21 [beta]
The kubelet attempts to detect node system shutdown and terminates pods running on the node.
Kubelet ensures that pods follow the normal pod termination process during the node shutdown.
The Graceful node shutdown feature depends on systemd since it takes advantage of systemd inhibitor locks to delay the node shutdown with a given duration.
Graceful node shutdown is controlled with the GracefulNodeShutdown
feature gate which is
enabled by default in 1.21.
Note that by default, both configuration options described below,
ShutdownGracePeriod
and ShutdownGracePeriodCriticalPods
are set to zero,
thus not activating Graceful node shutdown functionality.
To activate the feature, the two kubelet config settings should be configured appropriately and set to non-zero values.
During a graceful shutdown, kubelet terminates pods in two phases:
- Terminate regular pods running on the node.
- Terminate critical pods running on the node.
Graceful node shutdown feature is configured with two KubeletConfiguration
options:
ShutdownGracePeriod
:- Specifies the total duration that the node should delay the shutdown by. This is the total grace period for pod termination for both regular and critical pods.
ShutdownGracePeriodCriticalPods
:- Specifies the duration used to terminate critical pods during a node shutdown. This value should be less than
ShutdownGracePeriod
.
- Specifies the duration used to terminate critical pods during a node shutdown. This value should be less than
For example, if ShutdownGracePeriod=30s
, and
ShutdownGracePeriodCriticalPods=10s
, kubelet will delay the node shutdown by
30 seconds. During the shutdown, the first 20 (30-10) seconds would be reserved
for gracefully terminating normal pods, and the last 10 seconds would be
reserved for terminating critical pods.
What's next
- Learn about the components that make up a node.
- Read the API definition for Node.
- Read the Node section of the architecture design document.
- Read about taints and tolerations.
2.2 - Control Plane-Node Communication
This document catalogs the communication paths between the control plane (apiserver) and the Kubernetes cluster. The intent is to allow users to customize their installation to harden the network configuration such that the cluster can be run on an untrusted network (or on fully public IPs on a cloud provider).
Node to Control Plane
Kubernetes has a "hub-and-spoke" API pattern. All API usage from nodes (or the pods they run) terminates at the apiserver. None of the other control plane components are designed to expose remote services. The apiserver is configured to listen for remote connections on a secure HTTPS port (typically 443) with one or more forms of client authentication enabled. One or more forms of authorization should be enabled, especially if anonymous requests or service account tokens are allowed.
Nodes should be provisioned with the public root certificate for the cluster such that they can connect securely to the apiserver along with valid client credentials. A good approach is that the client credentials provided to the kubelet are in the form of a client certificate. See kubelet TLS bootstrapping for automated provisioning of kubelet client certificates.
Pods that wish to connect to the apiserver can do so securely by leveraging a service account so that Kubernetes will automatically inject the public root certificate and a valid bearer token into the pod when it is instantiated.
The kubernetes
service (in default
namespace) is configured with a virtual IP address that is redirected (via kube-proxy) to the HTTPS endpoint on the apiserver.
The control plane components also communicate with the cluster apiserver over the secure port.
As a result, the default operating mode for connections from the nodes and pods running on the nodes to the control plane is secured by default and can run over untrusted and/or public networks.
Control Plane to node
There are two primary communication paths from the control plane (apiserver) to the nodes. The first is from the apiserver to the kubelet process which runs on each node in the cluster. The second is from the apiserver to any node, pod, or service through the apiserver's proxy functionality.
apiserver to kubelet
The connections from the apiserver to the kubelet are used for:
- Fetching logs for pods.
- Attaching (through kubectl) to running pods.
- Providing the kubelet's port-forwarding functionality.
These connections terminate at the kubelet's HTTPS endpoint. By default, the apiserver does not verify the kubelet's serving certificate, which makes the connection subject to man-in-the-middle attacks and unsafe to run over untrusted and/or public networks.
To verify this connection, use the --kubelet-certificate-authority
flag to provide the apiserver with a root certificate bundle to use to verify the kubelet's serving certificate.
If that is not possible, use SSH tunneling between the apiserver and kubelet if required to avoid connecting over an untrusted or public network.
Finally, Kubelet authentication and/or authorization should be enabled to secure the kubelet API.
apiserver to nodes, pods, and services
The connections from the apiserver to a node, pod, or service default to plain HTTP connections and are therefore neither authenticated nor encrypted. They can be run over a secure HTTPS connection by prefixing https:
to the node, pod, or service name in the API URL, but they will not validate the certificate provided by the HTTPS endpoint nor provide client credentials. So while the connection will be encrypted, it will not provide any guarantees of integrity. These connections are not currently safe to run over untrusted or public networks.
SSH tunnels
Kubernetes supports SSH tunnels to protect the control plane to nodes communication paths. In this configuration, the apiserver initiates an SSH tunnel to each node in the cluster (connecting to the ssh server listening on port 22) and passes all traffic destined for a kubelet, node, pod, or service through the tunnel. This tunnel ensures that the traffic is not exposed outside of the network in which the nodes are running.
SSH tunnels are currently deprecated, so you shouldn't opt to use them unless you know what you are doing. The Konnectivity service is a replacement for this communication channel.
Konnectivity service
Kubernetes v1.18 [beta]
As a replacement to the SSH tunnels, the Konnectivity service provides TCP level proxy for the control plane to cluster communication. The Konnectivity service consists of two parts: the Konnectivity server in the control plane network and the Konnectivity agents in the nodes network. The Konnectivity agents initiate connections to the Konnectivity server and maintain the network connections. After enabling the Konnectivity service, all control plane to nodes traffic goes through these connections.
Follow the Konnectivity service task to set up the Konnectivity service in your cluster.
2.3 - Controllers
In robotics and automation, a control loop is a non-terminating loop that regulates the state of a system.
Here is one example of a control loop: a thermostat in a room.
When you set the temperature, that's telling the thermostat about your desired state. The actual room temperature is the current state. The thermostat acts to bring the current state closer to the desired state, by turning equipment on or off.
In Kubernetes, controllers are control loops that watch the state of your cluster, then make or request changes where needed. Each controller tries to move the current cluster state closer to the desired state.Controller pattern
A controller tracks at least one Kubernetes resource type. These objects have a spec field that represents the desired state. The controller(s) for that resource are responsible for making the current state come closer to that desired state.
The controller might carry the action out itself; more commonly, in Kubernetes, a controller will send messages to the API server that have useful side effects. You'll see examples of this below.
Control via API server
The Job controller is an example of a Kubernetes built-in controller. Built-in controllers manage state by interacting with the cluster API server.
Job is a Kubernetes resource that runs a Pod, or perhaps several Pods, to carry out a task and then stop.
(Once scheduled, Pod objects become part of the desired state for a kubelet).
When the Job controller sees a new task it makes sure that, somewhere in your cluster, the kubelets on a set of Nodes are running the right number of Pods to get the work done. The Job controller does not run any Pods or containers itself. Instead, the Job controller tells the API server to create or remove Pods. Other components in the control plane act on the new information (there are new Pods to schedule and run), and eventually the work is done.
After you create a new Job, the desired state is for that Job to be completed. The Job controller makes the current state for that Job be nearer to your desired state: creating Pods that do the work you wanted for that Job, so that the Job is closer to completion.
Controllers also update the objects that configure them.
For example: once the work is done for a Job, the Job controller
updates that Job object to mark it Finished
.
(This is a bit like how some thermostats turn a light off to indicate that your room is now at the temperature you set).
Direct control
By contrast with Job, some controllers need to make changes to things outside of your cluster.
For example, if you use a control loop to make sure there are enough Nodes in your cluster, then that controller needs something outside the current cluster to set up new Nodes when needed.
Controllers that interact with external state find their desired state from the API server, then communicate directly with an external system to bring the current state closer in line.
(There actually is a controller that horizontally scales the nodes in your cluster.)
The important point here is that the controller makes some change to bring about your desired state, and then reports current state back to your cluster's API server. Other control loops can observe that reported data and take their own actions.
In the thermostat example, if the room is very cold then a different controller might also turn on a frost protection heater. With Kubernetes clusters, the control plane indirectly works with IP address management tools, storage services, cloud provider APIs, and other services by extending Kubernetes to implement that.
Desired versus current state
Kubernetes takes a cloud-native view of systems, and is able to handle constant change.
Your cluster could be changing at any point as work happens and control loops automatically fix failures. This means that, potentially, your cluster never reaches a stable state.
As long as the controllers for your cluster are running and able to make useful changes, it doesn't matter if the overall state is stable or not.
Design
As a tenet of its design, Kubernetes uses lots of controllers that each manage a particular aspect of cluster state. Most commonly, a particular control loop (controller) uses one kind of resource as its desired state, and has a different kind of resource that it manages to make that desired state happen. For example, a controller for Jobs tracks Job objects (to discover new work) and Pod objects (to run the Jobs, and then to see when the work is finished). In this case something else creates the Jobs, whereas the Job controller creates Pods.
It's useful to have simple controllers rather than one, monolithic set of control loops that are interlinked. Controllers can fail, so Kubernetes is designed to allow for that.
Note:There can be several controllers that create or update the same kind of object. Behind the scenes, Kubernetes controllers make sure that they only pay attention to the resources linked to their controlling resource.
For example, you can have Deployments and Jobs; these both create Pods. The Job controller does not delete the Pods that your Deployment created, because there is information (labels) the controllers can use to tell those Pods apart.
Ways of running controllers
Kubernetes comes with a set of built-in controllers that run inside the kube-controller-manager. These built-in controllers provide important core behaviors.
The Deployment controller and Job controller are examples of controllers that come as part of Kubernetes itself ("built-in" controllers). Kubernetes lets you run a resilient control plane, so that if any of the built-in controllers were to fail, another part of the control plane will take over the work.
You can find controllers that run outside the control plane, to extend Kubernetes. Or, if you want, you can write a new controller yourself. You can run your own controller as a set of Pods, or externally to Kubernetes. What fits best will depend on what that particular controller does.
What's next
- Read about the Kubernetes control plane
- Discover some of the basic Kubernetes objects
- Learn more about the Kubernetes API
- If you want to write your own controller, see Extension Patterns in Extending Kubernetes.
2.4 - Cloud Controller Manager
Kubernetes v1.11 [beta]
Cloud infrastructure technologies let you run Kubernetes on public, private, and hybrid clouds. Kubernetes believes in automated, API-driven infrastructure without tight coupling between components.
The cloud-controller-manager is a Kubernetes control plane component that embeds cloud-specific control logic. The cloud controller manager lets you link your cluster into your cloud provider's API, and separates out the components that interact with that cloud platform from components that only interact with your cluster.
By decoupling the interoperability logic between Kubernetes and the underlying cloud infrastructure, the cloud-controller-manager component enables cloud providers to release features at a different pace compared to the main Kubernetes project.
The cloud-controller-manager is structured using a plugin mechanism that allows different cloud providers to integrate their platforms with Kubernetes.
Design
The cloud controller manager runs in the control plane as a replicated set of processes (usually, these are containers in Pods). Each cloud-controller-manager implements multiple controllers in a single process.
Note: You can also run the cloud controller manager as a Kubernetes addon rather than as part of the control plane.
Cloud controller manager functions
The controllers inside the cloud controller manager include:
Node controller
The node controller is responsible for creating Node objects when new servers are created in your cloud infrastructure. The node controller obtains information about the hosts running inside your tenancy with the cloud provider. The node controller performs the following functions:
- Initialize a Node object for each server that the controller discovers through the cloud provider API.
- Annotating and labelling the Node object with cloud-specific information, such as the region the node is deployed into and the resources (CPU, memory, etc) that it has available.
- Obtain the node's hostname and network addresses.
- Verifying the node's health. In case a node becomes unresponsive, this controller checks with your cloud provider's API to see if the server has been deactivated / deleted / terminated. If the node has been deleted from the cloud, the controller deletes the Node object from your Kubernetes cluster.
Some cloud provider implementations split this into a node controller and a separate node lifecycle controller.
Route controller
The route controller is responsible for configuring routes in the cloud appropriately so that containers on different nodes in your Kubernetes cluster can communicate with each other.
Depending on the cloud provider, the route controller might also allocate blocks of IP addresses for the Pod network.
Service controller
Services integrate with cloud infrastructure components such as managed load balancers, IP addresses, network packet filtering, and target health checking. The service controller interacts with your cloud provider's APIs to set up load balancers and other infrastructure components when you declare a Service resource that requires them.
Authorization
This section breaks down the access that the cloud controller managers requires on various API objects, in order to perform its operations.
Node controller
The Node controller only works with Node objects. It requires full access to read and modify Node objects.
v1/Node
:
- Get
- List
- Create
- Update
- Patch
- Watch
- Delete
Route controller
The route controller listens to Node object creation and configures routes appropriately. It requires Get access to Node objects.
v1/Node
:
- Get
Service controller
The service controller listens to Service object Create, Update and Delete events and then configures Endpoints for those Services appropriately.
To access Services, it requires List, and Watch access. To update Services, it requires Patch and Update access.
To set up Endpoints resources for the Services, it requires access to Create, List, Get, Watch, and Update.
v1/Service
:
- List
- Get
- Watch
- Patch
- Update
Others
The implementation of the core of the cloud controller manager requires access to create Event objects, and to ensure secure operation, it requires access to create ServiceAccounts.
v1/Event
:
- Create
- Patch
- Update
v1/ServiceAccount
:
- Create
The RBAC ClusterRole for the cloud controller manager looks like:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cloud-controller-manager
rules:
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- update
- apiGroups:
- ""
resources:
- nodes
verbs:
- '*'
- apiGroups:
- ""
resources:
- nodes/status
verbs:
- patch
- apiGroups:
- ""
resources:
- services
verbs:
- list
- patch
- update
- watch
- apiGroups:
- ""
resources:
- serviceaccounts
verbs:
- create
- apiGroups:
- ""
resources:
- persistentvolumes
verbs:
- get
- list
- update
- watch
- apiGroups:
- ""
resources:
- endpoints
verbs:
- create
- get
- list
- watch
- update
What's next
Cloud Controller Manager Administration has instructions on running and managing the cloud controller manager.
To upgrade a HA control plane to use the cloud controller manager, see Migrate Replicated Control Plane To Use Cloud Controller Manager.
Want to know how to implement your own cloud controller manager, or extend an existing project?
The cloud controller manager uses Go interfaces to allow implementations from any cloud to be plugged in. Specifically, it uses the CloudProvider
interface defined in cloud.go
from kubernetes/cloud-provider.
The implementation of the shared controllers highlighted in this document (Node, Route, and Service), and some scaffolding along with the shared cloudprovider interface, is part of the Kubernetes core. Implementations specific to cloud providers are outside the core of Kubernetes and implement the CloudProvider
interface.
For more information about developing plugins, see Developing Cloud Controller Manager.
3 - Containers
Each container that you run is repeatable; the standardization from having dependencies included means that you get the same behavior wherever you run it.
Containers decouple applications from underlying host infrastructure. This makes deployment easier in different cloud or OS environments.
Container images
A container image is a ready-to-run software package, containing everything needed to run an application: the code and any runtime it requires, application and system libraries, and default values for any essential settings.
By design, a container is immutable: you cannot change the code of a container that is already running. If you have a containerized application and want to make changes, you need to build a new image that includes the change, then recreate the container to start from the updated image.
Container runtimes
The container runtime is the software that is responsible for running containers.
Kubernetes supports several container runtimes: Docker, containerd, CRI-O, and any implementation of the Kubernetes CRI (Container Runtime Interface).
What's next
- Read about container images
- Read about Pods
3.1 - Images
A container image represents binary data that encapsulates an application and all its software dependencies. Container images are executable software bundles that can run standalone and that make very well defined assumptions about their runtime environment.
You typically create a container image of your application and push it to a registry before referring to it in a Pod
This page provides an outline of the container image concept.
Image names
Container images are usually given a name such as pause
, example/mycontainer
, or kube-apiserver
.
Images can also include a registry hostname; for example: fictional.registry.example/imagename
,
and possible a port number as well; for example: fictional.registry.example:10443/imagename
.
If you don't specify a registry hostname, Kubernetes assumes that you mean the Docker public registry.
After the image name part you can add a tag (as also using with commands such
as docker
and podman
).
Tags let you identify different versions of the same series of images.
Image tags consist of lowercase and uppercase letters, digits, underscores (_
),
periods (.
), and dashes (-
).
There are additional rules about where you can place the separator
characters (_
, -
, and .
) inside an image tag.
If you don't specify a tag, Kubernetes assumes you mean the tag latest
.
Caution:You should avoid using the
latest
tag when deploying containers in production, as it is harder to track which version of the image is running and more difficult to roll back to a working version.Instead, specify a meaningful tag such as
v1.42.0
.
Updating images
When you first create a Deployment,
StatefulSet, Pod, or other
object that includes a Pod template, then by default the pull policy of all
containers in that pod will be set to IfNotPresent
if it is not explicitly
specified. This policy causes the
kubelet to skip pulling an
image if it already exists.
If you would like to always force a pull, you can do one of the following:
- set the
imagePullPolicy
of the container toAlways
. - omit the
imagePullPolicy
and use:latest
as the tag for the image to use; Kubernetes will set the policy toAlways
. - omit the
imagePullPolicy
and the tag for the image to use. - enable the AlwaysPullImages admission controller.
Note:The value of
imagePullPolicy
of the container is always set when the object is first created, and is not updated if the image's tag later changes.For example, if you create a Deployment with an image whose tag is not
:latest
, and later update that Deployment's image to a:latest
tag, theimagePullPolicy
field will not change toAlways
. You must manually change the pull policy of any object after its initial creation.
When imagePullPolicy
is defined without a specific value, it is also set to Always
.
Multi-architecture images with image indexes
As well as providing binary images, a container registry can also serve a container image index. An image index can point to multiple image manifests for architecture-specific versions of a container. The idea is that you can have a name for an image (for example: pause
, example/mycontainer
, kube-apiserver
) and allow different systems to fetch the right binary image for the machine architecture they are using.
Kubernetes itself typically names container images with a suffix -$(ARCH)
. For backward compatibility, please generate the older images with suffixes. The idea is to generate say pause
image which has the manifest for all the arch(es) and say pause-amd64
which is backwards compatible for older configurations or YAML files which may have hard coded the images with suffixes.
Using a private registry
Private registries may require keys to read images from them.
Credentials can be provided in several ways:
- Configuring Nodes to Authenticate to a Private Registry
- all pods can read any configured private registries
- requires node configuration by cluster administrator
- Pre-pulled Images
- all pods can use any images cached on a node
- requires root access to all nodes to setup
- Specifying ImagePullSecrets on a Pod
- only pods which provide own keys can access the private registry
- Vendor-specific or local extensions
- if you're using a custom node configuration, you (or your cloud provider) can implement your mechanism for authenticating the node to the container registry.
These options are explained in more detail below.
Configuring nodes to authenticate to a private registry
If you run Docker on your nodes, you can configure the Docker container runtime to authenticate to a private container registry.
This approach is suitable if you can control node configuration.
Note: Default Kubernetes only supports theauths
andHttpHeaders
section in Docker configuration. Docker credential helpers (credHelpers
orcredsStore
) are not supported.
Docker stores keys for private registries in the $HOME/.dockercfg
or $HOME/.docker/config.json
file. If you put the same file
in the search paths list below, kubelet uses it as the credential provider when pulling images.
{--root-dir:-/var/lib/kubelet}/config.json
{cwd of kubelet}/config.json
${HOME}/.docker/config.json
/.docker/config.json
{--root-dir:-/var/lib/kubelet}/.dockercfg
{cwd of kubelet}/.dockercfg
${HOME}/.dockercfg
/.dockercfg
Note: You may have to setHOME=/root
explicitly in the environment of the kubelet process.
Here are the recommended steps to configuring your nodes to use a private registry. In this example, run these on your desktop/laptop:
- Run
docker login [server]
for each set of credentials you want to use. This updates$HOME/.docker/config.json
on your PC. - View
$HOME/.docker/config.json
in an editor to ensure it contains only the credentials you want to use. - Get a list of your nodes; for example:
- if you want the names:
nodes=$( kubectl get nodes -o jsonpath='{range.items[*].metadata}{.name} {end}' )
- if you want to get the IP addresses:
nodes=$( kubectl get nodes -o jsonpath='{range .items[*].status.addresses[?(@.type=="ExternalIP")]}{.address} {end}' )
- if you want the names:
- Copy your local
.docker/config.json
to one of the search paths list above.- for example, to test this out:
for n in $nodes; do scp ~/.docker/config.json root@"$n":/var/lib/kubelet/config.json; done
- for example, to test this out:
Note: For production clusters, use a configuration management tool so that you can apply this setting to all the nodes where you need it.
Verify by creating a Pod that uses a private image; for example:
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: private-image-test-1
spec:
containers:
- name: uses-private-image
image: $PRIVATE_IMAGE_NAME
imagePullPolicy: Always
command: [ "echo", "SUCCESS" ]
EOF
pod/private-image-test-1 created
If everything is working, then, after a few moments, you can run:
kubectl logs private-image-test-1
and see that the command outputs:
SUCCESS
If you suspect that the command failed, you can run:
kubectl describe pods/private-image-test-1 | grep 'Failed'
In case of failure, the output is similar to:
Fri, 26 Jun 2015 15:36:13 -0700 Fri, 26 Jun 2015 15:39:13 -0700 19 {kubelet node-i2hq} spec.containers{uses-private-image} failed Failed to pull image "user/privaterepo:v1": Error: image user/privaterepo:v1 not found
You must ensure all nodes in the cluster have the same .docker/config.json
. Otherwise, pods will run on
some nodes and fail to run on others. For example, if you use node autoscaling, then each instance
template needs to include the .docker/config.json
or mount a drive that contains it.
All pods will have read access to images in any private registry once private
registry keys are added to the .docker/config.json
.
Pre-pulled images
Note: This approach is suitable if you can control node configuration. It will not work reliably if your cloud provider manages nodes and replaces them automatically.
By default, the kubelet tries to pull each image from the specified registry.
However, if the imagePullPolicy
property of the container is set to IfNotPresent
or Never
,
then a local image is used (preferentially or exclusively, respectively).
If you want to rely on pre-pulled images as a substitute for registry authentication, you must ensure all nodes in the cluster have the same pre-pulled images.
This can be used to preload certain images for speed or as an alternative to authenticating to a private registry.
All pods will have read access to any pre-pulled images.
Specifying imagePullSecrets on a Pod
Note: This is the recommended approach to run containers based on images in private registries.
Kubernetes supports specifying container image registry keys on a Pod.
Creating a Secret with a Docker config
Run the following command, substituting the appropriate uppercase values:
kubectl create secret docker-registry <name> --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL
If you already have a Docker credentials file then, rather than using the above
command, you can import the credentials file as a Kubernetes
Secrets.
Create a Secret based on existing Docker credentials explains how to set this up.
This is particularly useful if you are using multiple private container
registries, as kubectl create secret docker-registry
creates a Secret that
only works with a single private registry.
Note: Pods can only reference image pull secrets in their own namespace, so this process needs to be done one time per namespace.
Referring to an imagePullSecrets on a Pod
Now, you can create pods which reference that secret by adding an imagePullSecrets
section to a Pod definition.
For example:
cat <<EOF > pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: foo
namespace: awesomeapps
spec:
containers:
- name: foo
image: janedoe/awesomeapp:v1
imagePullSecrets:
- name: myregistrykey
EOF
cat <<EOF >> ./kustomization.yaml
resources:
- pod.yaml
EOF
This needs to be done for each pod that is using a private registry.
However, setting of this field can be automated by setting the imagePullSecrets in a ServiceAccount resource.
Check Add ImagePullSecrets to a Service Account for detailed instructions.
You can use this in conjunction with a per-node .docker/config.json
. The credentials
will be merged.
Use cases
There are a number of solutions for configuring private registries. Here are some common use cases and suggested solutions.
- Cluster running only non-proprietary (e.g. open-source) images. No need to hide images.
- Use public images on the Docker hub.
- No configuration required.
- Some cloud providers automatically cache or mirror public images, which improves availability and reduces the time to pull images.
- Use public images on the Docker hub.
- Cluster running some proprietary images which should be hidden to those outside the company, but
visible to all cluster users.
- Use a hosted private Docker registry.
- It may be hosted on the Docker Hub, or elsewhere.
- Manually configure .docker/config.json on each node as described above.
- Or, run an internal private registry behind your firewall with open read access.
- No Kubernetes configuration is required.
- Use a hosted container image registry service that controls image access
- It will work better with cluster autoscaling than manual node configuration.
- Or, on a cluster where changing the node configuration is inconvenient, use
imagePullSecrets
.
- Use a hosted private Docker registry.
- Cluster with proprietary images, a few of which require stricter access control.
- Ensure AlwaysPullImages admission controller is active. Otherwise, all Pods potentially have access to all images.
- Move sensitive data into a "Secret" resource, instead of packaging it in an image.
- A multi-tenant cluster where each tenant needs own private registry.
- Ensure AlwaysPullImages admission controller is active. Otherwise, all Pods of all tenants potentially have access to all images.
- Run a private registry with authorization required.
- Generate registry credential for each tenant, put into secret, and populate secret to each tenant namespace.
- The tenant adds that secret to imagePullSecrets of each namespace.
If you need access to multiple registries, you can create one secret for each registry.
Kubelet will merge any imagePullSecrets
into a single virtual .docker/config.json
What's next
- Read the OCI Image Manifest Specification
3.2 - Container Environment
This page describes the resources available to Containers in the Container environment.
Container environment
The Kubernetes Container environment provides several important resources to Containers:
- A filesystem, which is a combination of an image and one or more volumes.
- Information about the Container itself.
- Information about other objects in the cluster.
Container information
The hostname of a Container is the name of the Pod in which the Container is running.
It is available through the hostname
command or the
gethostname
function call in libc.
The Pod name and namespace are available as environment variables through the downward API.
User defined environment variables from the Pod definition are also available to the Container, as are any environment variables specified statically in the Docker image.
Cluster information
A list of all services that were running when a Container was created is available to that Container as environment variables. This list is limited to services within the same namespace as the new Container's Pod and Kubernetes control plane services. Those environment variables match the syntax of Docker links.
For a service named foo that maps to a Container named bar, the following variables are defined:
FOO_SERVICE_HOST=<the host the service is running on>
FOO_SERVICE_PORT=<the port the service is running on>
Services have dedicated IP addresses and are available to the Container via DNS, if DNS addon is enabled.
What's next
- Learn more about Container lifecycle hooks.
- Get hands-on experience attaching handlers to Container lifecycle events.
3.3 - Runtime Class
Kubernetes v1.20 [stable]
This page describes the RuntimeClass resource and runtime selection mechanism.
RuntimeClass is a feature for selecting the container runtime configuration. The container runtime configuration is used to run a Pod's containers.
Motivation
You can set a different RuntimeClass between different Pods to provide a balance of performance versus security. For example, if part of your workload deserves a high level of information security assurance, you might choose to schedule those Pods so that they run in a container runtime that uses hardware virtualization. You'd then benefit from the extra isolation of the alternative runtime, at the expense of some additional overhead.
You can also use RuntimeClass to run different Pods with the same container runtime but with different settings.
Setup
- Configure the CRI implementation on nodes (runtime dependent)
- Create the corresponding RuntimeClass resources
1. Configure the CRI implementation on nodes
The configurations available through RuntimeClass are Container Runtime Interface (CRI) implementation dependent. See the corresponding documentation (below) for your CRI implementation for how to configure.
Note: RuntimeClass assumes a homogeneous node configuration across the cluster by default (which means that all nodes are configured the same way with respect to container runtimes). To support heterogeneous node configurations, see Scheduling below.
The configurations have a corresponding handler
name, referenced by the RuntimeClass. The
handler must be a valid DNS 1123 label (alpha-numeric + -
characters).
2. Create the corresponding RuntimeClass resources
The configurations setup in step 1 should each have an associated handler
name, which identifies
the configuration. For each handler, create a corresponding RuntimeClass object.
The RuntimeClass resource currently only has 2 significant fields: the RuntimeClass name
(metadata.name
) and the handler (handler
). The object definition looks like this:
apiVersion: node.k8s.io/v1 # RuntimeClass is defined in the node.k8s.io API group
kind: RuntimeClass
metadata:
name: myclass # The name the RuntimeClass will be referenced by
# RuntimeClass is a non-namespaced resource
handler: myconfiguration # The name of the corresponding CRI configuration
The name of a RuntimeClass object must be a valid DNS subdomain name.
Note: It is recommended that RuntimeClass write operations (create/update/patch/delete) be restricted to the cluster administrator. This is typically the default. See Authorization Overview for more details.
Usage
Once RuntimeClasses are configured for the cluster, using them is very simple. Specify a
runtimeClassName
in the Pod spec. For example:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
runtimeClassName: myclass
# ...
This will instruct the kubelet to use the named RuntimeClass to run this pod. If the named
RuntimeClass does not exist, or the CRI cannot run the corresponding handler, the pod will enter the
Failed
terminal phase. Look for a
corresponding event for an
error message.
If no runtimeClassName
is specified, the default RuntimeHandler will be used, which is equivalent
to the behavior when the RuntimeClass feature is disabled.
CRI Configuration
For more details on setting up CRI runtimes, see CRI installation.
dockershim
RuntimeClasses with dockershim must set the runtime handler to docker
. Dockershim does not support
custom configurable runtime handlers.
containerd
Runtime handlers are configured through containerd's configuration at
/etc/containerd/config.toml
. Valid handlers are configured under the runtimes section:
[plugins.cri.containerd.runtimes.${HANDLER_NAME}]
See containerd's config documentation for more details: https://github.com/containerd/cri/blob/master/docs/config.md
CRI-O
Runtime handlers are configured through CRI-O's configuration at /etc/crio/crio.conf
. Valid
handlers are configured under the crio.runtime
table:
[crio.runtime.runtimes.${HANDLER_NAME}]
runtime_path = "${PATH_TO_BINARY}"
See CRI-O's config documentation for more details.
Scheduling
Kubernetes v1.16 [beta]
By specifying the scheduling
field for a RuntimeClass, you can set constraints to
ensure that Pods running with this RuntimeClass are scheduled to nodes that support it.
If scheduling
is not set, this RuntimeClass is assumed to be supported by all nodes.
To ensure pods land on nodes supporting a specific RuntimeClass, that set of nodes should have a
common label which is then selected by the runtimeclass.scheduling.nodeSelector
field. The
RuntimeClass's nodeSelector is merged with the pod's nodeSelector in admission, effectively taking
the intersection of the set of nodes selected by each. If there is a conflict, the pod will be
rejected.
If the supported nodes are tainted to prevent other RuntimeClass pods from running on the node, you
can add tolerations
to the RuntimeClass. As with the nodeSelector
, the tolerations are merged
with the pod's tolerations in admission, effectively taking the union of the set of nodes tolerated
by each.
To learn more about configuring the node selector and tolerations, see Assigning Pods to Nodes.
Pod Overhead
Kubernetes v1.18 [beta]
You can specify overhead resources that are associated with running a Pod. Declaring overhead allows the cluster (including the scheduler) to account for it when making decisions about Pods and resources. To use Pod overhead, you must have the PodOverhead feature gate enabled (it is on by default).
Pod overhead is defined in RuntimeClass through the overhead
fields. Through the use of these fields,
you can specify the overhead of running pods utilizing this RuntimeClass and ensure these overheads
are accounted for in Kubernetes.
What's next
- RuntimeClass Design
- RuntimeClass Scheduling Design
- Read about the Pod Overhead concept
- PodOverhead Feature Design
3.4 - Container Lifecycle Hooks
This page describes how kubelet managed Containers can use the Container lifecycle hook framework to run code triggered by events during their management lifecycle.
Overview
Analogous to many programming language frameworks that have component lifecycle hooks, such as Angular, Kubernetes provides Containers with lifecycle hooks. The hooks enable Containers to be aware of events in their management lifecycle and run code implemented in a handler when the corresponding lifecycle hook is executed.
Container hooks
There are two hooks that are exposed to Containers:
PostStart
This hook is executed immediately after a container is created. However, there is no guarantee that the hook will execute before the container ENTRYPOINT. No parameters are passed to the handler.
PreStop
This hook is called immediately before a container is terminated due to an API request or management
event such as a liveness/startup probe failure, preemption, resource contention and others. A call
to the PreStop
hook fails if the container is already in a terminated or completed state and the
hook must complete before the TERM signal to stop the container can be sent. The Pod's termination
grace period countdown begins before the PreStop
hook is executed, so regardless of the outcome of
the handler, the container will eventually terminate within the Pod's termination grace period. No
parameters are passed to the handler.
A more detailed description of the termination behavior can be found in Termination of Pods.
Hook handler implementations
Containers can access a hook by implementing and registering a handler for that hook. There are two types of hook handlers that can be implemented for Containers:
- Exec - Executes a specific command, such as
pre-stop.sh
, inside the cgroups and namespaces of the Container. Resources consumed by the command are counted against the Container. - HTTP - Executes an HTTP request against a specific endpoint on the Container.
Hook handler execution
When a Container lifecycle management hook is called,
the Kubernetes management system execute the handler according to the hook action,
httpGet
and tcpSocket
are executed by the kubelet process, and exec
is executed in the container.
Hook handler calls are synchronous within the context of the Pod containing the Container.
This means that for a PostStart
hook,
the Container ENTRYPOINT and hook fire asynchronously.
However, if the hook takes too long to run or hangs,
the Container cannot reach a running
state.
PreStop
hooks are not executed asynchronously from the signal to stop the Container; the hook must
complete its execution before the TERM signal can be sent. If a PreStop
hook hangs during
execution, the Pod's phase will be Terminating
and remain there until the Pod is killed after its
terminationGracePeriodSeconds
expires. This grace period applies to the total time it takes for
both the PreStop
hook to execute and for the Container to stop normally. If, for example,
terminationGracePeriodSeconds
is 60, and the hook takes 55 seconds to complete, and the Container
takes 10 seconds to stop normally after receiving the signal, then the Container will be killed
before it can stop normally, since terminationGracePeriodSeconds
is less than the total time
(55+10) it takes for these two things to happen.
If either a PostStart
or PreStop
hook fails,
it kills the Container.
Users should make their hook handlers as lightweight as possible. There are cases, however, when long running commands make sense, such as when saving state prior to stopping a Container.
Hook delivery guarantees
Hook delivery is intended to be at least once,
which means that a hook may be called multiple times for any given event,
such as for PostStart
or PreStop
.
It is up to the hook implementation to handle this correctly.
Generally, only single deliveries are made. If, for example, an HTTP hook receiver is down and is unable to take traffic, there is no attempt to resend. In some rare cases, however, double delivery may occur. For instance, if a kubelet restarts in the middle of sending a hook, the hook might be resent after the kubelet comes back up.
Debugging Hook handlers
The logs for a Hook handler are not exposed in Pod events.
If a handler fails for some reason, it broadcasts an event.
For PostStart
, this is the FailedPostStartHook
event,
and for PreStop
, this is the FailedPreStopHook
event.
You can see these events by running kubectl describe pod <pod_name>
.
Here is some example output of events from running this command:
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1m 1m 1 {default-scheduler } Normal Scheduled Successfully assigned test-1730497541-cq1d2 to gke-test-cluster-default-pool-a07e5d30-siqd
1m 1m 1 {kubelet gke-test-cluster-default-pool-a07e5d30-siqd} spec.containers{main} Normal Pulling pulling image "test:1.0"
1m 1m 1 {kubelet gke-test-cluster-default-pool-a07e5d30-siqd} spec.containers{main} Normal Created Created container with docker id 5c6a256a2567; Security:[seccomp=unconfined]
1m 1m 1 {kubelet gke-test-cluster-default-pool-a07e5d30-siqd} spec.containers{main} Normal Pulled Successfully pulled image "test:1.0"
1m 1m 1 {kubelet gke-test-cluster-default-pool-a07e5d30-siqd} spec.containers{main} Normal Started Started container with docker id 5c6a256a2567
38s 38s 1 {kubelet gke-test-cluster-default-pool-a07e5d30-siqd} spec.containers{main} Normal Killing Killing container with docker id 5c6a256a2567: PostStart handler: Error executing in Docker Container: 1
37s 37s 1 {kubelet gke-test-cluster-default-pool-a07e5d30-siqd} spec.containers{main} Normal Killing Killing container with docker id 8df9fdfd7054: PostStart handler: Error executing in Docker Container: 1
38s 37s 2 {kubelet gke-test-cluster-default-pool-a07e5d30-siqd} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "main" with RunContainerError: "PostStart handler: Error executing in Docker Container: 1"
1m 22s 2 {kubelet gke-test-cluster-default-pool-a07e5d30-siqd} spec.containers{main} Warning FailedPostStartHook
What's next
- Learn more about the Container environment.
- Get hands-on experience attaching handlers to Container lifecycle events.
4 - Workloads
A workload is an application running on Kubernetes.
Whether your workload is a single component or several that work together, on Kubernetes you run
it inside a set of pods.
In Kubernetes, a Pod
represents a set of running
containers on your cluster.
Kubernetes pods have a defined lifecycle.
For example, once a pod is running in your cluster then a critical fault on the
node where that pod is running means that
all the pods on that node fail. Kubernetes treats that level of failure as final: you
would need to create a new Pod
to recover, even if the node later becomes healthy.
However, to make life considerably easier, you don't need to manage each Pod
directly.
Instead, you can use workload resources that manage a set of pods on your behalf.
These resources configure controllers
that make sure the right number of the right kind of pod are running, to match the state
you specified.
Kubernetes provides several built-in workload resources:
Deployment
andReplicaSet
(replacing the legacy resource ReplicationController).Deployment
is a good fit for managing a stateless application workload on your cluster, where anyPod
in theDeployment
is interchangeable and can be replaced if needed.StatefulSet
lets you run one or more related Pods that do track state somehow. For example, if your workload records data persistently, you can run aStatefulSet
that matches eachPod
with aPersistentVolume
. Your code, running in thePods
for thatStatefulSet
, can replicate data to otherPods
in the sameStatefulSet
to improve overall resilience.DaemonSet
definesPods
that provide node-local facilities. These might be fundamental to the operation of your cluster, such as a networking helper tool, or be part of an add-on.
Every time you add a node to your cluster that matches the specification in aDaemonSet
, the control plane schedules aPod
for thatDaemonSet
onto the new node.Job
andCronJob
define tasks that run to completion and then stop. Jobs represent one-off tasks, whereasCronJobs
recur according to a schedule.
In the wider Kubernetes ecosystem, you can find third-party workload resources that provide
additional behaviors. Using a
custom resource definition,
you can add in a third-party workload resource if you want a specific behavior that's not part
of Kubernetes' core. For example, if you wanted to run a group of Pods
for your application but
stop work unless all the Pods are available (perhaps for some high-throughput distributed task),
then you can implement or install an extension that does provide that feature.
What's next
As well as reading about each resource, you can learn about specific tasks that relate to them:
- Run a stateless application using a
Deployment
- Run a stateful application either as a single instance or as a replicated set
- Run automated tasks with a
CronJob
To learn about Kubernetes' mechanisms for separating code from configuration, visit Configuration.
There are two supporting concepts that provide backgrounds about how Kubernetes manages pods for applications:
- Garbage collection tidies up objects from your cluster after their owning resource has been removed.
- The time-to-live after finished controller removes Jobs once a defined time has passed since they completed.
Once your application is running, you might want to make it available on the internet as
a Service
or, for web application only,
using an Ingress
.
4.1 - Pods
Pods are the smallest deployable units of computing that you can create and manage in Kubernetes.
A Pod (as in a pod of whales or pea pod) is a group of one or more containers, with shared storage and network resources, and a specification for how to run the containers. A Pod's contents are always co-located and co-scheduled, and run in a shared context. A Pod models an application-specific "logical host": it contains one or more application containers which are relatively tightly coupled. In non-cloud contexts, applications executed on the same physical or virtual machine are analogous to cloud applications executed on the same logical host.
As well as application containers, a Pod can contain init containers that run during Pod startup. You can also inject ephemeral containers for debugging if your cluster offers this.
What is a Pod?
Note: While Kubernetes supports more container runtimes than just Docker, Docker is the most commonly known runtime, and it helps to describe Pods using some terminology from Docker.
The shared context of a Pod is a set of Linux namespaces, cgroups, and potentially other facets of isolation - the same things that isolate a Docker container. Within a Pod's context, the individual applications may have further sub-isolations applied.
In terms of Docker concepts, a Pod is similar to a group of Docker containers with shared namespaces and shared filesystem volumes.
Using Pods
Usually you don't need to create Pods directly, even singleton Pods. Instead, create them using workload resources such as Deployment or Job. If your Pods need to track state, consider the StatefulSet resource.
Pods in a Kubernetes cluster are used in two main ways:
Pods that run a single container. The "one-container-per-Pod" model is the most common Kubernetes use case; in this case, you can think of a Pod as a wrapper around a single container; Kubernetes manages Pods rather than managing the containers directly.
Pods that run multiple containers that need to work together. A Pod can encapsulate an application composed of multiple co-located containers that are tightly coupled and need to share resources. These co-located containers form a single cohesive unit of service—for example, one container serving data stored in a shared volume to the public, while a separate sidecar container refreshes or updates those files. The Pod wraps these containers, storage resources, and an ephemeral network identity together as a single unit.
Note: Grouping multiple co-located and co-managed containers in a single Pod is a relatively advanced use case. You should use this pattern only in specific instances in which your containers are tightly coupled.
Each Pod is meant to run a single instance of a given application. If you want to scale your application horizontally (to provide more overall resources by running more instances), you should use multiple Pods, one for each instance. In Kubernetes, this is typically referred to as replication. Replicated Pods are usually created and managed as a group by a workload resource and its controller.
See Pods and controllers for more information on how Kubernetes uses workload resources, and their controllers, to implement application scaling and auto-healing.
How Pods manage multiple containers
Pods are designed to support multiple cooperating processes (as containers) that form a cohesive unit of service. The containers in a Pod are automatically co-located and co-scheduled on the same physical or virtual machine in the cluster. The containers can share resources and dependencies, communicate with one another, and coordinate when and how they are terminated.
For example, you might have a container that acts as a web server for files in a shared volume, and a separate "sidecar" container that updates those files from a remote source, as in the following diagram:
Some Pods have init containers as well as app containers. Init containers run and complete before the app containers are started.
Pods natively provide two kinds of shared resources for their constituent containers: networking and storage.
Working with Pods
You'll rarely create individual Pods directly in Kubernetes—even singleton Pods. This is because Pods are designed as relatively ephemeral, disposable entities. When a Pod gets created (directly by you, or indirectly by a controller), the new Pod is scheduled to run on a Node in your cluster. The Pod remains on that node until the Pod finishes execution, the Pod object is deleted, the Pod is evicted for lack of resources, or the node fails.
Note: Restarting a container in a Pod should not be confused with restarting a Pod. A Pod is not a process, but an environment for running container(s). A Pod persists until it is deleted.
When you create the manifest for a Pod object, make sure the name specified is a valid DNS subdomain name.
Pods and controllers
You can use workload resources to create and manage multiple Pods for you. A controller for the resource handles replication and rollout and automatic healing in case of Pod failure. For example, if a Node fails, a controller notices that Pods on that Node have stopped working and creates a replacement Pod. The scheduler places the replacement Pod onto a healthy Node.
Here are some examples of workload resources that manage one or more Pods:
Pod templates
Controllers for workload resources create Pods from a pod template and manage those Pods on your behalf.
PodTemplates are specifications for creating Pods, and are included in workload resources such as Deployments, Jobs, and DaemonSets.
Each controller for a workload resource uses the PodTemplate
inside the workload
object to make actual Pods. The PodTemplate
is part of the desired state of whatever
workload resource you used to run your app.
The sample below is a manifest for a simple Job with a template
that starts one
container. The container in that Pod prints a message then pauses.
apiVersion: batch/v1
kind: Job
metadata:
name: hello
spec:
template:
# This is the pod template
spec:
containers:
- name: hello
image: busybox
command: ['sh', '-c', 'echo "Hello, Kubernetes!" && sleep 3600']
restartPolicy: OnFailure
# The pod template ends here
Modifying the pod template or switching to a new pod template has no direct effect on the Pods that already exist. If you change the pod template for a workload resource, that resource needs to create replacement Pods that use the updated template.
For example, the StatefulSet controller ensures that the running Pods match the current pod template for each StatefulSet object. If you edit the StatefulSet to change its pod template, the StatefulSet starts to create new Pods based on the updated template. Eventually, all of the old Pods are replaced with new Pods, and the update is complete.
Each workload resource implements its own rules for handling changes to the Pod template. If you want to read more about StatefulSet specifically, read Update strategy in the StatefulSet Basics tutorial.
On Nodes, the kubelet does not directly observe or manage any of the details around pod templates and updates; those details are abstracted away. That abstraction and separation of concerns simplifies system semantics, and makes it feasible to extend the cluster's behavior without changing existing code.
Pod update and replacement
As mentioned in the previous section, when the Pod template for a workload resource is changed, the controller creates new Pods based on the updated template instead of updating or patching the existing Pods.
Kubernetes doesn't prevent you from managing Pods directly. It is possible to
update some fields of a running Pod, in place. However, Pod update operations
like
patch
, and
replace
have some limitations:
Most of the metadata about a Pod is immutable. For example, you cannot change the
namespace
,name
,uid
, orcreationTimestamp
fields; thegeneration
field is unique. It only accepts updates that increment the field's current value.If the
metadata.deletionTimestamp
is set, no new entry can be added to themetadata.finalizers
list.Pod updates may not change fields other than
spec.containers[*].image
,spec.initContainers[*].image
,spec.activeDeadlineSeconds
orspec.tolerations
. Forspec.tolerations
, you can only add new entries.When updating the
spec.activeDeadlineSeconds
field, two types of updates are allowed:- setting the unassigned field to a positive number;
- updating the field from a positive number to a smaller, non-negative number.
Resource sharing and communication
Pods enable data sharing and communication among their constituent containers.
Storage in Pods
A Pod can specify a set of shared storage volumes. All containers in the Pod can access the shared volumes, allowing those containers to share data. Volumes also allow persistent data in a Pod to survive in case one of the containers within needs to be restarted. See Storage for more information on how Kubernetes implements shared storage and makes it available to Pods.
Pod networking
Each Pod is assigned a unique IP address for each address family. Every
container in a Pod shares the network namespace, including the IP address and
network ports. Inside a Pod (and only then), the containers that belong to the Pod
can communicate with one another using localhost
. When containers in a Pod communicate
with entities outside the Pod,
they must coordinate how they use the shared network resources (such as ports).
Within a Pod, containers share an IP address and port space, and
can find each other via localhost
. The containers in a Pod can also communicate
with each other using standard inter-process communications like SystemV semaphores
or POSIX shared memory. Containers in different Pods have distinct IP addresses
and can not communicate by IPC without
special configuration.
Containers that want to interact with a container running in a different Pod can
use IP networking to communicate.
Containers within the Pod see the system hostname as being the same as the configured
name
for the Pod. There's more about this in the networking
section.
Privileged mode for containers
Any container in a Pod can enable privileged mode, using the privileged
flag on the security context of the container spec. This is useful for containers that want to use operating system administrative capabilities such as manipulating the network stack or accessing hardware devices.
Processes within a privileged container get almost the same privileges that are available to processes outside a container.
Note: Your container runtime must support the concept of a privileged container for this setting to be relevant.
Static Pods
Static Pods are managed directly by the kubelet daemon on a specific node, without the API server observing them. Whereas most Pods are managed by the control plane (for example, a Deployment), for static Pods, the kubelet directly supervises each static Pod (and restarts it if it fails).
Static Pods are always bound to one Kubelet on a specific node. The main use for static Pods is to run a self-hosted control plane: in other words, using the kubelet to supervise the individual control plane components.
The kubelet automatically tries to create a mirror Pod on the Kubernetes API server for each static Pod. This means that the Pods running on a node are visible on the API server, but cannot be controlled from there.
What's next
- Learn about the lifecycle of a Pod.
- Learn about RuntimeClass and how you can use it to configure different Pods with different container runtime configurations.
- Read about Pod topology spread constraints.
- Read about PodDisruptionBudget and how you can use it to manage application availability during disruptions.
- Pod is a top-level resource in the Kubernetes REST API. The Pod object definition describes the object in detail.
- The Distributed System Toolkit: Patterns for Composite Containers explains common layouts for Pods with more than one container.
To understand the context for why Kubernetes wraps a common Pod API in other resources (such as StatefulSets or Deployments), you can read about the prior art, including:
4.1.1 - Pod Lifecycle
This page describes the lifecycle of a Pod. Pods follow a defined lifecycle, starting
in the Pending
phase, moving through Running
if at least one
of its primary containers starts OK, and then through either the Succeeded
or
Failed
phases depending on whether any container in the Pod terminated in failure.
Whilst a Pod is running, the kubelet is able to restart containers to handle some kind of faults. Within a Pod, Kubernetes tracks different container states and determines what action to take to make the Pod healthy again.
In the Kubernetes API, Pods have both a specification and an actual status. The status for a Pod object consists of a set of Pod conditions. You can also inject custom readiness information into the condition data for a Pod, if that is useful to your application.
Pods are only scheduled once in their lifetime. Once a Pod is scheduled (assigned) to a Node, the Pod runs on that Node until it stops or is terminated.
Pod lifetime
Like individual application containers, Pods are considered to be relatively ephemeral (rather than durable) entities. Pods are created, assigned a unique ID (UID), and scheduled to nodes where they remain until termination (according to restart policy) or deletion. If a Node dies, the Pods scheduled to that node are scheduled for deletion after a timeout period.
Pods do not, by themselves, self-heal. If a Pod is scheduled to a node that then fails, the Pod is deleted; likewise, a Pod won't survive an eviction due to a lack of resources or Node maintenance. Kubernetes uses a higher-level abstraction, called a controller, that handles the work of managing the relatively disposable Pod instances.
A given Pod (as defined by a UID) is never "rescheduled" to a different node; instead, that Pod can be replaced by a new, near-identical Pod, with even the same name if desired, but with a different UID.
When something is said to have the same lifetime as a Pod, such as a volume, that means that the thing exists as long as that specific Pod (with that exact UID) exists. If that Pod is deleted for any reason, and even if an identical replacement is created, the related thing (a volume, in this example) is also destroyed and created anew.
Pod diagram
A multi-container Pod that contains a file puller and a web server that uses a persistent volume for shared storage between the containers.
Pod phase
A Pod's status
field is a
PodStatus
object, which has a phase
field.
The phase of a Pod is a simple, high-level summary of where the Pod is in its lifecycle. The phase is not intended to be a comprehensive rollup of observations of container or Pod state, nor is it intended to be a comprehensive state machine.
The number and meanings of Pod phase values are tightly guarded.
Other than what is documented here, nothing should be assumed about Pods that
have a given phase
value.
Here are the possible values for phase
:
Value | Description |
---|---|
Pending | The Pod has been accepted by the Kubernetes cluster, but one or more of the containers has not been set up and made ready to run. This includes time a Pod spends waiting to be scheduled as well as the time spent downloading container images over the network. |
Running | The Pod has been bound to a node, and all of the containers have been created. At least one container is still running, or is in the process of starting or restarting. |
Succeeded | All containers in the Pod have terminated in success, and will not be restarted. |
Failed | All containers in the Pod have terminated, and at least one container has terminated in failure. That is, the container either exited with non-zero status or was terminated by the system. |
Unknown | For some reason the state of the Pod could not be obtained. This phase typically occurs due to an error in communicating with the node where the Pod should be running. |
Note: When a Pod is being deleted, it is shown asTerminating
by some kubectl commands. ThisTerminating
status is not one of the Pod phases. A Pod is granted a term to terminate gracefully, which defaults to 30 seconds. You can use the flag--force
to terminate a Pod by force.
If a node dies or is disconnected from the rest of the cluster, Kubernetes
applies a policy for setting the phase
of all Pods on the lost node to Failed.
Container states
As well as the phase of the Pod overall, Kubernetes tracks the state of each container inside a Pod. You can use container lifecycle hooks to trigger events to run at certain points in a container's lifecycle.
Once the scheduler
assigns a Pod to a Node, the kubelet starts creating containers for that Pod
using a container runtime.
There are three possible container states: Waiting
, Running
, and Terminated
.
To check the state of a Pod's containers, you can use
kubectl describe pod <name-of-pod>
. The output shows the state for each container
within that Pod.
Each state has a specific meaning:
Waiting
If a container is not in either the Running
or Terminated
state, it is Waiting
.
A container in the Waiting
state is still running the operations it requires in
order to complete start up: for example, pulling the container image from a container
image registry, or applying Secret
data.
When you use kubectl
to query a Pod with a container that is Waiting
, you also see
a Reason field to summarize why the container is in that state.
Running
The Running
status indicates that a container is executing without issues. If there
was a postStart
hook configured, it has already executed and finished. When you use
kubectl
to query a Pod with a container that is Running
, you also see information
about when the container entered the Running
state.
Terminated
A container in the Terminated
state began execution and then either ran to
completion or failed for some reason. When you use kubectl
to query a Pod with
a container that is Terminated
, you see a reason, an exit code, and the start and
finish time for that container's period of execution.
If a container has a preStop
hook configured, that runs before the container enters
the Terminated
state.
Container restart policy
The spec
of a Pod has a restartPolicy
field with possible values Always, OnFailure,
and Never. The default value is Always.
The restartPolicy
applies to all containers in the Pod. restartPolicy
only
refers to restarts of the containers by the kubelet on the same node. After containers
in a Pod exit, the kubelet restarts them with an exponential back-off delay (10s, 20s,
40s, …), that is capped at five minutes. Once a container has executed for 10 minutes
without any problems, the kubelet resets the restart backoff timer for that container.
Pod conditions
A Pod has a PodStatus, which has an array of PodConditions through which the Pod has or has not passed:
PodScheduled
: the Pod has been scheduled to a node.ContainersReady
: all containers in the Pod are ready.Initialized
: all init containers have started successfully.Ready
: the Pod is able to serve requests and should be added to the load balancing pools of all matching Services.
Field name | Description |
---|---|
type | Name of this Pod condition. |
status | Indicates whether that condition is applicable, with possible values "True ", "False ", or "Unknown ". |
lastProbeTime | Timestamp of when the Pod condition was last probed. |
lastTransitionTime | Timestamp for when the Pod last transitioned from one status to another. |
reason | Machine-readable, UpperCamelCase text indicating the reason for the condition's last transition. |
message | Human-readable message indicating details about the last status transition. |
Pod readiness
Kubernetes v1.14 [stable]
Your application can inject extra feedback or signals into PodStatus:
Pod readiness. To use this, set readinessGates
in the Pod's spec
to
specify a list of additional conditions that the kubelet evaluates for Pod readiness.
Readiness gates are determined by the current state of status.condition
fields for the Pod. If Kubernetes cannot find such a condition in the
status.conditions
field of a Pod, the status of the condition
is defaulted to "False
".
Here is an example:
kind: Pod
...
spec:
readinessGates:
- conditionType: "www.example.com/feature-1"
status:
conditions:
- type: Ready # a built in PodCondition
status: "False"
lastProbeTime: null
lastTransitionTime: 2018-01-01T00:00:00Z
- type: "www.example.com/feature-1" # an extra PodCondition
status: "False"
lastProbeTime: null
lastTransitionTime: 2018-01-01T00:00:00Z
containerStatuses:
- containerID: docker://abcd...
ready: true
...
The Pod conditions you add must have names that meet the Kubernetes label key format.
Status for Pod readiness
The kubectl patch
command does not support patching object status.
To set these status.conditions
for the pod, applications and
operators should use
the PATCH
action.
You can use a Kubernetes client library to
write code that sets custom Pod conditions for Pod readiness.
For a Pod that uses custom conditions, that Pod is evaluated to be ready only when both the following statements apply:
- All containers in the Pod are ready.
- All conditions specified in
readinessGates
areTrue
.
When a Pod's containers are Ready but at least one custom condition is missing or
False
, the kubelet sets the Pod's condition to ContainersReady
.
Container probes
A Probe is a diagnostic performed periodically by the kubelet on a Container. To perform a diagnostic, the kubelet calls a Handler implemented by the container. There are three types of handlers:
ExecAction: Executes a specified command inside the container. The diagnostic is considered successful if the command exits with a status code of 0.
TCPSocketAction: Performs a TCP check against the Pod's IP address on a specified port. The diagnostic is considered successful if the port is open.
HTTPGetAction: Performs an HTTP
GET
request against the Pod's IP address on a specified port and path. The diagnostic is considered successful if the response has a status code greater than or equal to 200 and less than 400.
Each probe has one of three results:
Success
: The container passed the diagnostic.Failure
: The container failed the diagnostic.Unknown
: The diagnostic failed, so no action should be taken.
The kubelet can optionally perform and react to three kinds of probes on running containers:
livenessProbe
: Indicates whether the container is running. If the liveness probe fails, the kubelet kills the container, and the container is subjected to its restart policy. If a Container does not provide a liveness probe, the default state isSuccess
.readinessProbe
: Indicates whether the container is ready to respond to requests. If the readiness probe fails, the endpoints controller removes the Pod's IP address from the endpoints of all Services that match the Pod. The default state of readiness before the initial delay isFailure
. If a Container does not provide a readiness probe, the default state isSuccess
.startupProbe
: Indicates whether the application within the container is started. All other probes are disabled if a startup probe is provided, until it succeeds. If the startup probe fails, the kubelet kills the container, and the container is subjected to its restart policy. If a Container does not provide a startup probe, the default state isSuccess
.
For more information about how to set up a liveness, readiness, or startup probe, see Configure Liveness, Readiness and Startup Probes.
When should you use a liveness probe?
Kubernetes v1.0 [stable]
If the process in your container is able to crash on its own whenever it
encounters an issue or becomes unhealthy, you do not necessarily need a liveness
probe; the kubelet will automatically perform the correct action in accordance
with the Pod's restartPolicy
.
If you'd like your container to be killed and restarted if a probe fails, then
specify a liveness probe, and specify a restartPolicy
of Always or OnFailure.
When should you use a readiness probe?
Kubernetes v1.0 [stable]
If you'd like to start sending traffic to a Pod only when a probe succeeds, specify a readiness probe. In this case, the readiness probe might be the same as the liveness probe, but the existence of the readiness probe in the spec means that the Pod will start without receiving any traffic and only start receiving traffic after the probe starts succeeding. If your container needs to work on loading large data, configuration files, or migrations during startup, specify a readiness probe.
If you want your container to be able to take itself down for maintenance, you can specify a readiness probe that checks an endpoint specific to readiness that is different from the liveness probe.
Note: If you want to be able to drain requests when the Pod is deleted, you do not necessarily need a readiness probe; on deletion, the Pod automatically puts itself into an unready state regardless of whether the readiness probe exists. The Pod remains in the unready state while it waits for the containers in the Pod to stop.
When should you use a startup probe?
Kubernetes v1.20 [stable]
Startup probes are useful for Pods that have containers that take a long time to come into service. Rather than set a long liveness interval, you can configure a separate configuration for probing the container as it starts up, allowing a time longer than the liveness interval would allow.
If your container usually starts in more than
initialDelaySeconds + failureThreshold × periodSeconds
, you should specify a
startup probe that checks the same endpoint as the liveness probe. The default for
periodSeconds
is 10s. You should then set its failureThreshold
high enough to
allow the container to start, without changing the default values of the liveness
probe. This helps to protect against deadlocks.
Termination of Pods
Because Pods represent processes running on nodes in the cluster, it is important to
allow those processes to gracefully terminate when they are no longer needed (rather
than being abruptly stopped with a KILL
signal and having no chance to clean up).
The design aim is for you to be able to request deletion and know when processes terminate, but also be able to ensure that deletes eventually complete. When you request deletion of a Pod, the cluster records and tracks the intended grace period before the Pod is allowed to be forcefully killed. With that forceful shutdown tracking in place, the kubelet attempts graceful shutdown.
Typically, the container runtime sends a TERM signal to the main process in each
container. Many container runtimes respect the STOPSIGNAL
value defined in the container
image and send this instead of TERM.
Once the grace period has expired, the KILL signal is sent to any remaining
processes, and the Pod is then deleted from the
API Server. If the kubelet or the
container runtime's management service is restarted while waiting for processes to terminate, the
cluster retries from the start including the full original grace period.
An example flow:
- You use the
kubectl
tool to manually delete a specific Pod, with the default grace period (30 seconds). - The Pod in the API server is updated with the time beyond which the Pod is considered "dead"
along with the grace period.
If you use
kubectl describe
to check on the Pod you're deleting, that Pod shows up as "Terminating". On the node where the Pod is running: as soon as the kubelet sees that a Pod has been marked as terminating (a graceful shutdown duration has been set), the kubelet begins the local Pod shutdown process.- If one of the Pod's containers has defined a
preStop
hook, the kubelet runs that hook inside of the container. If thepreStop
hook is still running after the grace period expires, the kubelet requests a small, one-off grace period extension of 2 seconds.Note: If thepreStop
hook needs longer to complete than the default grace period allows, you must modifyterminationGracePeriodSeconds
to suit this. - The kubelet triggers the container runtime to send a TERM signal to process 1 inside each
container.Note: The containers in the Pod receive the TERM signal at different times and in an arbitrary order. If the order of shutdowns matters, consider using a
preStop
hook to synchronize.
- If one of the Pod's containers has defined a
- At the same time as the kubelet is starting graceful shutdown, the control plane removes that shutting-down Pod from Endpoints (and, if enabled, EndpointSlice) objects where these represent a Service with a configured selector. ReplicaSets and other workload resources no longer treat the shutting-down Pod as a valid, in-service replica. Pods that shut down slowly cannot continue to serve traffic as load balancers (like the service proxy) remove the Pod from the list of endpoints as soon as the termination grace period begins.
- When the grace period expires, the kubelet triggers forcible shutdown. The container runtime sends
SIGKILL
to any processes still running in any container in the Pod. The kubelet also cleans up a hiddenpause
container if that container runtime uses one. - The kubelet triggers forcible removal of Pod object from the API server, by setting grace period to 0 (immediate deletion).
- The API server deletes the Pod's API object, which is then no longer visible from any client.
Forced Pod termination
Caution: Forced deletions can be potentially disruptive for some workloads and their Pods.
By default, all deletes are graceful within 30 seconds. The kubectl delete
command supports
the --grace-period=<seconds>
option which allows you to override the default and specify your
own value.
Setting the grace period to 0
forcibly and immediately deletes the Pod from the API
server. If the pod was still running on a node, that forcible deletion triggers the kubelet to
begin immediate cleanup.
Note: You must specify an additional flag--force
along with--grace-period=0
in order to perform force deletions.
When a force deletion is performed, the API server does not wait for confirmation from the kubelet that the Pod has been terminated on the node it was running on. It removes the Pod in the API immediately so a new Pod can be created with the same name. On the node, Pods that are set to terminate immediately will still be given a small grace period before being force killed.
If you need to force-delete Pods that are part of a StatefulSet, refer to the task documentation for deleting Pods from a StatefulSet.
Garbage collection of failed Pods
For failed Pods, the API objects remain in the cluster's API until a human or controller process explicitly removes them.
The control plane cleans up terminated Pods (with a phase of Succeeded
or
Failed
), when the number of Pods exceeds the configured threshold
(determined by terminated-pod-gc-threshold
in the kube-controller-manager).
This avoids a resource leak as Pods are created and terminated over time.
What's next
Get hands-on experience attaching handlers to Container lifecycle events.
Get hands-on experience configuring Liveness, Readiness and Startup Probes.
Learn more about container lifecycle hooks.
For detailed information about Pod / Container status in the API, see PodStatus and ContainerStatus.
4.1.2 - Init Containers
This page provides an overview of init containers: specialized containers that run before app containers in a Pod. Init containers can contain utilities or setup scripts not present in an app image.
You can specify init containers in the Pod specification alongside the containers
array (which describes app containers).
Understanding init containers
A Pod can have multiple containers running apps within it, but it can also have one or more init containers, which are run before the app containers are started.
Init containers are exactly like regular containers, except:
- Init containers always run to completion.
- Each init container must complete successfully before the next one starts.
If a Pod's init container fails, the kubelet repeatedly restarts that init container until it succeeds.
However, if the Pod has a restartPolicy
of Never, and an init container fails during startup of that Pod, Kubernetes treats the overall Pod as failed.
To specify an init container for a Pod, add the initContainers
field into
the Pod specification, as an array of objects of type
Container,
alongside the app containers
array.
The status of the init containers is returned in .status.initContainerStatuses
field as an array of the container statuses (similar to the .status.containerStatuses
field).
Differences from regular containers
Init containers support all the fields and features of app containers, including resource limits, volumes, and security settings. However, the resource requests and limits for an init container are handled differently, as documented in Resources.
Also, init containers do not support lifecycle
, livenessProbe
, readinessProbe
, or
startupProbe
because they must run to completion before the Pod can be ready.
If you specify multiple init containers for a Pod, kubelet runs each init container sequentially. Each init container must succeed before the next can run. When all of the init containers have run to completion, kubelet initializes the application containers for the Pod and runs them as usual.
Using init containers
Because init containers have separate images from app containers, they have some advantages for start-up related code:
- Init containers can contain utilities or custom code for setup that are not present in an app
image. For example, there is no need to make an image
FROM
another image just to use a tool likesed
,awk
,python
, ordig
during setup. - The application image builder and deployer roles can work independently without the need to jointly build a single app image.
- Init containers can run with a different view of the filesystem than app containers in the same Pod. Consequently, they can be given access to Secrets that app containers cannot access.
- Because init containers run to completion before any app containers start, init containers offer a mechanism to block or delay app container startup until a set of preconditions are met. Once preconditions are met, all of the app containers in a Pod can start in parallel.
- Init containers can securely run utilities or custom code that would otherwise make an app container image less secure. By keeping unnecessary tools separate you can limit the attack surface of your app container image.
Examples
Here are some ideas for how to use init containers:
Wait for a Service to be created, using a shell one-line command like:
for i in {1..100}; do sleep 1; if dig myservice; then exit 0; fi; done; exit 1
Register this Pod with a remote server from the downward API with a command like:
curl -X POST http://$MANAGEMENT_SERVICE_HOST:$MANAGEMENT_SERVICE_PORT/register -d 'instance=$(<POD_NAME>)&ip=$(<POD_IP>)'
Wait for some time before starting the app container with a command like
sleep 60
Clone a Git repository into a Volume
Place values into a configuration file and run a template tool to dynamically generate a configuration file for the main app container. For example, place the
POD_IP
value in a configuration and generate the main app configuration file using Jinja.
Init containers in use
This example defines a simple Pod that has two init containers.
The first waits for myservice
, and the second waits for mydb
. Once both
init containers complete, the Pod runs the app container from its spec
section.
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
labels:
app: myapp
spec:
containers:
- name: myapp-container
image: busybox:1.28
command: ['sh', '-c', 'echo The app is running! && sleep 3600']
initContainers:
- name: init-myservice
image: busybox:1.28
command: ['sh', '-c', "until nslookup myservice.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
- name: init-mydb
image: busybox:1.28
command: ['sh', '-c', "until nslookup mydb.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for mydb; sleep 2; done"]
You can start this Pod by running:
kubectl apply -f myapp.yaml
The output is similar to this:
pod/myapp-pod created
And check on its status with:
kubectl get -f myapp.yaml
The output is similar to this:
NAME READY STATUS RESTARTS AGE
myapp-pod 0/1 Init:0/2 0 6m
or for more details:
kubectl describe -f myapp.yaml
The output is similar to this:
Name: myapp-pod
Namespace: default
[...]
Labels: app=myapp
Status: Pending
[...]
Init Containers:
init-myservice:
[...]
State: Running
[...]
init-mydb:
[...]
State: Waiting
Reason: PodInitializing
Ready: False
[...]
Containers:
myapp-container:
[...]
State: Waiting
Reason: PodInitializing
Ready: False
[...]
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
16s 16s 1 {default-scheduler } Normal Scheduled Successfully assigned myapp-pod to 172.17.4.201
16s 16s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice} Normal Pulling pulling image "busybox"
13s 13s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice} Normal Pulled Successfully pulled image "busybox"
13s 13s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice} Normal Created Created container with docker id 5ced34a04634; Security:[seccomp=unconfined]
13s 13s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice} Normal Started Started container with docker id 5ced34a04634
To see logs for the init containers in this Pod, run:
kubectl logs myapp-pod -c init-myservice # Inspect the first init container
kubectl logs myapp-pod -c init-mydb # Inspect the second init container
At this point, those init containers will be waiting to discover Services named
mydb
and myservice
.
Here's a configuration you can use to make those Services appear:
---
apiVersion: v1
kind: Service
metadata:
name: myservice
spec:
ports:
- protocol: TCP
port: 80
targetPort: 9376
---
apiVersion: v1
kind: Service
metadata:
name: mydb
spec:
ports:
- protocol: TCP
port: 80
targetPort: 9377
To create the mydb
and myservice
services:
kubectl apply -f services.yaml
The output is similar to this:
service/myservice created
service/mydb created
You'll then see that those init containers complete, and that the myapp-pod
Pod moves into the Running state:
kubectl get -f myapp.yaml
The output is similar to this:
NAME READY STATUS RESTARTS AGE
myapp-pod 1/1 Running 0 9m
This simple example should provide some inspiration for you to create your own init containers. What's next contains a link to a more detailed example.
Detailed behavior
During Pod startup, the kubelet delays running init containers until the networking and storage are ready. Then the kubelet runs the Pod's init containers in the order they appear in the Pod's spec.
Each init container must exit successfully before
the next container starts. If a container fails to start due to the runtime or
exits with failure, it is retried according to the Pod restartPolicy
. However,
if the Pod restartPolicy
is set to Always, the init containers use
restartPolicy
OnFailure.
A Pod cannot be Ready
until all init containers have succeeded. The ports on an
init container are not aggregated under a Service. A Pod that is initializing
is in the Pending
state but should have a condition Initialized
set to false.
If the Pod restarts, or is restarted, all init containers must execute again.
Changes to the init container spec are limited to the container image field. Altering an init container image field is equivalent to restarting the Pod.
Because init containers can be restarted, retried, or re-executed, init container
code should be idempotent. In particular, code that writes to files on EmptyDirs
should be prepared for the possibility that an output file already exists.
Init containers have all of the fields of an app container. However, Kubernetes
prohibits readinessProbe
from being used because init containers cannot
define readiness distinct from completion. This is enforced during validation.
Use activeDeadlineSeconds
on the Pod and livenessProbe
on the container to
prevent init containers from failing forever. The active deadline includes init
containers.
The name of each app and init container in a Pod must be unique; a validation error is thrown for any container sharing a name with another.
Resources
Given the ordering and execution for init containers, the following rules for resource usage apply:
- The highest of any particular resource request or limit defined on all init containers is the effective init request/limit
- The Pod's effective request/limit for a resource is the higher of:
- the sum of all app containers request/limit for a resource
- the effective init request/limit for a resource
- Scheduling is done based on effective requests/limits, which means init containers can reserve resources for initialization that are not used during the life of the Pod.
- The QoS (quality of service) tier of the Pod's effective QoS tier is the QoS tier for init containers and app containers alike.
Quota and limits are applied based on the effective Pod request and limit.
Pod level control groups (cgroups) are based on the effective Pod request and limit, the same as the scheduler.
Pod restart reasons
A Pod can restart, causing re-execution of init containers, for the following reasons:
- The Pod infrastructure container is restarted. This is uncommon and would have to be done by someone with root access to nodes.
- All containers in a Pod are terminated while
restartPolicy
is set to Always, forcing a restart, and the init container completion record has been lost due to garbage collection.
The Pod will not be restarted when the init container image is changed, or the init container completion record has been lost due to garbage collection. This applies for Kubernetes v1.20 and later. If you are using an earlier version of Kubernetes, consult the documentation for the version you are using.
What's next
- Read about creating a Pod that has an init container
- Learn how to debug init containers
4.1.3 - Pod Topology Spread Constraints
Kubernetes v1.19 [stable]
You can use topology spread constraints to control how Pods are spread across your cluster among failure-domains such as regions, zones, nodes, and other user-defined topology domains. This can help to achieve high availability as well as efficient resource utilization.
Note: In versions of Kubernetes before v1.19, you must enable theEvenPodsSpread
feature gate on the API server and the scheduler in order to use Pod topology spread constraints.
Prerequisites
Node Labels
Topology spread constraints rely on node labels to identify the topology domain(s) that each Node is in. For example, a Node might have labels: node=node1,zone=us-east-1a,region=us-east-1
Suppose you have a 4-node cluster with the following labels:
NAME STATUS ROLES AGE VERSION LABELS
node1 Ready <none> 4m26s v1.16.0 node=node1,zone=zoneA
node2 Ready <none> 3m58s v1.16.0 node=node2,zone=zoneA
node3 Ready <none> 3m17s v1.16.0 node=node3,zone=zoneB
node4 Ready <none> 2m43s v1.16.0 node=node4,zone=zoneB
Then the cluster is logically viewed as below:
Instead of manually applying labels, you can also reuse the well-known labels that are created and populated automatically on most clusters.
Spread Constraints for Pods
API
The API field pod.spec.topologySpreadConstraints
is defined as below:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
topologySpreadConstraints:
- maxSkew: <integer>
topologyKey: <string>
whenUnsatisfiable: <string>
labelSelector: <object>
You can define one or multiple topologySpreadConstraint
to instruct the kube-scheduler how to place each incoming Pod in relation to the existing Pods across your cluster. The fields are:
- maxSkew describes the degree to which Pods may be unevenly distributed.
It's the maximum permitted difference between the number of matching Pods in
any two topology domains of a given topology type. It must be greater than
zero. Its semantics differs according to the value of
whenUnsatisfiable
:- when
whenUnsatisfiable
equals to "DoNotSchedule",maxSkew
is the maximum permitted difference between the number of matching pods in the target topology and the global minimum. - when
whenUnsatisfiable
equals to "ScheduleAnyway", scheduler gives higher precedence to topologies that would help reduce the skew.
- when
- topologyKey is the key of node labels. If two Nodes are labelled with this key and have identical values for that label, the scheduler treats both Nodes as being in the same topology. The scheduler tries to place a balanced number of Pods into each topology domain.
- whenUnsatisfiable indicates how to deal with a Pod if it doesn't satisfy the spread constraint:
DoNotSchedule
(default) tells the scheduler not to schedule it.ScheduleAnyway
tells the scheduler to still schedule it while prioritizing nodes that minimize the skew.
- labelSelector is used to find matching Pods. Pods that match this label selector are counted to determine the number of Pods in their corresponding topology domain. See Label Selectors for more details.
You can read more about this field by running kubectl explain Pod.spec.topologySpreadConstraints
.
Example: One TopologySpreadConstraint
Suppose you have a 4-node cluster where 3 Pods labeled foo:bar
are located in node1, node2 and node3 respectively:
If we want an incoming Pod to be evenly spread with existing Pods across zones, the spec can be given as:
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: pause
image: k8s.gcr.io/pause:3.1
topologyKey: zone
implies the even distribution will only be applied to the nodes which have label pair "zone:<any value>" present. whenUnsatisfiable: DoNotSchedule
tells the scheduler to let it stay pending if the incoming Pod can't satisfy the constraint.
If the scheduler placed this incoming Pod into "zoneA", the Pods distribution would become [3, 1], hence the actual skew is 2 (3 - 1) - which violates maxSkew: 1
. In this example, the incoming Pod can only be placed onto "zoneB":
OR
You can tweak the Pod spec to meet various kinds of requirements:
- Change
maxSkew
to a bigger value like "2" so that the incoming Pod can be placed onto "zoneA" as well. - Change
topologyKey
to "node" so as to distribute the Pods evenly across nodes instead of zones. In the above example, ifmaxSkew
remains "1", the incoming Pod can only be placed onto "node4". - Change
whenUnsatisfiable: DoNotSchedule
towhenUnsatisfiable: ScheduleAnyway
to ensure the incoming Pod to be always schedulable (suppose other scheduling APIs are satisfied). However, it's preferred to be placed onto the topology domain which has fewer matching Pods. (Be aware that this preferability is jointly normalized with other internal scheduling priorities like resource usage ratio, etc.)
Example: Multiple TopologySpreadConstraints
This builds upon the previous example. Suppose you have a 4-node cluster where 3 Pods labeled foo:bar
are located in node1, node2 and node3 respectively:
You can use 2 TopologySpreadConstraints to control the Pods spreading on both zone and node:
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
- maxSkew: 1
topologyKey: node
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: pause
image: k8s.gcr.io/pause:3.1
In this case, to match the first constraint, the incoming Pod can only be placed onto "zoneB"; while in terms of the second constraint, the incoming Pod can only be placed onto "node4". Then the results of 2 constraints are ANDed, so the only viable option is to place on "node4".
Multiple constraints can lead to conflicts. Suppose you have a 3-node cluster across 2 zones:
If you apply "two-constraints.yaml" to this cluster, you will notice "mypod" stays in Pending
state. This is because: to satisfy the first constraint, "mypod" can only be put to "zoneB"; while in terms of the second constraint, "mypod" can only put to "node2". Then a joint result of "zoneB" and "node2" returns nothing.
To overcome this situation, you can either increase the maxSkew
or modify one of the constraints to use whenUnsatisfiable: ScheduleAnyway
.
Conventions
There are some implicit conventions worth noting here:
Only the Pods holding the same namespace as the incoming Pod can be matching candidates.
Nodes without
topologySpreadConstraints[*].topologyKey
present will be bypassed. It implies that:- the Pods located on those nodes do not impact
maxSkew
calculation - in the above example, suppose "node1" does not have label "zone", then the 2 Pods will be disregarded, hence the incoming Pod will be scheduled into "zoneA". - the incoming Pod has no chances to be scheduled onto this kind of nodes - in the above example, suppose a "node5" carrying label
{zone-typo: zoneC}
joins the cluster, it will be bypassed due to the absence of label key "zone".
- the Pods located on those nodes do not impact
Be aware of what will happen if the incomingPod's
topologySpreadConstraints[*].labelSelector
doesn't match its own labels. In the above example, if we remove the incoming Pod's labels, it can still be placed onto "zoneB" since the constraints are still satisfied. However, after the placement, the degree of imbalance of the cluster remains unchanged - it's still zoneA having 2 Pods which hold label {foo:bar}, and zoneB having 1 Pod which holds label {foo:bar}. So if this is not what you expect, we recommend the workload'stopologySpreadConstraints[*].labelSelector
to match its own labels.If the incoming Pod has
spec.nodeSelector
orspec.affinity.nodeAffinity
defined, nodes not matching them will be bypassed.Suppose you have a 5-node cluster ranging from zoneA to zoneC:
graph BT subgraph "zoneB" p3(Pod) --> n3(Node3) n4(Node4) end subgraph "zoneA" p1(Pod) --> n1(Node1) p2(Pod) --> n2(Node2) end classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000; classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff; classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5; class n1,n2,n3,n4,p1,p2,p3 k8s; class p4 plain; class zoneA,zoneB cluster;graph BT subgraph "zoneC" n5(Node5) end classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000; classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff; classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5; class n5 k8s; class zoneC cluster;and you know that "zoneC" must be excluded. In this case, you can compose the yaml as below, so that "mypod" will be placed onto "zoneB" instead of "zoneC". Similarly
spec.nodeSelector
is also respected.kind: Pod apiVersion: v1 metadata: name: mypod labels: foo: bar spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: foo: bar affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: zone operator: NotIn values: - zoneC containers: - name: pause image: k8s.gcr.io/pause:3.1
Cluster-level default constraints
It is possible to set default topology spread constraints for a cluster. Default topology spread constraints are applied to a Pod if, and only if:
- It doesn't define any constraints in its
.spec.topologySpreadConstraints
. - It belongs to a service, replication controller, replica set or stateful set.
Default constraints can be set as part of the PodTopologySpread
plugin args
in a scheduling profile.
The constraints are specified with the same API above, except that
labelSelector
must be empty. The selectors are calculated from the services,
replication controllers, replica sets or stateful sets that the Pod belongs to.
An example configuration might look like follows:
apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
defaultingType: List
Note: The score produced by default scheduling constraints might conflict with the score produced by theSelectorSpread
plugin. It is recommended that you disable this plugin in the scheduling profile when using default constraints forPodTopologySpread
.
Internal default constraints
Kubernetes v1.20 [beta]
With the DefaultPodTopologySpread
feature gate, enabled by default, the
legacy SelectorSpread
plugin is disabled.
kube-scheduler uses the following default topology constraints for the
PodTopologySpread
plugin configuration:
defaultConstraints:
- maxSkew: 3
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: ScheduleAnyway
- maxSkew: 5
topologyKey: "topology.kubernetes.io/zone"
whenUnsatisfiable: ScheduleAnyway
Also, the legacy SelectorSpread
plugin, which provides an equivalent behavior,
is disabled.
Note:If your nodes are not expected to have both
kubernetes.io/hostname
andtopology.kubernetes.io/zone
labels set, define your own constraints instead of using the Kubernetes defaults.The
PodTopologySpread
plugin does not score the nodes that don't have the topology keys specified in the spreading constraints.
If you don't want to use the default Pod spreading constraints for your cluster,
you can disable those defaults by setting defaultingType
to List
and leaving
empty defaultConstraints
in the PodTopologySpread
plugin configuration:
apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints: []
defaultingType: List
Comparison with PodAffinity/PodAntiAffinity
In Kubernetes, directives related to "Affinity" control how Pods are scheduled - more packed or more scattered.
- For
PodAffinity
, you can try to pack any number of Pods into qualifying topology domain(s) - For
PodAntiAffinity
, only one Pod can be scheduled into a single topology domain.
For finer control, you can specify topology spread constraints to distribute Pods across different topology domains - to achieve either high availability or cost-saving. This can also help on rolling update workloads and scaling out replicas smoothly. See Motivation for more details.
Known Limitations
- Scaling down a Deployment may result in imbalanced Pods distribution.
- Pods matched on tainted nodes are respected. See Issue 80921
What's next
- Blog: Introducing PodTopologySpread
explains
maxSkew
in details, as well as bringing up some advanced usage examples.
4.1.4 - Disruptions
This guide is for application owners who want to build highly available applications, and thus need to understand what types of disruptions can happen to Pods.
It is also for cluster administrators who want to perform automated cluster actions, like upgrading and autoscaling clusters.
Voluntary and involuntary disruptions
Pods do not disappear until someone (a person or a controller) destroys them, or there is an unavoidable hardware or system software error.
We call these unavoidable cases involuntary disruptions to an application. Examples are:
- a hardware failure of the physical machine backing the node
- cluster administrator deletes VM (instance) by mistake
- cloud provider or hypervisor failure makes VM disappear
- a kernel panic
- the node disappears from the cluster due to cluster network partition
- eviction of a pod due to the node being out-of-resources.
Except for the out-of-resources condition, all these conditions should be familiar to most users; they are not specific to Kubernetes.
We call other cases voluntary disruptions. These include both actions initiated by the application owner and those initiated by a Cluster Administrator. Typical application owner actions include:
- deleting the deployment or other controller that manages the pod
- updating a deployment's pod template causing a restart
- directly deleting a pod (e.g. by accident)
Cluster administrator actions include:
- Draining a node for repair or upgrade.
- Draining a node from a cluster to scale the cluster down (learn about Cluster Autoscaling ).
- Removing a pod from a node to permit something else to fit on that node.
These actions might be taken directly by the cluster administrator, or by automation run by the cluster administrator, or by your cluster hosting provider.
Ask your cluster administrator or consult your cloud provider or distribution documentation to determine if any sources of voluntary disruptions are enabled for your cluster. If none are enabled, you can skip creating Pod Disruption Budgets.
Caution: Not all voluntary disruptions are constrained by Pod Disruption Budgets. For example, deleting deployments or pods bypasses Pod Disruption Budgets.
Dealing with disruptions
Here are some ways to mitigate involuntary disruptions:
- Ensure your pod requests the resources it needs.
- Replicate your application if you need higher availability. (Learn about running replicated stateless and stateful applications.)
- For even higher availability when running replicated applications, spread applications across racks (using anti-affinity) or across zones (if using a multi-zone cluster.)
The frequency of voluntary disruptions varies. On a basic Kubernetes cluster, there are no voluntary disruptions at all. However, your cluster administrator or hosting provider may run some additional services which cause voluntary disruptions. For example, rolling out node software updates can cause voluntary disruptions. Also, some implementations of cluster (node) autoscaling may cause voluntary disruptions to defragment and compact nodes. Your cluster administrator or hosting provider should have documented what level of voluntary disruptions, if any, to expect.
Pod disruption budgets
Kubernetes v1.21 [stable]
Kubernetes offers features to help you run highly available applications even when you introduce frequent voluntary disruptions.
As an application owner, you can create a PodDisruptionBudget (PDB) for each application. A PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions. For example, a quorum-based application would like to ensure that the number of replicas running is never brought below the number needed for a quorum. A web front end might want to ensure that the number of replicas serving load never falls below a certain percentage of the total.
Cluster managers and hosting providers should use tools which respect PodDisruptionBudgets by calling the Eviction API instead of directly deleting pods or deployments.
For example, the kubectl drain
subcommand lets you mark a node as going out of
service. When you run kubectl drain
, the tool tries to evict all of the Pods on
the Node you're taking out of service. The eviction request that kubectl
submits on
your behalf may be temporarily rejected, so the tool periodically retries all failed
requests until all Pods on the target node are terminated, or until a configurable timeout
is reached.
A PDB specifies the number of replicas that an application can tolerate having, relative to how
many it is intended to have. For example, a Deployment which has a .spec.replicas: 5
is
supposed to have 5 pods at any given time. If its PDB allows for there to be 4 at a time,
then the Eviction API will allow voluntary disruption of one (but not two) pods at a time.
The group of pods that comprise the application is specified using a label selector, the same as the one used by the application's controller (deployment, stateful-set, etc).
The "intended" number of pods is computed from the .spec.replicas
of the workload resource
that is managing those pods. The control plane discovers the owning workload resource by
examining the .metadata.ownerReferences
of the Pod.
PDBs cannot prevent involuntary disruptions from occurring, but they do count against the budget.
Pods which are deleted or unavailable due to a rolling upgrade to an application do count against the disruption budget, but workload resources (such as Deployment and StatefulSet) are not limited by PDBs when doing rolling upgrades. Instead, the handling of failures during application updates is configured in the spec for the specific workload resource.
When a pod is evicted using the eviction API, it is gracefully
terminated, honoring the
terminationGracePeriodSeconds
setting in its PodSpec.
PodDisruptionBudget example
Consider a cluster with 3 nodes, node-1
through node-3
.
The cluster is running several applications. One of them has 3 replicas initially called
pod-a
, pod-b
, and pod-c
. Another, unrelated pod without a PDB, called pod-x
, is also shown.
Initially, the pods are laid out as follows:
node-1 | node-2 | node-3 |
---|---|---|
pod-a available | pod-b available | pod-c available |
pod-x available |
All 3 pods are part of a deployment, and they collectively have a PDB which requires there be at least 2 of the 3 pods to be available at all times.
For example, assume the cluster administrator wants to reboot into a new kernel version to fix a bug in the kernel.
The cluster administrator first tries to drain node-1
using the kubectl drain
command.
That tool tries to evict pod-a
and pod-x
. This succeeds immediately.
Both pods go into the terminating
state at the same time.
This puts the cluster in this state:
node-1 draining | node-2 | node-3 |
---|---|---|
pod-a terminating | pod-b available | pod-c available |
pod-x terminating |
The deployment notices that one of the pods is terminating, so it creates a replacement
called pod-d
. Since node-1
is cordoned, it lands on another node. Something has
also created pod-y
as a replacement for pod-x
.
(Note: for a StatefulSet, pod-a
, which would be called something like pod-0
, would need
to terminate completely before its replacement, which is also called pod-0
but has a
different UID, could be created. Otherwise, the example applies to a StatefulSet as well.)
Now the cluster is in this state:
node-1 draining | node-2 | node-3 |
---|---|---|
pod-a terminating | pod-b available | pod-c available |
pod-x terminating | pod-d starting | pod-y |
At some point, the pods terminate, and the cluster looks like this:
node-1 drained | node-2 | node-3 |
---|---|---|
pod-b available | pod-c available | |
pod-d starting | pod-y |
At this point, if an impatient cluster administrator tries to drain node-2
or
node-3
, the drain command will block, because there are only 2 available
pods for the deployment, and its PDB requires at least 2. After some time passes, pod-d
becomes available.
The cluster state now looks like this:
node-1 drained | node-2 | node-3 |
---|---|---|
pod-b available | pod-c available | |
pod-d available | pod-y |
Now, the cluster administrator tries to drain node-2
.
The drain command will try to evict the two pods in some order, say
pod-b
first and then pod-d
. It will succeed at evicting pod-b
.
But, when it tries to evict pod-d
, it will be refused because that would leave only
one pod available for the deployment.
The deployment creates a replacement for pod-b
called pod-e
.
Because there are not enough resources in the cluster to schedule
pod-e
the drain will again block. The cluster may end up in this
state:
node-1 drained | node-2 | node-3 | no node |
---|---|---|---|
pod-b terminating | pod-c available | pod-e pending | |
pod-d available | pod-y |
At this point, the cluster administrator needs to add a node back to the cluster to proceed with the upgrade.
You can see how Kubernetes varies the rate at which disruptions can happen, according to:
- how many replicas an application needs
- how long it takes to gracefully shutdown an instance
- how long it takes a new instance to start up
- the type of controller
- the cluster's resource capacity
Separating Cluster Owner and Application Owner Roles
Often, it is useful to think of the Cluster Manager and Application Owner as separate roles with limited knowledge of each other. This separation of responsibilities may make sense in these scenarios:
- when there are many application teams sharing a Kubernetes cluster, and there is natural specialization of roles
- when third-party tools or services are used to automate cluster management
Pod Disruption Budgets support this separation of roles by providing an interface between the roles.
If you do not have such a separation of responsibilities in your organization, you may not need to use Pod Disruption Budgets.
How to perform Disruptive Actions on your Cluster
If you are a Cluster Administrator, and you need to perform a disruptive action on all the nodes in your cluster, such as a node or system software upgrade, here are some options:
- Accept downtime during the upgrade.
- Failover to another complete replica cluster.
- No downtime, but may be costly both for the duplicated nodes and for human effort to orchestrate the switchover.
- Write disruption tolerant applications and use PDBs.
- No downtime.
- Minimal resource duplication.
- Allows more automation of cluster administration.
- Writing disruption-tolerant applications is tricky, but the work to tolerate voluntary disruptions largely overlaps with work to support autoscaling and tolerating involuntary disruptions.
What's next
Follow steps to protect your application by configuring a Pod Disruption Budget.
Learn more about draining nodes
Learn about updating a deployment including steps to maintain its availability during the rollout.
4.1.5 - Ephemeral Containers
Kubernetes v1.16 [alpha]
This page provides an overview of ephemeral containers: a special type of container that runs temporarily in an existing Pod to accomplish user-initiated actions such as troubleshooting. You use ephemeral containers to inspect services rather than to build applications.
Warning: Ephemeral containers are in early alpha state and are not suitable for production clusters. In accordance with the Kubernetes Deprecation Policy, this alpha feature could change significantly in the future or be removed entirely.
Understanding ephemeral containers
Pods are the fundamental building block of Kubernetes applications. Since Pods are intended to be disposable and replaceable, you cannot add a container to a Pod once it has been created. Instead, you usually delete and replace Pods in a controlled fashion using deployments.
Sometimes it's necessary to inspect the state of an existing Pod, however, for example to troubleshoot a hard-to-reproduce bug. In these cases you can run an ephemeral container in an existing Pod to inspect its state and run arbitrary commands.
What is an ephemeral container?
Ephemeral containers differ from other containers in that they lack guarantees
for resources or execution, and they will never be automatically restarted, so
they are not appropriate for building applications. Ephemeral containers are
described using the same ContainerSpec
as regular containers, but many fields
are incompatible and disallowed for ephemeral containers.
- Ephemeral containers may not have ports, so fields such as
ports
,livenessProbe
,readinessProbe
are disallowed. - Pod resource allocations are immutable, so setting
resources
is disallowed. - For a complete list of allowed fields, see the EphemeralContainer reference documentation.
Ephemeral containers are created using a special ephemeralcontainers
handler
in the API rather than by adding them directly to pod.spec
, so it's not
possible to add an ephemeral container using kubectl edit
.
Like regular containers, you may not change or remove an ephemeral container after you have added it to a Pod.
Uses for ephemeral containers
Ephemeral containers are useful for interactive troubleshooting when kubectl exec
is insufficient because a container has crashed or a container image
doesn't include debugging utilities.
In particular, distroless images
enable you to deploy minimal container images that reduce attack surface
and exposure to bugs and vulnerabilities. Since distroless images do not include a
shell or any debugging utilities, it's difficult to troubleshoot distroless
images using kubectl exec
alone.
When using ephemeral containers, it's helpful to enable process namespace sharing so you can view processes in other containers.
See Debugging with Ephemeral Debug Container for examples of troubleshooting using ephemeral containers.
Ephemeral containers API
Note: The examples in this section require theEphemeralContainers
feature gate to be enabled, and Kubernetes client and server version v1.16 or later.
The examples in this section demonstrate how ephemeral containers appear in
the API. You would normally use kubectl debug
or another kubectl
plugin to automate these steps
rather than invoking the API directly.
Ephemeral containers are created using the ephemeralcontainers
subresource
of Pod, which can be demonstrated using kubectl --raw
. First describe
the ephemeral container to add as an EphemeralContainers
list:
{
"apiVersion": "v1",
"kind": "EphemeralContainers",
"metadata": {
"name": "example-pod"
},
"ephemeralContainers": [{
"command": [
"sh"
],
"image": "busybox",
"imagePullPolicy": "IfNotPresent",
"name": "debugger",
"stdin": true,
"tty": true,
"terminationMessagePolicy": "File"
}]
}
To update the ephemeral containers of the already running example-pod
:
kubectl replace --raw /api/v1/namespaces/default/pods/example-pod/ephemeralcontainers -f ec.json
This will return the new list of ephemeral containers:
{
"kind":"EphemeralContainers",
"apiVersion":"v1",
"metadata":{
"name":"example-pod",
"namespace":"default",
"selfLink":"/api/v1/namespaces/default/pods/example-pod/ephemeralcontainers",
"uid":"a14a6d9b-62f2-4119-9d8e-e2ed6bc3a47c",
"resourceVersion":"15886",
"creationTimestamp":"2019-08-29T06:41:42Z"
},
"ephemeralContainers":[
{
"name":"debugger",
"image":"busybox",
"command":[
"sh"
],
"resources":{
},
"terminationMessagePolicy":"File",
"imagePullPolicy":"IfNotPresent",
"stdin":true,
"tty":true
}
]
}
You can view the state of the newly created ephemeral container using kubectl describe
:
kubectl describe pod example-pod
...
Ephemeral Containers:
debugger:
Container ID: docker://cf81908f149e7e9213d3c3644eda55c72efaff67652a2685c1146f0ce151e80f
Image: busybox
Image ID: docker-pullable://busybox@sha256:9f1003c480699be56815db0f8146ad2e22efea85129b5b5983d0e0fb52d9ab70
Port: <none>
Host Port: <none>
Command:
sh
State: Running
Started: Thu, 29 Aug 2019 06:42:21 +0000
Ready: False
Restart Count: 0
Environment: <none>
Mounts: <none>
...
You can interact with the new ephemeral container in the same way as other
containers using kubectl attach
, kubectl exec
, and kubectl logs
, for
example:
kubectl attach -it example-pod -c debugger
4.2 - Workload Resources
4.2.1 - Deployments
A Deployment provides declarative updates for Pods and ReplicaSets.
You describe a desired state in a Deployment, and the Deployment Controller changes the actual state to the desired state at a controlled rate. You can define Deployments to create new ReplicaSets, or to remove existing Deployments and adopt all their resources with new Deployments.
Note: Do not manage ReplicaSets owned by a Deployment. Consider opening an issue in the main Kubernetes repository if your use case is not covered below.
Use Case
The following are typical use cases for Deployments:
- Create a Deployment to rollout a ReplicaSet. The ReplicaSet creates Pods in the background. Check the status of the rollout to see if it succeeds or not.
- Declare the new state of the Pods by updating the PodTemplateSpec of the Deployment. A new ReplicaSet is created and the Deployment manages moving the Pods from the old ReplicaSet to the new one at a controlled rate. Each new ReplicaSet updates the revision of the Deployment.
- Rollback to an earlier Deployment revision if the current state of the Deployment is not stable. Each rollback updates the revision of the Deployment.
- Scale up the Deployment to facilitate more load.
- Pause the Deployment to apply multiple fixes to its PodTemplateSpec and then resume it to start a new rollout.
- Use the status of the Deployment as an indicator that a rollout has stuck.
- Clean up older ReplicaSets that you don't need anymore.
Creating a Deployment
The following is an example of a Deployment. It creates a ReplicaSet to bring up three nginx
Pods:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
In this example:
A Deployment named
nginx-deployment
is created, indicated by the.metadata.name
field.The Deployment creates three replicated Pods, indicated by the
.spec.replicas
field.The
.spec.selector
field defines how the Deployment finds which Pods to manage. In this case, you select a label that is defined in the Pod template (app: nginx
). However, more sophisticated selection rules are possible, as long as the Pod template itself satisfies the rule.Note: The.spec.selector.matchLabels
field is a map of {key,value} pairs. A single {key,value} in thematchLabels
map is equivalent to an element ofmatchExpressions
, whosekey
field is "key", theoperator
is "In", and thevalues
array contains only "value". All of the requirements, from bothmatchLabels
andmatchExpressions
, must be satisfied in order to match.The
template
field contains the following sub-fields:- The Pods are labeled
app: nginx
using the.metadata.labels
field. - The Pod template's specification, or
.template.spec
field, indicates that the Pods run one container,nginx
, which runs thenginx
Docker Hub image at version 1.14.2. - Create one container and name it
nginx
using the.spec.template.spec.containers[0].name
field.
- The Pods are labeled
Before you begin, make sure your Kubernetes cluster is up and running. Follow the steps given below to create the above Deployment:
Create the Deployment by running the following command:
kubectl apply -f https://k8s.io/examples/controllers/nginx-deployment.yaml
Note: You can specify the--record
flag to write the command executed in the resource annotationkubernetes.io/change-cause
. The recorded change is useful for future introspection. For example, to see the commands executed in each Deployment revision.
Run
kubectl get deployments
to check if the Deployment was created.If the Deployment is still being created, the output is similar to the following:
NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/3 0 0 1s
When you inspect the Deployments in your cluster, the following fields are displayed:
NAME
lists the names of the Deployments in the namespace.READY
displays how many replicas of the application are available to your users. It follows the pattern ready/desired.UP-TO-DATE
displays the number of replicas that have been updated to achieve the desired state.AVAILABLE
displays how many replicas of the application are available to your users.AGE
displays the amount of time that the application has been running.
Notice how the number of desired replicas is 3 according to
.spec.replicas
field.To see the Deployment rollout status, run
kubectl rollout status deployment/nginx-deployment
.The output is similar to:
Waiting for rollout to finish: 2 out of 3 new replicas have been updated... deployment "nginx-deployment" successfully rolled out
Run the
kubectl get deployments
again a few seconds later. The output is similar to this:NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 3/3 3 3 18s
Notice that the Deployment has created all three replicas, and all replicas are up-to-date (they contain the latest Pod template) and available.
To see the ReplicaSet (
rs
) created by the Deployment, runkubectl get rs
. The output is similar to this:NAME DESIRED CURRENT READY AGE nginx-deployment-75675f5897 3 3 3 18s
ReplicaSet output shows the following fields:
NAME
lists the names of the ReplicaSets in the namespace.DESIRED
displays the desired number of replicas of the application, which you define when you create the Deployment. This is the desired state.CURRENT
displays how many replicas are currently running.READY
displays how many replicas of the application are available to your users.AGE
displays the amount of time that the application has been running.
Notice that the name of the ReplicaSet is always formatted as
[DEPLOYMENT-NAME]-[RANDOM-STRING]
. The random string is randomly generated and uses thepod-template-hash
as a seed.To see the labels automatically generated for each Pod, run
kubectl get pods --show-labels
. The output is similar to:NAME READY STATUS RESTARTS AGE LABELS nginx-deployment-75675f5897-7ci7o 1/1 Running 0 18s app=nginx,pod-template-hash=3123191453 nginx-deployment-75675f5897-kzszj 1/1 Running 0 18s app=nginx,pod-template-hash=3123191453 nginx-deployment-75675f5897-qqcnn 1/1 Running 0 18s app=nginx,pod-template-hash=3123191453
The created ReplicaSet ensures that there are three
nginx
Pods.
Note:You must specify an appropriate selector and Pod template labels in a Deployment (in this case,
app: nginx
).Do not overlap labels or selectors with other controllers (including other Deployments and StatefulSets). Kubernetes doesn't stop you from overlapping, and if multiple controllers have overlapping selectors those controllers might conflict and behave unexpectedly.
Pod-template-hash label
Caution: Do not change this label.
The pod-template-hash
label is added by the Deployment controller to every ReplicaSet that a Deployment creates or adopts.
This label ensures that child ReplicaSets of a Deployment do not overlap. It is generated by hashing the PodTemplate
of the ReplicaSet and using the resulting hash as the label value that is added to the ReplicaSet selector, Pod template labels,
and in any existing Pods that the ReplicaSet might have.
Updating a Deployment
Note: A Deployment's rollout is triggered if and only if the Deployment's Pod template (that is,.spec.template
) is changed, for example if the labels or container images of the template are updated. Other updates, such as scaling the Deployment, do not trigger a rollout.
Follow the steps given below to update your Deployment:
Let's update the nginx Pods to use the
nginx:1.16.1
image instead of thenginx:1.14.2
image.kubectl --record deployment.apps/nginx-deployment set image deployment.v1.apps/nginx-deployment nginx=nginx:1.16.1
or use the following command:
kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1 --record
The output is similar to:
deployment.apps/nginx-deployment image updated
Alternatively, you can
edit
the Deployment and change.spec.template.spec.containers[0].image
fromnginx:1.14.2
tonginx:1.16.1
:kubectl edit deployment.v1.apps/nginx-deployment
The output is similar to:
deployment.apps/nginx-deployment edited
To see the rollout status, run:
kubectl rollout status deployment/nginx-deployment
The output is similar to this:
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
or
deployment "nginx-deployment" successfully rolled out
Get more details on your updated Deployment:
After the rollout succeeds, you can view the Deployment by running
kubectl get deployments
. The output is similar to this:NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 3/3 3 3 36s
Run
kubectl get rs
to see that the Deployment updated the Pods by creating a new ReplicaSet and scaling it up to 3 replicas, as well as scaling down the old ReplicaSet to 0 replicas.kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-deployment-1564180365 3 3 3 6s nginx-deployment-2035384211 0 0 0 36s
Running
get pods
should now show only the new Pods:kubectl get pods
The output is similar to this:
NAME READY STATUS RESTARTS AGE nginx-deployment-1564180365-khku8 1/1 Running 0 14s nginx-deployment-1564180365-nacti 1/1 Running 0 14s nginx-deployment-1564180365-z9gth 1/1 Running 0 14s
Next time you want to update these Pods, you only need to update the Deployment's Pod template again.
Deployment ensures that only a certain number of Pods are down while they are being updated. By default, it ensures that at least 75% of the desired number of Pods are up (25% max unavailable).
Deployment also ensures that only a certain number of Pods are created above the desired number of Pods. By default, it ensures that at most 125% of the desired number of Pods are up (25% max surge).
For example, if you look at the above Deployment closely, you will see that it first created a new Pod, then deleted some old Pods, and created new ones. It does not kill old Pods until a sufficient number of new Pods have come up, and does not create new Pods until a sufficient number of old Pods have been killed. It makes sure that at least 2 Pods are available and that at max 4 Pods in total are available.
Get details of your Deployment:
kubectl describe deployments
The output is similar to this:
Name: nginx-deployment Namespace: default CreationTimestamp: Thu, 30 Nov 2017 10:56:25 +0000 Labels: app=nginx Annotations: deployment.kubernetes.io/revision=2 Selector: app=nginx Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable StrategyType: RollingUpdate MinReadySeconds: 0 RollingUpdateStrategy: 25% max unavailable, 25% max surge Pod Template: Labels: app=nginx Containers: nginx: Image: nginx:1.16.1 Port: 80/TCP Environment: <none> Mounts: <none> Volumes: <none> Conditions: Type Status Reason ---- ------ ------ Available True MinimumReplicasAvailable Progressing True NewReplicaSetAvailable OldReplicaSets: <none> NewReplicaSet: nginx-deployment-1564180365 (3/3 replicas created) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ScalingReplicaSet 2m deployment-controller Scaled up replica set nginx-deployment-2035384211 to 3 Normal ScalingReplicaSet 24s deployment-controller Scaled up replica set nginx-deployment-1564180365 to 1 Normal ScalingReplicaSet 22s deployment-controller Scaled down replica set nginx-deployment-2035384211 to 2 Normal ScalingReplicaSet 22s deployment-controller Scaled up replica set nginx-deployment-1564180365 to 2 Normal ScalingReplicaSet 19s deployment-controller Scaled down replica set nginx-deployment-2035384211 to 1 Normal ScalingReplicaSet 19s deployment-controller Scaled up replica set nginx-deployment-1564180365 to 3 Normal ScalingReplicaSet 14s deployment-controller Scaled down replica set nginx-deployment-2035384211 to 0
Here you see that when you first created the Deployment, it created a ReplicaSet (nginx-deployment-2035384211) and scaled it up to 3 replicas directly. When you updated the Deployment, it created a new ReplicaSet (nginx-deployment-1564180365) and scaled it up to 1 and then scaled down the old ReplicaSet to 2, so that at least 2 Pods were available and at most 4 Pods were created at all times. It then continued scaling up and down the new and the old ReplicaSet, with the same rolling update strategy. Finally, you'll have 3 available replicas in the new ReplicaSet, and the old ReplicaSet is scaled down to 0.
Rollover (aka multiple updates in-flight)
Each time a new Deployment is observed by the Deployment controller, a ReplicaSet is created to bring up
the desired Pods. If the Deployment is updated, the existing ReplicaSet that controls Pods whose labels
match .spec.selector
but whose template does not match .spec.template
are scaled down. Eventually, the new
ReplicaSet is scaled to .spec.replicas
and all old ReplicaSets is scaled to 0.
If you update a Deployment while an existing rollout is in progress, the Deployment creates a new ReplicaSet as per the update and start scaling that up, and rolls over the ReplicaSet that it was scaling up previously -- it will add it to its list of old ReplicaSets and start scaling it down.
For example, suppose you create a Deployment to create 5 replicas of nginx:1.14.2
,
but then update the Deployment to create 5 replicas of nginx:1.16.1
, when only 3
replicas of nginx:1.14.2
had been created. In that case, the Deployment immediately starts
killing the 3 nginx:1.14.2
Pods that it had created, and starts creating
nginx:1.16.1
Pods. It does not wait for the 5 replicas of nginx:1.14.2
to be created
before changing course.
Label selector updates
It is generally discouraged to make label selector updates and it is suggested to plan your selectors up front. In any case, if you need to perform a label selector update, exercise great caution and make sure you have grasped all of the implications.
Note: In API versionapps/v1
, a Deployment's label selector is immutable after it gets created.
- Selector additions require the Pod template labels in the Deployment spec to be updated with the new label too, otherwise a validation error is returned. This change is a non-overlapping one, meaning that the new selector does not select ReplicaSets and Pods created with the old selector, resulting in orphaning all old ReplicaSets and creating a new ReplicaSet.
- Selector updates changes the existing value in a selector key -- result in the same behavior as additions.
- Selector removals removes an existing key from the Deployment selector -- do not require any changes in the Pod template labels. Existing ReplicaSets are not orphaned, and a new ReplicaSet is not created, but note that the removed label still exists in any existing Pods and ReplicaSets.
Rolling Back a Deployment
Sometimes, you may want to rollback a Deployment; for example, when the Deployment is not stable, such as crash looping. By default, all of the Deployment's rollout history is kept in the system so that you can rollback anytime you want (you can change that by modifying revision history limit).
Note: A Deployment's revision is created when a Deployment's rollout is triggered. This means that the new revision is created if and only if the Deployment's Pod template (.spec.template
) is changed, for example if you update the labels or container images of the template. Other updates, such as scaling the Deployment, do not create a Deployment revision, so that you can facilitate simultaneous manual- or auto-scaling. This means that when you roll back to an earlier revision, only the Deployment's Pod template part is rolled back.
Suppose that you made a typo while updating the Deployment, by putting the image name as
nginx:1.161
instead ofnginx:1.16.1
:kubectl set image deployment.v1.apps/nginx-deployment nginx=nginx:1.161 --record=true
The output is similar to this:
deployment.apps/nginx-deployment image updated
The rollout gets stuck. You can verify it by checking the rollout status:
kubectl rollout status deployment/nginx-deployment
The output is similar to this:
Waiting for rollout to finish: 1 out of 3 new replicas have been updated...
Press Ctrl-C to stop the above rollout status watch. For more information on stuck rollouts, read more here.
You see that the number of old replicas (
nginx-deployment-1564180365
andnginx-deployment-2035384211
) is 2, and new replicas (nginx-deployment-3066724191) is 1.kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-deployment-1564180365 3 3 3 25s nginx-deployment-2035384211 0 0 0 36s nginx-deployment-3066724191 1 1 0 6s
Looking at the Pods created, you see that 1 Pod created by new ReplicaSet is stuck in an image pull loop.
kubectl get pods
The output is similar to this:
NAME READY STATUS RESTARTS AGE nginx-deployment-1564180365-70iae 1/1 Running 0 25s nginx-deployment-1564180365-jbqqo 1/1 Running 0 25s nginx-deployment-1564180365-hysrc 1/1 Running 0 25s nginx-deployment-3066724191-08mng 0/1 ImagePullBackOff 0 6s
Note: The Deployment controller stops the bad rollout automatically, and stops scaling up the new ReplicaSet. This depends on the rollingUpdate parameters (maxUnavailable
specifically) that you have specified. Kubernetes by default sets the value to 25%.Get the description of the Deployment:
kubectl describe deployment
The output is similar to this:
Name: nginx-deployment Namespace: default CreationTimestamp: Tue, 15 Mar 2016 14:48:04 -0700 Labels: app=nginx Selector: app=nginx Replicas: 3 desired | 1 updated | 4 total | 3 available | 1 unavailable StrategyType: RollingUpdate MinReadySeconds: 0 RollingUpdateStrategy: 25% max unavailable, 25% max surge Pod Template: Labels: app=nginx Containers: nginx: Image: nginx:1.161 Port: 80/TCP Host Port: 0/TCP Environment: <none> Mounts: <none> Volumes: <none> Conditions: Type Status Reason ---- ------ ------ Available True MinimumReplicasAvailable Progressing True ReplicaSetUpdated OldReplicaSets: nginx-deployment-1564180365 (3/3 replicas created) NewReplicaSet: nginx-deployment-3066724191 (1/1 replicas created) Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 1m 1m 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-2035384211 to 3 22s 22s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-1564180365 to 1 22s 22s 1 {deployment-controller } Normal ScalingReplicaSet Scaled down replica set nginx-deployment-2035384211 to 2 22s 22s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-1564180365 to 2 21s 21s 1 {deployment-controller } Normal ScalingReplicaSet Scaled down replica set nginx-deployment-2035384211 to 1 21s 21s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-1564180365 to 3 13s 13s 1 {deployment-controller } Normal ScalingReplicaSet Scaled down replica set nginx-deployment-2035384211 to 0 13s 13s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-3066724191 to 1
To fix this, you need to rollback to a previous revision of Deployment that is stable.
Checking Rollout History of a Deployment
Follow the steps given below to check the rollout history:
First, check the revisions of this Deployment:
kubectl rollout history deployment.v1.apps/nginx-deployment
The output is similar to this:
deployments "nginx-deployment" REVISION CHANGE-CAUSE 1 kubectl apply --filename=https://k8s.io/examples/controllers/nginx-deployment.yaml --record=true 2 kubectl set image deployment.v1.apps/nginx-deployment nginx=nginx:1.16.1 --record=true 3 kubectl set image deployment.v1.apps/nginx-deployment nginx=nginx:1.161 --record=true
CHANGE-CAUSE
is copied from the Deployment annotationkubernetes.io/change-cause
to its revisions upon creation. You can specify theCHANGE-CAUSE
message by:- Annotating the Deployment with
kubectl annotate deployment.v1.apps/nginx-deployment kubernetes.io/change-cause="image updated to 1.16.1"
- Append the
--record
flag to save thekubectl
command that is making changes to the resource. - Manually editing the manifest of the resource.
- Annotating the Deployment with
To see the details of each revision, run:
kubectl rollout history deployment.v1.apps/nginx-deployment --revision=2
The output is similar to this:
deployments "nginx-deployment" revision 2 Labels: app=nginx pod-template-hash=1159050644 Annotations: kubernetes.io/change-cause=kubectl set image deployment.v1.apps/nginx-deployment nginx=nginx:1.16.1 --record=true Containers: nginx: Image: nginx:1.16.1 Port: 80/TCP QoS Tier: cpu: BestEffort memory: BestEffort Environment Variables: <none> No volumes.
Rolling Back to a Previous Revision
Follow the steps given below to rollback the Deployment from the current version to the previous version, which is version 2.
Now you've decided to undo the current rollout and rollback to the previous revision:
kubectl rollout undo deployment.v1.apps/nginx-deployment
The output is similar to this:
deployment.apps/nginx-deployment rolled back
Alternatively, you can rollback to a specific revision by specifying it with
--to-revision
:kubectl rollout undo deployment.v1.apps/nginx-deployment --to-revision=2
The output is similar to this:
deployment.apps/nginx-deployment rolled back
For more details about rollout related commands, read
kubectl rollout
.The Deployment is now rolled back to a previous stable revision. As you can see, a
DeploymentRollback
event for rolling back to revision 2 is generated from Deployment controller.Check if the rollback was successful and the Deployment is running as expected, run:
kubectl get deployment nginx-deployment
The output is similar to this:
NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 3/3 3 3 30m
Get the description of the Deployment:
kubectl describe deployment nginx-deployment
The output is similar to this:
Name: nginx-deployment Namespace: default CreationTimestamp: Sun, 02 Sep 2018 18:17:55 -0500 Labels: app=nginx Annotations: deployment.kubernetes.io/revision=4 kubernetes.io/change-cause=kubectl set image deployment.v1.apps/nginx-deployment nginx=nginx:1.16.1 --record=true Selector: app=nginx Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable StrategyType: RollingUpdate MinReadySeconds: 0 RollingUpdateStrategy: 25% max unavailable, 25% max surge Pod Template: Labels: app=nginx Containers: nginx: Image: nginx:1.16.1 Port: 80/TCP Host Port: 0/TCP Environment: <none> Mounts: <none> Volumes: <none> Conditions: Type Status Reason ---- ------ ------ Available True MinimumReplicasAvailable Progressing True NewReplicaSetAvailable OldReplicaSets: <none> NewReplicaSet: nginx-deployment-c4747d96c (3/3 replicas created) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ScalingReplicaSet 12m deployment-controller Scaled up replica set nginx-deployment-75675f5897 to 3 Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 1 Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 2 Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 2 Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 1 Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 3 Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 0 Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-595696685f to 1 Normal DeploymentRollback 15s deployment-controller Rolled back deployment "nginx-deployment" to revision 2 Normal ScalingReplicaSet 15s deployment-controller Scaled down replica set nginx-deployment-595696685f to 0
Scaling a Deployment
You can scale a Deployment by using the following command:
kubectl scale deployment.v1.apps/nginx-deployment --replicas=10
The output is similar to this:
deployment.apps/nginx-deployment scaled
Assuming horizontal Pod autoscaling is enabled in your cluster, you can setup an autoscaler for your Deployment and choose the minimum and maximum number of Pods you want to run based on the CPU utilization of your existing Pods.
kubectl autoscale deployment.v1.apps/nginx-deployment --min=10 --max=15 --cpu-percent=80
The output is similar to this:
deployment.apps/nginx-deployment scaled
Proportional scaling
RollingUpdate Deployments support running multiple versions of an application at the same time. When you or an autoscaler scales a RollingUpdate Deployment that is in the middle of a rollout (either in progress or paused), the Deployment controller balances the additional replicas in the existing active ReplicaSets (ReplicaSets with Pods) in order to mitigate risk. This is called proportional scaling.
For example, you are running a Deployment with 10 replicas, maxSurge=3, and maxUnavailable=2.
Ensure that the 10 replicas in your Deployment are running.
kubectl get deploy
The output is similar to this:
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE nginx-deployment 10 10 10 10 50s
You update to a new image which happens to be unresolvable from inside the cluster.
kubectl set image deployment.v1.apps/nginx-deployment nginx=nginx:sometag
The output is similar to this:
deployment.apps/nginx-deployment image updated
The image update starts a new rollout with ReplicaSet nginx-deployment-1989198191, but it's blocked due to the
maxUnavailable
requirement that you mentioned above. Check out the rollout status:kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-deployment-1989198191 5 5 0 9s nginx-deployment-618515232 8 8 8 1m
Then a new scaling request for the Deployment comes along. The autoscaler increments the Deployment replicas to 15. The Deployment controller needs to decide where to add these new 5 replicas. If you weren't using proportional scaling, all 5 of them would be added in the new ReplicaSet. With proportional scaling, you spread the additional replicas across all ReplicaSets. Bigger proportions go to the ReplicaSets with the most replicas and lower proportions go to ReplicaSets with less replicas. Any leftovers are added to the ReplicaSet with the most replicas. ReplicaSets with zero replicas are not scaled up.
In our example above, 3 replicas are added to the old ReplicaSet and 2 replicas are added to the new ReplicaSet. The rollout process should eventually move all replicas to the new ReplicaSet, assuming the new replicas become healthy. To confirm this, run:
kubectl get deploy
The output is similar to this:
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
nginx-deployment 15 18 7 8 7m
The rollout status confirms how the replicas were added to each ReplicaSet.
kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE
nginx-deployment-1989198191 7 7 0 7m
nginx-deployment-618515232 11 11 11 7m
Pausing and Resuming a Deployment
You can pause a Deployment before triggering one or more updates and then resume it. This allows you to apply multiple fixes in between pausing and resuming without triggering unnecessary rollouts.
For example, with a Deployment that was created: Get the Deployment details:
kubectl get deploy
The output is similar to this:
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE nginx 3 3 3 3 1m
Get the rollout status:
kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-2142116321 3 3 3 1m
Pause by running the following command:
kubectl rollout pause deployment.v1.apps/nginx-deployment
The output is similar to this:
deployment.apps/nginx-deployment paused
Then update the image of the Deployment:
kubectl set image deployment.v1.apps/nginx-deployment nginx=nginx:1.16.1
The output is similar to this:
deployment.apps/nginx-deployment image updated
Notice that no new rollout started:
kubectl rollout history deployment.v1.apps/nginx-deployment
The output is similar to this:
deployments "nginx" REVISION CHANGE-CAUSE 1 <none>
Get the rollout status to ensure that the Deployment is updated successfully:
kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-2142116321 3 3 3 2m
You can make as many updates as you wish, for example, update the resources that will be used:
kubectl set resources deployment.v1.apps/nginx-deployment -c=nginx --limits=cpu=200m,memory=512Mi
The output is similar to this:
deployment.apps/nginx-deployment resource requirements updated
The initial state of the Deployment prior to pausing it will continue its function, but new updates to the Deployment will not have any effect as long as the Deployment is paused.
Eventually, resume the Deployment and observe a new ReplicaSet coming up with all the new updates:
kubectl rollout resume deployment.v1.apps/nginx-deployment
The output is similar to this:
deployment.apps/nginx-deployment resumed
Watch the status of the rollout until it's done.
kubectl get rs -w
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-2142116321 2 2 2 2m nginx-3926361531 2 2 0 6s nginx-3926361531 2 2 1 18s nginx-2142116321 1 2 2 2m nginx-2142116321 1 2 2 2m nginx-3926361531 3 2 1 18s nginx-3926361531 3 2 1 18s nginx-2142116321 1 1 1 2m nginx-3926361531 3 3 1 18s nginx-3926361531 3 3 2 19s nginx-2142116321 0 1 1 2m nginx-2142116321 0 1 1 2m nginx-2142116321 0 0 0 2m nginx-3926361531 3 3 3 20s
Get the status of the latest rollout:
kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-2142116321 0 0 0 2m nginx-3926361531 3 3 3 28s
Note: You cannot rollback a paused Deployment until you resume it.
Deployment status
A Deployment enters various states during its lifecycle. It can be progressing while rolling out a new ReplicaSet, it can be complete, or it can fail to progress.
Progressing Deployment
Kubernetes marks a Deployment as progressing when one of the following tasks is performed:
- The Deployment creates a new ReplicaSet.
- The Deployment is scaling up its newest ReplicaSet.
- The Deployment is scaling down its older ReplicaSet(s).
- New Pods become ready or available (ready for at least MinReadySeconds).
You can monitor the progress for a Deployment by using kubectl rollout status
.
Complete Deployment
Kubernetes marks a Deployment as complete when it has the following characteristics:
- All of the replicas associated with the Deployment have been updated to the latest version you've specified, meaning any updates you've requested have been completed.
- All of the replicas associated with the Deployment are available.
- No old replicas for the Deployment are running.
You can check if a Deployment has completed by using kubectl rollout status
. If the rollout completed
successfully, kubectl rollout status
returns a zero exit code.
kubectl rollout status deployment/nginx-deployment
The output is similar to this:
Waiting for rollout to finish: 2 of 3 updated replicas are available...
deployment "nginx-deployment" successfully rolled out
and the exit status from kubectl rollout
is 0 (success):
echo $?
0
Failed Deployment
Your Deployment may get stuck trying to deploy its newest ReplicaSet without ever completing. This can occur due to some of the following factors:
- Insufficient quota
- Readiness probe failures
- Image pull errors
- Insufficient permissions
- Limit ranges
- Application runtime misconfiguration
One way you can detect this condition is to specify a deadline parameter in your Deployment spec:
(.spec.progressDeadlineSeconds
). .spec.progressDeadlineSeconds
denotes the
number of seconds the Deployment controller waits before indicating (in the Deployment status) that the
Deployment progress has stalled.
The following kubectl
command sets the spec with progressDeadlineSeconds
to make the controller report
lack of progress for a Deployment after 10 minutes:
kubectl patch deployment.v1.apps/nginx-deployment -p '{"spec":{"progressDeadlineSeconds":600}}'
The output is similar to this:
deployment.apps/nginx-deployment patched
Once the deadline has been exceeded, the Deployment controller adds a DeploymentCondition with the following
attributes to the Deployment's .status.conditions
:
- Type=Progressing
- Status=False
- Reason=ProgressDeadlineExceeded
See the Kubernetes API conventions for more information on status conditions.
Note: Kubernetes takes no action on a stalled Deployment other than to report a status condition withReason=ProgressDeadlineExceeded
. Higher level orchestrators can take advantage of it and act accordingly, for example, rollback the Deployment to its previous version.
Note: If you pause a Deployment, Kubernetes does not check progress against your specified deadline. You can safely pause a Deployment in the middle of a rollout and resume without triggering the condition for exceeding the deadline.
You may experience transient errors with your Deployments, either due to a low timeout that you have set or due to any other kind of error that can be treated as transient. For example, let's suppose you have insufficient quota. If you describe the Deployment you will notice the following section:
kubectl describe deployment nginx-deployment
The output is similar to this:
<...>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True ReplicaSetUpdated
ReplicaFailure True FailedCreate
<...>
If you run kubectl get deployment nginx-deployment -o yaml
, the Deployment status is similar to this:
status:
availableReplicas: 2
conditions:
- lastTransitionTime: 2016-10-04T12:25:39Z
lastUpdateTime: 2016-10-04T12:25:39Z
message: Replica set "nginx-deployment-4262182780" is progressing.
reason: ReplicaSetUpdated
status: "True"
type: Progressing
- lastTransitionTime: 2016-10-04T12:25:42Z
lastUpdateTime: 2016-10-04T12:25:42Z
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: 2016-10-04T12:25:39Z
lastUpdateTime: 2016-10-04T12:25:39Z
message: 'Error creating: pods "nginx-deployment-4262182780-" is forbidden: exceeded quota:
object-counts, requested: pods=1, used: pods=3, limited: pods=2'
reason: FailedCreate
status: "True"
type: ReplicaFailure
observedGeneration: 3
replicas: 2
unavailableReplicas: 2
Eventually, once the Deployment progress deadline is exceeded, Kubernetes updates the status and the reason for the Progressing condition:
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing False ProgressDeadlineExceeded
ReplicaFailure True FailedCreate
You can address an issue of insufficient quota by scaling down your Deployment, by scaling down other
controllers you may be running, or by increasing quota in your namespace. If you satisfy the quota
conditions and the Deployment controller then completes the Deployment rollout, you'll see the
Deployment's status update with a successful condition (Status=True
and Reason=NewReplicaSetAvailable
).
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
Type=Available
with Status=True
means that your Deployment has minimum availability. Minimum availability is dictated
by the parameters specified in the deployment strategy. Type=Progressing
with Status=True
means that your Deployment
is either in the middle of a rollout and it is progressing or that it has successfully completed its progress and the minimum
required new replicas are available (see the Reason of the condition for the particulars - in our case
Reason=NewReplicaSetAvailable
means that the Deployment is complete).
You can check if a Deployment has failed to progress by using kubectl rollout status
. kubectl rollout status
returns a non-zero exit code if the Deployment has exceeded the progression deadline.
kubectl rollout status deployment/nginx-deployment
The output is similar to this:
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
error: deployment "nginx" exceeded its progress deadline
and the exit status from kubectl rollout
is 1 (indicating an error):
echo $?
1
Operating on a failed deployment
All actions that apply to a complete Deployment also apply to a failed Deployment. You can scale it up/down, roll back to a previous revision, or even pause it if you need to apply multiple tweaks in the Deployment Pod template.
Clean up Policy
You can set .spec.revisionHistoryLimit
field in a Deployment to specify how many old ReplicaSets for
this Deployment you want to retain. The rest will be garbage-collected in the background. By default,
it is 10.
Note: Explicitly setting this field to 0, will result in cleaning up all the history of your Deployment thus that Deployment will not be able to roll back.
Canary Deployment
If you want to roll out releases to a subset of users or servers using the Deployment, you can create multiple Deployments, one for each release, following the canary pattern described in managing resources.
Writing a Deployment Spec
As with all other Kubernetes configs, a Deployment needs .apiVersion
, .kind
, and .metadata
fields.
For general information about working with config files, see
deploying applications,
configuring containers, and using kubectl to manage resources documents.
The name of a Deployment object must be a valid
DNS subdomain name.
A Deployment also needs a .spec
section.
Pod Template
The .spec.template
and .spec.selector
are the only required field of the .spec
.
The .spec.template
is a Pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind
.
In addition to required fields for a Pod, a Pod template in a Deployment must specify appropriate labels and an appropriate restart policy. For labels, make sure not to overlap with other controllers. See selector.
Only a .spec.template.spec.restartPolicy
equal to Always
is
allowed, which is the default if not specified.
Replicas
.spec.replicas
is an optional field that specifies the number of desired Pods. It defaults to 1.
Selector
.spec.selector
is a required field that specifies a label selector
for the Pods targeted by this Deployment.
.spec.selector
must match .spec.template.metadata.labels
, or it will be rejected by the API.
In API version apps/v1
, .spec.selector
and .metadata.labels
do not default to .spec.template.metadata.labels
if not set. So they must be set explicitly. Also note that .spec.selector
is immutable after creation of the Deployment in apps/v1
.
A Deployment may terminate Pods whose labels match the selector if their template is different
from .spec.template
or if the total number of such Pods exceeds .spec.replicas
. It brings up new
Pods with .spec.template
if the number of Pods is less than the desired number.
Note: You should not create other Pods whose labels match this selector, either directly, by creating another Deployment, or by creating another controller such as a ReplicaSet or a ReplicationController. If you do so, the first Deployment thinks that it created these other Pods. Kubernetes does not stop you from doing this.
If you have multiple controllers that have overlapping selectors, the controllers will fight with each other and won't behave correctly.
Strategy
.spec.strategy
specifies the strategy used to replace old Pods by new ones.
.spec.strategy.type
can be "Recreate" or "RollingUpdate". "RollingUpdate" is
the default value.
Recreate Deployment
All existing Pods are killed before new ones are created when .spec.strategy.type==Recreate
.
Note: This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a StatefulSet.
Rolling Update Deployment
The Deployment updates Pods in a rolling update
fashion when .spec.strategy.type==RollingUpdate
. You can specify maxUnavailable
and maxSurge
to control
the rolling update process.
Max Unavailable
.spec.strategy.rollingUpdate.maxUnavailable
is an optional field that specifies the maximum number
of Pods that can be unavailable during the update process. The value can be an absolute number (for example, 5)
or a percentage of desired Pods (for example, 10%). The absolute number is calculated from percentage by
rounding down. The value cannot be 0 if .spec.strategy.rollingUpdate.maxSurge
is 0. The default value is 25%.
For example, when this value is set to 30%, the old ReplicaSet can be scaled down to 70% of desired Pods immediately when the rolling update starts. Once new Pods are ready, old ReplicaSet can be scaled down further, followed by scaling up the new ReplicaSet, ensuring that the total number of Pods available at all times during the update is at least 70% of the desired Pods.
Max Surge
.spec.strategy.rollingUpdate.maxSurge
is an optional field that specifies the maximum number of Pods
that can be created over the desired number of Pods. The value can be an absolute number (for example, 5) or a
percentage of desired Pods (for example, 10%). The value cannot be 0 if MaxUnavailable
is 0. The absolute number
is calculated from the percentage by rounding up. The default value is 25%.
For example, when this value is set to 30%, the new ReplicaSet can be scaled up immediately when the rolling update starts, such that the total number of old and new Pods does not exceed 130% of desired Pods. Once old Pods have been killed, the new ReplicaSet can be scaled up further, ensuring that the total number of Pods running at any time during the update is at most 130% of desired Pods.
Progress Deadline Seconds
.spec.progressDeadlineSeconds
is an optional field that specifies the number of seconds you want
to wait for your Deployment to progress before the system reports back that the Deployment has
failed progressing - surfaced as a condition with Type=Progressing
, Status=False
.
and Reason=ProgressDeadlineExceeded
in the status of the resource. The Deployment controller will keep
retrying the Deployment. This defaults to 600. In the future, once automatic rollback will be implemented, the Deployment
controller will roll back a Deployment as soon as it observes such a condition.
If specified, this field needs to be greater than .spec.minReadySeconds
.
Min Ready Seconds
.spec.minReadySeconds
is an optional field that specifies the minimum number of seconds for which a newly
created Pod should be ready without any of its containers crashing, for it to be considered available.
This defaults to 0 (the Pod will be considered available as soon as it is ready). To learn more about when
a Pod is considered ready, see Container Probes.
Revision History Limit
A Deployment's revision history is stored in the ReplicaSets it controls.
.spec.revisionHistoryLimit
is an optional field that specifies the number of old ReplicaSets to retain
to allow rollback. These old ReplicaSets consume resources in etcd
and crowd the output of kubectl get rs
. The configuration of each Deployment revision is stored in its ReplicaSets; therefore, once an old ReplicaSet is deleted, you lose the ability to rollback to that revision of Deployment. By default, 10 old ReplicaSets will be kept, however its ideal value depends on the frequency and stability of new Deployments.
More specifically, setting this field to zero means that all old ReplicaSets with 0 replicas will be cleaned up. In this case, a new Deployment rollout cannot be undone, since its revision history is cleaned up.
Paused
.spec.paused
is an optional boolean field for pausing and resuming a Deployment. The only difference between
a paused Deployment and one that is not paused, is that any changes into the PodTemplateSpec of the paused
Deployment will not trigger new rollouts as long as it is paused. A Deployment is not paused by default when
it is created.
4.2.2 - ReplicaSet
A ReplicaSet's purpose is to maintain a stable set of replica Pods running at any given time. As such, it is often used to guarantee the availability of a specified number of identical Pods.
How a ReplicaSet works
A ReplicaSet is defined with fields, including a selector that specifies how to identify Pods it can acquire, a number of replicas indicating how many Pods it should be maintaining, and a pod template specifying the data of new Pods it should create to meet the number of replicas criteria. A ReplicaSet then fulfills its purpose by creating and deleting Pods as needed to reach the desired number. When a ReplicaSet needs to create new Pods, it uses its Pod template.
A ReplicaSet is linked to its Pods via the Pods' metadata.ownerReferences field, which specifies what resource the current object is owned by. All Pods acquired by a ReplicaSet have their owning ReplicaSet's identifying information within their ownerReferences field. It's through this link that the ReplicaSet knows of the state of the Pods it is maintaining and plans accordingly.
A ReplicaSet identifies new Pods to acquire by using its selector. If there is a Pod that has no OwnerReference or the OwnerReference is not a Controller and it matches a ReplicaSet's selector, it will be immediately acquired by said ReplicaSet.
When to use a ReplicaSet
A ReplicaSet ensures that a specified number of pod replicas are running at any given time. However, a Deployment is a higher-level concept that manages ReplicaSets and provides declarative updates to Pods along with a lot of other useful features. Therefore, we recommend using Deployments instead of directly using ReplicaSets, unless you require custom update orchestration or don't require updates at all.
This actually means that you may never need to manipulate ReplicaSet objects: use a Deployment instead, and define your application in the spec section.
Example
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: frontend
labels:
app: guestbook
tier: frontend
spec:
# modify replicas according to your case
replicas: 3
selector:
matchLabels:
tier: frontend
template:
metadata:
labels:
tier: frontend
spec:
containers:
- name: php-redis
image: gcr.io/google_samples/gb-frontend:v3
Saving this manifest into frontend.yaml
and submitting it to a Kubernetes cluster will
create the defined ReplicaSet and the Pods that it manages.
kubectl apply -f https://kubernetes.io/examples/controllers/frontend.yaml
You can then get the current ReplicaSets deployed:
kubectl get rs
And see the frontend one you created:
NAME DESIRED CURRENT READY AGE
frontend 3 3 3 6s
You can also check on the state of the ReplicaSet:
kubectl describe rs/frontend
And you will see output similar to:
Name: frontend
Namespace: default
Selector: tier=frontend
Labels: app=guestbook
tier=frontend
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"apps/v1","kind":"ReplicaSet","metadata":{"annotations":{},"labels":{"app":"guestbook","tier":"frontend"},"name":"frontend",...
Replicas: 3 current / 3 desired
Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: tier=frontend
Containers:
php-redis:
Image: gcr.io/google_samples/gb-frontend:v3
Port: <none>
Host Port: <none>
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 117s replicaset-controller Created pod: frontend-wtsmm
Normal SuccessfulCreate 116s replicaset-controller Created pod: frontend-b2zdv
Normal SuccessfulCreate 116s replicaset-controller Created pod: frontend-vcmts
And lastly you can check for the Pods brought up:
kubectl get pods
You should see Pod information similar to:
NAME READY STATUS RESTARTS AGE
frontend-b2zdv 1/1 Running 0 6m36s
frontend-vcmts 1/1 Running 0 6m36s
frontend-wtsmm 1/1 Running 0 6m36s
You can also verify that the owner reference of these pods is set to the frontend ReplicaSet. To do this, get the yaml of one of the Pods running:
kubectl get pods frontend-b2zdv -o yaml
The output will look similar to this, with the frontend ReplicaSet's info set in the metadata's ownerReferences field:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2020-02-12T07:06:16Z"
generateName: frontend-
labels:
tier: frontend
name: frontend-b2zdv
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: frontend
uid: f391f6db-bb9b-4c09-ae74-6a1f77f3d5cf
...
Non-Template Pod acquisitions
While you can create bare Pods with no problems, it is strongly recommended to make sure that the bare Pods do not have labels which match the selector of one of your ReplicaSets. The reason for this is because a ReplicaSet is not limited to owning Pods specified by its template-- it can acquire other Pods in the manner specified in the previous sections.
Take the previous frontend ReplicaSet example, and the Pods specified in the following manifest:
apiVersion: v1
kind: Pod
metadata:
name: pod1
labels:
tier: frontend
spec:
containers:
- name: hello1
image: gcr.io/google-samples/hello-app:2.0
---
apiVersion: v1
kind: Pod
metadata:
name: pod2
labels:
tier: frontend
spec:
containers:
- name: hello2
image: gcr.io/google-samples/hello-app:1.0
As those Pods do not have a Controller (or any object) as their owner reference and match the selector of the frontend ReplicaSet, they will immediately be acquired by it.
Suppose you create the Pods after the frontend ReplicaSet has been deployed and has set up its initial Pod replicas to fulfill its replica count requirement:
kubectl apply -f https://kubernetes.io/examples/pods/pod-rs.yaml
The new Pods will be acquired by the ReplicaSet, and then immediately terminated as the ReplicaSet would be over its desired count.
Fetching the Pods:
kubectl get pods
The output shows that the new Pods are either already terminated, or in the process of being terminated:
NAME READY STATUS RESTARTS AGE
frontend-b2zdv 1/1 Running 0 10m
frontend-vcmts 1/1 Running 0 10m
frontend-wtsmm 1/1 Running 0 10m
pod1 0/1 Terminating 0 1s
pod2 0/1 Terminating 0 1s
If you create the Pods first:
kubectl apply -f https://kubernetes.io/examples/pods/pod-rs.yaml
And then create the ReplicaSet however:
kubectl apply -f https://kubernetes.io/examples/controllers/frontend.yaml
You shall see that the ReplicaSet has acquired the Pods and has only created new ones according to its spec until the number of its new Pods and the original matches its desired count. As fetching the Pods:
kubectl get pods
Will reveal in its output:
NAME READY STATUS RESTARTS AGE
frontend-hmmj2 1/1 Running 0 9s
pod1 1/1 Running 0 36s
pod2 1/1 Running 0 36s
In this manner, a ReplicaSet can own a non-homogenous set of Pods
Writing a ReplicaSet manifest
As with all other Kubernetes API objects, a ReplicaSet needs the apiVersion
, kind
, and metadata
fields.
For ReplicaSets, the kind
is always a ReplicaSet.
In Kubernetes 1.9 the API version apps/v1
on the ReplicaSet kind is the current version and is enabled by default. The API version apps/v1beta2
is deprecated.
Refer to the first lines of the frontend.yaml
example for guidance.
The name of a ReplicaSet object must be a valid DNS subdomain name.
A ReplicaSet also needs a .spec
section.
Pod Template
The .spec.template
is a pod template which is also
required to have labels in place. In our frontend.yaml
example we had one label: tier: frontend
.
Be careful not to overlap with the selectors of other controllers, lest they try to adopt this Pod.
For the template's restart policy field,
.spec.template.spec.restartPolicy
, the only allowed value is Always
, which is the default.
Pod Selector
The .spec.selector
field is a label selector. As discussed
earlier these are the labels used to identify potential Pods to acquire. In our
frontend.yaml
example, the selector was:
matchLabels:
tier: frontend
In the ReplicaSet, .spec.template.metadata.labels
must match spec.selector
, or it will
be rejected by the API.
Note: For 2 ReplicaSets specifying the same.spec.selector
but different.spec.template.metadata.labels
and.spec.template.spec
fields, each ReplicaSet ignores the Pods created by the other ReplicaSet.
Replicas
You can specify how many Pods should run concurrently by setting .spec.replicas
. The ReplicaSet will create/delete
its Pods to match this number.
If you do not specify .spec.replicas
, then it defaults to 1.
Working with ReplicaSets
Deleting a ReplicaSet and its Pods
To delete a ReplicaSet and all of its Pods, use kubectl delete
. The Garbage collector automatically deletes all of the dependent Pods by default.
When using the REST API or the client-go
library, you must set propagationPolicy
to Background
or Foreground
in
the -d option.
For example:
kubectl proxy --port=8080
curl -X DELETE 'localhost:8080/apis/apps/v1/namespaces/default/replicasets/frontend' \
> -d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Foreground"}' \
> -H "Content-Type: application/json"
Deleting just a ReplicaSet
You can delete a ReplicaSet without affecting any of its Pods using kubectl delete
with the --cascade=orphan
option.
When using the REST API or the client-go
library, you must set propagationPolicy
to Orphan
.
For example:
kubectl proxy --port=8080
curl -X DELETE 'localhost:8080/apis/apps/v1/namespaces/default/replicasets/frontend' \
> -d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Orphan"}' \
> -H "Content-Type: application/json"
Once the original is deleted, you can create a new ReplicaSet to replace it. As long
as the old and new .spec.selector
are the same, then the new one will adopt the old Pods.
However, it will not make any effort to make existing Pods match a new, different pod template.
To update Pods to a new spec in a controlled way, use a
Deployment, as ReplicaSets do not support a rolling update directly.
Isolating Pods from a ReplicaSet
You can remove Pods from a ReplicaSet by changing their labels. This technique may be used to remove Pods from service for debugging, data recovery, etc. Pods that are removed in this way will be replaced automatically ( assuming that the number of replicas is not also changed).
Scaling a ReplicaSet
A ReplicaSet can be easily scaled up or down by simply updating the .spec.replicas
field. The ReplicaSet controller
ensures that a desired number of Pods with a matching label selector are available and operational.
When scaling down, the ReplicaSet controller chooses which pods to delete by sorting the available pods to prioritize scaling down pods based on the following general algorithm:
- Pending (and unschedulable) pods are scaled down first
- If controller.kubernetes.io/pod-deletion-cost annotation is set, then the pod with the lower value will come first.
- Pods on nodes with more replicas come before pods on nodes with fewer replicas.
- If the pods' creation times differ, the pod that was created more recently
comes before the older pod (the creation times are bucketed on an integer log scale
when the
LogarithmicScaleDown
feature gate is enabled)
If all of the above match, then selection is random.
Pod deletion cost
Kubernetes v1.21 [alpha]
Using the controller.kubernetes.io/pod-deletion-cost
annotation, users can set a preference regarding which pods to remove first when downscaling a ReplicaSet.
The annotation should be set on the pod, the range is [-2147483647, 2147483647]. It represents the cost of deleting a pod compared to other pods belonging to the same ReplicaSet. Pods with lower deletion cost are preferred to be deleted before pods with higher deletion cost.
The implicit value for this annotation for pods that don't set it is 0; negative values are permitted. Invalid values will be rejected by the API server.
This feature is alpha and disabled by default. You can enable it by setting the
feature gate
PodDeletionCost
in both kube-apiserver and kube-controller-manager.
Note:
- This is honored on a best-effort basis, so it does not offer any guarantees on pod deletion order.
- Users should avoid updating the annotation frequently, such as updating it based on a metric value, because doing so will generate a significant number of pod updates on the apiserver.
Example Use Case
The different pods of an application could have different utilization levels. On scale down, the application
may prefer to remove the pods with lower utilization. To avoid frequently updating the pods, the application
should update controller.kubernetes.io/pod-deletion-cost
once before issuing a scale down (setting the
annotation to a value proportional to pod utilization level). This works if the application itself controls
the down scaling; for example, the driver pod of a Spark deployment.
ReplicaSet as a Horizontal Pod Autoscaler Target
A ReplicaSet can also be a target for Horizontal Pod Autoscalers (HPA). That is, a ReplicaSet can be auto-scaled by an HPA. Here is an example HPA targeting the ReplicaSet we created in the previous example.
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: frontend-scaler
spec:
scaleTargetRef:
kind: ReplicaSet
name: frontend
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 50
Saving this manifest into hpa-rs.yaml
and submitting it to a Kubernetes cluster should
create the defined HPA that autoscales the target ReplicaSet depending on the CPU usage
of the replicated Pods.
kubectl apply -f https://k8s.io/examples/controllers/hpa-rs.yaml
Alternatively, you can use the kubectl autoscale
command to accomplish the same
(and it's easier!)
kubectl autoscale rs frontend --max=10 --min=3 --cpu-percent=50
Alternatives to ReplicaSet
Deployment (recommended)
Deployment
is an object which can own ReplicaSets and update
them and their Pods via declarative, server-side rolling updates.
While ReplicaSets can be used independently, today they're mainly used by Deployments as a mechanism to orchestrate Pod
creation, deletion and updates. When you use Deployments you don't have to worry about managing the ReplicaSets that
they create. Deployments own and manage their ReplicaSets.
As such, it is recommended to use Deployments when you want ReplicaSets.
Bare Pods
Unlike the case where a user directly created Pods, a ReplicaSet replaces Pods that are deleted or terminated for any reason, such as in the case of node failure or disruptive node maintenance, such as a kernel upgrade. For this reason, we recommend that you use a ReplicaSet even if your application requires only a single Pod. Think of it similarly to a process supervisor, only it supervises multiple Pods across multiple nodes instead of individual processes on a single node. A ReplicaSet delegates local container restarts to some agent on the node (for example, Kubelet or Docker).
Job
Use a Job
instead of a ReplicaSet for Pods that are expected to terminate on their own
(that is, batch jobs).
DaemonSet
Use a DaemonSet
instead of a ReplicaSet for Pods that provide a
machine-level function, such as machine monitoring or machine logging. These Pods have a lifetime that is tied
to a machine lifetime: the Pod needs to be running on the machine before other Pods start, and are
safe to terminate when the machine is otherwise ready to be rebooted/shutdown.
ReplicationController
ReplicaSets are the successors to ReplicationControllers. The two serve the same purpose, and behave similarly, except that a ReplicationController does not support set-based selector requirements as described in the labels user guide. As such, ReplicaSets are preferred over ReplicationControllers
4.2.3 - StatefulSets
StatefulSet is the workload API object used to manage stateful applications.
Manages the deployment and scaling of a set of Pods, and provides guarantees about the ordering and uniqueness of these Pods.
Like a Deployment, a StatefulSet manages Pods that are based on an identical container spec. Unlike a Deployment, a StatefulSet maintains a sticky identity for each of their Pods. These pods are created from the same spec, but are not interchangeable: each has a persistent identifier that it maintains across any rescheduling.
If you want to use storage volumes to provide persistence for your workload, you can use a StatefulSet as part of the solution. Although individual Pods in a StatefulSet are susceptible to failure, the persistent Pod identifiers make it easier to match existing volumes to the new Pods that replace any that have failed.
Using StatefulSets
StatefulSets are valuable for applications that require one or more of the following.
- Stable, unique network identifiers.
- Stable, persistent storage.
- Ordered, graceful deployment and scaling.
- Ordered, automated rolling updates.
In the above, stable is synonymous with persistence across Pod (re)scheduling. If an application doesn't require any stable identifiers or ordered deployment, deletion, or scaling, you should deploy your application using a workload object that provides a set of stateless replicas. Deployment or ReplicaSet may be better suited to your stateless needs.
Limitations
- The storage for a given Pod must either be provisioned by a PersistentVolume Provisioner based on the requested
storage class
, or pre-provisioned by an admin. - Deleting and/or scaling a StatefulSet down will not delete the volumes associated with the StatefulSet. This is done to ensure data safety, which is generally more valuable than an automatic purge of all related StatefulSet resources.
- StatefulSets currently require a Headless Service to be responsible for the network identity of the Pods. You are responsible for creating this Service.
- StatefulSets do not provide any guarantees on the termination of pods when a StatefulSet is deleted. To achieve ordered and graceful termination of the pods in the StatefulSet, it is possible to scale the StatefulSet down to 0 prior to deletion.
- When using Rolling Updates with the default
Pod Management Policy (
OrderedReady
), it's possible to get into a broken state that requires manual intervention to repair.
Components
The example below demonstrates the components of a StatefulSet.
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx # has to match .spec.template.metadata.labels
serviceName: "nginx"
replicas: 3 # by default is 1
template:
metadata:
labels:
app: nginx # has to match .spec.selector.matchLabels
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "my-storage-class"
resources:
requests:
storage: 1Gi
In the above example:
- A Headless Service, named
nginx
, is used to control the network domain. - The StatefulSet, named
web
, has a Spec that indicates that 3 replicas of the nginx container will be launched in unique Pods. - The
volumeClaimTemplates
will provide stable storage using PersistentVolumes provisioned by a PersistentVolume Provisioner.
The name of a StatefulSet object must be a valid DNS subdomain name.
Pod Selector
You must set the .spec.selector
field of a StatefulSet to match the labels of its .spec.template.metadata.labels
. Prior to Kubernetes 1.8, the .spec.selector
field was defaulted when omitted. In 1.8 and later versions, failing to specify a matching Pod Selector will result in a validation error during StatefulSet creation.
Pod Identity
StatefulSet Pods have a unique identity that is comprised of an ordinal, a stable network identity, and stable storage. The identity sticks to the Pod, regardless of which node it's (re)scheduled on.
Ordinal Index
For a StatefulSet with N replicas, each Pod in the StatefulSet will be assigned an integer ordinal, from 0 up through N-1, that is unique over the Set.
Stable Network ID
Each Pod in a StatefulSet derives its hostname from the name of the StatefulSet
and the ordinal of the Pod. The pattern for the constructed hostname
is $(statefulset name)-$(ordinal)
. The example above will create three Pods
named web-0,web-1,web-2
.
A StatefulSet can use a Headless Service
to control the domain of its Pods. The domain managed by this Service takes the form:
$(service name).$(namespace).svc.cluster.local
, where "cluster.local" is the
cluster domain.
As each Pod is created, it gets a matching DNS subdomain, taking the form:
$(podname).$(governing service domain)
, where the governing service is defined
by the serviceName
field on the StatefulSet.
Depending on how DNS is configured in your cluster, you may not be able to look up the DNS name for a newly-run Pod immediately. This behavior can occur when other clients in the cluster have already sent queries for the hostname of the Pod before it was created. Negative caching (normal in DNS) means that the results of previous failed lookups are remembered and reused, even after the Pod is running, for at least a few seconds.
If you need to discover Pods promptly after they are created, you have a few options:
- Query the Kubernetes API directly (for example, using a watch) rather than relying on DNS lookups.
- Decrease the time of caching in your Kubernetes DNS provider (typically this means editing the config map for CoreDNS, which currently caches for 30 seconds).
As mentioned in the limitations section, you are responsible for creating the Headless Service responsible for the network identity of the pods.
Here are some examples of choices for Cluster Domain, Service name, StatefulSet name, and how that affects the DNS names for the StatefulSet's Pods.
Cluster Domain | Service (ns/name) | StatefulSet (ns/name) | StatefulSet Domain | Pod DNS | Pod Hostname |
---|---|---|---|---|---|
cluster.local | default/nginx | default/web | nginx.default.svc.cluster.local | web-{0..N-1}.nginx.default.svc.cluster.local | web-{0..N-1} |
cluster.local | foo/nginx | foo/web | nginx.foo.svc.cluster.local | web-{0..N-1}.nginx.foo.svc.cluster.local | web-{0..N-1} |
kube.local | foo/nginx | foo/web | nginx.foo.svc.kube.local | web-{0..N-1}.nginx.foo.svc.kube.local | web-{0..N-1} |
Note: Cluster Domain will be set tocluster.local
unless otherwise configured.
Stable Storage
Kubernetes creates one PersistentVolume for each
VolumeClaimTemplate. In the nginx example above, each Pod will receive a single PersistentVolume
with a StorageClass of my-storage-class
and 1 Gib of provisioned storage. If no StorageClass
is specified, then the default StorageClass will be used. When a Pod is (re)scheduled
onto a node, its volumeMounts
mount the PersistentVolumes associated with its
PersistentVolume Claims. Note that, the PersistentVolumes associated with the
Pods' PersistentVolume Claims are not deleted when the Pods, or StatefulSet are deleted.
This must be done manually.
Pod Name Label
When the StatefulSet Controller creates a Pod,
it adds a label, statefulset.kubernetes.io/pod-name
, that is set to the name of
the Pod. This label allows you to attach a Service to a specific Pod in
the StatefulSet.
Deployment and Scaling Guarantees
- For a StatefulSet with N replicas, when Pods are being deployed, they are created sequentially, in order from {0..N-1}.
- When Pods are being deleted, they are terminated in reverse order, from {N-1..0}.
- Before a scaling operation is applied to a Pod, all of its predecessors must be Running and Ready.
- Before a Pod is terminated, all of its successors must be completely shutdown.
The StatefulSet should not specify a pod.Spec.TerminationGracePeriodSeconds
of 0. This practice is unsafe and strongly discouraged. For further explanation, please refer to force deleting StatefulSet Pods.
When the nginx example above is created, three Pods will be deployed in the order web-0, web-1, web-2. web-1 will not be deployed before web-0 is Running and Ready, and web-2 will not be deployed until web-1 is Running and Ready. If web-0 should fail, after web-1 is Running and Ready, but before web-2 is launched, web-2 will not be launched until web-0 is successfully relaunched and becomes Running and Ready.
If a user were to scale the deployed example by patching the StatefulSet such that
replicas=1
, web-2 would be terminated first. web-1 would not be terminated until web-2
is fully shutdown and deleted. If web-0 were to fail after web-2 has been terminated and
is completely shutdown, but prior to web-1's termination, web-1 would not be terminated
until web-0 is Running and Ready.
Pod Management Policies
In Kubernetes 1.7 and later, StatefulSet allows you to relax its ordering guarantees while
preserving its uniqueness and identity guarantees via its .spec.podManagementPolicy
field.
OrderedReady Pod Management
OrderedReady
pod management is the default for StatefulSets. It implements the behavior
described above.
Parallel Pod Management
Parallel
pod management tells the StatefulSet controller to launch or
terminate all Pods in parallel, and to not wait for Pods to become Running
and Ready or completely terminated prior to launching or terminating another
Pod. This option only affects the behavior for scaling operations. Updates are not
affected.
Update Strategies
In Kubernetes 1.7 and later, StatefulSet's .spec.updateStrategy
field allows you to configure
and disable automated rolling updates for containers, labels, resource request/limits, and
annotations for the Pods in a StatefulSet.
On Delete
The OnDelete
update strategy implements the legacy (1.6 and prior) behavior. When a StatefulSet's
.spec.updateStrategy.type
is set to OnDelete
, the StatefulSet controller will not automatically
update the Pods in a StatefulSet. Users must manually delete Pods to cause the controller to
create new Pods that reflect modifications made to a StatefulSet's .spec.template
.
Rolling Updates
The RollingUpdate
update strategy implements automated, rolling update for the Pods in a
StatefulSet. It is the default strategy when .spec.updateStrategy
is left unspecified. When a StatefulSet's .spec.updateStrategy.type
is set to RollingUpdate
, the
StatefulSet controller will delete and recreate each Pod in the StatefulSet. It will proceed
in the same order as Pod termination (from the largest ordinal to the smallest), updating
each Pod one at a time. It will wait until an updated Pod is Running and Ready prior to
updating its predecessor.
Partitions
The RollingUpdate
update strategy can be partitioned, by specifying a
.spec.updateStrategy.rollingUpdate.partition
. If a partition is specified, all Pods with an
ordinal that is greater than or equal to the partition will be updated when the StatefulSet's
.spec.template
is updated. All Pods with an ordinal that is less than the partition will not
be updated, and, even if they are deleted, they will be recreated at the previous version. If a
StatefulSet's .spec.updateStrategy.rollingUpdate.partition
is greater than its .spec.replicas
,
updates to its .spec.template
will not be propagated to its Pods.
In most cases you will not need to use a partition, but they are useful if you want to stage an
update, roll out a canary, or perform a phased roll out.
Forced Rollback
When using Rolling Updates with the default
Pod Management Policy (OrderedReady
),
it's possible to get into a broken state that requires manual intervention to repair.
If you update the Pod template to a configuration that never becomes Running and Ready (for example, due to a bad binary or application-level configuration error), StatefulSet will stop the rollout and wait.
In this state, it's not enough to revert the Pod template to a good configuration. Due to a known issue, StatefulSet will continue to wait for the broken Pod to become Ready (which never happens) before it will attempt to revert it back to the working configuration.
After reverting the template, you must also delete any Pods that StatefulSet had already attempted to run with the bad configuration. StatefulSet will then begin to recreate the Pods using the reverted template.
What's next
- Follow an example of deploying a stateful application.
- Follow an example of deploying Cassandra with Stateful Sets.
- Follow an example of running a replicated stateful application.
4.2.4 - DaemonSet
A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created.
Some typical uses of a DaemonSet are:
- running a cluster storage daemon on every node
- running a logs collection daemon on every node
- running a node monitoring daemon on every node
In a simple case, one DaemonSet, covering all nodes, would be used for each type of daemon. A more complex setup might use multiple DaemonSets for a single type of daemon, but with different flags and/or different memory and cpu requests for different hardware types.
Writing a DaemonSet Spec
Create a DaemonSet
You can describe a DaemonSet in a YAML file. For example, the daemonset.yaml
file below describes a DaemonSet that runs the fluentd-elasticsearch Docker image:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd-elasticsearch
namespace: kube-system
labels:
k8s-app: fluentd-logging
spec:
selector:
matchLabels:
name: fluentd-elasticsearch
template:
metadata:
labels:
name: fluentd-elasticsearch
spec:
tolerations:
# this toleration is to have the daemonset runnable on master nodes
# remove it if your masters can't run pods
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluentd-elasticsearch
image: quay.io/fluentd_elasticsearch/fluentd:v2.5.2
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
terminationGracePeriodSeconds: 30
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
Create a DaemonSet based on the YAML file:
kubectl apply -f https://k8s.io/examples/controllers/daemonset.yaml
Required Fields
As with all other Kubernetes config, a DaemonSet needs apiVersion
, kind
, and metadata
fields. For
general information about working with config files, see
running stateless applications,
configuring containers, and object management using kubectl documents.
The name of a DaemonSet object must be a valid DNS subdomain name.
A DaemonSet also needs a .spec
section.
Pod Template
The .spec.template
is one of the required fields in .spec
.
The .spec.template
is a pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind
.
In addition to required fields for a Pod, a Pod template in a DaemonSet has to specify appropriate labels (see pod selector).
A Pod Template in a DaemonSet must have a RestartPolicy
equal to Always
, or be unspecified, which defaults to Always
.
Pod Selector
The .spec.selector
field is a pod selector. It works the same as the .spec.selector
of
a Job.
As of Kubernetes 1.8, you must specify a pod selector that matches the labels of the
.spec.template
. The pod selector will no longer be defaulted when left empty. Selector
defaulting was not compatible with kubectl apply
. Also, once a DaemonSet is created,
its .spec.selector
can not be mutated. Mutating the pod selector can lead to the
unintentional orphaning of Pods, and it was found to be confusing to users.
The .spec.selector
is an object consisting of two fields:
matchLabels
- works the same as the.spec.selector
of a ReplicationController.matchExpressions
- allows to build more sophisticated selectors by specifying key, list of values and an operator that relates the key and values.
When the two are specified the result is ANDed.
If the .spec.selector
is specified, it must match the .spec.template.metadata.labels
. Config with these not matching will be rejected by the API.
Running Pods on select Nodes
If you specify a .spec.template.spec.nodeSelector
, then the DaemonSet controller will
create Pods on nodes which match that node
selector. Likewise if you specify a .spec.template.spec.affinity
,
then DaemonSet controller will create Pods on nodes which match that node affinity.
If you do not specify either, then the DaemonSet controller will create Pods on all nodes.
How Daemon Pods are scheduled
Scheduled by default scheduler
Kubernetes v1.21 [stable]
A DaemonSet ensures that all eligible nodes run a copy of a Pod. Normally, the node that a Pod runs on is selected by the Kubernetes scheduler. However, DaemonSet pods are created and scheduled by the DaemonSet controller instead. That introduces the following issues:
- Inconsistent Pod behavior: Normal Pods waiting to be scheduled are created
and in
Pending
state, but DaemonSet pods are not created inPending
state. This is confusing to the user. - Pod preemption is handled by default scheduler. When preemption is enabled, the DaemonSet controller will make scheduling decisions without considering pod priority and preemption.
ScheduleDaemonSetPods
allows you to schedule DaemonSets using the default
scheduler instead of the DaemonSet controller, by adding the NodeAffinity
term
to the DaemonSet pods, instead of the .spec.nodeName
term. The default
scheduler is then used to bind the pod to the target host. If node affinity of
the DaemonSet pod already exists, it is replaced (the original node affinity was taken into account before selecting the target host). The DaemonSet controller only
performs these operations when creating or modifying DaemonSet pods, and no
changes are made to the spec.template
of the DaemonSet.
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- target-host-name
In addition, node.kubernetes.io/unschedulable:NoSchedule
toleration is added
automatically to DaemonSet Pods. The default scheduler ignores
unschedulable
Nodes when scheduling DaemonSet Pods.
Taints and Tolerations
Although Daemon Pods respect taints and tolerations, the following tolerations are added to DaemonSet Pods automatically according to the related features.
Toleration Key | Effect | Version | Description |
---|---|---|---|
node.kubernetes.io/not-ready | NoExecute | 1.13+ | DaemonSet pods will not be evicted when there are node problems such as a network partition. |
node.kubernetes.io/unreachable | NoExecute | 1.13+ | DaemonSet pods will not be evicted when there are node problems such as a network partition. |
node.kubernetes.io/disk-pressure | NoSchedule | 1.8+ | DaemonSet pods tolerate disk-pressure attributes by default scheduler. |
node.kubernetes.io/memory-pressure | NoSchedule | 1.8+ | DaemonSet pods tolerate memory-pressure attributes by default scheduler. |
node.kubernetes.io/unschedulable | NoSchedule | 1.12+ | DaemonSet pods tolerate unschedulable attributes by default scheduler. |
node.kubernetes.io/network-unavailable | NoSchedule | 1.12+ | DaemonSet pods, who uses host network, tolerate network-unavailable attributes by default scheduler. |
Communicating with Daemon Pods
Some possible patterns for communicating with Pods in a DaemonSet are:
- Push: Pods in the DaemonSet are configured to send updates to another service, such as a stats database. They do not have clients.
- NodeIP and Known Port: Pods in the DaemonSet can use a
hostPort
, so that the pods are reachable via the node IPs. Clients know the list of node IPs somehow, and know the port by convention. - DNS: Create a headless service with the same pod selector,
and then discover DaemonSets using the
endpoints
resource or retrieve multiple A records from DNS. - Service: Create a service with the same Pod selector, and use the service to reach a daemon on a random node. (No way to reach specific node.)
Updating a DaemonSet
If node labels are changed, the DaemonSet will promptly add Pods to newly matching nodes and delete Pods from newly not-matching nodes.
You can modify the Pods that a DaemonSet creates. However, Pods do not allow all fields to be updated. Also, the DaemonSet controller will use the original template the next time a node (even with the same name) is created.
You can delete a DaemonSet. If you specify --cascade=false
with kubectl
, then the Pods
will be left on the nodes. If you subsequently create a new DaemonSet with the same selector,
the new DaemonSet adopts the existing Pods. If any Pods need replacing the DaemonSet replaces
them according to its updateStrategy
.
You can perform a rolling update on a DaemonSet.
Alternatives to DaemonSet
Init scripts
It is certainly possible to run daemon processes by directly starting them on a node (e.g. using
init
, upstartd
, or systemd
). This is perfectly fine. However, there are several advantages to
running such processes via a DaemonSet:
- Ability to monitor and manage logs for daemons in the same way as applications.
- Same config language and tools (e.g. Pod templates,
kubectl
) for daemons and applications. - Running daemons in containers with resource limits increases isolation between daemons from app containers. However, this can also be accomplished by running the daemons in a container but not in a Pod (e.g. start directly via Docker).
Bare Pods
It is possible to create Pods directly which specify a particular node to run on. However, a DaemonSet replaces Pods that are deleted or terminated for any reason, such as in the case of node failure or disruptive node maintenance, such as a kernel upgrade. For this reason, you should use a DaemonSet rather than creating individual Pods.
Static Pods
It is possible to create Pods by writing a file to a certain directory watched by Kubelet. These are called static pods. Unlike DaemonSet, static Pods cannot be managed with kubectl or other Kubernetes API clients. Static Pods do not depend on the apiserver, making them useful in cluster bootstrapping cases. Also, static Pods may be deprecated in the future.
Deployments
DaemonSets are similar to Deployments in that they both create Pods, and those Pods have processes which are not expected to terminate (e.g. web servers, storage servers).
Use a Deployment for stateless services, like frontends, where scaling up and down the number of replicas and rolling out updates are more important than controlling exactly which host the Pod runs on. Use a DaemonSet when it is important that a copy of a Pod always run on all or certain hosts, and when it needs to start before other Pods.
4.2.5 - Jobs
A Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully terminate. As pods successfully complete, the Job tracks the successful completions. When a specified number of successful completions is reached, the task (ie, Job) is complete. Deleting a Job will clean up the Pods it created. Suspending a Job will delete its active Pods until the Job is resumed again.
A simple case is to create one Job object in order to reliably run one Pod to completion. The Job object will start a new Pod if the first Pod fails or is deleted (for example due to a node hardware failure or a node reboot).
You can also use a Job to run multiple Pods in parallel.
Running an example Job
Here is an example Job config. It computes π to 2000 places and prints it out. It takes around 10s to complete.
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
backoffLimit: 4
You can run the example with this command:
kubectl apply -f https://kubernetes.io/examples/controllers/job.yaml
The output is similar to this:
job.batch/pi created
Check on the status of the Job with kubectl
:
kubectl describe jobs/pi
The output is similar to this:
Name: pi
Namespace: default
Selector: controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c
Labels: controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c
job-name=pi
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"batch/v1","kind":"Job","metadata":{"annotations":{},"name":"pi","namespace":"default"},"spec":{"backoffLimit":4,"template":...
Parallelism: 1
Completions: 1
Start Time: Mon, 02 Dec 2019 15:20:11 +0200
Completed At: Mon, 02 Dec 2019 15:21:16 +0200
Duration: 65s
Pods Statuses: 0 Running / 1 Succeeded / 0 Failed
Pod Template:
Labels: controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c
job-name=pi
Containers:
pi:
Image: perl
Port: <none>
Host Port: <none>
Command:
perl
-Mbignum=bpi
-wle
print bpi(2000)
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 14m job-controller Created pod: pi-5rwd7
To view completed Pods of a Job, use kubectl get pods
.
To list all the Pods that belong to a Job in a machine readable form, you can use a command like this:
pods=$(kubectl get pods --selector=job-name=pi --output=jsonpath='{.items[*].metadata.name}')
echo $pods
The output is similar to this:
pi-5rwd7
Here, the selector is the same as the selector for the Job. The --output=jsonpath
option specifies an expression
with the name from each Pod in the returned list.
View the standard output of one of the pods:
kubectl logs $pods
The output is similar to this:
3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185950244594553469083026425223082533446850352619311881710100031378387528865875332083814206171776691473035982534904287554687311595628638823537875937519577818577805321712268066130019278766111959092164201989380952572010654858632788659361533818279682303019520353018529689957736225994138912497217752834791315155748572424541506959508295331168617278558890750983817546374649393192550604009277016711390098488240128583616035637076601047101819429555961989467678374494482553797747268471040475346462080466842590694912933136770289891521047521620569660240580381501935112533824300355876402474964732639141992726042699227967823547816360093417216412199245863150302861829745557067498385054945885869269956909272107975093029553211653449872027559602364806654991198818347977535663698074265425278625518184175746728909777727938000816470600161452491921732172147723501414419735685481613611573525521334757418494684385233239073941433345477624168625189835694855620992192221842725502542568876717904946016534668049886272327917860857843838279679766814541009538837863609506800642251252051173929848960841284886269456042419652850222106611863067442786220391949450471237137869609563643719172874677646575739624138908658326459958133904780275901
Writing a Job spec
As with all other Kubernetes config, a Job needs apiVersion
, kind
, and metadata
fields.
Its name must be a valid DNS subdomain name.
A Job also needs a .spec
section.
Pod Template
The .spec.template
is the only required field of the .spec
.
The .spec.template
is a pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind
.
In addition to required fields for a Pod, a pod template in a Job must specify appropriate labels (see pod selector) and an appropriate restart policy.
Only a RestartPolicy
equal to Never
or OnFailure
is allowed.
Pod selector
The .spec.selector
field is optional. In almost all cases you should not specify it.
See section specifying your own pod selector.
Parallel execution for Jobs
There are three main types of task suitable to run as a Job:
- Non-parallel Jobs
- normally, only one Pod is started, unless the Pod fails.
- the Job is complete as soon as its Pod terminates successfully.
- Parallel Jobs with a fixed completion count:
- specify a non-zero positive value for
.spec.completions
. - the Job represents the overall task, and is complete when there are
.spec.completions
successful Pods. - when using
.spec.completionMode="Indexed"
, each Pod gets a different index in the range 0 to.spec.completions-1
.
- specify a non-zero positive value for
- Parallel Jobs with a work queue:
- do not specify
.spec.completions
, default to.spec.parallelism
. - the Pods must coordinate amongst themselves or an external service to determine what each should work on. For example, a Pod might fetch a batch of up to N items from the work queue.
- each Pod is independently capable of determining whether or not all its peers are done, and thus that the entire Job is done.
- when any Pod from the Job terminates with success, no new Pods are created.
- once at least one Pod has terminated with success and all Pods are terminated, then the Job is completed with success.
- once any Pod has exited with success, no other Pod should still be doing any work for this task or writing any output. They should all be in the process of exiting.
- do not specify
For a non-parallel Job, you can leave both .spec.completions
and .spec.parallelism
unset. When both are
unset, both are defaulted to 1.
For a fixed completion count Job, you should set .spec.completions
to the number of completions needed.
You can set .spec.parallelism
, or leave it unset and it will default to 1.
For a work queue Job, you must leave .spec.completions
unset, and set .spec.parallelism
to
a non-negative integer.
For more information about how to make use of the different types of job, see the job patterns section.
Controlling parallelism
The requested parallelism (.spec.parallelism
) can be set to any non-negative value.
If it is unspecified, it defaults to 1.
If it is specified as 0, then the Job is effectively paused until it is increased.
Actual parallelism (number of pods running at any instant) may be more or less than requested parallelism, for a variety of reasons:
- For fixed completion count Jobs, the actual number of pods running in parallel will not exceed the number of
remaining completions. Higher values of
.spec.parallelism
are effectively ignored. - For work queue Jobs, no new Pods are started after any Pod has succeeded -- remaining Pods are allowed to complete, however.
- If the Job Controller has not had time to react.
- If the Job controller failed to create Pods for any reason (lack of
ResourceQuota
, lack of permission, etc.), then there may be fewer pods than requested. - The Job controller may throttle new Pod creation due to excessive previous pod failures in the same Job.
- When a Pod is gracefully shut down, it takes time to stop.
Completion mode
Kubernetes v1.21 [alpha]
Note: To be able to create Indexed Jobs, make sure to enable theIndexedJob
feature gate on the API server and the controller manager.
Jobs with fixed completion count - that is, jobs that have non null
.spec.completions
- can have a completion mode that is specified in .spec.completionMode
:
NonIndexed
(default): the Job is considered complete when there have been.spec.completions
successfully completed Pods. In other words, each Pod completion is homologous to each other. Note that Jobs that have null.spec.completions
are implicitlyNonIndexed
.Indexed
: the Pods of a Job get an associated completion index from 0 to.spec.completions-1
, available in the annotationbatch.kubernetes.io/job-completion-index
. The Job is considered complete when there is one successfully completed Pod for each index. For more information about how to use this mode, see Indexed Job for Parallel Processing with Static Work Assignment. Note that, although rare, more than one Pod could be started for the same index, but only one of them will count towards the completion count.
Handling Pod and container failures
A container in a Pod may fail for a number of reasons, such as because the process in it exited with
a non-zero exit code, or the container was killed for exceeding a memory limit, etc. If this
happens, and the .spec.template.spec.restartPolicy = "OnFailure"
, then the Pod stays
on the node, but the container is re-run. Therefore, your program needs to handle the case when it is
restarted locally, or else specify .spec.template.spec.restartPolicy = "Never"
.
See pod lifecycle for more information on restartPolicy
.
An entire Pod can also fail, for a number of reasons, such as when the pod is kicked off the node
(node is upgraded, rebooted, deleted, etc.), or if a container of the Pod fails and the
.spec.template.spec.restartPolicy = "Never"
. When a Pod fails, then the Job controller
starts a new Pod. This means that your application needs to handle the case when it is restarted in a new
pod. In particular, it needs to handle temporary files, locks, incomplete output and the like
caused by previous runs.
Note that even if you specify .spec.parallelism = 1
and .spec.completions = 1
and
.spec.template.spec.restartPolicy = "Never"
, the same program may
sometimes be started twice.
If you do specify .spec.parallelism
and .spec.completions
both greater than 1, then there may be
multiple pods running at once. Therefore, your pods must also be tolerant of concurrency.
Pod backoff failure policy
There are situations where you want to fail a Job after some amount of retries
due to a logical error in configuration etc.
To do so, set .spec.backoffLimit
to specify the number of retries before
considering a Job as failed. The back-off limit is set by default to 6. Failed
Pods associated with the Job are recreated by the Job controller with an
exponential back-off delay (10s, 20s, 40s ...) capped at six minutes. The
back-off count is reset when a Job's Pod is deleted or successful without any
other Pods for the Job failing around that time.
Note: If your job hasrestartPolicy = "OnFailure"
, keep in mind that your container running the Job will be terminated once the job backoff limit has been reached. This can make debugging the Job's executable more difficult. We suggest settingrestartPolicy = "Never"
when debugging the Job or using a logging system to ensure output from failed Jobs is not lost inadvertently.
Job termination and cleanup
When a Job completes, no more Pods are created, but the Pods are not deleted either. Keeping them around
allows you to still view the logs of completed pods to check for errors, warnings, or other diagnostic output.
The job object also remains after it is completed so that you can view its status. It is up to the user to delete
old jobs after noting their status. Delete the job with kubectl
(e.g. kubectl delete jobs/pi
or kubectl delete -f ./job.yaml
). When you delete the job using kubectl
, all the pods it created are deleted too.
By default, a Job will run uninterrupted unless a Pod fails (restartPolicy=Never
) or a Container exits in error (restartPolicy=OnFailure
), at which point the Job defers to the
.spec.backoffLimit
described above. Once .spec.backoffLimit
has been reached the Job will be marked as failed and any running Pods will be terminated.
Another way to terminate a Job is by setting an active deadline.
Do this by setting the .spec.activeDeadlineSeconds
field of the Job to a number of seconds.
The activeDeadlineSeconds
applies to the duration of the job, no matter how many Pods are created.
Once a Job reaches activeDeadlineSeconds
, all of its running Pods are terminated and the Job status will become type: Failed
with reason: DeadlineExceeded
.
Note that a Job's .spec.activeDeadlineSeconds
takes precedence over its .spec.backoffLimit
. Therefore, a Job that is retrying one or more failed Pods will not deploy additional Pods once it reaches the time limit specified by activeDeadlineSeconds
, even if the backoffLimit
is not yet reached.
Example:
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-timeout
spec:
backoffLimit: 5
activeDeadlineSeconds: 100
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
Note that both the Job spec and the Pod template spec within the Job have an activeDeadlineSeconds
field. Ensure that you set this field at the proper level.
Keep in mind that the restartPolicy
applies to the Pod, and not to the Job itself: there is no automatic Job restart once the Job status is type: Failed
.
That is, the Job termination mechanisms activated with .spec.activeDeadlineSeconds
and .spec.backoffLimit
result in a permanent Job failure that requires manual intervention to resolve.
Clean up finished jobs automatically
Finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.
TTL mechanism for finished Jobs
Kubernetes v1.12 [alpha]
Another way to clean up finished Jobs (either Complete
or Failed
)
automatically is to use a TTL mechanism provided by a
TTL controller for
finished resources, by specifying the .spec.ttlSecondsAfterFinished
field of
the Job.
When the TTL controller cleans up the Job, it will delete the Job cascadingly, i.e. delete its dependent objects, such as Pods, together with the Job. Note that when the Job is deleted, its lifecycle guarantees, such as finalizers, will be honored.
For example:
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-ttl
spec:
ttlSecondsAfterFinished: 100
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
The Job pi-with-ttl
will be eligible to be automatically deleted, 100
seconds after it finishes.
If the field is set to 0
, the Job will be eligible to be automatically deleted
immediately after it finishes. If the field is unset, this Job won't be cleaned
up by the TTL controller after it finishes.
Note that this TTL mechanism is alpha, with feature gate TTLAfterFinished
. For
more information, see the documentation for
TTL controller for
finished resources.
Job patterns
The Job object can be used to support reliable parallel execution of Pods. The Job object is not designed to support closely-communicating parallel processes, as commonly found in scientific computing. It does support parallel processing of a set of independent but related work items. These might be emails to be sent, frames to be rendered, files to be transcoded, ranges of keys in a NoSQL database to scan, and so on.
In a complex system, there may be multiple different sets of work items. Here we are just considering one set of work items that the user wants to manage together — a batch job.
There are several different patterns for parallel computation, each with strengths and weaknesses. The tradeoffs are:
- One Job object for each work item, vs. a single Job object for all work items. The latter is better for large numbers of work items. The former creates some overhead for the user and for the system to manage large numbers of Job objects.
- Number of pods created equals number of work items, vs. each Pod can process multiple work items. The former typically requires less modification to existing code and containers. The latter is better for large numbers of work items, for similar reasons to the previous bullet.
- Several approaches use a work queue. This requires running a queue service, and modifications to the existing program or container to make it use the work queue. Other approaches are easier to adapt to an existing containerised application.
The tradeoffs are summarized here, with columns 2 to 4 corresponding to the above tradeoffs. The pattern names are also links to examples and more detailed description.
Pattern | Single Job object | Fewer pods than work items? | Use app unmodified? |
---|---|---|---|
Queue with Pod Per Work Item | ✓ | sometimes | |
Queue with Variable Pod Count | ✓ | ✓ | |
Indexed Job with Static Work Assignment | ✓ | ✓ | |
Job Template Expansion | ✓ |
When you specify completions with .spec.completions
, each Pod created by the Job controller
has an identical spec
. This means that
all pods for a task will have the same command line and the same
image, the same volumes, and (almost) the same environment variables. These patterns
are different ways to arrange for pods to work on different things.
This table shows the required settings for .spec.parallelism
and .spec.completions
for each of the patterns.
Here, W
is the number of work items.
Pattern | .spec.completions | .spec.parallelism |
---|---|---|
Queue with Pod Per Work Item | W | any |
Queue with Variable Pod Count | null | any |
Indexed Job with Static Work Assignment | W | any |
Job Template Expansion | 1 | should be 1 |
Advanced usage
Suspending a Job
Kubernetes v1.21 [alpha]
Note: Suspending Jobs is available in Kubernetes versions 1.21 and above. You must enable theSuspendJob
feature gate on the API server and the controller manager in order to use this feature.
When a Job is created, the Job controller will immediately begin creating Pods
to satisfy the Job's requirements and will continue to do so until the Job is
complete. However, you may want to temporarily suspend a Job's execution and
resume it later. To suspend a Job, you can update the .spec.suspend
field of
the Job to true; later, when you want to resume it again, update it to false.
Creating a Job with .spec.suspend
set to true will create it in the suspended
state.
When a Job is resumed from suspension, its .status.startTime
field will be
reset to the current time. This means that the .spec.activeDeadlineSeconds
timer will be stopped and reset when a Job is suspended and resumed.
Remember that suspending a Job will delete all active Pods. When the Job is
suspended, your Pods will be terminated
with a SIGTERM signal. The Pod's graceful termination period will be honored and
your Pod must handle this signal in this period. This may involve saving
progress for later or undoing changes. Pods terminated this way will not count
towards the Job's completions
count.
An example Job definition in the suspended state can be like so:
kubectl get job myjob -o yaml
apiVersion: batch/v1
kind: Job
metadata:
name: myjob
spec:
suspend: true
parallelism: 1
completions: 5
template:
spec:
...
The Job's status can be used to determine if a Job is suspended or has been suspended in the past:
kubectl get jobs/myjob -o yaml
apiVersion: batch/v1
kind: Job
# .metadata and .spec omitted
status:
conditions:
- lastProbeTime: "2021-02-05T13:14:33Z"
lastTransitionTime: "2021-02-05T13:14:33Z"
status: "True"
type: Suspended
startTime: "2021-02-05T13:13:48Z"
The Job condition of type "Suspended" with status "True" means the Job is
suspended; the lastTransitionTime
field can be used to determine how long the
Job has been suspended for. If the status of that condition is "False", then the
Job was previously suspended and is now running. If such a condition does not
exist in the Job's status, the Job has never been stopped.
Events are also created when the Job is suspended and resumed:
kubectl describe jobs/myjob
Name: myjob
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 12m job-controller Created pod: myjob-hlrpl
Normal SuccessfulDelete 11m job-controller Deleted pod: myjob-hlrpl
Normal Suspended 11m job-controller Job suspended
Normal SuccessfulCreate 3s job-controller Created pod: myjob-jvb44
Normal Resumed 3s job-controller Job resumed
The last four events, particularly the "Suspended" and "Resumed" events, are
directly a result of toggling the .spec.suspend
field. In the time between
these two events, we see that no Pods were created, but Pod creation restarted
as soon as the Job was resumed.
Specifying your own Pod selector
Normally, when you create a Job object, you do not specify .spec.selector
.
The system defaulting logic adds this field when the Job is created.
It picks a selector value that will not overlap with any other jobs.
However, in some cases, you might need to override this automatically set selector.
To do this, you can specify the .spec.selector
of the Job.
Be very careful when doing this. If you specify a label selector which is not
unique to the pods of that Job, and which matches unrelated Pods, then pods of the unrelated
job may be deleted, or this Job may count other Pods as completing it, or one or both
Jobs may refuse to create Pods or run to completion. If a non-unique selector is
chosen, then other controllers (e.g. ReplicationController) and their Pods may behave
in unpredictable ways too. Kubernetes will not stop you from making a mistake when
specifying .spec.selector
.
Here is an example of a case when you might want to use this feature.
Say Job old
is already running. You want existing Pods
to keep running, but you want the rest of the Pods it creates
to use a different pod template and for the Job to have a new name.
You cannot update the Job because these fields are not updatable.
Therefore, you delete Job old
but leave its pods
running, using kubectl delete jobs/old --cascade=false
.
Before deleting it, you make a note of what selector it uses:
kubectl get job old -o yaml
The output is similar to this:
kind: Job
metadata:
name: old
...
spec:
selector:
matchLabels:
controller-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002
...
Then you create a new Job with name new
and you explicitly specify the same selector.
Since the existing Pods have label controller-uid=a8f3d00d-c6d2-11e5-9f87-42010af00002
,
they are controlled by Job new
as well.
You need to specify manualSelector: true
in the new Job since you are not using
the selector that the system normally generates for you automatically.
kind: Job
metadata:
name: new
...
spec:
manualSelector: true
selector:
matchLabels:
controller-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002
...
The new Job itself will have a different uid from a8f3d00d-c6d2-11e5-9f87-42010af00002
. Setting
manualSelector: true
tells the system to that you know what you are doing and to allow this
mismatch.
Alternatives
Bare Pods
When the node that a Pod is running on reboots or fails, the pod is terminated and will not be restarted. However, a Job will create new Pods to replace terminated ones. For this reason, we recommend that you use a Job rather than a bare Pod, even if your application requires only a single Pod.
Replication Controller
Jobs are complementary to Replication Controllers. A Replication Controller manages Pods which are not expected to terminate (e.g. web servers), and a Job manages Pods that are expected to terminate (e.g. batch tasks).
As discussed in Pod Lifecycle, Job
is only appropriate
for pods with RestartPolicy
equal to OnFailure
or Never
.
(Note: If RestartPolicy
is not set, the default value is Always
.)
Single Job starts controller Pod
Another pattern is for a single Job to create a Pod which then creates other Pods, acting as a sort of custom controller for those Pods. This allows the most flexibility, but may be somewhat complicated to get started with and offers less integration with Kubernetes.
One example of this pattern would be a Job which starts a Pod which runs a script that in turn starts a Spark master controller (see spark example), runs a spark driver, and then cleans up.
An advantage of this approach is that the overall process gets the completion guarantee of a Job object, but maintains complete control over what Pods are created and how work is assigned to them.
Cron Jobs
You can use a CronJob
to create a Job that will run at specified times/dates, similar to the Unix tool cron
.
4.2.6 - Garbage Collection
The role of the Kubernetes garbage collector is to delete certain objects that once had an owner, but no longer have an owner.
Owners and dependents
Some Kubernetes objects are owners of other objects. For example, a ReplicaSet
is the owner of a set of Pods. The owned objects are called dependents of the
owner object. Every dependent object has a metadata.ownerReferences
field that
points to the owning object.
Sometimes, Kubernetes sets the value of ownerReference
automatically. For
example, when you create a ReplicaSet, Kubernetes automatically sets the
ownerReference
field of each Pod in the ReplicaSet. In 1.8, Kubernetes
automatically sets the value of ownerReference
for objects created or adopted
by ReplicationController, ReplicaSet, StatefulSet, DaemonSet, Deployment, Job
and CronJob.
You can also specify relationships between owners and dependents by manually
setting the ownerReference
field.
Here's a configuration file for a ReplicaSet that has three Pods:
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: my-repset
spec:
replicas: 3
selector:
matchLabels:
pod-is-for: garbage-collection-example
template:
metadata:
labels:
pod-is-for: garbage-collection-example
spec:
containers:
- name: nginx
image: nginx
If you create the ReplicaSet and then view the Pod metadata, you can see OwnerReferences field:
kubectl apply -f https://k8s.io/examples/controllers/replicaset.yaml
kubectl get pods --output=yaml
The output shows that the Pod owner is a ReplicaSet named my-repset
:
apiVersion: v1
kind: Pod
metadata:
...
ownerReferences:
- apiVersion: apps/v1
controller: true
blockOwnerDeletion: true
kind: ReplicaSet
name: my-repset
uid: d9607e19-f88f-11e6-a518-42010a800195
...
Note:Cross-namespace owner references are disallowed by design.
Namespaced dependents can specify cluster-scoped or namespaced owners. A namespaced owner must exist in the same namespace as the dependent. If it does not, the owner reference is treated as absent, and the dependent is subject to deletion once all owners are verified absent.
Cluster-scoped dependents can only specify cluster-scoped owners. In v1.20+, if a cluster-scoped dependent specifies a namespaced kind as an owner, it is treated as having an unresolveable owner reference, and is not able to be garbage collected.
In v1.20+, if the garbage collector detects an invalid cross-namespace
ownerReference
, or a cluster-scoped dependent with anownerReference
referencing a namespaced kind, a warning Event with a reason ofOwnerRefInvalidNamespace
and aninvolvedObject
of the invalid dependent is reported. You can check for that kind of Event by runningkubectl get events -A --field-selector=reason=OwnerRefInvalidNamespace
.
Controlling how the garbage collector deletes dependents
When you delete an object, you can specify whether the object's dependents are also deleted automatically. Deleting dependents automatically is called cascading deletion. There are two modes of cascading deletion: background and foreground.
If you delete an object without deleting its dependents automatically, the dependents are said to be orphaned.
Foreground cascading deletion
In foreground cascading deletion, the root object first enters a "deletion in progress" state. In the "deletion in progress" state, the following things are true:
- The object is still visible via the REST API
- The object's
deletionTimestamp
is set - The object's
metadata.finalizers
contains the value "foregroundDeletion".
Once the "deletion in progress" state is set, the garbage
collector deletes the object's dependents. Once the garbage collector has deleted all
"blocking" dependents (objects with ownerReference.blockOwnerDeletion=true
), it deletes
the owner object.
Note that in the "foregroundDeletion", only dependents with
ownerReference.blockOwnerDeletion=true
block the deletion of the owner object.
Kubernetes version 1.7 added an admission controller that controls user access to set
blockOwnerDeletion
to true based on delete permissions on the owner object, so that
unauthorized dependents cannot delay deletion of an owner object.
If an object's ownerReferences
field is set by a controller (such as Deployment or ReplicaSet),
blockOwnerDeletion is set automatically and you do not need to manually modify this field.
Background cascading deletion
In background cascading deletion, Kubernetes deletes the owner object immediately and the garbage collector then deletes the dependents in the background.
Setting the cascading deletion policy
To control the cascading deletion policy, set the propagationPolicy
field on the deleteOptions
argument when deleting an Object. Possible values include "Orphan",
"Foreground", or "Background".
Here's an example that deletes dependents in background:
kubectl proxy --port=8080
curl -X DELETE localhost:8080/apis/apps/v1/namespaces/default/replicasets/my-repset \
-d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Background"}' \
-H "Content-Type: application/json"
Here's an example that deletes dependents in foreground:
kubectl proxy --port=8080
curl -X DELETE localhost:8080/apis/apps/v1/namespaces/default/replicasets/my-repset \
-d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Foreground"}' \
-H "Content-Type: application/json"
Here's an example that orphans dependents:
kubectl proxy --port=8080
curl -X DELETE localhost:8080/apis/apps/v1/namespaces/default/replicasets/my-repset \
-d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Orphan"}' \
-H "Content-Type: application/json"
kubectl also supports cascading deletion.
To delete dependents in the foreground using kubectl, set --cascade=foreground
. To
orphan dependents, set --cascade=orphan
.
The default behavior is to delete the dependents in the background which is the
behavior when --cascade
is omitted or explicitly set to background
.
Here's an example that orphans the dependents of a ReplicaSet:
kubectl delete replicaset my-repset --cascade=orphan
Additional note on Deployments
Prior to 1.7, When using cascading deletes with Deployments you must use propagationPolicy: Foreground
to delete not only the ReplicaSets created, but also their Pods. If this type of propagationPolicy
is not used, only the ReplicaSets will be deleted, and the Pods will be orphaned.
See kubeadm/#149 for more information.
Known issues
Tracked at #26120
What's next
4.2.7 - TTL Controller for Finished Resources
Kubernetes v1.21 [beta]
The TTL controller provides a TTL (time to live) mechanism to limit the lifetime of resource objects that have finished execution. TTL controller only handles Jobs for now, and may be expanded to handle other resources that will finish execution, such as Pods and custom resources.
This feature is currently beta and enabled by default, and can be disabled via
feature gate
TTLAfterFinished
in both kube-apiserver and kube-controller-manager.
TTL Controller
The TTL controller only supports Jobs for now. A cluster operator can use this feature to clean
up finished Jobs (either Complete
or Failed
) automatically by specifying the
.spec.ttlSecondsAfterFinished
field of a Job, as in this
example.
The TTL controller will assume that a resource is eligible to be cleaned up
TTL seconds after the resource has finished, in other words, when the TTL has expired. When the
TTL controller cleans up a resource, it will delete it cascadingly, that is to say it will delete
its dependent objects together with it. Note that when the resource is deleted,
its lifecycle guarantees, such as finalizers, will be honored.
The TTL seconds can be set at any time. Here are some examples for setting the
.spec.ttlSecondsAfterFinished
field of a Job:
- Specify this field in the resource manifest, so that a Job can be cleaned up automatically some time after it finishes.
- Set this field of existing, already finished resources, to adopt this new feature.
- Use a mutating admission webhook to set this field dynamically at resource creation time. Cluster administrators can use this to enforce a TTL policy for finished resources.
- Use a mutating admission webhook to set this field dynamically after the resource has finished, and choose different TTL values based on resource status, labels, etc.
Caveat
Updating TTL Seconds
Note that the TTL period, e.g. .spec.ttlSecondsAfterFinished
field of Jobs,
can be modified after the resource is created or has finished. However, once the
Job becomes eligible to be deleted (when the TTL has expired), the system won't
guarantee that the Jobs will be kept, even if an update to extend the TTL
returns a successful API response.
Time Skew
Because TTL controller uses timestamps stored in the Kubernetes resources to determine whether the TTL has expired or not, this feature is sensitive to time skew in the cluster, which may cause TTL controller to clean up resource objects at the wrong time.
In Kubernetes, it's required to run NTP on all nodes (see #6159) to avoid time skew. Clocks aren't always correct, but the difference should be very small. Please be aware of this risk when setting a non-zero TTL.
What's next
4.2.8 - CronJob
Kubernetes v1.21 [stable]
A CronJob creates Jobs on a repeating schedule.
One CronJob object is like one line of a crontab (cron table) file. It runs a job periodically on a given schedule, written in Cron format.
Caution:All CronJob
schedule:
times are based on the timezone of the kube-controller-manager.If your control plane runs the kube-controller-manager in Pods or bare containers, the timezone set for the kube-controller-manager container determines the timezone that the cron job controller uses.
When creating the manifest for a CronJob resource, make sure the name you provide is a valid DNS subdomain name. The name must be no longer than 52 characters. This is because the CronJob controller will automatically append 11 characters to the job name provided and there is a constraint that the maximum length of a Job name is no more than 63 characters.
CronJob
CronJobs are useful for creating periodic and recurring tasks, like running backups or sending emails. CronJobs can also schedule individual tasks for a specific time, such as scheduling a Job for when your cluster is likely to be idle.
Example
This example CronJob manifest prints the current time and a hello message every minute:
apiVersion: batch/v1
kind: CronJob
metadata:
name: hello
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
restartPolicy: OnFailure
(Running Automated Tasks with a CronJob takes you through this example in more detail).
Cron schedule syntax
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12)
# │ │ │ │ ┌───────────── day of the week (0 - 6) (Sunday to Saturday;
# │ │ │ │ │ 7 is also Sunday on some systems)
# │ │ │ │ │
# │ │ │ │ │
# * * * * *
Entry | Description | Equivalent to |
---|---|---|
@yearly (or @annually) | Run once a year at midnight of 1 January | 0 0 1 1 * |
@monthly | Run once a month at midnight of the first day of the month | 0 0 1 * * |
@weekly | Run once a week at midnight on Sunday morning | 0 0 * * 0 |
@daily (or @midnight) | Run once a day at midnight | 0 0 * * * |
@hourly | Run once an hour at the beginning of the hour | 0 * * * * |
For example, the line below states that the task must be started every Friday at midnight, as well as on the 13th of each month at midnight:
0 0 13 * 5
To generate CronJob schedule expressions, you can also use web tools like crontab.guru.
CronJob limitations
A cron job creates a job object about once per execution time of its schedule. We say "about" because there are certain circumstances where two jobs might be created, or no job might be created. We attempt to make these rare, but do not completely prevent them. Therefore, jobs should be idempotent.
If startingDeadlineSeconds
is set to a large value or left unset (the default)
and if concurrencyPolicy
is set to Allow
, the jobs will always run
at least once.
Caution: IfstartingDeadlineSeconds
is set to a value less than 10 seconds, the CronJob may not be scheduled. This is because the CronJob controller checks things every 10 seconds.
For every CronJob, the CronJob Controller checks how many schedules it missed in the duration from its last scheduled time until now. If there are more than 100 missed schedules, then it does not start the job and logs the error
Cannot determine if job needs to be started. Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.
It is important to note that if the startingDeadlineSeconds
field is set (not nil
), the controller counts how many missed jobs occurred from the value of startingDeadlineSeconds
until now rather than from the last scheduled time until now. For example, if startingDeadlineSeconds
is 200
, the controller counts how many missed jobs occurred in the last 200 seconds.
A CronJob is counted as missed if it has failed to be created at its scheduled time. For example, If concurrencyPolicy
is set to Forbid
and a CronJob was attempted to be scheduled when there was a previous schedule still running, then it would count as missed.
For example, suppose a CronJob is set to schedule a new Job every one minute beginning at 08:30:00
, and its
startingDeadlineSeconds
field is not set. If the CronJob controller happens to
be down from 08:29:00
to 10:21:00
, the job will not start as the number of missed jobs which missed their schedule is greater than 100.
To illustrate this concept further, suppose a CronJob is set to schedule a new Job every one minute beginning at 08:30:00
, and its
startingDeadlineSeconds
is set to 200 seconds. If the CronJob controller happens to
be down for the same period as the previous example (08:29:00
to 10:21:00
,) the Job will still start at 10:22:00. This happens as the controller now checks how many missed schedules happened in the last 200 seconds (ie, 3 missed schedules), rather than from the last scheduled time until now.
The CronJob is only responsible for creating Jobs that match its schedule, and the Job in turn is responsible for the management of the Pods it represents.
Controller version
Starting with Kubernetes v1.21 the second version of the CronJob controller
is the default implementation. To disable the default CronJob controller
and use the original CronJob controller instead, one pass the CronJobControllerV2
feature gate
flag to the kube-controller-manager,
and set this flag to false
. For example:
--feature-gates="CronJobControllerV2=false"
What's next
Cron expression format
documents the format of CronJob schedule
fields.
For instructions on creating and working with cron jobs, and for an example of CronJob manifest, see Running automated tasks with cron jobs.
4.2.9 - ReplicationController
Note: ADeployment
that configures aReplicaSet
is now the recommended way to set up replication.
A ReplicationController ensures that a specified number of pod replicas are running at any one time. In other words, a ReplicationController makes sure that a pod or a homogeneous set of pods is always up and available.
How a ReplicationController Works
If there are too many pods, the ReplicationController terminates the extra pods. If there are too few, the ReplicationController starts more pods. Unlike manually created pods, the pods maintained by a ReplicationController are automatically replaced if they fail, are deleted, or are terminated. For example, your pods are re-created on a node after disruptive maintenance such as a kernel upgrade. For this reason, you should use a ReplicationController even if your application requires only a single pod. A ReplicationController is similar to a process supervisor, but instead of supervising individual processes on a single node, the ReplicationController supervises multiple pods across multiple nodes.
ReplicationController is often abbreviated to "rc" in discussion, and as a shortcut in kubectl commands.
A simple case is to create one ReplicationController object to reliably run one instance of a Pod indefinitely. A more complex use case is to run several identical replicas of a replicated service, such as web servers.
Running an example ReplicationController
This example ReplicationController config runs three copies of the nginx web server.
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 3
selector:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
Run the example job by downloading the example file and then running this command:
kubectl apply -f https://k8s.io/examples/controllers/replication.yaml
The output is similar to this:
replicationcontroller/nginx created
Check on the status of the ReplicationController using this command:
kubectl describe replicationcontrollers/nginx
The output is similar to this:
Name: nginx
Namespace: default
Selector: app=nginx
Labels: app=nginx
Annotations: <none>
Replicas: 3 current / 3 desired
Pods Status: 0 Running / 3 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx
Port: 80/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- ---- ------ -------
20s 20s 1 {replication-controller } Normal SuccessfulCreate Created pod: nginx-qrm3m
20s 20s 1 {replication-controller } Normal SuccessfulCreate Created pod: nginx-3ntk0
20s 20s 1 {replication-controller } Normal SuccessfulCreate Created pod: nginx-4ok8v
Here, three pods are created, but none is running yet, perhaps because the image is being pulled. A little later, the same command may show:
Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed
To list all the pods that belong to the ReplicationController in a machine readable form, you can use a command like this:
pods=$(kubectl get pods --selector=app=nginx --output=jsonpath={.items..metadata.name})
echo $pods
The output is similar to this:
nginx-3ntk0 nginx-4ok8v nginx-qrm3m
Here, the selector is the same as the selector for the ReplicationController (seen in the
kubectl describe
output), and in a different form in replication.yaml
. The --output=jsonpath
option
specifies an expression with the name from each pod in the returned list.
Writing a ReplicationController Spec
As with all other Kubernetes config, a ReplicationController needs apiVersion
, kind
, and metadata
fields.
The name of a ReplicationController object must be a valid
DNS subdomain name.
For general information about working with configuration files, see object management.
A ReplicationController also needs a .spec
section.
Pod Template
The .spec.template
is the only required field of the .spec
.
The .spec.template
is a pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind
.
In addition to required fields for a Pod, a pod template in a ReplicationController must specify appropriate labels and an appropriate restart policy. For labels, make sure not to overlap with other controllers. See pod selector.
Only a .spec.template.spec.restartPolicy
equal to Always
is allowed, which is the default if not specified.
For local container restarts, ReplicationControllers delegate to an agent on the node, for example the Kubelet or Docker.
Labels on the ReplicationController
The ReplicationController can itself have labels (.metadata.labels
). Typically, you
would set these the same as the .spec.template.metadata.labels
; if .metadata.labels
is not specified
then it defaults to .spec.template.metadata.labels
. However, they are allowed to be
different, and the .metadata.labels
do not affect the behavior of the ReplicationController.
Pod Selector
The .spec.selector
field is a label selector. A ReplicationController
manages all the pods with labels that match the selector. It does not distinguish
between pods that it created or deleted and pods that another person or process created or
deleted. This allows the ReplicationController to be replaced without affecting the running pods.
If specified, the .spec.template.metadata.labels
must be equal to the .spec.selector
, or it will
be rejected by the API. If .spec.selector
is unspecified, it will be defaulted to
.spec.template.metadata.labels
.
Also you should not normally create any pods whose labels match this selector, either directly, with another ReplicationController, or with another controller such as Job. If you do so, the ReplicationController thinks that it created the other pods. Kubernetes does not stop you from doing this.
If you do end up with multiple controllers that have overlapping selectors, you will have to manage the deletion yourself (see below).
Multiple Replicas
You can specify how many pods should run concurrently by setting .spec.replicas
to the number
of pods you would like to have running concurrently. The number running at any time may be higher
or lower, such as if the replicas were just increased or decreased, or if a pod is gracefully
shutdown, and a replacement starts early.
If you do not specify .spec.replicas
, then it defaults to 1.
Working with ReplicationControllers
Deleting a ReplicationController and its Pods
To delete a ReplicationController and all its pods, use kubectl delete
. Kubectl will scale the ReplicationController to zero and wait
for it to delete each pod before deleting the ReplicationController itself. If this kubectl
command is interrupted, it can be restarted.
When using the REST API or Go client library, you need to do the steps explicitly (scale replicas to 0, wait for pod deletions, then delete the ReplicationController).
Deleting only a ReplicationController
You can delete a ReplicationController without affecting any of its pods.
Using kubectl, specify the --cascade=false
option to kubectl delete
.
When using the REST API or Go client library, you can delete the ReplicationController object.
Once the original is deleted, you can create a new ReplicationController to replace it. As long
as the old and new .spec.selector
are the same, then the new one will adopt the old pods.
However, it will not make any effort to make existing pods match a new, different pod template.
To update pods to a new spec in a controlled way, use a rolling update.
Isolating pods from a ReplicationController
Pods may be removed from a ReplicationController's target set by changing their labels. This technique may be used to remove pods from service for debugging and data recovery. Pods that are removed in this way will be replaced automatically (assuming that the number of replicas is not also changed).
Common usage patterns
Rescheduling
As mentioned above, whether you have 1 pod you want to keep running, or 1000, a ReplicationController will ensure that the specified number of pods exists, even in the event of node failure or pod termination (for example, due to an action by another control agent).
Scaling
The ReplicationController enables scaling the number of replicas up or down, either manually or by an auto-scaling control agent, by updating the replicas
field.
Rolling updates
The ReplicationController is designed to facilitate rolling updates to a service by replacing pods one-by-one.
As explained in #1353, the recommended approach is to create a new ReplicationController with 1 replica, scale the new (+1) and old (-1) controllers one by one, and then delete the old controller after it reaches 0 replicas. This predictably updates the set of pods regardless of unexpected failures.
Ideally, the rolling update controller would take application readiness into account, and would ensure that a sufficient number of pods were productively serving at any given time.
The two ReplicationControllers would need to create pods with at least one differentiating label, such as the image tag of the primary container of the pod, since it is typically image updates that motivate rolling updates.
Multiple release tracks
In addition to running multiple releases of an application while a rolling update is in progress, it's common to run multiple releases for an extended period of time, or even continuously, using multiple release tracks. The tracks would be differentiated by labels.
For instance, a service might target all pods with tier in (frontend), environment in (prod)
. Now say you have 10 replicated pods that make up this tier. But you want to be able to 'canary' a new version of this component. You could set up a ReplicationController with replicas
set to 9 for the bulk of the replicas, with labels tier=frontend, environment=prod, track=stable
, and another ReplicationController with replicas
set to 1 for the canary, with labels tier=frontend, environment=prod, track=canary
. Now the service is covering both the canary and non-canary pods. But you can mess with the ReplicationControllers separately to test things out, monitor the results, etc.
Using ReplicationControllers with Services
Multiple ReplicationControllers can sit behind a single service, so that, for example, some traffic goes to the old version, and some goes to the new version.
A ReplicationController will never terminate on its own, but it isn't expected to be as long-lived as services. Services may be composed of pods controlled by multiple ReplicationControllers, and it is expected that many ReplicationControllers may be created and destroyed over the lifetime of a service (for instance, to perform an update of pods that run the service). Both services themselves and their clients should remain oblivious to the ReplicationControllers that maintain the pods of the services.
Writing programs for Replication
Pods created by a ReplicationController are intended to be fungible and semantically identical, though their configurations may become heterogeneous over time. This is an obvious fit for replicated stateless servers, but ReplicationControllers can also be used to maintain availability of master-elected, sharded, and worker-pool applications. Such applications should use dynamic work assignment mechanisms, such as the RabbitMQ work queues, as opposed to static/one-time customization of the configuration of each pod, which is considered an anti-pattern. Any pod customization performed, such as vertical auto-sizing of resources (for example, cpu or memory), should be performed by another online controller process, not unlike the ReplicationController itself.
Responsibilities of the ReplicationController
The ReplicationController ensures that the desired number of pods matches its label selector and are operational. Currently, only terminated pods are excluded from its count. In the future, readiness and other information available from the system may be taken into account, we may add more controls over the replacement policy, and we plan to emit events that could be used by external clients to implement arbitrarily sophisticated replacement and/or scale-down policies.
The ReplicationController is forever constrained to this narrow responsibility. It itself will not perform readiness nor liveness probes. Rather than performing auto-scaling, it is intended to be controlled by an external auto-scaler (as discussed in #492), which would change its replicas
field. We will not add scheduling policies (for example, spreading) to the ReplicationController. Nor should it verify that the pods controlled match the currently specified template, as that would obstruct auto-sizing and other automated processes. Similarly, completion deadlines, ordering dependencies, configuration expansion, and other features belong elsewhere. We even plan to factor out the mechanism for bulk pod creation (#170).
The ReplicationController is intended to be a composable building-block primitive. We expect higher-level APIs and/or tools to be built on top of it and other complementary primitives for user convenience in the future. The "macro" operations currently supported by kubectl (run, scale) are proof-of-concept examples of this. For instance, we could imagine something like Asgard managing ReplicationControllers, auto-scalers, services, scheduling policies, canaries, etc.
API Object
Replication controller is a top-level resource in the Kubernetes REST API. More details about the API object can be found at: ReplicationController API object.
Alternatives to ReplicationController
ReplicaSet
ReplicaSet
is the next-generation ReplicationController that supports the new set-based label selector.
It's mainly used by Deployment as a mechanism to orchestrate pod creation, deletion and updates.
Note that we recommend using Deployments instead of directly using Replica Sets, unless you require custom update orchestration or don't require updates at all.
Deployment (Recommended)
Deployment
is a higher-level API object that updates its underlying Replica Sets and their Pods. Deployments are recommended if you want this rolling update functionality because, they are declarative, server-side, and have additional features.
Bare Pods
Unlike in the case where a user directly created pods, a ReplicationController replaces pods that are deleted or terminated for any reason, such as in the case of node failure or disruptive node maintenance, such as a kernel upgrade. For this reason, we recommend that you use a ReplicationController even if your application requires only a single pod. Think of it similarly to a process supervisor, only it supervises multiple pods across multiple nodes instead of individual processes on a single node. A ReplicationController delegates local container restarts to some agent on the node (for example, Kubelet or Docker).
Job
Use a Job
instead of a ReplicationController for pods that are expected to terminate on their own
(that is, batch jobs).
DaemonSet
Use a DaemonSet
instead of a ReplicationController for pods that provide a
machine-level function, such as machine monitoring or machine logging. These pods have a lifetime that is tied
to a machine lifetime: the pod needs to be running on the machine before other pods start, and are
safe to terminate when the machine is otherwise ready to be rebooted/shutdown.
For more information
5 - Services, Load Balancing, and Networking
Kubernetes networking addresses four concerns:
- Containers within a Pod use networking to communicate via loopback.
- Cluster networking provides communication between different Pods.
- The Service resource lets you expose an application running in Pods to be reachable from outside your cluster.
- You can also use Services to publish services only for consumption inside your cluster.
5.1 - Service
An abstract way to expose an application running on a set of Pods as a network service.With Kubernetes you don't need to modify your application to use an unfamiliar service discovery mechanism. Kubernetes gives Pods their own IP addresses and a single DNS name for a set of Pods, and can load-balance across them.
Motivation
Kubernetes Pods are created and destroyed to match the state of your cluster. Pods are nonpermanent resources. If you use a Deployment to run your app, it can create and destroy Pods dynamically.
Each Pod gets its own IP address, however in a Deployment, the set of Pods running in one moment in time could be different from the set of Pods running that application a moment later.
This leads to a problem: if some set of Pods (call them "backends") provides functionality to other Pods (call them "frontends") inside your cluster, how do the frontends find out and keep track of which IP address to connect to, so that the frontend can use the backend part of the workload?
Enter Services.
Service resources
In Kubernetes, a Service is an abstraction which defines a logical set of Pods and a policy by which to access them (sometimes this pattern is called a micro-service). The set of Pods targeted by a Service is usually determined by a selector. To learn about other ways to define Service endpoints, see Services without selectors.
For example, consider a stateless image-processing backend which is running with 3 replicas. Those replicas are fungible—frontends do not care which backend they use. While the actual Pods that compose the backend set may change, the frontend clients should not need to be aware of that, nor should they need to keep track of the set of backends themselves.
The Service abstraction enables this decoupling.
Cloud-native service discovery
If you're able to use Kubernetes APIs for service discovery in your application, you can query the API server for Endpoints, that get updated whenever the set of Pods in a Service changes.
For non-native applications, Kubernetes offers ways to place a network port or load balancer in between your application and the backend Pods.
Defining a Service
A Service in Kubernetes is a REST object, similar to a Pod. Like all of the
REST objects, you can POST
a Service definition to the API server to create
a new instance.
The name of a Service object must be a valid
DNS label name.
For example, suppose you have a set of Pods where each listens on TCP port 9376
and contains a label app=MyApp
:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
targetPort: 9376
This specification creates a new Service object named "my-service", which
targets TCP port 9376 on any Pod with the app=MyApp
label.
Kubernetes assigns this Service an IP address (sometimes called the "cluster IP"), which is used by the Service proxies (see Virtual IPs and service proxies below).
The controller for the Service selector continuously scans for Pods that match its selector, and then POSTs any updates to an Endpoint object also named "my-service".
Note: A Service can map any incomingport
to atargetPort
. By default and for convenience, thetargetPort
is set to the same value as theport
field.
Port definitions in Pods have names, and you can reference these names in the
targetPort
attribute of a Service. This works even if there is a mixture
of Pods in the Service using a single configured name, with the same network
protocol available via different port numbers.
This offers a lot of flexibility for deploying and evolving your Services.
For example, you can change the port numbers that Pods expose in the next
version of your backend software, without breaking clients.
The default protocol for Services is TCP; you can also use any other supported protocol.
As many Services need to expose more than one port, Kubernetes supports multiple
port definitions on a Service object.
Each port definition can have the same protocol
, or a different one.
Services without selectors
Services most commonly abstract access to Kubernetes Pods, but they can also abstract other kinds of backends. For example:
- You want to have an external database cluster in production, but in your test environment you use your own databases.
- You want to point your Service to a Service in a different Namespace or on another cluster.
- You are migrating a workload to Kubernetes. While evaluating the approach, you run only a portion of your backends in Kubernetes.
In any of these scenarios you can define a Service without a Pod selector. For example:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
ports:
- protocol: TCP
port: 80
targetPort: 9376
Because this Service has no selector, the corresponding Endpoints object is not created automatically. You can manually map the Service to the network address and port where it's running, by adding an Endpoints object manually:
apiVersion: v1
kind: Endpoints
metadata:
name: my-service
subsets:
- addresses:
- ip: 192.0.2.42
ports:
- port: 9376
The name of the Endpoints object must be a valid DNS subdomain name.
Note:The endpoint IPs must not be: loopback (127.0.0.0/8 for IPv4, ::1/128 for IPv6), or link-local (169.254.0.0/16 and 224.0.0.0/24 for IPv4, fe80::/64 for IPv6).
Endpoint IP addresses cannot be the cluster IPs of other Kubernetes Services, because kube-proxy doesn't support virtual IPs as a destination.
Accessing a Service without a selector works the same as if it had a selector.
In the example above, traffic is routed to the single endpoint defined in
the YAML: 192.0.2.42:9376
(TCP).
An ExternalName Service is a special case of Service that does not have selectors and uses DNS names instead. For more information, see the ExternalName section later in this document.
Over Capacity Endpoints
If an Endpoints resource has more than 1000 endpoints then a Kubernetes v1.21 (or later)
cluster annotates that Endpoints with endpoints.kubernetes.io/over-capacity: warning
.
This annotation indicates that the affected Endpoints object is over capacity.
EndpointSlices
Kubernetes v1.21 [stable]
EndpointSlices are an API resource that can provide a more scalable alternative to Endpoints. Although conceptually quite similar to Endpoints, EndpointSlices allow for distributing network endpoints across multiple resources. By default, an EndpointSlice is considered "full" once it reaches 100 endpoints, at which point additional EndpointSlices will be created to store any additional endpoints.
EndpointSlices provide additional attributes and functionality which is described in detail in EndpointSlices.
Application protocol
Kubernetes v1.20 [stable]
The appProtocol
field provides a way to specify an application protocol for
each Service port. The value of this field is mirrored by the corresponding
Endpoints and EndpointSlice objects.
This field follows standard Kubernetes label syntax. Values should either be
IANA standard service names or
domain prefixed names such as mycompany.com/my-custom-protocol
.
Virtual IPs and service proxies
Every node in a Kubernetes cluster runs a kube-proxy
. kube-proxy
is
responsible for implementing a form of virtual IP for Services
of type other
than ExternalName
.
Why not use round-robin DNS?
A question that pops up every now and then is why Kubernetes relies on proxying to forward inbound traffic to backends. What about other approaches? For example, would it be possible to configure DNS records that have multiple A values (or AAAA for IPv6), and rely on round-robin name resolution?
There are a few reasons for using proxying for Services:
- There is a long history of DNS implementations not respecting record TTLs, and caching the results of name lookups after they should have expired.
- Some apps do DNS lookups only once and cache the results indefinitely.
- Even if apps and libraries did proper re-resolution, the low or zero TTLs on the DNS records could impose a high load on DNS that then becomes difficult to manage.
User space proxy mode
In this mode, kube-proxy watches the Kubernetes control plane for the addition and
removal of Service and Endpoint objects. For each Service it opens a
port (randomly chosen) on the local node. Any connections to this "proxy port"
are proxied to one of the Service's backend Pods (as reported via
Endpoints). kube-proxy takes the SessionAffinity
setting of the Service into
account when deciding which backend Pod to use.
Lastly, the user-space proxy installs iptables rules which capture traffic to
the Service's clusterIP
(which is virtual) and port
. The rules
redirect that traffic to the proxy port which proxies the backend Pod.
By default, kube-proxy in userspace mode chooses a backend via a round-robin algorithm.
iptables
proxy mode
In this mode, kube-proxy watches the Kubernetes control plane for the addition and
removal of Service and Endpoint objects. For each Service, it installs
iptables rules, which capture traffic to the Service's clusterIP
and port
,
and redirect that traffic to one of the Service's
backend sets. For each Endpoint object, it installs iptables rules which
select a backend Pod.
By default, kube-proxy in iptables mode chooses a backend at random.
Using iptables to handle traffic has a lower system overhead, because traffic is handled by Linux netfilter without the need to switch between userspace and the kernel space. This approach is also likely to be more reliable.
If kube-proxy is running in iptables mode and the first Pod that's selected does not respond, the connection fails. This is different from userspace mode: in that scenario, kube-proxy would detect that the connection to the first Pod had failed and would automatically retry with a different backend Pod.
You can use Pod readiness probes to verify that backend Pods are working OK, so that kube-proxy in iptables mode only sees backends that test out as healthy. Doing this means you avoid having traffic sent via kube-proxy to a Pod that's known to have failed.
IPVS proxy mode
Kubernetes v1.11 [stable]
In ipvs
mode, kube-proxy watches Kubernetes Services and Endpoints,
calls netlink
interface to create IPVS rules accordingly and synchronizes
IPVS rules with Kubernetes Services and Endpoints periodically.
This control loop ensures that IPVS status matches the desired
state.
When accessing a Service, IPVS directs traffic to one of the backend Pods.
The IPVS proxy mode is based on netfilter hook function that is similar to iptables mode, but uses a hash table as the underlying data structure and works in the kernel space. That means kube-proxy in IPVS mode redirects traffic with lower latency than kube-proxy in iptables mode, with much better performance when synchronising proxy rules. Compared to the other proxy modes, IPVS mode also supports a higher throughput of network traffic.
IPVS provides more options for balancing traffic to backend Pods; these are:
rr
: round-robinlc
: least connection (smallest number of open connections)dh
: destination hashingsh
: source hashingsed
: shortest expected delaynq
: never queue
Note:To run kube-proxy in IPVS mode, you must make IPVS available on the node before starting kube-proxy.
When kube-proxy starts in IPVS proxy mode, it verifies whether IPVS kernel modules are available. If the IPVS kernel modules are not detected, then kube-proxy falls back to running in iptables proxy mode.
In these proxy models, the traffic bound for the Service's IP:Port is proxied to an appropriate backend without the clients knowing anything about Kubernetes or Services or Pods.
If you want to make sure that connections from a particular client
are passed to the same Pod each time, you can select the session affinity based
on the client's IP addresses by setting service.spec.sessionAffinity
to "ClientIP"
(the default is "None").
You can also set the maximum session sticky time by setting
service.spec.sessionAffinityConfig.clientIP.timeoutSeconds
appropriately.
(the default value is 10800, which works out to be 3 hours).
Multi-Port Services
For some Services, you need to expose more than one port. Kubernetes lets you configure multiple port definitions on a Service object. When using multiple ports for a Service, you must give all of your ports names so that these are unambiguous. For example:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: MyApp
ports:
- name: http
protocol: TCP
port: 80
targetPort: 9376
- name: https
protocol: TCP
port: 443
targetPort: 9377
Note:As with Kubernetes names in general, names for ports must only contain lowercase alphanumeric characters and
-
. Port names must also start and end with an alphanumeric character.For example, the names
123-abc
andweb
are valid, but123_abc
and-web
are not.
Choosing your own IP address
You can specify your own cluster IP address as part of a Service
creation
request. To do this, set the .spec.clusterIP
field. For example, if you
already have an existing DNS entry that you wish to reuse, or legacy systems
that are configured for a specific IP address and difficult to re-configure.
The IP address that you choose must be a valid IPv4 or IPv6 address from within the
service-cluster-ip-range
CIDR range that is configured for the API server.
If you try to create a Service with an invalid clusterIP address value, the API
server will return a 422 HTTP status code to indicate that there's a problem.
Discovering services
Kubernetes supports 2 primary modes of finding a Service - environment variables and DNS.
Environment variables
When a Pod is run on a Node, the kubelet adds a set of environment variables
for each active Service. It supports both Docker links
compatible variables (see
makeLinkVariables)
and simpler {SVCNAME}_SERVICE_HOST
and {SVCNAME}_SERVICE_PORT
variables,
where the Service name is upper-cased and dashes are converted to underscores.
For example, the Service redis-master
which exposes TCP port 6379 and has been
allocated cluster IP address 10.0.0.11, produces the following environment
variables:
REDIS_MASTER_SERVICE_HOST=10.0.0.11
REDIS_MASTER_SERVICE_PORT=6379
REDIS_MASTER_PORT=tcp://10.0.0.11:6379
REDIS_MASTER_PORT_6379_TCP=tcp://10.0.0.11:6379
REDIS_MASTER_PORT_6379_TCP_PROTO=tcp
REDIS_MASTER_PORT_6379_TCP_PORT=6379
REDIS_MASTER_PORT_6379_TCP_ADDR=10.0.0.11
Note:When you have a Pod that needs to access a Service, and you are using the environment variable method to publish the port and cluster IP to the client Pods, you must create the Service before the client Pods come into existence. Otherwise, those client Pods won't have their environment variables populated.
If you only use DNS to discover the cluster IP for a Service, you don't need to worry about this ordering issue.
DNS
You can (and almost always should) set up a DNS service for your Kubernetes cluster using an add-on.
A cluster-aware DNS server, such as CoreDNS, watches the Kubernetes API for new Services and creates a set of DNS records for each one. If DNS has been enabled throughout your cluster then all Pods should automatically be able to resolve Services by their DNS name.
For example, if you have a Service called my-service
in a Kubernetes
namespace my-ns
, the control plane and the DNS Service acting together
create a DNS record for my-service.my-ns
. Pods in the my-ns
namespace
should be able to find the service by doing a name lookup for my-service
(my-service.my-ns
would also work).
Pods in other namespaces must qualify the name as my-service.my-ns
. These names
will resolve to the cluster IP assigned for the Service.
Kubernetes also supports DNS SRV (Service) records for named ports. If the
my-service.my-ns
Service has a port named http
with the protocol set to
TCP
, you can do a DNS SRV query for _http._tcp.my-service.my-ns
to discover
the port number for http
, as well as the IP address.
The Kubernetes DNS server is the only way to access ExternalName
Services.
You can find more information about ExternalName
resolution in
DNS Pods and Services.
Headless Services
Sometimes you don't need load-balancing and a single Service IP. In
this case, you can create what are termed "headless" Services, by explicitly
specifying "None"
for the cluster IP (.spec.clusterIP
).
You can use a headless Service to interface with other service discovery mechanisms, without being tied to Kubernetes' implementation.
For headless Services
, a cluster IP is not allocated, kube-proxy does not handle
these Services, and there is no load balancing or proxying done by the platform
for them. How DNS is automatically configured depends on whether the Service has
selectors defined:
With selectors
For headless Services that define selectors, the endpoints controller creates
Endpoints
records in the API, and modifies the DNS configuration to return
A records (IP addresses) that point directly to the Pods
backing the Service
.
Without selectors
For headless Services that do not define selectors, the endpoints controller does
not create Endpoints
records. However, the DNS system looks for and configures
either:
- CNAME records for
ExternalName
-type Services. - A records for any
Endpoints
that share a name with the Service, for all other types.
Publishing Services (ServiceTypes)
For some parts of your application (for example, frontends) you may want to expose a Service onto an external IP address, that's outside of your cluster.
Kubernetes ServiceTypes
allow you to specify what kind of Service you want.
The default is ClusterIP
.
Type
values and their behaviors are:
ClusterIP
: Exposes the Service on a cluster-internal IP. Choosing this value makes the Service only reachable from within the cluster. This is the defaultServiceType
.NodePort
: Exposes the Service on each Node's IP at a static port (theNodePort
). AClusterIP
Service, to which theNodePort
Service routes, is automatically created. You'll be able to contact theNodePort
Service, from outside the cluster, by requesting<NodeIP>:<NodePort>
.LoadBalancer
: Exposes the Service externally using a cloud provider's load balancer.NodePort
andClusterIP
Services, to which the external load balancer routes, are automatically created.ExternalName
: Maps the Service to the contents of theexternalName
field (e.g.foo.bar.example.com
), by returning aCNAME
record with its value. No proxying of any kind is set up.Note: You need eitherkube-dns
version 1.7 or CoreDNS version 0.0.8 or higher to use theExternalName
type.
You can also use Ingress to expose your Service. Ingress is not a Service type, but it acts as the entry point for your cluster. It lets you consolidate your routing rules into a single resource as it can expose multiple services under the same IP address.
Type NodePort
If you set the type
field to NodePort
, the Kubernetes control plane
allocates a port from a range specified by --service-node-port-range
flag (default: 30000-32767).
Each node proxies that port (the same port number on every Node) into your Service.
Your Service reports the allocated port in its .spec.ports[*].nodePort
field.
If you want to specify particular IP(s) to proxy the port, you can set the
--nodeport-addresses
flag for kube-proxy or the equivalent nodePortAddresses
field of the
kube-proxy configuration file
to particular IP block(s).
This flag takes a comma-delimited list of IP blocks (e.g. 10.0.0.0/8
, 192.0.2.0/25
) to specify IP address ranges that kube-proxy should consider as local to this node.
For example, if you start kube-proxy with the --nodeport-addresses=127.0.0.0/8
flag, kube-proxy only selects the loopback interface for NodePort Services. The default for --nodeport-addresses
is an empty list. This means that kube-proxy should consider all available network interfaces for NodePort. (That's also compatible with earlier Kubernetes releases).
If you want a specific port number, you can specify a value in the nodePort
field. The control plane will either allocate you that port or report that
the API transaction failed.
This means that you need to take care of possible port collisions yourself.
You also have to use a valid port number, one that's inside the range configured
for NodePort use.
Using a NodePort gives you the freedom to set up your own load balancing solution, to configure environments that are not fully supported by Kubernetes, or even to expose one or more nodes' IPs directly.
Note that this Service is visible as <NodeIP>:spec.ports[*].nodePort
and .spec.clusterIP:spec.ports[*].port
.
If the --nodeport-addresses
flag for kube-proxy or the equivalent field
in the kube-proxy configuration file is set, <NodeIP>
would be filtered node IP(s).
For example:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
type: NodePort
selector:
app: MyApp
ports:
# By default and for convenience, the `targetPort` is set to the same value as the `port` field.
- port: 80
targetPort: 80
# Optional field
# By default and for convenience, the Kubernetes control plane will allocate a port from a range (default: 30000-32767)
nodePort: 30007
Type LoadBalancer
On cloud providers which support external load balancers, setting the type
field to LoadBalancer
provisions a load balancer for your Service.
The actual creation of the load balancer happens asynchronously, and
information about the provisioned balancer is published in the Service's
.status.loadBalancer
field.
For example:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
targetPort: 9376
clusterIP: 10.0.171.239
type: LoadBalancer
status:
loadBalancer:
ingress:
- ip: 192.0.2.127
Traffic from the external load balancer is directed at the backend Pods. The cloud provider decides how it is load balanced.
Some cloud providers allow you to specify the loadBalancerIP
. In those cases, the load-balancer is created
with the user-specified loadBalancerIP
. If the loadBalancerIP
field is not specified,
the loadBalancer is set up with an ephemeral IP address. If you specify a loadBalancerIP
but your cloud provider does not support the feature, the loadbalancerIP
field that you
set is ignored.
Note:On Azure, if you want to use a user-specified public type
loadBalancerIP
, you first need to create a static type public IP address resource. This public IP address resource should be in the same resource group of the other automatically created resources of the cluster. For example,MC_myResourceGroup_myAKSCluster_eastus
.Specify the assigned IP address as loadBalancerIP. Ensure that you have updated the securityGroupName in the cloud provider configuration file. For information about troubleshooting
CreatingLoadBalancerFailed
permission issues see, Use a static IP address with the Azure Kubernetes Service (AKS) load balancer or CreatingLoadBalancerFailed on AKS cluster with advanced networking.
Load balancers with mixed protocol types
Kubernetes v1.20 [alpha]
By default, for LoadBalancer type of Services, when there is more than one port defined, all ports must have the same protocol, and the protocol must be one which is supported by the cloud provider.
If the feature gate MixedProtocolLBService
is enabled for the kube-apiserver it is allowed to use different protocols when there is more than one port defined.
Note: The set of protocols that can be used for LoadBalancer type of Services is still defined by the cloud provider.
Disabling load balancer NodePort allocation
Kubernetes v1.20 [alpha]
Starting in v1.20, you can optionally disable node port allocation for a Service Type=LoadBalancer by setting
the field spec.allocateLoadBalancerNodePorts
to false
. This should only be used for load balancer implementations
that route traffic directly to pods as opposed to using node ports. By default, spec.allocateLoadBalancerNodePorts
is true
and type LoadBalancer Services will continue to allocate node ports. If spec.allocateLoadBalancerNodePorts
is set to false
on an existing Service with allocated node ports, those node ports will NOT be de-allocated automatically.
You must explicitly remove the nodePorts
entry in every Service port to de-allocate those node ports.
You must enable the ServiceLBNodePortControl
feature gate to use this field.
Specifying class of load balancer implementation
Kubernetes v1.21 [alpha]
Starting in v1.21, you can optionally specify the class of a load balancer implementation for
LoadBalancer
type of Service by setting the field spec.loadBalancerClass
.
By default, spec.loadBalancerClass
is nil
and a LoadBalancer
type of Service uses
the cloud provider's default load balancer implementation.
If spec.loadBalancerClass
is specified, it is assumed that a load balancer
implementation that matches the specified class is watching for Services.
Any default load balancer implementation (for example, the one provided by
the cloud provider) will ignore Services that have this field set.
spec.loadBalancerClass
can be set on a Service of type LoadBalancer
only.
Once set, it cannot be changed.
The value of spec.loadBalancerClass
must be a label-style identifier,
with an optional prefix such as "internal-vip
" or "example.com/internal-vip
".
Unprefixed names are reserved for end-users.
You must enable the ServiceLoadBalancerClass
feature gate to use this field.
Internal load balancer
In a mixed environment it is sometimes necessary to route traffic from Services inside the same (virtual) network address block.
In a split-horizon DNS environment you would need two Services to be able to route both external and internal traffic to your endpoints.
To set an internal load balancer, add one of the following annotations to your Service depending on the cloud Service provider you're using.
Select one of the tabs.
[...]
metadata:
name: my-service
annotations:
cloud.google.com/load-balancer-type: "Internal"
[...]
[...]
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-internal: "true"
[...]
[...]
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
[...]
[...]
metadata:
name: my-service
annotations:
service.kubernetes.io/ibm-load-balancer-cloud-provider-ip-type: "private"
[...]
[...]
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/openstack-internal-load-balancer: "true"
[...]
[...]
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/cce-load-balancer-internal-vpc: "true"
[...]
[...]
metadata:
annotations:
service.kubernetes.io/qcloud-loadbalancer-internal-subnetid: subnet-xxxxx
[...]
[...]
metadata:
annotations:
service.beta.kubernetes.io/alibaba-cloud-loadbalancer-address-type: "intranet"
[...]
TLS support on AWS
For partial TLS / SSL support on clusters running on AWS, you can add three
annotations to a LoadBalancer
service:
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-ssl-cert: arn:aws:acm:us-east-1:123456789012:certificate/12345678-1234-1234-1234-123456789012
The first specifies the ARN of the certificate to use. It can be either a certificate from a third party issuer that was uploaded to IAM or one created within AWS Certificate Manager.
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: (https|http|ssl|tcp)
The second annotation specifies which protocol a Pod speaks. For HTTPS and SSL, the ELB expects the Pod to authenticate itself over the encrypted connection, using a certificate.
HTTP and HTTPS selects layer 7 proxying: the ELB terminates
the connection with the user, parses headers, and injects the X-Forwarded-For
header with the user's IP address (Pods only see the IP address of the
ELB at the other end of its connection) when forwarding requests.
TCP and SSL selects layer 4 proxying: the ELB forwards traffic without modifying the headers.
In a mixed-use environment where some ports are secured and others are left unencrypted, you can use the following annotations:
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443,8443"
In the above example, if the Service contained three ports, 80
, 443
, and
8443
, then 443
and 8443
would use the SSL certificate, but 80
would be proxied HTTP.
From Kubernetes v1.9 onwards you can use predefined AWS SSL policies with HTTPS or SSL listeners for your Services.
To see which policies are available for use, you can use the aws
command line tool:
aws elb describe-load-balancer-policies --query 'PolicyDescriptions[].PolicyName'
You can then specify any one of those policies using the
"service.beta.kubernetes.io/aws-load-balancer-ssl-negotiation-policy
"
annotation; for example:
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-ssl-negotiation-policy: "ELBSecurityPolicy-TLS-1-2-2017-01"
PROXY protocol support on AWS
To enable PROXY protocol support for clusters running on AWS, you can use the following service annotation:
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*"
Since version 1.3.0, the use of this annotation applies to all ports proxied by the ELB and cannot be configured otherwise.
ELB Access Logs on AWS
There are several annotations to manage access logs for ELB Services on AWS.
The annotation service.beta.kubernetes.io/aws-load-balancer-access-log-enabled
controls whether access logs are enabled.
The annotation service.beta.kubernetes.io/aws-load-balancer-access-log-emit-interval
controls the interval in minutes for publishing the access logs. You can specify
an interval of either 5 or 60 minutes.
The annotation service.beta.kubernetes.io/aws-load-balancer-access-log-s3-bucket-name
controls the name of the Amazon S3 bucket where load balancer access logs are
stored.
The annotation service.beta.kubernetes.io/aws-load-balancer-access-log-s3-bucket-prefix
specifies the logical hierarchy you created for your Amazon S3 bucket.
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-access-log-enabled: "true"
# Specifies whether access logs are enabled for the load balancer
service.beta.kubernetes.io/aws-load-balancer-access-log-emit-interval: "60"
# The interval for publishing the access logs. You can specify an interval of either 5 or 60 (minutes).
service.beta.kubernetes.io/aws-load-balancer-access-log-s3-bucket-name: "my-bucket"
# The name of the Amazon S3 bucket where the access logs are stored
service.beta.kubernetes.io/aws-load-balancer-access-log-s3-bucket-prefix: "my-bucket-prefix/prod"
# The logical hierarchy you created for your Amazon S3 bucket, for example `my-bucket-prefix/prod`
Connection Draining on AWS
Connection draining for Classic ELBs can be managed with the annotation
service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled
set
to the value of "true"
. The annotation
service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout
can
also be used to set maximum time, in seconds, to keep the existing connections open before deregistering the instances.
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout: "60"
Other ELB annotations
There are other annotations to manage Classic Elastic Load Balancers that are described below.
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "60"
# The time, in seconds, that the connection is allowed to be idle (no data has been sent over the connection) before it is closed by the load balancer
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
# Specifies whether cross-zone load balancing is enabled for the load balancer
service.beta.kubernetes.io/aws-load-balancer-additional-resource-tags: "environment=prod,owner=devops"
# A comma-separated list of key-value pairs which will be recorded as
# additional tags in the ELB.
service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: ""
# The number of successive successful health checks required for a backend to
# be considered healthy for traffic. Defaults to 2, must be between 2 and 10
service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "3"
# The number of unsuccessful health checks required for a backend to be
# considered unhealthy for traffic. Defaults to 6, must be between 2 and 10
service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "20"
# The approximate interval, in seconds, between health checks of an
# individual instance. Defaults to 10, must be between 5 and 300
service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "5"
# The amount of time, in seconds, during which no response means a failed
# health check. This value must be less than the service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval
# value. Defaults to 5, must be between 2 and 60
service.beta.kubernetes.io/aws-load-balancer-security-groups: "sg-53fae93f"
# A list of existing security groups to be configured on the ELB created. Unlike the annotation
# service.beta.kubernetes.io/aws-load-balancer-extra-security-groups, this replaces all other security groups previously assigned to the ELB and also overrides the creation
# of a uniquely generated security group for this ELB.
# The first security group ID on this list is used as a source to permit incoming traffic to target worker nodes (service traffic and health checks).
# If multiple ELBs are configured with the same security group ID, only a single permit line will be added to the worker node security groups, that means if you delete any
# of those ELBs it will remove the single permit line and block access for all ELBs that shared the same security group ID.
# This can cause a cross-service outage if not used properly
service.beta.kubernetes.io/aws-load-balancer-extra-security-groups: "sg-53fae93f,sg-42efd82e"
# A list of additional security groups to be added to the created ELB, this leaves the uniquely generated security group in place, this ensures that every ELB
# has a unique security group ID and a matching permit line to allow traffic to the target worker nodes (service traffic and health checks).
# Security groups defined here can be shared between services.
service.beta.kubernetes.io/aws-load-balancer-target-node-labels: "ingress-gw,gw-name=public-api"
# A comma separated list of key-value pairs which are used
# to select the target nodes for the load balancer
Network Load Balancer support on AWS
Kubernetes v1.15 [beta]
To use a Network Load Balancer on AWS, use the annotation service.beta.kubernetes.io/aws-load-balancer-type
with the value set to nlb
.
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
Note: NLB only works with certain instance classes; see the AWS documentation on Elastic Load Balancing for a list of supported instance types.
Unlike Classic Elastic Load Balancers, Network Load Balancers (NLBs) forward the
client's IP address through to the node. If a Service's .spec.externalTrafficPolicy
is set to Cluster
, the client's IP address is not propagated to the end
Pods.
By setting .spec.externalTrafficPolicy
to Local
, the client IP addresses is
propagated to the end Pods, but this could result in uneven distribution of
traffic. Nodes without any Pods for a particular LoadBalancer Service will fail
the NLB Target Group's health check on the auto-assigned
.spec.healthCheckNodePort
and not receive any traffic.
In order to achieve even traffic, either use a DaemonSet or specify a pod anti-affinity to not locate on the same node.
You can also use NLB Services with the internal load balancer annotation.
In order for client traffic to reach instances behind an NLB, the Node security groups are modified with the following IP rules:
Rule | Protocol | Port(s) | IpRange(s) | IpRange Description |
---|---|---|---|---|
Health Check | TCP | NodePort(s) (.spec.healthCheckNodePort for .spec.externalTrafficPolicy = Local ) | Subnet CIDR | kubernetes.io/rule/nlb/health=<loadBalancerName> |
| Client Traffic | TCP | NodePort(s) | .spec.loadBalancerSourceRanges
(defaults to 0.0.0.0/0
) | kubernetes.io/rule/nlb/client=<loadBalancerName> |
| MTU Discovery | ICMP | 3,4 | .spec.loadBalancerSourceRanges
(defaults to 0.0.0.0/0
) | kubernetes.io/rule/nlb/mtu=<loadBalancerName> |
In order to limit which client IP's can access the Network Load Balancer,
specify loadBalancerSourceRanges
.
spec:
loadBalancerSourceRanges:
- "143.231.0.0/16"
Note: If.spec.loadBalancerSourceRanges
is not set, Kubernetes allows traffic from0.0.0.0/0
to the Node Security Group(s). If nodes have public IP addresses, be aware that non-NLB traffic can also reach all instances in those modified security groups.
Other CLB annotations on Tencent Kubernetes Engine (TKE)
There are other annotations for managing Cloud Load Balancers on TKE as shown below.
metadata:
name: my-service
annotations:
# Bind Loadbalancers with specified nodes
service.kubernetes.io/qcloud-loadbalancer-backends-label: key in (value1, value2)
# ID of an existing load balancer
service.kubernetes.io/tke-existed-lbid:lb-6swtxxxx
# Custom parameters for the load balancer (LB), does not support modification of LB type yet
service.kubernetes.io/service.extensiveParameters: ""
# Custom parameters for the LB listener
service.kubernetes.io/service.listenerParameters: ""
# Specifies the type of Load balancer;
# valid values: classic (Classic Cloud Load Balancer) or application (Application Cloud Load Balancer)
service.kubernetes.io/loadbalance-type: xxxxx
# Specifies the public network bandwidth billing method;
# valid values: TRAFFIC_POSTPAID_BY_HOUR(bill-by-traffic) and BANDWIDTH_POSTPAID_BY_HOUR (bill-by-bandwidth).
service.kubernetes.io/qcloud-loadbalancer-internet-charge-type: xxxxxx
# Specifies the bandwidth value (value range: [1,2000] Mbps).
service.kubernetes.io/qcloud-loadbalancer-internet-max-bandwidth-out: "10"
# When this annotation is set,the loadbalancers will only register nodes
# with pod running on it, otherwise all nodes will be registered.
service.kubernetes.io/local-svc-only-bind-node-with-pod: true
Type ExternalName
Services of type ExternalName map a Service to a DNS name, not to a typical selector such as
my-service
or cassandra
. You specify these Services with the spec.externalName
parameter.
This Service definition, for example, maps
the my-service
Service in the prod
namespace to my.database.example.com
:
apiVersion: v1
kind: Service
metadata:
name: my-service
namespace: prod
spec:
type: ExternalName
externalName: my.database.example.com
Note: ExternalName accepts an IPv4 address string, but as a DNS names comprised of digits, not as an IP address. ExternalNames that resemble IPv4 addresses are not resolved by CoreDNS or ingress-nginx because ExternalName is intended to specify a canonical DNS name. To hardcode an IP address, consider using headless Services.
When looking up the host my-service.prod.svc.cluster.local
, the cluster DNS Service
returns a CNAME
record with the value my.database.example.com
. Accessing
my-service
works in the same way as other Services but with the crucial
difference that redirection happens at the DNS level rather than via proxying or
forwarding. Should you later decide to move your database into your cluster, you
can start its Pods, add appropriate selectors or endpoints, and change the
Service's type
.
Warning:You may have trouble using ExternalName for some common protocols, including HTTP and HTTPS. If you use ExternalName then the hostname used by clients inside your cluster is different from the name that the ExternalName references.
For protocols that use hostnames this difference may lead to errors or unexpected responses. HTTP requests will have a
Host:
header that the origin server does not recognize; TLS servers will not be able to provide a certificate matching the hostname that the client connected to.
Note: This section is indebted to the Kubernetes Tips - Part 1 blog post from Alen Komljen.
External IPs
If there are external IPs that route to one or more cluster nodes, Kubernetes Services can be exposed on those
externalIPs
. Traffic that ingresses into the cluster with the external IP (as destination IP), on the Service port,
will be routed to one of the Service endpoints. externalIPs
are not managed by Kubernetes and are the responsibility
of the cluster administrator.
In the Service spec, externalIPs
can be specified along with any of the ServiceTypes
.
In the example below, "my-service
" can be accessed by clients on "80.11.12.10:80
" (externalIP:port
)
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: MyApp
ports:
- name: http
protocol: TCP
port: 80
targetPort: 9376
externalIPs:
- 80.11.12.10
Shortcomings
Using the userspace proxy for VIPs works at small to medium scale, but will not scale to very large clusters with thousands of Services. The original design proposal for portals has more details on this.
Using the userspace proxy obscures the source IP address of a packet accessing a Service. This makes some kinds of network filtering (firewalling) impossible. The iptables proxy mode does not obscure in-cluster source IPs, but it does still impact clients coming through a load balancer or node-port.
The Type
field is designed as nested functionality - each level adds to the
previous. This is not strictly required on all cloud providers (e.g. Google Compute Engine does
not need to allocate a NodePort
to make LoadBalancer
work, but AWS does)
but the current API requires it.
Virtual IP implementation
The previous information should be sufficient for many people who want to use Services. However, there is a lot going on behind the scenes that may be worth understanding.
Avoiding collisions
One of the primary philosophies of Kubernetes is that you should not be exposed to situations that could cause your actions to fail through no fault of your own. For the design of the Service resource, this means not making you choose your own port number if that choice might collide with someone else's choice. That is an isolation failure.
In order to allow you to choose a port number for your Services, we must ensure that no two Services can collide. Kubernetes does that by allocating each Service its own IP address.
To ensure each Service receives a unique IP, an internal allocator atomically updates a global allocation map in etcd prior to creating each Service. The map object must exist in the registry for Services to get IP address assignments, otherwise creations will fail with a message indicating an IP address could not be allocated.
In the control plane, a background controller is responsible for creating that map (needed to support migrating from older versions of Kubernetes that used in-memory locking). Kubernetes also uses controllers to check for invalid assignments (eg due to administrator intervention) and for cleaning up allocated IP addresses that are no longer used by any Services.
Service IP addresses
Unlike Pod IP addresses, which actually route to a fixed destination, Service IPs are not actually answered by a single host. Instead, kube-proxy uses iptables (packet processing logic in Linux) to define virtual IP addresses which are transparently redirected as needed. When clients connect to the VIP, their traffic is automatically transported to an appropriate endpoint. The environment variables and DNS for Services are actually populated in terms of the Service's virtual IP address (and port).
kube-proxy supports three proxy modes—userspace, iptables and IPVS—which each operate slightly differently.
Userspace
As an example, consider the image processing application described above. When the backend Service is created, the Kubernetes master assigns a virtual IP address, for example 10.0.0.1. Assuming the Service port is 1234, the Service is observed by all of the kube-proxy instances in the cluster. When a proxy sees a new Service, it opens a new random port, establishes an iptables redirect from the virtual IP address to this new port, and starts accepting connections on it.
When a client connects to the Service's virtual IP address, the iptables rule kicks in, and redirects the packets to the proxy's own port. The "Service proxy" chooses a backend, and starts proxying traffic from the client to the backend.
This means that Service owners can choose any port they want without risk of collision. Clients can connect to an IP and port, without being aware of which Pods they are actually accessing.
iptables
Again, consider the image processing application described above. When the backend Service is created, the Kubernetes control plane assigns a virtual IP address, for example 10.0.0.1. Assuming the Service port is 1234, the Service is observed by all of the kube-proxy instances in the cluster. When a proxy sees a new Service, it installs a series of iptables rules which redirect from the virtual IP address to per-Service rules. The per-Service rules link to per-Endpoint rules which redirect traffic (using destination NAT) to the backends.
When a client connects to the Service's virtual IP address the iptables rule kicks in. A backend is chosen (either based on session affinity or randomly) and packets are redirected to the backend. Unlike the userspace proxy, packets are never copied to userspace, the kube-proxy does not have to be running for the virtual IP address to work, and Nodes see traffic arriving from the unaltered client IP address.
This same basic flow executes when traffic comes in through a node-port or through a load-balancer, though in those cases the client IP does get altered.
IPVS
iptables operations slow down dramatically in large scale cluster e.g 10,000 Services. IPVS is designed for load balancing and based on in-kernel hash tables. So you can achieve performance consistency in large number of Services from IPVS-based kube-proxy. Meanwhile, IPVS-based kube-proxy has more sophisticated load balancing algorithms (least conns, locality, weighted, persistence).
API Object
Service is a top-level resource in the Kubernetes REST API. You can find more details about the API object at: Service API object.
Supported protocols
TCP
You can use TCP for any kind of Service, and it's the default network protocol.
UDP
You can use UDP for most Services. For type=LoadBalancer Services, UDP support depends on the cloud provider offering this facility.
SCTP
Kubernetes v1.20 [stable]
When using a network plugin that supports SCTP traffic, you can use SCTP for most Services. For type=LoadBalancer Services, SCTP support depends on the cloud provider offering this facility. (Most do not).
Warnings
Support for multihomed SCTP associations
Warning:The support of multihomed SCTP associations requires that the CNI plugin can support the assignment of multiple interfaces and IP addresses to a Pod.
NAT for multihomed SCTP associations requires special logic in the corresponding kernel modules.
Windows
Note: SCTP is not supported on Windows based nodes.
Userspace kube-proxy
Warning: The kube-proxy does not support the management of SCTP associations when it is in userspace mode.
HTTP
If your cloud provider supports it, you can use a Service in LoadBalancer mode to set up external HTTP / HTTPS reverse proxying, forwarded to the Endpoints of the Service.
Note: You can also use Ingress in place of Service to expose HTTP/HTTPS Services.
PROXY protocol
If your cloud provider supports it, you can use a Service in LoadBalancer mode to configure a load balancer outside of Kubernetes itself, that will forward connections prefixed with PROXY protocol.
The load balancer will send an initial series of octets describing the incoming connection, similar to this example
PROXY TCP4 192.0.2.202 10.0.42.7 12345 7\r\n
followed by the data from the client.
What's next
- Read Connecting Applications with Services
- Read about Ingress
- Read about EndpointSlices
5.2 - Topology-aware traffic routing with topology keys
Kubernetes v1.21 [deprecated]
Note: This feature, specifically the alphatopologyKeys
API, is deprecated since Kubernetes v1.21. Topology Aware Hints, introduced in Kubernetes v1.21, provide similar functionality.
Service Topology enables a service to route traffic based upon the Node topology of the cluster. For example, a service can specify that traffic be preferentially routed to endpoints that are on the same Node as the client, or in the same availability zone.
Topology-aware traffic routing
By default, traffic sent to a ClusterIP
or NodePort
Service may be routed to
any backend address for the Service. Kubernetes 1.7 made it possible to
route "external" traffic to the Pods running on the same Node that received the
traffic. For ClusterIP
Services, the equivalent same-node preference for
routing wasn't possible; nor could you configure your cluster to favor routing
to endpoints within the same zone.
By setting topologyKeys
on a Service, you're able to define a policy for routing
traffic based upon the Node labels for the originating and destination Nodes.
The label matching between the source and destination lets you, as a cluster operator, designate sets of Nodes that are "closer" and "farther" from one another. You can define labels to represent whatever metric makes sense for your own requirements. In public clouds, for example, you might prefer to keep network traffic within the same zone, because interzonal traffic has a cost associated with it (and intrazonal traffic typically does not). Other common needs include being able to route traffic to a local Pod managed by a DaemonSet, or directing traffic to Nodes connected to the same top-of-rack switch for the lowest latency.
Using Service Topology
If your cluster has the ServiceTopology
feature gate enabled, you can control Service traffic
routing by specifying the topologyKeys
field on the Service spec. This field
is a preference-order list of Node labels which will be used to sort endpoints
when accessing this Service. Traffic will be directed to a Node whose value for
the first label matches the originating Node's value for that label. If there is
no backend for the Service on a matching Node, then the second label will be
considered, and so forth, until no labels remain.
If no match is found, the traffic will be rejected, as if there were no
backends for the Service at all. That is, endpoints are chosen based on the first
topology key with available backends. If this field is specified and all entries
have no backends that match the topology of the client, the service has no
backends for that client and connections should fail. The special value "*"
may
be used to mean "any topology". This catch-all value, if used, only makes sense
as the last value in the list.
If topologyKeys
is not specified or empty, no topology constraints will be applied.
Consider a cluster with Nodes that are labeled with their hostname, zone name,
and region name. Then you can set the topologyKeys
values of a service to direct
traffic as follows.
- Only to endpoints on the same node, failing if no endpoint exists on the node:
["kubernetes.io/hostname"]
. - Preferentially to endpoints on the same node, falling back to endpoints in the
same zone, followed by the same region, and failing otherwise:
["kubernetes.io/hostname", "topology.kubernetes.io/zone", "topology.kubernetes.io/region"]
. This may be useful, for example, in cases where data locality is critical. - Preferentially to the same zone, but fallback on any available endpoint if
none are available within this zone:
["topology.kubernetes.io/zone", "*"]
.
Constraints
Service topology is not compatible with
externalTrafficPolicy=Local
, and therefore a Service cannot use both of these features. It is possible to use both features in the same cluster on different Services, only not on the same Service.Valid topology keys are currently limited to
kubernetes.io/hostname
,topology.kubernetes.io/zone
, andtopology.kubernetes.io/region
, but will be generalized to other node labels in the future.Topology keys must be valid label keys and at most 16 keys may be specified.
The catch-all value,
"*"
, must be the last value in the topology keys, if it is used.
Examples
The following are common examples of using the Service Topology feature.
Only Node Local Endpoints
A Service that only routes to node local endpoints. If no endpoints exist on the node, traffic is dropped:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- protocol: TCP
port: 80
targetPort: 9376
topologyKeys:
- "kubernetes.io/hostname"
Prefer Node Local Endpoints
A Service that prefers node local Endpoints but falls back to cluster wide endpoints if node local endpoints do not exist:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- protocol: TCP
port: 80
targetPort: 9376
topologyKeys:
- "kubernetes.io/hostname"
- "*"
Only Zonal or Regional Endpoints
A Service that prefers zonal then regional endpoints. If no endpoints exist in either, traffic is dropped.
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- protocol: TCP
port: 80
targetPort: 9376
topologyKeys:
- "topology.kubernetes.io/zone"
- "topology.kubernetes.io/region"
Prefer Node Local, Zonal, then Regional Endpoints
A Service that prefers node local, zonal, then regional endpoints but falls back to cluster wide endpoints.
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- protocol: TCP
port: 80
targetPort: 9376
topologyKeys:
- "kubernetes.io/hostname"
- "topology.kubernetes.io/zone"
- "topology.kubernetes.io/region"
- "*"
What's next
- Read about enabling Service Topology
- Read Connecting Applications with Services
5.3 - DNS for Services and Pods
Kubernetes creates DNS records for services and pods. You can contact services with consistent DNS names instead of IP addresses.
Introduction
Kubernetes DNS schedules a DNS Pod and Service on the cluster, and configures the kubelets to tell individual containers to use the DNS Service's IP to resolve DNS names.
Every Service defined in the cluster (including the DNS server itself) is assigned a DNS name. By default, a client Pod's DNS search list includes the Pod's own namespace and the cluster's default domain.
Namespaces of Services
A DNS query may return different results based on the namespace of the pod making it. DNS queries that don't specify a namespace are limited to the pod's namespace. Access services in other namespaces by specifying it in the DNS query.
For example, consider a pod in a test
namespace. A data
service is in
the prod
namespace.
A query for data
returns no results, because it uses the pod's test
namespace.
A query for data.prod
returns the intended result, because it specifies the
namespace.
DNS queries may be expanded using the pod's /etc/resolv.conf
. Kubelet
sets this file for each pod. For example, a query for just data
may be
expanded to data.test.cluster.local
. The values of the search
option
are used to expand queries. To learn more about DNS queries, see
the resolv.conf
manual page.
nameserver 10.32.0.10
search <namespace>.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
In summary, a pod in the test namespace can successfully resolve either
data.prod
or data.prod.cluster.local
.
DNS Records
What objects get DNS records?
- Services
- Pods
The following sections detail the supported DNS record types and layout that is supported. Any other layout or names or queries that happen to work are considered implementation details and are subject to change without warning. For more up-to-date specification, see Kubernetes DNS-Based Service Discovery.
Services
A/AAAA records
"Normal" (not headless) Services are assigned a DNS A or AAAA record,
depending on the IP family of the service, for a name of the form
my-svc.my-namespace.svc.cluster-domain.example
. This resolves to the cluster IP
of the Service.
"Headless" (without a cluster IP) Services are also assigned a DNS A or AAAA record,
depending on the IP family of the service, for a name of the form
my-svc.my-namespace.svc.cluster-domain.example
. Unlike normal
Services, this resolves to the set of IPs of the pods selected by the Service.
Clients are expected to consume the set or else use standard round-robin
selection from the set.
SRV records
SRV Records are created for named ports that are part of normal or Headless
Services.
For each named port, the SRV record would have the form
_my-port-name._my-port-protocol.my-svc.my-namespace.svc.cluster-domain.example
.
For a regular service, this resolves to the port number and the domain name:
my-svc.my-namespace.svc.cluster-domain.example
.
For a headless service, this resolves to multiple answers, one for each pod
that is backing the service, and contains the port number and the domain name of the pod
of the form auto-generated-name.my-svc.my-namespace.svc.cluster-domain.example
.
Pods
A/AAAA records
In general a pod has the following DNS resolution:
pod-ip-address.my-namespace.pod.cluster-domain.example
.
For example, if a pod in the default
namespace has the IP address 172.17.0.3,
and the domain name for your cluster is cluster.local
, then the Pod has a DNS name:
172-17-0-3.default.pod.cluster.local
.
Any pods created by a Deployment or DaemonSet exposed by a Service have the following DNS resolution available:
pod-ip-address.deployment-name.my-namespace.svc.cluster-domain.example
.
Pod's hostname and subdomain fields
Currently when a pod is created, its hostname is the Pod's metadata.name
value.
The Pod spec has an optional hostname
field, which can be used to specify the
Pod's hostname. When specified, it takes precedence over the Pod's name to be
the hostname of the pod. For example, given a Pod with hostname
set to
"my-host
", the Pod will have its hostname set to "my-host
".
The Pod spec also has an optional subdomain
field which can be used to specify
its subdomain. For example, a Pod with hostname
set to "foo
", and subdomain
set to "bar
", in namespace "my-namespace
", will have the fully qualified
domain name (FQDN) "foo.bar.my-namespace.svc.cluster-domain.example
".
Example:
apiVersion: v1
kind: Service
metadata:
name: default-subdomain
spec:
selector:
name: busybox
clusterIP: None
ports:
- name: foo # Actually, no port is needed.
port: 1234
targetPort: 1234
---
apiVersion: v1
kind: Pod
metadata:
name: busybox1
labels:
name: busybox
spec:
hostname: busybox-1
subdomain: default-subdomain
containers:
- image: busybox:1.28
command:
- sleep
- "3600"
name: busybox
---
apiVersion: v1
kind: Pod
metadata:
name: busybox2
labels:
name: busybox
spec:
hostname: busybox-2
subdomain: default-subdomain
containers:
- image: busybox:1.28
command:
- sleep
- "3600"
name: busybox
If there exists a headless service in the same namespace as the pod and with
the same name as the subdomain, the cluster's DNS Server also returns an A or AAAA
record for the Pod's fully qualified hostname.
For example, given a Pod with the hostname set to "busybox-1
" and the subdomain set to
"default-subdomain
", and a headless Service named "default-subdomain
" in
the same namespace, the pod will see its own FQDN as
"busybox-1.default-subdomain.my-namespace.svc.cluster-domain.example
". DNS serves an
A or AAAA record at that name, pointing to the Pod's IP. Both pods "busybox1
" and
"busybox2
" can have their distinct A or AAAA records.
The Endpoints object can specify the hostname
for any endpoint addresses,
along with its IP.
Note: Because A or AAAA records are not created for Pod names,hostname
is required for the Pod's A or AAAA record to be created. A Pod with nohostname
but withsubdomain
will only create the A or AAAA record for the headless service (default-subdomain.my-namespace.svc.cluster-domain.example
), pointing to the Pod's IP address. Also, Pod needs to become ready in order to have a record unlesspublishNotReadyAddresses=True
is set on the Service.
Pod's setHostnameAsFQDN field
Kubernetes v1.20 [beta]
When a Pod is configured to have fully qualified domain name (FQDN), its hostname is the short hostname. For example, if you have a Pod with the fully qualified domain name busybox-1.default-subdomain.my-namespace.svc.cluster-domain.example
, then by default the hostname
command inside that Pod returns busybox-1
and the hostname --fqdn
command returns the FQDN.
When you set setHostnameAsFQDN: true
in the Pod spec, the kubelet writes the Pod's FQDN into the hostname for that Pod's namespace. In this case, both hostname
and hostname --fqdn
return the Pod's FQDN.
Note:In Linux, the hostname field of the kernel (the
nodename
field ofstruct utsname
) is limited to 64 characters.If a Pod enables this feature and its FQDN is longer than 64 character, it will fail to start. The Pod will remain in
Pending
status (ContainerCreating
as seen bykubectl
) generating error events, such as Failed to construct FQDN from pod hostname and cluster domain, FQDNlong-FQDN
is too long (64 characters is the max, 70 characters requested). One way of improving user experience for this scenario is to create an admission webhook controller to control FQDN size when users create top level objects, for example, Deployment.
Pod's DNS Policy
DNS policies can be set on a per-pod basis. Currently Kubernetes supports the
following pod-specific DNS policies. These policies are specified in the
dnsPolicy
field of a Pod Spec.
- "
Default
": The Pod inherits the name resolution configuration from the node that the pods run on. See related discussion for more details. - "
ClusterFirst
": Any DNS query that does not match the configured cluster domain suffix, such as "www.kubernetes.io
", is forwarded to the upstream nameserver inherited from the node. Cluster administrators may have extra stub-domain and upstream DNS servers configured. See related discussion for details on how DNS queries are handled in those cases. - "
ClusterFirstWithHostNet
": For Pods running with hostNetwork, you should explicitly set its DNS policy "ClusterFirstWithHostNet
". - "
None
": It allows a Pod to ignore DNS settings from the Kubernetes environment. All DNS settings are supposed to be provided using thednsConfig
field in the Pod Spec. See Pod's DNS config subsection below.
Note: "Default" is not the default DNS policy. IfdnsPolicy
is not explicitly specified, then "ClusterFirst" is used.
The example below shows a Pod with its DNS policy set to
"ClusterFirstWithHostNet
" because it has hostNetwork
set to true
.
apiVersion: v1
kind: Pod
metadata:
name: busybox
namespace: default
spec:
containers:
- image: busybox:1.28
command:
- sleep
- "3600"
imagePullPolicy: IfNotPresent
name: busybox
restartPolicy: Always
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
Pod's DNS Config
Pod's DNS Config allows users more control on the DNS settings for a Pod.
The dnsConfig
field is optional and it can work with any dnsPolicy
settings.
However, when a Pod's dnsPolicy
is set to "None
", the dnsConfig
field has
to be specified.
Below are the properties a user can specify in the dnsConfig
field:
nameservers
: a list of IP addresses that will be used as DNS servers for the Pod. There can be at most 3 IP addresses specified. When the Pod'sdnsPolicy
is set to "None
", the list must contain at least one IP address, otherwise this property is optional. The servers listed will be combined to the base nameservers generated from the specified DNS policy with duplicate addresses removed.searches
: a list of DNS search domains for hostname lookup in the Pod. This property is optional. When specified, the provided list will be merged into the base search domain names generated from the chosen DNS policy. Duplicate domain names are removed. Kubernetes allows for at most 6 search domains.options
: an optional list of objects where each object may have aname
property (required) and avalue
property (optional). The contents in this property will be merged to the options generated from the specified DNS policy. Duplicate entries are removed.
The following is an example Pod with custom DNS settings:
apiVersion: v1
kind: Pod
metadata:
namespace: default
name: dns-example
spec:
containers:
- name: test
image: nginx
dnsPolicy: "None"
dnsConfig:
nameservers:
- 1.2.3.4
searches:
- ns1.svc.cluster-domain.example
- my.dns.search.suffix
options:
- name: ndots
value: "2"
- name: edns0
When the Pod above is created, the container test
gets the following contents
in its /etc/resolv.conf
file:
nameserver 1.2.3.4
search ns1.svc.cluster-domain.example my.dns.search.suffix
options ndots:2 edns0
For IPv6 setup, search path and name server should be setup like this:
kubectl exec -it dns-example -- cat /etc/resolv.conf
The output is similar to this:
nameserver fd00:79:30::a
search default.svc.cluster-domain.example svc.cluster-domain.example cluster-domain.example
options ndots:5
Feature availability
The availability of Pod DNS Config and DNS Policy "None
" is shown as below.
k8s version | Feature support |
---|---|
1.14 | Stable |
1.10 | Beta (on by default) |
1.9 | Alpha |
What's next
For guidance on administering DNS configurations, check Configure DNS Service
5.4 - Connecting Applications with Services
The Kubernetes model for connecting containers
Now that you have a continuously running, replicated application you can expose it on a network. Before discussing the Kubernetes approach to networking, it is worthwhile to contrast it with the "normal" way networking works with Docker.
By default, Docker uses host-private networking, so containers can talk to other containers only if they are on the same machine. In order for Docker containers to communicate across nodes, there must be allocated ports on the machine's own IP address, which are then forwarded or proxied to the containers. This obviously means that containers must either coordinate which ports they use very carefully or ports must be allocated dynamically.
Coordinating port allocations across multiple developers or teams that provide containers is very difficult to do at scale, and exposes users to cluster-level issues outside of their control. Kubernetes assumes that pods can communicate with other pods, regardless of which host they land on. Kubernetes gives every pod its own cluster-private IP address, so you do not need to explicitly create links between pods or map container ports to host ports. This means that containers within a Pod can all reach each other's ports on localhost, and all pods in a cluster can see each other without NAT. The rest of this document elaborates on how you can run reliable services on such a networking model.
This guide uses a simple nginx server to demonstrate proof of concept.
Exposing pods to the cluster
We did this in a previous example, but let's do it once again and focus on the networking perspective. Create an nginx Pod, and note that it has a container port specification:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
selector:
matchLabels:
run: my-nginx
replicas: 2
template:
metadata:
labels:
run: my-nginx
spec:
containers:
- name: my-nginx
image: nginx
ports:
- containerPort: 80
This makes it accessible from any node in your cluster. Check the nodes the Pod is running on:
kubectl apply -f ./run-my-nginx.yaml
kubectl get pods -l run=my-nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE
my-nginx-3800858182-jr4a2 1/1 Running 0 13s 10.244.3.4 kubernetes-minion-905m
my-nginx-3800858182-kna2y 1/1 Running 0 13s 10.244.2.5 kubernetes-minion-ljyd
Check your pods' IPs:
kubectl get pods -l run=my-nginx -o yaml | grep podIP
podIP: 10.244.3.4
podIP: 10.244.2.5
You should be able to ssh into any node in your cluster and curl both IPs. Note that the containers are not using port 80 on the node, nor are there any special NAT rules to route traffic to the pod. This means you can run multiple nginx pods on the same node all using the same containerPort and access them from any other pod or node in your cluster using IP. Like Docker, ports can still be published to the host node's interfaces, but the need for this is radically diminished because of the networking model.
You can read more about how we achieve this if you're curious.
Creating a Service
So we have pods running nginx in a flat, cluster wide, address space. In theory, you could talk to these pods directly, but what happens when a node dies? The pods die with it, and the Deployment will create new ones, with different IPs. This is the problem a Service solves.
A Kubernetes Service is an abstraction which defines a logical set of Pods running somewhere in your cluster, that all provide the same functionality. When created, each Service is assigned a unique IP address (also called clusterIP). This address is tied to the lifespan of the Service, and will not change while the Service is alive. Pods can be configured to talk to the Service, and know that communication to the Service will be automatically load-balanced out to some pod that is a member of the Service.
You can create a Service for your 2 nginx replicas with kubectl expose
:
kubectl expose deployment/my-nginx
service/my-nginx exposed
This is equivalent to kubectl apply -f
the following yaml:
apiVersion: v1
kind: Service
metadata:
name: my-nginx
labels:
run: my-nginx
spec:
ports:
- port: 80
protocol: TCP
selector:
run: my-nginx
This specification will create a Service which targets TCP port 80 on any Pod
with the run: my-nginx
label, and expose it on an abstracted Service port
(targetPort
: is the port the container accepts traffic on, port
: is the
abstracted Service port, which can be any port other pods use to access the
Service).
View Service
API object to see the list of supported fields in service definition.
Check your Service:
kubectl get svc my-nginx
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
my-nginx ClusterIP 10.0.162.149 <none> 80/TCP 21s
As mentioned previously, a Service is backed by a group of Pods. These Pods are
exposed through endpoints
. The Service's selector will be evaluated continuously
and the results will be POSTed to an Endpoints object also named my-nginx
.
When a Pod dies, it is automatically removed from the endpoints, and new Pods
matching the Service's selector will automatically get added to the endpoints.
Check the endpoints, and note that the IPs are the same as the Pods created in
the first step:
kubectl describe svc my-nginx
Name: my-nginx
Namespace: default
Labels: run=my-nginx
Annotations: <none>
Selector: run=my-nginx
Type: ClusterIP
IP: 10.0.162.149
Port: <unset> 80/TCP
Endpoints: 10.244.2.5:80,10.244.3.4:80
Session Affinity: None
Events: <none>
kubectl get ep my-nginx
NAME ENDPOINTS AGE
my-nginx 10.244.2.5:80,10.244.3.4:80 1m
You should now be able to curl the nginx Service on <CLUSTER-IP>:<PORT>
from
any node in your cluster. Note that the Service IP is completely virtual, it
never hits the wire. If you're curious about how this works you can read more
about the service proxy.
Accessing the Service
Kubernetes supports 2 primary modes of finding a Service - environment variables and DNS. The former works out of the box while the latter requires the CoreDNS cluster addon.
Note: If the service environment variables are not desired (because possible clashing with expected program ones, too many variables to process, only using DNS, etc) you can disable this mode by setting theenableServiceLinks
flag tofalse
on the pod spec.
Environment Variables
When a Pod runs on a Node, the kubelet adds a set of environment variables for each active Service. This introduces an ordering problem. To see why, inspect the environment of your running nginx Pods (your Pod name will be different):
kubectl exec my-nginx-3800858182-jr4a2 -- printenv | grep SERVICE
KUBERNETES_SERVICE_HOST=10.0.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443
Note there's no mention of your Service. This is because you created the replicas before the Service. Another disadvantage of doing this is that the scheduler might put both Pods on the same machine, which will take your entire Service down if it dies. We can do this the right way by killing the 2 Pods and waiting for the Deployment to recreate them. This time around the Service exists before the replicas. This will give you scheduler-level Service spreading of your Pods (provided all your nodes have equal capacity), as well as the right environment variables:
kubectl scale deployment my-nginx --replicas=0; kubectl scale deployment my-nginx --replicas=2;
kubectl get pods -l run=my-nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE
my-nginx-3800858182-e9ihh 1/1 Running 0 5s 10.244.2.7 kubernetes-minion-ljyd
my-nginx-3800858182-j4rm4 1/1 Running 0 5s 10.244.3.8 kubernetes-minion-905m
You may notice that the pods have different names, since they are killed and recreated.
kubectl exec my-nginx-3800858182-e9ihh -- printenv | grep SERVICE
KUBERNETES_SERVICE_PORT=443
MY_NGINX_SERVICE_HOST=10.0.162.149
KUBERNETES_SERVICE_HOST=10.0.0.1
MY_NGINX_SERVICE_PORT=80
KUBERNETES_SERVICE_PORT_HTTPS=443
DNS
Kubernetes offers a DNS cluster addon Service that automatically assigns dns names to other Services. You can check if it's running on your cluster:
kubectl get services kube-dns --namespace=kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.0.0.10 <none> 53/UDP,53/TCP 8m
The rest of this section will assume you have a Service with a long lived IP
(my-nginx), and a DNS server that has assigned a name to that IP. Here we use the CoreDNS cluster addon (application name kube-dns
), so you can talk to the Service from any pod in your cluster using standard methods (e.g. gethostbyname()
). If CoreDNS isn't running, you can enable it referring to the CoreDNS README or Installing CoreDNS. Let's run another curl application to test this:
kubectl run curl --image=radial/busyboxplus:curl -i --tty
Waiting for pod default/curl-131556218-9fnch to be running, status is Pending, pod ready: false
Hit enter for command prompt
Then, hit enter and run nslookup my-nginx
:
[ root@curl-131556218-9fnch:/ ]$ nslookup my-nginx
Server: 10.0.0.10
Address 1: 10.0.0.10
Name: my-nginx
Address 1: 10.0.162.149
Securing the Service
Till now we have only accessed the nginx server from within the cluster. Before exposing the Service to the internet, you want to make sure the communication channel is secure. For this, you will need:
- Self signed certificates for https (unless you already have an identity certificate)
- An nginx server configured to use the certificates
- A secret that makes the certificates accessible to pods
You can acquire all these from the nginx https example. This requires having go and make tools installed. If you don't want to install those, then follow the manual steps later. In short:
make keys KEY=/tmp/nginx.key CERT=/tmp/nginx.crt
kubectl create secret tls nginxsecret --key /tmp/nginx.key --cert /tmp/nginx.crt
secret/nginxsecret created
kubectl get secrets
NAME TYPE DATA AGE
default-token-il9rc kubernetes.io/service-account-token 1 1d
nginxsecret kubernetes.io/tls 2 1m
And also the configmap:
kubectl create configmap nginxconfigmap --from-file=default.conf
configmap/nginxconfigmap created
kubectl get configmaps
NAME DATA AGE
nginxconfigmap 1 114s
Following are the manual steps to follow in case you run into problems running make (on windows for example):
# Create a public private key pair
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout /d/tmp/nginx.key -out /d/tmp/nginx.crt -subj "/CN=my-nginx/O=my-nginx"
# Convert the keys to base64 encoding
cat /d/tmp/nginx.crt | base64
cat /d/tmp/nginx.key | base64
Use the output from the previous commands to create a yaml file as follows. The base64 encoded value should all be on a single line.
apiVersion: "v1"
kind: "Secret"
metadata:
name: "nginxsecret"
namespace: "default"
type: kubernetes.io/tls
data:
tls.crt: "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURIekNDQWdlZ0F3SUJBZ0lKQUp5M3lQK0pzMlpJTUEwR0NTcUdTSWIzRFFFQkJRVUFNQ1l4RVRBUEJnTlYKQkFNVENHNW5hVzU0YzNaak1SRXdEd1lEVlFRS0V3aHVaMmx1ZUhOMll6QWVGdzB4TnpFd01qWXdOekEzTVRKYQpGdzB4T0RFd01qWXdOekEzTVRKYU1DWXhFVEFQQmdOVkJBTVRDRzVuYVc1NGMzWmpNUkV3RHdZRFZRUUtFd2h1CloybHVlSE4yWXpDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBSjFxSU1SOVdWM0IKMlZIQlRMRmtobDRONXljMEJxYUhIQktMSnJMcy8vdzZhU3hRS29GbHlJSU94NGUrMlN5ajBFcndCLzlYTnBwbQppeW1CL3JkRldkOXg5UWhBQUxCZkVaTmNiV3NsTVFVcnhBZW50VWt1dk1vLzgvMHRpbGhjc3paenJEYVJ4NEo5Ci82UVRtVVI3a0ZTWUpOWTVQZkR3cGc3dlVvaDZmZ1Voam92VG42eHNVR0M2QURVODBpNXFlZWhNeVI1N2lmU2YKNHZpaXdIY3hnL3lZR1JBRS9mRTRqakxCdmdONjc2SU90S01rZXV3R0ljNDFhd05tNnNTSzRqYUNGeGpYSnZaZQp2by9kTlEybHhHWCtKT2l3SEhXbXNhdGp4WTRaNVk3R1ZoK0QrWnYvcW1mMFgvbVY0Rmo1NzV3ajFMWVBocWtsCmdhSXZYRyt4U1FVQ0F3RUFBYU5RTUU0d0hRWURWUjBPQkJZRUZPNG9OWkI3YXc1OUlsYkROMzhIYkduYnhFVjcKTUI4R0ExVWRJd1FZTUJhQUZPNG9OWkI3YXc1OUlsYkROMzhIYkduYnhFVjdNQXdHQTFVZEV3UUZNQU1CQWY4dwpEUVlKS29aSWh2Y05BUUVGQlFBRGdnRUJBRVhTMW9FU0lFaXdyMDhWcVA0K2NwTHI3TW5FMTducDBvMm14alFvCjRGb0RvRjdRZnZqeE04Tzd2TjB0clcxb2pGSW0vWDE4ZnZaL3k4ZzVaWG40Vm8zc3hKVmRBcStNZC9jTStzUGEKNmJjTkNUekZqeFpUV0UrKzE5NS9zb2dmOUZ3VDVDK3U2Q3B5N0M3MTZvUXRUakViV05VdEt4cXI0Nk1OZWNCMApwRFhWZmdWQTRadkR4NFo3S2RiZDY5eXM3OVFHYmg5ZW1PZ05NZFlsSUswSGt0ejF5WU4vbVpmK3FqTkJqbWZjCkNnMnlwbGQ0Wi8rUUNQZjl3SkoybFIrY2FnT0R4elBWcGxNSEcybzgvTHFDdnh6elZPUDUxeXdLZEtxaUMwSVEKQ0I5T2wwWW5scE9UNEh1b2hSUzBPOStlMm9KdFZsNUIyczRpbDlhZ3RTVXFxUlU9Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K"
tls.key: "LS0tLS1CRUdJTiBQUklWQVRFIEtFWS0tLS0tCk1JSUV2UUlCQURBTkJna3Foa2lHOXcwQkFRRUZBQVNDQktjd2dnU2pBZ0VBQW9JQkFRQ2RhaURFZlZsZHdkbFIKd1V5eFpJWmVEZWNuTkFhbWh4d1NpeWF5N1AvOE9ta3NVQ3FCWmNpQ0RzZUh2dGtzbzlCSzhBZi9WemFhWm9zcApnZjYzUlZuZmNmVUlRQUN3WHhHVFhHMXJKVEVGSzhRSHA3VkpMcnpLUC9QOUxZcFlYTE0yYzZ3MmtjZUNmZitrCkU1bEVlNUJVbUNUV09UM3c4S1lPNzFLSWVuNEZJWTZMMDUrc2JGQmd1Z0ExUE5JdWFubm9UTWtlZTRuMG4rTDQKb3NCM01ZUDhtQmtRQlAzeE9JNHl3YjREZXUraURyU2pKSHJzQmlIT05Xc0RadXJFaXVJMmdoY1kxeWIyWHI2UAozVFVOcGNSbC9pVG9zQngxcHJHclk4V09HZVdPeGxZZmcvbWIvNnBuOUYvNWxlQlkrZStjSTlTMkQ0YXBKWUdpCkwxeHZzVWtGQWdNQkFBRUNnZ0VBZFhCK0xkbk8ySElOTGo5bWRsb25IUGlHWWVzZ294RGQwci9hQ1Zkank4dlEKTjIwL3FQWkUxek1yall6Ry9kVGhTMmMwc0QxaTBXSjdwR1lGb0xtdXlWTjltY0FXUTM5SjM0VHZaU2FFSWZWNgo5TE1jUHhNTmFsNjRLMFRVbUFQZytGam9QSFlhUUxLOERLOUtnNXNrSE5pOWNzMlY5ckd6VWlVZWtBL0RBUlBTClI3L2ZjUFBacDRuRWVBZmI3WTk1R1llb1p5V21SU3VKdlNyblBESGtUdW1vVlVWdkxMRHRzaG9reUxiTWVtN3oKMmJzVmpwSW1GTHJqbGtmQXlpNHg0WjJrV3YyMFRrdWtsZU1jaVlMbjk4QWxiRi9DSmRLM3QraTRoMTVlR2ZQegpoTnh3bk9QdlVTaDR2Q0o3c2Q5TmtEUGJvS2JneVVHOXBYamZhRGR2UVFLQmdRRFFLM01nUkhkQ1pKNVFqZWFKClFGdXF4cHdnNzhZTjQyL1NwenlUYmtGcVFoQWtyczJxWGx1MDZBRzhrZzIzQkswaHkzaE9zSGgxcXRVK3NHZVAKOWRERHBsUWV0ODZsY2FlR3hoc0V0L1R6cEdtNGFKSm5oNzVVaTVGZk9QTDhPTm1FZ3MxMVRhUldhNzZxelRyMgphRlpjQ2pWV1g0YnRSTHVwSkgrMjZnY0FhUUtCZ1FEQmxVSUUzTnNVOFBBZEYvL25sQVB5VWs1T3lDdWc3dmVyClUycXlrdXFzYnBkSi9hODViT1JhM05IVmpVM25uRGpHVHBWaE9JeXg5TEFrc2RwZEFjVmxvcG9HODhXYk9lMTAKMUdqbnkySmdDK3JVWUZiRGtpUGx1K09IYnRnOXFYcGJMSHBzUVpsMGhucDBYSFNYVm9CMUliQndnMGEyOFVadApCbFBtWmc2d1BRS0JnRHVIUVV2SDZHYTNDVUsxNFdmOFhIcFFnMU16M2VvWTBPQm5iSDRvZUZKZmcraEppSXlnCm9RN3hqWldVR3BIc3AyblRtcHErQWlSNzdyRVhsdlhtOElVU2FsbkNiRGlKY01Pc29RdFBZNS9NczJMRm5LQTQKaENmL0pWb2FtZm1nZEN0ZGtFMXNINE9MR2lJVHdEbTRpb0dWZGIwMllnbzFyb2htNUpLMUI3MkpBb0dBUW01UQpHNDhXOTVhL0w1eSt5dCsyZ3YvUHM2VnBvMjZlTzRNQ3lJazJVem9ZWE9IYnNkODJkaC8xT2sybGdHZlI2K3VuCnc1YytZUXRSTHlhQmd3MUtpbGhFZDBKTWU3cGpUSVpnQWJ0LzVPbnlDak9OVXN2aDJjS2lrQ1Z2dTZsZlBjNkQKckliT2ZIaHhxV0RZK2Q1TGN1YSt2NzJ0RkxhenJsSlBsRzlOZHhrQ2dZRUF5elIzT3UyMDNRVVV6bUlCRkwzZAp4Wm5XZ0JLSEo3TnNxcGFWb2RjL0d5aGVycjFDZzE2MmJaSjJDV2RsZkI0VEdtUjZZdmxTZEFOOFRwUWhFbUtKCnFBLzVzdHdxNWd0WGVLOVJmMWxXK29xNThRNTBxMmk1NVdUTThoSDZhTjlaMTltZ0FGdE5VdGNqQUx2dFYxdEYKWSs4WFJkSHJaRnBIWll2NWkwVW1VbGc9Ci0tLS0tRU5EIFBSSVZBVEUgS0VZLS0tLS0K"
Now create the secrets using the file:
kubectl apply -f nginxsecrets.yaml
kubectl get secrets
NAME TYPE DATA AGE
default-token-il9rc kubernetes.io/service-account-token 1 1d
nginxsecret kubernetes.io/tls 2 1m
Now modify your nginx replicas to start an https server using the certificate in the secret, and the Service, to expose both ports (80 and 443):
apiVersion: v1
kind: Service
metadata:
name: my-nginx
labels:
run: my-nginx
spec:
type: NodePort
ports:
- port: 8080
targetPort: 80
protocol: TCP
name: http
- port: 443
protocol: TCP
name: https
selector:
run: my-nginx
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
selector:
matchLabels:
run: my-nginx
replicas: 1
template:
metadata:
labels:
run: my-nginx
spec:
volumes:
- name: secret-volume
secret:
secretName: nginxsecret
- name: configmap-volume
configMap:
name: nginxconfigmap
containers:
- name: nginxhttps
image: bprashanth/nginxhttps:1.0
ports:
- containerPort: 443
- containerPort: 80
volumeMounts:
- mountPath: /etc/nginx/ssl
name: secret-volume
- mountPath: /etc/nginx/conf.d
name: configmap-volume
Noteworthy points about the nginx-secure-app manifest:
- It contains both Deployment and Service specification in the same file.
- The nginx server serves HTTP traffic on port 80 and HTTPS traffic on 443, and nginx Service exposes both ports.
- Each container has access to the keys through a volume mounted at
/etc/nginx/ssl
. This is setup before the nginx server is started.
kubectl delete deployments,svc my-nginx; kubectl create -f ./nginx-secure-app.yaml
At this point you can reach the nginx server from any node.
kubectl get pods -o yaml | grep -i podip
podIP: 10.244.3.5
node $ curl -k https://10.244.3.5
...
<h1>Welcome to nginx!</h1>
Note how we supplied the -k
parameter to curl in the last step, this is because we don't know anything about the pods running nginx at certificate generation time,
so we have to tell curl to ignore the CName mismatch. By creating a Service we linked the CName used in the certificate with the actual DNS name used by pods during Service lookup.
Let's test this from a pod (the same secret is being reused for simplicity, the pod only needs nginx.crt to access the Service):
apiVersion: apps/v1
kind: Deployment
metadata:
name: curl-deployment
spec:
selector:
matchLabels:
app: curlpod
replicas: 1
template:
metadata:
labels:
app: curlpod
spec:
volumes:
- name: secret-volume
secret:
secretName: nginxsecret
containers:
- name: curlpod
command:
- sh
- -c
- while true; do sleep 1; done
image: radial/busyboxplus:curl
volumeMounts:
- mountPath: /etc/nginx/ssl
name: secret-volume
kubectl apply -f ./curlpod.yaml
kubectl get pods -l app=curlpod
NAME READY STATUS RESTARTS AGE
curl-deployment-1515033274-1410r 1/1 Running 0 1m
kubectl exec curl-deployment-1515033274-1410r -- curl https://my-nginx --cacert /etc/nginx/ssl/tls.crt
...
<title>Welcome to nginx!</title>
...
Exposing the Service
For some parts of your applications you may want to expose a Service onto an
external IP address. Kubernetes supports two ways of doing this: NodePorts and
LoadBalancers. The Service created in the last section already used NodePort
,
so your nginx HTTPS replica is ready to serve traffic on the internet if your
node has a public IP.
kubectl get svc my-nginx -o yaml | grep nodePort -C 5
uid: 07191fb3-f61a-11e5-8ae5-42010af00002
spec:
clusterIP: 10.0.162.149
ports:
- name: http
nodePort: 31704
port: 8080
protocol: TCP
targetPort: 80
- name: https
nodePort: 32453
port: 443
protocol: TCP
targetPort: 443
selector:
run: my-nginx
kubectl get nodes -o yaml | grep ExternalIP -C 1
- address: 104.197.41.11
type: ExternalIP
allocatable:
--
- address: 23.251.152.56
type: ExternalIP
allocatable:
...
$ curl https://<EXTERNAL-IP>:<NODE-PORT> -k
...
<h1>Welcome to nginx!</h1>
Let's now recreate the Service to use a cloud load balancer. Change the Type
of my-nginx
Service from NodePort
to LoadBalancer
:
kubectl edit svc my-nginx
kubectl get svc my-nginx
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
my-nginx LoadBalancer 10.0.162.149 xx.xxx.xxx.xxx 8080:30163/TCP 21s
curl https://<EXTERNAL-IP> -k
...
<title>Welcome to nginx!</title>
The IP address in the EXTERNAL-IP
column is the one that is available on the public internet. The CLUSTER-IP
is only available inside your
cluster/private cloud network.
Note that on AWS, type LoadBalancer
creates an ELB, which uses a (long)
hostname, not an IP. It's too long to fit in the standard kubectl get svc
output, in fact, so you'll need to do kubectl describe service my-nginx
to
see it. You'll see something like this:
kubectl describe service my-nginx
...
LoadBalancer Ingress: a320587ffd19711e5a37606cf4a74574-1142138393.us-east-1.elb.amazonaws.com
...
What's next
- Learn more about Using a Service to Access an Application in a Cluster
- Learn more about Connecting a Front End to a Back End Using a Service
- Learn more about Creating an External Load Balancer
5.5 - Ingress
Kubernetes v1.19 [stable]
An API object that manages external access to the services in a cluster, typically HTTP.
Ingress may provide load balancing, SSL termination and name-based virtual hosting.
Terminology
For clarity, this guide defines the following terms:
- Node: A worker machine in Kubernetes, part of a cluster.
- Cluster: A set of Nodes that run containerized applications managed by Kubernetes. For this example, and in most common Kubernetes deployments, nodes in the cluster are not part of the public internet.
- Edge router: A router that enforces the firewall policy for your cluster. This could be a gateway managed by a cloud provider or a physical piece of hardware.
- Cluster network: A set of links, logical or physical, that facilitate communication within a cluster according to the Kubernetes networking model.
- Service: A Kubernetes Service that identifies a set of Pods using label selectors. Unless mentioned otherwise, Services are assumed to have virtual IPs only routable within the cluster network.
What is Ingress?
Ingress exposes HTTP and HTTPS routes from outside the cluster to services within the cluster. Traffic routing is controlled by rules defined on the Ingress resource.
Here is a simple example where an Ingress sends all its traffic to one Service:
load balancer .->ingress[Ingress]; ingress-->|routing rule|service[Service]; subgraph cluster ingress; service-->pod1[Pod]; service-->pod2[Pod]; end classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000; classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff; classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5; class ingress,service,pod1,pod2 k8s; class client plain; class cluster cluster;
An Ingress may be configured to give Services externally-reachable URLs, load balance traffic, terminate SSL / TLS, and offer name-based virtual hosting. An Ingress controller is responsible for fulfilling the Ingress, usually with a load balancer, though it may also configure your edge router or additional frontends to help handle the traffic.
An Ingress does not expose arbitrary ports or protocols. Exposing services other than HTTP and HTTPS to the internet typically uses a service of type Service.Type=NodePort or Service.Type=LoadBalancer.
Prerequisites
You must have an Ingress controller to satisfy an Ingress. Only creating an Ingress resource has no effect.
You may need to deploy an Ingress controller such as ingress-nginx. You can choose from a number of Ingress controllers.
Ideally, all Ingress controllers should fit the reference specification. In reality, the various Ingress controllers operate slightly differently.
Note: Make sure you review your Ingress controller's documentation to understand the caveats of choosing it.
The Ingress resource
A minimal Ingress resource example:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: minimal-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
rules:
- http:
paths:
- path: /testpath
pathType: Prefix
backend:
service:
name: test
port:
number: 80
As with all other Kubernetes resources, an Ingress needs apiVersion
, kind
, and metadata
fields.
The name of an Ingress object must be a valid
DNS subdomain name.
For general information about working with config files, see deploying applications, configuring containers, managing resources.
Ingress frequently uses annotations to configure some options depending on the Ingress controller, an example of which
is the rewrite-target annotation.
Different Ingress controller support different annotations. Review the documentation for
your choice of Ingress controller to learn which annotations are supported.
The Ingress spec has all the information needed to configure a load balancer or proxy server. Most importantly, it contains a list of rules matched against all incoming requests. Ingress resource only supports rules for directing HTTP(S) traffic.
Ingress rules
Each HTTP rule contains the following information:
- An optional host. In this example, no host is specified, so the rule applies to all inbound HTTP traffic through the IP address specified. If a host is provided (for example, foo.bar.com), the rules apply to that host.
- A list of paths (for example,
/testpath
), each of which has an associated backend defined with aservice.name
and aservice.port.name
orservice.port.number
. Both the host and path must match the content of an incoming request before the load balancer directs traffic to the referenced Service. - A backend is a combination of Service and port names as described in the Service doc or a custom resource backend by way of a CRD. HTTP (and HTTPS) requests to the Ingress that matches the host and path of the rule are sent to the listed backend.
A defaultBackend
is often configured in an Ingress controller to service any requests that do not
match a path in the spec.
DefaultBackend
An Ingress with no rules sends all traffic to a single default backend. The defaultBackend
is conventionally a configuration option
of the Ingress controller and is not specified in your Ingress resources.
If none of the hosts or paths match the HTTP request in the Ingress objects, the traffic is routed to your default backend.
Resource backends
A Resource
backend is an ObjectRef to another Kubernetes resource within the
same namespace as the Ingress object. A Resource
is a mutually exclusive
setting with Service, and will fail validation if both are specified. A common
usage for a Resource
backend is to ingress data to an object storage backend
with static assets.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ingress-resource-backend
spec:
defaultBackend:
resource:
apiGroup: k8s.example.com
kind: StorageBucket
name: static-assets
rules:
- http:
paths:
- path: /icons
pathType: ImplementationSpecific
backend:
resource:
apiGroup: k8s.example.com
kind: StorageBucket
name: icon-assets
After creating the Ingress above, you can view it with the following command:
kubectl describe ingress ingress-resource-backend
Name: ingress-resource-backend
Namespace: default
Address:
Default backend: APIGroup: k8s.example.com, Kind: StorageBucket, Name: static-assets
Rules:
Host Path Backends
---- ---- --------
*
/icons APIGroup: k8s.example.com, Kind: StorageBucket, Name: icon-assets
Annotations: <none>
Events: <none>
Path types
Each path in an Ingress is required to have a corresponding path type. Paths
that do not include an explicit pathType
will fail validation. There are three
supported path types:
ImplementationSpecific
: With this path type, matching is up to the IngressClass. Implementations can treat this as a separatepathType
or treat it identically toPrefix
orExact
path types.Exact
: Matches the URL path exactly and with case sensitivity.Prefix
: Matches based on a URL path prefix split by/
. Matching is case sensitive and done on a path element by element basis. A path element refers to the list of labels in the path split by the/
separator. A request is a match for path p if every p is an element-wise prefix of p of the request path.Note: If the last element of the path is a substring of the last element in request path, it is not a match (for example:/foo/bar
matches/foo/bar/baz
, but does not match/foo/barbaz
).
Examples
Kind | Path(s) | Request path(s) | Matches? |
---|---|---|---|
Prefix | / | (all paths) | Yes |
Exact | /foo | /foo | Yes |
Exact | /foo | /bar | No |
Exact | /foo | /foo/ | No |
Exact | /foo/ | /foo | No |
Prefix | /foo | /foo , /foo/ | Yes |
Prefix | /foo/ | /foo , /foo/ | Yes |
Prefix | /aaa/bb | /aaa/bbb | No |
Prefix | /aaa/bbb | /aaa/bbb | Yes |
Prefix | /aaa/bbb/ | /aaa/bbb | Yes, ignores trailing slash |
Prefix | /aaa/bbb | /aaa/bbb/ | Yes, matches trailing slash |
Prefix | /aaa/bbb | /aaa/bbb/ccc | Yes, matches subpath |
Prefix | /aaa/bbb | /aaa/bbbxyz | No, does not match string prefix |
Prefix | / , /aaa | /aaa/ccc | Yes, matches /aaa prefix |
Prefix | / , /aaa , /aaa/bbb | /aaa/bbb | Yes, matches /aaa/bbb prefix |
Prefix | / , /aaa , /aaa/bbb | /ccc | Yes, matches / prefix |
Prefix | /aaa | /ccc | No, uses default backend |
Mixed | /foo (Prefix), /foo (Exact) | /foo | Yes, prefers Exact |
Multiple matches
In some cases, multiple paths within an Ingress will match a request. In those cases precedence will be given first to the longest matching path. If two paths are still equally matched, precedence will be given to paths with an exact path type over prefix path type.
Hostname wildcards
Hosts can be precise matches (for example “foo.bar.com
”) or a wildcard (for
example “*.foo.com
”). Precise matches require that the HTTP host
header
matches the host
field. Wildcard matches require the HTTP host
header is
equal to the suffix of the wildcard rule.
Host | Host header | Match? |
---|---|---|
*.foo.com | bar.foo.com | Matches based on shared suffix |
*.foo.com | baz.bar.foo.com | No match, wildcard only covers a single DNS label |
*.foo.com | foo.com | No match, wildcard only covers a single DNS label |
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ingress-wildcard-host
spec:
rules:
- host: "foo.bar.com"
http:
paths:
- pathType: Prefix
path: "/bar"
backend:
service:
name: service1
port:
number: 80
- host: "*.foo.com"
http:
paths:
- pathType: Prefix
path: "/foo"
backend:
service:
name: service2
port:
number: 80
Ingress class
Ingresses can be implemented by different controllers, often with different configuration. Each Ingress should specify a class, a reference to an IngressClass resource that contains additional configuration including the name of the controller that should implement the class.
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: external-lb
spec:
controller: example.com/ingress-controller
parameters:
apiGroup: k8s.example.com
kind: IngressParameters
name: external-lb
IngressClass resources contain an optional parameters field. This can be used to reference additional implementation-specific configuration for this class.
Namespace-scoped parameters
Kubernetes v1.21 [alpha]
Parameters
field has a scope
and namespace
field that can be used to
reference a namespace-specific resource for configuration of an Ingress class.
Scope
field defaults to Cluster
, meaning, the default is cluster-scoped
resource. Setting Scope
to Namespace
and setting the Namespace
field
will reference a paramters resource in a specific namespace:
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: external-lb
spec:
controller: example.com/ingress-controller
parameters:
apiGroup: k8s.example.com
kind: IngressParameters
name: external-lb
namespace: external-configuration
scope: Namespace
Deprecated annotation
Before the IngressClass resource and ingressClassName
field were added in
Kubernetes 1.18, Ingress classes were specified with a
kubernetes.io/ingress.class
annotation on the Ingress. This annotation was
never formally defined, but was widely supported by Ingress controllers.
The newer ingressClassName
field on Ingresses is a replacement for that
annotation, but is not a direct equivalent. While the annotation was generally
used to reference the name of the Ingress controller that should implement the
Ingress, the field is a reference to an IngressClass resource that contains
additional Ingress configuration, including the name of the Ingress controller.
Default IngressClass
You can mark a particular IngressClass as default for your cluster. Setting the
ingressclass.kubernetes.io/is-default-class
annotation to true
on an
IngressClass resource will ensure that new Ingresses without an
ingressClassName
field specified will be assigned this default IngressClass.
Caution: If you have more than one IngressClass marked as the default for your cluster, the admission controller prevents creating new Ingress objects that don't have aningressClassName
specified. You can resolve this by ensuring that at most 1 IngressClass is marked as default in your cluster.
Types of Ingress
Ingress backed by a single Service
There are existing Kubernetes concepts that allow you to expose a single Service (see alternatives). You can also do this with an Ingress by specifying a default backend with no rules.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: test-ingress
spec:
defaultBackend:
service:
name: test
port:
number: 80
If you create it using kubectl apply -f
you should be able to view the state
of the Ingress you added:
kubectl get ingress test-ingress
NAME CLASS HOSTS ADDRESS PORTS AGE
test-ingress external-lb * 203.0.113.123 80 59s
Where 203.0.113.123
is the IP allocated by the Ingress controller to satisfy
this Ingress.
Note: Ingress controllers and load balancers may take a minute or two to allocate an IP address. Until that time, you often see the address listed as<pending>
.
Simple fanout
A fanout configuration routes traffic from a single IP address to more than one Service, based on the HTTP URI being requested. An Ingress allows you to keep the number of load balancers down to a minimum. For example, a setup like:
load balancer .->ingress[Ingress, 178.91.123.132]; ingress-->|/foo|service1[Service service1:4200]; ingress-->|/bar|service2[Service service2:8080]; subgraph cluster ingress; service1-->pod1[Pod]; service1-->pod2[Pod]; service2-->pod3[Pod]; service2-->pod4[Pod]; end classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000; classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff; classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5; class ingress,service1,service2,pod1,pod2,pod3,pod4 k8s; class client plain; class cluster cluster;
would require an Ingress such as:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: simple-fanout-example
spec:
rules:
- host: foo.bar.com
http:
paths:
- path: /foo
pathType: Prefix
backend:
service:
name: service1
port:
number: 4200
- path: /bar
pathType: Prefix
backend:
service:
name: service2
port:
number: 8080
When you create the Ingress with kubectl apply -f
:
kubectl describe ingress simple-fanout-example
Name: simple-fanout-example
Namespace: default
Address: 178.91.123.132
Default backend: default-http-backend:80 (10.8.2.3:8080)
Rules:
Host Path Backends
---- ---- --------
foo.bar.com
/foo service1:4200 (10.8.0.90:4200)
/bar service2:8080 (10.8.0.91:8080)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ADD 22s loadbalancer-controller default/test
The Ingress controller provisions an implementation-specific load balancer
that satisfies the Ingress, as long as the Services (service1
, service2
) exist.
When it has done so, you can see the address of the load balancer at the
Address field.
Note: Depending on the Ingress controller you are using, you may need to create a default-http-backend Service.
Name based virtual hosting
Name-based virtual hosts support routing HTTP traffic to multiple host names at the same IP address.
load balancer .->ingress[Ingress, 178.91.123.132]; ingress-->|Host: foo.bar.com|service1[Service service1:80]; ingress-->|Host: bar.foo.com|service2[Service service2:80]; subgraph cluster ingress; service1-->pod1[Pod]; service1-->pod2[Pod]; service2-->pod3[Pod]; service2-->pod4[Pod]; end classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000; classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff; classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5; class ingress,service1,service2,pod1,pod2,pod3,pod4 k8s; class client plain; class cluster cluster;
The following Ingress tells the backing load balancer to route requests based on the Host header.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: name-virtual-host-ingress
spec:
rules:
- host: foo.bar.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service1
port:
number: 80
- host: bar.foo.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service2
port:
number: 80
If you create an Ingress resource without any hosts defined in the rules, then any web traffic to the IP address of your Ingress controller can be matched without a name based virtual host being required.
For example, the following Ingress routes traffic
requested for first.bar.com
to service1
, second.bar.com
to service2
, and any traffic
to the IP address without a hostname defined in request (that is, without a request header being
presented) to service3
.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: name-virtual-host-ingress-no-third-host
spec:
rules:
- host: first.bar.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service1
port:
number: 80
- host: second.bar.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service2
port:
number: 80
- http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service3
port:
number: 80
TLS
You can secure an Ingress by specifying a Secret
that contains a TLS private key and certificate. The Ingress resource only
supports a single TLS port, 443, and assumes TLS termination at the ingress point
(traffic to the Service and its Pods is in plaintext).
If the TLS configuration section in an Ingress specifies different hosts, they are
multiplexed on the same port according to the hostname specified through the
SNI TLS extension (provided the Ingress controller supports SNI). The TLS secret
must contain keys named tls.crt
and tls.key
that contain the certificate
and private key to use for TLS. For example:
apiVersion: v1
kind: Secret
metadata:
name: testsecret-tls
namespace: default
data:
tls.crt: base64 encoded cert
tls.key: base64 encoded key
type: kubernetes.io/tls
Referencing this secret in an Ingress tells the Ingress controller to
secure the channel from the client to the load balancer using TLS. You need to make
sure the TLS secret you created came from a certificate that contains a Common
Name (CN), also known as a Fully Qualified Domain Name (FQDN) for https-example.foo.com
.
Note: Keep in mind that TLS will not work on the default rule because the certificates would have to be issued for all the possible sub-domains. Therefore,hosts
in thetls
section need to explicitly match thehost
in therules
section.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: tls-example-ingress
spec:
tls:
- hosts:
- https-example.foo.com
secretName: testsecret-tls
rules:
- host: https-example.foo.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: service1
port:
number: 80
Load balancing
An Ingress controller is bootstrapped with some load balancing policy settings that it applies to all Ingress, such as the load balancing algorithm, backend weight scheme, and others. More advanced load balancing concepts (e.g. persistent sessions, dynamic weights) are not yet exposed through the Ingress. You can instead get these features through the load balancer used for a Service.
It's also worth noting that even though health checks are not exposed directly through the Ingress, there exist parallel concepts in Kubernetes such as readiness probes that allow you to achieve the same end result. Please review the controller specific documentation to see how they handle health checks (for example: nginx, or GCE).
Updating an Ingress
To update an existing Ingress to add a new Host, you can update it by editing the resource:
kubectl describe ingress test
Name: test
Namespace: default
Address: 178.91.123.132
Default backend: default-http-backend:80 (10.8.2.3:8080)
Rules:
Host Path Backends
---- ---- --------
foo.bar.com
/foo service1:80 (10.8.0.90:80)
Annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ADD 35s loadbalancer-controller default/test
kubectl edit ingress test
This pops up an editor with the existing configuration in YAML format. Modify it to include the new Host:
spec:
rules:
- host: foo.bar.com
http:
paths:
- backend:
service:
name: service1
port:
number: 80
path: /foo
pathType: Prefix
- host: bar.baz.com
http:
paths:
- backend:
service:
name: service2
port:
number: 80
path: /foo
pathType: Prefix
..
After you save your changes, kubectl updates the resource in the API server, which tells the Ingress controller to reconfigure the load balancer.
Verify this:
kubectl describe ingress test
Name: test
Namespace: default
Address: 178.91.123.132
Default backend: default-http-backend:80 (10.8.2.3:8080)
Rules:
Host Path Backends
---- ---- --------
foo.bar.com
/foo service1:80 (10.8.0.90:80)
bar.baz.com
/foo service2:80 (10.8.0.91:80)
Annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ADD 45s loadbalancer-controller default/test
You can achieve the same outcome by invoking kubectl replace -f
on a modified Ingress YAML file.
Failing across availability zones
Techniques for spreading traffic across failure domains differ between cloud providers. Please check the documentation of the relevant Ingress controller for details.
Alternatives
You can expose a Service in multiple ways that don't directly involve the Ingress resource:
What's next
- Learn about the Ingress API
- Learn about Ingress controllers
- Set up Ingress on Minikube with the NGINX Controller
5.6 - Ingress Controllers
In order for the Ingress resource to work, the cluster must have an ingress controller running.
Unlike other types of controllers which run as part of the kube-controller-manager
binary, Ingress controllers
are not started automatically with a cluster. Use this page to choose the ingress controller implementation
that best fits your cluster.
Kubernetes as a project supports and maintains AWS, GCE, and nginx ingress controllers.
Additional controllers
Caution: This section links to third party projects that provide functionality required by Kubernetes. The Kubernetes project authors aren't responsible for these projects. This page follows CNCF website guidelines by listing projects alphabetically. To add a project to this list, read the content guide before submitting a change.
- AKS Application Gateway Ingress Controller is an ingress controller that configures the Azure Application Gateway.
- Ambassador API Gateway is an Envoy-based ingress controller.
- Apache APISIX ingress controller is an Apache APISIX-based ingress controller.
- Avi Kubernetes Operator provides L4-L7 load-balancing using VMware NSX Advanced Load Balancer.
- The Citrix ingress controller works with Citrix Application Delivery Controller.
- Contour is an Envoy based ingress controller.
- EnRoute is an Envoy based API gateway that can run as an ingress controller.
- F5 BIG-IP Container Ingress Services for Kubernetes lets you use an Ingress to configure F5 BIG-IP virtual servers.
- Gloo is an open-source ingress controller based on Envoy, which offers API gateway functionality.
- HAProxy Ingress is an ingress controller for HAProxy.
- The HAProxy Ingress Controller for Kubernetes is also an ingress controller for HAProxy.
- Istio Ingress is an Istio based ingress controller.
- The Kong Ingress Controller for Kubernetes is an ingress controller driving Kong Gateway.
- The NGINX Ingress Controller for Kubernetes works with the NGINX webserver (as a proxy).
- Skipper HTTP router and reverse proxy for service composition, including use cases like Kubernetes Ingress, designed as a library to build your custom proxy.
- The Traefik Kubernetes Ingress provider is an ingress controller for the Traefik proxy.
- Tyk Operator extends Ingress with Custom Resources to bring API Management capabilities to Ingress. Tyk Operator works with the Open Source Tyk Gateway & Tyk Cloud control plane.
- Voyager is an ingress controller for HAProxy.
Using multiple Ingress controllers
You may deploy any number of ingress controllers
within a cluster. When you create an ingress, you should annotate each ingress with the appropriate
ingress.class
to indicate which ingress controller should be used if more than one exists within your cluster.
If you do not define a class, your cloud provider may use a default ingress controller.
Ideally, all ingress controllers should fulfill this specification, but the various ingress controllers operate slightly differently.
Note: Make sure you review your ingress controller's documentation to understand the caveats of choosing it.
What's next
- Learn more about Ingress.
- Set up Ingress on Minikube with the NGINX Controller.
5.7 - EndpointSlices
Kubernetes v1.21 [stable]
EndpointSlices provide a simple way to track network endpoints within a Kubernetes cluster. They offer a more scalable and extensible alternative to Endpoints.
Motivation
The Endpoints API has provided a simple and straightforward way of tracking network endpoints in Kubernetes. Unfortunately as Kubernetes clusters and Services have grown to handle and send more traffic to more backend Pods, limitations of that original API became more visible. Most notably, those included challenges with scaling to larger numbers of network endpoints.
Since all network endpoints for a Service were stored in a single Endpoints resource, those resources could get quite large. That affected the performance of Kubernetes components (notably the master control plane) and resulted in significant amounts of network traffic and processing when Endpoints changed. EndpointSlices help you mitigate those issues as well as provide an extensible platform for additional features such as topological routing.
EndpointSlice resources
In Kubernetes, an EndpointSlice contains references to a set of network endpoints. The control plane automatically creates EndpointSlices for any Kubernetes Service that has a selector specified. These EndpointSlices include references to all the Pods that match the Service selector. EndpointSlices group network endpoints together by unique combinations of protocol, port number, and Service name. The name of a EndpointSlice object must be a valid DNS subdomain name.
As an example, here's a sample EndpointSlice resource for the example
Kubernetes Service.
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
name: example-abc
labels:
kubernetes.io/service-name: example
addressType: IPv4
ports:
- name: http
protocol: TCP
port: 80
endpoints:
- addresses:
- "10.1.2.3"
conditions:
ready: true
hostname: pod-1
nodeName: node-1
zone: us-west2-a
By default, the control plane creates and manages EndpointSlices to have no
more than 100 endpoints each. You can configure this with the
--max-endpoints-per-slice
kube-controller-manager
flag, up to a maximum of 1000.
EndpointSlices can act as the source of truth for kube-proxy when it comes to how to route internal traffic. When enabled, they should provide a performance improvement for services with large numbers of endpoints.
Address types
EndpointSlices support three address types:
- IPv4
- IPv6
- FQDN (Fully Qualified Domain Name)
Conditions
The EndpointSlice API stores conditions about endpoints that may be useful for consumers.
The three conditions are ready
, serving
, and terminating
.
Ready
ready
is a condition that maps to a Pod's Ready
condition. A running Pod with the Ready
condition set to True
should have this EndpointSlice condition also set to true
. For
compatibility reasons, ready
is NEVER true
when a Pod is terminating. Consumers should refer
to the serving
condition to inspect the readiness of terminating Pods. The only exception to
this rule is for Services with spec.publishNotReadyAddresses
set to true
. Endpoints for these
Services will always have the ready
condition set to true
.
Serving
Kubernetes v1.20 [alpha]
serving
is identical to the ready
condition, except it does not account for terminating states.
Consumers of the EndpointSlice API should check this condition if they care about pod readiness while
the pod is also terminating.
Note: Althoughserving
is almost identical toready
, it was added to prevent break the existing meaning ofready
. It may be unexpected for existing clients ifready
could betrue
for terminating endpoints, since historically terminating endpoints were never included in the Endpoints or EndpointSlice API to begin with. For this reason,ready
is alwaysfalse
for terminating endpoints, and a new conditionserving
was added in v1.20 so that clients can track readiness for terminating pods independent of the existing semantics forready
.
Terminating
Kubernetes v1.20 [alpha]
Terminating
is a condition that indicates whether an endpoint is terminating.
For pods, this is any pod that has a deletion timestamp set.
Topology information
Each endpoint within an EndpointSlice can contain relevant topology information. The topology information includes the location of the endpoint and information about the corresponding Node and zone. These are available in the following per endpoint fields on EndpointSlices:
nodeName
- The name of the Node this endpoint is on.zone
- The zone this endpoint is in.
Note:In the v1 API, the per endpoint
topology
was effectively removed in favor of the dedicated fieldsnodeName
andzone
.Setting arbitrary topology fields on the
endpoint
field of anEndpointSlice
resource has been deprecated and is not be supported in the v1 API. Instead, the v1 API supports setting individualnodeName
andzone
fields. These fields are automatically translated between API versions. For example, the value of the"topology.kubernetes.io/zone"
key in thetopology
field in the v1beta1 API is accessible as thezone
field in the v1 API.
Management
Most often, the control plane (specifically, the endpoint slice controller) creates and manages EndpointSlice objects. There are a variety of other use cases for EndpointSlices, such as service mesh implementations, that could result in other entities or controllers managing additional sets of EndpointSlices.
To ensure that multiple entities can manage EndpointSlices without interfering
with each other, Kubernetes defines the
label
endpointslice.kubernetes.io/managed-by
, which indicates the entity managing
an EndpointSlice.
The endpoint slice controller sets endpointslice-controller.k8s.io
as the value
for this label on all EndpointSlices it manages. Other entities managing
EndpointSlices should also set a unique value for this label.
Ownership
In most use cases, EndpointSlices are owned by the Service that the endpoint
slice object tracks endpoints for. This ownership is indicated by an owner
reference on each EndpointSlice as well as a kubernetes.io/service-name
label that enables simple lookups of all EndpointSlices belonging to a Service.
EndpointSlice mirroring
In some cases, applications create custom Endpoints resources. To ensure that these applications do not need to concurrently write to both Endpoints and EndpointSlice resources, the cluster's control plane mirrors most Endpoints resources to corresponding EndpointSlices.
The control plane mirrors Endpoints resources unless:
- the Endpoints resource has a
endpointslice.kubernetes.io/skip-mirror
label set totrue
. - the Endpoints resource has a
control-plane.alpha.kubernetes.io/leader
annotation. - the corresponding Service resource does not exist.
- the corresponding Service resource has a non-nil selector.
Individual Endpoints resources may translate into multiple EndpointSlices. This will occur if an Endpoints resource has multiple subsets or includes endpoints with multiple IP families (IPv4 and IPv6). A maximum of 1000 addresses per subset will be mirrored to EndpointSlices.
Distribution of EndpointSlices
Each EndpointSlice has a set of ports that applies to all endpoints within the resource. When named ports are used for a Service, Pods may end up with different target port numbers for the same named port, requiring different EndpointSlices. This is similar to the logic behind how subsets are grouped with Endpoints.
The control plane tries to fill EndpointSlices as full as possible, but does not actively rebalance them. The logic is fairly straightforward:
- Iterate through existing EndpointSlices, remove endpoints that are no longer desired and update matching endpoints that have changed.
- Iterate through EndpointSlices that have been modified in the first step and fill them up with any new endpoints needed.
- If there's still new endpoints left to add, try to fit them into a previously unchanged slice and/or create new ones.
Importantly, the third step prioritizes limiting EndpointSlice updates over a perfectly full distribution of EndpointSlices. As an example, if there are 10 new endpoints to add and 2 EndpointSlices with room for 5 more endpoints each, this approach will create a new EndpointSlice instead of filling up the 2 existing EndpointSlices. In other words, a single EndpointSlice creation is preferrable to multiple EndpointSlice updates.
With kube-proxy running on each Node and watching EndpointSlices, every change to an EndpointSlice becomes relatively expensive since it will be transmitted to every Node in the cluster. This approach is intended to limit the number of changes that need to be sent to every Node, even if it may result with multiple EndpointSlices that are not full.
In practice, this less than ideal distribution should be rare. Most changes processed by the EndpointSlice controller will be small enough to fit in an existing EndpointSlice, and if not, a new EndpointSlice is likely going to be necessary soon anyway. Rolling updates of Deployments also provide a natural repacking of EndpointSlices with all Pods and their corresponding endpoints getting replaced.
Duplicate endpoints
Due to the nature of EndpointSlice changes, endpoints may be represented in more
than one EndpointSlice at the same time. This naturally occurs as changes to
different EndpointSlice objects can arrive at the Kubernetes client watch/cache
at different times. Implementations using EndpointSlice must be able to have the
endpoint appear in more than one slice. A reference implementation of how to
perform endpoint deduplication can be found in the EndpointSliceCache
implementation in kube-proxy
.
What's next
- Learn about Enabling EndpointSlices
- Read Connecting Applications with Services
5.8 - Service Internal Traffic Policy
Kubernetes v1.21 [alpha]
Service Internal Traffic Policy enables internal traffic restrictions to only route internal traffic to endpoints within the node the traffic originated from. The "internal" traffic here refers to traffic originated from Pods in the current cluster. This can help to reduce costs and improve performance.
Using Service Internal Traffic Policy
Once you have enabled the ServiceInternalTrafficPolicy
feature gate,
you can enable an internal-only traffic policy for a
Services, by setting its
.spec.internalTrafficPolicy
to Local
.
This tells kube-proxy to only use node local endpoints for cluster internal traffic.
Note: For pods on nodes with no endpoints for a given Service, the Service behaves as if it has zero endpoints (for Pods on this node) even if the service does have endpoints on other nodes.
The following example shows what a Service looks like when you set
.spec.internalTrafficPolicy
to Local
:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
targetPort: 9376
internalTrafficPolicy: Local
How it works
The kube-proxy filters the endpoints it routes to based on the
spec.internalTrafficPolicy
setting. When it's set to Local
, only node local
endpoints are considered. When it's Cluster
or missing, all endpoints are
considered.
When the feature gate
ServiceInternalTrafficPolicy
is enabled, spec.internalTrafficPolicy
defaults to "Cluster".
Constraints
- Service Internal Traffic Policy is not used when
externalTrafficPolicy
is set toLocal
on a Service. It is possible to use both features in the same cluster on different Services, just not on the same Service.
What's next
- Read about enabling Topology Aware Hints
- Read about Service External Traffic Policy
- Read Connecting Applications with Services
5.9 - Topology Aware Hints
Kubernetes v1.21 [alpha]
Topology Aware Hints enable topology aware routing by including suggestions for how clients should consume endpoints. This approach adds metadata to enable consumers of EndpointSlice and / or and Endpoints objects, so that traffic to those network endpoints can be routed closer to where it originated.
For example, you can route traffic within a locality to reduce costs, or to improve network performance.
Motivation
Kubernetes clusters are increasingly deployed in multi-zone environments. Topology Aware Hints provides a mechanism to help keep traffic within the zone it originated from. This concept is commonly referred to as "Topology Aware Routing". When calculating the endpoints for a Service, the EndpointSlice controller considers the topology (region and zone) of each endpoint and populates the hints field to allocate it to a zone. Cluster components such as the kube-proxy can then consume those hints, and use them to influence how traffic to is routed (favoring topologically closer endpoints).
Using Topology Aware Hints
If you have enabled the
overall feature, you can activate Topology Aware Hints for a Service by setting the
service.kubernetes.io/topology-aware-hints
annotation to auto
. This tells
the EndpointSlice controller to set topology hints if it is deemed safe.
Importantly, this does not guarantee that hints will always be set.
How it works
The functionality enabling this feature is split into two components: The EndpointSlice controller and the kube-proxy. This section provides a high level overview of how each component implements this feature.
EndpointSlice controller
The EndpointSlice controller is responsible for setting hints on EndpointSlices when this feature is enabled. The controller allocates a proportional amount of endpoints to each zone. This proportion is based on the allocatable CPU cores for nodes running in that zone. For example, if one zone had 2 CPU cores and another zone only had 1 CPU core, the controller would allocated twice as many endpoints to the zone with 2 CPU cores.
The following example shows what an EndpointSlice looks like when hints have been populated:
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
name: example-hints
labels:
kubernetes.io/service-name: example-svc
addressType: IPv4
ports:
- name: http
protocol: TCP
port: 80
endpoints:
- addresses:
- "10.1.2.3"
conditions:
ready: true
hostname: pod-1
zone: zone-a
hints:
forZones:
- name: "zone-a"
kube-proxy
The kube-proxy component filters the endpoints it routes to based on the hints set by the EndpointSlice controller. In most cases, this means that the kube-proxy is able to route traffic to endpoints in the same zone. Sometimes the controller allocates endpoints from a different zone to ensure more even distribution of endpoints between zones. This would result in some traffic being routed to other zones.
Safeguards
The Kubernetes control plane and the kube-proxy on each node apply some safeguard rules before using Topology Aware Hints. If these don't check out, the kube-proxy selects endpoints from anywhere in your cluster, regardless of the zone.
Insufficient number of endpoints: If there are less endpoints than zones in a cluster, the controller will not assign any hints.
Impossible to achieve balanced allocation: In some cases, it will be impossible to achieve a balanced allocation of endpoints among zones. For example, if zone-a is twice as large as zone-b, but there are only 2 endpoints, an endpoint allocated to zone-a may receive twice as much traffic as zone-b. The controller does not assign hints if it can't get this "expected overload" value below an acceptable threshold for each zone. Importantly this is not based on real-time feedback. It is still possible for individual endpoints to become overloaded.
One or more Nodes has insufficient information: If any node does not have a
topology.kubernetes.io/zone
label or is not reporting a value for allocatable CPU, the control plane does not set any topology-aware endpoint hints and so kube-proxy does not filter endpoints by zone.One or more endpoints does not have a zone hint: When this happens, the kube-proxy assumes that a transition from or to Topology Aware Hints is underway. Filtering endpoints for a Service in this state would be dangerous so the kube-proxy falls back to using all endpoints.
A zone is not represented in hints: If the kube-proxy is unable to find at least one endpoint with a hint targeting the zone it is running in, it falls to using endpoints from all zones. This is most likely to happen as you add a new zone into your existing cluster.
Constraints
Topology Aware Hints are not used when either
externalTrafficPolicy
orinternalTrafficPolicy
is set toLocal
on a Service. It is possible to use both features in the same cluster on different Services, just not on the same Service.This approach will not work well for Services that have a large proportion of traffic originating from a subset of zones. Instead this assumes that incoming traffic will be roughly proportional to the capacity of the Nodes in each zone.
The EndpointSlice controller ignores unready nodes as it calculates the proportions of each zone. This could have unintended consequences if a large portion of nodes are unready.
The EndpointSlice controller does not take into account tolerations when deploying calculating the proportions of each zone. If the Pods backing a Service are limited to a subset of Nodes in the cluster, this will not be taken into account.
This may not work well with autoscaling. For example, if a lot of traffic is originating from a single zone, only the endpoints allocated to that zone will be handling that traffic. That could result in Horizontal Pod Autoscaler either not picking up on this event, or newly added pods starting in a different zone.
What's next
5.10 - Network Policies
If you want to control traffic flow at the IP address or port level (OSI layer 3 or 4), then you might consider using Kubernetes NetworkPolicies for particular applications in your cluster. NetworkPolicies are an application-centric construct which allow you to specify how a pod is allowed to communicate with various network "entities" (we use the word "entity" here to avoid overloading the more common terms such as "endpoints" and "services", which have specific Kubernetes connotations) over the network.
The entities that a Pod can communicate with are identified through a combination of the following 3 identifiers:
- Other pods that are allowed (exception: a pod cannot block access to itself)
- Namespaces that are allowed
- IP blocks (exception: traffic to and from the node where a Pod is running is always allowed, regardless of the IP address of the Pod or the node)
When defining a pod- or namespace- based NetworkPolicy, you use a selector to specify what traffic is allowed to and from the Pod(s) that match the selector.
Meanwhile, when IP based NetworkPolicies are created, we define policies based on IP blocks (CIDR ranges).
Prerequisites
Network policies are implemented by the network plugin. To use network policies, you must be using a networking solution which supports NetworkPolicy. Creating a NetworkPolicy resource without a controller that implements it will have no effect.
Isolated and Non-isolated Pods
By default, pods are non-isolated; they accept traffic from any source.
Pods become isolated by having a NetworkPolicy that selects them. Once there is any NetworkPolicy in a namespace selecting a particular pod, that pod will reject any connections that are not allowed by any NetworkPolicy. (Other pods in the namespace that are not selected by any NetworkPolicy will continue to accept all traffic.)
Network policies do not conflict; they are additive. If any policy or policies select a pod, the pod is restricted to what is allowed by the union of those policies' ingress/egress rules. Thus, order of evaluation does not affect the policy result.
For a network flow between two pods to be allowed, both the egress policy on the source pod and the ingress policy on the destination pod need to allow the traffic. If either the egress policy on the source, or the ingress policy on the destination denies the traffic, the traffic will be denied.
The NetworkPolicy resource
See the NetworkPolicy reference for a full definition of the resource.
An example NetworkPolicy might look like this:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: test-network-policy
namespace: default
spec:
podSelector:
matchLabels:
role: db
policyTypes:
- Ingress
- Egress
ingress:
- from:
- ipBlock:
cidr: 172.17.0.0/16
except:
- 172.17.1.0/24
- namespaceSelector:
matchLabels:
project: myproject
- podSelector:
matchLabels:
role: frontend
ports:
- protocol: TCP
port: 6379
egress:
- to:
- ipBlock:
cidr: 10.0.0.0/24
ports:
- protocol: TCP
port: 5978
Note: POSTing this to the API server for your cluster will have no effect unless your chosen networking solution supports network policy.
Mandatory Fields: As with all other Kubernetes config, a NetworkPolicy
needs apiVersion
, kind
, and metadata
fields. For general information
about working with config files, see
Configure Containers Using a ConfigMap,
and Object Management.
spec: NetworkPolicy spec has all the information needed to define a particular network policy in the given namespace.
podSelector: Each NetworkPolicy includes a podSelector
which selects the grouping of pods to which the policy applies. The example policy selects pods with the label "role=db". An empty podSelector
selects all pods in the namespace.
policyTypes: Each NetworkPolicy includes a policyTypes
list which may include either Ingress
, Egress
, or both. The policyTypes
field indicates whether or not the given policy applies to ingress traffic to selected pod, egress traffic from selected pods, or both. If no policyTypes
are specified on a NetworkPolicy then by default Ingress
will always be set and Egress
will be set if the NetworkPolicy has any egress rules.
ingress: Each NetworkPolicy may include a list of allowed ingress
rules. Each rule allows traffic which matches both the from
and ports
sections. The example policy contains a single rule, which matches traffic on a single port, from one of three sources, the first specified via an ipBlock
, the second via a namespaceSelector
and the third via a podSelector
.
egress: Each NetworkPolicy may include a list of allowed egress
rules. Each rule allows traffic which matches both the to
and ports
sections. The example policy contains a single rule, which matches traffic on a single port to any destination in 10.0.0.0/24
.
So, the example NetworkPolicy:
isolates "role=db" pods in the "default" namespace for both ingress and egress traffic (if they weren't already isolated)
(Ingress rules) allows connections to all pods in the "default" namespace with the label "role=db" on TCP port 6379 from:
- any pod in the "default" namespace with the label "role=frontend"
- any pod in a namespace with the label "project=myproject"
- IP addresses in the ranges 172.17.0.0–172.17.0.255 and 172.17.2.0–172.17.255.255 (ie, all of 172.17.0.0/16 except 172.17.1.0/24)
(Egress rules) allows connections from any pod in the "default" namespace with the label "role=db" to CIDR 10.0.0.0/24 on TCP port 5978
See the Declare Network Policy walkthrough for further examples.
Behavior of to
and from
selectors
There are four kinds of selectors that can be specified in an ingress
from
section or egress
to
section:
podSelector: This selects particular Pods in the same namespace as the NetworkPolicy which should be allowed as ingress sources or egress destinations.
namespaceSelector: This selects particular namespaces for which all Pods should be allowed as ingress sources or egress destinations.
namespaceSelector and podSelector: A single to
/from
entry that specifies both namespaceSelector
and podSelector
selects particular Pods within particular namespaces. Be careful to use correct YAML syntax; this policy:
...
ingress:
- from:
- namespaceSelector:
matchLabels:
user: alice
podSelector:
matchLabels:
role: client
...
contains a single from
element allowing connections from Pods with the label role=client
in namespaces with the label user=alice
. But this policy:
...
ingress:
- from:
- namespaceSelector:
matchLabels:
user: alice
- podSelector:
matchLabels:
role: client
...
contains two elements in the from
array, and allows connections from Pods in the local Namespace with the label role=client
, or from any Pod in any namespace with the label user=alice
.
When in doubt, use kubectl describe
to see how Kubernetes has interpreted the policy.
ipBlock: This selects particular IP CIDR ranges to allow as ingress sources or egress destinations. These should be cluster-external IPs, since Pod IPs are ephemeral and unpredictable.
Cluster ingress and egress mechanisms often require rewriting the source or destination IP
of packets. In cases where this happens, it is not defined whether this happens before or
after NetworkPolicy processing, and the behavior may be different for different
combinations of network plugin, cloud provider, Service
implementation, etc.
In the case of ingress, this means that in some cases you may be able to filter incoming
packets based on the actual original source IP, while in other cases, the "source IP" that
the NetworkPolicy acts on may be the IP of a LoadBalancer
or of the Pod's node, etc.
For egress, this means that connections from pods to Service
IPs that get rewritten to
cluster-external IPs may or may not be subject to ipBlock
-based policies.
Default policies
By default, if no policies exist in a namespace, then all ingress and egress traffic is allowed to and from pods in that namespace. The following examples let you change the default behavior in that namespace.
Default deny all ingress traffic
You can create a "default" isolation policy for a namespace by creating a NetworkPolicy that selects all pods but does not allow any ingress traffic to those pods.
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
spec:
podSelector: {}
policyTypes:
- Ingress
This ensures that even pods that aren't selected by any other NetworkPolicy will still be isolated. This policy does not change the default egress isolation behavior.
Default allow all ingress traffic
If you want to allow all traffic to all pods in a namespace (even if policies are added that cause some pods to be treated as "isolated"), you can create a policy that explicitly allows all traffic in that namespace.
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-all-ingress
spec:
podSelector: {}
ingress:
- {}
policyTypes:
- Ingress
Default deny all egress traffic
You can create a "default" egress isolation policy for a namespace by creating a NetworkPolicy that selects all pods but does not allow any egress traffic from those pods.
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-egress
spec:
podSelector: {}
policyTypes:
- Egress
This ensures that even pods that aren't selected by any other NetworkPolicy will not be allowed egress traffic. This policy does not change the default ingress isolation behavior.
Default allow all egress traffic
If you want to allow all traffic from all pods in a namespace (even if policies are added that cause some pods to be treated as "isolated"), you can create a policy that explicitly allows all egress traffic in that namespace.
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-all-egress
spec:
podSelector: {}
egress:
- {}
policyTypes:
- Egress
Default deny all ingress and all egress traffic
You can create a "default" policy for a namespace which prevents all ingress AND egress traffic by creating the following NetworkPolicy in that namespace.
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
This ensures that even pods that aren't selected by any other NetworkPolicy will not be allowed ingress or egress traffic.
SCTP support
Kubernetes v1.19 [beta]
As a beta feature, this is enabled by default. To disable SCTP at a cluster level, you (or your cluster administrator) will need to disable the SCTPSupport
feature gate for the API server with --feature-gates=SCTPSupport=false,…
.
When the feature gate is enabled, you can set the protocol
field of a NetworkPolicy to SCTP
.
Note: You must be using a CNI plugin that supports SCTP protocol NetworkPolicies.
Targeting a range of Ports
Kubernetes v1.21 [alpha]
When writing a NetworkPolicy, you can target a range of ports instead of a single port.
This is achievable with the usage of the endPort
field, as the following example:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: multi-port-egress
namespace: default
spec:
podSelector:
matchLabels:
role: db
policyTypes:
- Egress
egress:
- to:
- ipBlock:
cidr: 10.0.0.0/24
ports:
- protocol: TCP
port: 32000
endPort: 32768
The above rule allows any Pod with label db
on the namespace default
to communicate with any IP within the range 10.0.0.0/24
over TCP, provided that the target port is between the range 32000 and 32768.
The following restrictions apply when using this field:
- As an alpha feature, this is disabled by default. To enable the
endPort
field at a cluster level, you (or your cluster administrator) need to enable theNetworkPolicyEndPort
feature gate for the API server with--feature-gates=NetworkPolicyEndPort=true,…
. - The
endPort
field must be equal than or greater to theport
field. endPort
can only be defined ifport
is also defined.- Both ports must be numeric.
Note: Your cluster must be using a CNI plugin that supports theendPort
field in NetworkPolicy specifications.
Targeting a Namespace by its name
Kubernetes 1.21 [beta]
The Kubernetes control plane sets an immutable label kubernetes.io/metadata.name
on all
namespaces, provided that the NamespaceDefaultLabelName
feature gate is enabled.
The value of the label is the namespace name.
While NetworkPolicy cannot target a namespace by its name with some object field, you can use the standardized label to target a specific namespace.
What you can't do with network policies (at least, not yet)
As of Kubernetes 1.21, the following functionality does not exist in the NetworkPolicy API, but you might be able to implement workarounds using Operating System components (such as SELinux, OpenVSwitch, IPTables, and so on) or Layer 7 technologies (Ingress controllers, Service Mesh implementations) or admission controllers. In case you are new to network security in Kubernetes, its worth noting that the following User Stories cannot (yet) be implemented using the NetworkPolicy API.
- Forcing internal cluster traffic to go through a common gateway (this might be best served with a service mesh or other proxy).
- Anything TLS related (use a service mesh or ingress controller for this).
- Node specific policies (you can use CIDR notation for these, but you cannot target nodes by their Kubernetes identities specifically).
- Targeting of services by name (you can, however, target pods or namespaces by their labels, which is often a viable workaround).
- Creation or management of "Policy requests" that are fulfilled by a third party.
- Default policies which are applied to all namespaces or pods (there are some third party Kubernetes distributions and projects which can do this).
- Advanced policy querying and reachability tooling.
- The ability to log network security events (for example connections that are blocked or accepted).
- The ability to explicitly deny policies (currently the model for NetworkPolicies are deny by default, with only the ability to add allow rules).
- The ability to prevent loopback or incoming host traffic (Pods cannot currently block localhost access, nor do they have the ability to block access from their resident node).
What's next
- See the Declare Network Policy walkthrough for further examples.
- See more recipes for common scenarios enabled by the NetworkPolicy resource.
5.11 - Adding entries to Pod /etc/hosts with HostAliases
Adding entries to a Pod's /etc/hosts
file provides Pod-level override of hostname resolution when DNS and other options are not applicable. You can add these custom entries with the HostAliases field in PodSpec.
Modification not using HostAliases is not suggested because the file is managed by the kubelet and can be overwritten on during Pod creation/restart.
Default hosts file content
Start an Nginx Pod which is assigned a Pod IP:
kubectl run nginx --image nginx
pod/nginx created
Examine a Pod IP:
kubectl get pods --output=wide
NAME READY STATUS RESTARTS AGE IP NODE
nginx 1/1 Running 0 13s 10.200.0.4 worker0
The hosts file content would look like this:
kubectl exec nginx -- cat /etc/hosts
# Kubernetes-managed hosts file.
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
10.200.0.4 nginx
By default, the hosts
file only includes IPv4 and IPv6 boilerplates like
localhost
and its own hostname.
Adding additional entries with hostAliases
In addition to the default boilerplate, you can add additional entries to the
hosts
file.
For example: to resolve foo.local
, bar.local
to 127.0.0.1
and foo.remote
,
bar.remote
to 10.1.2.3
, you can configure HostAliases for a Pod under
.spec.hostAliases
:
apiVersion: v1
kind: Pod
metadata:
name: hostaliases-pod
spec:
restartPolicy: Never
hostAliases:
- ip: "127.0.0.1"
hostnames:
- "foo.local"
- "bar.local"
- ip: "10.1.2.3"
hostnames:
- "foo.remote"
- "bar.remote"
containers:
- name: cat-hosts
image: busybox
command:
- cat
args:
- "/etc/hosts"
You can start a Pod with that configuration by running:
kubectl apply -f https://k8s.io/examples/service/networking/hostaliases-pod.yaml
pod/hostaliases-pod created
Examine a Pod's details to see its IPv4 address and its status:
kubectl get pod --output=wide
NAME READY STATUS RESTARTS AGE IP NODE
hostaliases-pod 0/1 Completed 0 6s 10.200.0.5 worker0
The hosts
file content looks like this:
kubectl logs hostaliases-pod
# Kubernetes-managed hosts file.
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
10.200.0.5 hostaliases-pod
# Entries added by HostAliases.
127.0.0.1 foo.local bar.local
10.1.2.3 foo.remote bar.remote
with the additional entries specified at the bottom.
Why does the kubelet manage the hosts file?
The kubelet manages the
hosts
file for each container of the Pod to prevent Docker from
modifying the file after the
containers have already been started.
Caution:Avoid making manual changes to the hosts file inside a container.
If you make manual changes to the hosts file, those changes are lost when the container exits.
5.12 - IPv4/IPv6 dual-stack
Kubernetes v1.21 [beta]
IPv4/IPv6 dual-stack networking enables the allocation of both IPv4 and IPv6 addresses to Pods and Services.
IPv4/IPv6 dual-stack networking is enabled by default for your Kubernetes cluster starting in 1.21, allowing the simultaneous assignment of both IPv4 and IPv6 addresses.
Supported Features
IPv4/IPv6 dual-stack on your Kubernetes cluster provides the following features:
- Dual-stack Pod networking (a single IPv4 and IPv6 address assignment per Pod)
- IPv4 and IPv6 enabled Services
- Pod off-cluster egress routing (eg. the Internet) via both IPv4 and IPv6 interfaces
Prerequisites
The following prerequisites are needed in order to utilize IPv4/IPv6 dual-stack Kubernetes clusters:
- Kubernetes 1.20 or later
For information about using dual-stack services with earlier Kubernetes versions, refer to the documentation for that version of Kubernetes. - Provider support for dual-stack networking (Cloud provider or otherwise must be able to provide Kubernetes nodes with routable IPv4/IPv6 network interfaces)
- A network plugin that supports dual-stack (such as Kubenet or Calico)
Configure IPv4/IPv6 dual-stack
To use IPv4/IPv6 dual-stack, ensure the IPv6DualStack
feature gate is enabled for the relevant components of your cluster. (Starting in 1.21, IPv4/IPv6 dual-stack defaults to enabled.)
To configure IPv4/IPv6 dual-stack, set dual-stack cluster network assignments:
- kube-apiserver:
--service-cluster-ip-range=<IPv4 CIDR>,<IPv6 CIDR>
- kube-controller-manager:
--cluster-cidr=<IPv4 CIDR>,<IPv6 CIDR>
--service-cluster-ip-range=<IPv4 CIDR>,<IPv6 CIDR>
--node-cidr-mask-size-ipv4|--node-cidr-mask-size-ipv6
defaults to /24 for IPv4 and /64 for IPv6
- kube-proxy:
--cluster-cidr=<IPv4 CIDR>,<IPv6 CIDR>
Note:An example of an IPv4 CIDR:
10.244.0.0/16
(though you would supply your own address range)An example of an IPv6 CIDR:
fdXY:IJKL:MNOP:15::/64
(this shows the format but is not a valid address - see RFC 4193)Starting in 1.21, IPv4/IPv6 dual-stack defaults to enabled. You can disable it when necessary by specifying
--feature-gates="IPv6DualStack=false"
on the kube-apiserver, kube-controller-manager, kubelet, and kube-proxy command line.
Services
You can create Services which can use IPv4, IPv6, or both.
The address family of a Service defaults to the address family of the first service cluster IP range (configured via the --service-cluster-ip-range
flag to the kube-apiserver).
When you define a Service you can optionally configure it as dual stack. To specify the behavior you want, you
set the .spec.ipFamilyPolicy
field to one of the following values:
SingleStack
: Single-stack service. The control plane allocates a cluster IP for the Service, using the first configured service cluster IP range.PreferDualStack
:- Allocates IPv4 and IPv6 cluster IPs for the Service. (If the cluster has
--feature-gates="IPv6DualStack=false"
, this setting follows the same behavior asSingleStack
.)
- Allocates IPv4 and IPv6 cluster IPs for the Service. (If the cluster has
RequireDualStack
: Allocates Service.spec.ClusterIPs
from both IPv4 and IPv6 address ranges.- Selects the
.spec.ClusterIP
from the list of.spec.ClusterIPs
based on the address family of the first element in the.spec.ipFamilies
array.
- Selects the
If you would like to define which IP family to use for single stack or define the order of IP families for dual-stack, you can choose the address families by setting an optional field, .spec.ipFamilies
, on the Service.
Note: The.spec.ipFamilies
field is immutable because the.spec.ClusterIP
cannot be reallocated on a Service that already exists. If you want to change.spec.ipFamilies
, delete and recreate the Service.
You can set .spec.ipFamilies
to any of the following array values:
["IPv4"]
["IPv6"]
["IPv4","IPv6"]
(dual stack)["IPv6","IPv4"]
(dual stack)
The first family you list is used for the legacy .spec.ClusterIP
field.
Dual-stack Service configuration scenarios
These examples demonstrate the behavior of various dual-stack Service configuration scenarios.
Dual-stack options on new Services
- This Service specification does not explicitly define
.spec.ipFamilyPolicy
. When you create this Service, Kubernetes assigns a cluster IP for the Service from the first configuredservice-cluster-ip-range
and sets the.spec.ipFamilyPolicy
toSingleStack
. (Services without selectors and headless Services with selectors will behave in this same way.)
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: MyApp
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
This Service specification explicitly defines
PreferDualStack
in.spec.ipFamilyPolicy
. When you create this Service on a dual-stack cluster, Kubernetes assigns both IPv4 and IPv6 addresses for the service. The control plane updates the.spec
for the Service to record the IP address assignments. The field.spec.ClusterIPs
is the primary field, and contains both assigned IP addresses;.spec.ClusterIP
is a secondary field with its value calculated from.spec.ClusterIPs
.- For the
.spec.ClusterIP
field, the control plane records the IP address that is from the same address family as the first service cluster IP range. - On a single-stack cluster, the
.spec.ClusterIPs
and.spec.ClusterIP
fields both only list one address. - On a cluster with dual-stack enabled, specifying
RequireDualStack
in.spec.ipFamilyPolicy
behaves the same asPreferDualStack
.
- For the
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: MyApp
spec:
ipFamilyPolicy: PreferDualStack
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
- This Service specification explicitly defines
IPv6
andIPv4
in.spec.ipFamilies
as well as definingPreferDualStack
in.spec.ipFamilyPolicy
. When Kubernetes assigns an IPv6 and IPv4 address in.spec.ClusterIPs
,.spec.ClusterIP
is set to the IPv6 address because that is the first element in the.spec.ClusterIPs
array, overriding the default.
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: MyApp
spec:
ipFamilyPolicy: PreferDualStack
ipFamilies:
- IPv6
- IPv4
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
Dual-stack defaults on existing Services
These examples demonstrate the default behavior when dual-stack is newly enabled on a cluster where Services already exist. (Upgrading an existing cluster to 1.21 will enable dual-stack unless --feature-gates="IPv6DualStack=false"
is set.)
- When dual-stack is enabled on a cluster, existing Services (whether
IPv4
orIPv6
) are configured by the control plane to set.spec.ipFamilyPolicy
toSingleStack
and set.spec.ipFamilies
to the address family of the existing Service. The existing Service cluster IP will be stored in.spec.ClusterIPs
.
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: MyApp
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
You can validate this behavior by using kubectl to inspect an existing service.
kubectl get svc my-service -o yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: MyApp
name: my-service
spec:
clusterIP: 10.0.197.123
clusterIPs:
- 10.0.197.123
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: MyApp
type: ClusterIP
status:
loadBalancer: {}
- When dual-stack is enabled on a cluster, existing headless Services with selectors are configured by the control plane to set
.spec.ipFamilyPolicy
toSingleStack
and set.spec.ipFamilies
to the address family of the first service cluster IP range (configured via the--service-cluster-ip-range
flag to the kube-apiserver) even though.spec.ClusterIP
is set toNone
.
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: MyApp
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
You can validate this behavior by using kubectl to inspect an existing headless service with selectors.
kubectl get svc my-service -o yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: MyApp
name: my-service
spec:
clusterIP: None
clusterIPs:
- None
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: MyApp
Switching Services between single-stack and dual-stack
Services can be changed from single-stack to dual-stack and from dual-stack to single-stack.
To change a Service from single-stack to dual-stack, change
.spec.ipFamilyPolicy
fromSingleStack
toPreferDualStack
orRequireDualStack
as desired. When you change this Service from single-stack to dual-stack, Kubernetes assigns the missing address family so that the Service now has IPv4 and IPv6 addresses.Edit the Service specification updating the
.spec.ipFamilyPolicy
fromSingleStack
toPreferDualStack
.
Before:
spec:
ipFamilyPolicy: SingleStack
After:
spec:
ipFamilyPolicy: PreferDualStack
- To change a Service from dual-stack to single-stack, change
.spec.ipFamilyPolicy
fromPreferDualStack
orRequireDualStack
toSingleStack
. When you change this Service from dual-stack to single-stack, Kubernetes retains only the first element in the.spec.ClusterIPs
array, and sets.spec.ClusterIP
to that IP address and sets.spec.ipFamilies
to the address family of.spec.ClusterIPs
.
Headless Services without selector
For Headless Services without selectors and without .spec.ipFamilyPolicy
explicitly set, the .spec.ipFamilyPolicy
field defaults to RequireDualStack
.
Service type LoadBalancer
To provision a dual-stack load balancer for your Service:
- Set the
.spec.type
field toLoadBalancer
- Set
.spec.ipFamilyPolicy
field toPreferDualStack
orRequireDualStack
Note: To use a dual-stackLoadBalancer
type Service, your cloud provider must support IPv4 and IPv6 load balancers.
Egress traffic
If you want to enable egress traffic in order to reach off-cluster destinations (eg. the public Internet) from a Pod that uses non-publicly routable IPv6 addresses, you need to enable the Pod to use a publicly routed IPv6 address via a mechanism such as transparent proxying or IP masquerading. The ip-masq-agent project supports IP masquerading on dual-stack clusters.
Note: Ensure your CNI provider supports IPv6.
What's next
6 - Storage
6.1 - Volumes
On-disk files in a container are ephemeral, which presents some problems for
non-trivial applications when running in containers. One problem
is the loss of files when a container crashes. The kubelet restarts the container
but with a clean state. A second problem occurs when sharing files
between containers running together in a Pod
.
The Kubernetes volume abstraction
solves both of these problems.
Familiarity with Pods is suggested.
Background
Docker has a concept of volumes, though it is somewhat looser and less managed. A Docker volume is a directory on disk or in another container. Docker provides volume drivers, but the functionality is somewhat limited.
Kubernetes supports many types of volumes. A Pod can use any number of volume types simultaneously. Ephemeral volume types have a lifetime of a pod, but persistent volumes exist beyond the lifetime of a pod. Consequently, a volume outlives any containers that run within the pod, and data is preserved across container restarts. When a pod ceases to exist, Kubernetes destroys ephemeral volumes; however, Kubernetes does not destroy persistent volumes.
At its core, a volume is a directory, possibly with some data in it, which is accessible to the containers in a pod. How that directory comes to be, the medium that backs it, and the contents of it are determined by the particular volume type used.
To use a volume, specify the volumes to provide for the Pod in .spec.volumes
and declare where to mount those volumes into containers in .spec.containers[*].volumeMounts
.
A process in a container sees a filesystem view composed from their Docker
image and volumes. The Docker image
is at the root of the filesystem hierarchy. Volumes mount at the specified paths within
the image. Volumes can not mount onto other volumes or have hard links to
other volumes. Each Container in the Pod's configuration must independently specify where to
mount each volume.
Types of Volumes
Kubernetes supports several types of volumes.
awsElasticBlockStore
An awsElasticBlockStore
volume mounts an Amazon Web Services (AWS)
EBS volume into your pod. Unlike
emptyDir
, which is erased when a pod is removed, the contents of an EBS
volume are persisted and the volume is unmounted. This means that an
EBS volume can be pre-populated with data, and that data can be shared between pods.
Note: You must create an EBS volume by usingaws ec2 create-volume
or the AWS API before you can use it.
There are some restrictions when using an awsElasticBlockStore
volume:
- the nodes on which pods are running must be AWS EC2 instances
- those instances need to be in the same region and availability zone as the EBS volume
- EBS only supports a single EC2 instance mounting a volume
Creating an AWS EBS volume
Before you can use an EBS volume with a pod, you need to create it.
aws ec2 create-volume --availability-zone=eu-west-1a --size=10 --volume-type=gp2
Make sure the zone matches the zone you brought up your cluster in. Check that the size and EBS volume type are suitable for your use.
AWS EBS configuration example
apiVersion: v1
kind: Pod
metadata:
name: test-ebs
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /test-ebs
name: test-volume
volumes:
- name: test-volume
# This AWS EBS volume must already exist.
awsElasticBlockStore:
volumeID: "<volume id>"
fsType: ext4
If the EBS volume is partitioned, you can supply the optional field partition: "<partition number>"
to specify which parition to mount on.
AWS EBS CSI migration
Kubernetes v1.17 [beta]
The CSIMigration
feature for awsElasticBlockStore
, when enabled, redirects
all plugin operations from the existing in-tree plugin to the ebs.csi.aws.com
Container
Storage Interface (CSI) driver. In order to use this feature, the AWS EBS CSI
driver
must be installed on the cluster and the CSIMigration
and CSIMigrationAWS
beta features must be enabled.
AWS EBS CSI migration complete
Kubernetes v1.17 [alpha]
To disable the awsElasticBlockStore
storage plugin from being loaded by the controller manager
and the kubelet, set the CSIMigrationAWSComplete
flag to true
. This feature requires the ebs.csi.aws.com
Container Storage Interface (CSI) driver installed on all worker nodes.
azureDisk
The azureDisk
volume type mounts a Microsoft Azure Data Disk into a pod.
For more details, see the azureDisk
volume plugin.
azureDisk CSI migration
Kubernetes v1.19 [beta]
The CSIMigration
feature for azureDisk
, when enabled, redirects all plugin operations
from the existing in-tree plugin to the disk.csi.azure.com
Container
Storage Interface (CSI) Driver. In order to use this feature, the Azure Disk CSI
Driver
must be installed on the cluster and the CSIMigration
and CSIMigrationAzureDisk
features must be enabled.
azureFile
The azureFile
volume type mounts a Microsoft Azure File volume (SMB 2.1 and 3.0)
into a pod.
For more details, see the azureFile
volume plugin.
azureFile CSI migration
Kubernetes v1.21 [beta]
The CSIMigration
feature for azureFile
, when enabled, redirects all plugin operations
from the existing in-tree plugin to the file.csi.azure.com
Container
Storage Interface (CSI) Driver. In order to use this feature, the Azure File CSI
Driver
must be installed on the cluster and the CSIMigration
and CSIMigrationAzureFile
feature gates must be enabled.
Azure File CSI driver does not support using same volume with different fsgroups, if Azurefile CSI migration is enabled, using same volume with different fsgroups won't be supported at all.
cephfs
A cephfs
volume allows an existing CephFS volume to be
mounted into your Pod. Unlike emptyDir
, which is erased when a pod is
removed, the contents of a cephfs
volume are preserved and the volume is merely
unmounted. This means that a cephfs
volume can be pre-populated with data, and
that data can be shared between pods. The cephfs
volume can be mounted by multiple
writers simultaneously.
Note: You must have your own Ceph server running with the share exported before you can use it.
See the CephFS example for more details.
cinder
Note: Kubernetes must be configured with the OpenStack cloud provider.
The cinder
volume type is used to mount the OpenStack Cinder volume into your pod.
Cinder volume configuration example
apiVersion: v1
kind: Pod
metadata:
name: test-cinder
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-cinder-container
volumeMounts:
- mountPath: /test-cinder
name: test-volume
volumes:
- name: test-volume
# This OpenStack volume must already exist.
cinder:
volumeID: "<volume id>"
fsType: ext4
OpenStack CSI migration
Kubernetes v1.21 [beta]
The CSIMigration
feature for Cinder is enabled by default in Kubernetes 1.21.
It redirects all plugin operations from the existing in-tree plugin to the
cinder.csi.openstack.org
Container Storage Interface (CSI) Driver.
OpenStack Cinder CSI Driver
must be installed on the cluster.
You can disable Cinder CSI migration for your cluster by setting the CSIMigrationOpenStack
feature gate to false
.
If you disable the CSIMigrationOpenStack
feature, the in-tree Cinder volume plugin takes responsibility
for all aspects of Cinder volume storage management.
configMap
A ConfigMap
provides a way to inject configuration data into pods.
The data stored in a ConfigMap can be referenced in a volume of type
configMap
and then consumed by containerized applications running in a pod.
When referencing a ConfigMap, you provide the name of the ConfigMap in the
volume. You can customize the path to use for a specific
entry in the ConfigMap. The following configuration shows how to mount
the log-config
ConfigMap onto a Pod called configmap-pod
:
apiVersion: v1
kind: Pod
metadata:
name: configmap-pod
spec:
containers:
- name: test
image: busybox
volumeMounts:
- name: config-vol
mountPath: /etc/config
volumes:
- name: config-vol
configMap:
name: log-config
items:
- key: log_level
path: log_level
The log-config
ConfigMap is mounted as a volume, and all contents stored in
its log_level
entry are mounted into the Pod at path /etc/config/log_level
.
Note that this path is derived from the volume's mountPath
and the path
keyed with log_level
.
downwardAPI
A downwardAPI
volume makes downward API data available to applications.
It mounts a directory and writes the requested data in plain text files.
Note: A container using the downward API as asubPath
volume mount will not receive downward API updates.
See the downward API example for more details.
emptyDir
An emptyDir
volume is first created when a Pod is assigned to a node, and
exists as long as that Pod is running on that node. As the name says, the
emptyDir
volume is initially empty. All containers in the Pod can read and write the same
files in the emptyDir
volume, though that volume can be mounted at the same
or different paths in each container. When a Pod is removed from a node for
any reason, the data in the emptyDir
is deleted permanently.
Note: A container crashing does not remove a Pod from a node. The data in anemptyDir
volume is safe across container crashes.
Some uses for an emptyDir
are:
- scratch space, such as for a disk-based merge sort
- checkpointing a long computation for recovery from crashes
- holding files that a content-manager container fetches while a webserver container serves the data
Depending on your environment, emptyDir
volumes are stored on whatever medium that backs the
node such as disk or SSD, or network storage. However, if you set the emptyDir.medium
field
to "Memory"
, Kubernetes mounts a tmpfs (RAM-backed filesystem) for you instead.
While tmpfs is very fast, be aware that unlike disks, tmpfs is cleared on
node reboot and any files you write count against your container's
memory limit.
Note: If theSizeMemoryBackedVolumes
feature gate is enabled, you can specify a size for memory backed volumes. If no size is specified, memory backed volumes are sized to 50% of the memory on a Linux host.
emptyDir configuration example
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /cache
name: cache-volume
volumes:
- name: cache-volume
emptyDir: {}
fc (fibre channel)
An fc
volume type allows an existing fibre channel block storage volume
to mount in a Pod. You can specify single or multiple target world wide names (WWNs)
using the parameter targetWWNs
in your Volume configuration. If multiple WWNs are specified,
targetWWNs expect that those WWNs are from multi-path connections.
Note: You must configure FC SAN Zoning to allocate and mask those LUNs (volumes) to the target WWNs beforehand so that Kubernetes hosts can access them.
See the fibre channel example for more details.
flocker (deprecated)
Flocker is an open-source, clustered container data volume manager. Flocker provides management and orchestration of data volumes backed by a variety of storage backends.
A flocker
volume allows a Flocker dataset to be mounted into a Pod. If the
dataset does not already exist in Flocker, it needs to be first created with the Flocker
CLI or by using the Flocker API. If the dataset already exists it will be
reattached by Flocker to the node that the pod is scheduled. This means data
can be shared between pods as required.
Note: You must have your own Flocker installation running before you can use it.
See the Flocker example for more details.
gcePersistentDisk
A gcePersistentDisk
volume mounts a Google Compute Engine (GCE)
persistent disk (PD) into your Pod.
Unlike emptyDir
, which is erased when a pod is removed, the contents of a PD are
preserved and the volume is merely unmounted. This means that a PD can be
pre-populated with data, and that data can be shared between pods.
Note: You must create a PD usinggcloud
or the GCE API or UI before you can use it.
There are some restrictions when using a gcePersistentDisk
:
- the nodes on which Pods are running must be GCE VMs
- those VMs need to be in the same GCE project and zone as the persistent disk
One feature of GCE persistent disk is concurrent read-only access to a persistent disk.
A gcePersistentDisk
volume permits multiple consumers to simultaneously
mount a persistent disk as read-only. This means that you can pre-populate a PD with your dataset
and then serve it in parallel from as many Pods as you need. Unfortunately,
PDs can only be mounted by a single consumer in read-write mode. Simultaneous
writers are not allowed.
Using a GCE persistent disk with a Pod controlled by a ReplicaSet will fail unless the PD is read-only or the replica count is 0 or 1.
Creating a GCE persistent disk
Before you can use a GCE persistent disk with a Pod, you need to create it.
gcloud compute disks create --size=500GB --zone=us-central1-a my-data-disk
GCE persistent disk configuration example
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /test-pd
name: test-volume
volumes:
- name: test-volume
# This GCE PD must already exist.
gcePersistentDisk:
pdName: my-data-disk
fsType: ext4
Regional persistent disks
The Regional persistent disks feature allows the creation of persistent disks that are available in two zones within the same region. In order to use this feature, the volume must be provisioned as a PersistentVolume; referencing the volume directly from a pod is not supported.
Manually provisioning a Regional PD PersistentVolume
Dynamic provisioning is possible using a StorageClass for GCE PD. Before creating a PersistentVolume, you must create the persistent disk:
gcloud compute disks create --size=500GB my-data-disk
--region us-central1
--replica-zones us-central1-a,us-central1-b
Regional persistent disk configuration example
apiVersion: v1
kind: PersistentVolume
metadata:
name: test-volume
spec:
capacity:
storage: 400Gi
accessModes:
- ReadWriteOnce
gcePersistentDisk:
pdName: my-data-disk
fsType: ext4
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- us-central1-a
- us-central1-b
GCE CSI migration
Kubernetes v1.17 [beta]
The CSIMigration
feature for GCE PD, when enabled, redirects all plugin operations
from the existing in-tree plugin to the pd.csi.storage.gke.io
Container
Storage Interface (CSI) Driver. In order to use this feature, the GCE PD CSI
Driver
must be installed on the cluster and the CSIMigration
and CSIMigrationGCE
beta features must be enabled.
gitRepo (deprecated)
A gitRepo
volume is an example of a volume plugin. This plugin
mounts an empty directory and clones a git repository into this directory
for your Pod to use.
Here is an example of a gitRepo
volume:
apiVersion: v1
kind: Pod
metadata:
name: server
spec:
containers:
- image: nginx
name: nginx
volumeMounts:
- mountPath: /mypath
name: git-volume
volumes:
- name: git-volume
gitRepo:
repository: "git@somewhere:me/my-git-repository.git"
revision: "22f1d8406d464b0c0874075539c1f2e96c253775"
glusterfs
A glusterfs
volume allows a Glusterfs (an open
source networked filesystem) volume to be mounted into your Pod. Unlike
emptyDir
, which is erased when a Pod is removed, the contents of a
glusterfs
volume are preserved and the volume is merely unmounted. This
means that a glusterfs volume can be pre-populated with data, and that data can
be shared between pods. GlusterFS can be mounted by multiple writers
simultaneously.
Note: You must have your own GlusterFS installation running before you can use it.
See the GlusterFS example for more details.
hostPath
A hostPath
volume mounts a file or directory from the host node's filesystem
into your Pod. This is not something that most Pods will need, but it offers a
powerful escape hatch for some applications.
For example, some uses for a hostPath
are:
- running a container that needs access to Docker internals; use a
hostPath
of/var/lib/docker
- running cAdvisor in a container; use a
hostPath
of/sys
- allowing a Pod to specify whether a given
hostPath
should exist prior to the Pod running, whether it should be created, and what it should exist as
In addition to the required path
property, you can optionally specify a type
for a hostPath
volume.
The supported values for field type
are:
Value | Behavior |
---|---|
Empty string (default) is for backward compatibility, which means that no checks will be performed before mounting the hostPath volume. | |
DirectoryOrCreate | If nothing exists at the given path, an empty directory will be created there as needed with permission set to 0755, having the same group and ownership with Kubelet. |
Directory | A directory must exist at the given path |
FileOrCreate | If nothing exists at the given path, an empty file will be created there as needed with permission set to 0644, having the same group and ownership with Kubelet. |
File | A file must exist at the given path |
Socket | A UNIX socket must exist at the given path |
CharDevice | A character device must exist at the given path |
BlockDevice | A block device must exist at the given path |
Watch out when using this type of volume, because:
- Pods with identical configuration (such as created from a PodTemplate) may behave differently on different nodes due to different files on the nodes
- The files or directories created on the underlying hosts are only writable by root. You
either need to run your process as root in a
privileged Container or modify the file
permissions on the host to be able to write to a
hostPath
volume
hostPath configuration example
apiVersion: v1
kind: Pod