This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Reference

This section of the Kubernetes documentation contains references.

API Reference

Officially supported client libraries

To call the Kubernetes API from a programming language, you can use client libraries. Officially supported client libraries:

CLI

  • kubectl - Main CLI tool for running commands and managing Kubernetes clusters.
  • kubeadm - CLI tool to easily provision a secure Kubernetes cluster.

Components

  • kubelet - The primary agent that runs on each node. The kubelet takes a set of PodSpecs and ensures that the described containers are running and healthy.

  • kube-apiserver - REST API that validates and configures data for API objects such as pods, services, replication controllers.

  • kube-controller-manager - Daemon that embeds the core control loops shipped with Kubernetes.

  • kube-proxy - Can do simple TCP/UDP stream forwarding or round-robin TCP/UDP forwarding across a set of back-ends.

  • kube-scheduler - Scheduler that manages availability, performance, and capacity.

  • List of ports and protocols that should be open on control plane and worker nodes

Config APIs

This section hosts the documentation for "unpublished" APIs which are used to configure kubernetes components or tools. Most of these APIs are not exposed by the API server in a RESTful way though they are essential for a user or an operator to use or manage a cluster.

Config API for kubeadm

Design Docs

An archive of the design docs for Kubernetes functionality. Good starting points are Kubernetes Architecture and Kubernetes Design Overview.

1 - Glossary

2 - API Overview

This section provides reference information for the Kubernetes API.

The REST API is the fundamental fabric of Kubernetes. All operations and communications between components, and external user commands are REST API calls that the API Server handles. Consequently, everything in the Kubernetes platform is treated as an API object and has a corresponding entry in the API.

The Kubernetes API reference lists the API for Kubernetes version v1.25.

For general background information, read The Kubernetes API. Controlling Access to the Kubernetes API describes how clients can authenticate to the Kubernetes API server, and how their requests are authorized.

API versioning

The JSON and Protobuf serialization schemas follow the same guidelines for schema changes. The following descriptions cover both formats.

The API versioning and software versioning are indirectly related. The API and release versioning proposal describes the relationship between API versioning and software versioning.

Different API versions indicate different levels of stability and support. You can find more information about the criteria for each level in the API Changes documentation.

Here's a summary of each level:

  • Alpha:

    • The version names contain alpha (for example, v1alpha1).
    • Built-in alpha API versions are disabled by default and must be explicitly enabled in the kube-apiserver configuration to be used.
    • The software may contain bugs. Enabling a feature may expose bugs.
    • Support for an alpha API may be dropped at any time without notice.
    • The API may change in incompatible ways in a later software release without notice.
    • The software is recommended for use only in short-lived testing clusters, due to increased risk of bugs and lack of long-term support.
  • Beta:

    • The version names contain beta (for example, v2beta3).

    • Built-in beta API versions are disabled by default and must be explicitly enabled in the kube-apiserver configuration to be used (except for beta versions of APIs introduced prior to Kubernetes 1.22, which were enabled by default).

    • Built-in beta API versions have a maximum lifetime of 9 months or 3 minor releases (whichever is longer) from introduction to deprecation, and 9 months or 3 minor releases (whichever is longer) from deprecation to removal.

    • The software is well tested. Enabling a feature is considered safe.

    • The support for a feature will not be dropped, though the details may change.

    • The schema and/or semantics of objects may change in incompatible ways in a subsequent beta or stable API version. When this happens, migration instructions are provided. Adapting to a subsequent beta or stable API version may require editing or re-creating API objects, and may not be straightforward. The migration may require downtime for applications that rely on the feature.

    • The software is not recommended for production uses. Subsequent releases may introduce incompatible changes. Use of beta API versions is required to transition to subsequent beta or stable API versions once the beta API version is deprecated and no longer served.

  • Stable:

    • The version name is vX where X is an integer.
    • Stable API versions remain available for all future releases within a Kubernetes major version, and there are no current plans for a major version revision of Kubernetes that removes stable APIs.

API groups

API groups make it easier to extend the Kubernetes API. The API group is specified in a REST path and in the apiVersion field of a serialized object.

There are several API groups in Kubernetes:

  • The core (also called legacy) group is found at REST path /api/v1. The core group is not specified as part of the apiVersion field, for example, apiVersion: v1.
  • The named groups are at REST path /apis/$GROUP_NAME/$VERSION and use apiVersion: $GROUP_NAME/$VERSION (for example, apiVersion: batch/v1). You can find the full list of supported API groups in Kubernetes API reference.

Enabling or disabling API groups

Certain resources and API groups are enabled by default. You can enable or disable them by setting --runtime-config on the API server. The --runtime-config flag accepts comma separated <key>[=<value>] pairs describing the runtime configuration of the API server. If the =<value> part is omitted, it is treated as if =true is specified. For example:

  • to disable batch/v1, set --runtime-config=batch/v1=false
  • to enable batch/v2alpha1, set --runtime-config=batch/v2alpha1
  • to enable a specific version of an API, such as storage.k8s.io/v1beta1/csistoragecapacities, set --runtime-config=storage.k8s.io/v1beta1/csistoragecapacities

Persistence

Kubernetes stores its serialized state in terms of the API resources by writing them into etcd.

What's next

2.1 - Kubernetes API Concepts

The Kubernetes API is a resource-based (RESTful) programmatic interface provided via HTTP. It supports retrieving, creating, updating, and deleting primary resources via the standard HTTP verbs (POST, PUT, PATCH, DELETE, GET).

For some resources, the API includes additional subresources that allow fine grained authorization (such as separate views for Pod details and log retrievals), and can accept and serve those resources in different representations for convenience or efficiency.

Kubernetes supports efficient change notifications on resources via watches. Kubernetes also provides consistent list operations so that API clients can effectively cache, track, and synchronize the state of resources.

You can view the API reference online, or read on to learn about the API in general.

Kubernetes API terminology

Kubernetes generally leverages common RESTful terminology to describe the API concepts:

  • A resource type is the name used in the URL (pods, namespaces, services)
  • All resource types have a concrete representation (their object schema) which is called a kind
  • A list of instances of a resource is known as a collection
  • A single instance of a resource type is called a resource, and also usually represents an object
  • For some resource types, the API includes one or more sub-resources, which are represented as URI paths below the resource

Most Kubernetes API resource types are objects: they represent a concrete instance of a concept on the cluster, like a pod or namespace. A smaller number of API resource types are virtual in that they often represent operations on objects, rather than objects, such as a permission check (use a POST with a JSON-encoded body of SubjectAccessReview to the subjectaccessreviews resource), or the eviction sub-resource of a Pod (used to trigger API-initiated eviction).

Object names

All objects you can create via the API have a unique object name to allow idempotent creation and retrieval, except that virtual resource types may not have unique names if they are not retrievable, or do not rely on idempotency. Within a namespace, only one object of a given kind can have a given name at a time. However, if you delete the object, you can make a new object with the same name. Some objects are not namespaced (for example: Nodes), and so their names must be unique across the whole cluster.

API verbs

Almost all object resource types support the standard HTTP verbs - GET, POST, PUT, PATCH, and DELETE. Kubernetes also uses its own verbs, which are often written lowercase to distinguish them from HTTP verbs.

Kubernetes uses the term list to describe returning a collection of resources to distinguish from retrieving a single resource which is usually called a get. If you sent an HTTP GET request with the ?watch query parameter, Kubernetes calls this a watch and not a get (see Efficient detection of changes for more details).

For PUT requests, Kubernetes internally classifies these as either create or update based on the state of the existing object. An update is different from a patch; the HTTP verb for a patch is PATCH.

Resource URIs

All resource types are either scoped by the cluster (/apis/GROUP/VERSION/*) or to a namespace (/apis/GROUP/VERSION/namespaces/NAMESPACE/*). A namespace-scoped resource type will be deleted when its namespace is deleted and access to that resource type is controlled by authorization checks on the namespace scope.

You can also access collections of resources (for example: listing all Nodes). The following paths are used to retrieve collections and resources:

  • Cluster-scoped resources:

    • GET /apis/GROUP/VERSION/RESOURCETYPE - return the collection of resources of the resource type
    • GET /apis/GROUP/VERSION/RESOURCETYPE/NAME - return the resource with NAME under the resource type
  • Namespace-scoped resources:

    • GET /apis/GROUP/VERSION/RESOURCETYPE - return the collection of all instances of the resource type across all namespaces
    • GET /apis/GROUP/VERSION/namespaces/NAMESPACE/RESOURCETYPE - return collection of all instances of the resource type in NAMESPACE
    • GET /apis/GROUP/VERSION/namespaces/NAMESPACE/RESOURCETYPE/NAME - return the instance of the resource type with NAME in NAMESPACE

Since a namespace is a cluster-scoped resource type, you can retrieve the list (“collection”) of all namespaces with GET /api/v1/namespaces and details about a particular namespace with GET /api/v1/namespaces/NAME.

  • Cluster-scoped subresource: GET /apis/GROUP/VERSION/RESOURCETYPE/NAME/SUBRESOURCE
  • Namespace-scoped subresource: GET /apis/GROUP/VERSION/namespaces/NAMESPACE/RESOURCETYPE/NAME/SUBRESOURCE

The verbs supported for each subresource will differ depending on the object - see the API reference for more information. It is not possible to access sub-resources across multiple resources - generally a new virtual resource type would be used if that becomes necessary.

Efficient detection of changes

The Kubernetes API allows clients to make an initial request for an object or a collection, and then to track changes since that initial request: a watch. Clients can send a list or a get and then make a follow-up watch request.

To make this change tracking possible, every Kubernetes object has a resourceVersion field representing the version of that resource as stored in the underlying persistence layer. When retrieving a collection of resources (either namespace or cluster scoped), the response from the API server contains a resourceVersion value. The client can use that resourceVersion to initiate a watch against the API server.

When you send a watch request, the API server responds with a stream of changes. These changes itemize the outcome of operations (such as create, delete, and update) that occurred after the resourceVersion you specified as a parameter to the watch request. The overall watch mechanism allows a client to fetch the current state and then subscribe to subsequent changes, without missing any events.

If a client watch is disconnected then that client can start a new watch from the last returned resourceVersion; the client could also perform a fresh get / list request and begin again. See Resource Version Semantics for more detail.

For example:

  1. List all of the pods in a given namespace.

    GET /api/v1/namespaces/test/pods
    ---
    200 OK
    Content-Type: application/json
    
    {
      "kind": "PodList",
      "apiVersion": "v1",
      "metadata": {"resourceVersion":"10245"},
      "items": [...]
    }
    
  2. Starting from resource version 10245, receive notifications of any API operations (such as create, delete, apply or update) that affect Pods in the test namespace. Each change notification is a JSON document. The HTTP response body (served as application/json) consists a series of JSON documents.

    GET /api/v1/namespaces/test/pods?watch=1&resourceVersion=10245
    ---
    200 OK
    Transfer-Encoding: chunked
    Content-Type: application/json
    
    {
      "type": "ADDED",
      "object": {"kind": "Pod", "apiVersion": "v1", "metadata": {"resourceVersion": "10596", ...}, ...}
    }
    {
      "type": "MODIFIED",
      "object": {"kind": "Pod", "apiVersion": "v1", "metadata": {"resourceVersion": "11020", ...}, ...}
    }
    ...
    

A given Kubernetes server will only preserve a historical record of changes for a limited time. Clusters using etcd 3 preserve changes in the last 5 minutes by default. When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code 410 Gone, clearing their local cache, performing a new get or list operation, and starting the watch from the resourceVersion that was returned.

For subscribing to collections, Kubernetes client libraries typically offer some form of standard tool for this list-then-watch logic. (In the Go client library, this is called a Reflector and is located in the k8s.io/client-go/tools/cache package.)

Watch bookmarks

To mitigate the impact of short history window, the Kubernetes API provides a watch event named BOOKMARK. It is a special kind of event to mark that all changes up to a given resourceVersion the client is requesting have already been sent. The document representing the BOOKMARK event is of the type requested by the request, but only includes a .metadata.resourceVersion field. For example:

GET /api/v1/namespaces/test/pods?watch=1&resourceVersion=10245&allowWatchBookmarks=true
---
200 OK
Transfer-Encoding: chunked
Content-Type: application/json

{
  "type": "ADDED",
  "object": {"kind": "Pod", "apiVersion": "v1", "metadata": {"resourceVersion": "10596", ...}, ...}
}
...
{
  "type": "BOOKMARK",
  "object": {"kind": "Pod", "apiVersion": "v1", "metadata": {"resourceVersion": "12746"} }
}

As a client, you can request BOOKMARK events by setting the allowWatchBookmarks=true query parameter to a watch request, but you shouldn't assume bookmarks are returned at any specific interval, nor can clients assume that the API server will send any BOOKMARK event even when requested.

Retrieving large results sets in chunks

FEATURE STATE: Kubernetes v1.9 [beta]

On large clusters, retrieving the collection of some resource types may result in very large responses that can impact the server and client. For instance, a cluster may have tens of thousands of Pods, each of which is equivalent to roughly 2 KiB of encoded JSON. Retrieving all pods across all namespaces may result in a very large response (10-20MB) and consume a large amount of server resources.

Provided that you don't explicitly disable the APIListChunking feature gate, the Kubernetes API server supports the ability to break a single large collection request into many smaller chunks while preserving the consistency of the total request. Each chunk can be returned sequentially which reduces both the total size of the request and allows user-oriented clients to display results incrementally to improve responsiveness.

You can request that the API server handles a list by serving single collection using pages (which Kubernetes calls chunks). To retrieve a single collection in chunks, two query parameters limit and continue are supported on requests against collections, and a response field continue is returned from all list operations in the collection's metadata field. A client should specify the maximum results they wish to receive in each chunk with limit and the server will return up to limit resources in the result and include a continue value if there are more resources in the collection.

As an API client, you can then pass this continue value to the API server on the next request, to instruct the server to return the next page (chunk) of results. By continuing until the server returns an empty continue value, you can retrieve the entire collection.

Like a watch operation, a continue token will expire after a short amount of time (by default 5 minutes) and return a 410 Gone if more results cannot be returned. In this case, the client will need to start from the beginning or omit the limit parameter.

For example, if there are 1,253 pods on the cluster and you wants to receive chunks of 500 pods at a time, request those chunks as follows:

  1. List all of the pods on a cluster, retrieving up to 500 pods each time.

    GET /api/v1/pods?limit=500
    ---
    200 OK
    Content-Type: application/json
    
    {
      "kind": "PodList",
      "apiVersion": "v1",
      "metadata": {
        "resourceVersion":"10245",
        "continue": "ENCODED_CONTINUE_TOKEN",
        "remainingItemCount": 753,
        ...
      },
      "items": [...] // returns pods 1-500
    }
    
  2. Continue the previous call, retrieving the next set of 500 pods.

    GET /api/v1/pods?limit=500&continue=ENCODED_CONTINUE_TOKEN
    ---
    200 OK
    Content-Type: application/json
    
    {
      "kind": "PodList",
      "apiVersion": "v1",
      "metadata": {
        "resourceVersion":"10245",
        "continue": "ENCODED_CONTINUE_TOKEN_2",
        "remainingItemCount": 253,
        ...
      },
      "items": [...] // returns pods 501-1000
    }
    
  3. Continue the previous call, retrieving the last 253 pods.

    GET /api/v1/pods?limit=500&continue=ENCODED_CONTINUE_TOKEN_2
    ---
    200 OK
    Content-Type: application/json
    
    {
      "kind": "PodList",
      "apiVersion": "v1",
      "metadata": {
        "resourceVersion":"10245",
        "continue": "", // continue token is empty because we have reached the end of the list
        ...
      },
      "items": [...] // returns pods 1001-1253
    }
    

Notice that the resourceVersion of the collection remains constant across each request, indicating the server is showing you a consistent snapshot of the pods. Pods that are created, updated, or deleted after version 10245 would not be shown unless you make a separate list request without the continue token. This allows you to break large requests into smaller chunks and then perform a watch operation on the full set without missing any updates.

remainingItemCount is the number of subsequent items in the collection that are not included in this response. If the list request contained label or field selectors then the number of remaining items is unknown and the API server does not include a remainingItemCount field in its response. If the list is complete (either because it is not chunking, or because this is the last chunk), then there are no more remaining items and the API server does not include a remainingItemCount field in its response. The intended use of the remainingItemCount is estimating the size of a collection.

Collections

In Kubernetes terminology, the response you get from a list is a collection. However, Kubernetes defines concrete kinds for collections of different types of resource. Collections have a kind named for the resource kind, with List appended.

When you query the API for a particular type, all items returned by that query are of that type. For example, when you list Services, the collection response has kind set to ServiceList; each item in that collection represents a single Service. For example:

GET /api/v1/services
{
  "kind": "ServiceList",
  "apiVersion": "v1",
  "metadata": {
    "resourceVersion": "2947301"
  },
  "items": [
    {
      "metadata": {
        "name": "kubernetes",
        "namespace": "default",
...
      "metadata": {
        "name": "kube-dns",
        "namespace": "kube-system",
...

There are dozens of collection types (such as PodList, ServiceList, and NodeList) defined in the Kubernetes API. You can get more information about each collection type from the Kubernetes API documentation.

Some tools, such as kubectl, represent the Kubernetes collection mechanism slightly differently from the Kubernetes API itself. Because the output of kubectl might include the response from multiple list operations at the API level, kubectl represents a list of items using kind: List. For example:

kubectl get services -A -o yaml
apiVersion: v1
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
items:
- apiVersion: v1
  kind: Service
  metadata:
    creationTimestamp: "2021-06-03T14:54:12Z"
    labels:
      component: apiserver
      provider: kubernetes
    name: kubernetes
    namespace: default
...
- apiVersion: v1
  kind: Service
  metadata:
    annotations:
      prometheus.io/port: "9153"
      prometheus.io/scrape: "true"
    creationTimestamp: "2021-06-03T14:54:14Z"
    labels:
      k8s-app: kube-dns
      kubernetes.io/cluster-service: "true"
      kubernetes.io/name: CoreDNS
    name: kube-dns
    namespace: kube-system

Receiving resources as Tables

When you run kubectl get, the default output format is a simple tabular representation of one or more instances of a particular resource type. In the past, clients were required to reproduce the tabular and describe output implemented in kubectl to perform simple lists of objects. A few limitations of that approach include non-trivial logic when dealing with certain objects. Additionally, types provided by API aggregation or third party resources are not known at compile time. This means that generic implementations had to be in place for types unrecognized by a client.

In order to avoid potential limitations as described above, clients may request the Table representation of objects, delegating specific details of printing to the server. The Kubernetes API implements standard HTTP content type negotiation: passing an Accept header containing a value of application/json;as=Table;g=meta.k8s.io;v=v1 with a GET call will request that the server return objects in the Table content type.

For example, list all of the pods on a cluster in the Table format.

GET /api/v1/pods
Accept: application/json;as=Table;g=meta.k8s.io;v=v1
---
200 OK
Content-Type: application/json

{
    "kind": "Table",
    "apiVersion": "meta.k8s.io/v1",
    ...
    "columnDefinitions": [
        ...
    ]
}

For API resource types that do not have a custom Table definition known to the control plane, the API server returns a default Table response that consists of the resource's name and creationTimestamp fields.

GET /apis/crd.example.com/v1alpha1/namespaces/default/resources
---
200 OK
Content-Type: application/json
...

{
    "kind": "Table",
    "apiVersion": "meta.k8s.io/v1",
    ...
    "columnDefinitions": [
        {
            "name": "Name",
            "type": "string",
            ...
        },
        {
            "name": "Created At",
            "type": "date",
            ...
        }
    ]
}

Not all API resource types support a Table response; for example, a CustomResourceDefinitions might not define field-to-table mappings, and an APIService that extends the core Kubernetes API might not serve Table responses at all. If you are implementing a client that uses the Table information and must work against all resource types, including extensions, you should make requests that specify multiple content types in the Accept header. For example:

Accept: application/json;as=Table;g=meta.k8s.io;v=v1, application/json

Alternate representations of resources

By default, Kubernetes returns objects serialized to JSON with content type application/json. This is the default serialization format for the API. However, clients may request the more efficient Protobuf representation of these objects for better performance at scale. The Kubernetes API implements standard HTTP content type negotiation: passing an Accept header with a GET call will request that the server tries to return a response in your preferred media type, while sending an object in Protobuf to the server for a PUT or POST call means that you must set the Content-Type header appropriately.

The server will return a response with a Content-Type header if the requested format is supported, or the 406 Not acceptable error if none of the media types you requested are supported. All built-in resource types support the application/json media type.

See the Kubernetes API reference for a list of supported content types for each API.

For example:

  1. List all of the pods on a cluster in Protobuf format.

    GET /api/v1/pods
    Accept: application/vnd.kubernetes.protobuf
    ---
    200 OK
    Content-Type: application/vnd.kubernetes.protobuf
    
    ... binary encoded PodList object
    
  2. Create a pod by sending Protobuf encoded data to the server, but request a response in JSON.

    POST /api/v1/namespaces/test/pods
    Content-Type: application/vnd.kubernetes.protobuf
    Accept: application/json
    ... binary encoded Pod object
    ---
    200 OK
    Content-Type: application/json
    
    {
      "kind": "Pod",
      "apiVersion": "v1",
      ...
    }
    

Not all API resource types support Protobuf; specifically, Protobuf isn't available for resources that are defined as CustomResourceDefinitions or are served via the aggregation layer. As a client, if you might need to work with extension types you should specify multiple content types in the request Accept header to support fallback to JSON. For example:

Accept: application/vnd.kubernetes.protobuf, application/json

Kubernetes Protobuf encoding

Kubernetes uses an envelope wrapper to encode Protobuf responses. That wrapper starts with a 4 byte magic number to help identify content in disk or in etcd as Protobuf (as opposed to JSON), and then is followed by a Protobuf encoded wrapper message, which describes the encoding and type of the underlying object and then contains the object.

The wrapper format is:

A four byte magic number prefix:
  Bytes 0-3: "k8s\x00" [0x6b, 0x38, 0x73, 0x00]

An encoded Protobuf message with the following IDL:
  message Unknown {
    // typeMeta should have the string values for "kind" and "apiVersion" as set on the JSON object
    optional TypeMeta typeMeta = 1;

    // raw will hold the complete serialized object in protobuf. See the protobuf definitions in the client libraries for a given kind.
    optional bytes raw = 2;

    // contentEncoding is encoding used for the raw data. Unspecified means no encoding.
    optional string contentEncoding = 3;

    // contentType is the serialization method used to serialize 'raw'. Unspecified means application/vnd.kubernetes.protobuf and is usually
    // omitted.
    optional string contentType = 4;
  }

  message TypeMeta {
    // apiVersion is the group/version for this type
    optional string apiVersion = 1;
    // kind is the name of the object schema. A protobuf definition should exist for this object.
    optional string kind = 2;
  }

Resource deletion

When you delete a resource this takes place in two phases.

  1. finalization
  2. removal
{
  "kind": "ConfigMap",
  "apiVersion": "v1",
  "metadata": {
    "finalizers": {"url.io/neat-finalization", "other-url.io/my-finalizer"},
    "deletionTimestamp": nil,
  }
}

When a client first sends a delete to request removal of a resource, the .metadata.deletionTimestamp is set to the current time. Once the .metadata.deletionTimestamp is set, external controllers that act on finalizers may start performing their cleanup work at any time, in any order.

Order is not enforced between finalizers because it would introduce significant risk of stuck .metadata.finalizers.

The .metadata.finalizers field is shared: any actor with permission can reorder it. If the finalizer list were processed in order, then this might lead to a situation in which the component responsible for the first finalizer in the list is waiting for some signal (field value, external system, or other) produced by a component responsible for a finalizer later in the list, resulting in a deadlock.

Without enforced ordering, finalizers are free to order amongst themselves and are not vulnerable to ordering changes in the list.

Once the last finalizer is removed, the resource is actually removed from etcd.

Single resource API

The Kubernetes API verbs get, create, apply, update, patch, delete and proxy support single resources only. These verbs with single resource support have no support for submitting multiple resources together in an ordered or unordered list or transaction.

When clients (including kubectl) act on a set of resources, the client makes a series of single-resource API requests, then aggregates the responses if needed.

By contrast, the Kubernetes API verbs list and watch allow getting multiple resources, and deletecollection allows deleting multiple resources.

Field validation

Kubernetes always validates the type of fields. For example, if a field in the API is defined as a number, you cannot set the field to a text value. If a field is defined as an array of strings, you can only provide an array. Some fields allow you to omit them, other fields are required. Omitting a required field from an API request is an error.

If you make a request with an extra field, one that the cluster's control plane does not recognize, then the behavior of the API server is more complicated.

By default, the API server drops fields that it does not recognize from an input that it receives (for example, the JSON body of a PUT request).

There are two situations where the API server drops fields that you supplied in an HTTP request.

These situations are:

  1. The field is unrecognized because it is not in the resource's OpenAPI schema. (One exception to this is for CRDs that explicitly choose not to prune unknown fields via x-kubernetes-preserve-unknown-fields).
  2. The field is duplicated in the object.

Setting the field validation level

FEATURE STATE: Kubernetes v1.25 [beta]

Provided that the ServerSideFieldValidation feature gate is enabled (disabled by default in 1.23 and 1.24, enabled by default starting in 1.25), you can take advantage of server side field validation to catch these unrecognized fields.

When you use HTTP verbs that can submit data (POST, PUT, and PATCH), field validation gives you the option to choose how you would like to be notified of these fields that are being dropped by the API server. Possible levels of validation are Ignore, Warn, and Strict.

Field validation is set by the fieldValidation query parameter. The three values that you can provide for this parameter are:

Ignore
The API server succeeds in handling the request as it would without the erroneous fields being set, dropping all unknown and duplicate fields and giving no indication it has done so.
Warn
(Default) The API server succeeds in handling the request, and reports a warning to the client. The warning is sent using the Warning: response header, adding one warning item for each unknown or duplicate field. For more information about warnings and the Kubernetes API, see the blog article Warning: Helpful Warnings Ahead.
Strict
The API server rejects the request with a 400 Bad Request error when it detects any unknown or duplicate fields. The response message from the API server specifies all the unknown or duplicate fields that the API server has detected.

Tools that submit requests to the server (such as kubectl), might set their own defaults that are different from the Warn validation level that the API server uses by default.

The kubectl tool uses the --validate flag to set the level of field validation. Historically --validate was used to toggle client-side validation on or off as a boolean flag. Since Kubernetes 1.25, kubectl uses server-side field validation when sending requests to a serer with this feature enabled. Validation will fall back to client-side only when it cannot connect to an API server with field validation enabled. It accepts the values ignore, warn, and strict while also accepting the values true (equivalent to strict) and false (equivalent to ignore). The default validation setting for kubectl is --validate=true, which means strict server-side field validation.

Dry-run

FEATURE STATE: Kubernetes v1.18 [stable]

When you use HTTP verbs that can modify resources (POST, PUT, PATCH, and DELETE), you can submit your request in a dry run mode. Dry run mode helps to evaluate a request through the typical request stages (admission chain, validation, merge conflicts) up until persisting objects to storage. The response body for the request is as close as possible to a non-dry-run response. Kubernetes guarantees that dry-run requests will not be persisted in storage or have any other side effects.

Make a dry-run request

Dry-run is triggered by setting the dryRun query parameter. This parameter is a string, working as an enum, and the only accepted values are:

[no value set]
Allow side effects. You request this with a query string such as ?dryRun or ?dryRun&pretty=true. The response is the final object that would have been persisted, or an error if the request could not be fulfilled.
All
Every stage runs as normal, except for the final storage stage where side effects are prevented.

When you set ?dryRun=All, any relevant admission controllers are run, validating admission controllers check the request post-mutation, merge is performed on PATCH, fields are defaulted, and schema validation occurs. The changes are not persisted to the underlying storage, but the final object which would have been persisted is still returned to the user, along with the normal status code.

If the non-dry-run version of a request would trigger an admission controller that has side effects, the request will be failed rather than risk an unwanted side effect. All built in admission control plugins support dry-run. Additionally, admission webhooks can declare in their configuration object that they do not have side effects, by setting their sideEffects field to None.

Here is an example dry-run request that uses ?dryRun=All:

POST /api/v1/namespaces/test/pods?dryRun=All
Content-Type: application/json
Accept: application/json

The response would look the same as for non-dry-run request, but the values of some generated fields may differ.

Generated values

Some values of an object are typically generated before the object is persisted. It is important not to rely upon the values of these fields set by a dry-run request, since these values will likely be different in dry-run mode from when the real request is made. Some of these fields are:

  • name: if generateName is set, name will have a unique random name
  • creationTimestamp / deletionTimestamp: records the time of creation/deletion
  • UID: uniquely identifies the object and is randomly generated (non-deterministic)
  • resourceVersion: tracks the persisted version of the object
  • Any field set by a mutating admission controller
  • For the Service resource: Ports or IP addresses that the kube-apiserver assigns to Service objects

Dry-run authorization

Authorization for dry-run and non-dry-run requests is identical. Thus, to make a dry-run request, you must be authorized to make the non-dry-run request.

For example, to run a dry-run patch for a Deployment, you must be authorized to perform that patch. Here is an example of a rule for Kubernetes RBAC that allows patching Deployments:

rules:
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["patch"]

See Authorization Overview.

Server Side Apply

Kubernetes' Server Side Apply feature allows the control plane to track managed fields for newly created objects. Server Side Apply provides a clear pattern for managing field conflicts, offers server-side Apply and Update operations, and replaces the client-side functionality of kubectl apply.

The API verb for Server-Side Apply is apply. See Server Side Apply for more details.

Resource versions

Resource versions are strings that identify the server's internal version of an object. Resource versions can be used by clients to determine when objects have changed, or to express data consistency requirements when getting, listing and watching resources. Resource versions must be treated as opaque by clients and passed unmodified back to the server.

You must not assume resource versions are numeric or collatable. API clients may only compare two resource versions for equality (this means that you must not compare resource versions for greater-than or less-than relationships).

resourceVersion fields in metadata

Clients find resource versions in resources, including the resources from the response stream for a watch, or when using list to enumerate resources.

v1.meta/ObjectMeta - The metadata.resourceVersion of a resource instance identifies the resource version the instance was last modified at.

v1.meta/ListMeta - The metadata.resourceVersion of a resource collection (the response to a list) identifies the resource version at which the collection was constructed.

resourceVersion parameters in query strings

The get, list, and watch operations support the resourceVersion parameter. From version v1.19, Kubernetes API servers also support the resourceVersionMatch parameter on list requests.

The API server interprets the resourceVersion parameter differently depending on the operation you request, and on the value of resourceVersion. If you set resourceVersionMatch then this also affects the way matching happens.

Semantics for get and list

For get and list, the semantics of resourceVersion are:

get:

resourceVersion unsetresourceVersion="0"resourceVersion="{value other than 0}"
Most RecentAnyNot older than

list:

From version v1.19, Kubernetes API servers support the resourceVersionMatch parameter on list requests. If you set both resourceVersion and resourceVersionMatch, the resourceVersionMatch parameter determines how the API server interprets resourceVersion.

You should always set the resourceVersionMatch parameter when setting resourceVersion on a list request. However, be prepared to handle the case where the API server that responds is unaware of resourceVersionMatch and ignores it.

Unless you have strong consistency requirements, using resourceVersionMatch=NotOlderThan and a known resourceVersion is preferable since it can achieve better performance and scalability of your cluster than leaving resourceVersion and resourceVersionMatch unset, which requires quorum read to be served.

Setting the resourceVersionMatch parameter without setting resourceVersion is not valid.

This table explains the behavior of list requests with various combinations of resourceVersion and resourceVersionMatch:

resourceVersionMatch and paging parameters for list
resourceVersionMatch parampaging paramsresourceVersion not setresourceVersion="0"resourceVersion="{value other than 0}"
unsetlimit unsetMost RecentAnyNot older than
unsetlimit=<n>, continue unsetMost RecentAnyExact
unsetlimit=<n>, continue=<token>Continue Token, ExactInvalid, treated as Continue Token, ExactInvalid, HTTP 400 Bad Request
resourceVersionMatch=Exactlimit unsetInvalidInvalidExact
resourceVersionMatch=Exactlimit=<n>, continue unsetInvalidInvalidExact
resourceVersionMatch=NotOlderThanlimit unsetInvalidAnyNot older than
resourceVersionMatch=NotOlderThanlimit=<n>, continue unsetInvalidAnyNot older than

The meaning of the get and list semantics are:

Any
Return data at any resource version. The newest available resource version is preferred, but strong consistency is not required; data at any resource version may be served. It is possible for the request to return data at a much older resource version that the client has previously observed, particularly in high availability configurations, due to partitions or stale caches. Clients that cannot tolerate this should not use this semantic.
Most recent
Return data at the most recent resource version. The returned data must be consistent (in detail: served from etcd via a quorum read).
Not older than
Return data at least as new as the provided resourceVersion. The newest available data is preferred, but any data not older than the provided resourceVersion may be served. For list requests to servers that honor the resourceVersionMatch parameter, this guarantees that the collection's .metadata.resourceVersion is not older than the requested resourceVersion, but does not make any guarantee about the .metadata.resourceVersion of any of the items in that collection.
Exact
Return data at the exact resource version provided. If the provided resourceVersion is unavailable, the server responds with HTTP 410 "Gone". For list requests to servers that honor the resourceVersionMatch parameter, this guarantees that the collection's .metadata.resourceVersion is the same as the resourceVersion you requested in the query string. That guarantee does not apply to the .metadata.resourceVersion of any items within that collection.
Continue Token, Exact
Return data at the resource version of the initial paginated list call. The returned continue tokens are responsible for keeping track of the initially provided resource version for all paginated list calls after the initial paginated list.

When using resourceVersionMatch=NotOlderThan and limit is set, clients must handle HTTP 410 "Gone" responses. For example, the client might retry with a newer resourceVersion or fall back to resourceVersion="".

When using resourceVersionMatch=Exact and limit is unset, clients must verify that the collection's .metadata.resourceVersion matches the requested resourceVersion, and handle the case where it does not. For example, the client might fall back to a request with limit set.

Semantics for watch

For watch, the semantics of resource version are:

watch:

resourceVersion for watch
resourceVersion unsetresourceVersion="0"resourceVersion="{value other than 0}"
Get State and Start at Most RecentGet State and Start at AnyStart at Exact

The meaning of those watch semantics are:

Get State and Start at Any
Start a watch at any resource version; the most recent resource version available is preferred, but not required. Any starting resource version is allowed. It is possible for the watch to start at a much older resource version that the client has previously observed, particularly in high availability configurations, due to partitions or stale caches. Clients that cannot tolerate this apparent rewinding should not start a watch with this semantic. To establish initial state, the watch begins with synthetic "Added" events for all resource instances that exist at the starting resource version. All following watch events are for all changes that occurred after the resource version the watch started at.
Get State and Start at Most Recent
Start a watch at the most recent resource version, which must be consistent (in detail: served from etcd via a quorum read). To establish initial state, the watch begins with synthetic "Added" events of all resources instances that exist at the starting resource version. All following watch events are for all changes that occurred after the resource version the watch started at.
Start at Exact
Start a watch at an exact resource version. The watch events are for all changes after the provided resource version. Unlike "Get State and Start at Most Recent" and "Get State and Start at Any", the watch is not started with synthetic "Added" events for the provided resource version. The client is assumed to already have the initial state at the starting resource version since the client provided the resource version.

"410 Gone" responses

Servers are not required to serve all older resource versions and may return a HTTP 410 (Gone) status code if a client requests a resourceVersion older than the server has retained. Clients must be able to tolerate 410 (Gone) responses. See Efficient detection of changes for details on how to handle 410 (Gone) responses when watching resources.

If you request a resourceVersion outside the applicable limit then, depending on whether a request is served from cache or not, the API server may reply with a 410 Gone HTTP response.

Unavailable resource versions

Servers are not required to serve unrecognized resource versions. If you request list or get for a resource version that the API server does not recognize, then the API server may either:

  • wait briefly for the resource version to become available, then timeout with a 504 (Gateway Timeout) if the provided resource versions does not become available in a reasonable amount of time;
  • respond with a Retry-After response header indicating how many seconds a client should wait before retrying the request.

If you request a resource version that an API server does not recognize, the kube-apiserver additionally identifies its error responses with a "Too large resource version" message.

If you make a watch request for an unrecognized resource version, the API server may wait indefinitely (until the request timeout) for the resource version to become available.

2.2 - Server-Side Apply

FEATURE STATE: Kubernetes v1.22 [stable]

Introduction

Server-Side Apply helps users and controllers manage their resources through declarative configurations. Clients can create and modify their objects declaratively by sending their fully specified intent.

A fully specified intent is a partial object that only includes the fields and values for which the user has an opinion. That intent either creates a new object or is combined, by the server, with the existing object.

The system supports multiple appliers collaborating on a single object.

Changes to an object's fields are tracked through a "field management" mechanism. When a field's value changes, ownership moves from its current manager to the manager making the change. When trying to apply an object, fields that have a different value and are owned by another manager will result in a conflict. This is done in order to signal that the operation might undo another collaborator's changes. Conflicts can be forced, in which case the value will be overridden, and the ownership will be transferred.

If you remove a field from a configuration and apply the configuration, Server-Side Apply checks if there are any other field managers that also own the field. If the field is not owned by any other field managers, it is either deleted from the live object or reset to its default value, if it has one. The same rule applies to associative list or map items.

Server-Side Apply is meant both as a replacement for the original kubectl apply and as a simpler mechanism for controllers to enact their changes.

If you have Server-Side Apply enabled, the control plane tracks managed fields for all newly created objects.

Field Management

Compared to the last-applied annotation managed by kubectl, Server-Side Apply uses a more declarative approach, which tracks a user's field management, rather than a user's last applied state. This means that as a side effect of using Server-Side Apply, information about which field manager manages each field in an object also becomes available.

For a user to manage a field, in the Server-Side Apply sense, means that the user relies on and expects the value of the field not to change. The user who last made an assertion about the value of a field will be recorded as the current field manager. This can be done either by changing the value with POST, PUT, or non-apply PATCH, or by including the field in a config sent to the Server-Side Apply endpoint. When using Server-Side Apply, trying to change a field which is managed by someone else will result in a rejected request (if not forced, see Conflicts).

When two or more appliers set a field to the same value, they share ownership of that field. Any subsequent attempt to change the value of the shared field, by any of the appliers, results in a conflict. Shared field owners may give up ownership of a field by removing it from their configuration.

Field management is stored in amanagedFields field that is part of an object's metadata.

A simple example of an object created by Server-Side Apply could look like this:

apiVersion: v1
kind: ConfigMap
metadata:
  name: test-cm
  namespace: default
  labels:
    test-label: test
  managedFields:
  - manager: kubectl
    operation: Apply
    apiVersion: v1
    time: "2010-10-10T0:00:00Z"
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          f:test-label: {}
      f:data:
        f:key: {}
data:
  key: some value

The above object contains a single manager in metadata.managedFields. The manager consists of basic information about the managing entity itself, like operation type, API version, and the fields managed by it.

Nevertheless it is possible to change metadata.managedFields through an Update operation. Doing so is highly discouraged, but might be a reasonable option to try if, for example, the managedFields get into an inconsistent state (which clearly should not happen).

The format of the managedFields is described in the API.

Conflicts

A conflict is a special status error that occurs when an Apply operation tries to change a field, which another user also claims to manage. This prevents an applier from unintentionally overwriting the value set by another user. When this occurs, the applier has 3 options to resolve the conflicts:

  • Overwrite value, become sole manager: If overwriting the value was intentional (or if the applier is an automated process like a controller) the applier should set the force query parameter to true (in kubectl, it can be done by using the --force-conflicts flag with the apply command) and make the request again. This forces the operation to succeed, changes the value of the field, and removes the field from all other managers' entries in managedFields.

  • Don't overwrite value, give up management claim: If the applier doesn't care about the value of the field anymore, they can remove it from their config and make the request again. This leaves the value unchanged, and causes the field to be removed from the applier's entry in managedFields.

  • Don't overwrite value, become shared manager: If the applier still cares about the value of the field, but doesn't want to overwrite it, they can change the value of the field in their config to match the value of the object on the server, and make the request again. This leaves the value unchanged, and causes the field's management to be shared by the applier and all other field managers that already claimed to manage it.

Managers

Managers identify distinct workflows that are modifying the object (especially useful on conflicts!), and can be specified through the fieldManager query parameter as part of a modifying request. It is required for the apply endpoint, though kubectl will default it to kubectl. For other updates, its default is computed from the user-agent.

Apply and Update

The two operation types considered by this feature are Apply (PATCH with content type application/apply-patch+yaml) and Update (all other operations which modify the object). Both operations update the managedFields, but behave a little differently.

For instance, only the apply operation fails on conflicts while update does not. Also, apply operations are required to identify themselves by providing a fieldManager query parameter, while the query parameter is optional for update operations. Finally, when using the apply operation you cannot have managedFields in the object that is being applied.

An example object with multiple managers could look like this:

apiVersion: v1
kind: ConfigMap
metadata:
  name: test-cm
  namespace: default
  labels:
    test-label: test
  managedFields:
  - manager: kubectl
    operation: Apply
    apiVersion: v1
    fields:
      f:metadata:
        f:labels:
          f:test-label: {}
  - manager: kube-controller-manager
    operation: Update
    apiVersion: v1
    time: '2019-03-30T16:00:00.000Z'
    fields:
      f:data:
        f:key: {}
data:
  key: new value

In this example, a second operation was run as an Update by the manager called kube-controller-manager. The update changed a value in the data field which caused the field's management to change to the kube-controller-manager.

If this update would have been an Apply operation, the operation would have failed due to conflicting ownership.

Merge strategy

The merging strategy, implemented with Server-Side Apply, provides a generally more stable object lifecycle. Server-Side Apply tries to merge fields based on the actor who manages them instead of overruling based on values. This way multiple actors can update the same object without causing unexpected interference.

When a user sends a "fully-specified intent" object to the Server-Side Apply endpoint, the server merges it with the live object favoring the value in the applied config if it is specified in both places. If the set of items present in the applied config is not a superset of the items applied by the same user last time, each missing item not managed by any other appliers is removed. For more information about how an object's schema is used to make decisions when merging, see sigs.k8s.io/structured-merge-diff.

A number of markers were added in Kubernetes 1.16 and 1.17, to allow API developers to describe the merge strategy supported by lists, maps, and structs. These markers can be applied to objects of the respective type, in Go files or in the OpenAPI schema definition of the CRD:

Golang markerOpenAPI extensionAccepted valuesDescriptionIntroduced in
//+listTypex-kubernetes-list-typeatomic/set/mapApplicable to lists. set applies to lists that include only scalar elements. These elements must be unique. map applies to lists of nested types only. The key values (see listMapKey) must be unique in the list. atomic can apply to any list. If configured as atomic, the entire list is replaced during merge. At any point in time, a single manager owns the list. If set or map, different managers can manage entries separately.1.16
//+listMapKeyx-kubernetes-list-map-keysList of field names, e.g. ["port", "protocol"]Only applicable when +listType=map. A list of field names whose values uniquely identify entries in the list. While there can be multiple keys, listMapKey is singular because keys need to be specified individually in the Go type. The key fields must be scalars.1.16
//+mapTypex-kubernetes-map-typeatomic/granularApplicable to maps. atomic means that the map can only be entirely replaced by a single manager. granular means that the map supports separate managers updating individual fields.1.17
//+structTypex-kubernetes-map-typeatomic/granularApplicable to structs; otherwise same usage and OpenAPI annotation as //+mapType.1.17

If listType is missing, the API server interprets a patchMergeStrategy=merge marker as a listType=map and the corresponding patchMergeKey marker as a listMapKey.

The atomic list type is recursive.

These markers are specified as comments and don't have to be repeated as field tags.

Compatibility across topology changes

On rare occurrences, a CRD or built-in type author may want to change the specific topology of a field in their resource without incrementing its version. Changing the topology of types, by upgrading the cluster or updating the CRD, has different consequences when updating existing objects. There are two categories of changes: when a field goes from map/set/granular to atomic and the other way around.

When the listType, mapType, or structType changes from map/set/granular to atomic, the whole list, map, or struct of existing objects will end-up being owned by actors who owned an element of these types. This means that any further change to these objects would cause a conflict.

When a list, map, or struct changes from atomic to map/set/granular, the API server won't be able to infer the new ownership of these fields. Because of that, no conflict will be produced when objects have these fields updated. For that reason, it is not recommended to change a type from atomic to map/set/granular.

Take for example, the custom resource:

apiVersion: example.com/v1
kind: Foo
metadata:
  name: foo-sample
  managedFields:
  - manager: manager-one
    operation: Apply
    apiVersion: example.com/v1
    fields:
      f:spec:
        f:data: {}
spec:
  data:
    key1: val1
    key2: val2

Before spec.data gets changed from atomic to granular, manager-one owns the field spec.data, and all the fields within it (key1 and key2). When the CRD gets changed to make spec.data granular, manager-one continues to own the top-level field spec.data (meaning no other managers can delete the map called data without a conflict), but it no longer owns key1 and key2, so another manager can then modify or delete those fields without conflict.

Custom Resources

By default, Server-Side Apply treats custom resources as unstructured data. All keys are treated the same as struct fields, and all lists are considered atomic.

If the Custom Resource Definition defines a schema that contains annotations as defined in the previous "Merge Strategy" section, these annotations will be used when merging objects of this type.

Using Server-Side Apply in a controller

As a developer of a controller, you can use server-side apply as a way to simplify the update logic of your controller. The main differences with a read-modify-write and/or patch are the following:

  • the applied object must contain all the fields that the controller cares about.
  • there is no way to remove fields that haven't been applied by the controller before (controller can still send a PATCH/UPDATE for these use-cases).
  • the object doesn't have to be read beforehand, resourceVersion doesn't have to be specified.

It is strongly recommended for controllers to always "force" conflicts, since they might not be able to resolve or act on these conflicts.

Transferring Ownership

In addition to the concurrency controls provided by conflict resolution, Server-Side Apply provides ways to perform coordinated field ownership transfers from users to controllers.

This is best explained by example. Let's look at how to safely transfer ownership of the replicas field from a user to a controller while enabling automatic horizontal scaling for a Deployment, using the HorizontalPodAutoscaler resource and its accompanying controller.

Say a user has defined deployment with replicas set to the desired value:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2

And the user has created the deployment using Server-Side Apply like so:

kubectl apply -f https://k8s.io/examples/application/ssa/nginx-deployment.yaml --server-side

Then later, HPA is enabled for the deployment, e.g.:

kubectl autoscale deployment nginx-deployment --cpu-percent=50 --min=1 --max=10

Now, the user would like to remove replicas from their configuration, so they don't accidentally fight with the HPA controller. However, there is a race: it might take some time before HPA feels the need to adjust replicas, and if the user removes replicas before the HPA writes to the field and becomes its owner, then apiserver will set replicas to 1, its default value. This is not what the user wants to happen, even temporarily.

There are two solutions:

  • (basic) Leave replicas in the configuration; when HPA eventually writes to that field, the system gives the user a conflict over it. At that point, it is safe to remove from the configuration.

  • (more advanced) If, however, the user doesn't want to wait, for example because they want to keep the cluster legible to coworkers, then they can take the following steps to make it safe to remove replicas from their configuration:

First, the user defines a new configuration containing only the replicas field:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3

The user applies that configuration using the field manager name handover-to-hpa:

kubectl apply -f https://k8s.io/examples/application/ssa/nginx-deployment-replicas-only.yaml \
  --server-side --field-manager=handover-to-hpa \
  --validate=false

If the apply results in a conflict with the HPA controller, then do nothing. The conflict indicates the controller has claimed the field earlier in the process than it sometimes does.

At this point the user may remove the replicas field from their configuration.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2

Note that whenever the HPA controller sets the replicas field to a new value, the temporary field manager will no longer own any fields and will be automatically deleted. No clean up is required.

Transferring Ownership Between Users

Users can transfer ownership of a field between each other by setting the field to the same value in both of their applied configs, causing them to share ownership of the field. Once the users share ownership of the field, one of them can remove the field from their applied configuration to give up ownership and complete the transfer to the other user.

Comparison with Client Side Apply

A consequence of the conflict detection and resolution implemented by Server-Side Apply is that an applier always has up to date field values in their local state. If they don't, they get a conflict the next time they apply. Any of the three options to resolve conflicts results in the applied configuration being an up to date subset of the object on the server's fields.

This is different from Client Side Apply, where outdated values which have been overwritten by other users are left in an applier's local config. These values only become accurate when the user updates that specific field, if ever, and an applier has no way of knowing whether their next apply will overwrite other users' changes.

Another difference is that an applier using Client Side Apply is unable to change the API version they are using, but Server-Side Apply supports this use case.

Upgrading from client-side apply to server-side apply

Client-side apply users who manage a resource with kubectl apply can start using server-side apply with the following flag.

kubectl apply --server-side [--dry-run=server]

By default, field management of the object transfers from client-side apply to kubectl server-side apply without encountering conflicts.

This behavior applies to server-side apply with the kubectl field manager. As an exception, you can opt-out of this behavior by specifying a different, non-default field manager, as seen in the following example. The default field manager for kubectl server-side apply is kubectl.

kubectl apply --server-side --field-manager=my-manager [--dry-run=server]

Downgrading from server-side apply to client-side apply

If you manage a resource with kubectl apply --server-side, you can downgrade to client-side apply directly with kubectl apply.

Downgrading works because kubectl server-side apply keeps the last-applied-configuration annotation up-to-date if you use kubectl apply.

This behavior applies to server-side apply with the kubectl field manager. As an exception, you can opt-out of this behavior by specifying a different, non-default field manager, as seen in the following example. The default field manager for kubectl server-side apply is kubectl.

kubectl apply --server-side --field-manager=my-manager [--dry-run=server]

API Endpoint

With the Server-Side Apply feature enabled, the PATCH endpoint accepts the additional application/apply-patch+yaml content type. Users of Server-Side Apply can send partially specified objects as YAML to this endpoint. When applying a configuration, one should always include all the fields that they have an opinion about.

RBAC and permissions

Since Server-Side Apply is a type of PATCH, a role will require the PATCH permission to edit resources, but will also need the CREATE verb permission in order to create resources with Server-Side Apply.

Clearing ManagedFields

It is possible to strip all managedFields from an object by overwriting them using MergePatch, StrategicMergePatch, JSONPatch, or Update, so every non-apply operation. This can be done by overwriting the managedFields field with an empty entry. Two examples are:

PATCH /api/v1/namespaces/default/configmaps/example-cm
Content-Type: application/merge-patch+json
Accept: application/json
Data: {"metadata":{"managedFields": [{}]}}
PATCH /api/v1/namespaces/default/configmaps/example-cm
Content-Type: application/json-patch+json
Accept: application/json
Data: [{"op": "replace", "path": "/metadata/managedFields", "value": [{}]}]

This will overwrite the managedFields with a list containing a single empty entry that then results in the managedFields being stripped entirely from the object. Note that setting the managedFields to an empty list will not reset the field. This is on purpose, so managedFields never get stripped by clients not aware of the field.

In cases where the reset operation is combined with changes to other fields than the managedFields, this will result in the managedFields being reset first and the other changes being processed afterwards. As a result the applier takes ownership of any fields updated in the same request.

2.3 - Client Libraries

This page contains an overview of the client libraries for using the Kubernetes API from various programming languages.

To write applications using the Kubernetes REST API, you do not need to implement the API calls and request/response types yourself. You can use a client library for the programming language you are using.

Client libraries often handle common tasks such as authentication for you. Most client libraries can discover and use the Kubernetes Service Account to authenticate if the API client is running inside the Kubernetes cluster, or can understand the kubeconfig file format to read the credentials and the API Server address.

Officially-supported Kubernetes client libraries

The following client libraries are officially maintained by Kubernetes SIG API Machinery.

LanguageClient LibrarySample Programs
Cgithub.com/kubernetes-client/cbrowse
dotnetgithub.com/kubernetes-client/csharpbrowse
Gogithub.com/kubernetes/client-go/browse
Haskellgithub.com/kubernetes-client/haskellbrowse
Javagithub.com/kubernetes-client/javabrowse
JavaScriptgithub.com/kubernetes-client/javascriptbrowse
Perlgithub.com/kubernetes-client/perl/browse
Pythongithub.com/kubernetes-client/python/browse
Rubygithub.com/kubernetes-client/ruby/browse

Community-maintained client libraries

The following Kubernetes API client libraries are provided and maintained by their authors, not the Kubernetes team.

LanguageClient Library
Clojuregithub.com/yanatan16/clj-kubernetes-api
DotNetgithub.com/tonnyeremin/kubernetes_gen
DotNet (RestSharp)github.com/masroorhasan/Kubernetes.DotNet
Elixirgithub.com/obmarg/kazan
Elixirgithub.com/coryodaniel/k8s
Gogithub.com/ericchiang/k8s
Java (OSGi)bitbucket.org/amdatulabs/amdatu-kubernetes
Java (Fabric8, OSGi)github.com/fabric8io/kubernetes-client
Javagithub.com/manusa/yakc
Lispgithub.com/brendandburns/cl-k8s
Lispgithub.com/xh4/cube
Node.js (TypeScript)github.com/Goyoo/node-k8s-client
Node.jsgithub.com/ajpauwels/easy-k8s
Node.jsgithub.com/godaddy/kubernetes-client
Node.jsgithub.com/tenxcloud/node-kubernetes-client
Perlmetacpan.org/pod/Net::Kubernetes
PHPgithub.com/allansun/kubernetes-php-client
PHPgithub.com/maclof/kubernetes-client
PHPgithub.com/travisghansen/kubernetes-client-php
PHPgithub.com/renoki-co/php-k8s
Pythongithub.com/fiaas/k8s
Pythongithub.com/gtsystem/lightkube
Pythongithub.com/mnubo/kubernetes-py
Pythongithub.com/tomplus/kubernetes_asyncio
Pythongithub.com/Frankkkkk/pykorm
Rubygithub.com/abonas/kubeclient
Rubygithub.com/k8s-ruby/k8s-ruby
Rubygithub.com/kontena/k8s-client
Rustgithub.com/clux/kube-rs
Rustgithub.com/ynqa/kubernetes-rust
Scalagithub.com/hagay3/skuber
Scalagithub.com/hnaderi/scala-k8s
Scalagithub.com/joan38/kubernetes-client
Swiftgithub.com/swiftkube/client

2.4 - Kubernetes Deprecation Policy

This document details the deprecation policy for various facets of the system.

Kubernetes is a large system with many components and many contributors. As with any such software, the feature set naturally evolves over time, and sometimes a feature may need to be removed. This could include an API, a flag, or even an entire feature. To avoid breaking existing users, Kubernetes follows a deprecation policy for aspects of the system that are slated to be removed.

Deprecating parts of the API

Since Kubernetes is an API-driven system, the API has evolved over time to reflect the evolving understanding of the problem space. The Kubernetes API is actually a set of APIs, called "API groups", and each API group is independently versioned. API versions fall into 3 main tracks, each of which has different policies for deprecation:

ExampleTrack
v1GA (generally available, stable)
v1beta1Beta (pre-release)
v1alpha1Alpha (experimental)

A given release of Kubernetes can support any number of API groups and any number of versions of each.

The following rules govern the deprecation of elements of the API. This includes:

  • REST resources (aka API objects)
  • Fields of REST resources
  • Annotations on REST resources, including "beta" annotations but not including "alpha" annotations.
  • Enumerated or constant values
  • Component config structures

These rules are enforced between official releases, not between arbitrary commits to master or release branches.

Rule #1: API elements may only be removed by incrementing the version of the API group.

Once an API element has been added to an API group at a particular version, it can not be removed from that version or have its behavior significantly changed, regardless of track.

Rule #2: API objects must be able to round-trip between API versions in a given release without information loss, with the exception of whole REST resources that do not exist in some versions.

For example, an object can be written as v1 and then read back as v2 and converted to v1, and the resulting v1 resource will be identical to the original. The representation in v2 might be different from v1, but the system knows how to convert between them in both directions. Additionally, any new field added in v2 must be able to round-trip to v1 and back, which means v1 might have to add an equivalent field or represent it as an annotation.

Rule #3: An API version in a given track may not be deprecated in favor of a less stable API version.

  • GA API versions can replace beta and alpha API versions.
  • Beta API versions can replace earlier beta and alpha API versions, but may not replace GA API versions.
  • Alpha API versions can replace earlier alpha API versions, but may not replace GA or beta API versions.

Rule #4a: API lifetime is determined by the API stability level

  • GA API versions may be marked as deprecated, but must not be removed within a major version of Kubernetes
  • Beta API versions are deprecated no more than 9 months or 3 minor releases after introduction (whichever is longer), and are no longer served 9 months or 3 minor releases after deprecation (whichever is longer)
  • Alpha API versions may be removed in any release without prior deprecation notice

This ensures beta API support covers the maximum supported version skew of 2 releases, and that APIs don't stagnate on unstable beta versions, accumulating production usage that will be disrupted when support for the beta API ends.

Rule #4b: The "preferred" API version and the "storage version" for a given group may not advance until after a release has been made that supports both the new version and the previous version

Users must be able to upgrade to a new release of Kubernetes and then roll back to a previous release, without converting anything to the new API version or suffering breakages (unless they explicitly used features only available in the newer version). This is particularly evident in the stored representation of objects.

All of this is best illustrated by examples. Imagine a Kubernetes release, version X, which introduces a new API group. A new Kubernetes release is made every approximately 4 months (3 per year). The following table describes which API versions are supported in a series of subsequent releases.

ReleaseAPI VersionsPreferred/Storage VersionNotes
Xv1alpha1v1alpha1
X+1v1alpha2v1alpha2
  • v1alpha1 is removed, "action required" relnote
X+2v1beta1v1beta1
  • v1alpha2 is removed, "action required" relnote
X+3v1beta2, v1beta1 (deprecated)v1beta1
  • v1beta1 is deprecated, "action required" relnote
X+4v1beta2, v1beta1 (deprecated)v1beta2
X+5v1, v1beta1 (deprecated), v1beta2 (deprecated)v1beta2
  • v1beta2 is deprecated, "action required" relnote
X+6v1, v1beta2 (deprecated)v1
  • v1beta1 is removed, "action required" relnote
X+7v1, v1beta2 (deprecated)v1
X+8v2alpha1, v1v1
  • v1beta2 is removed, "action required" relnote
X+9v2alpha2, v1v1
  • v2alpha1 is removed, "action required" relnote
X+10v2beta1, v1v1
  • v2alpha2 is removed, "action required" relnote
X+11v2beta2, v2beta1 (deprecated), v1v1
  • v2beta1 is deprecated, "action required" relnote
X+12v2, v2beta2 (deprecated), v2beta1 (deprecated), v1 (deprecated)v1
  • v2beta2 is deprecated, "action required" relnote
  • v1 is deprecated in favor of v2, but will not be removed
X+13v2, v2beta1 (deprecated), v2beta2 (deprecated), v1 (deprecated)v2
X+14v2, v2beta2 (deprecated), v1 (deprecated)v2
  • v2beta1 is removed, "action required" relnote
X+15v2, v1 (deprecated)v2
  • v2beta2 is removed, "action required" relnote

REST resources (aka API objects)

Consider a hypothetical REST resource named Widget, which was present in API v1 in the above timeline, and which needs to be deprecated. We document and announce the deprecation in sync with release X+1. The Widget resource still exists in API version v1 (deprecated) but not in v2alpha1. The Widget resource continues to exist and function in releases up to and including X+8. Only in release X+9, when API v1 has aged out, does the Widget resource cease to exist, and the behavior get removed.

Starting in Kubernetes v1.19, making an API request to a deprecated REST API endpoint:

  1. Returns a Warning header (as defined in RFC7234, Section 5.5) in the API response.

  2. Adds a "k8s.io/deprecated":"true" annotation to the audit event recorded for the request.

  3. Sets an apiserver_requested_deprecated_apis gauge metric to 1 in the kube-apiserver process. The metric has labels for group, version, resource, subresource that can be joined to the apiserver_request_total metric, and a removed_release label that indicates the Kubernetes release in which the API will no longer be served. The following Prometheus query returns information about requests made to deprecated APIs which will be removed in v1.22:

    apiserver_requested_deprecated_apis{removed_release="1.22"} * on(group,version,resource,subresource) group_right() apiserver_request_total
    

Fields of REST resources

As with whole REST resources, an individual field which was present in API v1 must exist and function until API v1 is removed. Unlike whole resources, the v2 APIs may choose a different representation for the field, as long as it can be round-tripped. For example a v1 field named "magnitude" which was deprecated might be named "deprecatedMagnitude" in API v2. When v1 is eventually removed, the deprecated field can be removed from v2.

Enumerated or constant values

As with whole REST resources and fields thereof, a constant value which was supported in API v1 must exist and function until API v1 is removed.

Component config structures

Component configs are versioned and managed similar to REST resources.

Future work

Over time, Kubernetes will introduce more fine-grained API versions, at which point these rules will be adjusted as needed.

Deprecating a flag or CLI

The Kubernetes system is comprised of several different programs cooperating. Sometimes, a Kubernetes release might remove flags or CLI commands (collectively "CLI elements") in these programs. The individual programs naturally sort into two main groups - user-facing and admin-facing programs, which vary slightly in their deprecation policies. Unless a flag is explicitly prefixed or documented as "alpha" or "beta", it is considered GA.

CLI elements are effectively part of the API to the system, but since they are not versioned in the same way as the REST API, the rules for deprecation are as follows:

Rule #5a: CLI elements of user-facing components (e.g. kubectl) must function after their announced deprecation for no less than:

  • GA: 12 months or 2 releases (whichever is longer)
  • Beta: 3 months or 1 release (whichever is longer)
  • Alpha: 0 releases

Rule #5b: CLI elements of admin-facing components (e.g. kubelet) must function after their announced deprecation for no less than:

  • GA: 6 months or 1 release (whichever is longer)
  • Beta: 3 months or 1 release (whichever is longer)
  • Alpha: 0 releases

Rule #6: Deprecated CLI elements must emit warnings (optionally disable) when used.

Deprecating a feature or behavior

Occasionally a Kubernetes release needs to deprecate some feature or behavior of the system that is not controlled by the API or CLI. In this case, the rules for deprecation are as follows:

Rule #7: Deprecated behaviors must function for no less than 1 year after their announced deprecation.

This does not imply that all changes to the system are governed by this policy. This applies only to significant, user-visible behaviors which impact the correctness of applications running on Kubernetes or that impact the administration of Kubernetes clusters, and which are being removed entirely.

An exception to the above rule is feature gates. Feature gates are key=value pairs that allow for users to enable/disable experimental features.

Feature gates are intended to cover the development life cycle of a feature - they are not intended to be long-term APIs. As such, they are expected to be deprecated and removed after a feature becomes GA or is dropped.

As a feature moves through the stages, the associated feature gate evolves. The feature life cycle matched to its corresponding feature gate is:

  • Alpha: the feature gate is disabled by default and can be enabled by the user.
  • Beta: the feature gate is enabled by default and can be disabled by the user.
  • GA: the feature gate is deprecated (see "Deprecation") and becomes non-operational.
  • GA, deprecation window complete: the feature gate is removed and calls to it are no longer accepted.

Deprecation

Features can be removed at any point in the life cycle prior to GA. When features are removed prior to GA, their associated feature gates are also deprecated.

When an invocation tries to disable a non-operational feature gate, the call fails in order to avoid unsupported scenarios that might otherwise run silently.

In some cases, removing pre-GA features requires considerable time. Feature gates can remain operational until their associated feature is fully removed, at which point the feature gate itself can be deprecated.

When removing a feature gate for a GA feature also requires considerable time, calls to feature gates may remain operational if the feature gate has no effect on the feature, and if the feature gate causes no errors.

Features intended to be disabled by users should include a mechanism for disabling the feature in the associated feature gate.

Versioning for feature gates is different from the previously discussed components, therefore the rules for deprecation are as follows:

Rule #8: Feature gates must be deprecated when the corresponding feature they control transitions a lifecycle stage as follows. Feature gates must function for no less than:

  • Beta feature to GA: 6 months or 2 releases (whichever is longer)
  • Beta feature to EOL: 3 months or 1 release (whichever is longer)
  • Alpha feature to EOL: 0 releases

Rule #9: Deprecated feature gates must respond with a warning when used. When a feature gate is deprecated it must be documented in both in the release notes and the corresponding CLI help. Both warnings and documentation must indicate whether a feature gate is non-operational.

Deprecating a metric

Each component of the Kubernetes control-plane exposes metrics (usually the /metrics endpoint), which are typically ingested by cluster administrators. Not all metrics are the same: some metrics are commonly used as SLIs or used to determine SLOs, these tend to have greater import. Other metrics are more experimental in nature or are used primarily in the Kubernetes development process.

Accordingly, metrics fall under two stability classes (ALPHA and STABLE); this impacts removal of a metric during a Kubernetes release. These classes are determined by the perceived importance of the metric. The rules for deprecating and removing a metric are as follows:

Rule #9a: Metrics, for the corresponding stability class, must function for no less than:

  • STABLE: 4 releases or 12 months (whichever is longer)
  • ALPHA: 0 releases

Rule #9b: Metrics, after their announced deprecation, must function for no less than:

  • STABLE: 3 releases or 9 months (whichever is longer)
  • ALPHA: 0 releases

Deprecated metrics will have their description text prefixed with a deprecation notice string '(Deprecated from x.y)' and a warning log will be emitted during metric registration. Like their stable undeprecated counterparts, deprecated metrics will be automatically registered to the metrics endpoint and therefore visible.

On a subsequent release (when the metric's deprecatedVersion is equal to current_kubernetes_version - 3)), a deprecated metric will become a hidden metric. Unlike their deprecated counterparts, hidden metrics will no longer be automatically registered to the metrics endpoint (hence hidden). However, they can be explicitly enabled through a command line flag on the binary (--show-hidden-metrics-for-version=). This provides cluster admins an escape hatch to properly migrate off of a deprecated metric, if they were not able to react to the earlier deprecation warnings. Hidden metrics should be deleted after one release.

Exceptions

No policy can cover every possible situation. This policy is a living document, and will evolve over time. In practice, there will be situations that do not fit neatly into this policy, or for which this policy becomes a serious impediment. Such situations should be discussed with SIGs and project leaders to find the best solutions for those specific cases, always bearing in mind that Kubernetes is committed to being a stable system that, as much as possible, never breaks users. Exceptions will always be announced in all relevant release notes.

2.5 - Deprecated API Migration Guide

As the Kubernetes API evolves, APIs are periodically reorganized or upgraded. When APIs evolve, the old API is deprecated and eventually removed. This page contains information you need to know when migrating from deprecated API versions to newer and more stable API versions.

Removed APIs by release

v1.27

The v1.27 release will stop serving the following deprecated API versions:

CSIStorageCapacity

The storage.k8s.io/v1beta1 API version of CSIStorageCapacity will no longer be served in v1.27.

  • Migrate manifests and API clients to use the storage.k8s.io/v1 API version, available since v1.24.
  • All existing persisted objects are accessible via the new API
  • No notable changes

v1.26

The v1.26 release will stop serving the following deprecated API versions:

Flow control resources

The flowcontrol.apiserver.k8s.io/v1beta1 API version of FlowSchema and PriorityLevelConfiguration will no longer be served in v1.26.

  • Migrate manifests and API clients to use the flowcontrol.apiserver.k8s.io/v1beta2 API version, available since v1.23.
  • All existing persisted objects are accessible via the new API
  • No notable changes

HorizontalPodAutoscaler

The autoscaling/v2beta2 API version of HorizontalPodAutoscaler will no longer be served in v1.26.

  • Migrate manifests and API clients to use the autoscaling/v2 API version, available since v1.23.
  • All existing persisted objects are accessible via the new API

v1.25

The v1.25 release will stop serving the following deprecated API versions:

CronJob

The batch/v1beta1 API version of CronJob will no longer be served in v1.25.

  • Migrate manifests and API clients to use the batch/v1 API version, available since v1.21.
  • All existing persisted objects are accessible via the new API
  • No notable changes

EndpointSlice

The discovery.k8s.io/v1beta1 API version of EndpointSlice will no longer be served in v1.25.

  • Migrate manifests and API clients to use the discovery.k8s.io/v1 API version, available since v1.21.
  • All existing persisted objects are accessible via the new API
  • Notable changes in discovery.k8s.io/v1:
    • use per Endpoint nodeName field instead of deprecated topology["kubernetes.io/hostname"] field
    • use per Endpoint zone field instead of deprecated topology["topology.kubernetes.io/zone"] field
    • topology is replaced with the deprecatedTopology field which is not writable in v1

Event

The events.k8s.io/v1beta1 API version of Event will no longer be served in v1.25.

  • Migrate manifests and API clients to use the events.k8s.io/v1 API version, available since v1.19.
  • All existing persisted objects are accessible via the new API
  • Notable changes in events.k8s.io/v1:
    • type is limited to Normal and Warning
    • involvedObject is renamed to regarding
    • action, reason, reportingController, and reportingInstance are required when creating new events.k8s.io/v1 Events
    • use eventTime instead of the deprecated firstTimestamp field (which is renamed to deprecatedFirstTimestamp and not permitted in new events.k8s.io/v1 Events)
    • use series.lastObservedTime instead of the deprecated lastTimestamp field (which is renamed to deprecatedLastTimestamp and not permitted in new events.k8s.io/v1 Events)
    • use series.count instead of the deprecated count field (which is renamed to deprecatedCount and not permitted in new events.k8s.io/v1 Events)
    • use reportingController instead of the deprecated source.component field (which is renamed to deprecatedSource.component and not permitted in new events.k8s.io/v1 Events)
    • use reportingInstance instead of the deprecated source.host field (which is renamed to deprecatedSource.host and not permitted in new events.k8s.io/v1 Events)

HorizontalPodAutoscaler

The autoscaling/v2beta1 API version of HorizontalPodAutoscaler will no longer be served in v1.25.

  • Migrate manifests and API clients to use the autoscaling/v2 API version, available since v1.23.
  • All existing persisted objects are accessible via the new API

PodDisruptionBudget

The policy/v1beta1 API version of PodDisruptionBudget will no longer be served in v1.25.

  • Migrate manifests and API clients to use the policy/v1 API version, available since v1.21.
  • All existing persisted objects are accessible via the new API
  • Notable changes in policy/v1:
    • an empty spec.selector ({}) written to a policy/v1 PodDisruptionBudget selects all pods in the namespace (in policy/v1beta1 an empty spec.selector selected no pods). An unset spec.selector selects no pods in either API version.

PodSecurityPolicy

PodSecurityPolicy in the policy/v1beta1 API version will no longer be served in v1.25, and the PodSecurityPolicy admission controller will be removed.

Migrate to Pod Security Admission or a 3rd party admission webhook. For a migration guide, see Migrate from PodSecurityPolicy to the Built-In PodSecurity Admission Controller. For more information on the deprecation, see PodSecurityPolicy Deprecation: Past, Present, and Future.

RuntimeClass

RuntimeClass in the node.k8s.io/v1beta1 API version will no longer be served in v1.25.

  • Migrate manifests and API clients to use the node.k8s.io/v1 API version, available since v1.20.
  • All existing persisted objects are accessible via the new API
  • No notable changes

v1.22

The v1.22 release stopped serving the following deprecated API versions:

Webhook resources

The admissionregistration.k8s.io/v1beta1 API version of MutatingWebhookConfiguration and ValidatingWebhookConfiguration is no longer served as of v1.22.

  • Migrate manifests and API clients to use the admissionregistration.k8s.io/v1 API version, available since v1.16.
  • All existing persisted objects are accessible via the new APIs
  • Notable changes:
    • webhooks[*].failurePolicy default changed from Ignore to Fail for v1
    • webhooks[*].matchPolicy default changed from Exact to Equivalent for v1
    • webhooks[*].timeoutSeconds default changed from 30s to 10s for v1
    • webhooks[*].sideEffects default value is removed, and the field made required, and only None and NoneOnDryRun are permitted for v1
    • webhooks[*].admissionReviewVersions default value is removed and the field made required for v1 (supported versions for AdmissionReview are v1 and v1beta1)
    • webhooks[*].name must be unique in the list for objects created via admissionregistration.k8s.io/v1

CustomResourceDefinition

The apiextensions.k8s.io/v1beta1 API version of CustomResourceDefinition is no longer served as of v1.22.

  • Migrate manifests and API clients to use the apiextensions.k8s.io/v1 API version, available since v1.16.
  • All existing persisted objects are accessible via the new API
  • Notable changes:
    • spec.scope is no longer defaulted to Namespaced and must be explicitly specified
    • spec.version is removed in v1; use spec.versions instead
    • spec.validation is removed in v1; use spec.versions[*].schema instead
    • spec.subresources is removed in v1; use spec.versions[*].subresources instead
    • spec.additionalPrinterColumns is removed in v1; use spec.versions[*].additionalPrinterColumns instead
    • spec.conversion.webhookClientConfig is moved to spec.conversion.webhook.clientConfig in v1
    • spec.conversion.conversionReviewVersions is moved to spec.conversion.webhook.conversionReviewVersions in v1
    • spec.versions[*].schema.openAPIV3Schema is now required when creating v1 CustomResourceDefinition objects, and must be a structural schema
    • spec.preserveUnknownFields: true is disallowed when creating v1 CustomResourceDefinition objects; it must be specified within schema definitions as x-kubernetes-preserve-unknown-fields: true
    • In additionalPrinterColumns items, the JSONPath field was renamed to jsonPath in v1 (fixes #66531)

APIService

The apiregistration.k8s.io/v1beta1 API version of APIService is no longer served as of v1.22.

  • Migrate manifests and API clients to use the apiregistration.k8s.io/v1 API version, available since v1.10.
  • All existing persisted objects are accessible via the new API
  • No notable changes

TokenReview

The authentication.k8s.io/v1beta1 API version of TokenReview is no longer served as of v1.22.

  • Migrate manifests and API clients to use the authentication.k8s.io/v1 API version, available since v1.6.
  • No notable changes

SubjectAccessReview resources

The authorization.k8s.io/v1beta1 API version of LocalSubjectAccessReview, SelfSubjectAccessReview, SubjectAccessReview, and SelfSubjectRulesReview is no longer served as of v1.22.

  • Migrate manifests and API clients to use the authorization.k8s.io/v1 API version, available since v1.6.
  • Notable changes:
    • spec.group was renamed to spec.groups in v1 (fixes #32709)

CertificateSigningRequest

The certificates.k8s.io/v1beta1 API version of CertificateSigningRequest is no longer served as of v1.22.

  • Migrate manifests and API clients to use the certificates.k8s.io/v1 API version, available since v1.19.
  • All existing persisted objects are accessible via the new API
  • Notable changes in certificates.k8s.io/v1:
    • For API clients requesting certificates:
      • spec.signerName is now required (see known Kubernetes signers), and requests for kubernetes.io/legacy-unknown are not allowed to be created via the certificates.k8s.io/v1 API
      • spec.usages is now required, may not contain duplicate values, and must only contain known usages
    • For API clients approving or signing certificates:
      • status.conditions may not contain duplicate types
      • status.conditions[*].status is now required
      • status.certificate must be PEM-encoded, and contain only CERTIFICATE blocks

Lease

The coordination.k8s.io/v1beta1 API version of Lease is no longer served as of v1.22.

  • Migrate manifests and API clients to use the coordination.k8s.io/v1 API version, available since v1.14.
  • All existing persisted objects are accessible via the new API
  • No notable changes

Ingress

The extensions/v1beta1 and networking.k8s.io/v1beta1 API versions of Ingress is no longer served as of v1.22.

  • Migrate manifests and API clients to use the networking.k8s.io/v1 API version, available since v1.19.
  • All existing persisted objects are accessible via the new API
  • Notable changes:
    • spec.backend is renamed to spec.defaultBackend
    • The backend serviceName field is renamed to service.name
    • Numeric backend servicePort fields are renamed to service.port.number
    • String backend servicePort fields are renamed to service.port.name
    • pathType is now required for each specified path. Options are Prefix, Exact, and ImplementationSpecific. To match the undefined v1beta1 behavior, use ImplementationSpecific.

IngressClass

The networking.k8s.io/v1beta1 API version of IngressClass is no longer served as of v1.22.

  • Migrate manifests and API clients to use the networking.k8s.io/v1 API version, available since v1.19.
  • All existing persisted objects are accessible via the new API
  • No notable changes

RBAC resources

The rbac.authorization.k8s.io/v1beta1 API version of ClusterRole, ClusterRoleBinding, Role, and RoleBinding is no longer served as of v1.22.

  • Migrate manifests and API clients to use the rbac.authorization.k8s.io/v1 API version, available since v1.8.
  • All existing persisted objects are accessible via the new APIs
  • No notable changes

PriorityClass

The scheduling.k8s.io/v1beta1 API version of PriorityClass is no longer served as of v1.22.

  • Migrate manifests and API clients to use the scheduling.k8s.io/v1 API version, available since v1.14.
  • All existing persisted objects are accessible via the new API
  • No notable changes

Storage resources

The storage.k8s.io/v1beta1 API version of CSIDriver, CSINode, StorageClass, and VolumeAttachment is no longer served as of v1.22.

  • Migrate manifests and API clients to use the storage.k8s.io/v1 API version
    • CSIDriver is available in storage.k8s.io/v1 since v1.19.
    • CSINode is available in storage.k8s.io/v1 since v1.17
    • StorageClass is available in storage.k8s.io/v1 since v1.6
    • VolumeAttachment is available in storage.k8s.io/v1 v1.13
  • All existing persisted objects are accessible via the new APIs
  • No notable changes

v1.16

The v1.16 release stopped serving the following deprecated API versions:

NetworkPolicy

The extensions/v1beta1 API version of NetworkPolicy is no longer served as of v1.16.

  • Migrate manifests and API clients to use the networking.k8s.io/v1 API version, available since v1.8.
  • All existing persisted objects are accessible via the new API

DaemonSet

The extensions/v1beta1 and apps/v1beta2 API versions of DaemonSet are no longer served as of v1.16.

  • Migrate manifests and API clients to use the apps/v1 API version, available since v1.9.
  • All existing persisted objects are accessible via the new API
  • Notable changes:
    • spec.templateGeneration is removed
    • spec.selector is now required and immutable after creation; use the existing template labels as the selector for seamless upgrades
    • spec.updateStrategy.type now defaults to RollingUpdate (the default in extensions/v1beta1 was OnDelete)

Deployment

The extensions/v1beta1, apps/v1beta1, and apps/v1beta2 API versions of Deployment are no longer served as of v1.16.

  • Migrate manifests and API clients to use the apps/v1 API version, available since v1.9.
  • All existing persisted objects are accessible via the new API
  • Notable changes:
    • spec.rollbackTo is removed
    • spec.selector is now required and immutable after creation; use the existing template labels as the selector for seamless upgrades
    • spec.progressDeadlineSeconds now defaults to 600 seconds (the default in extensions/v1beta1 was no deadline)
    • spec.revisionHistoryLimit now defaults to 10 (the default in apps/v1beta1 was 2, the default in extensions/v1beta1 was to retain all)
    • maxSurge and maxUnavailable now default to 25% (the default in extensions/v1beta1 was 1)

StatefulSet

The apps/v1beta1 and apps/v1beta2 API versions of StatefulSet are no longer served as of v1.16.

  • Migrate manifests and API clients to use the apps/v1 API version, available since v1.9.
  • All existing persisted objects are accessible via the new API
  • Notable changes:
    • spec.selector is now required and immutable after creation; use the existing template labels as the selector for seamless upgrades
    • spec.updateStrategy.type now defaults to RollingUpdate (the default in apps/v1beta1 was OnDelete)

ReplicaSet

The extensions/v1beta1, apps/v1beta1, and apps/v1beta2 API versions of ReplicaSet are no longer served as of v1.16.

  • Migrate manifests and API clients to use the apps/v1 API version, available since v1.9.
  • All existing persisted objects are accessible via the new API
  • Notable changes:
    • spec.selector is now required and immutable after creation; use the existing template labels as the selector for seamless upgrades

PodSecurityPolicy

The extensions/v1beta1 API version of PodSecurityPolicy is no longer served as of v1.16.

  • Migrate manifests and API client to use the policy/v1beta1 API version, available since v1.10.
  • Note that the policy/v1beta1 API version of PodSecurityPolicy will be removed in v1.25.

What to do

Test with deprecated APIs disabled

You can test your clusters by starting an API server with specific API versions disabled to simulate upcoming removals. Add the following flag to the API server startup arguments:

--runtime-config=<group>/<version>=false

For example:

--runtime-config=admissionregistration.k8s.io/v1beta1=false,apiextensions.k8s.io/v1beta1,...

Locate use of deprecated APIs

Use client warnings, metrics, and audit information available in 1.19+ to locate use of deprecated APIs.

Migrate to non-deprecated APIs

  • Update custom integrations and controllers to call the non-deprecated APIs

  • Change YAML files to reference the non-deprecated APIs

    You can use the kubectl-convert command (kubectl convert prior to v1.20) to automatically convert an existing object:

    kubectl-convert -f <file> --output-version <group>/<version>.

    For example, to convert an older Deployment to apps/v1, you can run:

    kubectl-convert -f ./my-deployment.yaml --output-version apps/v1

    Note that this may use non-ideal default values. To learn more about a specific resource, check the Kubernetes API reference.

2.6 - Kubernetes API health endpoints

The Kubernetes API server provides API endpoints to indicate the current status of the API server. This page describes these API endpoints and explains how you can use them.

API endpoints for health

The Kubernetes API server provides 3 API endpoints (healthz, livez and readyz) to indicate the current status of the API server. The healthz endpoint is deprecated (since Kubernetes v1.16), and you should use the more specific livez and readyz endpoints instead. The livez endpoint can be used with the --livez-grace-period flag to specify the startup duration. For a graceful shutdown you can specify the --shutdown-delay-duration flag with the /readyz endpoint. Machines that check the healthz/livez/readyz of the API server should rely on the HTTP status code. A status code 200 indicates the API server is healthy/live/ready, depending on the called endpoint. The more verbose options shown below are intended to be used by human operators to debug their cluster or understand the state of the API server.

The following examples will show how you can interact with the health API endpoints.

For all endpoints, you can use the verbose parameter to print out the checks and their status. This can be useful for a human operator to debug the current status of the API server, it is not intended to be consumed by a machine:

curl -k https://localhost:6443/livez?verbose

or from a remote host with authentication:

kubectl get --raw='/readyz?verbose'

The output will look like this:

[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check passed

The Kubernetes API server also supports to exclude specific checks. The query parameters can also be combined like in this example:

curl -k 'https://localhost:6443/readyz?verbose&exclude=etcd'

The output show that the etcd check is excluded:

[+]ping ok
[+]log ok
[+]etcd excluded: ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
[+]shutdown ok
healthz check passed

Individual health checks

FEATURE STATE: Kubernetes v1.25 [alpha]

Each individual health check exposes an HTTP endpoint and can be checked individually. The schema for the individual health checks is /livez/<healthcheck-name> where livez and readyz and be used to indicate if you want to check the liveness or the readiness of the API server. The <healthcheck-name> path can be discovered using the verbose flag from above and take the path between [+] and ok. These individual health checks should not be consumed by machines but can be helpful for a human operator to debug a system:

curl -k https://localhost:6443/livez/etcd

3.1 - Authenticating

This page provides an overview of authenticating.

Users in Kubernetes

All Kubernetes clusters have two categories of users: service accounts managed by Kubernetes, and normal users.

It is assumed that a cluster-independent service manages normal users in the following ways:

  • an administrator distributing private keys
  • a user store like Keystone or Google Accounts
  • a file with a list of usernames and passwords

In this regard, Kubernetes does not have objects which represent normal user accounts. Normal users cannot be added to a cluster through an API call.

Even though a normal user cannot be added via an API call, any user that presents a valid certificate signed by the cluster's certificate authority (CA) is considered authenticated. In this configuration, Kubernetes determines the username from the common name field in the 'subject' of the cert (e.g., "/CN=bob"). From there, the role based access control (RBAC) sub-system would determine whether the user is authorized to perform a specific operation on a resource. For more details, refer to the normal users topic in certificate request for more details about this.

In contrast, service accounts are users managed by the Kubernetes API. They are bound to specific namespaces, and created automatically by the API server or manually through API calls. Service accounts are tied to a set of credentials stored as Secrets, which are mounted into pods allowing in-cluster processes to talk to the Kubernetes API.

API requests are tied to either a normal user or a service account, or are treated as anonymous requests. This means every process inside or outside the cluster, from a human user typing kubectl on a workstation, to kubelets on nodes, to members of the control plane, must authenticate when making requests to the API server, or be treated as an anonymous user.

Authentication strategies

Kubernetes uses client certificates, bearer tokens, or an authenticating proxy to authenticate API requests through authentication plugins. As HTTP requests are made to the API server, plugins attempt to associate the following attributes with the request:

  • Username: a string which identifies the end user. Common values might be kube-admin or jane@example.com.
  • UID: a string which identifies the end user and attempts to be more consistent and unique than username.
  • Groups: a set of strings, each of which indicates the user's membership in a named logical collection of users. Common values might be system:masters or devops-team.
  • Extra fields: a map of strings to list of strings which holds additional information authorizers may find useful.

All values are opaque to the authentication system and only hold significance when interpreted by an authorizer.

You can enable multiple authentication methods at once. You should usually use at least two methods:

  • service account tokens for service accounts
  • at least one other method for user authentication.

When multiple authenticator modules are enabled, the first module to successfully authenticate the request short-circuits evaluation. The API server does not guarantee the order authenticators run in.

The system:authenticated group is included in the list of groups for all authenticated users.

Integrations with other authentication protocols (LDAP, SAML, Kerberos, alternate x509 schemes, etc) can be accomplished using an authenticating proxy or the authentication webhook.

X509 Client Certs

Client certificate authentication is enabled by passing the --client-ca-file=SOMEFILE option to API server. The referenced file must contain one or more certificate authorities to use to validate client certificates presented to the API server. If a client certificate is presented and verified, the common name of the subject is used as the user name for the request. As of Kubernetes 1.4, client certificates can also indicate a user's group memberships using the certificate's organization fields. To include multiple group memberships for a user, include multiple organization fields in the certificate.

For example, using the openssl command line tool to generate a certificate signing request:

openssl req -new -key jbeda.pem -out jbeda-csr.pem -subj "/CN=jbeda/O=app1/O=app2"

This would create a CSR for the username "jbeda", belonging to two groups, "app1" and "app2".

See Managing Certificates for how to generate a client cert.

Static Token File

The API server reads bearer tokens from a file when given the --token-auth-file=SOMEFILE option on the command line. Currently, tokens last indefinitely, and the token list cannot be changed without restarting the API server.

The token file is a csv file with a minimum of 3 columns: token, user name, user uid, followed by optional group names.

Putting a Bearer Token in a Request

When using bearer token authentication from an http client, the API server expects an Authorization header with a value of Bearer <token>. The bearer token must be a character sequence that can be put in an HTTP header value using no more than the encoding and quoting facilities of HTTP. For example: if the bearer token is 31ada4fd-adec-460c-809a-9e56ceb75269 then it would appear in an HTTP header as shown below.

Authorization: Bearer 31ada4fd-adec-460c-809a-9e56ceb75269

Bootstrap Tokens

FEATURE STATE: Kubernetes v1.18 [stable]

To allow for streamlined bootstrapping for new clusters, Kubernetes includes a dynamically-managed Bearer token type called a Bootstrap Token. These tokens are stored as Secrets in the kube-system namespace, where they can be dynamically managed and created. Controller Manager contains a TokenCleaner controller that deletes bootstrap tokens as they expire.

The tokens are of the form [a-z0-9]{6}.[a-z0-9]{16}. The first component is a Token ID and the second component is the Token Secret. You specify the token in an HTTP header as follows:

Authorization: Bearer 781292.db7bc3a58fc5f07e

You must enable the Bootstrap Token Authenticator with the --enable-bootstrap-token-auth flag on the API Server. You must enable the TokenCleaner controller via the --controllers flag on the Controller Manager. This is done with something like --controllers=*,tokencleaner. kubeadm will do this for you if you are using it to bootstrap a cluster.

The authenticator authenticates as system:bootstrap:<Token ID>. It is included in the system:bootstrappers group. The naming and groups are intentionally limited to discourage users from using these tokens past bootstrapping. The user names and group can be used (and are used by kubeadm) to craft the appropriate authorization policies to support bootstrapping a cluster.

Please see Bootstrap Tokens for in depth documentation on the Bootstrap Token authenticator and controllers along with how to manage these tokens with kubeadm.

Service Account Tokens

A service account is an automatically enabled authenticator that uses signed bearer tokens to verify requests. The plugin takes two optional flags:

  • --service-account-key-file File containing PEM-encoded x509 RSA or ECDSA private or public keys, used to verify ServiceAccount tokens. The specified file can contain multiple keys, and the flag can be specified multiple times with different files. If unspecified, --tls-private-key-file is used.
  • --service-account-lookup If enabled, tokens which are deleted from the API will be revoked.

Service accounts are usually created automatically by the API server and associated with pods running in the cluster through the ServiceAccount Admission Controller. Bearer tokens are mounted into pods at well-known locations, and allow in-cluster processes to talk to the API server. Accounts may be explicitly associated with pods using the serviceAccountName field of a PodSpec.

apiVersion: apps/v1 # this apiVersion is relevant as of Kubernetes 1.9
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: default
spec:
  replicas: 3
  template:
    metadata:
    # ...
    spec:
      serviceAccountName: bob-the-bot
      containers:
      - name: nginx
        image: nginx:1.14.2

Service account bearer tokens are perfectly valid to use outside the cluster and can be used to create identities for long standing jobs that wish to talk to the Kubernetes API. To manually create a service account, use the kubectl create serviceaccount (NAME) command. This creates a service account in the current namespace.

kubectl create serviceaccount jenkins
serviceaccount/jenkins created

Create an associated token:

kubectl create token jenkins
eyJhbGciOiJSUzI1NiIsImtp...

The created token is a signed JSON Web Token (JWT).

The signed JWT can be used as a bearer token to authenticate as the given service account. See above for how the token is included in a request. Normally these tokens are mounted into pods for in-cluster access to the API server, but can be used from outside the cluster as well.

Service accounts authenticate with the username system:serviceaccount:(NAMESPACE):(SERVICEACCOUNT), and are assigned to the groups system:serviceaccounts and system:serviceaccounts:(NAMESPACE).

OpenID Connect Tokens

OpenID Connect is a flavor of OAuth2 supported by some OAuth2 providers, notably Azure Active Directory, Salesforce, and Google. The protocol's main extension of OAuth2 is an additional field returned with the access token called an ID Token. This token is a JSON Web Token (JWT) with well known fields, such as a user's email, signed by the server.

To identify the user, the authenticator uses the id_token (not the access_token) from the OAuth2 token response as a bearer token. See above for how the token is included in a request.

sequenceDiagram participant user as User participant idp as Identity Provider participant kube as Kubectl participant api as API Server user ->> idp: 1. Login to IdP activate idp idp -->> user: 2. Provide access_token,
id_token, and refresh_token deactivate idp activate user user ->> kube: 3. Call Kubectl
with --token being the id_token
OR add tokens to .kube/config deactivate user activate kube kube ->> api: 4. Authorization: Bearer... deactivate kube activate api api ->> api: 5. Is JWT signature valid? api ->> api: 6. Has the JWT expired? (iat+exp) api ->> api: 7. User authorized? api -->> kube: 8. Authorized: Perform
action and return result deactivate api activate kube kube --x user: 9. Return result deactivate kube
  1. Login to your identity provider
  2. Your identity provider will provide you with an access_token, id_token and a refresh_token
  3. When using kubectl, use your id_token with the --token flag or add it directly to your kubeconfig
  4. kubectl sends your id_token in a header called Authorization to the API server
  5. The API server will make sure the JWT signature is valid by checking against the certificate named in the configuration
  6. Check to make sure the id_token hasn't expired
  7. Make sure the user is authorized
  8. Once authorized the API server returns a response to kubectl
  9. kubectl provides feedback to the user

Since all of the data needed to validate who you are is in the id_token, Kubernetes doesn't need to "phone home" to the identity provider. In a model where every request is stateless this provides a very scalable solution for authentication. It does offer a few challenges:

  1. Kubernetes has no "web interface" to trigger the authentication process. There is no browser or interface to collect credentials which is why you need to authenticate to your identity provider first.
  2. The id_token can't be revoked, it's like a certificate so it should be short-lived (only a few minutes) so it can be very annoying to have to get a new token every few minutes.
  3. To authenticate to the Kubernetes dashboard, you must use the kubectl proxy command or a reverse proxy that injects the id_token.

Configuring the API Server

To enable the plugin, configure the following flags on the API server:

ParameterDescriptionExampleRequired
--oidc-issuer-urlURL of the provider which allows the API server to discover public signing keys. Only URLs which use the https:// scheme are accepted. This is typically the provider's discovery URL without a path, for example "https://accounts.google.com" or "https://login.salesforce.com". This URL should point to the level below .well-known/openid-configurationIf the discovery URL is https://accounts.google.com/.well-known/openid-configuration, the value should be https://accounts.google.comYes
--oidc-client-idA client id that all tokens must be issued for.kubernetesYes
--oidc-username-claimJWT claim to use as the user name. By default sub, which is expected to be a unique identifier of the end user. Admins can choose other claims, such as email or name, depending on their provider. However, claims other than email will be prefixed with the issuer URL to prevent naming clashes with other plugins.subNo
--oidc-username-prefixPrefix prepended to username claims to prevent clashes with existing names (such as system: users). For example, the value oidc: will create usernames like oidc:jane.doe. If this flag isn't provided and --oidc-username-claim is a value other than email the prefix defaults to ( Issuer URL )# where ( Issuer URL ) is the value of --oidc-issuer-url. The value - can be used to disable all prefixing.oidc:No
--oidc-groups-claimJWT claim to use as the user's group. If the claim is present it must be an array of strings.groupsNo
--oidc-groups-prefixPrefix prepended to group claims to prevent clashes with existing names (such as system: groups). For example, the value oidc: will create group names like oidc:engineering and oidc:infra.oidc:No
--oidc-required-claimA key=value pair that describes a required claim in the ID Token. If set, the claim is verified to be present in the ID Token with a matching value. Repeat this flag to specify multiple claims.claim=valueNo
--oidc-ca-fileThe path to the certificate for the CA that signed your identity provider's web certificate. Defaults to the host's root CAs./etc/kubernetes/ssl/kc-ca.pemNo
--oidc-signing-algsThe signing algorithms accepted. Default is "RS256".RS512No

Importantly, the API server is not an OAuth2 client, rather it can only be configured to trust a single issuer. This allows the use of public providers, such as Google, without trusting credentials issued to third parties. Admins who wish to utilize multiple OAuth clients should explore providers which support the azp (authorized party) claim, a mechanism for allowing one client to issue tokens on behalf of another.

Kubernetes does not provide an OpenID Connect Identity Provider. You can use an existing public OpenID Connect Identity Provider (such as Google, or others). Or, you can run your own Identity Provider, such as dex, Keycloak, CloudFoundry UAA, or Tremolo Security's OpenUnison.

For an identity provider to work with Kubernetes it must:

  1. Support OpenID connect discovery; not all do.
  2. Run in TLS with non-obsolete ciphers
  3. Have a CA signed certificate (even if the CA is not a commercial CA or is self signed)

A note about requirement #3 above, requiring a CA signed certificate. If you deploy your own identity provider (as opposed to one of the cloud providers like Google or Microsoft) you MUST have your identity provider's web server certificate signed by a certificate with the CA flag set to TRUE, even if it is self signed. This is due to GoLang's TLS client implementation being very strict to the standards around certificate validation. If you don't have a CA handy, you can use this script from the Dex team to create a simple CA and a signed certificate and key pair. Or you can use this similar script that generates SHA256 certs with a longer life and larger key size.

Setup instructions for specific systems:

Using kubectl

Option 1 - OIDC Authenticator

The first option is to use the kubectl oidc authenticator, which sets the id_token as a bearer token for all requests and refreshes the token once it expires. After you've logged into your provider, use kubectl to add your id_token, refresh_token, client_id, and client_secret to configure the plugin.

Providers that don't return an id_token as part of their refresh token response aren't supported by this plugin and should use "Option 2" below.

kubectl config set-credentials USER_NAME \
   --auth-provider=oidc \
   --auth-provider-arg=idp-issuer-url=( issuer url ) \
   --auth-provider-arg=client-id=( your client id ) \
   --auth-provider-arg=client-secret=( your client secret ) \
   --auth-provider-arg=refresh-token=( your refresh token ) \
   --auth-provider-arg=idp-certificate-authority=( path to your ca certificate ) \
   --auth-provider-arg=id-token=( your id_token )

As an example, running the below command after authenticating to your identity provider:

kubectl config set-credentials mmosley  \
        --auth-provider=oidc  \
        --auth-provider-arg=idp-issuer-url=https://oidcidp.tremolo.lan:8443/auth/idp/OidcIdP  \
        --auth-provider-arg=client-id=kubernetes  \
        --auth-provider-arg=client-secret=1db158f6-177d-4d9c-8a8b-d36869918ec5  \
        --auth-provider-arg=refresh-token=q1bKLFOyUiosTfawzA93TzZIDzH2TNa2SMm0zEiPKTUwME6BkEo6Sql5yUWVBSWpKUGphaWpxSVAfekBOZbBhaEW+VlFUeVRGcluyVF5JT4+haZmPsluFoFu5XkpXk5BXqHega4GAXlF+ma+vmYpFcHe5eZR+slBFpZKtQA= \
        --auth-provider-arg=idp-certificate-authority=/root/ca.pem \
        --auth-provider-arg=id-token=eyJraWQiOiJDTj1vaWRjaWRwLnRyZW1vbG8ubGFuLCBPVT1EZW1vLCBPPVRybWVvbG8gU2VjdXJpdHksIEw9QXJsaW5ndG9uLCBTVD1WaXJnaW5pYSwgQz1VUy1DTj1rdWJlLWNhLTEyMDIxNDc5MjEwMzYwNzMyMTUyIiwiYWxnIjoiUlMyNTYifQ.eyJpc3MiOiJodHRwczovL29pZGNpZHAudHJlbW9sby5sYW46ODQ0My9hdXRoL2lkcC9PaWRjSWRQIiwiYXVkIjoia3ViZXJuZXRlcyIsImV4cCI6MTQ4MzU0OTUxMSwianRpIjoiMm96US15TXdFcHV4WDlHZUhQdy1hZyIsImlhdCI6MTQ4MzU0OTQ1MSwibmJmIjoxNDgzNTQ5MzMxLCJzdWIiOiI0YWViMzdiYS1iNjQ1LTQ4ZmQtYWIzMC0xYTAxZWU0MWUyMTgifQ.w6p4J_6qQ1HzTG9nrEOrubxIMb9K5hzcMPxc9IxPx2K4xO9l-oFiUw93daH3m5pluP6K7eOE6txBuRVfEcpJSwlelsOsW8gb8VJcnzMS9EnZpeA0tW_p-mnkFc3VcfyXuhe5R3G7aa5d8uHv70yJ9Y3-UhjiN9EhpMdfPAoEB9fYKKkJRzF7utTTIPGrSaSU6d2pcpfYKaxIwePzEkT4DfcQthoZdy9ucNvvLoi1DIC-UocFD8HLs8LYKEqSxQvOcvnThbObJ9af71EwmuE21fO5KzMW20KtAeget1gnldOosPtz1G5EwvaQ401-RPQzPGMVBld0_zMCAwZttJ4knw

Which would produce the below configuration:

users:
- name: mmosley
  user:
    auth-provider:
      config:
        client-id: kubernetes
        client-secret: 1db158f6-177d-4d9c-8a8b-d36869918ec5
        id-token: eyJraWQiOiJDTj1vaWRjaWRwLnRyZW1vbG8ubGFuLCBPVT1EZW1vLCBPPVRybWVvbG8gU2VjdXJpdHksIEw9QXJsaW5ndG9uLCBTVD1WaXJnaW5pYSwgQz1VUy1DTj1rdWJlLWNhLTEyMDIxNDc5MjEwMzYwNzMyMTUyIiwiYWxnIjoiUlMyNTYifQ.eyJpc3MiOiJodHRwczovL29pZGNpZHAudHJlbW9sby5sYW46ODQ0My9hdXRoL2lkcC9PaWRjSWRQIiwiYXVkIjoia3ViZXJuZXRlcyIsImV4cCI6MTQ4MzU0OTUxMSwianRpIjoiMm96US15TXdFcHV4WDlHZUhQdy1hZyIsImlhdCI6MTQ4MzU0OTQ1MSwibmJmIjoxNDgzNTQ5MzMxLCJzdWIiOiI0YWViMzdiYS1iNjQ1LTQ4ZmQtYWIzMC0xYTAxZWU0MWUyMTgifQ.w6p4J_6qQ1HzTG9nrEOrubxIMb9K5hzcMPxc9IxPx2K4xO9l-oFiUw93daH3m5pluP6K7eOE6txBuRVfEcpJSwlelsOsW8gb8VJcnzMS9EnZpeA0tW_p-mnkFc3VcfyXuhe5R3G7aa5d8uHv70yJ9Y3-UhjiN9EhpMdfPAoEB9fYKKkJRzF7utTTIPGrSaSU6d2pcpfYKaxIwePzEkT4DfcQthoZdy9ucNvvLoi1DIC-UocFD8HLs8LYKEqSxQvOcvnThbObJ9af71EwmuE21fO5KzMW20KtAeget1gnldOosPtz1G5EwvaQ401-RPQzPGMVBld0_zMCAwZttJ4knw
        idp-certificate-authority: /root/ca.pem
        idp-issuer-url: https://oidcidp.tremolo.lan:8443/auth/idp/OidcIdP
        refresh-token: q1bKLFOyUiosTfawzA93TzZIDzH2TNa2SMm0zEiPKTUwME6BkEo6Sql5yUWVBSWpKUGphaWpxSVAfekBOZbBhaEW+VlFUeVRGcluyVF5JT4+haZmPsluFoFu5XkpXk5BXq
      name: oidc

Once your id_token expires, kubectl will attempt to refresh your id_token using your refresh_token and client_secret storing the new values for the refresh_token and id_token in your .kube/config.

Option 2 - Use the --token Option

The kubectl command lets you pass in a token using the --token option. Copy and paste the id_token into this option:

kubectl --token=eyJhbGciOiJSUzI1NiJ9.eyJpc3MiOiJodHRwczovL21sYi50cmVtb2xvLmxhbjo4MDQzL2F1dGgvaWRwL29pZGMiLCJhdWQiOiJrdWJlcm5ldGVzIiwiZXhwIjoxNDc0NTk2NjY5LCJqdGkiOiI2RDUzNXoxUEpFNjJOR3QxaWVyYm9RIiwiaWF0IjoxNDc0NTk2MzY5LCJuYmYiOjE0NzQ1OTYyNDksInN1YiI6Im13aW5kdSIsInVzZXJfcm9sZSI6WyJ1c2VycyIsIm5ldy1uYW1lc3BhY2Utdmlld2VyIl0sImVtYWlsIjoibXdpbmR1QG5vbW9yZWplZGkuY29tIn0.f2As579n9VNoaKzoF-dOQGmXkFKf1FMyNV0-va_B63jn-_n9LGSCca_6IVMP8pO-Zb4KvRqGyTP0r3HkHxYy5c81AnIh8ijarruczl-TK_yF5akjSTHFZD-0gRzlevBDiH8Q79NAr-ky0P4iIXS8lY9Vnjch5MF74Zx0c3alKJHJUnnpjIACByfF2SCaYzbWFMUNat-K1PaUk5-ujMBG7yYnr95xD-63n8CO8teGUAAEMx6zRjzfhnhbzX-ajwZLGwGUBT4WqjMs70-6a7_8gZmLZb2az1cZynkFRj2BaCkVT3A2RrjeEwZEtGXlMqKJ1_I2ulrOVsYx01_yD35-rw get nodes

Webhook Token Authentication

Webhook authentication is a hook for verifying bearer tokens.

  • --authentication-token-webhook-config-file a configuration file describing how to access the remote webhook service.
  • --authentication-token-webhook-cache-ttl how long to cache authentication decisions. Defaults to two minutes.
  • --authentication-token-webhook-version determines whether to use authentication.k8s.io/v1beta1 or authentication.k8s.io/v1 TokenReview objects to send/receive information from the webhook. Defaults to v1beta1.

The configuration file uses the kubeconfig file format. Within the file, clusters refers to the remote service and users refers to the API server webhook. An example would be:

# Kubernetes API version
apiVersion: v1
# kind of the API object
kind: Config
# clusters refers to the remote service.
clusters:
  - name: name-of-remote-authn-service
    cluster:
      certificate-authority: /path/to/ca.pem         # CA for verifying the remote service.
      server: https://authn.example.com/authenticate # URL of remote service to query. 'https' recommended for production.

# users refers to the API server's webhook configuration.
users:
  - name: name-of-api-server
    user:
      client-certificate: /path/to/cert.pem # cert for the webhook plugin to use
      client-key: /path/to/key.pem          # key matching the cert

# kubeconfig files require a context. Provide one for the API server.
current-context: webhook
contexts:
- context:
    cluster: name-of-remote-authn-service
    user: name-of-api-server
  name: webhook

When a client attempts to authenticate with the API server using a bearer token as discussed above, the authentication webhook POSTs a JSON-serialized TokenReview object containing the token to the remote service.

Note that webhook API objects are subject to the same versioning compatibility rules as other Kubernetes API objects. Implementers should check the apiVersion field of the request to ensure correct deserialization, and must respond with a TokenReview object of the same version as the request.

{
  "apiVersion": "authentication.k8s.io/v1",
  "kind": "TokenReview",
  "spec": {
    # Opaque bearer token sent to the API server
    "token": "014fbff9a07c...",
   
    # Optional list of the audience identifiers for the server the token was presented to.
    # Audience-aware token authenticators (for example, OIDC token authenticators) 
    # should verify the token was intended for at least one of the audiences in this list,
    # and return the intersection of this list and the valid audiences for the token in the response status.
    # This ensures the token is valid to authenticate to the server it was presented to.
    # If no audiences are provided, the token should be validated to authenticate to the Kubernetes API server.
    "audiences": ["https://myserver.example.com", "https://myserver.internal.example.com"]
  }
}

{
  "apiVersion": "authentication.k8s.io/v1beta1",
  "kind": "TokenReview",
  "spec": {
    # Opaque bearer token sent to the API server
    "token": "014fbff9a07c...",
   
    # Optional list of the audience identifiers for the server the token was presented to.
    # Audience-aware token authenticators (for example, OIDC token authenticators) 
    # should verify the token was intended for at least one of the audiences in this list,
    # and return the intersection of this list and the valid audiences for the token in the response status.
    # This ensures the token is valid to authenticate to the server it was presented to.
    # If no audiences are provided, the token should be validated to authenticate to the Kubernetes API server.
    "audiences": ["https://myserver.example.com", "https://myserver.internal.example.com"]
  }
}

The remote service is expected to fill the status field of the request to indicate the success of the login. The response body's spec field is ignored and may be omitted. The remote service must return a response using the same TokenReview API version that it received. A successful validation of the bearer token would return:

{
  "apiVersion": "authentication.k8s.io/v1",
  "kind": "TokenReview",
  "status": {
    "authenticated": true,
    "user": {
      # Required
      "username": "janedoe@example.com",
      # Optional
      "uid": "42",
      # Optional group memberships
      "groups": ["developers", "qa"],
      # Optional additional information provided by the authenticator.
      # This should not contain confidential data, as it can be recorded in logs
      # or API objects, and is made available to admission webhooks.
      "extra": {
        "extrafield1": [
          "extravalue1",
          "extravalue2"
        ]
      }
    },
    # Optional list audience-aware token authenticators can return,
    # containing the audiences from the `spec.audiences` list for which the provided token was valid.
    # If this is omitted, the token is considered to be valid to authenticate to the Kubernetes API server.
    "audiences": ["https://myserver.example.com"]
  }
}

{
  "apiVersion": "authentication.k8s.io/v1beta1",
  "kind": "TokenReview",
  "status": {
    "authenticated": true,
    "user": {
      # Required
      "username": "janedoe@example.com",
      # Optional
      "uid": "42",
      # Optional group memberships
      "groups": ["developers", "qa"],
      # Optional additional information provided by the authenticator.
      # This should not contain confidential data, as it can be recorded in logs
      # or API objects, and is made available to admission webhooks.
      "extra": {
        "extrafield1": [
          "extravalue1",
          "extravalue2"
        ]
      }
    },
    # Optional list audience-aware token authenticators can return,
    # containing the audiences from the `spec.audiences` list for which the provided token was valid.
    # If this is omitted, the token is considered to be valid to authenticate to the Kubernetes API server.
    "audiences": ["https://myserver.example.com"]
  }
}

An unsuccessful request would return:

{
  "apiVersion": "authentication.k8s.io/v1",
  "kind": "TokenReview",
  "status": {
    "authenticated": false,
    # Optionally include details about why authentication failed.
    # If no error is provided, the API will return a generic Unauthorized message.
    # The error field is ignored when authenticated=true.
    "error": "Credentials are expired"
  }
}

{
  "apiVersion": "authentication.k8s.io/v1beta1",
  "kind": "TokenReview",
  "status": {
    "authenticated": false,
    # Optionally include details about why authentication failed.
    # If no error is provided, the API will return a generic Unauthorized message.
    # The error field is ignored when authenticated=true.
    "error": "Credentials are expired"
  }
}

Authenticating Proxy

The API server can be configured to identify users from request header values, such as X-Remote-User. It is designed for use in combination with an authenticating proxy, which sets the request header value.

  • --requestheader-username-headers Required, case-insensitive. Header names to check, in order, for the user identity. The first header containing a value is used as the username.
  • --requestheader-group-headers 1.6+. Optional, case-insensitive. "X-Remote-Group" is suggested. Header names to check, in order, for the user's groups. All values in all specified headers are used as group names.
  • --requestheader-extra-headers-prefix 1.6+. Optional, case-insensitive. "X-Remote-Extra-" is suggested. Header prefixes to look for to determine extra information about the user (typically used by the configured authorization plugin). Any headers beginning with any of the specified prefixes have the prefix removed. The remainder of the header name is lowercased and percent-decoded and becomes the extra key, and the header value is the extra value.

For example, with this configuration:

--requestheader-username-headers=X-Remote-User
--requestheader-group-headers=X-Remote-Group
--requestheader-extra-headers-prefix=X-Remote-Extra-

this request:

GET / HTTP/1.1
X-Remote-User: fido
X-Remote-Group: dogs
X-Remote-Group: dachshunds
X-Remote-Extra-Acme.com%2Fproject: some-project
X-Remote-Extra-Scopes: openid
X-Remote-Extra-Scopes: profile

would result in this user info:

name: fido
groups:
- dogs
- dachshunds
extra:
  acme.com/project:
  - some-project
  scopes:
  - openid
  - profile

In order to prevent header spoofing, the authenticating proxy is required to present a valid client certificate to the API server for validation against the specified CA before the request headers are checked. WARNING: do not reuse a CA that is used in a different context unless you understand the risks and the mechanisms to protect the CA's usage.

  • --requestheader-client-ca-file Required. PEM-encoded certificate bundle. A valid client certificate must be presented and validated against the certificate authorities in the specified file before the request headers are checked for user names.
  • --requestheader-allowed-names Optional. List of Common Name values (CNs). If set, a valid client certificate with a CN in the specified list must be presented before the request headers are checked for user names. If empty, any CN is allowed.

Anonymous requests

When enabled, requests that are not rejected by other configured authentication methods are treated as anonymous requests, and given a username of system:anonymous and a group of system:unauthenticated.

For example, on a server with token authentication configured, and anonymous access enabled, a request providing an invalid bearer token would receive a 401 Unauthorized error. A request providing no bearer token would be treated as an anonymous request.

In 1.5.1-1.5.x, anonymous access is disabled by default, and can be enabled by passing the --anonymous-auth=true option to the API server.

In 1.6+, anonymous access is enabled by default if an authorization mode other than AlwaysAllow is used, and can be disabled by passing the --anonymous-auth=false option to the API server. Starting in 1.6, the ABAC and RBAC authorizers require explicit authorization of the system:anonymous user or the system:unauthenticated group, so legacy policy rules that grant access to the * user or * group do not include anonymous users.

User impersonation

A user can act as another user through impersonation headers. These let requests manually override the user info a request authenticates as. For example, an admin could use this feature to debug an authorization policy by temporarily impersonating another user and seeing if a request was denied.

Impersonation requests first authenticate as the requesting user, then switch to the impersonated user info.

  • A user makes an API call with their credentials and impersonation headers.
  • API server authenticates the user.
  • API server ensures the authenticated users have impersonation privileges.
  • Request user info is replaced with impersonation values.
  • Request is evaluated, authorization acts on impersonated user info.

The following HTTP headers can be used to performing an impersonation request:

  • Impersonate-User: The username to act as.
  • Impersonate-Group: A group name to act as. Can be provided multiple times to set multiple groups. Optional. Requires "Impersonate-User".
  • Impersonate-Extra-( extra name ): A dynamic header used to associate extra fields with the user. Optional. Requires "Impersonate-User". In order to be preserved consistently, ( extra name ) must be lower-case, and any characters which aren't legal in HTTP header labels MUST be utf8 and percent-encoded.
  • Impersonate-Uid: A unique identifier that represents the user being impersonated. Optional. Requires "Impersonate-User". Kubernetes does not impose any format requirements on this string.

An example of the impersonation headers used when impersonating a user with groups:

Impersonate-User: jane.doe@example.com
Impersonate-Group: developers
Impersonate-Group: admins

An example of the impersonation headers used when impersonating a user with a UID and extra fields:

Impersonate-User: jane.doe@example.com
Impersonate-Extra-dn: cn=jane,ou=engineers,dc=example,dc=com
Impersonate-Extra-acme.com%2Fproject: some-project
Impersonate-Extra-scopes: view
Impersonate-Extra-scopes: development
Impersonate-Uid: 06f6ce97-e2c5-4ab8-7ba5-7654dd08d52b

When using kubectl set the --as flag to configure the Impersonate-User header, set the --as-group flag to configure the Impersonate-Group header.

kubectl drain mynode
Error from server (Forbidden): User "clark" cannot get nodes at the cluster scope. (get nodes mynode)

Set the --as and --as-group flag:

kubectl drain mynode --as=superman --as-group=system:masters
node/mynode cordoned
node/mynode drained

To impersonate a user, group, user identifier (UID) or extra fields, the impersonating user must have the ability to perform the "impersonate" verb on the kind of attribute being impersonated ("user", "group", "uid", etc.). For clusters that enable the RBAC authorization plugin, the following ClusterRole encompasses the rules needed to set user and group impersonation headers:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: impersonator
rules:
- apiGroups: [""]
  resources: ["users", "groups", "serviceaccounts"]
  verbs: ["impersonate"]

For impersonation, extra fields and impersonated UIDs are both under the "authentication.k8s.io" apiGroup. Extra fields are evaluated as sub-resources of the resource "userextras". To allow a user to use impersonation headers for the extra field "scopes" and for UIDs, a user should be granted the following role:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: scopes-and-uid-impersonator
rules:
# Can set "Impersonate-Extra-scopes" header and the "Impersonate-Uid" header.
- apiGroups: ["authentication.k8s.io"]
  resources: ["userextras/scopes", "uids"]
  verbs: ["impersonate"]

The values of impersonation headers can also be restricted by limiting the set of resourceNames a resource can take.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: limited-impersonator
rules:
# Can impersonate the user "jane.doe@example.com"
- apiGroups: [""]
  resources: ["users"]
  verbs: ["impersonate"]
  resourceNames: ["jane.doe@example.com"]

# Can impersonate the groups "developers" and "admins"
- apiGroups: [""]
  resources: ["groups"]
  verbs: ["impersonate"]
  resourceNames: ["developers","admins"]

# Can impersonate the extras field "scopes" with the values "view" and "development"
- apiGroups: ["authentication.k8s.io"]
  resources: ["userextras/scopes"]
  verbs: ["impersonate"]
  resourceNames: ["view", "development"]

# Can impersonate the uid "06f6ce97-e2c5-4ab8-7ba5-7654dd08d52b"
- apiGroups: ["authentication.k8s.io"]
  resources: ["uids"]
  verbs: ["impersonate"]
  resourceNames: ["06f6ce97-e2c5-4ab8-7ba5-7654dd08d52b"]

client-go credential plugins

FEATURE STATE: Kubernetes v1.22 [stable]

k8s.io/client-go and tools using it such as kubectl and kubelet are able to execute an external command to receive user credentials.

This feature is intended for client side integrations with authentication protocols not natively supported by k8s.io/client-go (LDAP, Kerberos, OAuth2, SAML, etc.). The plugin implements the protocol specific logic, then returns opaque credentials to use. Almost all credential plugin use cases require a server side component with support for the webhook token authenticator to interpret the credential format produced by the client plugin.

Example use case

In a hypothetical use case, an organization would run an external service that exchanges LDAP credentials for user specific, signed tokens. The service would also be capable of responding to webhook token authenticator requests to validate the tokens. Users would be required to install a credential plugin on their workstation.

To authenticate against the API:

  • The user issues a kubectl command.
  • Credential plugin prompts the user for LDAP credentials, exchanges credentials with external service for a token.
  • Credential plugin returns token to client-go, which uses it as a bearer token against the API server.
  • API server uses the webhook token authenticator to submit a TokenReview to the external service.
  • External service verifies the signature on the token and returns the user's username and groups.

Configuration

Credential plugins are configured through kubectl config files as part of the user fields.

apiVersion: v1
kind: Config
users:
- name: my-user
  user:
    exec:
      # Command to execute. Required.
      command: "example-client-go-exec-plugin"

      # API version to use when decoding the ExecCredentials resource. Required.
      #
      # The API version returned by the plugin MUST match the version listed here.
      #
      # To integrate with tools that support multiple versions (such as client.authentication.k8s.io/v1beta1),
      # set an environment variable, pass an argument to the tool that indicates which version the exec plugin expects,
      # or read the version from the ExecCredential object in the KUBERNETES_EXEC_INFO environment variable.
      apiVersion: "client.authentication.k8s.io/v1"

      # Environment variables to set when executing the plugin. Optional.
      env:
      - name: "FOO"
        value: "bar"

      # Arguments to pass when executing the plugin. Optional.
      args:
      - "arg1"
      - "arg2"

      # Text shown to the user when the executable doesn't seem to be present. Optional.
      installHint: |
        example-client-go-exec-plugin is required to authenticate
        to the current cluster.  It can be installed:

        On macOS: brew install example-client-go-exec-plugin

        On Ubuntu: apt-get install example-client-go-exec-plugin

        On Fedora: dnf install example-client-go-exec-plugin

        ...        

      # Whether or not to provide cluster information, which could potentially contain
      # very large CA data, to this exec plugin as a part of the KUBERNETES_EXEC_INFO
      # environment variable.
      provideClusterInfo: true

      # The contract between the exec plugin and the standard input I/O stream. If the
      # contract cannot be satisfied, this plugin will not be run and an error will be
      # returned. Valid values are "Never" (this exec plugin never uses standard input),
      # "IfAvailable" (this exec plugin wants to use standard input if it is available),
      # or "Always" (this exec plugin requires standard input to function). Required.
      interactiveMode: Never
clusters:
- name: my-cluster
  cluster:
    server: "https://172.17.4.100:6443"
    certificate-authority: "/etc/kubernetes/ca.pem"
    extensions:
    - name: client.authentication.k8s.io/exec # reserved extension name for per cluster exec config
      extension:
        arbitrary: config
        this: can be provided via the KUBERNETES_EXEC_INFO environment variable upon setting provideClusterInfo
        you: ["can", "put", "anything", "here"]
contexts:
- name: my-cluster
  context:
    cluster: my-cluster
    user: my-user
current-context: my-cluster

apiVersion: v1
kind: Config
users:
- name: my-user
  user:
    exec:
      # Command to execute. Required.
      command: "example-client-go-exec-plugin"

      # API version to use when decoding the ExecCredentials resource. Required.
      #
      # The API version returned by the plugin MUST match the version listed here.
      #
      # To integrate with tools that support multiple versions (such as client.authentication.k8s.io/v1),
      # set an environment variable, pass an argument to the tool that indicates which version the exec plugin expects,
      # or read the version from the ExecCredential object in the KUBERNETES_EXEC_INFO environment variable.
      apiVersion: "client.authentication.k8s.io/v1beta1"

      # Environment variables to set when executing the plugin. Optional.
      env:
      - name: "FOO"
        value: "bar"

      # Arguments to pass when executing the plugin. Optional.
      args:
      - "arg1"
      - "arg2"

      # Text shown to the user when the executable doesn't seem to be present. Optional.
      installHint: |
        example-client-go-exec-plugin is required to authenticate
        to the current cluster.  It can be installed:

        On macOS: brew install example-client-go-exec-plugin

        On Ubuntu: apt-get install example-client-go-exec-plugin

        On Fedora: dnf install example-client-go-exec-plugin

        ...        

      # Whether or not to provide cluster information, which could potentially contain
      # very large CA data, to this exec plugin as a part of the KUBERNETES_EXEC_INFO
      # environment variable.
      provideClusterInfo: true

      # The contract between the exec plugin and the standard input I/O stream. If the
      # contract cannot be satisfied, this plugin will not be run and an error will be
      # returned. Valid values are "Never" (this exec plugin never uses standard input),
      # "IfAvailable" (this exec plugin wants to use standard input if it is available),
      # or "Always" (this exec plugin requires standard input to function). Optional.
      # Defaults to "IfAvailable".
      interactiveMode: Never
clusters:
- name: my-cluster
  cluster:
    server: "https://172.17.4.100:6443"
    certificate-authority: "/etc/kubernetes/ca.pem"
    extensions:
    - name: client.authentication.k8s.io/exec # reserved extension name for per cluster exec config
      extension:
        arbitrary: config
        this: can be provided via the KUBERNETES_EXEC_INFO environment variable upon setting provideClusterInfo
        you: ["can", "put", "anything", "here"]
contexts:
- name: my-cluster
  context:
    cluster: my-cluster
    user: my-user
current-context: my-cluster

Relative command paths are interpreted as relative to the directory of the config file. If KUBECONFIG is set to /home/jane/kubeconfig and the exec command is ./bin/example-client-go-exec-plugin, the binary /home/jane/bin/example-client-go-exec-plugin is executed.

- name: my-user
  user:
    exec:
      # Path relative to the directory of the kubeconfig
      command: "./bin/example-client-go-exec-plugin"
      apiVersion: "client.authentication.k8s.io/v1"
      interactiveMode: Never

Input and output formats

The executed command prints an ExecCredential object to stdout. k8s.io/client-go authenticates against the Kubernetes API using the returned credentials in the status. The executed command is passed an ExecCredential object as input via the KUBERNETES_EXEC_INFO environment variable. This input contains helpful information like the expected API version of the returned ExecCredential object and whether or not the plugin can use stdin to interact with the user.

When run from an interactive session (i.e., a terminal), stdin can be exposed directly to the plugin. Plugins should use the spec.interactive field of the input ExecCredential object from the KUBERNETES_EXEC_INFO environment variable in order to determine if stdin has been provided. A plugin's stdin requirements (i.e., whether stdin is optional, strictly required, or never used in order for the plugin to run successfully) is declared via the user.exec.interactiveMode field in the kubeconfig (see table below for valid values). The user.exec.interactiveMode field is optional in client.authentication.k8s.io/v1beta1 and required in client.authentication.k8s.io/v1.

interactiveMode values
interactiveMode ValueMeaning
NeverThis exec plugin never needs to use standard input, and therefore the exec plugin will be run regardless of whether standard input is available for user input.
IfAvailableThis exec plugin would like to use standard input if it is available, but can still operate if standard input is not available. Therefore, the exec plugin will be run regardless of whether stdin is available for user input. If standard input is available for user input, then it will be provided to this exec plugin.
AlwaysThis exec plugin requires standard input in order to run, and therefore the exec plugin will only be run if standard input is available for user input. If standard input is not available for user input, then the exec plugin will not be run and an error will be returned by the exec plugin runner.

To use bearer token credentials, the plugin returns a token in the status of the ExecCredential

{
  "apiVersion": "client.authentication.k8s.io/v1",
  "kind": "ExecCredential",
  "status": {
    "token": "my-bearer-token"
  }
}

{
  "apiVersion": "client.authentication.k8s.io/v1beta1",
  "kind": "ExecCredential",
  "status": {
    "token": "my-bearer-token"
  }
}

Alternatively, a PEM-encoded client certificate and key can be returned to use TLS client auth. If the plugin returns a different certificate and key on a subsequent call, k8s.io/client-go will close existing connections with the server to force a new TLS handshake.

If specified, clientKeyData and clientCertificateData must both must be present.

clientCertificateData may contain additional intermediate certificates to send to the server.

{
  "apiVersion": "client.authentication.k8s.io/v1",
  "kind": "ExecCredential",
  "status": {
    "clientCertificateData": "-----BEGIN CERTIFICATE-----\n...\n-----END CERTIFICATE-----",
    "clientKeyData": "-----BEGIN RSA PRIVATE KEY-----\n...\n-----END RSA PRIVATE KEY-----"
  }
}

{
  "apiVersion": "client.authentication.k8s.io/v1beta1",
  "kind": "ExecCredential",
  "status": {
    "clientCertificateData": "-----BEGIN CERTIFICATE-----\n...\n-----END CERTIFICATE-----",
    "clientKeyData": "-----BEGIN RSA PRIVATE KEY-----\n...\n-----END RSA PRIVATE KEY-----"
  }
}

Optionally, the response can include the expiry of the credential formatted as a RFC 3339 timestamp.

Presence or absence of an expiry has the following impact:

  • If an expiry is included, the bearer token and TLS credentials are cached until the expiry time is reached, or if the server responds with a 401 HTTP status code, or when the process exits.
  • If an expiry is omitted, the bearer token and TLS credentials are cached until the server responds with a 401 HTTP status code or until the process exits.

{
  "apiVersion": "client.authentication.k8s.io/v1",
  "kind": "ExecCredential",
  "status": {
    "token": "my-bearer-token",
    "expirationTimestamp": "2018-03-05T17:30:20-08:00"
  }
}

{
  "apiVersion": "client.authentication.k8s.io/v1beta1",
  "kind": "ExecCredential",
  "status": {
    "token": "my-bearer-token",
    "expirationTimestamp": "2018-03-05T17:30:20-08:00"
  }
}

To enable the exec plugin to obtain cluster-specific information, set provideClusterInfo on the user.exec field in the kubeconfig. The plugin will then be supplied this cluster-specific information in the KUBERNETES_EXEC_INFO environment variable. Information from this environment variable can be used to perform cluster-specific credential acquisition logic. The following ExecCredential manifest describes a cluster information sample.

{
  "apiVersion": "client.authentication.k8s.io/v1",
  "kind": "ExecCredential",
  "spec": {
    "cluster": {
      "server": "https://172.17.4.100:6443",
      "certificate-authority-data": "LS0t...",
      "config": {
        "arbitrary": "config",
        "this": "can be provided via the KUBERNETES_EXEC_INFO environment variable upon setting provideClusterInfo",
        "you": ["can", "put", "anything", "here"]
      }
    },
    "interactive": true
  }
}

{
  "apiVersion": "client.authentication.k8s.io/v1beta1",
  "kind": "ExecCredential",
  "spec": {
    "cluster": {
      "server": "https://172.17.4.100:6443",
      "certificate-authority-data": "LS0t...",
      "config": {
        "arbitrary": "config",
        "this": "can be provided via the KUBERNETES_EXEC_INFO environment variable upon setting provideClusterInfo",
        "you": ["can", "put", "anything", "here"]
      }
    },
    "interactive": true
  }
}

What's next

3.2 - Authenticating with Bootstrap Tokens

FEATURE STATE: Kubernetes v1.18 [stable]

Bootstrap tokens are a simple bearer token that is meant to be used when creating new clusters or joining new nodes to an existing cluster. It was built to support kubeadm, but can be used in other contexts for users that wish to start clusters without kubeadm. It is also built to work, via RBAC policy, with the Kubelet TLS Bootstrapping system.

Bootstrap Tokens Overview

Bootstrap Tokens are defined with a specific type (bootstrap.kubernetes.io/token) of secrets that lives in the kube-system namespace. These Secrets are then read by the Bootstrap Authenticator in the API Server. Expired tokens are removed with the TokenCleaner controller in the Controller Manager. The tokens are also used to create a signature for a specific ConfigMap used in a "discovery" process through a BootstrapSigner controller.

Token Format

Bootstrap Tokens take the form of abcdef.0123456789abcdef. More formally, they must match the regular expression [a-z0-9]{6}\.[a-z0-9]{16}.

The first part of the token is the "Token ID" and is considered public information. It is used when referring to a token without leaking the secret part used for authentication. The second part is the "Token Secret" and should only be shared with trusted parties.

Enabling Bootstrap Token Authentication

The Bootstrap Token authenticator can be enabled using the following flag on the API server:

--enable-bootstrap-token-auth

When enabled, bootstrapping tokens can be used as bearer token credentials to authenticate requests against the API server.

Authorization: Bearer 07401b.f395accd246ae52d

Tokens authenticate as the username system:bootstrap:<token id> and are members of the group system:bootstrappers. Additional groups may be specified in the token's Secret.

Expired tokens can be deleted automatically by enabling the tokencleaner controller on the controller manager.

--controllers=*,tokencleaner

Bootstrap Token Secret Format

Each valid token is backed by a secret in the kube-system namespace. You can find the full design doc here.

Here is what the secret looks like.

apiVersion: v1
kind: Secret
metadata:
  # Name MUST be of form "bootstrap-token-<token id>"
  name: bootstrap-token-07401b
  namespace: kube-system

# Type MUST be 'bootstrap.kubernetes.io/token'
type: bootstrap.kubernetes.io/token
stringData:
  # Human readable description. Optional.
  description: "The default bootstrap token generated by 'kubeadm init'."

  # Token ID and secret. Required.
  token-id: 07401b
  token-secret: f395accd246ae52d

  # Expiration. Optional.
  expiration: 2017-03-10T03:22:11Z

  # Allowed usages.
  usage-bootstrap-authentication: "true"
  usage-bootstrap-signing: "true"

  # Extra groups to authenticate the token as. Must start with "system:bootstrappers:"
  auth-extra-groups: system:bootstrappers:worker,system:bootstrappers:ingress

The type of the secret must be bootstrap.kubernetes.io/token and the name must be bootstrap-token-<token id>. It must also exist in the kube-system namespace.

The usage-bootstrap-* members indicate what this secret is intended to be used for. A value must be set to true to be enabled.

  • usage-bootstrap-authentication indicates that the token can be used to authenticate to the API server as a bearer token.
  • usage-bootstrap-signing indicates that the token may be used to sign the cluster-info ConfigMap as described below.

The expiration field controls the expiry of the token. Expired tokens are rejected when used for authentication and ignored during ConfigMap signing. The expiry value is encoded as an absolute UTC time using RFC3339. Enable the tokencleaner controller to automatically delete expired tokens.

Token Management with kubeadm

You can use the kubeadm tool to manage tokens on a running cluster. See the kubeadm token docs for details.

ConfigMap Signing

In addition to authentication, the tokens can be used to sign a ConfigMap. This is used early in a cluster bootstrap process before the client trusts the API server. The signed ConfigMap can be authenticated by the shared token.

Enable ConfigMap signing by enabling the bootstrapsigner controller on the Controller Manager.

--controllers=*,bootstrapsigner

The ConfigMap that is signed is cluster-info in the kube-public namespace. The typical flow is that a client reads this ConfigMap while unauthenticated and ignoring TLS errors. It then validates the payload of the ConfigMap by looking at a signature embedded in the ConfigMap.

The ConfigMap may look like this:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-info
  namespace: kube-public
data:
  jws-kubeconfig-07401b: eyJhbGciOiJIUzI1NiIsImtpZCI6IjA3NDAxYiJ9..tYEfbo6zDNo40MQE07aZcQX2m3EB2rO3NuXtxVMYm9U
  kubeconfig: |
    apiVersion: v1
    clusters:
    - cluster:
        certificate-authority-data: <really long certificate data>
        server: https://10.138.0.2:6443
      name: ""
    contexts: []
    current-context: ""
    kind: Config
    preferences: {}
    users: []    

The kubeconfig member of the ConfigMap is a config file with only the cluster information filled out. The key thing being communicated here is the certificate-authority-data. This may be expanded in the future.

The signature is a JWS signature using the "detached" mode. To validate the signature, the user should encode the kubeconfig payload according to JWS rules (base64 encoded while discarding any trailing =). That encoded payload is then used to form a whole JWS by inserting it between the 2 dots. You can verify the JWS using the HS256 scheme (HMAC-SHA256) with the full token (e.g. 07401b.f395accd246ae52d) as the shared secret. Users must verify that HS256 is used.

Consult the kubeadm implementation details section for more information.

3.3 - Certificate Signing Requests

FEATURE STATE: Kubernetes v1.19 [stable]

The Certificates API enables automation of X.509 credential provisioning by providing a programmatic interface for clients of the Kubernetes API to request and obtain X.509 certificates from a Certificate Authority (CA).

A CertificateSigningRequest (CSR) resource is used to request that a certificate be signed by a denoted signer, after which the request may be approved or denied before finally being signed.

Request signing process

The CertificateSigningRequest resource type allows a client to ask for an X.509 certificate be issued, based on a signing request. The CertificateSigningRequest object includes a PEM-encoded PKCS#10 signing request in the spec.request field. The CertificateSigningRequest denotes the signer (the recipient that the request is being made to) using the spec.signerName field. Note that spec.signerName is a required key after API version certificates.k8s.io/v1. In Kubernetes v1.22 and later, clients may optionally set the spec.expirationSeconds field to request a particular lifetime for the issued certificate. The minimum valid value for this field is 600, i.e. ten minutes.

Once created, a CertificateSigningRequest must be approved before it can be signed. Depending on the signer selected, a CertificateSigningRequest may be automatically approved by a controller. Otherwise, a CertificateSigningRequest must be manually approved either via the REST API (or client-go) or by running kubectl certificate approve. Likewise, a CertificateSigningRequest may also be denied, which tells the configured signer that it must not sign the request.

For certificates that have been approved, the next step is signing. The relevant signing controller first validates that the signing conditions are met and then creates a certificate. The signing controller then updates the CertificateSigningRequest, storing the new certificate into the status.certificate field of the existing CertificateSigningRequest object. The status.certificate field is either empty or contains a X.509 certificate, encoded in PEM format. The CertificateSigningRequest status.certificate field is empty until the signer does this.

Once the status.certificate field has been populated, the request has been completed and clients can now fetch the signed certificate PEM data from the CertificateSigningRequest resource. The signers can instead deny certificate signing if the approval conditions are not met.

In order to reduce the number of old CertificateSigningRequest resources left in a cluster, a garbage collection controller runs periodically. The garbage collection removes CertificateSigningRequests that have not changed state for some duration:

  • Approved requests: automatically deleted after 1 hour
  • Denied requests: automatically deleted after 1 hour
  • Failed requests: automatically deleted after 1 hour
  • Pending requests: automatically deleted after 24 hours
  • All requests: automatically deleted after the issued certificate has expired

Signers

Custom signerNames can also be specified. All signers should provide information about how they work so that clients can predict what will happen to their CSRs. This includes:

  1. Trust distribution: how trust (CA bundles) are distributed.
  2. Permitted subjects: any restrictions on and behavior when a disallowed subject is requested.
  3. Permitted x509 extensions: including IP subjectAltNames, DNS subjectAltNames, Email subjectAltNames, URI subjectAltNames etc, and behavior when a disallowed extension is requested.
  4. Permitted key usages / extended key usages: any restrictions on and behavior when usages different than the signer-determined usages are specified in the CSR.
  5. Expiration/certificate lifetime: whether it is fixed by the signer, configurable by the admin, determined by the CSR spec.expirationSeconds field, etc and the behavior when the signer-determined expiration is different from the CSR spec.expirationSeconds field.
  6. CA bit allowed/disallowed: and behavior if a CSR contains a request a for a CA certificate when the signer does not permit it.

Commonly, the status.certificate field contains a single PEM-encoded X.509 certificate once the CSR is approved and the certificate is issued. Some signers store multiple certificates into the status.certificate field. In that case, the documentation for the signer should specify the meaning of additional certificates; for example, this might be the certificate plus intermediates to be presented during TLS handshakes.

The PKCS#10 signing request format does not have a standard mechanism to specify a certificate expiration or lifetime. The expiration or lifetime therefore has to be set through the spec.expirationSeconds field of the CSR object. The built-in signers use the ClusterSigningDuration configuration option, which defaults to 1 year, (the --cluster-signing-duration command-line flag of the kube-controller-manager) as the default when no spec.expirationSeconds is specified. When spec.expirationSeconds is specified, the minimum of spec.expirationSeconds and ClusterSigningDuration is used.

Kubernetes signers

Kubernetes provides built-in signers that each have a well-known signerName:

  1. kubernetes.io/kube-apiserver-client: signs certificates that will be honored as client certificates by the API server. Never auto-approved by kube-controller-manager.

    1. Trust distribution: signed certificates must be honored as client certificates by the API server. The CA bundle is not distributed by any other means.
    2. Permitted subjects - no subject restrictions, but approvers and signers may choose not to approve or sign. Certain subjects like cluster-admin level users or groups vary between distributions and installations, but deserve additional scrutiny before approval and signing. The CertificateSubjectRestriction admission plugin is enabled by default to restrict system:masters, but it is often not the only cluster-admin subject in a cluster.
    3. Permitted x509 extensions - honors subjectAltName and key usage extensions and discards other extensions.
    4. Permitted key usages - must include ["client auth"]. Must not include key usages beyond ["digital signature", "key encipherment", "client auth"].
    5. Expiration/certificate lifetime - for the kube-controller-manager implementation of this signer, set to the minimum of the --cluster-signing-duration option or, if specified, the spec.expirationSeconds field of the CSR object.
    6. CA bit allowed/disallowed - not allowed.
  2. kubernetes.io/kube-apiserver-client-kubelet: signs client certificates that will be honored as client certificates by the API server. May be auto-approved by kube-controller-manager.

    1. Trust distribution: signed certificates must be honored as client certificates by the API server. The CA bundle is not distributed by any other means.
    2. Permitted subjects - organizations are exactly ["system:nodes"], common name starts with "system:node:".
    3. Permitted x509 extensions - honors key usage extensions, forbids subjectAltName extensions and drops other extensions.
    4. Permitted key usages - exactly ["key encipherment", "digital signature", "client auth"].
    5. Expiration/certificate lifetime - for the kube-controller-manager implementation of this signer, set to the minimum of the --cluster-signing-duration option or, if specified, the spec.expirationSeconds field of the CSR object.
    6. CA bit allowed/disallowed - not allowed.
  3. kubernetes.io/kubelet-serving: signs serving certificates that are honored as a valid kubelet serving certificate by the API server, but has no other guarantees. Never auto-approved by kube-controller-manager.

    1. Trust distribution: signed certificates must be honored by the API server as valid to terminate connections to a kubelet. The CA bundle is not distributed by any other means.
    2. Permitted subjects - organizations are exactly ["system:nodes"], common name starts with "system:node:".
    3. Permitted x509 extensions - honors key usage and DNSName/IPAddress subjectAltName extensions, forbids EmailAddress and URI subjectAltName extensions, drops other extensions. At least one DNS or IP subjectAltName must be present.
    4. Permitted key usages - exactly ["key encipherment", "digital signature", "server auth"].
    5. Expiration/certificate lifetime - for the kube-controller-manager implementation of this signer, set to the minimum of the --cluster-signing-duration option or, if specified, the spec.expirationSeconds field of the CSR object.
    6. CA bit allowed/disallowed - not allowed.
  4. kubernetes.io/legacy-unknown: has no guarantees for trust at all. Some third-party distributions of Kubernetes may honor client certificates signed by it. The stable CertificateSigningRequest API (version certificates.k8s.io/v1 and later) does not allow to set the signerName as kubernetes.io/legacy-unknown. Never auto-approved by kube-controller-manager.

    1. Trust distribution: None. There is no standard trust or distribution for this signer in a Kubernetes cluster.
    2. Permitted subjects - any
    3. Permitted x509 extensions - honors subjectAltName and key usage extensions and discards other extensions.
    4. Permitted key usages - any
    5. Expiration/certificate lifetime - for the kube-controller-manager implementation of this signer, set to the minimum of the --cluster-signing-duration option or, if specified, the spec.expirationSeconds field of the CSR object.
    6. CA bit allowed/disallowed - not allowed.

Distribution of trust happens out of band for these signers. Any trust outside of those described above are strictly coincidental. For instance, some distributions may honor kubernetes.io/legacy-unknown as client certificates for the kube-apiserver, but this is not a standard. None of these usages are related to ServiceAccount token secrets .data[ca.crt] in any way. That CA bundle is only guaranteed to verify a connection to the API server using the default service (kubernetes.default.svc).

Authorization

To allow creating a CertificateSigningRequest and retrieving any CertificateSigningRequest:

  • Verbs: create, get, list, watch, group: certificates.k8s.io, resource: certificatesigningrequests

For example:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: csr-creator
rules:
- apiGroups:
  - certificates.k8s.io
  resources:
  - certificatesigningrequests
  verbs:
  - create
  - get
  - list
  - watch

To allow approving a CertificateSigningRequest:

  • Verbs: get, list, watch, group: certificates.k8s.io, resource: certificatesigningrequests
  • Verbs: update, group: certificates.k8s.io, resource: certificatesigningrequests/approval
  • Verbs: approve, group: certificates.k8s.io, resource: signers, resourceName: <signerNameDomain>/<signerNamePath> or <signerNameDomain>/*

For example:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: csr-approver
rules:
- apiGroups:
  - certificates.k8s.io
  resources:
  - certificatesigningrequests
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - certificates.k8s.io
  resources:
  - certificatesigningrequests/approval
  verbs:
  - update
- apiGroups:
  - certificates.k8s.io
  resources:
  - signers
  resourceNames:
  - example.com/my-signer-name # example.com/* can be used to authorize for all signers in the 'example.com' domain
  verbs:
  - approve

To allow signing a CertificateSigningRequest:

  • Verbs: get, list, watch, group: certificates.k8s.io, resource: certificatesigningrequests
  • Verbs: update, group: certificates.k8s.io, resource: certificatesigningrequests/status
  • Verbs: sign, group: certificates.k8s.io, resource: signers, resourceName: <signerNameDomain>/<signerNamePath> or <signerNameDomain>/*
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: csr-signer
rules:
- apiGroups:
  - certificates.k8s.io
  resources:
  - certificatesigningrequests
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - certificates.k8s.io
  resources:
  - certificatesigningrequests/status
  verbs:
  - update
- apiGroups:
  - certificates.k8s.io
  resources:
  - signers
  resourceNames:
  - example.com/my-signer-name # example.com/* can be used to authorize for all signers in the 'example.com' domain
  verbs:
  - sign

Normal user

A few steps are required in order to get a normal user to be able to authenticate and invoke an API. First, this user must have a certificate issued by the Kubernetes cluster, and then present that certificate to the Kubernetes API.

Create private key

The following scripts show how to generate PKI private key and CSR. It is important to set CN and O attribute of the CSR. CN is the name of the user and O is the group that this user will belong to. You can refer to RBAC for standard groups.

openssl genrsa -out myuser.key 2048
openssl req -new -key myuser.key -out myuser.csr

Create CertificateSigningRequest

Create a CertificateSigningRequest and submit it to a Kubernetes Cluster via kubectl. Below is a script to generate the CertificateSigningRequest.

cat <<EOF | kubectl apply -f -
apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
  name: myuser
spec:
  request: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQ1ZqQ0NBVDRDQVFBd0VURVBNQTBHQTFVRUF3d0dZVzVuWld4aE1JSUJJakFOQmdrcWhraUc5dzBCQVFFRgpBQU9DQVE4QU1JSUJDZ0tDQVFFQTByczhJTHRHdTYxakx2dHhWTTJSVlRWMDNHWlJTWWw0dWluVWo4RElaWjBOCnR2MUZtRVFSd3VoaUZsOFEzcWl0Qm0wMUFSMkNJVXBGd2ZzSjZ4MXF3ckJzVkhZbGlBNVhwRVpZM3ExcGswSDQKM3Z3aGJlK1o2MVNrVHF5SVBYUUwrTWM5T1Nsbm0xb0R2N0NtSkZNMUlMRVI3QTVGZnZKOEdFRjJ6dHBoaUlFMwpub1dtdHNZb3JuT2wzc2lHQ2ZGZzR4Zmd4eW8ybmlneFNVekl1bXNnVm9PM2ttT0x1RVF6cXpkakJ3TFJXbWlECklmMXBMWnoyalVnald4UkhCM1gyWnVVV1d1T09PZnpXM01LaE8ybHEvZi9DdS8wYk83c0x0MCt3U2ZMSU91TFcKcW90blZtRmxMMytqTy82WDNDKzBERHk5aUtwbXJjVDBnWGZLemE1dHJRSURBUUFCb0FBd0RRWUpLb1pJaHZjTgpBUUVMQlFBRGdnRUJBR05WdmVIOGR4ZzNvK21VeVRkbmFjVmQ1N24zSkExdnZEU1JWREkyQTZ1eXN3ZFp1L1BVCkkwZXpZWFV0RVNnSk1IRmQycVVNMjNuNVJsSXJ3R0xuUXFISUh5VStWWHhsdnZsRnpNOVpEWllSTmU3QlJvYXgKQVlEdUI5STZXT3FYbkFvczFqRmxNUG5NbFpqdU5kSGxpT1BjTU1oNndLaTZzZFhpVStHYTJ2RUVLY01jSVUyRgpvU2djUWdMYTk0aEpacGk3ZnNMdm1OQUxoT045UHdNMGM1dVJVejV4T0dGMUtCbWRSeEgvbUNOS2JKYjFRQm1HCkkwYitEUEdaTktXTU0xMzhIQXdoV0tkNjVoVHdYOWl4V3ZHMkh4TG1WQzg0L1BHT0tWQW9FNkpsYWFHdTlQVmkKdjlOSjVaZlZrcXdCd0hKbzZXdk9xVlA3SVFjZmg3d0drWm89Ci0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo=
  signerName: kubernetes.io/kube-apiserver-client
  expirationSeconds: 86400  # one day
  usages:
  - client auth
EOF

Some points to note:

  • usages has to be 'client auth'
  • expirationSeconds could be made longer (i.e. 864000 for ten days) or shorter (i.e. 3600 for one hour)
  • request is the base64 encoded value of the CSR file content. You can get the content using this command: cat myuser.csr | base64 | tr -d "\n"

Approve certificate signing request

Use kubectl to create a CSR and approve it.

Get the list of CSRs:

kubectl get csr

Approve the CSR:

kubectl certificate approve myuser

Get the certificate

Retrieve the certificate from the CSR:

kubectl get csr/myuser -o yaml

The certificate value is in Base64-encoded format under status.certificate.

Export the issued certificate from the CertificateSigningRequest.

kubectl get csr myuser -o jsonpath='{.status.certificate}'| base64 -d > myuser.crt

Create Role and RoleBinding

With the certificate created it is time to define the Role and RoleBinding for this user to access Kubernetes cluster resources.

This is a sample command to create a Role for this new user:

kubectl create role developer --verb=create --verb=get --verb=list --verb=update --verb=delete --resource=pods

This is a sample command to create a RoleBinding for this new user:

kubectl create rolebinding developer-binding-myuser --role=developer --user=myuser

Add to kubeconfig

The last step is to add this user into the kubeconfig file.

First, you need to add new credentials:

kubectl config set-credentials myuser --client-key=myuser.key --client-certificate=myuser.crt --embed-certs=true

Then, you need to add the context:

kubectl config set-context myuser --cluster=kubernetes --user=myuser

To test it, change the context to myuser:

kubectl config use-context myuser

Approval or rejection

Control plane automated approval

The kube-controller-manager ships with a built-in approver for certificates with a signerName of kubernetes.io/kube-apiserver-client-kubelet that delegates various permissions on CSRs for node credentials to authorization. The kube-controller-manager POSTs SubjectAccessReview resources to the API server in order to check authorization for certificate approval.

Approval or rejection using kubectl

A Kubernetes administrator (with appropriate permissions) can manually approve (or deny) CertificateSigningRequests by using the kubectl certificate approve and kubectl certificate deny commands.

To approve a CSR with kubectl:

kubectl certificate approve <certificate-signing-request-name>

Likewise, to deny a CSR:

kubectl certificate deny <certificate-signing-request-name>

Approval or rejection using the Kubernetes API

Users of the REST API can approve CSRs by submitting an UPDATE request to the approval subresource of the CSR to be approved. For example, you could write an operator that watches for a particular kind of CSR and then sends an UPDATE to approve them.

When you make an approval or rejection request, set either the Approved or Denied status condition based on the state you determine:

For Approved CSRs:

apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
...
status:
  conditions:
  - lastUpdateTime: "2020-02-08T11:37:35Z"
    lastTransitionTime: "2020-02-08T11:37:35Z"
    message: Approved by my custom approver controller
    reason: ApprovedByMyPolicy # You can set this to any string
    type: Approved

For Denied CSRs:

apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
...
status:
  conditions:
  - lastUpdateTime: "2020-02-08T11:37:35Z"
    lastTransitionTime: "2020-02-08T11:37:35Z"
    message: Denied by my custom approver controller
    reason: DeniedByMyPolicy # You can set this to any string
    type: Denied

It's usual to set status.conditions.reason to a machine-friendly reason code using TitleCase; this is a convention but you can set it to anything you like. If you want to add a note for human consumption, use the status.conditions.message field.

Signing

Control plane signer

The Kubernetes control plane implements each of the Kubernetes signers, as part of the kube-controller-manager.

API-based signers

Users of the REST API can sign CSRs by submitting an UPDATE request to the status subresource of the CSR to be signed.

As part of this request, the status.certificate field should be set to contain the signed certificate. This field contains one or more PEM-encoded certificates.

All PEM blocks must have the "CERTIFICATE" label, contain no headers, and the encoded data must be a BER-encoded ASN.1 Certificate structure as described in section 4 of RFC5280.

Example certificate content:

-----BEGIN CERTIFICATE-----
MIIDgjCCAmqgAwIBAgIUC1N1EJ4Qnsd322BhDPRwmg3b/oAwDQYJKoZIhvcNAQEL
BQAwXDELMAkGA1UEBhMCeHgxCjAIBgNVBAgMAXgxCjAIBgNVBAcMAXgxCjAIBgNV
BAoMAXgxCjAIBgNVBAsMAXgxCzAJBgNVBAMMAmNhMRAwDgYJKoZIhvcNAQkBFgF4
MB4XDTIwMDcwNjIyMDcwMFoXDTI1MDcwNTIyMDcwMFowNzEVMBMGA1UEChMMc3lz
dGVtOm5vZGVzMR4wHAYDVQQDExVzeXN0ZW06bm9kZToxMjcuMC4wLjEwggEiMA0G
CSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDne5X2eQ1JcLZkKvhzCR4Hxl9+ZmU3
+e1zfOywLdoQxrPi+o4hVsUH3q0y52BMa7u1yehHDRSaq9u62cmi5ekgXhXHzGmm
kmW5n0itRECv3SFsSm2DSghRKf0mm6iTYHWDHzUXKdm9lPPWoSOxoR5oqOsm3JEh
Q7Et13wrvTJqBMJo1GTwQuF+HYOku0NF/DLqbZIcpI08yQKyrBgYz2uO51/oNp8a
sTCsV4OUfyHhx2BBLUo4g4SptHFySTBwlpRWBnSjZPOhmN74JcpTLB4J5f4iEeA7
2QytZfADckG4wVkhH3C2EJUmRtFIBVirwDn39GXkSGlnvnMgF3uLZ6zNAgMBAAGj
YTBfMA4GA1UdDwEB/wQEAwIFoDATBgNVHSUEDDAKBggrBgEFBQcDAjAMBgNVHRMB
Af8EAjAAMB0GA1UdDgQWBBTREl2hW54lkQBDeVCcd2f2VSlB1DALBgNVHREEBDAC
ggAwDQYJKoZIhvcNAQELBQADggEBABpZjuIKTq8pCaX8dMEGPWtAykgLsTcD2jYr
L0/TCrqmuaaliUa42jQTt2OVsVP/L8ofFunj/KjpQU0bvKJPLMRKtmxbhXuQCQi1
qCRkp8o93mHvEz3mTUN+D1cfQ2fpsBENLnpS0F4G/JyY2Vrh19/X8+mImMEK5eOy
o0BMby7byUj98WmcUvNCiXbC6F45QTmkwEhMqWns0JZQY+/XeDhEcg+lJvz9Eyo2
aGgPsye1o3DpyXnyfJWAWMhOz7cikS5X2adesbgI86PhEHBXPIJ1v13ZdfCExmdd
M1fLPhLyR54fGaY+7/X8P9AZzPefAkwizeXwe9ii6/a08vWoiE4=
-----END CERTIFICATE-----

Non-PEM content may appear before or after the CERTIFICATE PEM blocks and is unvalidated, to allow for explanatory text as described in section 5.2 of RFC7468.

When encoded in JSON or YAML, this field is base-64 encoded. A CertificateSigningRequest containing the example certificate above would look like this:

apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
...
status:
  certificate: "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JS..."

What's next

3.4 - Admission Controllers Reference

This page provides an overview of Admission Controllers.

What are they?

An admission controller is a piece of code that intercepts requests to the Kubernetes API server prior to persistence of the object, but after the request is authenticated and authorized.

Admission controllers may be validating, mutating, or both. Mutating controllers may modify related objects to the requests they admit; validating controllers may not.

Admission controllers limit requests to create, delete, modify objects. Admission controllers can also block custom verbs, such as a request connect to a Pod via an API server proxy. Admission controllers do not (and cannot) block requests to read (get, watch or list) objects.

The admission controllers in Kubernetes 1.25 consist of the list below, are compiled into the kube-apiserver binary, and may only be configured by the cluster administrator. In that list, there are two special controllers: MutatingAdmissionWebhook and ValidatingAdmissionWebhook. These execute the mutating and validating (respectively) admission control webhooks which are configured in the API.

Admission control phases

The admission control process proceeds in two phases. In the first phase, mutating admission controllers are run. In the second phase, validating admission controllers are run. Note again that some of the controllers are both.

If any of the controllers in either phase reject the request, the entire request is rejected immediately and an error is returned to the end-user.

Finally, in addition to sometimes mutating the object in question, admission controllers may sometimes have side effects, that is, mutate related resources as part of request processing. Incrementing quota usage is the canonical example of why this is necessary. Any such side-effect needs a corresponding reclamation or reconciliation process, as a given admission controller does not know for sure that a given request will pass all of the other admission controllers.

Why do I need them?

Several important features of Kubernetes require an admission controller to be enabled in order to properly support the feature. As a result, a Kubernetes API server that is not properly configured with the right set of admission controllers is an incomplete server and will not support all the features you expect.

How do I turn on an admission controller?

The Kubernetes API server flag enable-admission-plugins takes a comma-delimited list of admission control plugins to invoke prior to modifying objects in the cluster. For example, the following command line enables the NamespaceLifecycle and the LimitRanger admission control plugins:

kube-apiserver --enable-admission-plugins=NamespaceLifecycle,LimitRanger ...

How do I turn off an admission controller?

The Kubernetes API server flag disable-admission-plugins takes a comma-delimited list of admission control plugins to be disabled, even if they are in the list of plugins enabled by default.

kube-apiserver --disable-admission-plugins=PodNodeSelector,AlwaysDeny ...

Which plugins are enabled by default?

To see which admission plugins are enabled:

kube-apiserver -h | grep enable-admission-plugins

In Kubernetes 1.25, the default ones are:

CertificateApproval, CertificateSigning, CertificateSubjectRestriction, DefaultIngressClass, DefaultStorageClass, DefaultTolerationSeconds, LimitRanger, MutatingAdmissionWebhook, NamespaceLifecycle, PersistentVolumeClaimResize, PodSecurity, Priority, ResourceQuota, RuntimeClass, ServiceAccount, StorageObjectInUseProtection, TaintNodesByCondition, ValidatingAdmissionWebhook

What does each admission controller do?

AlwaysAdmit

FEATURE STATE: Kubernetes v1.13 [deprecated]

This admission controller allows all pods into the cluster. It is deprecated because its behavior is the same as if there were no admission controller at all.

AlwaysDeny

FEATURE STATE: Kubernetes v1.13 [deprecated]

Rejects all requests. AlwaysDeny is deprecated as it has no real meaning.

AlwaysPullImages

This admission controller modifies every new Pod to force the image pull policy to Always. This is useful in a multitenant cluster so that users can be assured that their private images can only be used by those who have the credentials to pull them. Without this admission controller, once an image has been pulled to a node, any pod from any user can use it by knowing the image's name (assuming the Pod is scheduled onto the right node), without any authorization check against the image. When this admission controller is enabled, images are always pulled prior to starting containers, which means valid credentials are required.

CertificateApproval

This admission controller observes requests to approve CertificateSigningRequest resources and performs additional authorization checks to ensure the approving user has permission to approve certificate requests with the spec.signerName requested on the CertificateSigningRequest resource.

See Certificate Signing Requests for more information on the permissions required to perform different actions on CertificateSigningRequest resources.

CertificateSigning

This admission controller observes updates to the status.certificate field of CertificateSigningRequest resources and performs an additional authorization checks to ensure the signing user has permission to sign certificate requests with the spec.signerName requested on the CertificateSigningRequest resource.

See Certificate Signing Requests for more information on the permissions required to perform different actions on CertificateSigningRequest resources.

CertificateSubjectRestriction

This admission controller observes creation of CertificateSigningRequest resources that have a spec.signerName of kubernetes.io/kube-apiserver-client. It rejects any request that specifies a 'group' (or 'organization attribute') of system:masters.

DefaultIngressClass

This admission controller observes creation of Ingress objects that do not request any specific ingress class and automatically adds a default ingress class to them. This way, users that do not request any special ingress class do not need to care about them at all and they will get the default one.

This admission controller does not do anything when no default ingress class is configured. When more than one ingress class is marked as default, it rejects any creation of Ingress with an error and an administrator must revisit their IngressClass objects and mark only one as default (with the annotation "ingressclass.kubernetes.io/is-default-class"). This admission controller ignores any Ingress updates; it acts only on creation.

See the Ingress documentation for more about ingress classes and how to mark one as default.

DefaultStorageClass

This admission controller observes creation of PersistentVolumeClaim objects that do not request any specific storage class and automatically adds a default storage class to them. This way, users that do not request any special storage class do not need to care about them at all and they will get the default one.

This admission controller does not do anything when no default storage class is configured. When more than one storage class is marked as default, it rejects any creation of PersistentVolumeClaim with an error and an administrator must revisit their StorageClass objects and mark only one as default. This admission controller ignores any PersistentVolumeClaim updates; it acts only on creation.

See persistent volume documentation about persistent volume claims and storage classes and how to mark a storage class as default.

DefaultTolerationSeconds

This admission controller sets the default forgiveness toleration for pods to tolerate the taints notready:NoExecute and unreachable:NoExecute based on the k8s-apiserver input parameters default-not-ready-toleration-seconds and default-unreachable-toleration-seconds if the pods don't already have toleration for taints node.kubernetes.io/not-ready:NoExecute or node.kubernetes.io/unreachable:NoExecute. The default value for default-not-ready-toleration-seconds and default-unreachable-toleration-seconds is 5 minutes.

DenyServiceExternalIPs

This admission controller rejects all net-new usage of the Service field externalIPs. This feature is very powerful (allows network traffic interception) and not well controlled by policy. When enabled, users of the cluster may not create new Services which use externalIPs and may not add new values to externalIPs on existing Service objects. Existing uses of externalIPs are not affected, and users may remove values from externalIPs on existing Service objects.

Most users do not need this feature at all, and cluster admins should consider disabling it. Clusters that do need to use this feature should consider using some custom policy to manage usage of it.

This admission controller is disabled by default.

EventRateLimit

FEATURE STATE: Kubernetes v1.13 [alpha]

This admission controller mitigates the problem where the API server gets flooded by requests to store new Events. The cluster admin can specify event rate limits by:

  • Enabling the EventRateLimit admission controller;
  • Referencing an EventRateLimit configuration file from the file provided to the API server's command line flag --admission-control-config-file:
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
  - name: EventRateLimit
    path: eventconfig.yaml
...

There are four types of limits that can be specified in the configuration:

  • Server: All Event requests (creation or modifications) received by the API server share a single bucket.
  • Namespace: Each namespace has a dedicated bucket.
  • User: Each user is allocated a bucket.
  • SourceAndObject: A bucket is assigned by each combination of source and involved object of the event.

Below is a sample eventconfig.yaml for such a configuration:

apiVersion: eventratelimit.admission.k8s.io/v1alpha1
kind: Configuration
limits:
  - type: Namespace
    qps: 50
    burst: 100
    cacheSize: 2000
  - type: User
    qps: 10
    burst: 50

See the EventRateLimit Config API (v1alpha1) for more details.

This admission controller is disabled by default.

ExtendedResourceToleration

This plug-in facilitates creation of dedicated nodes with extended resources. If operators want to create dedicated nodes with extended resources (like GPUs, FPGAs etc.), they are expected to taint the node with the extended resource name as the key. This admission controller, if enabled, automatically adds tolerations for such taints to pods requesting extended resources, so users don't have to manually add these tolerations.

This admission controller is disabled by default.

ImagePolicyWebhook

The ImagePolicyWebhook admission controller allows a backend webhook to make admission decisions.

This admission controller is disabled by default.

Configuration file format

ImagePolicyWebhook uses a configuration file to set options for the behavior of the backend. This file may be json or yaml and has the following format:

imagePolicy:
  kubeConfigFile: /path/to/kubeconfig/for/backend
  # time in s to cache approval
  allowTTL: 50
  # time in s to cache denial
  denyTTL: 50
  # time in ms to wait between retries
  retryBackoff: 500
  # determines behavior if the webhook backend fails
  defaultAllow: true

Reference the ImagePolicyWebhook configuration file from the file provided to the API server's command line flag --admission-control-config-file:

apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
  - name: ImagePolicyWebhook
    path: imagepolicyconfig.yaml
...

Alternatively, you can embed the configuration directly in the file:

apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
  - name: ImagePolicyWebhook
    configuration:
      imagePolicy:
        kubeConfigFile: <path-to-kubeconfig-file>
        allowTTL: 50
        denyTTL: 50
        retryBackoff: 500
        defaultAllow: true

The ImagePolicyWebhook config file must reference a kubeconfig formatted file which sets up the connection to the backend. It is required that the backend communicate over TLS.

The kubeconfig file's cluster field must point to the remote service, and the user field must contain the returned authorizer.

# clusters refers to the remote service.
clusters:
  - name: name-of-remote-imagepolicy-service
    cluster:
      certificate-authority: /path/to/ca.pem    # CA for verifying the remote service.
      server: https://images.example.com/policy # URL of remote service to query. Must use 'https'.

# users refers to the API server's webhook configuration.
users:
  - name: name-of-api-server
    user:
      client-certificate: /path/to/cert.pem # cert for the webhook admission controller to use
      client-key: /path/to/key.pem          # key matching the cert

For additional HTTP configuration, refer to the kubeconfig documentation.

Request payloads

When faced with an admission decision, the API Server POSTs a JSON serialized imagepolicy.k8s.io/v1alpha1 ImageReview object describing the action. This object contains fields describing the containers being admitted, as well as any pod annotations that match *.image-policy.k8s.io/*.

An example request body:

{
  "apiVersion":"imagepolicy.k8s.io/v1alpha1",
  "kind":"ImageReview",
  "spec":{
    "containers":[
      {
        "image":"myrepo/myimage:v1"
      },
      {
        "image":"myrepo/myimage@sha256:beb6bd6a68f114c1dc2ea4b28db81bdf91de202a9014972bec5e4d9171d90ed"
      }
    ],
    "annotations":{
      "mycluster.image-policy.k8s.io/ticket-1234": "break-glass"
    },
    "namespace":"mynamespace"
  }
}

The remote service is expected to fill the status field of the request and respond to either allow or disallow access. The response body's spec field is ignored, and may be omitted. A permissive response would return:

{
  "apiVersion": "imagepolicy.k8s.io/v1alpha1",
  "kind": "ImageReview",
  "status": {
    "allowed": true
  }
}

To disallow access, the service would return:

{
  "apiVersion": "imagepolicy.k8s.io/v1alpha1",
  "kind": "ImageReview",
  "status": {
    "allowed": false,
    "reason": "image currently blacklisted"
  }
}

For further documentation refer to the imagepolicy.v1alpha1 API.

Extending with Annotations

All annotations on a Pod that match *.image-policy.k8s.io/* are sent to the webhook. Sending annotations allows users who are aware of the image policy backend to send extra information to it, and for different backends implementations to accept different information.

Examples of information you might put here are:

  • request to "break glass" to override a policy, in case of emergency.
  • a ticket number from a ticket system that documents the break-glass request
  • provide a hint to the policy server as to the imageID of the image being provided, to save it a lookup

In any case, the annotations are provided by the user and are not validated by Kubernetes in any way.

LimitPodHardAntiAffinityTopology

This admission controller denies any pod that defines AntiAffinity topology key other than kubernetes.io/hostname in requiredDuringSchedulingRequiredDuringExecution.

This admission controller is disabled by default.

LimitRanger

This admission controller will observe the incoming request and ensure that it does not violate any of the constraints enumerated in the LimitRange object in a Namespace. If you are using LimitRange objects in your Kubernetes deployment, you MUST use this admission controller to enforce those constraints. LimitRanger can also be used to apply default resource requests to Pods that don't specify any; currently, the default LimitRanger applies a 0.1 CPU requirement to all Pods in the default namespace.

See the LimitRange API reference and the example of LimitRange for more details.

MutatingAdmissionWebhook

This admission controller calls any mutating webhooks which match the request. Matching webhooks are called in serial; each one may modify the object if it desires.

This admission controller (as implied by the name) only runs in the mutating phase.

If a webhook called by this has side effects (for example, decrementing quota) it must have a reconciliation system, as it is not guaranteed that subsequent webhooks or validating admission controllers will permit the request to finish.

If you disable the MutatingAdmissionWebhook, you must also disable the MutatingWebhookConfiguration object in the admissionregistration.k8s.io/v1 group/version via the --runtime-config flag, both are on by default.

Use caution when authoring and installing mutating webhooks

  • Users may be confused when the objects they try to create are different from what they get back.
  • Built in control loops may break when the objects they try to create are different when read back.
    • Setting originally unset fields is less likely to cause problems than overwriting fields set in the original request. Avoid doing the latter.
  • Future changes to control loops for built-in resources or third-party resources may break webhooks that work well today. Even when the webhook installation API is finalized, not all possible webhook behaviors will be guaranteed to be supported indefinitely.

NamespaceAutoProvision

This admission controller examines all incoming requests on namespaced resources and checks if the referenced namespace does exist. It creates a namespace if it cannot be found. This admission controller is useful in deployments that do not want to restrict creation of a namespace prior to its usage.

NamespaceExists

This admission controller checks all requests on namespaced resources other than Namespace itself. If the namespace referenced from a request doesn't exist, the request is rejected.

NamespaceLifecycle

This admission controller enforces that a Namespace that is undergoing termination cannot have new objects created in it, and ensures that requests in a non-existent Namespace are rejected. This admission controller also prevents deletion of three system reserved namespaces default, kube-system, kube-public.

A Namespace deletion kicks off a sequence of operations that remove all objects (pods, services, etc.) in that namespace. In order to enforce integrity of that process, we strongly recommend running this admission controller.

NodeRestriction

This admission controller limits the Node and Pod objects a kubelet can modify. In order to be limited by this admission controller, kubelets must use credentials in the system:nodes group, with a username in the form system:node:<nodeName>. Such kubelets will only be allowed to modify their own Node API object, and only modify Pod API objects that are bound to their node. kubelets are not allowed to update or remove taints from their Node API object.

The NodeRestriction admission plugin prevents kubelets from deleting their Node API object, and enforces kubelet modification of labels under the kubernetes.io/ or k8s.io/ prefixes as follows:

  • Prevents kubelets from adding/removing/updating labels with a node-restriction.kubernetes.io/ prefix. This label prefix is reserved for administrators to label their Node objects for workload isolation purposes, and kubelets will not be allowed to modify labels with that prefix.
  • Allows kubelets to add/remove/update these labels and label prefixes:
    • kubernetes.io/hostname
    • kubernetes.io/arch
    • kubernetes.io/os
    • beta.kubernetes.io/instance-type
    • node.kubernetes.io/instance-type
    • failure-domain.beta.kubernetes.io/region (deprecated)
    • failure-domain.beta.kubernetes.io/zone (deprecated)
    • topology.kubernetes.io/region
    • topology.kubernetes.io/zone
    • kubelet.kubernetes.io/-prefixed labels
    • node.kubernetes.io/-prefixed labels

Use of any other labels under the kubernetes.io or k8s.io prefixes by kubelets is reserved, and may be disallowed or allowed by the NodeRestriction admission plugin in the future.

Future versions may add additional restrictions to ensure kubelets have the minimal set of permissions required to operate correctly.

OwnerReferencesPermissionEnforcement

This admission controller protects the access to the metadata.ownerReferences of an object so that only users with delete permission to the object can change it. This admission controller also protects the access to metadata.ownerReferences[x].blockOwnerDeletion of an object, so that only users with update permission to the finalizers subresource of the referenced owner can change it.

PersistentVolumeClaimResize

FEATURE STATE: Kubernetes v1.24 [stable]

This admission controller implements additional validations for checking incoming PersistentVolumeClaim resize requests.

Enabling the PersistentVolumeClaimResize admission controller is recommended. This admission controller prevents resizing of all claims by default unless a claim's StorageClass explicitly enables resizing by setting allowVolumeExpansion to true.

For example: all PersistentVolumeClaims created from the following StorageClass support volume expansion:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gluster-vol-default
provisioner: kubernetes.io/glusterfs
parameters:
  resturl: "http://192.168.10.100:8080"
  restuser: ""
  secretNamespace: ""
  secretName: ""
allowVolumeExpansion: true

For more information about persistent volume claims, see PersistentVolumeClaims.

PersistentVolumeLabel

FEATURE STATE: Kubernetes v1.13 [deprecated]

This admission controller automatically attaches region or zone labels to PersistentVolumes as defined by the cloud provider (for example, Azure or GCP). It helps ensure the Pods and the PersistentVolumes mounted are in the same region and/or zone. If the admission controller doesn't support automatic labelling your PersistentVolumes, you may need to add the labels manually to prevent pods from mounting volumes from a different zone. PersistentVolumeLabel is deprecated as labeling for persistent volumes has been taken over by the cloud-controller-manager.

This admission controller is disabled by default.

PodNodeSelector

FEATURE STATE: Kubernetes v1.5 [alpha]

This admission controller defaults and limits what node selectors may be used within a namespace by reading a namespace annotation and a global configuration.

This admission controller is disabled by default.

Configuration file format

PodNodeSelector uses a configuration file to set options for the behavior of the backend. Note that the configuration file format will move to a versioned file in a future release. This file may be json or yaml and has the following format:

podNodeSelectorPluginConfig:
 clusterDefaultNodeSelector: name-of-node-selector
 namespace1: name-of-node-selector
 namespace2: name-of-node-selector

Reference the PodNodeSelector configuration file from the file provided to the API server's command line flag --admission-control-config-file:

apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: PodNodeSelector
  path: podnodeselector.yaml
...

Configuration Annotation Format

PodNodeSelector uses the annotation key scheduler.alpha.kubernetes.io/node-selector to assign node selectors to namespaces.

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/node-selector: name-of-node-selector
  name: namespace3

Internal Behavior

This admission controller has the following behavior:

  1. If the Namespace has an annotation with a key scheduler.alpha.kubernetes.io/node-selector, use its value as the node selector.
  2. If the namespace lacks such an annotation, use the clusterDefaultNodeSelector defined in the PodNodeSelector plugin configuration file as the node selector.
  3. Evaluate the pod's node selector against the namespace node selector for conflicts. Conflicts result in rejection.
  4. Evaluate the pod's node selector against the namespace-specific allowed selector defined the plugin configuration file. Conflicts result in rejection.

PodSecurity

FEATURE STATE: Kubernetes v1.25 [stable]

This is the replacement for the deprecated PodSecurityPolicy admission controller defined in the next section. This admission controller acts on creation and modification of the pod and determines if it should be admitted based on the requested security context and the Pod Security Standards.

See the Pod Security Admission documentation for more information.

PodSecurityPolicy

FEATURE STATE: Kubernetes v1.21 [deprecated]

This admission controller acts on creation and modification of the pod and determines if it should be admitted based on the requested security context and the available Pod Security Policies.

See also the PodSecurityPolicy documentation for more information.

PodTolerationRestriction

FEATURE STATE: Kubernetes v1.7 [alpha]

The PodTolerationRestriction admission controller verifies any conflict between tolerations of a pod and the tolerations of its namespace. It rejects the pod request if there is a conflict. It then merges the tolerations annotated on the namespace into the tolerations of the pod. The resulting tolerations are checked against a list of allowed tolerations annotated to the namespace. If the check succeeds, the pod request is admitted otherwise it is rejected.

If the namespace of the pod does not have any associated default tolerations or allowed tolerations annotated, the cluster-level default tolerations or cluster-level list of allowed tolerations are used instead if they are specified.

Tolerations to a namespace are assigned via the scheduler.alpha.kubernetes.io/defaultTolerations annotation key. The list of allowed tolerations can be added via the scheduler.alpha.kubernetes.io/tolerationsWhitelist annotation key.

Example for namespace annotations:

apiVersion: v1
kind: Namespace
metadata:
  name: apps-that-need-nodes-exclusively
  annotations:
    scheduler.alpha.kubernetes.io/defaultTolerations: '[{"operator": "Exists", "effect": "NoSchedule", "key": "dedicated-node"}]'
    scheduler.alpha.kubernetes.io/tolerationsWhitelist: '[{"operator": "Exists", "effect": "NoSchedule", "key": "dedicated-node"}]'

This admission controller is disabled by default.

Priority

The priority admission controller uses the priorityClassName field and populates the integer value of the priority. If the priority class is not found, the Pod is rejected.

ResourceQuota

This admission controller will observe the incoming request and ensure that it does not violate any of the constraints enumerated in the ResourceQuota object in a Namespace. If you are using ResourceQuota objects in your Kubernetes deployment, you MUST use this admission controller to enforce quota constraints.

See the ResourceQuota API reference and the example of Resource Quota for more details.

RuntimeClass

If you define a RuntimeClass with Pod overhead configured, this admission controller checks incoming Pods. When enabled, this admission controller rejects any Pod create requests that have the overhead already set. For Pods that have a RuntimeClass configured and selected in their .spec, this admission controller sets .spec.overhead in the Pod based on the value defined in the corresponding RuntimeClass.

See also Pod Overhead for more information.

SecurityContextDeny

This admission controller will deny any Pod that attempts to set certain escalating SecurityContext fields, as shown in the Configure a Security Context for a Pod or Container task. If you don't use Pod Security admission, PodSecurityPolicies, nor any external enforcement mechanism, then you could use this admission controller to restrict the set of values a security context can take.

See Pod Security Standards for more context on restricting pod privileges.

ServiceAccount

This admission controller implements automation for serviceAccounts. The Kubernetes project strongly recommends enabling this admission controller. You should enable this admission controller if you intend to make any use of Kubernetes ServiceAccount objects.

StorageObjectInUseProtection

The StorageObjectInUseProtection plugin adds the kubernetes.io/pvc-protection or kubernetes.io/pv-protection finalizers to newly created Persistent Volume Claims (PVCs) or Persistent Volumes (PV). In case a user deletes a PVC or PV the PVC or PV is not removed until the finalizer is removed from the PVC or PV by PVC or PV Protection Controller. Refer to the Storage Object in Use Protection for more detailed information.

TaintNodesByCondition

This admission controller taints newly created Nodes as NotReady and NoSchedule. That tainting avoids a race condition that could cause Pods to be scheduled on new Nodes before their taints were updated to accurately reflect their reported conditions.

ValidatingAdmissionWebhook

This admission controller calls any validating webhooks which match the request. Matching webhooks are called in parallel; if any of them rejects the request, the request fails. This admission controller only runs in the validation phase; the webhooks it calls may not mutate the object, as opposed to the webhooks called by the MutatingAdmissionWebhook admission controller.

If a webhook called by this has side effects (for example, decrementing quota) it must have a reconciliation system, as it is not guaranteed that subsequent webhooks or other validating admission controllers will permit the request to finish.

If you disable the ValidatingAdmissionWebhook, you must also disable the ValidatingWebhookConfiguration object in the admissionregistration.k8s.io/v1 group/version via the --runtime-config flag.

Yes. The recommended admission controllers are enabled by default (shown here), so you do not need to explicitly specify them. You can enable additional admission controllers beyond the default set using the --enable-admission-plugins flag (order doesn't matter).

3.5 - Dynamic Admission Control

In addition to compiled-in admission plugins, admission plugins can be developed as extensions and run as webhooks configured at runtime. This page describes how to build, configure, use, and monitor admission webhooks.

What are admission webhooks?

Admission webhooks are HTTP callbacks that receive admission requests and do something with them. You can define two types of admission webhooks, validating admission webhook and mutating admission webhook. Mutating admission webhooks are invoked first, and can modify objects sent to the API server to enforce custom defaults. After all object modifications are complete, and after the incoming object is validated by the API server, validating admission webhooks are invoked and can reject requests to enforce custom policies.

Experimenting with admission webhooks

Admission webhooks are essentially part of the cluster control-plane. You should write and deploy them with great caution. Please read the user guides for instructions if you intend to write/deploy production-grade admission webhooks. In the following, we describe how to quickly experiment with admission webhooks.

Prerequisites

  • Ensure that MutatingAdmissionWebhook and ValidatingAdmissionWebhook admission controllers are enabled. Here is a recommended set of admission controllers to enable in general.

  • Ensure that the admissionregistration.k8s.io/v1 API is enabled.

Write an admission webhook server

Please refer to the implementation of the admission webhook server that is validated in a Kubernetes e2e test. The webhook handles the AdmissionReview request sent by the API servers, and sends back its decision as an AdmissionReview object in the same version it received.

See the webhook request section for details on the data sent to webhooks.

See the webhook response section for the data expected from webhooks.

The example admission webhook server leaves the ClientAuth field empty, which defaults to NoClientCert. This means that the webhook server does not authenticate the identity of the clients, supposedly API servers. If you need mutual TLS or other ways to authenticate the clients, see how to authenticate API servers.

Deploy the admission webhook service

The webhook server in the e2e test is deployed in the Kubernetes cluster, via the deployment API. The test also creates a service as the front-end of the webhook server. See code.

You may also deploy your webhooks outside of the cluster. You will need to update your webhook configurations accordingly.

Configure admission webhooks on the fly

You can dynamically configure what resources are subject to what admission webhooks via ValidatingWebhookConfiguration or MutatingWebhookConfiguration.

The following is an example ValidatingWebhookConfiguration, a mutating webhook configuration is similar. See the webhook configuration section for details about each config field.

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: "pod-policy.example.com"
webhooks:
- name: "pod-policy.example.com"
  rules:
  - apiGroups:   [""]
    apiVersions: ["v1"]
    operations:  ["CREATE"]
    resources:   ["pods"]
    scope:       "Namespaced"
  clientConfig:
    service:
      namespace: "example-namespace"
      name: "example-service"
    caBundle: <CA_BUNDLE>
  admissionReviewVersions: ["v1"]
  sideEffects: None
  timeoutSeconds: 5

The scope field specifies if only cluster-scoped resources ("Cluster") or namespace-scoped resources ("Namespaced") will match this rule. "∗" means that there are no scope restrictions.

When an API server receives a request that matches one of the rules, the API server sends an admissionReview request to webhook as specified in the clientConfig.

After you create the webhook configuration, the system will take a few seconds to honor the new configuration.

Authenticate API servers

If your admission webhooks require authentication, you can configure the API servers to use basic auth, bearer token, or a cert to authenticate itself to the webhooks. There are three steps to complete the configuration.

  • When starting the API server, specify the location of the admission control configuration file via the --admission-control-config-file flag.

  • In the admission control configuration file, specify where the MutatingAdmissionWebhook controller and ValidatingAdmissionWebhook controller should read the credentials. The credentials are stored in kubeConfig files (yes, the same schema that's used by kubectl), so the field name is kubeConfigFile. Here is an example admission control configuration file:

apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: ValidatingAdmissionWebhook
  configuration:
    apiVersion: apiserver.config.k8s.io/v1
    kind: WebhookAdmissionConfiguration
    kubeConfigFile: "<path-to-kubeconfig-file>"
- name: MutatingAdmissionWebhook
  configuration:
    apiVersion: apiserver.config.k8s.io/v1
    kind: WebhookAdmissionConfiguration
    kubeConfigFile: "<path-to-kubeconfig-file>"

# Deprecated in v1.17 in favor of apiserver.config.k8s.io/v1
apiVersion: apiserver.k8s.io/v1alpha1
kind: AdmissionConfiguration
plugins:
- name: ValidatingAdmissionWebhook
  configuration:
    # Deprecated in v1.17 in favor of apiserver.config.k8s.io/v1, kind=WebhookAdmissionConfiguration
    apiVersion: apiserver.config.k8s.io/v1alpha1
    kind: WebhookAdmission
    kubeConfigFile: "<path-to-kubeconfig-file>"
- name: MutatingAdmissionWebhook
  configuration:
    # Deprecated in v1.17 in favor of apiserver.config.k8s.io/v1, kind=WebhookAdmissionConfiguration
    apiVersion: apiserver.config.k8s.io/v1alpha1
    kind: WebhookAdmission
    kubeConfigFile: "<path-to-kubeconfig-file>"

For more information about AdmissionConfiguration, see the AdmissionConfiguration (v1) reference. See the webhook configuration section for details about each config field.

In the kubeConfig file, provide the credentials:

apiVersion: v1
kind: Config
users:
# name should be set to the DNS name of the service or the host (including port) of the URL the webhook is configured to speak to.
# If a non-443 port is used for services, it must be included in the name when configuring 1.16+ API servers.
#
# For a webhook configured to speak to a service on the default port (443), specify the DNS name of the service:
# - name: webhook1.ns1.svc
#   user: ...
#
# For a webhook configured to speak to a service on non-default port (e.g. 8443), specify the DNS name and port of the service in 1.16+:
# - name: webhook1.ns1.svc:8443
#   user: ...
# and optionally create a second stanza using only the DNS name of the service for compatibility with 1.15 API servers:
# - name: webhook1.ns1.svc
#   user: ...
#
# For webhooks configured to speak to a URL, match the host (and port) specified in the webhook's URL. Examples:
# A webhook with `url: https://www.example.com`:
# - name: www.example.com
#   user: ...
#
# A webhook with `url: https://www.example.com:443`:
# - name: www.example.com:443
#   user: ...
#
# A webhook with `url: https://www.example.com:8443`:
# - name: www.example.com:8443
#   user: ...
#
- name: 'webhook1.ns1.svc'
  user:
    client-certificate-data: "<pem encoded certificate>"
    client-key-data: "<pem encoded key>"
# The `name` supports using * to wildcard-match prefixing segments.
- name: '*.webhook-company.org'
  user:
    password: "<password>"
    username: "<name>"
# '*' is the default match.
- name: '*'
  user:
    token: "<token>"

Of course you need to set up the webhook server to handle these authentication requests.

Webhook request and response

Request

Webhooks are sent as POST requests, with Content-Type: application/json, with an AdmissionReview API object in the admission.k8s.io API group serialized to JSON as the body.

Webhooks can specify what versions of AdmissionReview objects they accept with the admissionReviewVersions field in their configuration:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
webhooks:
- name: my-webhook.example.com
  admissionReviewVersions: ["v1", "v1beta1"]

admissionReviewVersions is a required field when creating webhook configurations. Webhooks are required to support at least one AdmissionReview version understood by the current and previous API server.

API servers send the first AdmissionReview version in the admissionReviewVersions list they support. If none of the versions in the list are supported by the API server, the configuration will not be allowed to be created. If an API server encounters a webhook configuration that was previously created and does not support any of the AdmissionReview versions the API server knows how to send, attempts to call to the webhook will fail and be subject to the failure policy.

This example shows the data contained in an AdmissionReview object for a request to update the scale subresource of an apps/v1 Deployment:

apiVersion: admission.k8s.io/v1
kind: AdmissionReview
request:
  # Random uid uniquely identifying this admission call
  uid: 705ab4f5-6393-11e8-b7cc-42010a800002

  # Fully-qualified group/version/kind of the incoming object
  kind:
    group: autoscaling
    version: v1
    kind: Scale

  # Fully-qualified group/version/kind of the resource being modified
  resource:
    group: apps
    version: v1
    resource: deployments

  # subresource, if the request is to a subresource
  subResource: scale

  # Fully-qualified group/version/kind of the incoming object in the original request to the API server.
  # This only differs from `kind` if the webhook specified `matchPolicy: Equivalent` and the
  # original request to the API server was converted to a version the webhook registered for.
  requestKind:
    group: autoscaling
    version: v1
    kind: Scale

  # Fully-qualified group/version/kind of the resource being modified in the original request to the API server.
  # This only differs from `resource` if the webhook specified `matchPolicy: Equivalent` and the
  # original request to the API server was converted to a version the webhook registered for.
  requestResource:
    group: apps
    version: v1
    resource: deployments

  # subresource, if the request is to a subresource
  # This only differs from `subResource` if the webhook specified `matchPolicy: Equivalent` and the
  # original request to the API server was converted to a version the webhook registered for.
  requestSubResource: scale

  # Name of the resource being modified
  name: my-deployment

  # Namespace of the resource being modified, if the resource is namespaced (or is a Namespace object)
  namespace: my-namespace

  # operation can be CREATE, UPDATE, DELETE, or CONNECT
  operation: UPDATE

  userInfo:
    # Username of the authenticated user making the request to the API server
    username: admin

    # UID of the authenticated user making the request to the API server
    uid: 014fbff9a07c

    # Group memberships of the authenticated user making the request to the API server
    groups:
      - system:authenticated
      - my-admin-group
    # Arbitrary extra info associated with the user making the request to the API server.
    # This is populated by the API server authentication layer and should be included
    # if any SubjectAccessReview checks are performed by the webhook.
    extra:
      some-key:
        - some-value1
        - some-value2

  # object is the new object being admitted.
  # It is null for DELETE operations.
  object:
    apiVersion: autoscaling/v1
    kind: Scale

  # oldObject is the existing object.
  # It is null for CREATE and CONNECT operations.
  oldObject:
    apiVersion: autoscaling/v1
    kind: Scale

  # options contains the options for the operation being admitted, like meta.k8s.io/v1 CreateOptions, UpdateOptions, or DeleteOptions.
  # It is null for CONNECT operations.
  options:
    apiVersion: meta.k8s.io/v1
    kind: UpdateOptions

  # dryRun indicates the API request is running in dry run mode and will not be persisted.
  # Webhooks with side effects should avoid actuating those side effects when dryRun is true.
  # See http://k8s.io/docs/reference/using-api/api-concepts/#make-a-dry-run-request for more details.
  dryRun: False

Response

Webhooks respond with a 200 HTTP status code, Content-Type: application/json, and a body containing an AdmissionReview object (in the same version they were sent), with the response stanza populated, serialized to JSON.

At a minimum, the response stanza must contain the following fields:

  • uid, copied from the request.uid sent to the webhook
  • allowed, either set to true or false

Example of a minimal response from a webhook to allow a request:

{
  "apiVersion": "admission.k8s.io/v1",
  "kind": "AdmissionReview",
  "response": {
    "uid": "<value from request.uid>",
    "allowed": true
  }
}

Example of a minimal response from a webhook to forbid a request:

{
  "apiVersion": "admission.k8s.io/v1",
  "kind": "AdmissionReview",
  "response": {
    "uid": "<value from request.uid>",
    "allowed": false
  }
}

When rejecting a request, the webhook can customize the http code and message returned to the user using the status field. The specified status object is returned to the user. See the API documentation for details about the status type. Example of a response to forbid a request, customizing the HTTP status code and message presented to the user:

{
  "apiVersion": "admission.k8s.io/v1",
  "kind": "AdmissionReview",
  "response": {
    "uid": "<value from request.uid>",
    "allowed": false,
    "status": {
      "code": 403,
      "message": "You cannot do this because it is Tuesday and your name starts with A"
    }
  }
}

When allowing a request, a mutating admission webhook may optionally modify the incoming object as well. This is done using the patch and patchType fields in the response. The only currently supported patchType is JSONPatch. See JSON patch documentation for more details. For patchType: JSONPatch, the patch field contains a base64-encoded array of JSON patch operations.

As an example, a single patch operation that would set spec.replicas would be [{"op": "add", "path": "/spec/replicas", "value": 3}]

Base64-encoded, this would be W3sib3AiOiAiYWRkIiwgInBhdGgiOiAiL3NwZWMvcmVwbGljYXMiLCAidmFsdWUiOiAzfV0=

So a webhook response to add that label would be:

{
  "apiVersion": "admission.k8s.io/v1",
  "kind": "AdmissionReview",
  "response": {
    "uid": "<value from request.uid>",
    "allowed": true,
    "patchType": "JSONPatch",
    "patch": "W3sib3AiOiAiYWRkIiwgInBhdGgiOiAiL3NwZWMvcmVwbGljYXMiLCAidmFsdWUiOiAzfV0="
  }
}

Admission webhooks can optionally return warning messages that are returned to the requesting client in HTTP Warning headers with a warning code of 299. Warnings can be sent with allowed or rejected admission responses.

If you're implementing a webhook that returns a warning:

  • Don't include a "Warning:" prefix in the message
  • Use warning messages to describe problems the client making the API request should correct or be aware of
  • Limit warnings to 120 characters if possible
{
  "apiVersion": "admission.k8s.io/v1",
  "kind": "AdmissionReview",
  "response": {
    "uid": "<value from request.uid>",
    "allowed": true,
    "warnings": [
      "duplicate envvar entries specified with name MY_ENV",
      "memory request less than 4MB specified for container mycontainer, which will not start successfully"
    ]
  }
}

Webhook configuration

To register admission webhooks, create MutatingWebhookConfiguration or ValidatingWebhookConfiguration API objects. The name of a MutatingWebhookConfiguration or a ValidatingWebhookConfiguration object must be a valid DNS subdomain name.

Each configuration can contain one or more webhooks. If multiple webhooks are specified in a single configuration, each must be given a unique name. This is required in order to make resulting audit logs and metrics easier to match up to active configurations.

Each webhook defines the following things.

Matching requests: rules

Each webhook must specify a list of rules used to determine if a request to the API server should be sent to the webhook. Each rule specifies one or more operations, apiGroups, apiVersions, and resources, and a resource scope:

  • operations lists one or more operations to match. Can be "CREATE", "UPDATE", "DELETE", "CONNECT", or "*" to match all.

  • apiGroups lists one or more API groups to match. "" is the core API group. "*" matches all API groups.

  • apiVersions lists one or more API versions to match. "*" matches all API versions.

  • resources lists one or more resources to match.

    • "*" matches all resources, but not subresources.
    • "*/*" matches all resources and subresources.
    • "pods/*" matches all subresources of pods.
    • "*/status" matches all status subresources.
  • scope specifies a scope to match. Valid values are "Cluster", "Namespaced", and "*". Subresources match the scope of their parent resource. Default is "*".

    • "Cluster" means that only cluster-scoped resources will match this rule (Namespace API objects are cluster-scoped).
    • "Namespaced" means that only namespaced resources will match this rule.
    • "*" means that there are no scope restrictions.

If an incoming request matches one of the specified operations, groups, versions, resources, and scope for any of a webhook's rules, the request is sent to the webhook.

Here are other examples of rules that could be used to specify which resources should be intercepted.

Match CREATE or UPDATE requests to apps/v1 and apps/v1beta1 deployments and replicasets:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
  rules:
  - operations: ["CREATE", "UPDATE"]
    apiGroups: ["apps"]
    apiVersions: ["v1", "v1beta1"]
    resources: ["deployments", "replicasets"]
    scope: "Namespaced"
  ...

Match create requests for all resources (but not subresources) in all API groups and versions:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
webhooks:
  - name: my-webhook.example.com
    rules:
      - operations: ["CREATE"]
        apiGroups: ["*"]
        apiVersions: ["*"]
        resources: ["*"]
        scope: "*"

Match update requests for all status subresources in all API groups and versions:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
webhooks:
  - name: my-webhook.example.com
    rules:
      - operations: ["UPDATE"]
        apiGroups: ["*"]
        apiVersions: ["*"]
        resources: ["*/status"]
        scope: "*"

Matching requests: objectSelector

Webhooks may optionally limit which requests are intercepted based on the labels of the objects they would be sent, by specifying an objectSelector. If specified, the objectSelector is evaluated against both the object and oldObject that would be sent to the webhook, and is considered to match if either object matches the selector.

A null object (oldObject in the case of create, or newObject in the case of delete), or an object that cannot have labels (like a DeploymentRollback or a PodProxyOptions object) is not considered to match.

Use the object selector only if the webhook is opt-in, because end users may skip the admission webhook by setting the labels.

This example shows a mutating webhook that would match a CREATE of any resource with the label foo: bar:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
webhooks:
- name: my-webhook.example.com
  objectSelector:
    matchLabels:
      foo: bar
  rules:
  - operations: ["CREATE"]
    apiGroups: ["*"]
    apiVersions: ["*"]
    resources: ["*"]
    scope: "*"

See labels concept for more examples of label selectors.

Matching requests: namespaceSelector

Webhooks may optionally limit which requests for namespaced resources are intercepted, based on the labels of the containing namespace, by specifying a namespaceSelector.

The namespaceSelector decides whether to run the webhook on a request for a namespaced resource (or a Namespace object), based on whether the namespace's labels match the selector. If the object itself is a namespace, the matching is performed on object.metadata.labels. If the object is a cluster scoped resource other than a Namespace, namespaceSelector has no effect.

This example shows a mutating webhook that matches a CREATE of any namespaced resource inside a namespace that does not have a "runlevel" label of "0" or "1":

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
webhooks:
  - name: my-webhook.example.com
    namespaceSelector:
      matchExpressions:
        - key: runlevel
          operator: NotIn
          values: ["0","1"]
    rules:
      - operations: ["CREATE"]
        apiGroups: ["*"]
        apiVersions: ["*"]
        resources: ["*"]
        scope: "Namespaced"

This example shows a validating webhook that matches a CREATE of any namespaced resource inside a namespace that is associated with the "environment" of "prod" or "staging":

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
webhooks:
  - name: my-webhook.example.com
    namespaceSelector:
      matchExpressions:
        - key: environment
          operator: In
          values: ["prod","staging"]
    rules:
      - operations: ["CREATE"]
        apiGroups: ["*"]
        apiVersions: ["*"]
        resources: ["*"]
        scope: "Namespaced"

See labels concept for more examples of label selectors.

Matching requests: matchPolicy

API servers can make objects available via multiple API groups or versions.

For example, if a webhook only specified a rule for some API groups/versions (like apiGroups:["apps"], apiVersions:["v1","v1beta1"]), and a request was made to modify the resource via another API group/version (like extensions/v1beta1), the request would not be sent to the webhook.

The matchPolicy lets a webhook define how its rules are used to match incoming requests. Allowed values are Exact or Equivalent.

  • Exact means a request should be intercepted only if it exactly matches a specified rule.
  • Equivalent means a request should be intercepted if modifies a resource listed in rules, even via another API group or version.

In the example given above, the webhook that only registered for apps/v1 could use matchPolicy:

  • matchPolicy: Exact would mean the extensions/v1beta1 request would not be sent to the webhook
  • matchPolicy: Equivalent means the extensions/v1beta1 request would be sent to the webhook (with the objects converted to a version the webhook had specified: apps/v1)

Specifying Equivalent is recommended, and ensures that webhooks continue to intercept the resources they expect when upgrades enable new versions of the resource in the API server.

When a resource stops being served by the API server, it is no longer considered equivalent to other versions of that resource that are still served. For example, extensions/v1beta1 deployments were first deprecated and then removed (in Kubernetes v1.16).

Since that removal, a webhook with a apiGroups:["extensions"], apiVersions:["v1beta1"], resources:["deployments"] rule does not intercept deployments created via apps/v1 APIs. For that reason, webhooks should prefer registering for stable versions of resources.

This example shows a validating webhook that intercepts modifications to deployments (no matter the API group or version), and is always sent an apps/v1 Deployment object:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
webhooks:
- name: my-webhook.example.com
  matchPolicy: Equivalent
  rules:
  - operations: ["CREATE","UPDATE","DELETE"]
    apiGroups: ["apps"]
    apiVersions: ["v1"]
    resources: ["deployments"]
    scope: "Namespaced"

The matchPolicy for an admission webhooks defaults to Equivalent.

Contacting the webhook

Once the API server has determined a request should be sent to a webhook, it needs to know how to contact the webhook. This is specified in the clientConfig stanza of the webhook configuration.

Webhooks can either be called via a URL or a service reference, and can optionally include a custom CA bundle to use to verify the TLS connection.

URL

url gives the location of the webhook, in standard URL form (scheme://host:port/path).

The host should not refer to a service running in the cluster; use a service reference by specifying the service field instead. The host might be resolved via external DNS in some API servers (e.g., kube-apiserver cannot resolve in-cluster DNS as that would be a layering violation). host may also be an IP address.

Please note that using localhost or 127.0.0.1 as a host is risky unless you take great care to run this webhook on all hosts which run an API server which might need to make calls to this webhook. Such installations are likely to be non-portable or not readily run in a new cluster.

The scheme must be "https"; the URL must begin with "https://".

Attempting to use a user or basic auth (for example user:password@) is not allowed. Fragments (#...) and query parameters (?...) are also not allowed.

Here is an example of a mutating webhook configured to call a URL (and expects the TLS certificate to be verified using system trust roots, so does not specify a caBundle):

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
webhooks:
- name: my-webhook.example.com
  clientConfig:
    url: "https://my-webhook.example.com:9443/my-webhook-path"

Service reference

The service stanza inside clientConfig is a reference to the service for this webhook. If the webhook is running within the cluster, then you should use service instead of url. The service namespace and name are required. The port is optional and defaults to 443. The path is optional and defaults to "/".

Here is an example of a mutating webhook configured to call a service on port "1234" at the subpath "/my-path", and to verify the TLS connection against the ServerName my-service-name.my-service-namespace.svc using a custom CA bundle:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
webhooks:
- name: my-webhook.example.com
  clientConfig:
    caBundle: <CA_BUNDLE>
    service:
      namespace: my-service-namespace
      name: my-service-name
      path: /my-path
      port: 1234

Side effects

Webhooks typically operate only on the content of the AdmissionReview sent to them. Some webhooks, however, make out-of-band changes as part of processing admission requests.

Webhooks that make out-of-band changes ("side effects") must also have a reconciliation mechanism (like a controller) that periodically determines the actual state of the world, and adjusts the out-of-band data modified by the admission webhook to reflect reality. This is because a call to an admission webhook does not guarantee the admitted object will be persisted as is, or at all. Later webhooks can modify the content of the object, a conflict could be encountered while writing to storage, or the server could power off before persisting the object.

Additionally, webhooks with side effects must skip those side-effects when dryRun: true admission requests are handled. A webhook must explicitly indicate that it will not have side-effects when run with dryRun, or the dry-run request will not be sent to the webhook and the API request will fail instead.

Webhooks indicate whether they have side effects using the sideEffects field in the webhook configuration:

  • None: calling the webhook will have no side effects.
  • NoneOnDryRun: calling the webhook will possibly have side effects, but if a request with dryRun: true is sent to the webhook, the webhook will suppress the side effects (the webhook is dryRun-aware).

Here is an example of a validating webhook indicating it has no side effects on dryRun: true requests:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
webhooks:
  - name: my-webhook.example.com
    sideEffects: NoneOnDryRun

Timeouts

Because webhooks add to API request latency, they should evaluate as quickly as possible. timeoutSeconds allows configuring how long the API server should wait for a webhook to respond before treating the call as a failure.

If the timeout expires before the webhook responds, the webhook call will be ignored or the API call will be rejected based on the failure policy.

The timeout value must be between 1 and 30 seconds.

Here is an example of a validating webhook with a custom timeout of 2 seconds:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
webhooks:
  - name: my-webhook.example.com
    timeoutSeconds: 2

The timeout for an admission webhook defaults to 10 seconds.

Reinvocation policy

A single ordering of mutating admissions plugins (including webhooks) does not work for all cases (see https://issue.k8s.io/64333 as an example). A mutating webhook can add a new sub-structure to the object (like adding a container to a pod), and other mutating plugins which have already run may have opinions on those new structures (like setting an imagePullPolicy on all containers).

To allow mutating admission plugins to observe changes made by other plugins, built-in mutating admission plugins are re-run if a mutating webhook modifies an object, and mutating webhooks can specify a reinvocationPolicy to control whether they are reinvoked as well.

reinvocationPolicy may be set to Never or IfNeeded. It defaults to Never.

  • Never: the webhook must not be called more than once in a single admission evaluation.
  • IfNeeded: the webhook may be called again as part of the admission evaluation if the object being admitted is modified by other admission plugins after the initial webhook call.

The important elements to note are:

  • The number of additional invocations is not guaranteed to be exactly one.
  • If additional invocations result in further modifications to the object, webhooks are not guaranteed to be invoked again.
  • Webhooks that use this option may be reordered to minimize the number of additional invocations.
  • To validate an object after all mutations are guaranteed complete, use a validating admission webhook instead (recommended for webhooks with side-effects).

Here is an example of a mutating webhook opting into being re-invoked if later admission plugins modify the object:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
webhooks:
- name: my-webhook.example.com
  reinvocationPolicy: IfNeeded

Mutating webhooks must be idempotent, able to successfully process an object they have already admitted and potentially modified. This is true for all mutating admission webhooks, since any change they can make in an object could already exist in the user-provided object, but it is essential for webhooks that opt into reinvocation.

Failure policy

failurePolicy defines how unrecognized errors and timeout errors from the admission webhook are handled. Allowed values are Ignore or Fail.

  • Ignore means that an error calling the webhook is ignored and the API request is allowed to continue.
  • Fail means that an error calling the webhook causes the admission to fail and the API request to be rejected.

Here is a mutating webhook configured to reject an API request if errors are encountered calling the admission webhook:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
webhooks:
- name: my-webhook.example.com
  failurePolicy: Fail

The default failurePolicy for an admission webhooks is Fail.

Monitoring admission webhooks

The API server provides ways to monitor admission webhook behaviors. These monitoring mechanisms help cluster admins to answer questions like:

  1. Which mutating webhook mutated the object in a API request?

  2. What change did the mutating webhook applied to the object?

  3. Which webhooks are frequently rejecting API requests? What's the reason for a rejection?

Mutating webhook auditing annotations

Sometimes it's useful to know which mutating webhook mutated the object in a API request, and what change did the webhook apply.

The Kubernetes API server performs auditing on each mutating webhook invocation. Each invocation generates an auditing annotation capturing if a request object is mutated by the invocation, and optionally generates an annotation capturing the applied patch from the webhook admission response. The annotations are set in the audit event for given request on given stage of its execution, which is then pre-processed according to a certain policy and written to a backend.

The audit level of a event determines which annotations get recorded:

  • At Metadata audit level or higher, an annotation with key mutation.webhook.admission.k8s.io/round_{round idx}_index_{order idx} gets logged with JSON payload indicating a webhook gets invoked for given request and whether it mutated the object or not.

    For example, the following annotation gets recorded for a webhook being reinvoked. The webhook is ordered the third in the mutating webhook chain, and didn't mutated the request object during the invocation.

    # the audit event recorded
    {
        "kind": "Event",
        "apiVersion": "audit.k8s.io/v1",
        "annotations": {
            "mutation.webhook.admission.k8s.io/round_1_index_2": "{\"configuration\":\"my-mutating-webhook-configuration.example.com\",\"webhook\":\"my-webhook.example.com\",\"mutated\": false}"
            # other annotations
            ...
        }
        # other fields
        ...
    }
    
    # the annotation value deserialized
    {
        "configuration": "my-mutating-webhook-configuration.example.com",
        "webhook": "my-webhook.example.com",
        "mutated": false
    }
    

    The following annotation gets recorded for a webhook being invoked in the first round. The webhook is ordered the first in the mutating webhook chain, and mutated the request object during the invocation.

    # the audit event recorded
    {
        "kind": "Event",
        "apiVersion": "audit.k8s.io/v1",
        "annotations": {
            "mutation.webhook.admission.k8s.io/round_0_index_0": "{\"configuration\":\"my-mutating-webhook-configuration.example.com\",\"webhook\":\"my-webhook-always-mutate.example.com\",\"mutated\": true}"
            # other annotations
            ...
        }
        # other fields
        ...
    }
    
    # the annotation value deserialized
    {
        "configuration": "my-mutating-webhook-configuration.example.com",
        "webhook": "my-webhook-always-mutate.example.com",
        "mutated": true
    }
    
  • At Request audit level or higher, an annotation with key patch.webhook.admission.k8s.io/round_{round idx}_index_{order idx} gets logged with JSON payload indicating a webhook gets invoked for given request and what patch gets applied to the request object.

    For example, the following annotation gets recorded for a webhook being reinvoked. The webhook is ordered the fourth in the mutating webhook chain, and responded with a JSON patch which got applied to the request object.

    # the audit event recorded
    {
        "kind": "Event",
        "apiVersion": "audit.k8s.io/v1",
        "annotations": {
            "patch.webhook.admission.k8s.io/round_1_index_3": "{\"configuration\":\"my-other-mutating-webhook-configuration.example.com\",\"webhook\":\"my-webhook-always-mutate.example.com\",\"patch\":[{\"op\":\"add\",\"path\":\"/data/mutation-stage\",\"value\":\"yes\"}],\"patchType\":\"JSONPatch\"}"
            # other annotations
            ...
        }
        # other fields
        ...
    }
    
    # the annotation value deserialized
    {
        "configuration": "my-other-mutating-webhook-configuration.example.com",
        "webhook": "my-webhook-always-mutate.example.com",
        "patchType": "JSONPatch",
        "patch": [
            {
                "op": "add",
                "path": "/data/mutation-stage",
                "value": "yes"
            }
        ]
    }
    

Admission webhook metrics

The API server exposes Prometheus metrics from the /metrics endpoint, which can be used for monitoring and diagnosing API server status. The following metrics record status related to admission webhooks.

API server admission webhook rejection count

Sometimes it's useful to know which admission webhooks are frequently rejecting API requests, and the reason for a rejection.

The API server exposes a Prometheus counter metric recording admission webhook rejections. The metrics are labelled to identify the causes of webhook rejection(s):

  • name: the name of the webhook that rejected a request.

  • operation: the operation type of the request, can be one of CREATE, UPDATE, DELETE and CONNECT.

  • type: the admission webhook type, can be one of admit and validating.

  • error_type: identifies if an error occurred during the webhook invocation that caused the rejection. Its value can be one of:

    • calling_webhook_error: unrecognized errors or timeout errors from the admission webhook happened and the webhook's Failure policy is set to Fail.
    • no_error: no error occurred. The webhook rejected the request with allowed: false in the admission response. The metrics label rejection_code records the .status.code set in the admission response.
    • apiserver_internal_error: an API server internal error happened.
  • rejection_code: the HTTP status code set in the admission response when a webhook rejected a request.

Example of the rejection count metrics:

# HELP apiserver_admission_webhook_rejection_count [ALPHA] Admission webhook rejection count, identified by name and broken out for each admission type (validating or admit) and operation. Additional labels specify an error type (calling_webhook_error or apiserver_internal_error if an error occurred; no_error otherwise) and optionally a non-zero rejection code if the webhook rejects the request with an HTTP status code (honored by the apiserver when the code is greater or equal to 400). Codes greater than 600 are truncated to 600, to keep the metrics cardinality bounded.
# TYPE apiserver_admission_webhook_rejection_count counter
apiserver_admission_webhook_rejection_count{error_type="calling_webhook_error",name="always-timeout-webhook.example.com",operation="CREATE",rejection_code="0",type="validating"} 1
apiserver_admission_webhook_rejection_count{error_type="calling_webhook_error",name="invalid-admission-response-webhook.example.com",operation="CREATE",rejection_code="0",type="validating"} 1
apiserver_admission_webhook_rejection_count{error_type="no_error",name="deny-unwanted-configmap-data.example.com",operation="CREATE",rejection_code="400",type="validating"} 13

Best practices and warnings

Idempotence

An idempotent mutating admission webhook is able to successfully process an object it has already admitted and potentially modified. The admission can be applied multiple times without changing the result beyond the initial application.

Example of idempotent mutating admission webhooks:

  1. For a CREATE pod request, set the field .spec.securityContext.runAsNonRoot of the pod to true, to enforce security best practices.

  2. For a CREATE pod request, if the field .spec.containers[].resources.limits of a container is not set, set default resource limits.

  3. For a CREATE pod request, inject a sidecar container with name foo-sidecar if no container with the name foo-sidecar already exists.

In the cases above, the webhook can be safely reinvoked, or admit an object that already has the fields set.

Example of non-idempotent mutating admission webhooks:

  1. For a CREATE pod request, inject a sidecar container with name foo-sidecar suffixed with the current timestamp (e.g. foo-sidecar-19700101-000000).

  2. For a CREATE/UPDATE pod request, reject if the pod has label "env" set, otherwise add an "env": "prod" label to the pod.

  3. For a CREATE pod request, blindly append a sidecar container named foo-sidecar without looking to see if there is already a foo-sidecar container in the pod.

In the first case above, reinvoking the webhook can result in the same sidecar being injected multiple times to a pod, each time with a different container name. Similarly the webhook can inject duplicated containers if the sidecar already exists in a user-provided pod.

In the second case above, reinvoking the webhook will result in the webhook failing on its own output.

In the third case above, reinvoking the webhook will result in duplicated containers in the pod spec, which makes the request invalid and rejected by the API server.

Intercepting all versions of an object

It is recommended that admission webhooks should always intercept all versions of an object by setting .webhooks[].matchPolicy to Equivalent. It is also recommended that admission webhooks should prefer registering for stable versions of resources. Failure to intercept all versions of an object can result in admission policies not being enforced for requests in certain versions. See Matching requests: matchPolicy for examples.

Availability

It is recommended that admission webhooks should evaluate as quickly as possible (typically in milliseconds), since they add to API request latency. It is encouraged to use a small timeout for webhooks. See Timeouts for more detail.

It is recommended that admission webhooks should leverage some format of load-balancing, to provide high availability and performance benefits. If a webhook is running within the cluster, you can run multiple webhook backends behind a service to leverage the load-balancing that service supports.

Guaranteeing the final state of the object is seen

Admission webhooks that need to guarantee they see the final state of the object in order to enforce policy should use a validating admission webhook, since objects can be modified after being seen by mutating webhooks.

For example, a mutating admission webhook is configured to inject a sidecar container with name "foo-sidecar" on every CREATE pod request. If the sidecar must be present, a validating admisson webhook should also be configured to intercept CREATE pod requests, and validate that a container with name "foo-sidecar" with the expected configuration exists in the to-be-created object.

Avoiding deadlocks in self-hosted webhooks

A webhook running inside the cluster might cause deadlocks for its own deployment if it is configured to intercept resources required to start its own pods.

For example, a mutating admission webhook is configured to admit CREATE pod requests only if a certain label is set in the pod (e.g. "env": "prod"). The webhook server runs in a deployment which doesn't set the "env" label. When a node that runs the webhook server pods becomes unhealthy, the webhook deployment will try to reschedule the pods to another node. However the requests will get rejected by the existing webhook server since the "env" label is unset, and the migration cannot happen.

It is recommended to exclude the namespace where your webhook is running with a namespaceSelector.

Side effects

It is recommended that admission webhooks should avoid side effects if possible, which means the webhooks operate only on the content of the AdmissionReview sent to them, and do not make out-of-band changes. The .webhooks[].sideEffects field should be set to None if a webhook doesn't have any side effect.

If side effects are required during the admission evaluation, they must be suppressed when processing an AdmissionReview object with dryRun set to true, and the .webhooks[].sideEffects field should be set to NoneOnDryRun. See Side effects for more detail.

Avoiding operating on the kube-system namespace

The kube-system namespace contains objects created by the Kubernetes system, e.g. service accounts for the control plane components, pods like kube-dns. Accidentally mutating or rejecting requests in the kube-system namespace may cause the control plane components to stop functioning or introduce unknown behavior. If your admission webhooks don't intend to modify the behavior of the Kubernetes control plane, exclude the kube-system namespace from being intercepted using a namespaceSelector.

3.6 - Managing Service Accounts

A ServiceAccount provides an identity for processes that run in a Pod.

A process inside a Pod can use the identity of its associated service account to authenticate to the cluster's API server.

For an introduction to service accounts, read configure service accounts.

This task guide explains some of the concepts behind ServiceAccounts. The guide also explains how to obtain or revoke tokens that represent ServiceAccounts.

Before you begin

You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:

To be able to follow these steps exactly, ensure you have a namespace named examplens. If you don't, create one by running:

kubectl create namespace examplens

User accounts versus service accounts

Kubernetes distinguishes between the concept of a user account and a service account for a number of reasons:

  • User accounts are for humans. Service accounts are for application processes, which (for Kubernetes) run in containers that are part of pods.
  • User accounts are intended to be global: names must be unique across all namespaces of a cluster. No matter what namespace you look at, a particular username that represents a user represents the same user. In Kubernetes, service accounts are namespaced: two different namespaces can contain ServiceAccounts that have identical names.
  • Typically, a cluster's user accounts might be synchronised from a corporate database, where new user account creation requires special privileges and is tied to complex business processes. By contrast, service account creation is intended to be more lightweight, allowing cluster users to create service accounts for specific tasks on demand. Separating ServiceAccount creation from the steps to onboard human users makes it easier for workloads to following the principle of least privilege.
  • Auditing considerations for humans and service accounts may differ; the separation makes that easier to achieve.
  • A configuration bundle for a complex system may include definition of various service accounts for components of that system. Because service accounts can be created without many constraints and have namespaced names, such configuration is usually portable.

Bound service account token volume mechanism

FEATURE STATE: Kubernetes v1.22 [stable]

By default, the Kubernetes control plane (specifically, the ServiceAccount admission controller) adds a projected volume to Pods, and this volume includes a token for Kubernetes API access.

Here's an example of how that looks for a launched Pod:

...
  - name: kube-api-access-<random-suffix>
    projected:
      sources:
        - serviceAccountToken:
            path: token # must match the path the app expects
        - configMap:
            items:
              - key: ca.crt
                path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
              - fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
                path: namespace

That manifest snippet defines a projected volume that consists of three sources. In this case, each source also represents a single path within that volume. The three sources are:

  1. A serviceAccountToken source, that contains a token that the kubelet acquires from kube-apiserver. The kubelet fetches time-bound tokens using the TokenRequest API. A token served for a TokenRequest expires either when the pod is deleted or after a defined lifespan (by default, that is 1 hour). The token is bound to the specific Pod and has the kube-apiserver as its audience. This mechanism superseded an earlier mechanism that added a volume based on a Secret, where the Secret represented the ServiceAccount for the Pod, but did not expire.
  2. A configMap source. The ConfigMap contains a bundle of certificate authority data. Pods can use these certificates to make sure that they are connecting to your cluster's kube-apiserver (and not to middlebox or an accidentally misconfigured peer).
  3. A downwardAPI source that looks up the name of the namespace containing the Pod, and makes that name information available to application code running inside the Pod.

Any container within the Pod that mounts this particular volume can access the above information.

Manual Secret management for ServiceAccounts

Versions of Kubernetes before v1.22 automatically created credentials for accessing the Kubernetes API. This older mechanism was based on creating token Secrets that could then be mounted into running Pods.

In more recent versions, including Kubernetes v1.25, API credentials are obtained directly using the TokenRequest API, and are mounted into Pods using a projected volume. The tokens obtained using this method have bounded lifetimes, and are automatically invalidated when the Pod they are mounted into is deleted.

You can still manually create a Secret to hold a service account token; for example, if you need a token that never expires.

Once you manually create a Secret and link it to a ServiceAccount, the Kubernetes control plane automatically populates the token into that Secret.

Control plane details

Token controller

The service account token controller runs as part of kube-controller-manager. This controller acts asynchronously. It:

  • watches for ServiceAccount deletion and deletes all corresponding ServiceAccount token Secrets.
  • watches for ServiceAccount token Secret addition, and ensures the referenced ServiceAccount exists, and adds a token to the Secret if needed.
  • watches for Secret deletion and removes a reference from the corresponding ServiceAccount if needed.

You must pass a service account private key file to the token controller in the kube-controller-manager using the --service-account-private-key-file flag. The private key is used to sign generated service account tokens. Similarly, you must pass the corresponding public key to the kube-apiserver using the --service-account-key-file flag. The public key will be used to verify the tokens during authentication.

ServiceAccount admission controller

The modification of pods is implemented via a plugin called an Admission Controller. It is part of the API server. This admission controller acts synchronously to modify pods as they are created. When this plugin is active (and it is by default on most distributions), then it does the following when a Pod is created:

  1. If the pod does not have a .spec.serviceAccountName set, the admission controller sets the name of the ServiceAccount for this incoming Pod to default.
  2. The admission controller ensures that the ServiceAccount referenced by the incoming Pod exists. If there is no ServiceAccount with a matching name, the admission controller rejects the incoming Pod. That check applies even for the default ServiceAccount.
  3. Provided that neither the ServiceAccount's automountServiceAccountToken field nor the Pod's automountServiceAccountToken field is set to false:
    • the admission controller mutates the incoming Pod, adding an extra volume that contains a token for API access.
    • the admission controller adds a volumeMount to each container in the Pod, skipping any containers that already have a volume mount defined for the path /var/run/secrets/kubernetes.io/serviceaccount. For Linux containers, that volume is mounted at /var/run/secrets/kubernetes.io/serviceaccount; on Windows nodes, the mount is at the equivalent path.
  4. If the spec of the incoming Pod does already contain any imagePullSecrets, then the admission controller adds imagePullSecrets, copying them from the ServiceAccount.

TokenRequest API

FEATURE STATE: Kubernetes v1.22 [stable]

You use the TokenRequest subresource of a ServiceAccount to obtain a time-bound token for that ServiceAccount. You don't need to call this to obtain an API token for use within a container, since the kubelet sets this up for you using a projected volume.

If you want to use the TokenRequest API from kubectl, see Manually create an API token for a ServiceAccount.

The Kubernetes control plane (specifically, the ServiceAccount admission controller) adds a projected volume to Pods, and the kubelet ensures that this volume contains a token that lets containers authenticate as the right ServiceAccount.

(This mechanism superseded an earlier mechanism that added a volume based on a Secret, where the Secret represented the ServiceAccount for the Pod but did not expire.)

Here's an example of how that looks for a launched Pod:

...
  - name: kube-api-access-<random-suffix>
    projected:
      defaultMode: 420 # decimal equivalent of octal 0644
      sources:
        - serviceAccountToken:
            expirationSeconds: 3607
            path: token
        - configMap:
            items:
              - key: ca.crt
                path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
              - fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
                path: namespace

That manifest snippet defines a projected volume that combines information from three sources:

  1. A serviceAccountToken source, that contains a token that the kubelet acquires from kube-apiserver. The kubelet fetches time-bound tokens using the TokenRequest API. A token served for a TokenRequest expires either when the pod is deleted or after a defined lifespan (by default, that is 1 hour). The token is bound to the specific Pod and has the kube-apiserver as its audience.
  2. A configMap source. The ConfigMap contains a bundle of certificate authority data. Pods can use these certificates to make sure that they are connecting to your cluster's kube-apiserver (and not to middlebox or an accidentally misconfigured peer).
  3. A downwardAPI source. This downwardAPI volume makes the name of the namespace containing the Pod available to application code running inside the Pod.

Any container within the Pod that mounts this volume can access the above information.

Create additional API tokens

To create a non-expiring, persisted API token for a ServiceAccount, create a Secret of type kubernetes.io/service-account-token with an annotation referencing the ServiceAccount. The control plane then generates a long-lived token and updates that Secret with that generated token data.

Here is a sample manifest for such a Secret:

apiVersion: v1
kind: Secret
type: kubernetes.io/service-account-token
metadata:
  name: mysecretname
  annotations:
    - kubernetes.io/service-account.name: myserviceaccount

To create a Secret based on this example, run:

kubectl -n examplens create -f https://k8s.io/examples/secret/serviceaccount/mysecretname.yaml

To see the details for that Secret, run:

kubectl -n examplens describe secret mysecretname

The output is similar to:

Name:           mysecretname
Namespace:      examplens
Labels:         <none>
Annotations:    kubernetes.io/service-account.name=myserviceaccount
                kubernetes.io/service-account.uid=8a85c4c4-8483-11e9-bc42-526af7764f64

Type:   kubernetes.io/service-account-token

Data
====
ca.crt:         1362 bytes
namespace:      9 bytes
token:          ...

If you launch a new Pod into the examplens namespace, it can use the myserviceaccount service-account-token Secret that you just created.

Delete/invalidate a ServiceAccount token

If you know the name of the Secret that contains the token you want to remove:

kubectl delete secret name-of-secret

Otherwise, first find the Secret for the ServiceAccount.

# This assumes that you already have a namespace named 'examplens'
kubectl -n examplens get serviceaccount/example-automated-thing -o yaml

The output is similar to:

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
            {"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{},"name":"example-automated-thing","namespace":"examplens"}}
  creationTimestamp: "2019-07-21T07:07:07Z"
  name: example-automated-thing
  namespace: examplens
  resourceVersion: "777"
  selfLink: /api/v1/namespaces/examplens/serviceaccounts/example-automated-thing
  uid: f23fd170-66f2-4697-b049-e1e266b7f835
secrets:
  - name: example-automated-thing-token-zyxwv

Then, delete the Secret you now know the name of:

kubectl -n examplens delete secret/example-automated-thing-token-zyxwv

The control plane spots that the ServiceAccount is missing its Secret, and creates a replacement:

kubectl -n examplens get serviceaccount/example-automated-thing -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
            {"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{},"name":"example-automated-thing","namespace":"examplens"}}
  creationTimestamp: "2019-07-21T07:07:07Z"
  name: example-automated-thing
  namespace: examplens
  resourceVersion: "1026"
  selfLink: /api/v1/namespaces/examplens/serviceaccounts/example-automated-thing
  uid: f23fd170-66f2-4697-b049-e1e266b7f835
secrets:
  - name: example-automated-thing-token-4rdrh

Clean up

If you created a namespace examplens to experiment with, you can remove it:

kubectl delete namespace examplens

Control plane details

ServiceAccount controller

A ServiceAccount controller manages the ServiceAccounts inside namespaces, and ensures a ServiceAccount named "default" exists in every active namespace.

Token controller

The service account token controller runs as part of kube-controller-manager. This controller acts asynchronously. It:

  • watches for ServiceAccount creation and creates a corresponding ServiceAccount token Secret to allow API access.
  • watches for ServiceAccount deletion and deletes all corresponding ServiceAccount token Secrets.
  • watches for ServiceAccount token Secret addition, and ensures the referenced ServiceAccount exists, and adds a token to the Secret if needed.
  • watches for Secret deletion and removes a reference from the corresponding ServiceAccount if needed.

You must pass a service account private key file to the token controller in the kube-controller-manager using the --service-account-private-key-file flag. The private key is used to sign generated service account tokens. Similarly, you must pass the corresponding public key to the kube-apiserver using the --service-account-key-file flag. The public key will be used to verify the tokens during authentication.

What's next

3.7 - Authorization Overview

Learn more about Kubernetes authorization, including details about creating policies using the supported authorization modules.

In Kubernetes, you must be authenticated (logged in) before your request can be authorized (granted permission to access). For information about authentication, see Controlling Access to the Kubernetes API.

Kubernetes expects attributes that are common to REST API requests. This means that Kubernetes authorization works with existing organization-wide or cloud-provider-wide access control systems which may handle other APIs besides the Kubernetes API.

Determine Whether a Request is Allowed or Denied

Kubernetes authorizes API requests using the API server. It evaluates all of the request attributes against all policies and allows or denies the request. All parts of an API request must be allowed by some policy in order to proceed. This means that permissions are denied by default.

(Although Kubernetes uses the API server, access controls and policies that depend on specific fields of specific kinds of objects are handled by Admission Controllers.)

When multiple authorization modules are configured, each is checked in sequence. If any authorizer approves or denies a request, that decision is immediately returned and no other authorizer is consulted. If all modules have no opinion on the request, then the request is denied. A deny returns an HTTP status code 403.

Review Your Request Attributes

Kubernetes reviews only the following API request attributes:

  • user - The user string provided during authentication.
  • group - The list of group names to which the authenticated user belongs.
  • extra - A map of arbitrary string keys to string values, provided by the authentication layer.
  • API - Indicates whether the request is for an API resource.
  • Request path - Path to miscellaneous non-resource endpoints like /api or /healthz.
  • API request verb - API verbs like get, list, create, update, patch, watch, delete, and deletecollection are used for resource requests. To determine the request verb for a resource API endpoint, see Determine the request verb.
  • HTTP request verb - Lowercased HTTP methods like get, post, put, and delete are used for non-resource requests.
  • Resource - The ID or name of the resource that is being accessed (for resource requests only) -- For resource requests using get, update, patch, and delete verbs, you must provide the resource name.
  • Subresource - The subresource that is being accessed (for resource requests only).
  • Namespace - The namespace of the object that is being accessed (for namespaced resource requests only).
  • API group - The API Group being accessed (for resource requests only). An empty string designates the core API group.

Determine the Request Verb

Non-resource requests Requests to endpoints other than /api/v1/... or /apis/<group>/<version>/... are considered "non-resource requests", and use the lower-cased HTTP method of the request as the verb. For example, a GET request to endpoints like /api or /healthz would use get as the verb.

Resource requests To determine the request verb for a resource API endpoint, review the HTTP verb used and whether or not the request acts on an individual resource or a collection of resources:

HTTP verbrequest verb
POSTcreate
GET, HEADget (for individual resources), list (for collections, including full object content), watch (for watching an individual resource or collection of resources)
PUTupdate
PATCHpatch
DELETEdelete (for individual resources), deletecollection (for collections)

Kubernetes sometimes checks authorization for additional permissions using specialized verbs. For example:

  • RBAC
    • bind and escalate verbs on roles and clusterroles resources in the rbac.authorization.k8s.io API group.
  • Authentication
    • impersonate verb on users, groups, and serviceaccounts in the core API group, and the userextras in the authentication.k8s.io API group.

Authorization Modes

The Kubernetes API server may authorize a request using one of several authorization modes:

  • Node - A special-purpose authorization mode that grants permissions to kubelets based on the pods they are scheduled to run. To learn more about using the Node authorization mode, see Node Authorization.
  • ABAC - Attribute-based access control (ABAC) defines an access control paradigm whereby access rights are granted to users through the use of policies which combine attributes together. The policies can use any type of attributes (user attributes, resource attributes, object, environment attributes, etc). To learn more about using the ABAC mode, see ABAC Mode.
  • RBAC - Role-based access control (RBAC) is a method of regulating access to computer or network resources based on the roles of individual users within an enterprise. In this context, access is the ability of an individual user to perform a specific task, such as view, create, or modify a file. To learn more about using the RBAC mode, see RBAC Mode
    • When specified RBAC (Role-Based Access Control) uses the rbac.authorization.k8s.io API group to drive authorization decisions, allowing admins to dynamically configure permission policies through the Kubernetes API.
    • To enable RBAC, start the apiserver with --authorization-mode=RBAC.
  • Webhook - A WebHook is an HTTP callback: an HTTP POST that occurs when something happens; a simple event-notification via HTTP POST. A web application implementing WebHooks will POST a message to a URL when certain things happen. To learn more about using the Webhook mode, see Webhook Mode.

Checking API Access

kubectl provides the auth can-i subcommand for quickly querying the API authorization layer. The command uses the SelfSubjectAccessReview API to determine if the current user can perform a given action, and works regardless of the authorization mode used.

kubectl auth can-i create deployments --namespace dev

The output is similar to this:

yes
kubectl auth can-i create deployments --namespace prod

The output is similar to this:

no

Administrators can combine this with user impersonation to determine what action other users can perform.

kubectl auth can-i list secrets --namespace dev --as dave

The output is similar to this:

no

Similarly, to check whether a ServiceAccount named dev-sa in Namespace dev can list Pods in the Namespace target:

kubectl auth can-i list pods \
	--namespace target \
	--as system:serviceaccount:dev:dev-sa

The output is similar to this:

yes

SelfSubjectAccessReview is part of the authorization.k8s.io API group, which exposes the API server authorization to external services. Other resources in this group include:

  • SubjectAccessReview - Access review for any user, not only the current one. Useful for delegating authorization decisions to the API server. For example, the kubelet and extension API servers use this to determine user access to their own APIs.
  • LocalSubjectAccessReview - Like SubjectAccessReview but restricted to a specific namespace.
  • SelfSubjectRulesReview - A review which returns the set of actions a user can perform within a namespace. Useful for users to quickly summarize their own access, or for UIs to hide/show actions.

These APIs can be queried by creating normal Kubernetes resources, where the response "status" field of the returned object is the result of the query.

kubectl create -f - -o yaml << EOF
apiVersion: authorization.k8s.io/v1
kind: SelfSubjectAccessReview
spec:
  resourceAttributes:
    group: apps
    resource: deployments
    verb: create
    namespace: dev
EOF

The generated SelfSubjectAccessReview is:

apiVersion: authorization.k8s.io/v1
kind: SelfSubjectAccessReview
metadata:
  creationTimestamp: null
spec:
  resourceAttributes:
    group: apps
    resource: deployments
    namespace: dev
    verb: create
status:
  allowed: true
  denied: false

Using Flags for Your Authorization Module

You must include a flag in your policy to indicate which authorization module your policies include:

The following flags can be used:

  • --authorization-mode=ABAC Attribute-Based Access Control (ABAC) mode allows you to configure policies using local files.
  • --authorization-mode=RBAC Role-based access control (RBAC) mode allows you to create and store policies using the Kubernetes API.
  • --authorization-mode=Webhook WebHook is an HTTP callback mode that allows you to manage authorization using a remote REST endpoint.
  • --authorization-mode=Node Node authorization is a special-purpose authorization mode that specifically authorizes API requests made by kubelets.
  • --authorization-mode=AlwaysDeny This flag blocks all requests. Use this flag only for testing.
  • --authorization-mode=AlwaysAllow This flag allows all requests. Use this flag only if you do not require authorization for your API requests.

You can choose more than one authorization module. Modules are checked in order so an earlier module has higher priority to allow or deny a request.

Privilege escalation via workload creation or edits

Users who can create/edit pods in a namespace, either directly or through a controller such as an operator, could escalate their privileges in that namespace.

Escalation paths

  • Mounting arbitrary secrets in that namespace
    • Can be used to access secrets meant for other workloads
    • Can be used to obtain a more privileged service account's service account token
  • Using arbitrary Service Accounts in that namespace
    • Can perform Kubernetes API actions as another workload (impersonation)
    • Can perform any privileged actions that Service Account has
  • Mounting configmaps meant for other workloads in that namespace
    • Can be used to obtain information meant for other workloads, such as DB host names.
  • Mounting volumes meant for other workloads in that namespace
    • Can be used to obtain information meant for other workloads, and change it.

What's next

3.8 - Using RBAC Authorization

Role-based access control (RBAC) is a method of regulating access to computer or network resources based on the roles of individual users within your organization.

RBAC authorization uses the rbac.authorization.k8s.io API group to drive authorization decisions, allowing you to dynamically configure policies through the Kubernetes API.

To enable RBAC, start the API server with the --authorization-mode flag set to a comma-separated list that includes RBAC; for example:

kube-apiserver --authorization-mode=Example,RBAC --other-options --more-options

API objects

The RBAC API declares four kinds of Kubernetes object: Role, ClusterRole, RoleBinding and ClusterRoleBinding. You can describe objects, or amend them, using tools such as kubectl, just like any other Kubernetes object.

Role and ClusterRole

An RBAC Role or ClusterRole contains rules that represent a set of permissions. Permissions are purely additive (there are no "deny" rules).

A Role always sets permissions within a particular namespace; when you create a Role, you have to specify the namespace it belongs in.

ClusterRole, by contrast, is a non-namespaced resource. The resources have different names (Role and ClusterRole) because a Kubernetes object always has to be either namespaced or not namespaced; it can't be both.

ClusterRoles have several uses. You can use a ClusterRole to:

  1. define permissions on namespaced resources and be granted access within individual namespace(s)
  2. define permissions on namespaced resources and be granted access across all namespaces
  3. define permissions on cluster-scoped resources

If you want to define a role within a namespace, use a Role; if you want to define a role cluster-wide, use a ClusterRole.

Role example

Here's an example Role in the "default" namespace that can be used to grant read access to pods:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: pod-reader
rules:
- apiGroups: [""] # "" indicates the core API group
  resources: ["pods"]
  verbs: ["get", "watch", "list"]

ClusterRole example

A ClusterRole can be used to grant the same permissions as a Role. Because ClusterRoles are cluster-scoped, you can also use them to grant access to:

  • cluster-scoped resources (like nodes)

  • non-resource endpoints (like /healthz)

  • namespaced resources (like Pods), across all namespaces

    For example: you can use a ClusterRole to allow a particular user to run kubectl get pods --all-namespaces

Here is an example of a ClusterRole that can be used to grant read access to secrets in any particular namespace, or across all namespaces (depending on how it is bound):

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  # "namespace" omitted since ClusterRoles are not namespaced
  name: secret-reader
rules:
- apiGroups: [""]
  #
  # at the HTTP level, the name of the resource for accessing Secret
  # objects is "secrets"
  resources: ["secrets"]
  verbs: ["get", "watch", "list"]

The name of a Role or a ClusterRole object must be a valid path segment name.

RoleBinding and ClusterRoleBinding

A role binding grants the permissions defined in a role to a user or set of users. It holds a list of subjects (users, groups, or service accounts), and a reference to the role being granted. A RoleBinding grants permissions within a specific namespace whereas a ClusterRoleBinding grants that access cluster-wide.

A RoleBinding may reference any Role in the same namespace. Alternatively, a RoleBinding can reference a ClusterRole and bind that ClusterRole to the namespace of the RoleBinding. If you want to bind a ClusterRole to all the namespaces in your cluster, you use a ClusterRoleBinding.

The name of a RoleBinding or ClusterRoleBinding object must be a valid path segment name.

RoleBinding examples

Here is an example of a RoleBinding that grants the "pod-reader" Role to the user "jane" within the "default" namespace. This allows "jane" to read pods in the "default" namespace.

apiVersion: rbac.authorization.k8s.io/v1
# This role binding allows "jane" to read pods in the "default" namespace.
# You need to already have a Role named "pod-reader" in that namespace.
kind: RoleBinding
metadata:
  name: read-pods
  namespace: default
subjects:
# You can specify more than one "subject"
- kind: User
  name: jane # "name" is case sensitive
  apiGroup: rbac.authorization.k8s.io
roleRef:
  # "roleRef" specifies the binding to a Role / ClusterRole
  kind: Role #this must be Role or ClusterRole
  name: pod-reader # this must match the name of the Role or ClusterRole you wish to bind to
  apiGroup: rbac.authorization.k8s.io

A RoleBinding can also reference a ClusterRole to grant the permissions defined in that ClusterRole to resources inside the RoleBinding's namespace. This kind of reference lets you define a set of common roles across your cluster, then reuse them within multiple namespaces.

For instance, even though the following RoleBinding refers to a ClusterRole, "dave" (the subject, case sensitive) will only be able to read Secrets in the "development" namespace, because the RoleBinding's namespace (in its metadata) is "development".

apiVersion: rbac.authorization.k8s.io/v1
# This role binding allows "dave" to read secrets in the "development" namespace.
# You need to already have a ClusterRole named "secret-reader".
kind: RoleBinding
metadata:
  name: read-secrets
  #
  # The namespace of the RoleBinding determines where the permissions are granted.
  # This only grants permissions within the "development" namespace.
  namespace: development
subjects:
- kind: User
  name: dave # Name is case sensitive
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: secret-reader
  apiGroup: rbac.authorization.k8s.io

ClusterRoleBinding example

To grant permissions across a whole cluster, you can use a ClusterRoleBinding. The following ClusterRoleBinding allows any user in the group "manager" to read secrets in any namespace.

apiVersion: rbac.authorization.k8s.io/v1
# This cluster role binding allows anyone in the "manager" group to read secrets in any namespace.
kind: ClusterRoleBinding
metadata:
  name: read-secrets-global
subjects:
- kind: Group
  name: manager # Name is case sensitive
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: secret-reader
  apiGroup: rbac.authorization.k8s.io

After you create a binding, you cannot change the Role or ClusterRole that it refers to. If you try to change a binding's roleRef, you get a validation error. If you do want to change the roleRef for a binding, you need to remove the binding object and create a replacement.

There are two reasons for this restriction:

  1. Making roleRef immutable allows granting someone update permission on an existing binding object, so that they can manage the list of subjects, without being able to change the role that is granted to those subjects.
  2. A binding to a different role is a fundamentally different binding. Requiring a binding to be deleted/recreated in order to change the roleRef ensures the full list of subjects in the binding is intended to be granted the new role (as opposed to enabling or accidentally modifying only the roleRef without verifying all of the existing subjects should be given the new role's permissions).

The kubectl auth reconcile command-line utility creates or updates a manifest file containing RBAC objects, and handles deleting and recreating binding objects if required to change the role they refer to. See command usage and examples for more information.

Referring to resources

In the Kubernetes API, most resources are represented and accessed using a string representation of their object name, such as pods for a Pod. RBAC refers to resources using exactly the same name that appears in the URL for the relevant API endpoint. Some Kubernetes APIs involve a subresource, such as the logs for a Pod. A request for a Pod's logs looks like:

GET /api/v1/namespaces/{namespace}/pods/{name}/log

In this case, pods is the namespaced resource for Pod resources, and log is a subresource of pods. To represent this in an RBAC role, use a slash (/) to delimit the resource and subresource. To allow a subject to read pods and also access the log subresource for each of those Pods, you write:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: pod-and-pod-logs-reader
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log"]
  verbs: ["get", "list"]

You can also refer to resources by name for certain requests through the resourceNames list. When specified, requests can be restricted to individual instances of a resource. Here is an example that restricts its subject to only get or update a ConfigMap named my-configmap:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: configmap-updater
rules:
- apiGroups: [""]
  #
  # at the HTTP level, the name of the resource for accessing ConfigMap
  # objects is "configmaps"
  resources: ["configmaps"]
  resourceNames: ["my-configmap"]
  verbs: ["update", "get"]

Rather than referring to individual resources and verbs you can use the wildcard * symbol to refer to all such objects. For nonResourceURLs you can use the wildcard * symbol as a suffix glob match and for apiGroups and resourceNames an empty set means that everything is allowed. Here is an example that allows access to perform any current and future action on all current and future resources (note, this is similar to the built-in cluster-admin role).

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: example.com-superuser  # DO NOT USE THIS ROLE, IT IS JUST AN EXAMPLE
rules:
- apiGroups: ["example.com"]
  resources: ["*"]
  verbs: ["*"]

Aggregated ClusterRoles

You can aggregate several ClusterRoles into one combined ClusterRole. A controller, running as part of the cluster control plane, watches for ClusterRole objects with an aggregationRule set. The aggregationRule defines a label selector that the controller uses to match other ClusterRole objects that should be combined into the rules field of this one.

Here is an example aggregated ClusterRole:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring
aggregationRule:
  clusterRoleSelectors:
  - matchLabels:
      rbac.example.com/aggregate-to-monitoring: "true"
rules: [] # The control plane automatically fills in the rules

If you create a new ClusterRole that matches the label selector of an existing aggregated ClusterRole, that change triggers adding the new rules into the aggregated ClusterRole. Here is an example that adds rules to the "monitoring" ClusterRole, by creating another ClusterRole labeled rbac.example.com/aggregate-to-monitoring: true.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring-endpoints
  labels:
    rbac.example.com/aggregate-to-monitoring: "true"
# When you create the "monitoring-endpoints" ClusterRole,
# the rules below will be added to the "monitoring" ClusterRole.
rules:
- apiGroups: [""]
  resources: ["services", "endpointslices", "pods"]
  verbs: ["get", "list", "watch"]

The default user-facing roles use ClusterRole aggregation. This lets you, as a cluster administrator, include rules for custom resources, such as those served by CustomResourceDefinitions or aggregated API servers, to extend the default roles.

For example: the following ClusterRoles let the "admin" and "edit" default roles manage the custom resource named CronTab, whereas the "view" role can perform only read actions on CronTab resources. You can assume that CronTab objects are named "crontabs" in URLs as seen by the API server.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: aggregate-cron-tabs-edit
  labels:
    # Add these permissions to the "admin" and "edit" default roles.
    rbac.authorization.k8s.io/aggregate-to-admin: "true"
    rbac.authorization.k8s.io/aggregate-to-edit: "true"
rules:
- apiGroups: ["stable.example.com"]
  resources: ["crontabs"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: aggregate-cron-tabs-view
  labels:
    # Add these permissions to the "view" default role.
    rbac.authorization.k8s.io/aggregate-to-view: "true"
rules:
- apiGroups: ["stable.example.com"]
  resources: ["crontabs"]
  verbs: ["get", "list", "watch"]

Role examples

The following examples are excerpts from Role or ClusterRole objects, showing only the rules section.

Allow reading "pods" resources in the core API Group:

rules:
- apiGroups: [""]
  #
  # at the HTTP level, the name of the resource for accessing Pod
  # objects is "pods"
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

Allow reading/writing Deployments (at the HTTP level: objects with "deployments" in the resource part of their URL) in the "apps" API groups:

rules:
- apiGroups: ["apps"]
  #
  # at the HTTP level, the name of the resource for accessing Deployment
  # objects is "deployments"
  resources: ["deployments"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

Allow reading Pods in the core API group, as well as reading or writing Job resources in the "batch" API group:

rules:
- apiGroups: [""]
  #
  # at the HTTP level, the name of the resource for accessing Pod
  # objects is "pods"
  resources: ["pods"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
  #
  # at the HTTP level, the name of the resource for accessing Job
  # objects is "jobs"
  resources: ["jobs"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

Allow reading a ConfigMap named "my-config" (must be bound with a RoleBinding to limit to a single ConfigMap in a single namespace):

rules:
- apiGroups: [""]
  #
  # at the HTTP level, the name of the resource for accessing ConfigMap
  # objects is "configmaps"
  resources: ["configmaps"]
  resourceNames: ["my-config"]
  verbs: ["get"]

Allow reading the resource "nodes" in the core group (because a Node is cluster-scoped, this must be in a ClusterRole bound with a ClusterRoleBinding to be effective):

rules:
- apiGroups: [""]
  #
  # at the HTTP level, the name of the resource for accessing Node
  # objects is "nodes"
  resources: ["nodes"]
  verbs: ["get", "list", "watch"]

Allow GET and POST requests to the non-resource endpoint /healthz and all subpaths (must be in a ClusterRole bound with a ClusterRoleBinding to be effective):

rules:
- nonResourceURLs: ["/healthz", "/healthz/*"] # '*' in a nonResourceURL is a suffix glob match
  verbs: ["get", "post"]

Referring to subjects

A RoleBinding or ClusterRoleBinding binds a role to subjects. Subjects can be groups, users or ServiceAccounts.

Kubernetes represents usernames as strings. These can be: plain names, such as "alice"; email-style names, like "bob@example.com"; or numeric user IDs represented as a string. It is up to you as a cluster administrator to configure the authentication modules so that authentication produces usernames in the format you want.

In Kubernetes, Authenticator modules provide group information. Groups, like users, are represented as strings, and that string has no format requirements, other than that the prefix system: is reserved.

ServiceAccounts have names prefixed with system:serviceaccount:, and belong to groups that have names prefixed with system:serviceaccounts:.

RoleBinding examples

The following examples are RoleBinding excerpts that only show the subjects section.

For a user named alice@example.com:

subjects:
- kind: User
  name: "alice@example.com"
  apiGroup: rbac.authorization.k8s.io

For a group named frontend-admins:

subjects:
- kind: Group
  name: "frontend-admins"
  apiGroup: rbac.authorization.k8s.io

For the default service account in the "kube-system" namespace:

subjects:
- kind: ServiceAccount
  name: default
  namespace: kube-system

For all service accounts in the "qa" namespace:

subjects:
- kind: Group
  name: system:serviceaccounts:qa
  apiGroup: rbac.authorization.k8s.io

For all service accounts in any namespace:

subjects:
- kind: Group
  name: system:serviceaccounts
  apiGroup: rbac.authorization.k8s.io

For all authenticated users:

subjects:
- kind: Group
  name: system:authenticated
  apiGroup: rbac.authorization.k8s.io

For all unauthenticated users:

subjects:
- kind: Group
  name: system:unauthenticated
  apiGroup: rbac.authorization.k8s.io

For all users:

subjects:
- kind: Group
  name: system:authenticated
  apiGroup: rbac.authorization.k8s.io
- kind: Group
  name: system:unauthenticated
  apiGroup: rbac.authorization.k8s.io

Default roles and role bindings

API servers create a set of default ClusterRole and ClusterRoleBinding objects. Many of these are system: prefixed, which indicates that the resource is directly managed by the cluster control plane. All of the default ClusterRoles and ClusterRoleBindings are labeled with kubernetes.io/bootstrapping=rbac-defaults.

Auto-reconciliation

At each start-up, the API server updates default cluster roles with any missing permissions, and updates default cluster role bindings with any missing subjects. This allows the cluster to repair accidental modifications, and helps to keep roles and role bindings up-to-date as permissions and subjects change in new Kubernetes releases.

To opt out of this reconciliation, set the rbac.authorization.kubernetes.io/autoupdate annotation on a default cluster role or rolebinding to false. Be aware that missing default permissions and subjects can result in non-functional clusters.

Auto-reconciliation is enabled by default if the RBAC authorizer is active.

API discovery roles

Default role bindings authorize unauthenticated and authenticated users to read API information that is deemed safe to be publicly accessible (including CustomResourceDefinitions). To disable anonymous unauthenticated access, add --anonymous-auth=false to the API server configuration.

To view the configuration of these roles via kubectl run:

kubectl get clusterroles system:discovery -o yaml
Kubernetes RBAC API discovery roles
Default ClusterRoleDefault ClusterRoleBindingDescription
system:basic-usersystem:authenticated groupAllows a user read-only access to basic information about themselves. Prior to v1.14, this role was also bound to system:unauthenticated by default.
system:discoverysystem:authenticated groupAllows read-only access to API discovery endpoints needed to discover and negotiate an API level. Prior to v1.14, this role was also bound to system:unauthenticated by default.
system:public-info-viewersystem:authenticated and system:unauthenticated groupsAllows read-only access to non-sensitive information about the cluster. Introduced in Kubernetes v1.14.

User-facing roles

Some of the default ClusterRoles are not system: prefixed. These are intended to be user-facing roles. They include super-user roles (cluster-admin), roles intended to be granted cluster-wide using ClusterRoleBindings, and roles intended to be granted within particular namespaces using RoleBindings (admin, edit, view).

User-facing ClusterRoles use ClusterRole aggregation to allow admins to include rules for custom resources on these ClusterRoles. To add rules to the admin, edit, or view roles, create a ClusterRole with one or more of the following labels:

metadata:
  labels:
    rbac.authorization.k8s.io/aggregate-to-admin: "true"
    rbac.authorization.k8s.io/aggregate-to-edit: "true"
    rbac.authorization.k8s.io/aggregate-to-view: "true"

Default ClusterRoleDefault ClusterRoleBindingDescription
cluster-adminsystem:masters groupAllows super-user access to perform any action on any resource. When used in a ClusterRoleBinding, it gives full control over every resource in the cluster and in all namespaces. When used in a RoleBinding, it gives full control over every resource in the role binding's namespace, including the namespace itself.
adminNoneAllows admin access, intended to be granted within a namespace using a RoleBinding.

If used in a RoleBinding, allows read/write access to most resources in a namespace, including the ability to create roles and role bindings within the namespace. This role does not allow write access to resource quota or to the namespace itself. This role also does not allow write access to EndpointSlices (or Endpoints) in clusters created using Kubernetes v1.22+. More information is available in the "Write Access for EndpointSlices and Endpoints" section.

editNoneAllows read/write access to most objects in a namespace.

This role does not allow viewing or modifying roles or role bindings. However, this role allows accessing Secrets and running Pods as any ServiceAccount in the namespace, so it can be used to gain the API access levels of any ServiceAccount in the namespace. This role also does not allow write access to EndpointSlices (or Endpoints) in clusters created using Kubernetes v1.22+. More information is available in the "Write Access for EndpointSlices and Endpoints" section.

viewNoneAllows read-only access to see most objects in a namespace. It does not allow viewing roles or role bindings.

This role does not allow viewing Secrets, since reading the contents of Secrets enables access to ServiceAccount credentials in the namespace, which would allow API access as any ServiceAccount in the namespace (a form of privilege escalation).

Core component roles

Default ClusterRoleDefault ClusterRoleBindingDescription
system:kube-schedulersystem:kube-scheduler userAllows access to the resources required by the scheduler component.
system:volume-schedulersystem:kube-scheduler userAllows access to the volume resources required by the kube-scheduler component.
system:kube-controller-managersystem:kube-controller-manager userAllows access to the resources required by the controller manager component. The permissions required by individual controllers are detailed in the controller roles.
system:nodeNoneAllows access to resources required by the kubelet, including read access to all secrets, and write access to all pod status objects.

You should use the Node authorizer and NodeRestriction admission plugin instead of the system:node role, and allow granting API access to kubelets based on the Pods scheduled to run on them.

The system:node role only exists for compatibility with Kubernetes clusters upgraded from versions prior to v1.8.

system:node-proxiersystem:kube-proxy userAllows access to the resources required by the kube-proxy component.

Other component roles

Default ClusterRoleDefault ClusterRoleBindingDescription
system:auth-delegatorNoneAllows delegated authentication and authorization checks. This is commonly used by add-on API servers for unified authentication and authorization.
system:heapsterNoneRole for the Heapster component (deprecated).
system:kube-aggregatorNoneRole for the kube-aggregator component.
system:kube-dnskube-dns service account in the kube-system namespaceRole for the kube-dns component.
system:kubelet-api-adminNoneAllows full access to the kubelet API.
system:node-bootstrapperNoneAllows access to the resources required to perform kubelet TLS bootstrapping.
system:node-problem-detectorNoneRole for the node-problem-detector component.
system:persistent-volume-provisionerNoneAllows access to the resources required by most dynamic volume provisioners.
system:monitoringsystem:monitoring groupAllows read access to control-plane monitoring endpoints (i.e. kube-apiserver liveness and readiness endpoints (/healthz, /livez, /readyz), the individual health-check endpoints (/healthz/*, /livez/*, /readyz/*), and /metrics). Note that individual health check endpoints and the metric endpoint may expose sensitive information.

Roles for built-in controllers

The Kubernetes controller manager runs controllers that are built in to the Kubernetes control plane. When invoked with --use-service-account-credentials, kube-controller-manager starts each controller using a separate service account. Corresponding roles exist for each built-in controller, prefixed with system:controller:. If the controller manager is not started with --use-service-account-credentials, it runs all control loops using its own credential, which must be granted all the relevant roles. These roles include:

  • system:controller:attachdetach-controller
  • system:controller:certificate-controller
  • system:controller:clusterrole-aggregation-controller
  • system:controller:cronjob-controller
  • system:controller:daemon-set-controller
  • system:controller:deployment-controller
  • system:controller:disruption-controller
  • system:controller:endpoint-controller
  • system:controller:expand-controller
  • system:controller:generic-garbage-collector
  • system:controller:horizontal-pod-autoscaler
  • system:controller:job-controller
  • system:controller:namespace-controller
  • system:controller:node-controller
  • system:controller:persistent-volume-binder
  • system:controller:pod-garbage-collector
  • system:controller:pv-protection-controller
  • system:controller:pvc-protection-controller
  • system:controller:replicaset-controller
  • system:controller:replication-controller
  • system:controller:resourcequota-controller
  • system:controller:root-ca-cert-publisher
  • system:controller:route-controller
  • system:controller:service-account-controller
  • system:controller:service-controller
  • system:controller:statefulset-controller
  • system:controller:ttl-controller

Privilege escalation prevention and bootstrapping

The RBAC API prevents users from escalating privileges by editing roles or role bindings. Because this is enforced at the API level, it applies even when the RBAC authorizer is not in use.

Restrictions on role creation or update

You can only create/update a role if at least one of the following things is true:

  1. You already have all the permissions contained in the role, at the same scope as the object being modified (cluster-wide for a ClusterRole, within the same namespace or cluster-wide for a Role).
  2. You are granted explicit permission to perform the escalate verb on the roles or clusterroles resource in the rbac.authorization.k8s.io API group.

For example, if user-1 does not have the ability to list Secrets cluster-wide, they cannot create a ClusterRole containing that permission. To allow a user to create/update roles:

  1. Grant them a role that allows them to create/update Role or ClusterRole objects, as desired.
  2. Grant them permission to include specific permissions in the roles they create/update:
    • implicitly, by giving them those permissions (if they attempt to create or modify a Role or ClusterRole with permissions they themselves have not been granted, the API request will be forbidden)
    • or explicitly allow specifying any permission in a Role or ClusterRole by giving them permission to perform the escalate verb on roles or clusterroles resources in the rbac.authorization.k8s.io API group

Restrictions on role binding creation or update

You can only create/update a role binding if you already have all the permissions contained in the referenced role (at the same scope as the role binding) or if you have been authorized to perform the bind verb on the referenced role. For example, if user-1 does not have the ability to list Secrets cluster-wide, they cannot create a ClusterRoleBinding to a role that grants that permission. To allow a user to create/update role bindings:

  1. Grant them a role that allows them to create/update RoleBinding or ClusterRoleBinding objects, as desired.
  2. Grant them permissions needed to bind a particular role:
    • implicitly, by giving them the permissions contained in the role.
    • explicitly, by giving them permission to perform the bind verb on the particular Role (or ClusterRole).

For example, this ClusterRole and RoleBinding would allow user-1 to grant other users the admin, edit, and view roles in the namespace user-1-namespace:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: role-grantor
rules:
- apiGroups: ["rbac.authorization.k8s.io"]
  resources: ["rolebindings"]
  verbs: ["create"]
- apiGroups: ["rbac.authorization.k8s.io"]
  resources: ["clusterroles"]
  verbs: ["bind"]
  # omit resourceNames to allow binding any ClusterRole
  resourceNames: ["admin","edit","view"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: role-grantor-binding
  namespace: user-1-namespace
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: role-grantor
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: User
  name: user-1

When bootstrapping the first roles and role bindings, it is necessary for the initial user to grant permissions they do not yet have. To bootstrap initial roles and role bindings:

  • Use a credential with the "system:masters" group, which is bound to the "cluster-admin" super-user role by the default bindings.

Command-line utilities

kubectl create role

Creates a Role object defining permissions within a single namespace. Examples:

  • Create a Role named "pod-reader" that allows users to perform get, watch and list on pods:

    kubectl create role pod-reader --verb=get --verb=list --verb=watch --resource=pods
    
  • Create a Role named "pod-reader" with resourceNames specified:

    kubectl create role pod-reader --verb=get --resource=pods --resource-name=readablepod --resource-name=anotherpod
    
  • Create a Role named "foo" with apiGroups specified:

    kubectl create role foo --verb=get,list,watch --resource=replicasets.apps
    
  • Create a Role named "foo" with subresource permissions:

    kubectl create role foo --verb=get,list,watch --resource=pods,pods/status
    
  • Create a Role named "my-component-lease-holder" with permissions to get/update a resource with a specific name:

    kubectl create role my-component-lease-holder --verb=get,list,watch,update --resource=lease --resource-name=my-component
    

kubectl create clusterrole

Creates a ClusterRole. Examples:

  • Create a ClusterRole named "pod-reader" that allows user to perform get, watch and list on pods:

    kubectl create clusterrole pod-reader --verb=get,list,watch --resource=pods
    
  • Create a ClusterRole named "pod-reader" with resourceNames specified:

    kubectl create clusterrole pod-reader --verb=get --resource=pods --resource-name=readablepod --resource-name=anotherpod
    
  • Create a ClusterRole named "foo" with apiGroups specified:

    kubectl create clusterrole foo --verb=get,list,watch --resource=replicasets.apps
    
  • Create a ClusterRole named "foo" with subresource permissions:

    kubectl create clusterrole foo --verb=get,list,watch --resource=pods,pods/status
    
  • Create a ClusterRole named "foo" with nonResourceURL specified:

    kubectl create clusterrole "foo" --verb=get --non-resource-url=/logs/*
    
  • Create a ClusterRole named "monitoring" with an aggregationRule specified:

    kubectl create clusterrole monitoring --aggregation-rule="rbac.example.com/aggregate-to-monitoring=true"
    

kubectl create rolebinding

Grants a Role or ClusterRole within a specific namespace. Examples:

  • Within the namespace "acme", grant the permissions in the "admin" ClusterRole to a user named "bob":

    kubectl create rolebinding bob-admin-binding --clusterrole=admin --user=bob --namespace=acme
    
  • Within the namespace "acme", grant the permissions in the "view" ClusterRole to the service account in the namespace "acme" named "myapp":

    kubectl create rolebinding myapp-view-binding --clusterrole=view --serviceaccount=acme:myapp --namespace=acme
    
  • Within the namespace "acme", grant the permissions in the "view" ClusterRole to a service account in the namespace "myappnamespace" named "myapp":

    kubectl create rolebinding myappnamespace-myapp-view-binding --clusterrole=view --serviceaccount=myappnamespace:myapp --namespace=acme
    

kubectl create clusterrolebinding

Grants a ClusterRole across the entire cluster (all namespaces). Examples:

  • Across the entire cluster, grant the permissions in the "cluster-admin" ClusterRole to a user named "root":

    kubectl create clusterrolebinding root-cluster-admin-binding --clusterrole=cluster-admin --user=root
    
  • Across the entire cluster, grant the permissions in the "system:node-proxier" ClusterRole to a user named "system:kube-proxy":

    kubectl create clusterrolebinding kube-proxy-binding --clusterrole=system:node-proxier --user=system:kube-proxy
    
  • Across the entire cluster, grant the permissions in the "view" ClusterRole to a service account named "myapp" in the namespace "acme":

    kubectl create clusterrolebinding myapp-view-binding --clusterrole=view --serviceaccount=acme:myapp
    

kubectl auth reconcile

Creates or updates rbac.authorization.k8s.io/v1 API objects from a manifest file.

Missing objects are created, and the containing namespace is created for namespaced objects, if required.

Existing roles are updated to include the permissions in the input objects, and remove extra permissions if --remove-extra-permissions is specified.

Existing bindings are updated to include the subjects in the input objects, and remove extra subjects if --remove-extra-subjects is specified.

Examples:

  • Test applying a manifest file of RBAC objects, displaying changes that would be made:

    kubectl auth reconcile -f my-rbac-rules.yaml --dry-run=client
    
  • Apply a manifest file of RBAC objects, preserving any extra permissions (in roles) and any extra subjects (in bindings):

    kubectl auth reconcile -f my-rbac-rules.yaml
    
  • Apply a manifest file of RBAC objects, removing any extra permissions (in roles) and any extra subjects (in bindings):

    kubectl auth reconcile -f my-rbac-rules.yaml --remove-extra-subjects --remove-extra-permissions
    

ServiceAccount permissions

Default RBAC policies grant scoped permissions to control-plane components, nodes, and controllers, but grant no permissions to service accounts outside the kube-system namespace (beyond discovery permissions given to all authenticated users).

This allows you to grant particular roles to particular ServiceAccounts as needed. Fine-grained role bindings provide greater security, but require more effort to administrate. Broader grants can give unnecessary (and potentially escalating) API access to ServiceAccounts, but are easier to administrate.

In order from most secure to least secure, the approaches are:

  1. Grant a role to an application-specific service account (best practice)

    This requires the application to specify a serviceAccountName in its pod spec, and for the service account to be created (via the API, application manifest, kubectl create serviceaccount, etc.).

    For example, grant read-only permission within "my-namespace" to the "my-sa" service account:

    kubectl create rolebinding my-sa-view \
      --clusterrole=view \
      --serviceaccount=my-namespace:my-sa \
      --namespace=my-namespace
    
  2. Grant a role to the "default" service account in a namespace

    If an application does not specify a serviceAccountName, it uses the "default" service account.

    For example, grant read-only permission within "my-namespace" to the "default" service account:

    kubectl create rolebinding default-view \
      --clusterrole=view \
      --serviceaccount=my-namespace:default \
      --namespace=my-namespace
    

    Many add-ons run as the "default" service account in the kube-system namespace. To allow those add-ons to run with super-user access, grant cluster-admin permissions to the "default" service account in the kube-system namespace.

    kubectl create clusterrolebinding add-on-cluster-admin \
      --clusterrole=cluster-admin \
      --serviceaccount=kube-system:default
    
  3. Grant a role to all service accounts in a namespace

    If you want all applications in a namespace to have a role, no matter what service account they use, you can grant a role to the service account group for that namespace.

    For example, grant read-only permission within "my-namespace" to all service accounts in that namespace:

    kubectl create rolebinding serviceaccounts-view \
      --clusterrole=view \
      --group=system:serviceaccounts:my-namespace \
      --namespace=my-namespace
    
  4. Grant a limited role to all service accounts cluster-wide (discouraged)

    If you don't want to manage permissions per-namespace, you can grant a cluster-wide role to all service accounts.

    For example, grant read-only permission across all namespaces to all service accounts in the cluster:

    kubectl create clusterrolebinding serviceaccounts-view \
      --clusterrole=view \
     --group=system:serviceaccounts
    
  5. Grant super-user access to all service accounts cluster-wide (strongly discouraged)

    If you don't care about partitioning permissions at all, you can grant super-user access to all service accounts.

    kubectl create clusterrolebinding serviceaccounts-cluster-admin \
      --clusterrole=cluster-admin \
      --group=system:serviceaccounts
    

Write access for EndpointSlices and Endpoints

Kubernetes clusters created before Kubernetes v1.22 include write access to EndpointSlices (and Endpoints) in the aggregated "edit" and "admin" roles. As a mitigation for CVE-2021-25740, this access is not part of the aggregated roles in clusters that you create using Kubernetes v1.22 or later.

Existing clusters that have been upgraded to Kubernetes v1.22 will not be subject to this change. The CVE announcement includes guidance for restricting this access in existing clusters.

If you want new clusters to retain this level of access in the aggregated roles, you can create the following ClusterRole:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    kubernetes.io/description: |-
      Add endpoints write permissions to the edit and admin roles. This was
      removed by default in 1.22 because of CVE-2021-25740. See
      https://issue.k8s.io/103675. This can allow writers to direct LoadBalancer
      or Ingress implementations to expose backend IPs that would not otherwise
      be accessible, and can circumvent network policies or security controls
      intended to prevent/isolate access to those backends.
      EndpointSlices were never included in the edit or admin roles, so there
      is nothing to restore for the EndpointSlice API.      
  labels:
    rbac.authorization.k8s.io/aggregate-to-edit: "true"
  name: custom:aggregate-to-edit:endpoints # you can change this if you wish
rules:
  - apiGroups: [""]
    resources: ["endpoints"]
    verbs: ["create", "delete", "deletecollection", "patch", "update"]

Upgrading from ABAC

Clusters that originally ran older Kubernetes versions often used permissive ABAC policies, including granting full API access to all service accounts.

Default RBAC policies grant scoped permissions to control-plane components, nodes, and controllers, but grant no permissions to service accounts outside the kube-system namespace (beyond discovery permissions given to all authenticated users).

While far more secure, this can be disruptive to existing workloads expecting to automatically receive API permissions. Here are two approaches for managing this transition:

Parallel authorizers

Run both the RBAC and ABAC authorizers, and specify a policy file that contains the legacy ABAC policy:

--authorization-mode=...,RBAC,ABAC --authorization-policy-file=mypolicy.json

To explain that first command line option in detail: if earlier authorizers, such as Node, deny a request, then the RBAC authorizer attempts to authorize the API request. If RBAC also denies that API request, the ABAC authorizer is then run. This means that any request allowed by either the RBAC or ABAC policies is allowed.

When the kube-apiserver is run with a log level of 5 or higher for the RBAC component (--vmodule=rbac*=5 or --v=5), you can see RBAC denials in the API server log (prefixed with RBAC). You can use that information to determine which roles need to be granted to which users, groups, or service accounts.

Once you have granted roles to service accounts and workloads are running with no RBAC denial messages in the server logs, you can remove the ABAC authorizer.

Permissive RBAC permissions

You can replicate a permissive ABAC policy using RBAC role bindings.

After you have transitioned to use RBAC, you should adjust the access controls for your cluster to ensure that these meet your information security needs.

3.9 - Using ABAC Authorization

Attribute-based access control (ABAC) defines an access control paradigm whereby access rights are granted to users through the use of policies which combine attributes together.

Policy File Format

To enable ABAC mode, specify --authorization-policy-file=SOME_FILENAME and --authorization-mode=ABAC on startup.

The file format is one JSON object per line. There should be no enclosing list or map, only one map per line.

Each line is a "policy object", where each such object is a map with the following properties:

  • Versioning properties:
    • apiVersion, type string; valid values are "abac.authorization.kubernetes.io/v1beta1". Allows versioning and conversion of the policy format.
    • kind, type string: valid values are "Policy". Allows versioning and conversion of the policy format.
  • spec property set to a map with the following properties:
    • Subject-matching properties:
      • user, type string; the user-string from --token-auth-file. If you specify user, it must match the username of the authenticated user.
      • group, type string; if you specify group, it must match one of the groups of the authenticated user. system:authenticated matches all authenticated requests. system:unauthenticated matches all unauthenticated requests.
    • Resource-matching properties:
      • apiGroup, type string; an API group.
        • Ex: apps, networking.k8s.io
        • Wildcard: * matches all API groups.
      • namespace, type string; a namespace.
        • Ex: kube-system
        • Wildcard: * matches all resource requests.
      • resource, type string; a resource type
        • Ex: pods, deployments
        • Wildcard: * matches all resource requests.
    • Non-resource-matching properties:
      • nonResourcePath, type string; non-resource request paths.
        • Ex: /version or /apis
        • Wildcard:
          • * matches all non-resource requests.
          • /foo/* matches all subpaths of /foo/.
    • readonly, type boolean, when true, means that the Resource-matching policy only applies to get, list, and watch operations, Non-resource-matching policy only applies to get operation.

Authorization Algorithm

A request has attributes which correspond to the properties of a policy object.

When a request is received, the attributes are determined. Unknown attributes are set to the zero value of its type (e.g. empty string, 0, false).

A property set to "*" will match any value of the corresponding attribute.

The tuple of attributes is checked for a match against every policy in the policy file. If at least one line matches the request attributes, then the request is authorized (but may fail later validation).

To permit any authenticated user to do something, write a policy with the group property set to "system:authenticated".

To permit any unauthenticated user to do something, write a policy with the group property set to "system:unauthenticated".

To permit a user to do anything, write a policy with the apiGroup, namespace, resource, and nonResourcePath properties set to "*".

Kubectl

Kubectl uses the /api and /apis endpoints of apiserver to discover served resource types, and validates objects sent to the API by create/update operations using schema information located at /openapi/v2.

When using ABAC authorization, those special resources have to be explicitly exposed via the nonResourcePath property in a policy (see examples below):

  • /api, /api/*, /apis, and /apis/* for API version negotiation.
  • /version for retrieving the server version via kubectl version.
  • /swaggerapi/* for create/update operations.

To inspect the HTTP calls involved in a specific kubectl operation you can turn up the verbosity:

kubectl --v=8 version

Examples

  1. Alice can do anything to all resources:

    {"apiVersion": "abac.authorization.kubernetes.io/v1beta1", "kind": "Policy", "spec": {"user": "alice", "namespace": "*", "resource": "*", "apiGroup": "*"}}
    
  2. The Kubelet can read any pods:

    {"apiVersion": "abac.authorization.kubernetes.io/v1beta1", "kind": "Policy", "spec": {"user": "kubelet", "namespace": "*", "resource": "pods", "readonly": true}}
    
  3. The Kubelet can read and write events:

    {"apiVersion": "abac.authorization.kubernetes.io/v1beta1", "kind": "Policy", "spec": {"user": "kubelet", "namespace": "*", "resource": "events"}}
    
  4. Bob can just read pods in namespace "projectCaribou":

    {"apiVersion": "abac.authorization.kubernetes.io/v1beta1", "kind": "Policy", "spec": {"user": "bob", "namespace": "projectCaribou", "resource": "pods", "readonly": true}}
    
  5. Anyone can make read-only requests to all non-resource paths:

    {"apiVersion": "abac.authorization.kubernetes.io/v1beta1", "kind": "Policy", "spec": {"group": "system:authenticated", "readonly": true, "nonResourcePath": "*"}}
    {"apiVersion": "abac.authorization.kubernetes.io/v1beta1", "kind": "Policy", "spec": {"group": "system:unauthenticated", "readonly": true, "nonResourcePath": "*"}}
    

Complete file example

A quick note on service accounts

Every service account has a corresponding ABAC username, and that service account's username is generated according to the naming convention:

system:serviceaccount:<namespace>:<serviceaccountname>

Creating a new namespace leads to the creation of a new service account in the following format:

system:serviceaccount:<namespace>:default

For example, if you wanted to grant the default service account (in the kube-system namespace) full privilege to the API using ABAC, you would add this line to your policy file:

{"apiVersion":"abac.authorization.kubernetes.io/v1beta1","kind":"Policy","spec":{"user":"system:serviceaccount:kube-system:default","namespace":"*","resource":"*","apiGroup":"*"}}

The apiserver will need to be restarted to pick up the new policy lines.

3.10 - Using Node Authorization

Node authorization is a special-purpose authorization mode that specifically authorizes API requests made by kubelets.

Overview

The Node authorizer allows a kubelet to perform API operations. This includes:

Read operations:

  • services
  • endpoints
  • nodes
  • pods
  • secrets, configmaps, persistent volume claims and persistent volumes related to pods bound to the kubelet's node

Write operations:

  • nodes and node status (enable the NodeRestriction admission plugin to limit a kubelet to modify its own node)
  • pods and pod status (enable the NodeRestriction admission plugin to limit a kubelet to modify pods bound to itself)
  • events

Auth-related operations:

  • read/write access to the CertificateSigningRequests API for TLS bootstrapping
  • the ability to create TokenReviews and SubjectAccessReviews for delegated authentication/authorization checks

In future releases, the node authorizer may add or remove permissions to ensure kubelets have the minimal set of permissions required to operate correctly.

In order to be authorized by the Node authorizer, kubelets must use a credential that identifies them as being in the system:nodes group, with a username of system:node:<nodeName>. This group and user name format match the identity created for each kubelet as part of kubelet TLS bootstrapping.

The value of <nodeName> must match precisely the name of the node as registered by the kubelet. By default, this is the host name as provided by hostname, or overridden via the kubelet option --hostname-override. However, when using the --cloud-provider kubelet option, the specific hostname may be determined by the cloud provider, ignoring the local hostname and the --hostname-override option. For specifics about how the kubelet determines the hostname, see the kubelet options reference.

To enable the Node authorizer, start the apiserver with --authorization-mode=Node.

To limit the API objects kubelets are able to write, enable the NodeRestriction admission plugin by starting the apiserver with --enable-admission-plugins=...,NodeRestriction,...

Migration considerations

Kubelets outside the system:nodes group

Kubelets outside the system:nodes group would not be authorized by the Node authorization mode, and would need to continue to be authorized via whatever mechanism currently authorizes them. The node admission plugin would not restrict requests from these kubelets.

Kubelets with undifferentiated usernames

In some deployments, kubelets have credentials that place them in the system:nodes group, but do not identify the particular node they are associated with, because they do not have a username in the system:node:... format. These kubelets would not be authorized by the Node authorization mode, and would need to continue to be authorized via whatever mechanism currently authorizes them.

The NodeRestriction admission plugin would ignore requests from these kubelets, since the default node identifier implementation would not consider that a node identity.

Upgrades from previous versions using RBAC

Upgraded pre-1.7 clusters using RBAC will continue functioning as-is because the system:nodes group binding will already exist.

If a cluster admin wishes to start using the Node authorizer and NodeRestriction admission plugin to limit node access to the API, that can be done non-disruptively:

  1. Enable the Node authorization mode (--authorization-mode=Node,RBAC) and the NodeRestriction admission plugin
  2. Ensure all kubelets' credentials conform to the group/username requirements
  3. Audit apiserver logs to ensure the Node authorizer is not rejecting requests from kubelets (no persistent NODE DENY messages logged)
  4. Delete the system:node cluster role binding

RBAC Node Permissions

In 1.6, the system:node cluster role was automatically bound to the system:nodes group when using the RBAC Authorization mode.

In 1.7, the automatic binding of the system:nodes group to the system:node role is deprecated because the node authorizer accomplishes the same purpose with the benefit of additional restrictions on secret and configmap access. If the Node and RBAC authorization modes are both enabled, the automatic binding of the system:nodes group to the system:node role is not created in 1.7.

In 1.8, the binding will not be created at all.

When using RBAC, the system:node cluster role will continue to be created, for compatibility with deployment methods that bind other users or groups to that role.

3.11 - Mapping PodSecurityPolicies to Pod Security Standards

The tables below enumerate the configuration parameters on PodSecurityPolicy objects, whether the field mutates and/or validates pods, and how the configuration values map to the Pod Security Standards.

For each applicable parameter, the allowed values for the Baseline and Restricted profiles are listed. Anything outside the allowed values for those profiles would fall under the Privileged profile. "No opinion" means all values are allowed under all Pod Security Standards.

For a step-by-step migration guide, see Migrate from PodSecurityPolicy to the Built-In PodSecurity Admission Controller.

PodSecurityPolicy Spec

The fields enumerated in this table are part of the PodSecurityPolicySpec, which is specified under the .spec field path.

Mapping PodSecurityPolicySpec fields to Pod Security Standards
PodSecurityPolicySpecTypePod Security Standards Equivalent
privilegedValidatingBaseline & Restricted: false / undefined / nil
defaultAddCapabilitiesMutating & ValidatingRequirements match allowedCapabilities below.
allowedCapabilitiesValidating

Baseline: subset of

  • AUDIT_WRITE
  • CHOWN
  • DAC_OVERRIDE
  • FOWNER
  • FSETID
  • KILL
  • MKNOD
  • NET_BIND_SERVICE
  • SETFCAP
  • SETGID
  • SETPCAP
  • SETUID
  • SYS_CHROOT

Restricted: empty / undefined / nil OR a list containing only NET_BIND_SERVICE

requiredDropCapabilitiesMutating & Validating

Baseline: no opinion

Restricted: must include ALL

volumesValidating

Baseline: anything except

  • hostPath
  • *

Restricted: subset of

  • configMap
  • csi
  • downwardAPI
  • emptyDir
  • ephemeral
  • persistentVolumeClaim
  • projected
  • secret
hostNetworkValidatingBaseline & Restricted: false / undefined / nil
hostPortsValidatingBaseline & Restricted: undefined / nil / empty
hostPIDValidatingBaseline & Restricted: false / undefined / nil
hostIPCValidatingBaseline & Restricted: false / undefined / nil
seLinuxMutating & Validating

Baseline & Restricted: seLinux.rule is MustRunAs, with the following options

  • user is unset ("" / undefined / nil)
  • role is unset ("" / undefined / nil)
  • type is unset or one of: container_t, container_init_t, container_kvm_t
  • level is anything
runAsUserMutating & Validating

Baseline: Anything

Restricted: rule is MustRunAsNonRoot

runAsGroupMutating (MustRunAs) & ValidatingNo opinion
supplementalGroupsMutating & ValidatingNo opinion
fsGroupMutating & ValidatingNo opinion
readOnlyRootFilesystemMutating & ValidatingNo opinion
defaultAllowPrivilegeEscalationMutatingNo opinion (non-validating)
allowPrivilegeEscalationMutating & Validating

Only mutating if set to false

Baseline: No opinion

Restricted: false

allowedHostPathsValidatingNo opinion (volumes takes precedence)
allowedFlexVolumesValidatingNo opinion (volumes takes precedence)
allowedCSIDriversValidatingNo opinion (volumes takes precedence)
allowedUnsafeSysctlsValidatingBaseline & Restricted: undefined / nil / empty
forbiddenSysctlsValidatingNo opinion
allowedProcMountTypes
(alpha feature)
ValidatingBaseline & Restricted: ["Default"] OR undefined / nil / empty
runtimeClass
 .defaultRuntimeClassName
MutatingNo opinion
runtimeClass
 .allowedRuntimeClassNames
ValidatingNo opinion

PodSecurityPolicy annotations

The annotations enumerated in this table can be specified under .metadata.annotations on the PodSecurityPolicy object.

Mapping PodSecurityPolicy annotations to Pod Security Standards
PSP AnnotationTypePod Security Standards Equivalent
seccomp.security.alpha.kubernetes.io
/defaultProfileName
MutatingNo opinion
seccomp.security.alpha.kubernetes.io
/allowedProfileNames
Validating

Baseline: "runtime/default," (Trailing comma to allow unset)

Restricted: "runtime/default" (No trailing comma)

localhost/* values are also permitted for both Baseline & Restricted.

apparmor.security.beta.kubernetes.io
/defaultProfileName
MutatingNo opinion
apparmor.security.beta.kubernetes.io
/allowedProfileNames
Validating

Baseline: "runtime/default," (Trailing comma to allow unset)

Restricted: "runtime/default" (No trailing comma)

localhost/* values are also permitted for both Baseline & Restricted.

3.12 - Webhook Mode

A WebHook is an HTTP callback: an HTTP POST that occurs when something happens; a simple event-notification via HTTP POST. A web application implementing WebHooks will POST a message to a URL when certain things happen.

When specified, mode Webhook causes Kubernetes to query an outside REST service when determining user privileges.

Configuration File Format

Mode Webhook requires a file for HTTP configuration, specify by the --authorization-webhook-config-file=SOME_FILENAME flag.

The configuration file uses the kubeconfig file format. Within the file "users" refers to the API Server webhook and "clusters" refers to the remote service.

A configuration example which uses HTTPS client auth:

# Kubernetes API version
apiVersion: v1
# kind of the API object
kind: Config
# clusters refers to the remote service.
clusters:
  - name: name-of-remote-authz-service
    cluster:
      # CA for verifying the remote service.
      certificate-authority: /path/to/ca.pem
      # URL of remote service to query. Must use 'https'. May not include parameters.
      server: https://authz.example.com/authorize

# users refers to the API Server's webhook configuration.
users:
  - name: name-of-api-server
    user:
      client-certificate: /path/to/cert.pem # cert for the webhook plugin to use
      client-key: /path/to/key.pem          # key matching the cert

# kubeconfig files require a context. Provide one for the API Server.
current-context: webhook
contexts:
- context:
    cluster: name-of-remote-authz-service
    user: name-of-api-server
  name: webhook

Request Payloads

When faced with an authorization decision, the API Server POSTs a JSON- serialized authorization.k8s.io/v1beta1 SubjectAccessReview object describing the action. This object contains fields describing the user attempting to make the request, and either details about the resource being accessed or requests attributes.

Note that webhook API objects are subject to the same versioning compatibility rules as other Kubernetes API objects. Implementers should be aware of looser compatibility promises for beta objects and check the "apiVersion" field of the request to ensure correct deserialization. Additionally, the API Server must enable the authorization.k8s.io/v1beta1 API extensions group (--runtime-config=authorization.k8s.io/v1beta1=true).

An example request body:

{
  "apiVersion": "authorization.k8s.io/v1beta1",
  "kind": "SubjectAccessReview",
  "spec": {
    "resourceAttributes": {
      "namespace": "kittensandponies",
      "verb": "get",
      "group": "unicorn.example.org",
      "resource": "pods"
    },
    "user": "jane",
    "group": [
      "group1",
      "group2"
    ]
  }
}

The remote service is expected to fill the status field of the request and respond to either allow or disallow access. The response body's spec field is ignored and may be omitted. A permissive response would return:

{
  "apiVersion": "authorization.k8s.io/v1beta1",
  "kind": "SubjectAccessReview",
  "status": {
    "allowed": true
  }
}

For disallowing access there are two methods.

The first method is preferred in most cases, and indicates the authorization webhook does not allow, or has "no opinion" about the request, but if other authorizers are configured, they are given a chance to allow the request. If there are no other authorizers, or none of them allow the request, the request is forbidden. The webhook would return:

{
  "apiVersion": "authorization.k8s.io/v1beta1",
  "kind": "SubjectAccessReview",
  "status": {
    "allowed": false,
    "reason": "user does not have read access to the namespace"
  }
}

The second method denies immediately, short-circuiting evaluation by other configured authorizers. This should only be used by webhooks that have detailed knowledge of the full authorizer configuration of the cluster. The webhook would return:

{
  "apiVersion": "authorization.k8s.io/v1beta1",
  "kind": "SubjectAccessReview",
  "status": {
    "allowed": false,
    "denied": true,
    "reason": "user does not have read access to the namespace"
  }
}

Access to non-resource paths are sent as:

{
  "apiVersion": "authorization.k8s.io/v1beta1",
  "kind": "SubjectAccessReview",
  "spec": {
    "nonResourceAttributes": {
      "path": "/debug",
      "verb": "get"
    },
    "user": "jane",
    "group": [
      "group1",
      "group2"
    ]
  }
}

Non-resource paths include: /api, /apis, /metrics, /logs, /debug, /healthz, /livez, /openapi/v2, /readyz, and /version. Clients require access to /api, /api/*, /apis, /apis/*, and /version to discover what resources and versions are present on the server. Access to other non-resource paths can be disallowed without restricting access to the REST api.

For further documentation refer to the authorization.v1beta1 API objects and webhook.go.

3.13 - Kubelet authentication/authorization

Overview

A kubelet's HTTPS endpoint exposes APIs which give access to data of varying sensitivity, and allow you to perform operations with varying levels of power on the node and within containers.

This document describes how to authenticate and authorize access to the kubelet's HTTPS endpoint.

Kubelet authentication

By default, requests to the kubelet's HTTPS endpoint that are not rejected by other configured authentication methods are treated as anonymous requests, and given a username of system:anonymous and a group of system:unauthenticated.

To disable anonymous access and send 401 Unauthorized responses to unauthenticated requests:

  • start the kubelet with the --anonymous-auth=false flag

To enable X509 client certificate authentication to the kubelet's HTTPS endpoint:

  • start the kubelet with the --client-ca-file flag, providing a CA bundle to verify client certificates with
  • start the apiserver with --kubelet-client-certificate and --kubelet-client-key flags
  • see the apiserver authentication documentation for more details

To enable API bearer tokens (including service account tokens) to be used to authenticate to the kubelet's HTTPS endpoint:

  • ensure the authentication.k8s.io/v1beta1 API group is enabled in the API server
  • start the kubelet with the --authentication-token-webhook and --kubeconfig flags
  • the kubelet calls the TokenReview API on the configured API server to determine user information from bearer tokens

Kubelet authorization

Any request that is successfully authenticated (including an anonymous request) is then authorized. The default authorization mode is AlwaysAllow, which allows all requests.

There are many possible reasons to subdivide access to the kubelet API:

  • anonymous auth is enabled, but anonymous users' ability to call the kubelet API should be limited
  • bearer token auth is enabled, but arbitrary API users' (like service accounts) ability to call the kubelet API should be limited
  • client certificate auth is enabled, but only some of the client certificates signed by the configured CA should be allowed to use the kubelet API

To subdivide access to the kubelet API, delegate authorization to the API server:

  • ensure the authorization.k8s.io/v1beta1 API group is enabled in the API server
  • start the kubelet with the --authorization-mode=Webhook and the --kubeconfig flags
  • the kubelet calls the SubjectAccessReview API on the configured API server to determine whether each request is authorized

The kubelet authorizes API requests using the same request attributes approach as the apiserver.

The verb is determined from the incoming request's HTTP verb:

HTTP verbrequest verb
POSTcreate
GET, HEADget
PUTupdate
PATCHpatch
DELETEdelete

The resource and subresource is determined from the incoming request's path:

Kubelet APIresourcesubresource
/stats/*nodesstats
/metrics/*nodesmetrics
/logs/*nodeslog
/spec/*nodesspec
all othersnodesproxy

The namespace and API group attributes are always an empty string, and the resource name is always the name of the kubelet's Node API object.

When running in this mode, ensure the user identified by the --kubelet-client-certificate and --kubelet-client-key flags passed to the apiserver is authorized for the following attributes:

  • verb=*, resource=nodes, subresource=proxy
  • verb=*, resource=nodes, subresource=stats
  • verb=*, resource=nodes, subresource=log
  • verb=*, resource=nodes, subresource=spec
  • verb=*, resource=nodes, subresource=metrics

3.14 - TLS bootstrapping

In a Kubernetes cluster, the components on the worker nodes - kubelet and kube-proxy - need to communicate with Kubernetes control plane components, specifically kube-apiserver. In order to ensure that communication is kept private, not interfered with, and ensure that each component of the cluster is talking to another trusted component, we strongly recommend using client TLS certificates on nodes.

The normal process of bootstrapping these components, especially worker nodes that need certificates so they can communicate safely with kube-apiserver, can be a challenging process as it is often outside of the scope of Kubernetes and requires significant additional work. This in turn, can make it challenging to initialize or scale a cluster.

In order to simplify the process, beginning in version 1.4, Kubernetes introduced a certificate request and signing API. The proposal can be found here.

This document describes the process of node initialization, how to set up TLS client certificate bootstrapping for kubelets, and how it works.

Initialization Process

When a worker node starts up, the kubelet does the following:

  1. Look for its kubeconfig file
  2. Retrieve the URL of the API server and credentials, normally a TLS key and signed certificate from the kubeconfig file
  3. Attempt to communicate with the API server using the credentials.

Assuming that the kube-apiserver successfully validates the kubelet's credentials, it will treat the kubelet as a valid node, and begin to assign pods to it.

Note that the above process depends upon:

  • Existence of a key and certificate on the local host in the kubeconfig
  • The certificate having been signed by a Certificate Authority (CA) trusted by the kube-apiserver

All of the following are responsibilities of whoever sets up and manages the cluster:

  1. Creating the CA key and certificate
  2. Distributing the CA certificate to the control plane nodes, where kube-apiserver is running
  3. Creating a key and certificate for each kubelet; strongly recommended to have a unique one, with a unique CN, for each kubelet
  4. Signing the kubelet certificate using the CA key
  5. Distributing the kubelet key and signed certificate to the specific node on which the kubelet is running

The TLS Bootstrapping described in this document is intended to simplify, and partially or even completely automate, steps 3 onwards, as these are the most common when initializing or scaling a cluster.

Bootstrap Initialization

In the bootstrap initialization process, the following occurs:

  1. kubelet begins
  2. kubelet sees that it does not have a kubeconfig file
  3. kubelet searches for and finds a bootstrap-kubeconfig file
  4. kubelet reads its bootstrap file, retrieving the URL of the API server and a limited usage "token"
  5. kubelet connects to the API server, authenticates using the token
  6. kubelet now has limited credentials to create and retrieve a certificate signing request (CSR)
  7. kubelet creates a CSR for itself with the signerName set to kubernetes.io/kube-apiserver-client-kubelet
  8. CSR is approved in one of two ways:
    • If configured, kube-controller-manager automatically approves the CSR
    • If configured, an outside process, possibly a person, approves the CSR using the Kubernetes API or via kubectl
  9. Certificate is created for the kubelet
  10. Certificate is issued to the kubelet
  11. kubelet retrieves the certificate
  12. kubelet creates a proper kubeconfig with the key and signed certificate
  13. kubelet begins normal operation
  14. Optional: if configured, kubelet automatically requests renewal of the certificate when it is close to expiry
  15. The renewed certificate is approved and issued, either automatically or manually, depending on configuration.

The rest of this document describes the necessary steps to configure TLS Bootstrapping, and its limitations.

Configuration

To configure for TLS bootstrapping and optional automatic approval, you must configure options on the following components:

  • kube-apiserver
  • kube-controller-manager
  • kubelet
  • in-cluster resources: ClusterRoleBinding and potentially ClusterRole

In addition, you need your Kubernetes Certificate Authority (CA).

Certificate Authority

As without bootstrapping, you will need a Certificate Authority (CA) key and certificate. As without bootstrapping, these will be used to sign the kubelet certificate. As before, it is your responsibility to distribute them to control plane nodes.

For the purposes of this document, we will assume these have been distributed to control plane nodes at /var/lib/kubernetes/ca.pem (certificate) and /var/lib/kubernetes/ca-key.pem (key). We will refer to these as "Kubernetes CA certificate and key".

All Kubernetes components that use these certificates - kubelet, kube-apiserver, kube-controller-manager - assume the key and certificate to be PEM-encoded.

kube-apiserver configuration

The kube-apiserver has several requirements to enable TLS bootstrapping:

  • Recognizing CA that signs the client certificate
  • Authenticating the bootstrapping kubelet to the system:bootstrappers group
  • Authorize the bootstrapping kubelet to create a certificate signing request (CSR)

Recognizing client certificates

This is normal for all client certificate authentication. If not already set, add the --client-ca-file=FILENAME flag to the kube-apiserver command to enable client certificate authentication, referencing a certificate authority bundle containing the signing certificate, for example --client-ca-file=/var/lib/kubernetes/ca.pem.

Initial bootstrap authentication

In order for the bootstrapping kubelet to connect to kube-apiserver and request a certificate, it must first authenticate to the server. You can use any authenticator that can authenticate the kubelet.

While any authentication strategy can be used for the kubelet's initial bootstrap credentials, the following two authenticators are recommended for ease of provisioning.

  1. Bootstrap Tokens
  2. Token authentication file

Using bootstrap tokens is a simpler and more easily managed method to authenticate kubelets, and does not require any additional flags when starting kube-apiserver.

Whichever method you choose, the requirement is that the kubelet be able to authenticate as a user with the rights to:

  1. create and retrieve CSRs
  2. be automatically approved to request node client certificates, if automatic approval is enabled.

A kubelet authenticating using bootstrap tokens is authenticated as a user in the group system:bootstrappers, which is the standard method to use.

As this feature matures, you should ensure tokens are bound to a Role Based Access Control (RBAC) policy which limits requests (using the bootstrap token) strictly to client requests related to certificate provisioning. With RBAC in place, scoping the tokens to a group allows for great flexibility. For example, you could disable a particular bootstrap group's access when you are done provisioning the nodes.

Bootstrap tokens

Bootstrap tokens are described in detail here. These are tokens that are stored as secrets in the Kubernetes cluster, and then issued to the individual kubelet. You can use a single token for an entire cluster, or issue one per worker node.

The process is two-fold:

  1. Create a Kubernetes secret with the token ID, secret and scope(s).
  2. Issue the token to the kubelet

From the kubelet's perspective, one token is like another and has no special meaning. From the kube-apiserver's perspective, however, the bootstrap token is special. Due to its type, namespace and name, kube-apiserver recognizes it as a special token, and grants anyone authenticating with that token special bootstrap rights, notably treating them as a member of the system:bootstrappers group. This fulfills a basic requirement for TLS bootstrapping.

The details for creating the secret are available here.

If you want to use bootstrap tokens, you must enable it on kube-apiserver with the flag:

--enable-bootstrap-token-auth=true

Token authentication file

kube-apiserver has the ability to accept tokens as authentication. These tokens are arbitrary but should represent at least 128 bits of entropy derived from a secure random number generator (such as /dev/urandom on most modern Linux systems). There are multiple ways you can generate a token. For example:

head -c 16 /dev/urandom | od -An -t x | tr -d ' '

This will generate tokens that look like 02b50b05283e98dd0fd71db496ef01e8.

The token file should look like the following example, where the first three values can be anything and the quoted group name should be as depicted:

02b50b05283e98dd0fd71db496ef01e8,kubelet-bootstrap,10001,"system:bootstrappers"

Add the --token-auth-file=FILENAME flag to the kube-apiserver command (in your systemd unit file perhaps) to enable the token file. See docs here for further details.

Authorize kubelet to create CSR

Now that the bootstrapping node is authenticated as part of the system:bootstrappers group, it needs to be authorized to create a certificate signing request (CSR) as well as retrieve it when done. Fortunately, Kubernetes ships with a ClusterRole with precisely these (and only these) permissions, system:node-bootstrapper.

To do this, you only need to create a ClusterRoleBinding that binds the system:bootstrappers group to the cluster role system:node-bootstrapper.

# enable bootstrapping nodes to create CSR
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: create-csrs-for-bootstrapping
subjects:
- kind: Group
  name: system:bootstrappers
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: system:node-bootstrapper
  apiGroup: rbac.authorization.k8s.io

kube-controller-manager configuration

While the apiserver receives the requests for certificates from the kubelet and authenticates those requests, the controller-manager is responsible for issuing actual signed certificates.

The controller-manager performs this function via a certificate-issuing control loop. This takes the form of a cfssl local signer using assets on disk. Currently, all certificates issued have one year validity and a default set of key usages.

In order for the controller-manager to sign certificates, it needs the following:

  • access to the "Kubernetes CA key and certificate" that you created and distributed
  • enabling CSR signing

Access to key and certificate

As described earlier, you need to create a Kubernetes CA key and certificate, and distribute it to the control plane nodes. These will be used by the controller-manager to sign the kubelet certificates.

Since these signed certificates will, in turn, be used by the kubelet to authenticate as a regular kubelet to kube-apiserver, it is important that the CA provided to the controller-manager at this stage also be trusted by kube-apiserver for authentication. This is provided to kube-apiserver with the flag --client-ca-file=FILENAME (for example, --client-ca-file=/var/lib/kubernetes/ca.pem), as described in the kube-apiserver configuration section.

To provide the Kubernetes CA key and certificate to kube-controller-manager, use the following flags:

--cluster-signing-cert-file="/etc/path/to/kubernetes/ca/ca.crt" --cluster-signing-key-file="/etc/path/to/kubernetes/ca/ca.key"

For example:

--cluster-signing-cert-file="/var/lib/kubernetes/ca.pem" --cluster-signing-key-file="/var/lib/kubernetes/ca-key.pem"

The validity duration of signed certificates can be configured with flag:

--cluster-signing-duration

Approval

In order to approve CSRs, you need to tell the controller-manager that it is acceptable to approve them. This is done by granting RBAC permissions to the correct group.

There are two distinct sets of permissions:

  • nodeclient: If a node is creating a new certificate for a node, then it does not have a certificate yet. It is authenticating using one of the tokens listed above, and thus is part of the group system:bootstrappers.
  • selfnodeclient: If a node is renewing its certificate, then it already has a certificate (by definition), which it uses continuously to authenticate as part of the group system:nodes.

To enable the kubelet to request and receive a new certificate, create a ClusterRoleBinding that binds the group in which the bootstrapping node is a member system:bootstrappers to the ClusterRole that grants it permission, system:certificates.k8s.io:certificatesigningrequests:nodeclient:

# Approve all CSRs for the group "system:bootstrappers"
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: auto-approve-csrs-for-group
subjects:
- kind: Group
  name: system:bootstrappers
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: system:certificates.k8s.io:certificatesigningrequests:nodeclient
  apiGroup: rbac.authorization.k8s.io

To enable the kubelet to renew its own client certificate, create a ClusterRoleBinding that binds the group in which the fully functioning node is a member system:nodes to the ClusterRole that grants it permission, system:certificates.k8s.io:certificatesigningrequests:selfnodeclient:

# Approve renewal CSRs for the group "system:nodes"
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: auto-approve-renewals-for-nodes
subjects:
- kind: Group
  name: system:nodes
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: system:certificates.k8s.io:certificatesigningrequests:selfnodeclient
  apiGroup: rbac.authorization.k8s.io

The csrapproving controller that ships as part of kube-controller-manager and is enabled by default. The controller uses the SubjectAccessReview API to determine if a given user is authorized to request a CSR, then approves based on the authorization outcome. To prevent conflicts with other approvers, the built-in approver doesn't explicitly deny CSRs. It only ignores unauthorized requests. The controller also prunes expired certificates as part of garbage collection.

kubelet configuration

Finally, with the control plane nodes properly set up and all of the necessary authentication and authorization in place, we can configure the kubelet.

The kubelet requires the following configuration to bootstrap:

  • A path to store the key and certificate it generates (optional, can use default)
  • A path to a kubeconfig file that does not yet exist; it will place the bootstrapped config file here
  • A path to a bootstrap kubeconfig file to provide the URL for the server and bootstrap credentials, e.g. a bootstrap token
  • Optional: instructions to rotate certificates

The bootstrap kubeconfig should be in a path available to the kubelet, for example /var/lib/kubelet/bootstrap-kubeconfig.

Its format is identical to a normal kubeconfig file. A sample file might look as follows:

apiVersion: v1
kind: Config
clusters:
- cluster:
    certificate-authority: /var/lib/kubernetes/ca.pem
    server: https://my.server.example.com:6443
  name: bootstrap
contexts:
- context:
    cluster: bootstrap
    user: kubelet-bootstrap
  name: bootstrap
current-context: bootstrap
preferences: {}
users:
- name: kubelet-bootstrap
  user:
    token: 07401b.f395accd246ae52d

The important elements to note are:

  • certificate-authority: path to a CA file, used to validate the server certificate presented by kube-apiserver
  • server: URL to kube-apiserver
  • token: the token to use

The format of the token does not matter, as long as it matches what kube-apiserver expects. In the above example, we used a bootstrap token. As stated earlier, any valid authentication method can be used, not only tokens.

Because the bootstrap kubeconfig is a standard kubeconfig, you can use kubectl to generate it. To create the above example file:

kubectl config --kubeconfig=/var/lib/kubelet/bootstrap-kubeconfig set-cluster bootstrap --server='https://my.server.example.com:6443' --certificate-authority=/var/lib/kubernetes/ca.pem
kubectl config --kubeconfig=/var/lib/kubelet/bootstrap-kubeconfig set-credentials kubelet-bootstrap --token=07401b.f395accd246ae52d
kubectl config --kubeconfig=/var/lib/kubelet/bootstrap-kubeconfig set-context bootstrap --user=kubelet-bootstrap --cluster=bootstrap
kubectl config --kubeconfig=/var/lib/kubelet/bootstrap-kubeconfig use-context bootstrap

To indicate to the kubelet to use the bootstrap kubeconfig, use the following kubelet flag:

--bootstrap-kubeconfig="/var/lib/kubelet/bootstrap-kubeconfig" --kubeconfig="/var/lib/kubelet/kubeconfig"

When starting the kubelet, if the file specified via --kubeconfig does not exist, the bootstrap kubeconfig specified via --bootstrap-kubeconfig is used to request a client certificate from the API server. On approval of the certificate request and receipt back by the kubelet, a kubeconfig file referencing the generated key and obtained certificate is written to the path specified by --kubeconfig. The certificate and key file will be placed in the directory specified by --cert-dir.

Client and Serving Certificates

All of the above relate to kubelet client certificates, specifically, the certificates a kubelet uses to authenticate to kube-apiserver.

A kubelet also can use serving certificates. The kubelet itself exposes an https endpoint for certain features. To secure these, the kubelet can do one of:

  • use provided key and certificate, via the --tls-private-key-file and --tls-cert-file flags
  • create self-signed key and certificate, if a key and certificate are not provided
  • request serving certificates from the cluster server, via the CSR API

The client certificate provided by TLS bootstrapping is signed, by default, for client auth only, and thus cannot be used as serving certificates, or server auth.

However, you can enable its server certificate, at least partially, via certificate rotation.

Certificate Rotation

Kubernetes v1.8 and higher kubelet implements beta features for enabling rotation of its client and/or serving certificates. These can be enabled through the respective RotateKubeletClientCertificate and RotateKubeletServerCertificate feature flags on the kubelet and are enabled by default.

RotateKubeletClientCertificate causes the kubelet to rotate its client certificates by creating new CSRs as its existing credentials expire. To enable this feature pass the following flag to the kubelet:

--rotate-certificates

RotateKubeletServerCertificate causes the kubelet both to request a serving certificate after bootstrapping its client credentials and to rotate that certificate. To enable this feature pass the following flag to the kubelet:

--rotate-server-certificates

Other authenticating components

All of TLS bootstrapping described in this document relates to the kubelet. However, other components may need to communicate directly with kube-apiserver. Notable is kube-proxy, which is part of the Kubernetes node components and runs on every node, but may also include other components such as monitoring or networking.

Like the kubelet, these other components also require a method of authenticating to kube-apiserver. You have several options for generating these credentials:

  • The old way: Create and distribute certificates the same way you did for kubelet before TLS bootstrapping
  • DaemonSet: Since the kubelet itself is loaded on each node, and is sufficient to start base services, you can run kube-proxy and other node-specific services not as a standalone process, but rather as a daemonset in the kube-system namespace. Since it will be in-cluster, you can give it a proper service account with appropriate permissions to perform its activities. This may be the simplest way to configure such services.

kubectl approval

CSRs can be approved outside of the approval flows built into the controller manager.

The signing controller does not immediately sign all certificate requests. Instead, it waits until they have been flagged with an "Approved" status by an appropriately-privileged user. This flow is intended to allow for automated approval handled by an external approval controller or the approval controller implemented in the core controller-manager. However cluster administrators can also manually approve certificate requests using kubectl. An administrator can list CSRs with kubectl get csr and describe one in detail with kubectl describe csr <name>. An administrator can approve or deny a CSR with kubectl certificate approve <name> and kubectl certificate deny <name>.

4 - Well-Known Labels, Annotations and Taints

Kubernetes reserves all labels and annotations in the kubernetes.io and k8s.io namespaces.

This document serves both as a reference to the values and as a coordination point for assigning values.

Labels, annotations and taints used on API objects

app.kubernetes.io/component

Example: app.kubernetes.io/component: "database"

Used on: All Objects (typically used on workload resources).

The component within the architecture.

One of the recommended labels.

app.kubernetes.io/created-by (deprecated)

Example: app.kubernetes.io/created-by: "controller-manager"

Used on: All Objects (typically used on workload resources).

The controller/user who created this resource.

app.kubernetes.io/instance

Example: app.kubernetes.io/instance: "mysql-abcxzy"

Used on: All Objects (typically used on workload resources).

A unique name identifying the instance of an application. To assign a non-unique name, use app.kubernetes.io/name.

One of the recommended labels.

app.kubernetes.io/managed-by

Example: app.kubernetes.io/managed-by: "helm"

Used on: All Objects (typically used on workload resources).

The tool being used to manage the operation of an application.

One of the recommended labels.

app.kubernetes.io/name

Example: app.kubernetes.io/name: "mysql"

Used on: All Objects (typically used on workload resources).

The name of the application.

One of the recommended labels.

app.kubernetes.io/part-of

Example: app.kubernetes.io/part-of: "wordpress"

Used on: All Objects (typically used on workload resources).

The name of a higher-level application this one is part of.

One of the recommended labels.

app.kubernetes.io/version

Example: app.kubernetes.io/version: "5.7.21"

Used on: All Objects (typically used on workload resources).

The current version of the application.

Common forms of values include:

One of the recommended labels.

cluster-autoscaler.kubernetes.io/safe-to-evict

Example: cluster-autoscaler.kubernetes.io/safe-to-evict: "true"

Used on: Pod

When this annotation is set to "true", the cluster autoscaler is allowed to evict a Pod even if other rules would normally prevent that. The cluster autoscaler never evicts Pods that have this annotation explicitly set to "false"; you could set that on an important Pod that you want to keep running. If this annotation is not set then the cluster autoscaler follows its Pod-level behavior.

kubernetes.io/arch

Example: kubernetes.io/arch: "amd64"

Used on: Node

The Kubelet populates this with runtime.GOARCH as defined by Go. This can be handy if you are mixing arm and x86 nodes.

kubernetes.io/os

Example: kubernetes.io/os: "linux"

Used on: Node

The Kubelet populates this with runtime.GOOS as defined by Go. This can be handy if you are mixing operating systems in your cluster (for example: mixing Linux and Windows nodes).

kubernetes.io/metadata.name

Example: kubernetes.io/metadata.name: "mynamespace"

Used on: Namespaces

The Kubernetes API server (part of the control plane) sets this label on all namespaces. The label value is set to the name of the namespace. You can't change this label's value.

This is useful if you want to target a specific namespace with a label selector.

kubernetes.io/limit-ranger

Example: kubernetes.io/limit-ranger: "LimitRanger plugin set: cpu, memory request for container nginx; cpu, memory limit for container nginx"

Used on: Pod

Kubernetes by default doesn't provide any resource limit, that means unless you explicitly define limits, your container can consume unlimited CPU and memory. You can define a default request or default limit for pods. You do this by creating a LimitRange in the relevant namespace. Pods deployed after you define a LimitRange will have these limits applied to them. The annotation kubernetes.io/limit-ranger records that resource defaults were specified for the Pod, and they were applied successfully. For more details, read about LimitRanges.

beta.kubernetes.io/arch (deprecated)

This label has been deprecated. Please use kubernetes.io/arch instead.

beta.kubernetes.io/os (deprecated)

This label has been deprecated. Please use kubernetes.io/os instead.

kubernetes.io/hostname

Example: kubernetes.io/hostname: "ip-172-20-114-199.ec2.internal"

Used on: Node

The Kubelet populates this label with the hostname. Note that the hostname can be changed from the "actual" hostname by passing the --hostname-override flag to the kubelet.

This label is also used as part of the topology hierarchy. See topology.kubernetes.io/zone for more information.

kubernetes.io/change-cause

Example: kubernetes.io/change-cause: "kubectl edit --record deployment foo"

Used on: All Objects

This annotation is a best guess at why something was changed.

It is populated when adding --record to a kubectl command that may change an object.

kubernetes.io/description

Example: kubernetes.io/description: "Description of K8s object."

Used on: All Objects

This annotation is used for describing specific behaviour of given object.

kubernetes.io/enforce-mountable-secrets

Example: kubernetes.io/enforce-mountable-secrets: "true"

Used on: ServiceAccount

The value for this annotation must be true to take effect. This annotation indicates that pods running as this service account may only reference Secret API objects specified in the service account's secrets field.

node.kubernetes.io/exclude-from-external-load-balancer

Example: node.kubernetes.io/exclude-from-external-load-balancer

Used on: Node

Kubernetes automatically enables the ServiceNodeExclusion feature gate on the clusters it creates. With this feature gate enabled on a cluster, you can add labels to particular worker nodes to exclude them from the list of backend servers. The following command can be used to exclude a worker node from the list of backend servers in a backend set- kubectl label nodes <node-name> node.kubernetes.io/exclude-from-external-load-balancers=true

controller.kubernetes.io/pod-deletion-cost

Example: controller.kubernetes.io/pod-deletion-cost: "10"

Used on: Pod

This annotation is used to set Pod Deletion Cost which allows users to influence ReplicaSet downscaling order. The annotation parses into an int32 type.

cluster-autoscaler.kubernetes.io/enable-ds-eviction

Example: cluster-autoscaler.kubernetes.io/enable-ds-eviction: "true"

Used on: Pod

This annotation controls whether a DaemonSet pod should be evicted by a ClusterAutoscaler. This annotation needs to be specified on DaemonSet pods in a DaemonSet manifest. When this annotation is set to "true", the ClusterAutoscaler is allowed to evict a DaemonSet Pod, even if other rules would normally prevent that. To disallow the ClusterAutoscaler from evicting DaemonSet pods, you can set this annotation to "false" for important DaemonSet pods. If this annotation is not set, then the Cluster Autoscaler follows its overall behaviour (i.e evict the DaemonSets based on its configuration).

kubernetes.io/ingress-bandwidth

Example: kubernetes.io/ingress-bandwidth: 10M

Used on: Pod

You can apply quality-of-service traffic shaping to a pod and effectively limit its available bandwidth. Ingress traffic (to the pod) is handled by shaping queued packets to effectively handle data. To limit the bandwidth on a pod, write an object definition JSON file and specify the data traffic speed using kubernetes.io/ingress-bandwidth annotation. The unit used for specifying ingress rate is bits per second, as a Quantity. For example, 10M means 10 megabits per second.

kubernetes.io/egress-bandwidth

Example: kubernetes.io/egress-bandwidth: 10M

Used on: Pod

Egress traffic (from the pod) is handled by policing, which simply drops packets in excess of the configured rate. The limits you place on a pod do not affect the bandwidth of other pods. To limit the bandwidth on a pod, write an object definition JSON file and specify the data traffic speed using kubernetes.io/egress-bandwidth annotation. The unit used for specifying egress rate is bits per second, as a Quantity. For example, 10M means 10 megabits per second.

beta.kubernetes.io/instance-type (deprecated)

node.kubernetes.io/instance-type

Example: node.kubernetes.io/instance-type: "m3.medium"

Used on: Node

The Kubelet populates this with the instance type as defined by the cloudprovider. This will be set only if you are using a cloudprovider. This setting is handy if you want to target certain workloads to certain instance types, but typically you want to rely on the Kubernetes scheduler to perform resource-based scheduling. You should aim to schedule based on properties rather than on instance types (for example: require a GPU, instead of requiring a g2.2xlarge).

failure-domain.beta.kubernetes.io/region (deprecated)

See topology.kubernetes.io/region.

failure-domain.beta.kubernetes.io/zone (deprecated)

See topology.kubernetes.io/zone.

statefulset.kubernetes.io/pod-name

Example:

statefulset.kubernetes.io/pod-name: "mystatefulset-7"

When a StatefulSet controller creates a Pod for the StatefulSet, the control plane sets this label on that Pod. The value of the label is the name of the Pod being created.

See Pod Name Label in the StatefulSet topic for more details.

scheduler.alpha.kubernetes.io/node-selector

Example: scheduler.alpha.kubernetes.io/node-selector: "name-of-node-selector"

Used on: Namespace

The PodNodeSelector uses this annotation key to assign node selectors to pods in namespaces.

topology.kubernetes.io/region

Example:

topology.kubernetes.io/region: "us-east-1"

See topology.kubernetes.io/zone.

topology.kubernetes.io/zone

Example:

topology.kubernetes.io/zone: "us-east-1c"

Used on: Node, PersistentVolume

On Node: The kubelet or the external cloud-controller-manager populates this with the information as provided by the cloudprovider. This will be set only if you are using a cloudprovider. However, you should consider setting this on nodes if it makes sense in your topology.

On PersistentVolume: topology-aware volume provisioners will automatically set node affinity constraints on PersistentVolumes.

A zone represents a logical failure domain. It is common for Kubernetes clusters to span multiple zones for increased availability. While the exact definition of a zone is left to infrastructure implementations, common properties of a zone include very low network latency within a zone, no-cost network traffic within a zone, and failure independence from other zones. For example, nodes within a zone might share a network switch, but nodes in different zones should not.

A region represents a larger domain, made up of one or more zones. It is uncommon for Kubernetes clusters to span multiple regions, While the exact definition of a zone or region is left to infrastructure implementations, common properties of a region include higher network latency between them than within them, non-zero cost for network traffic between them, and failure independence from other zones or regions. For example, nodes within a region might share power infrastructure (e.g. a UPS or generator), but nodes in different regions typically would not.

Kubernetes makes a few assumptions about the structure of zones and regions:

  1. regions and zones are hierarchical: zones are strict subsets of regions and no zone can be in 2 regions
  2. zone names are unique across regions; for example region "africa-east-1" might be comprised of zones "africa-east-1a" and "africa-east-1b"

It should be safe to assume that topology labels do not change. Even though labels are strictly mutable, consumers of them can assume that a given node is not going to be moved between zones without being destroyed and recreated.

Kubernetes can use this information in various ways. For example, the scheduler automatically tries to spread the Pods in a ReplicaSet across nodes in a single-zone cluster (to reduce the impact of node failures, see kubernetes.io/hostname). With multiple-zone clusters, this spreading behavior also applies to zones (to reduce the impact of zone failures). This is achieved via SelectorSpreadPriority.

SelectorSpreadPriority is a best effort placement. If the zones in your cluster are heterogeneous (for example: different numbers of nodes, different types of nodes, or different pod resource requirements), this placement might prevent equal spreading of your Pods across zones. If desired, you can use homogenous zones (same number and types of nodes) to reduce the probability of unequal spreading.

The scheduler (through the VolumeZonePredicate predicate) also will ensure that Pods, that claim a given volume, are only placed into the same zone as that volume. Volumes cannot be attached across zones.

If PersistentVolumeLabel does not support automatic labeling of your PersistentVolumes, you should consider adding the labels manually (or adding support for PersistentVolumeLabel). With PersistentVolumeLabel, the scheduler prevents Pods from mounting volumes in a different zone. If your infrastructure doesn't have this constraint, you don't need to add the zone labels to the volumes at all.

volume.beta.kubernetes.io/storage-provisioner (deprecated)

Example: volume.beta.kubernetes.io/storage-provisioner: "k8s.io/minikube-hostpath"

Used on: PersistentVolumeClaim

This annotation has been deprecated.

volume.beta.kubernetes.io/mount-options (deprecated)

Example : volume.beta.kubernetes.io/mount-options: "ro,soft"

Used on: PersistentVolume

A Kubernetes administrator can specify additional mount options for when a PersistentVolume is mounted on a node.

This annotation has been deprecated.

volume.kubernetes.io/storage-provisioner

Used on: PersistentVolumeClaim

This annotation will be added to dynamic provisioning required PVC.

node.kubernetes.io/windows-build

Example: node.kubernetes.io/windows-build: "10.0.17763"

Used on: Node

When the kubelet is running on Microsoft Windows, it automatically labels its node to record the version of Windows Server in use.

The label's value is in the format "MajorVersion.MinorVersion.BuildNumber".

service.kubernetes.io/headless

Example: service.kubernetes.io/headless: ""

Used on: Service

The control plane adds this label to an Endpoints object when the owning Service is headless.

kubernetes.io/service-name

Example: kubernetes.io/service-name: "my-website"

Used on: EndpointSlice

Kubernetes associates EndpointSlices with Services using this label.

This label records the name of the Service that the EndpointSlice is backing. All EndpointSlices should have this label set to the name of their associated Service.

kubernetes.io/service-account.name

Example: kubernetes.io/service-account.name: "sa-name"

Used on: Secret

This annotation records the name of the ServiceAccount that the token (stored in the Secret of type kubernetes.io/service-account-token) represents.

kubernetes.io/service-account.uid

Example: kubernetes.io/service-account.uid: da68f9c6-9d26-11e7-b84e-002dc52800da

Used on: Secret

This annotation records the unique ID of the ServiceAccount that the token (stored in the Secret of type kubernetes.io/service-account-token) represents.

endpointslice.kubernetes.io/managed-by

Example: endpointslice.kubernetes.io/managed-by: "controller"

Used on: EndpointSlices

The label is used to indicate the controller or entity that manages an EndpointSlice. This label aims to enable different EndpointSlice objects to be managed by different controllers or entities within the same cluster.

endpointslice.kubernetes.io/skip-mirror

Example: endpointslice.kubernetes.io/skip-mirror: "true"

Used on: Endpoints

The label can be set to "true" on an Endpoints resource to indicate that the EndpointSliceMirroring controller should not mirror this resource with EndpointSlices.

service.kubernetes.io/service-proxy-name

Example: service.kubernetes.io/service-proxy-name: "foo-bar"

Used on: Service

The kube-proxy has this label for custom proxy, which delegates service control to custom proxy.

experimental.windows.kubernetes.io/isolation-type (deprecated)

Example: experimental.windows.kubernetes.io/isolation-type: "hyperv"

Used on: Pod

The annotation is used to run Windows containers with Hyper-V isolation. To use Hyper-V isolation feature and create a Hyper-V isolated container, the kubelet should be started with feature gates HyperVContainer=true and the Pod should include the annotation experimental.windows.kubernetes.io/isolation-type: hyperv.

ingressclass.kubernetes.io/is-default-class

Example: ingressclass.kubernetes.io/is-default-class: "true"

Used on: IngressClass

When a single IngressClass resource has this annotation set to "true", new Ingress resource without a class specified will be assigned this default class.

kubernetes.io/ingress.class (deprecated)

storageclass.kubernetes.io/is-default-class

Example: storageclass.kubernetes.io/is-default-class: "true"

Used on: StorageClass

When a single StorageClass resource has this annotation set to "true", new PersistentVolumeClaim resource without a class specified will be assigned this default class.

alpha.kubernetes.io/provided-node-ip

Example: alpha.kubernetes.io/provided-node-ip: "10.0.0.1"

Used on: Node

The kubelet can set this annotation on a Node to denote its configured IPv4 address.

When kubelet is started with the --cloud-provider flag set to any value (includes both external and legacy in-tree cloud providers), it sets this annotation on the Node to denote an IP address set from the command line flag (--node-ip). This IP is verified with the cloud provider as valid by the cloud-controller-manager.

batch.kubernetes.io/job-completion-index

Example: batch.kubernetes.io/job-completion-index: "3"

Used on: Pod

The Job controller in the kube-controller-manager sets this annotation for Pods created with Indexed completion mode.

kubectl.kubernetes.io/default-container

Example: kubectl.kubernetes.io/default-container: "front-end-app"

The value of the annotation is the container name that is default for this Pod. For example, kubectl logs or kubectl exec without -c or --container flag will use this default container.

endpoints.kubernetes.io/over-capacity

Example: endpoints.kubernetes.io/over-capacity:truncated

Used on: Endpoints

The control plane adds this annotation to an Endpoints object if the associated Service has more than 1000 backing endpoints. The annotation indicates that the Endpoints object is over capacity and the number of endpoints has been truncated to 1000.

If the number of backend endpoints falls below 1000, the control plane removes this annotation.

batch.kubernetes.io/job-tracking

Example: batch.kubernetes.io/job-tracking: ""

Used on: Jobs

The presence of this annotation on a Job indicates that the control plane is tracking the Job status using finalizers. You should not manually add or remove this annotation.

scheduler.alpha.kubernetes.io/defaultTolerations

Example: scheduler.alpha.kubernetes.io/defaultTolerations: '[{"operator": "Equal", "value": "value1", "effect": "NoSchedule", "key": "dedicated-node"}]'

Used on: Namespace

This annotation requires the PodTolerationRestriction admission controller to be enabled. This annotation key allows assigning tolerations to a namespace and any new pods created in this namespace would get these tolerations added.

scheduler.alpha.kubernetes.io/preferAvoidPods (deprecated)

Used on: Nodes

This annotation requires the NodePreferAvoidPods scheduling plugin to be enabled. The plugin is deprecated since Kubernetes 1.22. Use Taints and Tolerations instead.

The taints listed below are always used on Nodes

node.kubernetes.io/not-ready

Example: node.kubernetes.io/not-ready: "NoExecute"

The node controller detects whether a node is ready by monitoring its health and adds or removes this taint accordingly.

node.kubernetes.io/unreachable

Example: node.kubernetes.io/unreachable: "NoExecute"

The node controller adds the taint to a node corresponding to the NodeCondition Ready being Unknown.

node.kubernetes.io/unschedulable

Example: node.kubernetes.io/unschedulable: "NoSchedule"

The taint will be added to a node when initializing the node to avoid race condition.

node.kubernetes.io/memory-pressure

Example: node.kubernetes.io/memory-pressure: "NoSchedule"

The kubelet detects memory pressure based on memory.available and allocatableMemory.available observed on a Node. The observed values are then compared to the corresponding thresholds that can be set on the kubelet to determine if the Node condition and taint should be added/removed.

node.kubernetes.io/disk-pressure

Example: node.kubernetes.io/disk-pressure :"NoSchedule"

The kubelet detects disk pressure based on imagefs.available, imagefs.inodesFree, nodefs.available and nodefs.inodesFree(Linux only) observed on a Node. The observed values are then compared to the corresponding thresholds that can be set on the kubelet to determine if the Node condition and taint should be added/removed.

node.kubernetes.io/network-unavailable

Example: node.kubernetes.io/network-unavailable: "NoSchedule"

This is initially set by the kubelet when the cloud provider used indicates a requirement for additional network configuration. Only when the route on the cloud is configured properly will the taint be removed by the cloud provider.

node.kubernetes.io/pid-pressure

Example: node.kubernetes.io/pid-pressure: "NoSchedule"

The kubelet checks D-value of the size of /proc/sys/kernel/pid_max and the PIDs consumed by Kubernetes on a node to get the number of available PIDs that referred to as the pid.available metric. The metric is then compared to the corresponding threshold that can be set on the kubelet to determine if the node condition and taint should be added/removed.

node.kubernetes.io/out-of-service

Example: node.kubernetes.io/out-of-service:NoExecute

A user can manually add the taint to a Node marking it out-of-service. If the NodeOutOfServiceVolumeDetach feature gate is enabled on kube-controller-manager, and a Node is marked out-of-service with this taint, the pods on the node will be forcefully deleted if there are no matching tolerations on it and volume detach operations for the pods terminating on the node will happen immediately. This allows the Pods on the out-of-service node to recover quickly on a different node.

node.cloudprovider.kubernetes.io/uninitialized

Example: node.cloudprovider.kubernetes.io/uninitialized: "NoSchedule"

Sets this taint on a node to mark it as unusable, when kubelet is started with the "external" cloud provider, until a controller from the cloud-controller-manager initializes this node, and then removes the taint.

node.cloudprovider.kubernetes.io/shutdown

Example: node.cloudprovider.kubernetes.io/shutdown: "NoSchedule"

If a Node is in a cloud provider specified shutdown state, the Node gets tainted accordingly with node.cloudprovider.kubernetes.io/shutdown and the taint effect of NoSchedule.

pod-security.kubernetes.io/enforce

Example: pod-security.kubernetes.io/enforce: "baseline"

Used on: Namespace

Value must be one of privileged, baseline, or restricted which correspond to Pod Security Standard levels. Specifically, the enforce label prohibits the creation of any Pod in the labeled Namespace which does not meet the requirements outlined in the indicated level.

See Enforcing Pod Security at the Namespace Level for more information.

pod-security.kubernetes.io/enforce-version

Example: pod-security.kubernetes.io/enforce-version: "1.25"

Used on: Namespace

Value must be latest or a valid Kubernetes version in the format v<MAJOR>.<MINOR>. This determines the version of the Pod Security Standard policies to apply when validating a submitted Pod.

See Enforcing Pod Security at the Namespace Level for more information.

pod-security.kubernetes.io/audit

Example: pod-security.kubernetes.io/audit: "baseline"

Used on: Namespace

Value must be one of privileged, baseline, or restricted which correspond to Pod Security Standard levels. Specifically, the audit label does not prevent the creation of a Pod in the labeled Namespace which does not meet the requirements outlined in the indicated level, but adds an audit annotation to that Pod.

See Enforcing Pod Security at the Namespace Level for more information.

pod-security.kubernetes.io/audit-version

Example: pod-security.kubernetes.io/audit-version: "1.25"

Used on: Namespace

Value must be latest or a valid Kubernetes version in the format v<MAJOR>.<MINOR>. This determines the version of the Pod Security Standard policies to apply when validating a submitted Pod.

See Enforcing Pod Security at the Namespace Level for more information.

pod-security.kubernetes.io/warn

Example: pod-security.kubernetes.io/warn: "baseline"

Used on: Namespace

Value must be one of privileged, baseline, or restricted which correspond to Pod Security Standard levels. Specifically, the warn label does not prevent the creation of a Pod in the labeled Namespace which does not meet the requirements outlined in the indicated level, but returns a warning to the user after doing so. Note that warnings are also displayed when creating or updating objects that contain Pod templates, such as Deployments, Jobs, StatefulSets, etc.

See Enforcing Pod Security at the Namespace Level for more information.

pod-security.kubernetes.io/warn-version

Example: pod-security.kubernetes.io/warn-version: "1.25"

Used on: Namespace

Value must be latest or a valid Kubernetes version in the format v<MAJOR>.<MINOR>. This determines the version of the Pod Security Standard policies to apply when validating a submitted Pod. Note that warnings are also displayed when creating or updating objects that contain Pod templates, such as Deployments, Jobs, StatefulSets, etc.

See Enforcing Pod Security at the Namespace Level for more information.

kubernetes.io/psp (deprecated)

Example: kubernetes.io/psp: restricted

Used on: Pod

This annotation was only relevant if you were using PodSecurityPolicies. Kubernetes v1.25 does not support the PodSecurityPolicy API.

When the PodSecurityPolicy admission controller admitted a Pod, the admission controller modified the Pod to have this annotation. The value of the annotation was the name of the PodSecurityPolicy that was used for validation.

seccomp.security.alpha.kubernetes.io/pod (deprecated)

This annotation has been deprecated since Kubernetes v1.19 and will become non-functional in a future release. please use the corresponding pod or container securityContext.seccompProfile field instead. To specify security settings for a Pod, include the securityContext field in the Pod specification. The securityContext field within a Pod's .spec defines pod-level security attributes. When you specify the security context for a Pod, the settings you specify apply to all containers in that Pod.

container.seccomp.security.alpha.kubernetes.io/[NAME] (deprecated)

This annotation has been deprecated since Kubernetes v1.19 and will become non-functional in a future release. please use the corresponding pod or container securityContext.seccompProfile field instead. The tutorial Restrict a Container's Syscalls with seccomp takes you through the steps you follow to apply a seccomp profile to a Pod or to one of its containers. That tutorial covers the supported mechanism for configuring seccomp in Kubernetes, based on setting securityContext within the Pod's .spec.

snapshot.storage.kubernetes.io/allowVolumeModeChange

Example: snapshot.storage.kubernetes.io/allowVolumeModeChange: "true"

Used on: VolumeSnapshotContent

Value can either be true or false. This determines whether a user can modify the mode of the source volume when a PersistentVolumeClaim is being created from a VolumeSnapshot.

Refer to Converting the volume mode of a Snapshot and the Kubernetes CSI Developer Documentation for more information.

Annotations used for audit

See more details on the Audit Annotations page.

kubeadm

kubeadm.alpha.kubernetes.io/cri-socket

Example: kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/container.sock

Used on: Node

Annotation that kubeadm uses to preserve the CRI socket information given to kubeadm at init/join time for later use. kubeadm annotates the Node object with this information. The annotation remains "alpha", since ideally this should be a field in KubeletConfiguration instead.

kubeadm.kubernetes.io/etcd.advertise-client-urls

Example: kubeadm.kubernetes.io/etcd.advertise-client-urls: https://172.17.0.18:2379

Used on: Pod

Annotation that kubeadm places on locally managed etcd pods to keep track of a list of URLs where etcd clients should connect to. This is used mainly for etcd cluster health check purposes.

kubeadm.kubernetes.io/kube-apiserver.advertise-address.endpoint

Example: kubeadm.kubernetes.io/kube-apiserver.advertise-address.endpoint: https//172.17.0.18:6443

Used on: Pod

Annotation that kubeadm places on locally managed kube-apiserver pods to keep track of the exposed advertise address/port endpoint for that API server instance.

kubeadm.kubernetes.io/component-config.hash

Used on: ConfigMap

Example: kubeadm.kubernetes.io/component-config.hash: 2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae

Annotation that kubeadm places on ConfigMaps that it manages for configuring components. It contains a hash (SHA-256) used to determine if the user has applied settings different from the kubeadm defaults for a particular component.

node-role.kubernetes.io/control-plane

Used on: Node

Label that kubeadm applies on the control plane nodes that it manages.

node-role.kubernetes.io/control-plane

Used on: Node

Example: node-role.kubernetes.io/control-plane:NoSchedule

Taint that kubeadm applies on control plane nodes to allow only critical workloads to schedule on them.

node-role.kubernetes.io/master (deprecated)

Used on: Node

Example: node-role.kubernetes.io/master:NoSchedule

Taint that kubeadm previously applied on control plane nodes to allow only critical workloads to schedule on them. Replaced by node-role.kubernetes.io/control-plane; kubeadm no longer sets or uses this deprecated taint.

4.1 - Audit Annotations

This page serves as a reference for the audit annotations of the kubernetes.io namespace. These annotations apply to Event object from API group audit.k8s.io.

pod-security.kubernetes.io/exempt

Example: pod-security.kubernetes.io/exempt: namespace

Value must be one of user, namespace, or runtimeClass which correspond to Pod Security Exemption dimensions. This annotation indicates on which dimension was based the exemption from the PodSecurity enforcement.

pod-security.kubernetes.io/enforce-policy

Example: pod-security.kubernetes.io/enforce-policy: restricted:latest

Value must be privileged:<version>, baseline:<version>, restricted:<version> which correspond to Pod Security Standard levels accompanied by a version which must be latest or a valid Kubernetes version in the format v<MAJOR>.<MINOR>. This annotations informs about the enforcement level that allowed or denied the pod during PodSecurity admission.

See Pod Security Standards for more information.

pod-security.kubernetes.io/audit-violations

Example: pod-security.kubernetes.io/audit-violations: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "example" must set securityContext.allowPrivilegeEscalation=false), ...

Value details an audit policy violation, it contains the Pod Security Standard level that was transgressed as well as the specific policies on the fields that were violated from the PodSecurity enforcement.

See Pod Security Standards for more information.

authorization.k8s.io/decision

Example: authorization.k8s.io/decision: "forbid"

This annotation indicates whether or not a request was authorized in Kubernetes audit logs.

See Auditing for more information.

authorization.k8s.io/reason

Example: authorization.k8s.io/reason: "Human-readable reason for the decision"

This annotation gives reason for the decision in Kubernetes audit logs.

See Auditing for more information.

missing-san.invalid-cert.kubernetes.io/$hostname

Example: missing-san.invalid-cert.kubernetes.io/example-svc.example-namespace.svc: "relies on a legacy Common Name field instead of the SAN extension for subject validation"

Used by Kubernetes version v1.24 and later

This annotation indicates a webhook or aggregated API server is using an invalid certificate that is missing subjectAltNames. Support for these certificates was disabled by default in Kubernetes 1.19, and removed in Kubernetes 1.23.

Requests to endpoints using these certificates will fail. Services using these certificates should replace them as soon as possible to avoid disruption when running in Kubernetes 1.23+ environments.

There's more information about this in the Go documentation: X.509 CommonName deprecation.

insecure-sha1.invalid-cert.kubernetes.io/$hostname

Example: insecure-sha1.invalid-cert.kubernetes.io/example-svc.example-namespace.svc: "uses an insecure SHA-1 signature"

Used by Kubernetes version v1.24 and later

This annotation indicates a webhook or aggregated API server is using an insecure certificate signed with a SHA-1 hash. Support for these insecure certificates is disabled by default in Kubernetes 1.24, and will be removed in a future release.

Services using these certificates should replace them as soon as possible, to ensure connections are secured properly and to avoid disruption in future releases.

There's more information about this in the Go documentation: Rejecting SHA-1 certificates.

5 - Kubernetes API

Kubernetes' API is the application that serves Kubernetes functionality through a RESTful interface and stores the state of the cluster.

Kubernetes resources and "records of intent" are all stored as API objects, and modified via RESTful calls to the API. The API allows configuration to be managed in a declarative way. Users can interact with the Kubernetes API directly, or via tools like kubectl. The core Kubernetes API is flexible and can also be extended to support custom resources.

5.1 - Workload Resources

5.1.1 - Pod

Pod is a collection of containers that can run on a host.

apiVersion: v1

import "k8s.io/api/core/v1"

Pod

Pod is a collection of containers that can run on a host. This resource is created by clients and scheduled onto hosts.


PodSpec

PodSpec is a description of a pod.


Containers

  • containers ([]Container), required

    Patch strategy: merge on key name

    List of containers belonging to the pod. Containers cannot currently be added or removed. There must be at least one container in a Pod. Cannot be updated.

  • initContainers ([]Container)

    Patch strategy: merge on key name

    List of initialization containers belonging to the pod. Init containers are executed in order prior to containers being started. If any init container fails, the pod is considered to have failed and is handled according to its restartPolicy. The name for an init container or normal container must be unique among all containers. Init containers may not have Lifecycle actions, Readiness probes, Liveness probes, or Startup probes. The resourceRequirements of an init container are taken into account during scheduling by finding the highest request/limit for each resource type, and then using the max of of that value or the sum of the normal containers. Limits are applied to init containers in a similar fashion. Init containers cannot currently be added or removed. Cannot be updated. More info: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/

  • ephemeralContainers ([]EphemeralContainer)

    Patch strategy: merge on key name

    List of ephemeral containers run in this pod. Ephemeral containers may be run in an existing pod to perform user-initiated actions such as debugging. This list cannot be specified when creating a pod, and it cannot be modified by updating the pod spec. In order to add an ephemeral container to an existing pod, use the pod's ephemeralcontainers subresource.

  • imagePullSecrets ([]LocalObjectReference)

    Patch strategy: merge on key name

    ImagePullSecrets is an optional list of references to secrets in the same namespace to use for pulling any of the images used by this PodSpec. If specified, these secrets will be passed to individual puller implementations for them to use. More info: https://kubernetes.io/docs/concepts/containers/images#specifying-imagepullsecrets-on-a-pod

  • enableServiceLinks (boolean)

    EnableServiceLinks indicates whether information about services should be injected into pod's environment variables, matching the syntax of Docker links. Optional: Defaults to true.

  • os (PodOS)

    Specifies the OS of the containers in the pod. Some pod and container fields are restricted if this is set.

    If the OS field is set to linux, the following fields must be unset: -securityContext.windowsOptions

    If the OS field is set to windows, following fields must be unset: - spec.hostPID - spec.hostIPC - spec.hostUsers - spec.securityContext.seLinuxOptions - spec.securityContext.seccompProfile - spec.securityContext.fsGroup - spec.securityContext.fsGroupChangePolicy - spec.securityContext.sysctls - spec.shareProcessNamespace - spec.securityContext.runAsUser - spec.securityContext.runAsGroup - spec.securityContext.supplementalGroups - spec.containers[].securityContext.seLinuxOptions - spec.containers[].securityContext.seccompProfile - spec.containers[].securityContext.capabilities - spec.containers[].securityContext.readOnlyRootFilesystem - spec.containers[].securityContext.privileged - spec.containers[].securityContext.allowPrivilegeEscalation - spec.containers[].securityContext.procMount - spec.containers[].securityContext.runAsUser - spec.containers[*].securityContext.runAsGroup

    PodOS defines the OS parameters of a pod.

Volumes

Scheduling

  • nodeSelector (map[string]string)

    NodeSelector is a selector which must be true for the pod to fit on a node. Selector which must match a node's labels for the pod to be scheduled on that node. More info: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/

  • nodeName (string)

    NodeName is a request to schedule this pod onto a specific node. If it is non-empty, the scheduler simply schedules this pod onto that node, assuming that it fits resource requirements.

  • affinity (Affinity)

    If specified, the pod's scheduling constraints

    Affinity is a group of affinity scheduling rules.

    • affinity.nodeAffinity (NodeAffinity)

      Describes node affinity scheduling rules for the pod.

    • affinity.podAffinity (PodAffinity)

      Describes pod affinity scheduling rules (e.g. co-locate this pod in the same node, zone, etc. as some other pod(s)).

    • affinity.podAntiAffinity (PodAntiAffinity)

      Describes pod anti-affinity scheduling rules (e.g. avoid putting this pod in the same node, zone, etc. as some other pod(s)).

  • tolerations ([]Toleration)

    If specified, the pod's tolerations.

    The pod this Toleration is attached to tolerates any taint that matches the triple <key,value,effect> using the matching operator .

    • tolerations.key (string)

      Key is the taint key that the toleration applies to. Empty means match all taint keys. If the key is empty, operator must be Exists; this combination means to match all values and all keys.

    • tolerations.operator (string)

      Operator represents a key's relationship to the value. Valid operators are Exists and Equal. Defaults to Equal. Exists is equivalent to wildcard for value, so that a pod can tolerate all taints of a particular category.

    • tolerations.value (string)

      Value is the taint value the toleration matches to. If the operator is Exists, the value should be empty, otherwise just a regular string.

    • tolerations.effect (string)

      Effect indicates the taint effect to match. Empty means match all taint effects. When specified, allowed values are NoSchedule, PreferNoSchedule and NoExecute.

    • tolerations.tolerationSeconds (int64)

      TolerationSeconds represents the period of time the toleration (which must be of effect NoExecute, otherwise this field is ignored) tolerates the taint. By default, it is not set, which means tolerate the taint forever (do not evict). Zero and negative values will be treated as 0 (evict immediately) by the system.

  • schedulerName (string)

    If specified, the pod will be dispatched by specified scheduler. If not specified, the pod will be dispatched by default scheduler.

  • runtimeClassName (string)

    RuntimeClassName refers to a RuntimeClass object in the node.k8s.io group, which should be used to run this pod. If no RuntimeClass resource matches the named class, the pod will not be run. If unset or empty, the "legacy" RuntimeClass will be used, which is an implicit class with an empty definition that uses the default runtime handler. More info: https://git.k8s.io/enhancements/keps/sig-node/585-runtime-class

  • priorityClassName (string)

    If specified, indicates the pod's priority. "system-node-critical" and "system-cluster-critical" are two special keywords which indicate the highest priorities with the former being the highest priority. Any other name must be defined by creating a PriorityClass object with that name. If not specified, the pod priority will be default or zero if there is no default.

  • priority (int32)

    The priority value. Various system components use this field to find the priority of the pod. When Priority Admission Controller is enabled, it prevents users from setting this field. The admission controller populates this field from PriorityClassName. The higher the value, the higher the priority.

  • preemptionPolicy (string)

    PreemptionPolicy is the Policy for preempting pods with lower priority. One of Never, PreemptLowerPriority. Defaults to PreemptLowerPriority if unset.

  • topologySpreadConstraints ([]TopologySpreadConstraint)

    Patch strategy: merge on key topologyKey

    Map: unique values on keys topologyKey, whenUnsatisfiable will be kept during a merge

    TopologySpreadConstraints describes how a group of pods ought to spread across topology domains. Scheduler will schedule pods in a way which abides by the constraints. All topologySpreadConstraints are ANDed.

    TopologySpreadConstraint specifies how to spread matching pods among the given topology.

    • topologySpreadConstraints.maxSkew (int32), required

      MaxSkew describes the degree to which pods may be unevenly distributed. When whenUnsatisfiable=DoNotSchedule, it is the maximum permitted difference between the number of matching pods in the target topology and the global minimum. The global minimum is the minimum number of matching pods in an eligible domain or zero if the number of eligible domains is less than MinDomains. For example, in a 3-zone cluster, MaxSkew is set to 1, and pods with the same labelSelector spread as 2/2/1: In this case, the global minimum is 1. | zone1 | zone2 | zone3 | | P P | P P | P | - if MaxSkew is 1, incoming pod can only be scheduled to zone3 to become 2/2/2; scheduling it onto zone1(zone2) would make the ActualSkew(3-1) on zone1(zone2) violate MaxSkew(1). - if MaxSkew is 2, incoming pod can be scheduled onto any zone. When whenUnsatisfiable=ScheduleAnyway, it is used to give higher precedence to topologies that satisfy it. It's a required field. Default value is 1 and 0 is not allowed.

    • topologySpreadConstraints.topologyKey (string), required

      TopologyKey is the key of node labels. Nodes that have a label with this key and identical values are considered to be in the same topology. We consider each <key, value> as a "bucket", and try to put balanced number of pods into each bucket. We define a domain as a particular instance of a topology. Also, we define an eligible domain as a domain whose nodes meet the requirements of nodeAffinityPolicy and nodeTaintsPolicy. e.g. If TopologyKey is "kubernetes.io/hostname", each Node is a domain of that topology. And, if TopologyKey is "topology.kubernetes.io/zone", each zone is a domain of that topology. It's a required field.

    • topologySpreadConstraints.whenUnsatisfiable (string), required

      WhenUnsatisfiable indicates how to deal with a pod if it doesn't satisfy the spread constraint. - DoNotSchedule (default) tells the scheduler not to schedule it. - ScheduleAnyway tells the scheduler to schedule the pod in any location, but giving higher precedence to topologies that would help reduce the skew. A constraint is considered "Unsatisfiable" for an incoming pod if and only if every possible node assignment for that pod would violate "MaxSkew" on some topology. For example, in a 3-zone cluster, MaxSkew is set to 1, and pods with the same labelSelector spread as 3/1/1: | zone1 | zone2 | zone3 | | P P P | P | P | If WhenUnsatisfiable is set to DoNotSchedule, incoming pod can only be scheduled to zone2(zone3) to become 3/2/1(3/1/2) as ActualSkew(2-1) on zone2(zone3) satisfies MaxSkew(1). In other words, the cluster can still be imbalanced, but scheduler won't make it more imbalanced. It's a required field.

    • topologySpreadConstraints.labelSelector (LabelSelector)

      LabelSelector is used to find matching pods. Pods that match this label selector are counted to determine the number of pods in their corresponding topology domain.

    • topologySpreadConstraints.matchLabelKeys ([]string)

      Atomic: will be replaced during a merge

      MatchLabelKeys is a set of pod label keys to select the pods over which spreading will be calculated. The keys are used to lookup values from the incoming pod labels, those key-value labels are ANDed with labelSelector to select the group of existing pods over which spreading will be calculated for the incoming pod. Keys that don't exist in the incoming pod labels will be ignored. A null or empty list means only match against labelSelector.

    • topologySpreadConstraints.minDomains (int32)

      MinDomains indicates a minimum number of eligible domains. When the number of eligible domains with matching topology keys is less than minDomains, Pod Topology Spread treats "global minimum" as 0, and then the calculation of Skew is performed. And when the number of eligible domains with matching topology keys equals or greater than minDomains, this value has no effect on scheduling. As a result, when the number of eligible domains is less than minDomains, scheduler won't schedule more than maxSkew Pods to those domains. If value is nil, the constraint behaves as if MinDomains is equal to 1. Valid values are integers greater than 0. When value is not nil, WhenUnsatisfiable must be DoNotSchedule.

      For example, in a 3-zone cluster, MaxSkew is set to 2, MinDomains is set to 5 and pods with the same labelSelector spread as 2/2/2: | zone1 | zone2 | zone3 | | P P | P P | P P | The number of domains is less than 5(MinDomains), so "global minimum" is treated as 0. In this situation, new pod with the same labelSelector cannot be scheduled, because computed skew will be 3(3 - 0) if new Pod is scheduled to any of the three zones, it will violate MaxSkew.

      This is a beta field and requires the MinDomainsInPodTopologySpread feature gate to be enabled (enabled by default).

    • topologySpreadConstraints.nodeAffinityPolicy (string)

      NodeAffinityPolicy indicates how we will treat Pod's nodeAffinity/nodeSelector when calculating pod topology spread skew. Options are: - Honor: only nodes matching nodeAffinity/nodeSelector are included in the calculations. - Ignore: nodeAffinity/nodeSelector are ignored. All nodes are included in the calculations.

      If this value is nil, the behavior is equivalent to the Honor policy. This is a alpha-level feature enabled by the NodeInclusionPolicyInPodTopologySpread feature flag.

    • topologySpreadConstraints.nodeTaintsPolicy (string)

      NodeTaintsPolicy indicates how we will treat node taints when calculating pod topology spread skew. Options are: - Honor: nodes without taints, along with tainted nodes for which the incoming pod has a toleration, are included. - Ignore: node taints are ignored. All nodes are included.

      If this value is nil, the behavior is equivalent to the Ignore policy. This is a alpha-level feature enabled by the NodeInclusionPolicyInPodTopologySpread feature flag.

  • overhead (map[string]Quantity)

    Overhead represents the resource overhead associated with running a pod for a given RuntimeClass. This field will be autopopulated at admission time by the RuntimeClass admission controller. If the RuntimeClass admission controller is enabled, overhead must not be set in Pod create requests. The RuntimeClass admission controller will reject Pod create requests which have the overhead already set. If RuntimeClass is configured and selected in the PodSpec, Overhead will be set to the value defined in the corresponding RuntimeClass, otherwise it will remain unset and treated as zero. More info: https://git.k8s.io/enhancements/keps/sig-node/688-pod-overhead/README.md

Lifecycle

  • restartPolicy (string)

    Restart policy for all containers within the pod. One of Always, OnFailure, Never. Default to Always. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy

  • terminationGracePeriodSeconds (int64)

    Optional duration in seconds the pod needs to terminate gracefully. May be decreased in delete request. Value must be non-negative integer. The value zero indicates stop immediately via the kill signal (no opportunity to shut down). If this value is nil, the default grace period will be used instead. The grace period is the duration in seconds after the processes running in the pod are sent a termination signal and the time when the processes are forcibly halted with a kill signal. Set this value longer than the expected cleanup time for your process. Defaults to 30 seconds.

  • activeDeadlineSeconds (int64)

    Optional duration in seconds the pod may be active on the node relative to StartTime before the system will actively try to mark it failed and kill associated containers. Value must be a positive integer.

  • readinessGates ([]PodReadinessGate)

    If specified, all readiness gates will be evaluated for pod readiness. A pod is ready when all its containers are ready AND all conditions specified in the readiness gates have status equal to "True" More info: https://git.k8s.io/enhancements/keps/sig-network/580-pod-readiness-gates

    PodReadinessGate contains the reference to a pod condition

    • readinessGates.conditionType (string), required

      ConditionType refers to a condition in the pod's condition list with matching type.

Hostname and Name resolution

  • hostname (string)

    Specifies the hostname of the Pod If not specified, the pod's hostname will be set to a system-defined value.

  • setHostnameAsFQDN (boolean)

    If true the pod's hostname will be configured as the pod's FQDN, rather than the leaf name (the default). In Linux containers, this means setting the FQDN in the hostname field of the kernel (the nodename field of struct utsname). In Windows containers, this means setting the registry value of hostname for the registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters to FQDN. If a pod does not have FQDN, this has no effect. Default to false.

  • subdomain (string)

    If specified, the fully qualified Pod hostname will be "<hostname>.<subdomain>.<pod namespace>.svc.<cluster domain>". If not specified, the pod will not have a domainname at all.

  • hostAliases ([]HostAlias)

    Patch strategy: merge on key ip

    HostAliases is an optional list of hosts and IPs that will be injected into the pod's hosts file if specified. This is only valid for non-hostNetwork pods.

    HostAlias holds the mapping between IP and hostnames that will be injected as an entry in the pod's hosts file.

    • hostAliases.hostnames ([]string)

      Hostnames for the above IP address.

    • hostAliases.ip (string)

      IP address of the host file entry.

  • dnsConfig (PodDNSConfig)

    Specifies the DNS parameters of a pod. Parameters specified here will be merged to the generated DNS configuration based on DNSPolicy.

    PodDNSConfig defines the DNS parameters of a pod in addition to those generated from DNSPolicy.

    • dnsConfig.nameservers ([]string)

      A list of DNS name server IP addresses. This will be appended to the base nameservers generated from DNSPolicy. Duplicated nameservers will be removed.

    • dnsConfig.options ([]PodDNSConfigOption)

      A list of DNS resolver options. This will be merged with the base options generated from DNSPolicy. Duplicated entries will be removed. Resolution options given in Options will override those that appear in the base DNSPolicy.

      PodDNSConfigOption defines DNS resolver options of a pod.

      • dnsConfig.options.name (string)

        Required.

      • dnsConfig.options.value (string)

    • dnsConfig.searches ([]string)

      A list of DNS search domains for host-name lookup. This will be appended to the base search paths generated from DNSPolicy. Duplicated search paths will be removed.

  • dnsPolicy (string)

    Set DNS policy for the pod. Defaults to "ClusterFirst". Valid values are 'ClusterFirstWithHostNet', 'ClusterFirst', 'Default' or 'None'. DNS parameters given in DNSConfig will be merged with the policy selected with DNSPolicy. To have DNS options set along with hostNetwork, you have to specify DNS policy explicitly to 'ClusterFirstWithHostNet'.

Hosts namespaces

  • hostNetwork (boolean)

    Host networking requested for this pod. Use the host's network namespace. If this option is set, the ports that will be used must be specified. Default to false.

  • hostPID (boolean)

    Use the host's pid namespace. Optional: Default to false.

  • hostIPC (boolean)

    Use the host's ipc namespace. Optional: Default to false.

  • shareProcessNamespace (boolean)

    Share a single process namespace between all of the containers in a pod. When this is set containers will be able to view and signal processes from other containers in the same pod, and the first process in each container will not be assigned PID 1. HostPID and ShareProcessNamespace cannot both be set. Optional: Default to false.

Service account

Security context

  • securityContext (PodSecurityContext)

    SecurityContext holds pod-level security attributes and common container settings. Optional: Defaults to empty. See type description for default values of each field.

    PodSecurityContext holds pod-level security attributes and common container settings. Some fields are also present in container.securityContext. Field values of container.securityContext take precedence over field values of PodSecurityContext.

    • securityContext.runAsUser (int64)

      The UID to run the entrypoint of the container process. Defaults to user specified in image metadata if unspecified. May also be set in SecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence for that container. Note that this field cannot be set when spec.os.name is windows.

    • securityContext.runAsNonRoot (boolean)

      Indicates that the container must run as a non-root user. If true, the Kubelet will validate the image at runtime to ensure that it does not run as UID 0 (root) and fail to start the container if it does. If unset or false, no such validation will be performed. May also be set in SecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.

    • securityContext.runAsGroup (int64)

      The GID to run the entrypoint of the container process. Uses runtime default if unset. May also be set in SecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence for that container. Note that this field cannot be set when spec.os.name is windows.

    • securityContext.supplementalGroups ([]int64)

      A list of groups applied to the first process run in each container, in addition to the container's primary GID. If unspecified, no groups will be added to any container. Note that this field cannot be set when spec.os.name is windows.

    • securityContext.fsGroup (int64)

      A special supplemental group that applies to all containers in a pod. Some volume types allow the Kubelet to change the ownership of that volume to be owned by the pod:

      1. The owning GID will be the FSGroup 2. The setgid bit is set (new files created in the volume will be owned by FSGroup) 3. The permission bits are OR'd with rw-rw----

      If unset, the Kubelet will not modify the ownership and permissions of any volume. Note that this field cannot be set when spec.os.name is windows.

    • securityContext.fsGroupChangePolicy (string)

      fsGroupChangePolicy defines behavior of changing ownership and permission of the volume before being exposed inside Pod. This field will only apply to volume types which support fsGroup based ownership(and permissions). It will have no effect on ephemeral volume types such as: secret, configmaps and emptydir. Valid values are "OnRootMismatch" and "Always". If not specified, "Always" is used. Note that this field cannot be set when spec.os.name is windows.

    • securityContext.seccompProfile (SeccompProfile)

      The seccomp options to use by the containers in this pod. Note that this field cannot be set when spec.os.name is windows.

      SeccompProfile defines a pod/container's seccomp profile settings. Only one profile source may be set.

      • securityContext.seccompProfile.type (string), required

        type indicates which kind of seccomp profile will be applied. Valid options are:

        Localhost - a profile defined in a file on the node should be used. RuntimeDefault - the container runtime default profile should be used. Unconfined - no profile should be applied.

      • securityContext.seccompProfile.localhostProfile (string)

        localhostProfile indicates a profile defined in a file on the node should be used. The profile must be preconfigured on the node to work. Must be a descending path, relative to the kubelet's configured seccomp profile location. Must only be set if type is "Localhost".

    • securityContext.seLinuxOptions (SELinuxOptions)

      The SELinux context to be applied to all containers. If unspecified, the container runtime will allocate a random SELinux context for each container. May also be set in SecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence for that container. Note that this field cannot be set when spec.os.name is windows.

      SELinuxOptions are the labels to be applied to the container

      • securityContext.seLinuxOptions.level (string)

        Level is SELinux level label that applies to the container.

      • securityContext.seLinuxOptions.role (string)

        Role is a SELinux role label that applies to the container.

      • securityContext.seLinuxOptions.type (string)

        Type is a SELinux type label that applies to the container.

      • securityContext.seLinuxOptions.user (string)

        User is a SELinux user label that applies to the container.

    • securityContext.sysctls ([]Sysctl)

      Sysctls hold a list of namespaced sysctls used for the pod. Pods with unsupported sysctls (by the container runtime) might fail to launch. Note that this field cannot be set when spec.os.name is windows.

      Sysctl defines a kernel parameter to be set

      • securityContext.sysctls.name (string), required

        Name of a property to set

      • securityContext.sysctls.value (string), required

        Value of a property to set

    • securityContext.windowsOptions (WindowsSecurityContextOptions)

      The Windows specific settings applied to all containers. If unspecified, the options within a container's SecurityContext will be used. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is linux.

      WindowsSecurityContextOptions contain Windows-specific options and credentials.

      • securityContext.windowsOptions.gmsaCredentialSpec (string)

        GMSACredentialSpec is where the GMSA admission webhook (https://github.com/kubernetes-sigs/windows-gmsa) inlines the contents of the GMSA credential spec named by the GMSACredentialSpecName field.

      • securityContext.windowsOptions.gmsaCredentialSpecName (string)

        GMSACredentialSpecName is the name of the GMSA credential spec to use.

      • securityContext.windowsOptions.hostProcess (boolean)

        HostProcess determines if a container should be run as a 'Host Process' container. This field is alpha-level and will only be honored by components that enable the WindowsHostProcessContainers feature flag. Setting this field without the feature flag will result in errors when validating the Pod. All of a Pod's containers must have the same effective HostProcess value (it is not allowed to have a mix of HostProcess containers and non-HostProcess containers). In addition, if HostProcess is true then HostNetwork must also be set to true.

      • securityContext.windowsOptions.runAsUserName (string)

        The UserName in Windows to run the entrypoint of the container process. Defaults to the user specified in image metadata if unspecified. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.

Alpha level

  • hostUsers (boolean)

    Use the host's user namespace. Optional: Default to true. If set to true or not present, the pod will be run in the host user namespace, useful for when the pod needs a feature only available to the host user namespace, such as loading a kernel module with CAP_SYS_MODULE. When set to false, a new userns is created for the pod. Setting false is useful for mitigating container breakout vulnerabilities even allowing users to run their containers as root without actually having root privileges on the host. This field is alpha-level and is only honored by servers that enable the UserNamespacesSupport feature.

Deprecated

  • serviceAccount (string)

    DeprecatedServiceAccount is a depreciated alias for ServiceAccountName. Deprecated: Use serviceAccountName instead.

Container

A single application container that you want to run within a pod.


  • name (string), required

    Name of the container specified as a DNS_LABEL. Each container in a pod must have a unique name (DNS_LABEL). Cannot be updated.

Image

Entrypoint

  • command ([]string)

    Entrypoint array. Not executed within a shell. The container image's ENTRYPOINT is used if this is not provided. Variable references $(VAR_NAME) are expanded using the container's environment. If a variable cannot be resolved, the reference in the input string will be unchanged. Double $$ are reduced to a single $, which allows for escaping the $(VAR_NAME) syntax: i.e. "$$(VAR_NAME)" will produce the string literal "$(VAR_NAME)". Escaped references will never be expanded, regardless of whether the variable exists or not. Cannot be updated. More info: https://kubernetes.io/docs/tasks/inject-data-application/define-command-argument-container/#running-a-command-in-a-shell

  • args ([]string)

    Arguments to the entrypoint. The container image's CMD is used if this is not provided. Variable references $(VAR_NAME) are expanded using the container's environment. If a variable cannot be resolved, the reference in the input string will be unchanged. Double $$ are reduced to a single $, which allows for escaping the $(VAR_NAME) syntax: i.e. "$$(VAR_NAME)" will produce the string literal "$(VAR_NAME)". Escaped references will never be expanded, regardless of whether the variable exists or not. Cannot be updated. More info: https://kubernetes.io/docs/tasks/inject-data-application/define-command-argument-container/#running-a-command-in-a-shell

  • workingDir (string)

    Container's working directory. If not specified, the container runtime's default will be used, which might be configured in the container image. Cannot be updated.

Ports

  • ports ([]ContainerPort)

    Patch strategy: merge on key containerPort

    Map: unique values on keys containerPort, protocol will be kept during a merge

    List of ports to expose from the container. Not specifying a port here DOES NOT prevent that port from being exposed. Any port which is listening on the default "0.0.0.0" address inside a container will be accessible from the network. Modifying this array with strategic merge patch may corrupt the data. For more information See https://github.com/kubernetes/kubernetes/issues/108255. Cannot be updated.

    ContainerPort represents a network port in a single container.

    • ports.containerPort (int32), required

      Number of port to expose on the pod's IP address. This must be a valid port number, 0 < x < 65536.

    • ports.hostIP (string)

      What host IP to bind the external port to.

    • ports.hostPort (int32)

      Number of port to expose on the host. If specified, this must be a valid port number, 0 < x < 65536. If HostNetwork is specified, this must match ContainerPort. Most containers do not need this.

    • ports.name (string)

      If specified, this must be an IANA_SVC_NAME and unique within the pod. Each named port in a pod must have a unique name. Name for the port that can be referred to by services.

    • ports.protocol (string)

      Protocol for port. Must be UDP, TCP, or SCTP. Defaults to "TCP".

Environment variables

  • env ([]EnvVar)

    Patch strategy: merge on key name

    List of environment variables to set in the container. Cannot be updated.

    EnvVar represents an environment variable present in a Container.

    • env.name (string), required

      Name of the environment variable. Must be a C_IDENTIFIER.

    • env.value (string)

      Variable references $(VAR_NAME) are expanded using the previously defined environment variables in the container and any service environment variables. If a variable cannot be resolved, the reference in the input string will be unchanged. Double $$ are reduced to a single $, which allows for escaping the $(VAR_NAME) syntax: i.e. "$$(VAR_NAME)" will produce the string literal "$(VAR_NAME)". Escaped references will never be expanded, regardless of whether the variable exists or not. Defaults to "".

    • env.valueFrom (EnvVarSource)

      Source for the environment variable's value. Cannot be used if value is not empty.

      EnvVarSource represents a source for the value of an EnvVar.

      • env.valueFrom.configMapKeyRef (ConfigMapKeySelector)

        Selects a key of a ConfigMap.

        Selects a key from a ConfigMap.

      • env.valueFrom.fieldRef (ObjectFieldSelector)

        Selects a field of the pod: supports metadata.name, metadata.namespace, metadata.labels['\<KEY>'], metadata.annotations['\<KEY>'], spec.nodeName, spec.serviceAccountName, status.hostIP, status.podIP, status.podIPs.

      • env.valueFrom.resourceFieldRef (ResourceFieldSelector)

        Selects a resource of the container: only resources limits and requests (limits.cpu, limits.memory, limits.ephemeral-storage, requests.cpu, requests.memory and requests.ephemeral-storage) are currently supported.

      • env.valueFrom.secretKeyRef (SecretKeySelector)

        Selects a key of a secret in the pod's namespace

        SecretKeySelector selects a key of a Secret.

  • envFrom ([]EnvFromSource)

    List of sources to populate environment variables in the container. The keys defined within a source must be a C_IDENTIFIER. All invalid keys will be reported as an event when the container is starting. When a key exists in multiple sources, the value associated with the last source will take precedence. Values defined by an Env with a duplicate key will take precedence. Cannot be updated.

    EnvFromSource represents the source of a set of ConfigMaps

    • envFrom.configMapRef (ConfigMapEnvSource)

      The ConfigMap to select from

      *ConfigMapEnvSource selects a ConfigMap to populate the environment variables with.

      The contents of the target ConfigMap's Data field will represent the key-value pairs as environment variables.*

    • envFrom.prefix (string)

      An optional identifier to prepend to each key in the ConfigMap. Must be a C_IDENTIFIER.

    • envFrom.secretRef (SecretEnvSource)

      The Secret to select from

      *SecretEnvSource selects a Secret to populate the environment variables with.

      The contents of the target Secret's Data field will represent the key-value pairs as environment variables.*

Volumes

  • volumeMounts ([]VolumeMount)

    Patch strategy: merge on key mountPath

    Pod volumes to mount into the container's filesystem. Cannot be updated.

    VolumeMount describes a mounting of a Volume within a container.

    • volumeMounts.mountPath (string), required

      Path within the container at which the volume should be mounted. Must not contain ':'.

    • volumeMounts.name (string), required

      This must match the Name of a Volume.

    • volumeMounts.mountPropagation (string)

      mountPropagation determines how mounts are propagated from the host to container and the other way around. When not set, MountPropagationNone is used. This field is beta in 1.10.

    • volumeMounts.readOnly (boolean)

      Mounted read-only if true, read-write otherwise (false or unspecified). Defaults to false.

    • volumeMounts.subPath (string)

      Path within the volume from which the container's volume should be mounted. Defaults to "" (volume's root).

    • volumeMounts.subPathExpr (string)

      Expanded path within the volume from which the container's volume should be mounted. Behaves similarly to SubPath but environment variable references $(VAR_NAME) are expanded using the container's environment. Defaults to "" (volume's root). SubPathExpr and SubPath are mutually exclusive.

  • volumeDevices ([]VolumeDevice)

    Patch strategy: merge on key devicePath

    volumeDevices is the list of block devices to be used by the container.

    volumeDevice describes a mapping of a raw block device within a container.

    • volumeDevices.devicePath (string), required

      devicePath is the path inside of the container that the device will be mapped to.

    • volumeDevices.name (string), required

      name must match the name of a persistentVolumeClaim in the pod

Resources

Lifecycle

  • lifecycle (Lifecycle)

    Actions that the management system should take in response to container lifecycle events. Cannot be updated.

    Lifecycle describes actions that the management system should take in response to container lifecycle events. For the PostStart and PreStop lifecycle handlers, management of the container blocks until the action is complete, unless the container process fails, in which case the handler is aborted.

    • lifecycle.postStart (LifecycleHandler)

      PostStart is called immediately after a container is created. If the handler fails, the container is terminated and restarted according to its restart policy. Other management of the container blocks until the hook completes. More info: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks

    • lifecycle.preStop (LifecycleHandler)

      PreStop is called immediately before a container is terminated due to an API request or management event such as liveness/startup probe failure, preemption, resource contention, etc. The handler is not called if the container crashes or exits. The Pod's termination grace period countdown begins before the PreStop hook is executed. Regardless of the outcome of the handler, the container will eventually terminate within the Pod's termination grace period (unless delayed by finalizers). Other management of the container blocks until the hook completes or until the termination grace period is reached. More info: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks

  • terminationMessagePath (string)

    Optional: Path at which the file to which the container's termination message will be written is mounted into the container's filesystem. Message written is intended to be brief final status, such as an assertion failure message. Will be truncated by the node if greater than 4096 bytes. The total message length across all containers will be limited to 12kb. Defaults to /dev/termination-log. Cannot be updated.

  • terminationMessagePolicy (string)

    Indicate how the termination message should be populated. File will use the contents of terminationMessagePath to populate the container status message on both success and failure. FallbackToLogsOnError will use the last chunk of container log output if the termination message file is empty and the container exited with an error. The log output is limited to 2048 bytes or 80 lines, whichever is smaller. Defaults to File. Cannot be updated.

  • livenessProbe (Probe)

    Periodic probe of container liveness. Container will be restarted if the probe fails. Cannot be updated. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes

  • readinessProbe (Probe)

    Periodic probe of container service readiness. Container will be removed from service endpoints if the probe fails. Cannot be updated. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes

  • startupProbe (Probe)

    StartupProbe indicates that the Pod has successfully initialized. If specified, no other probes are executed until this completes successfully. If this probe fails, the Pod will be restarted, just as if the livenessProbe failed. This can be used to provide different probe parameters at the beginning of a Pod's lifecycle, when it might take a long time to load data or warm a cache, than during steady-state operation. This cannot be updated. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes

Security Context

  • securityContext (SecurityContext)

    SecurityContext defines the security options the container should be run with. If set, the fields of SecurityContext override the equivalent fields of PodSecurityContext. More info: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/

    SecurityContext holds security configuration that will be applied to a container. Some fields are present in both SecurityContext and PodSecurityContext. When both are set, the values in SecurityContext take precedence.

    • securityContext.runAsUser (int64)

      The UID to run the entrypoint of the container process. Defaults to user specified in image metadata if unspecified. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is windows.

    • securityContext.runAsNonRoot (boolean)

      Indicates that the container must run as a non-root user. If true, the Kubelet will validate the image at runtime to ensure that it does not run as UID 0 (root) and fail to start the container if it does. If unset or false, no such validation will be performed. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.

    • securityContext.runAsGroup (int64)

      The GID to run the entrypoint of the container process. Uses runtime default if unset. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is windows.

    • securityContext.readOnlyRootFilesystem (boolean)

      Whether this container has a read-only root filesystem. Default is false. Note that this field cannot be set when spec.os.name is windows.

    • securityContext.procMount (string)

      procMount denotes the type of proc mount to use for the containers. The default is DefaultProcMount which uses the container runtime defaults for readonly paths and masked paths. This requires the ProcMountType feature flag to be enabled. Note that this field cannot be set when spec.os.name is windows.

    • securityContext.privileged (boolean)

      Run container in privileged mode. Processes in privileged containers are essentially equivalent to root on the host. Defaults to false. Note that this field cannot be set when spec.os.name is windows.

    • securityContext.allowPrivilegeEscalation (boolean)

      AllowPrivilegeEscalation controls whether a process can gain more privileges than its parent process. This bool directly controls if the no_new_privs flag will be set on the container process. AllowPrivilegeEscalation is true always when the container is: 1) run as Privileged 2) has CAP_SYS_ADMIN Note that this field cannot be set when spec.os.name is windows.

    • securityContext.capabilities (Capabilities)

      The capabilities to add/drop when running containers. Defaults to the default set of capabilities granted by the container runtime. Note that this field cannot be set when spec.os.name is windows.

      Adds and removes POSIX capabilities from running containers.

      • securityContext.capabilities.add ([]string)

        Added capabilities

      • securityContext.capabilities.drop ([]string)

        Removed capabilities

    • securityContext.seccompProfile (SeccompProfile)

      The seccomp options to use by this container. If seccomp options are provided at both the pod & container level, the container options override the pod options. Note that this field cannot be set when spec.os.name is windows.

      SeccompProfile defines a pod/container's seccomp profile settings. Only one profile source may be set.

      • securityContext.seccompProfile.type (string), required

        type indicates which kind of seccomp profile will be applied. Valid options are:

        Localhost - a profile defined in a file on the node should be used. RuntimeDefault - the container runtime default profile should be used. Unconfined - no profile should be applied.

      • securityContext.seccompProfile.localhostProfile (string)

        localhostProfile indicates a profile defined in a file on the node should be used. The profile must be preconfigured on the node to work. Must be a descending path, relative to the kubelet's configured seccomp profile location. Must only be set if type is "Localhost".

    • securityContext.seLinuxOptions (SELinuxOptions)

      The SELinux context to be applied to the container. If unspecified, the container runtime will allocate a random SELinux context for each container. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is windows.

      SELinuxOptions are the labels to be applied to the container

      • securityContext.seLinuxOptions.level (string)

        Level is SELinux level label that applies to the container.

      • securityContext.seLinuxOptions.role (string)

        Role is a SELinux role label that applies to the container.

      • securityContext.seLinuxOptions.type (string)

        Type is a SELinux type label that applies to the container.

      • securityContext.seLinuxOptions.user (string)

        User is a SELinux user label that applies to the container.

    • securityContext.windowsOptions (WindowsSecurityContextOptions)

      The Windows specific settings applied to all containers. If unspecified, the options from the PodSecurityContext will be used. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is linux.

      WindowsSecurityContextOptions contain Windows-specific options and credentials.

      • securityContext.windowsOptions.gmsaCredentialSpec (string)

        GMSACredentialSpec is where the GMSA admission webhook (https://github.com/kubernetes-sigs/windows-gmsa) inlines the contents of the GMSA credential spec named by the GMSACredentialSpecName field.

      • securityContext.windowsOptions.gmsaCredentialSpecName (string)

        GMSACredentialSpecName is the name of the GMSA credential spec to use.

      • securityContext.windowsOptions.hostProcess (boolean)

        HostProcess determines if a container should be run as a 'Host Process' container. This field is alpha-level and will only be honored by components that enable the WindowsHostProcessContainers feature flag. Setting this field without the feature flag will result in errors when validating the Pod. All of a Pod's containers must have the same effective HostProcess value (it is not allowed to have a mix of HostProcess containers and non-HostProcess containers). In addition, if HostProcess is true then HostNetwork must also be set to true.

      • securityContext.windowsOptions.runAsUserName (string)

        The UserName in Windows to run the entrypoint of the container process. Defaults to the user specified in image metadata if unspecified. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.

Debugging

  • stdin (boolean)

    Whether this container should allocate a buffer for stdin in the container runtime. If this is not set, reads from stdin in the container will always result in EOF. Default is false.

  • stdinOnce (boolean)

    Whether the container runtime should close the stdin channel after it has been opened by a single attach. When stdin is true the stdin stream will remain open across multiple attach sessions. If stdinOnce is set to true, stdin is opened on container start, is empty until the first client attaches to stdin, and then remains open and accepts data until the client disconnects, at which time stdin is closed and remains closed until the container is restarted. If this flag is false, a container processes that reads from stdin will never receive an EOF. Default is false

  • tty (boolean)

    Whether this container should allocate a TTY for itself, also requires 'stdin' to be true. Default is false.

EphemeralContainer

An EphemeralContainer is a temporary container that you may add to an existing Pod for user-initiated activities such as debugging. Ephemeral containers have no resource or scheduling guarantees, and they will not be restarted when they exit or when a Pod is removed or restarted. The kubelet may evict a Pod if an ephemeral container causes the Pod to exceed its resource allocation.

To add an ephemeral container, use the ephemeralcontainers subresource of an existing Pod. Ephemeral containers may not be removed or restarted.


  • name (string), required

    Name of the ephemeral container specified as a DNS_LABEL. This name must be unique among all containers, init containers and ephemeral containers.

  • targetContainerName (string)

    If set, the name of the container from PodSpec that this ephemeral container targets. The ephemeral container will be run in the namespaces (IPC, PID, etc) of this container. If not set then the ephemeral container uses the namespaces configured in the Pod spec.

    The container runtime must implement support for this feature. If the runtime does not support namespace targeting then the result of setting this field is undefined.

Image

Entrypoint

  • command ([]string)

    Entrypoint array. Not executed within a shell. The image's ENTRYPOINT is used if this is not provided. Variable references $(VAR_NAME) are expanded using the container's environment. If a variable cannot be resolved, the reference in the input string will be unchanged. Double $$ are reduced to a single $, which allows for escaping the $(VAR_NAME) syntax: i.e. "$$(VAR_NAME)" will produce the string literal "$(VAR_NAME)". Escaped references will never be expanded, regardless of whether the variable exists or not. Cannot be updated. More info: https://kubernetes.io/docs/tasks/inject-data-application/define-command-argument-container/#running-a-command-in-a-shell

  • args ([]string)

    Arguments to the entrypoint. The image's CMD is used if this is not provided. Variable references $(VAR_NAME) are expanded using the container's environment. If a variable cannot be resolved, the reference in the input string will be unchanged. Double $$ are reduced to a single $, which allows for escaping the $(VAR_NAME) syntax: i.e. "$$(VAR_NAME)" will produce the string literal "$(VAR_NAME)". Escaped references will never be expanded, regardless of whether the variable exists or not. Cannot be updated. More info: https://kubernetes.io/docs/tasks/inject-data-application/define-command-argument-container/#running-a-command-in-a-shell

  • workingDir (string)

    Container's working directory. If not specified, the container runtime's default will be used, which might be configured in the container image. Cannot be updated.

Environment variables

  • env ([]EnvVar)

    Patch strategy: merge on key name

    List of environment variables to set in the container. Cannot be updated.

    EnvVar represents an environment variable present in a Container.

    • env.name (string), required

      Name of the environment variable. Must be a C_IDENTIFIER.

    • env.value (string)

      Variable references $(VAR_NAME) are expanded using the previously defined environment variables in the container and any service environment variables. If a variable cannot be resolved, the reference in the input string will be unchanged. Double $$ are reduced to a single $, which allows for escaping the $(VAR_NAME) syntax: i.e. "$$(VAR_NAME)" will produce the string literal "$(VAR_NAME)". Escaped references will never be expanded, regardless of whether the variable exists or not. Defaults to "".

    • env.valueFrom (EnvVarSource)

      Source for the environment variable's value. Cannot be used if value is not empty.

      EnvVarSource represents a source for the value of an EnvVar.

      • env.valueFrom.configMapKeyRef (ConfigMapKeySelector)

        Selects a key of a ConfigMap.

        Selects a key from a ConfigMap.

      • env.valueFrom.fieldRef (ObjectFieldSelector)

        Selects a field of the pod: supports metadata.name, metadata.namespace, metadata.labels['\<KEY>'], metadata.annotations['\<KEY>'], spec.nodeName, spec.serviceAccountName, status.hostIP, status.podIP, status.podIPs.

      • env.valueFrom.resourceFieldRef (ResourceFieldSelector)

        Selects a resource of the container: only resources limits and requests (limits.cpu, limits.memory, limits.ephemeral-storage, requests.cpu, requests.memory and requests.ephemeral-storage) are currently supported.

      • env.valueFrom.secretKeyRef (SecretKeySelector)

        Selects a key of a secret in the pod's namespace

        SecretKeySelector selects a key of a Secret.

  • envFrom ([]EnvFromSource)

    List of sources to populate environment variables in the container. The keys defined within a source must be a C_IDENTIFIER. All invalid keys will be reported as an event when the container is starting. When a key exists in multiple sources, the value associated with the last source will take precedence. Values defined by an Env with a duplicate key will take precedence. Cannot be updated.

    EnvFromSource represents the source of a set of ConfigMaps

    • envFrom.configMapRef (ConfigMapEnvSource)

      The ConfigMap to select from

      *ConfigMapEnvSource selects a ConfigMap to populate the environment variables with.

      The contents of the target ConfigMap's Data field will represent the key-value pairs as environment variables.*

    • envFrom.prefix (string)

      An optional identifier to prepend to each key in the ConfigMap. Must be a C_IDENTIFIER.

    • envFrom.secretRef (SecretEnvSource)

      The Secret to select from

      *SecretEnvSource selects a Secret to populate the environment variables with.

      The contents of the target Secret's Data field will represent the key-value pairs as environment variables.*

Volumes

  • volumeMounts ([]VolumeMount)

    Patch strategy: merge on key mountPath

    Pod volumes to mount into the container's filesystem. Subpath mounts are not allowed for ephemeral containers. Cannot be updated.

    VolumeMount describes a mounting of a Volume within a container.

    • volumeMounts.mountPath (string), required

      Path within the container at which the volume should be mounted. Must not contain ':'.

    • volumeMounts.name (string), required

      This must match the Name of a Volume.

    • volumeMounts.mountPropagation (string)

      mountPropagation determines how mounts are propagated from the host to container and the other way around. When not set, MountPropagationNone is used. This field is beta in 1.10.

    • volumeMounts.readOnly (boolean)

      Mounted read-only if true, read-write otherwise (false or unspecified). Defaults to false.

    • volumeMounts.subPath (string)

      Path within the volume from which the container's volume should be mounted. Defaults to "" (volume's root).

    • volumeMounts.subPathExpr (string)

      Expanded path within the volume from which the container's volume should be mounted. Behaves similarly to SubPath but environment variable references $(VAR_NAME) are expanded using the container's environment. Defaults to "" (volume's root). SubPathExpr and SubPath are mutually exclusive.

  • volumeDevices ([]VolumeDevice)

    Patch strategy: merge on key devicePath

    volumeDevices is the list of block devices to be used by the container.

    volumeDevice describes a mapping of a raw block device within a container.

    • volumeDevices.devicePath (string), required

      devicePath is the path inside of the container that the device will be mapped to.

    • volumeDevices.name (string), required

      name must match the name of a persistentVolumeClaim in the pod

Lifecycle

  • terminationMessagePath (string)

    Optional: Path at which the file to which the container's termination message will be written is mounted into the container's filesystem. Message written is intended to be brief final status, such as an assertion failure message. Will be truncated by the node if greater than 4096 bytes. The total message length across all containers will be limited to 12kb. Defaults to /dev/termination-log. Cannot be updated.

  • terminationMessagePolicy (string)

    Indicate how the termination message should be populated. File will use the contents of terminationMessagePath to populate the container status message on both success and failure. FallbackToLogsOnError will use the last chunk of container log output if the termination message file is empty and the container exited with an error. The log output is limited to 2048 bytes or 80 lines, whichever is smaller. Defaults to File. Cannot be updated.

Debugging

  • stdin (boolean)

    Whether this container should allocate a buffer for stdin in the container runtime. If this is not set, reads from stdin in the container will always result in EOF. Default is false.

  • stdinOnce (boolean)

    Whether the container runtime should close the stdin channel after it has been opened by a single attach. When stdin is true the stdin stream will remain open across multiple attach sessions. If stdinOnce is set to true, stdin is opened on container start, is empty until the first client attaches to stdin, and then remains open and accepts data until the client disconnects, at which time stdin is closed and remains closed until the container is restarted. If this flag is false, a container processes that reads from stdin will never receive an EOF. Default is false

  • tty (boolean)

    Whether this container should allocate a TTY for itself, also requires 'stdin' to be true. Default is false.

Security context

  • securityContext (SecurityContext)

    Optional: SecurityContext defines the security options the ephemeral container should be run with. If set, the fields of SecurityContext override the equivalent fields of PodSecurityContext.

    SecurityContext holds security configuration that will be applied to a container. Some fields are present in both SecurityContext and PodSecurityContext. When both are set, the values in SecurityContext take precedence.

    • securityContext.runAsUser (int64)

      The UID to run the entrypoint of the container process. Defaults to user specified in image metadata if unspecified. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is windows.

    • securityContext.runAsNonRoot (boolean)

      Indicates that the container must run as a non-root user. If true, the Kubelet will validate the image at runtime to ensure that it does not run as UID 0 (root) and fail to start the container if it does. If unset or false, no such validation will be performed. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.

    • securityContext.runAsGroup (int64)

      The GID to run the entrypoint of the container process. Uses runtime default if unset. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is windows.

    • securityContext.readOnlyRootFilesystem (boolean)

      Whether this container has a read-only root filesystem. Default is false. Note that this field cannot be set when spec.os.name is windows.

    • securityContext.procMount (string)

      procMount denotes the type of proc mount to use for the containers. The default is DefaultProcMount which uses the container runtime defaults for readonly paths and masked paths. This requires the ProcMountType feature flag to be enabled. Note that this field cannot be set when spec.os.name is windows.

    • securityContext.privileged (boolean)

      Run container in privileged mode. Processes in privileged containers are essentially equivalent to root on the host. Defaults to false. Note that this field cannot be set when spec.os.name is windows.

    • securityContext.allowPrivilegeEscalation (boolean)

      AllowPrivilegeEscalation controls whether a process can gain more privileges than its parent process. This bool directly controls if the no_new_privs flag will be set on the container process. AllowPrivilegeEscalation is true always when the container is: 1) run as Privileged 2) has CAP_SYS_ADMIN Note that this field cannot be set when spec.os.name is windows.

    • securityContext.capabilities (Capabilities)

      The capabilities to add/drop when running containers. Defaults to the default set of capabilities granted by the container runtime. Note that this field cannot be set when spec.os.name is windows.

      Adds and removes POSIX capabilities from running containers.

      • securityContext.capabilities.add ([]string)

        Added capabilities

      • securityContext.capabilities.drop ([]string)

        Removed capabilities

    • securityContext.seccompProfile (SeccompProfile)

      The seccomp options to use by this container. If seccomp options are provided at both the pod & container level, the container options override the pod options. Note that this field cannot be set when spec.os.name is windows.

      SeccompProfile defines a pod/container's seccomp profile settings. Only one profile source may be set.

      • securityContext.seccompProfile.type (string), required

        type indicates which kind of seccomp profile will be applied. Valid options are:

        Localhost - a profile defined in a file on the node should be used. RuntimeDefault - the container runtime default profile should be used. Unconfined - no profile should be applied.

      • securityContext.seccompProfile.localhostProfile (string)

        localhostProfile indicates a profile defined in a file on the node should be used. The profile must be preconfigured on the node to work. Must be a descending path, relative to the kubelet's configured seccomp profile location. Must only be set if type is "Localhost".

    • securityContext.seLinuxOptions (SELinuxOptions)

      The SELinux context to be applied to the container. If unspecified, the container runtime will allocate a random SELinux context for each container. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is windows.

      SELinuxOptions are the labels to be applied to the container

      • securityContext.seLinuxOptions.level (string)

        Level is SELinux level label that applies to the container.

      • securityContext.seLinuxOptions.role (string)

        Role is a SELinux role label that applies to the container.

      • securityContext.seLinuxOptions.type (string)

        Type is a SELinux type label that applies to the container.

      • securityContext.seLinuxOptions.user (string)

        User is a SELinux user label that applies to the container.

    • securityContext.windowsOptions (WindowsSecurityContextOptions)

      The Windows specific settings applied to all containers. If unspecified, the options from the PodSecurityContext will be used. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is linux.

      WindowsSecurityContextOptions contain Windows-specific options and credentials.

      • securityContext.windowsOptions.gmsaCredentialSpec (string)

        GMSACredentialSpec is where the GMSA admission webhook (https://github.com/kubernetes-sigs/windows-gmsa) inlines the contents of the GMSA credential spec named by the GMSACredentialSpecName field.

      • securityContext.windowsOptions.gmsaCredentialSpecName (string)

        GMSACredentialSpecName is the name of the GMSA credential spec to use.

      • securityContext.windowsOptions.hostProcess (boolean)

        HostProcess determines if a container should be run as a 'Host Process' container. This field is alpha-level and will only be honored by components that enable the WindowsHostProcessContainers feature flag. Setting this field without the feature flag will result in errors when validating the Pod. All of a Pod's containers must have the same effective HostProcess value (it is not allowed to have a mix of HostProcess containers and non-HostProcess containers). In addition, if HostProcess is true then HostNetwork must also be set to true.

      • securityContext.windowsOptions.runAsUserName (string)

        The UserName in Windows to run the entrypoint of the container process. Defaults to the user specified in image metadata if unspecified. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.

Not allowed

  • ports ([]ContainerPort)

    Patch strategy: merge on key containerPort

    Map: unique values on keys containerPort, protocol will be kept during a merge

    Ports are not allowed for ephemeral containers.

    ContainerPort represents a network port in a single container.

    • ports.containerPort (int32), required

      Number of port to expose on the pod's IP address. This must be a valid port number, 0 < x < 65536.

    • ports.hostIP (string)

      What host IP to bind the external port to.

    • ports.hostPort (int32)

      Number of port to expose on the host. If specified, this must be a valid port number, 0 < x < 65536. If HostNetwork is specified, this must match ContainerPort. Most containers do not need this.

    • ports.name (string)

      If specified, this must be an IANA_SVC_NAME and unique within the pod. Each named port in a pod must have a unique name. Name for the port that can be referred to by services.

    • ports.protocol (string)

      Protocol for port. Must be UDP, TCP, or SCTP. Defaults to "TCP".

  • resources (ResourceRequirements)

    Resources are not allowed for ephemeral containers. Ephemeral containers use spare resources already allocated to the pod.

    ResourceRequirements describes the compute resource requirements.

  • lifecycle (Lifecycle)

    Lifecycle is not allowed for ephemeral containers.

    Lifecycle describes actions that the management system should take in response to container lifecycle events. For the PostStart and PreStop lifecycle handlers, management of the container blocks until the action is complete, unless the container process fails, in which case the handler is aborted.

    • lifecycle.postStart (LifecycleHandler)

      PostStart is called immediately after a container is created. If the handler fails, the container is terminated and restarted according to its restart policy. Other management of the container blocks until the hook completes. More info: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks

    • lifecycle.preStop (LifecycleHandler)

      PreStop is called immediately before a container is terminated due to an API request or management event such as liveness/startup probe failure, preemption, resource contention, etc. The handler is not called if the container crashes or exits. The Pod's termination grace period countdown begins before the PreStop hook is executed. Regardless of the outcome of the handler, the container will eventually terminate within the Pod's termination grace period (unless delayed by finalizers). Other management of the container blocks until the hook completes or until the termination grace period is reached. More info: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks

  • livenessProbe (Probe)

    Probes are not allowed for ephemeral containers.

  • readinessProbe (Probe)

    Probes are not allowed for ephemeral containers.

  • startupProbe (Probe)

    Probes are not allowed for ephemeral containers.

LifecycleHandler

LifecycleHandler defines a specific action that should be taken in a lifecycle hook. One and only one of the fields, except TCPSocket must be specified.


  • exec (ExecAction)

    Exec specifies the action to take.

    ExecAction describes a "run in container" action.

    • exec.command ([]string)

      Command is the command line to execute inside the container, the working directory for the command is root ('/') in the container's filesystem. The command is simply exec'd, it is not run inside a shell, so traditional shell instructions ('|', etc) won't work. To use a shell, you need to explicitly call out to that shell. Exit status of 0 is treated as live/healthy and non-zero is unhealthy.

  • httpGet (HTTPGetAction)

    HTTPGet specifies the http request to perform.

    HTTPGetAction describes an action based on HTTP Get requests.

    • httpGet.port (IntOrString), required

      Name or number of the port to access on the container. Number must be in the range 1 to 65535. Name must be an IANA_SVC_NAME.

      IntOrString is a type that can hold an int32 or a string. When used in JSON or YAML marshalling and unmarshalling, it produces or consumes the inner type. This allows you to have, for example, a JSON field that can accept a name or number.

    • httpGet.host (string)

      Host name to connect to, defaults to the pod IP. You probably want to set "Host" in httpHeaders instead.

    • httpGet.httpHeaders ([]HTTPHeader)

      Custom headers to set in the request. HTTP allows repeated headers.

      HTTPHeader describes a custom header to be used in HTTP probes

      • httpGet.httpHeaders.name (string), required

        The header field name

      • httpGet.httpHeaders.value (string), required

        The header field value

    • httpGet.path (string)

      Path to access on the HTTP server.

    • httpGet.scheme (string)

      Scheme to use for connecting to the host. Defaults to HTTP.

  • tcpSocket (TCPSocketAction)

    Deprecated. TCPSocket is NOT supported as a LifecycleHandler and kept for the backward compatibility. There are no validation of this field and lifecycle hooks will fail in runtime when tcp handler is specified.

    TCPSocketAction describes an action based on opening a socket

    • tcpSocket.port (IntOrString), required

      Number or name of the port to access on the container. Number must be in the range 1 to 65535. Name must be an IANA_SVC_NAME.

      IntOrString is a type that can hold an int32 or a string. When used in JSON or YAML marshalling and unmarshalling, it produces or consumes the inner type. This allows you to have, for example, a JSON field that can accept a name or number.

    • tcpSocket.host (string)

      Optional: Host name to connect to, defaults to the pod IP.

NodeAffinity

Node affinity is a group of node affinity scheduling rules.


  • preferredDuringSchedulingIgnoredDuringExecution ([]PreferredSchedulingTerm)

    The scheduler will prefer to schedule pods to nodes that satisfy the affinity expressions specified by this field, but it may choose a node that violates one or more of the expressions. The node that is most preferred is the one with the greatest sum of weights, i.e. for each node that meets all of the scheduling requirements (resource request, requiredDuringScheduling affinity expressions, etc.), compute a sum by iterating through the elements of this field and adding "weight" to the sum if the node matches the corresponding matchExpressions; the node(s) with the highest sum are the most preferred.

    An empty preferred scheduling term matches all objects with implicit weight 0 (i.e. it's a no-op). A null preferred scheduling term matches no objects (i.e. is also a no-op).

    • preferredDuringSchedulingIgnoredDuringExecution.preference (NodeSelectorTerm), required

      A node selector term, associated with the corresponding weight.

      A null or empty node selector term matches no objects. The requirements of them are ANDed. The TopologySelectorTerm type implements a subset of the NodeSelectorTerm.

      • preferredDuringSchedulingIgnoredDuringExecution.preference.matchExpressions ([]NodeSelectorRequirement)

        A list of node selector requirements by node's labels.

      • preferredDuringSchedulingIgnoredDuringExecution.preference.matchFields ([]NodeSelectorRequirement)

        A list of node selector requirements by node's fields.

    • preferredDuringSchedulingIgnoredDuringExecution.weight (int32), required

      Weight associated with matching the corresponding nodeSelectorTerm, in the range 1-100.

  • requiredDuringSchedulingIgnoredDuringExecution (NodeSelector)

    If the affinity requirements specified by this field are not met at scheduling time, the pod will not be scheduled onto the node. If the affinity requirements specified by this field cease to be met at some point during pod execution (e.g. due to an update), the system may or may not try to eventually evict the pod from its node.

    A node selector represents the union of the results of one or more label queries over a set of nodes; that is, it represents the OR of the selectors represented by the node selector terms.

    • requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms ([]NodeSelectorTerm), required

      Required. A list of node selector terms. The terms are ORed.

      A null or empty node selector term matches no objects. The requirements of them are ANDed. The TopologySelectorTerm type implements a subset of the NodeSelectorTerm.

      • requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms.matchExpressions ([]NodeSelectorRequirement)

        A list of node selector requirements by node's labels.

      • requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms.matchFields ([]