Kubernetes Blog

Stateful Applications in Containers!? Kubernetes 1.3 Says “Yes!”

July 13 2016

Editor’s note: today’s guest post is from Mark Balch, VP of Products at Diamanti, who’ll share more about the contributions they’ve made to Kubernetes.

Congratulations to the Kubernetes community on another value-packed release. A focus on stateful applications and federated clusters are two reasons why I’m so excited about 1.3. Kubernetes support for stateful apps such as Cassandra, Kafka, and MongoDB is critical. Important services rely on databases, key value stores, message queues, and more. Additionally, relying on one data center or container cluster simply won’t work as apps grow to serve millions of users around the world. Cluster federation allows users to deploy apps across multiple clusters and data centers for scale and resiliency.

You may have heard me say before that containers are the next great application platform. Diamanti is accelerating container adoption for stateful apps in production - where performance and ease of deployment really matter. 

Apps Need More Than Cattle

Beyond stateless containers like web servers (so-called “cattle” because they are interchangeable), users are increasingly deploying stateful workloads with containers to benefit from “build once, run anywhere” and to improve bare metal efficiency/utilization. These “pets” (so-called because each requires special handling) bring new requirements including longer life cycle, configuration dependencies, stateful failover, and performance sensitivity. Container orchestration must address these needs to successfully deploy and scale apps.

Enter Pet Set, a new object in Kubernetes 1.3 for improved stateful application support. Pet Set sequences through the startup phase of each database replica (for example), ensuring orderly master/slave configuration. Pet Set also simplifies service discovery by leveraging ubiquitous DNS SRV records, a well-recognized and long-understood mechanism.

Diamanti’s FlexVolume contribution to Kubernetes enables stateful workloads by providing persistent volumes with low-latency storage and guaranteed performance, including enforced quality-of-service from container to media.

A Federalist

Users who are planning for application availability must contend with issues of failover and scale across geography. Cross-cluster federated services allows containerized apps to easily deploy across multiple clusters. Federated services tackles challenges such as managing multiple container clusters and coordinating service deployment and discovery across federated clusters.

Like a strictly centralized model, federation provides a common app deployment interface. With each cluster retaining autonomy, however, federation adds flexibility to manage clusters locally during network outages and other events. Cross-cluster federated services also applies consistent service naming and adoption across container clusters, simplifying DNS resolution.

It’s easy to imagine powerful multi-cluster use cases with cross-cluster federated services in future releases. An example is scheduling containers based on governance, security, and performance requirements. Diamanti’s scheduler extension was developed with this concept in mind. Our first implementation makes the Kubernetes scheduler aware of network and storage resources local to each cluster node. Similar concepts can be applied in the future to broader placement controls with cross-cluster federated services. 

Get Involved

With interest growing in stateful apps, work has already started to further enhance Kubernetes storage. The Storage Special Interest Group is discussing proposals to support local storage resources. Diamanti is looking forward to extend FlexVolume to include richer APIs that enable local storage and storage services including data protection, replication, and reduction. We’re also working on proposals for improved app placement, migration, and failover across container clusters through Kubernetes cross-cluster federated services.

Join the conversation and contribute! Here are some places to get started:

– Mark Balch, VP Products, Diamanti. Twitter @markbalch

Thousand Instances of Cassandra using Kubernetes Pet Set

July 13 2016

Editor’s note: this post is part of a series of in-depth articles on what’s new in Kubernetes 1.3

Running The Greek Pet Monster Races

For the Kubernetes 1.3 launch, we wanted to put the new Pet Set through its paces. By testing a thousand instances of Cassandra, we could make sure that Kubernetes 1.3 was production ready. Read on for how we adapted Cassandra to Kubernetes, and had our largest deployment ever.

It’s fairly straightforward to use containers with basic stateful applications today. Using a persistent volume, you can mount a disk in a pod, and ensure that your data lasts beyond the life of your pod. However, with deployments of distributed stateful applications, things can become more tricky. With Kubernetes 1.3, the new Pet Set component makes everything much easier. To test this new feature out at scale, we decided to host the Greek Pet Monster Races! We raced Centaurs and other Ancient Greek Monsters over hundreds of thousands of races across multiple availability zones.

File:Cassandra1.jpeg
As many of you know Kubernetes is from the Ancient Greek: κυβερνήτης. This means helmsman, pilot, steersman, or ship master. So in order to keep track of race results, we needed a data store, and we choose Cassandra. Κασσάνδρα, Cassandra who was the daughter of King of Priam and Queen Hecuba of Troy. With multiple references to the ancient Greek language, we thought it would be appropriate to race ancient Greek monsters.

From there the story kinda goes sideways because Cassandra was actually the Pets as well. Read on and we will explain.

One of the new exciting features in Kubernetes 1.3 is Pet Set. In order to organize the deployment of containers inside of Kubernetes, different deployment mechanisms are available. Examples of these components include Resource Controllers and Daemon Set. Pet Sets is a new feature that delivers the capability to deploy containers, as Pets, inside of Kubernetes. Pet Sets provide a guarantee of identity for various aspects of the pet / pod deployment: DNS name, consistent storage, and ordered pod indexing. Previously, using components like Deployments and Replication Controllers, would only deploy an application with a weak uncoupled identity. A weak identity is great for managing applications such as microservices, where service discovery is important, the application is stateless, and the naming of individual pods does not matter. Many software applications do require strong identity, including many different types of distributed stateful systems. Cassandra is a great example of a distributed application that requires consistent network identity, and stable storage.

Pet Sets provides the following capabilities:

  • A stable hostname, available to others in DNS. Number is based off of the Pet Set name and starts at zero. For example cassandra-0.
  • An ordinal index of Pets. 0, 1, 2, 3, etc.
  • Stable storage linked to the ordinal and hostname of the Pet.
  • Peer discovery is available via DNS. With Cassandra the names of the peers are known before the Pets are created.
  • Startup and Teardown ordering. Which numbered Pet is going to be created next is known, and which Pet will be destroyed upon reducing the Pet Set size. This feature is useful for such admin tasks as draining data from a Pet, when reducing the size of a cluster.

If your application has one or more of these requirements, then it may be a candidate for Pet Set.
A relevant analogy is that a Pet Set is composed of Pet dogs. If you have a white, brown or black dog and the brown dog runs away, you can replace it with another brown dog no one would notice. If over time you can keep replacing your dogs with only white dogs then someone would notice. Pet Set allows your application to maintain the unique identity or hair color of your Pets.

Example workloads for Pet Set:

  • Clustered software like Cassandra, Zookeeper, etcd, or Elastic require stable membership.
  • Databases like MySQL or PostgreSQL that require a single instance attached to a persistent volume at any time.

Only use Pet Set if your application requires some or all of these properties. Managing pods as stateless replicas is vastly easier.

So back to our races!

As we have mentioned, Cassandra was a perfect candidate to deploy via a Pet Set. A Pet Set is much like a Replica Controller with a few new bells and whistles. Here’s an example YAML manifest:

Headless service to provide DNS lookup

apiVersion: v1

kind: Service

metadata:

  labels:

    app: cassandra

  name: cassandra

spec:

  clusterIP: None

  ports:

    - port: 9042

  selector:

    app: cassandra-data

----

# new API name

apiVersion: "apps/v1alpha1"

kind: PetSet

metadata:

  name: cassandra

spec:

  serviceName: cassandra

  # replicas are the same as used by Replication Controllers

  # except pets are deployed in order 0, 1, 2, 3, etc

  replicas: 5

  template:

    metadata:

      annotations:

        pod.alpha.kubernetes.io/initialized: "true"

      labels:

        app: cassandra-data

    spec:

      # just as other component in Kubernetes one

      # or more containers are deployed

      containers:

      - name: cassandra

        image: "cassandra-debian:v1.1"

        imagePullPolicy: Always

        ports:

        - containerPort: 7000

          name: intra-node

        - containerPort: 7199

          name: jmx

        - containerPort: 9042

          name: cql

        resources:

          limits:

            cpu: "4"

            memory: 11Gi

          requests:

           cpu: "4"

           memory: 11Gi

        securityContext:

          privileged: true

        env:

          - name: MAX\_HEAP\_SIZE

            value: 8192M

          - name: HEAP\_NEWSIZE

            value: 2048M

          # this is relying on guaranteed network identity of Pet Sets, we

          # will know the name of the Pets / Pod before they are created

          - name: CASSANDRA\_SEEDS

            value: "cassandra-0.cassandra.default.svc.cluster.local,cassandra-1.cassandra.default.svc.cluster.local"

          - name: CASSANDRA\_CLUSTER\_NAME

            value: "OneKDemo"

          - name: CASSANDRA\_DC

            value: "DC1-Data"

          - name: CASSANDRA\_RACK

            value: "OneKDemo-Rack1-Data"

          - name: CASSANDRA\_AUTO\_BOOTSTRAP

            value: "false"

          # this variable is used by the read-probe looking

          # for the IP Address in a `nodetool status` command

          - name: POD\_IP

            valueFrom:

              fieldRef:

                fieldPath: status.podIP

        readinessProbe:

          exec:

            command:

            - /bin/bash

            - -c

            - /ready-probe.sh

          initialDelaySeconds: 15

          timeoutSeconds: 5

        # These volume mounts are persistent. They are like inline claims,

        # but not exactly because the names need to match exactly one of

        # the pet volumes.

        volumeMounts:

        - name: cassandra-data

          mountPath: /cassandra\_data

  # These are converted to volume claims by the controller

  # and mounted at the paths mentioned above.  Storage can be automatically

  # created for the Pets depending on the cloud environment.

  volumeClaimTemplates:

  - metadata:

      name: cassandra-data

      annotations:

        volume.alpha.kubernetes.io/storage-class: anything

    spec:

      accessModes: ["ReadWriteOnce"]

      resources:

        requests:
          storage: 380Gi

You may notice that these containers are on the rather large size, and it is not unusual to run Cassandra in production with 8 CPU and 16GB of ram. There are two key new features that you will notice above; dynamic volume provisioning, and of course Pet Set. The above manifest will create 5 Cassandra Pets / Pods starting with the number 0: cassandra-data-0, cassandra-data-1, etc.

In order to generate data for the races, we used another Kubernetes feature called Jobs. Simple python code was written to generate the random speed of the monster for every second of the race. Then that data, position information, winners, other data points, and metrics were stored in Cassandra. To visualize the data, we used JHipster to generate a AngularJS UI with Java services, and then used D3 for graphing.

An example of one of the Jobs:

apiVersion: batch/v1

kind: Job

metadata:

  name: pet-race-giants

  labels:

    name: pet-races

spec:

  parallelism: 2

  completions: 4

  template:

    metadata:

      name: pet-race-giants

      labels:

        name: pet-races

    spec:

      containers:

      - name: pet-race-giants

        image: py3numpy-job:v1.0

        command: ["pet-race-job", --length=100", "--pet=Giants", "--scale=3"]

        resources:

          limits:

            cpu: "2"

          requests:

            cpu: "2"

      restartPolicy: Never

File:Polyphemus.gifSince we are talking about Monsters, we had to go big. We deployed 1,009 minion nodes to Google Compute Engine (GCE), spread across 4 zones, running a custom version of the Kubernetes 1.3 beta. We ran this demo on beta code since the demo was being set up before the 1.3 release date. For the minion nodes, GCE virtual machine n1-standard-8 machine size was chosen, which is vm with 8 virtual CPUs and 30GB of memory. It would allow for a single instance of Cassandra to run on one node, which is recommended for disk I/O.

Then the pets were deployed! One thousand of them, in two different Cassandra Data Centers. Cassandra distributed architecture is specifically tailored for multiple-data center deployment. Often multiple Cassandra data centers are deployed inside the same physical or virtual data center, in order to separate workloads. Data is replicated across all data centers, but workloads can be different between data centers and thus application tuning can be different. Data centers named ‘DC1-Analytics’ and ‘DC1-Data’ where deployed with 500 pets each. The race data was created by the python Batch Jobs connected to DC1-Data, and the JHipster UI was connected DC1-Analytics.

Here are the final numbers:

  • 8,072 Cores. The master used 24, minion nodes used the rest
  • 1,009 IP addresses
  • 1,009 routes setup by Kubernetes on Google Cloud Platform
  • 100,510 GB persistent disk used by the Minions and the Master
  • 380,020 GB SSD disk persistent disk. 20 GB for the master and 340 GB per Cassandra Pet.
  • 1,000 deployed instances of Cassandra Yes we deployed 1,000 pets, but one really did not want to join the party! Technically with the Cassandra setup, we could have lost 333 nodes without service or data loss.

Limitations with Pet Sets in 1.3 Release

  • Pet Set is an alpha resource not available in any Kubernetes release prior to 1.3.
  • The storage for a given pet must either be provisioned by a dynamic storage provisioner based on the requested storage class, or pre-provisioned by an admin.
  • Deleting the Pet Set will not delete any pets or Pet storage. You will need to delete your Pets and possibly its storage by hand.
  • All Pet Sets currently require a “governing service”, or a Service responsible for the network identity of the pets. The user is responsible for this Service.
  • Updating an existing Pet Set is currently a manual process. You either need to deploy a new Pet Set with the new image version or orphan Pets one by one and update their image, which will join them back to the cluster.

Resources and References

  • The source code for the demo is available on GitHub: (Pet Set examples will be merged into the Kubernetes Cassandra Examples).
  • More information about Jobs
  • Documentation for Pet Set
  • Image credits: Cassandra image and Cyclops image

– Chris Love, Senior DevOps Open Source Consultant for Datapipe. Twitter @chrislovecnm

Kubernetes in Rancher: the further evolution

July 12 2016

Editor’s note: today’s guest post is from Alena Prokharchyk, Principal Software Engineer at Rancher Labs, who’ll share how they are incorporating new Kubernetes features into their platform.

Kubernetes was the first external orchestration platform supported by Rancher, and since its release, it has become one of the most widely used among our users, and continues to grow rapidly in adoption. As Kubernetes has evolved, so has Rancher in terms of adapting new Kubernetes features. We’ve started with supporting Kubernetes version 1.1, then switched to 1.2 as soon as it was released, and now we’re working on supporting the exciting new features in 1.3. I’d like to walk you through the features that we’ve been adding support for during each of these stages.

Rancher and Kubernetes 1.2

Kubernetes 1.2 introduced enhanced Ingress object to simplify allowing inbound connections to reach the cluster services: here’s an excellent blog post about ingress policies. Ingress resource allows users to define host name routing rules and TLS config for the Load Balancer in a user friendly way. Then it should be backed up by an Ingress controller that would configure a corresponding cloud provider’s Load Balancer with the Ingress rules. Since Rancher already included a software defined Load Balancer based on HAproxy, we already supported all of the configuration requirements of the Ingress resource, and didn’t have to do any changes on the Rancher side to adopt Ingress. What we had to do was write an Ingress controller that would listen to Kubernetes ingress specific events, configure the Rancher Load Balancer accordingly, and propagate the Load Balancer public entry point back to Kubernetes:

Screen-Shot-2016-05-13-at-11.15.56-AM.png

Now, the ingress controller gets deployed as a part of our Rancher Kubernetes system stack, and is managed by Rancher. Rancher monitors Ingress controller health, and recreates it in case of any failures. In addition to standard ingress features, Rancher also lets you to horizontally scale the Load Balancer supporting the ingress service by specifying scale via Ingress annotations. For example:

apiVersion: extensions/v1beta1

kind: Ingress

metadata:

 name: scalelb

 annotations:

 scale: "2"

spec:

  rules:

  - host: foo.bar.com

    http:

      paths:

      - path: /foo

        backend:

          serviceName: nginx-service

          servicePort: 80

As a result of the above, 2 instances of Rancher Load Balancer will get started on separate hosts, and Ingress will get updated with 2 public ip addresses:

kubectl get ingress

NAME      RULE          BACKEND   ADDRESS

scalelb      -                    104.154.107.202, 104.154.107.203  // hosts ip addresses where Rancher LB instances are deployed

          foo.bar.com

          /foo           nginx-service:80

More details on Rancher Ingress Controller implementation for Kubernetes can be found here:

Rancher and Kubernetes 1.3

We’ve very excited about Kubernetes 1.3 release, and all the new features that are included with it. There are two that we are especially interested in: Stateful Apps and Cluster Federation.

Kubernetes Stateful Apps

Stateful Apps is a new resource to Kubernetes to represent a set of pods in stateful application. This is an alternative to the using Replication Controllers, which are best leveraged for running stateless apps. This feature is specifically useful for apps that rely on quorum with leader election (such as MongoDB, Zookeeper, etcd) and decentralized quorum (Cassandra). Stateful Apps create and maintains a set of pods, each of which have a stable network identity. In order to provide the network identity, it must be possible to have a resolvable DNS name for the pod that is tied to the pod identity as per Kubernetes design doc:

# service mongo pointing to pods created by PetSet mdb, with identities mdb-1, mdb-2, mdb-3


dig mongodb.namespace.svc.cluster.local +short A

172.130.16.50


dig mdb-1.mongodb.namespace.svc.cluster.local +short A

# IP of pod created for mdb-1


dig mdb-2.mongodb.namespace.svc.cluster.local +short A

# IP of pod created for mdb-2


dig mdb-3.mongodb.namespace.svc.cluster.local +short A

# IP of pod created for mdb-3

The above is implemented via an annotation on pods, which is surfaced to endpoints, and finally surfaced as DNS on the service that exposes those pods. Currently Rancher simplifies DNS configuration by leveraging Rancher DNS as a drop-in replacement for SkyDNS. Rancher DNS is fast, stable, and scalable - every host in cluster gets DNS server running. Kubernetes services get programmed to Rancher DNS, and being resolved to either service’s cluster IP from 10,43.x.x address space, or to set of Pod ip addresses for headless service. To make PetSet work with Kubernetes via Rancher, we’ll have to add support for Pod Identities to Rancher DNS configuration. We’re working on this now and should have it supported in one of the upcoming Rancher releases.

Cluster Federation

Cluster Federation is a control plane of cluster federation in Kubernetes. It offers improved application availability by spreading applications across multiple clusters (the image below is a courtesy of Kubernetes):

Screen Shot 2016-07-07 at 1.46.55 PM.png

Each Kubernetes cluster exposes an API endpoint and gets registered to Cluster Federation as a part of Federation object. Then using Cluster Federation API, you can create federated services. Those objects are comprised of multiple equivalent underlying Kubernetes resources. Assuming that the 3 clusters on the picture above belong to the same Federation object, each Service created via Cluster Federation, will get equivalent service created in each of the clusters. Besides that, a Cluster Federation service will get publicly resolvable DNS name resolvable to Kuberentes service’s public ip addresses (DNS record gets programmed to a one of the public DNS providers below):

Screen Shot 2016-07-07 at 1.24.18 PM.png

To support Cluster Federation via Kubernetes in Rancher, certain changes need to be done. Today each Kubernetes cluster is represented as a Rancher environment. In each Kubernetes environment, we create a full Kubernetes system stack comprised of several services: Kubernetes API server, Scheduler, Ingress controller, persistent etcd, Controller manager, Kubelet and Proxy (2 last ones run on every host). To setup Cluster Federation, we will create one extra environment where Cluster Federation stack is going to run:

Screen Shot 2016-07-07 at 1.23.14 PM.png

Then every underlying Kubernetes cluster represented by Rancher environment, should be registered to a specific Cluster Federation. Potentially each cluster can be auto-discovered by Rancher Cluster Federation environment via label representing federation name on Kubernetes cluster. We’re still working through finalizing our design, but we’re very excited by this feature, and see a lot of use cases it can solve. Cluster Federation doc references:

Plans for Kubernetes 1.4

When we launched Kubernetes support in Rancher we decided to maintain our own distribution of Kubernetes in order to support Rancher’s native networking. We were aware that by having our own distribution, we’d need to update it every time there were changes made to Kubernetes, but we felt it was necessary to support the use cases we were working on for users. As part of our work for 1.4 we looked at our networking approach again, and re-analyzed the initial need for our own fork of Kubernetes. Other than the networking integration, all of the work we’ve done with Kubernetes has been developed as a Kubernetes plugin:

  • Rancher as a CloudProvider (to support Load Balancers).
  • Rancher as a CredentialProvider (to support Rancher private registries).
  • Rancher Ingress controller to back up Kubernetes ingress resource.

So we’ve decided to eliminate the need of Rancher Kubernetes distribution, and try to upstream all our changes to the Kubernetes repo. To do that, we will be reworking our networking integration, and support Rancher networking as a CNI plugin for Kubernetes. More details on that will be shared as soon as the feature design is finalized, but expect it to come in the next 2-3 months. We will also continue investing in Rancher’s core capabilities integrated with Kubernetes, including, but not limited to:

  • Access rights management via Rancher environment that represents Kubernetes cluster
  • Credential management and easy web-based access to standard kubectl cli
  • Load Balancing support
  • Rancher internal DNS support
  • Catalog support for Kubernetes templates
  • Enhanced UI to represent even more Kubernetes objects like: Deployment, Ingress, Daemonset.

All of that is to make Kubernetes experience even more powerful and user intuitive. We’re so excited by all of the progress in the Kubernetes community, and thrilled to be participating. Kubernetes 1.3 is an incredibly significant release, and you’ll be able to upgrade to it very soon within Rancher.

– Alena Prokharchyk, Principal Software Engineer, Rancher Labs. Twitter @lemonjet & GitHub alena1108

Rancher-and-Kubernetes.png

Autoscaling in Kubernetes

July 12 2016

Editor’s note: this post is part of a series of in-depth articles on what’s new in Kubernetes 1.3

Customers using Kubernetes respond to end user requests quickly and ship software faster than ever before. But what happens when you build a service that is even more popular than you planned for, and run out of compute? In Kubernetes 1.3, we are proud to announce that we have a solution: autoscaling. On Google Compute Engine (GCE) and Google Container Engine (GKE) (and coming soon on AWS), Kubernetes will automatically scale up your cluster as soon as you need it, and scale it back down to save you money when you don’t.

Benefits of Autoscaling

To understand better where autoscaling would provide the most value, let’s start with an example. Imagine you have a 24/7 production service with a load that is variable in time, where it is very busy during the day in the US, and relatively low at night. Ideally, we would want the number of nodes in the cluster and the number of pods in deployment to dynamically adjust to the load to meet end user demand. The new Cluster Autoscaling feature together with Horizontal Pod Autoscaler can handle this for you automatically.

Setting Up Autoscaling on GCE

The following instructions apply to GCE. For GKE please check the autoscaling section in cluster operations manual available here.

Before we begin, we need to have an active GCE project with Google Cloud Monitoring, Google Cloud Logging and Stackdriver enabled. For more information on project creation, please read our Getting Started Guide. We also need to download a recent version of Kubernetes project (version v1.3.0 or later).

First, we set up a cluster with Cluster Autoscaler turned on. The number of nodes in the cluster will start at 2, and autoscale up to a maximum of 5. To implement this, we’ll export the following environment variables:

export NUM\_NODES=2

export KUBE\_AUTOSCALER\_MIN\_NODES=2

export KUBE\_AUTOSCALER\_MAX\_NODES=5

export KUBE\_ENABLE\_CLUSTER\_AUTOSCALER=true

and start the cluster by running:

./cluster/kube-up.sh

The kube-up.sh script creates a cluster together with Cluster Autoscaler add-on. The autoscaler will try to add new nodes to the cluster if there are pending pods which could schedule on a new node.

Let’s see our cluster, it should have two nodes:

$ kubectl get nodes

NAME                           STATUS                     AGE

kubernetes-master              Ready,SchedulingDisabled   2m

kubernetes-minion-group-de5q   Ready                      2m

kubernetes-minion-group-yhdx   Ready                      1m

Run & Expose PHP-Apache Server

To demonstrate autoscaling we will use a custom docker image based on php-apache server. The image can be found here. It defines index.php page which performs some CPU intensive computations.

First, we’ll start a deployment running the image and expose it as a service:

$ kubectl run php-apache \   

  --image=gcr.io/google\_containers/hpa-example \

  --requests=cpu=500m,memory=500M --expose --port=80  

service "php-apache" createddeployment "php-apache" created

Now, we will wait some time and verify that both the deployment and the service were correctly created and are running:

$ kubectl get deployment

NAME         DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE

php-apache   1         1         1            1           49s



$ kubectl get pods  
NAME                          READY     STATUS    RESTARTS   AGE

php-apache-2046965998-z65jn   1/1       Running   0          30s

We may now check that php-apache server works correctly by calling wget with the service’s address:

$ kubectl run -i --tty service-test --image=busybox /bin/sh  
Hit enter for command prompt  
$ wget -q -O- http://php-apache.default.svc.cluster.local

OK!

Starting Horizontal Pod Autoscaler

Now that the deployment is running, we will create a Horizontal Pod Autoscaler for it. To create it, we will use kubectl autoscale command, which looks like this:

$ kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=10

This defines a Horizontal Ppod Autoscaler that maintains between 1 and 10 replicas of the Pods controlled by the php-apache deployment we created in the first step of these instructions. Roughly speaking, the horizontal autoscaler will increase and decrease the number of replicas (via the deployment) so as to maintain an average CPU utilization across all Pods of 50% (since each pod requests 500 milli-cores by kubectl run, this means average CPU usage of 250 milli-cores). See here for more details on the algorithm.

We may check the current status of autoscaler by running:

$ kubectl get hpa

NAME         REFERENCE                     TARGET    CURRENT   MINPODS   MAXPODS   AGE

php-apache   Deployment/php-apache/scale   50%       0%        1         20        14s

Please note that the current CPU consumption is 0% as we are not sending any requests to the server (the CURRENT column shows the average across all the pods controlled by the corresponding replication controller).

Raising the Load

Now, we will see how our autoscalers (Cluster Autoscaler and Horizontal Pod Autoscaler) react on the increased load of the server. We will start two infinite loops of queries to our server (please run them in different terminals):

$ kubectl run -i --tty load-generator --image=busybox /bin/sh  
Hit enter for command prompt  
$ while true; do wget -q -O- http://php-apache.default.svc.cluster.local; done

We need to wait a moment (about one minute) for stats to propagate. Afterwards, we will examine status of Horizontal Pod Autoscaler:

$ kubectl get hpa

NAME         REFERENCE                     TARGET    CURRENT   MINPODS   MAXPODS   AGE

php-apache   Deployment/php-apache/scale   50%       310%      1         20        2m



$ kubectl get deployment php-apache

NAME              DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE

php-apache        7         7         7            3           4m

Horizontal Pod Autoscaler has increased the number of pods in our deployment to 7. Let’s now check, if all the pods are running:

jsz@jsz-desk2:~/k8s-src$ kubectl get pods

php-apache-2046965998-3ewo6        0/1       Pending   0          1m

php-apache-2046965998-8m03k        1/1       Running   0          1m

php-apache-2046965998-ddpgp        1/1       Running   0          5m

php-apache-2046965998-lrik6        1/1       Running   0          1m

php-apache-2046965998-nj465        0/1       Pending   0          1m

php-apache-2046965998-tmwg1        1/1       Running   0          1m

php-apache-2046965998-xkbw1        0/1       Pending   0          1m

As we can see, some pods are pending. Let’s describe one of pending pods to get the reason of the pending state:

$ kubectl describe pod php-apache-2046965998-3ewo6

Name: php-apache-2046965998-3ewo6

Namespace: default

...

Events:

  FirstSeen From SubobjectPath Type Reason Message



  1m {default-scheduler } Warning FailedScheduling pod (php-apache-2046965998-3ewo6) failed to fit in any node

fit failure on node (kubernetes-minion-group-yhdx): Insufficient CPU

fit failure on node (kubernetes-minion-group-de5q): Insufficient CPU



  1m {cluster-autoscaler } Normal TriggeredScaleUp pod triggered scale-up, mig: kubernetes-minion-group, sizes (current/new): 2/3

The pod is pending as there was no CPU in the system for it. We see there’s a TriggeredScaleUp event connected with the pod. It means that the pod triggered reaction of Cluster Autoscaler and a new node will be added to the cluster. Now we’ll wait for the reaction (about 3 minutes) and list all nodes:

$ kubectl get nodes

NAME                           STATUS                     AGE

kubernetes-master              Ready,SchedulingDisabled   9m

kubernetes-minion-group-6z5i   Ready                      43s

kubernetes-minion-group-de5q   Ready                      9m

kubernetes-minion-group-yhdx   Ready                      9m

As we see a new node kubernetes-minion-group-6z5i was added by Cluster Autoscaler. Let’s verify that all pods are now running:

$ kubectl get pods

NAME                               READY     STATUS    RESTARTS   AGE

php-apache-2046965998-3ewo6        1/1       Running   0          3m

php-apache-2046965998-8m03k        1/1       Running   0          3m

php-apache-2046965998-ddpgp        1/1       Running   0          7m

php-apache-2046965998-lrik6        1/1       Running   0          3m

php-apache-2046965998-nj465        1/1       Running   0          3m

php-apache-2046965998-tmwg1        1/1       Running   0          3m

php-apache-2046965998-xkbw1        1/1       Running   0          3m

After the node addition all php-apache pods are running!

Stop Load

We will finish our example by stopping the user load. We’ll terminate both infinite while loops sending requests to the server and verify the result state:

$ kubectl get hpa

NAME         REFERENCE                     TARGET    CURRENT   MINPODS   MAXPODS   AGE

php-apache   Deployment/php-apache/scale   50%       0%        1         10        16m



$ kubectl get deployment php-apache

NAME              DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE

php-apache        1         1         1            1           14m

As we see, in the presented case CPU utilization dropped to 0, and the number of replicas dropped to 1.

After deleting pods most of the cluster resources are unused. Scaling the cluster down may take more time than scaling up because Cluster Autoscaler makes sure that the node is really not needed so that short periods of inactivity (due to pod upgrade etc) won’t trigger node deletion (see cluster autoscaler doc). After approximately 10-12 minutes you can verify that the number of nodes in the cluster dropped:

$ kubectl get nodes

NAME                           STATUS                     AGE

kubernetes-master              Ready,SchedulingDisabled   37m

kubernetes-minion-group-de5q   Ready                      36m

kubernetes-minion-group-yhdx   Ready                      36m

The number of nodes in our cluster is now two again as node kubernetes-minion-group-6z5i was removed by Cluster Autoscaler.

Other use cases

As we have shown, it is very easy to dynamically adjust the number of pods to the load using a combination of Horizontal Pod Autoscaler and Cluster Autoscaler.

However Cluster Autoscaler alone can also be quite helpful whenever there are irregularities in the cluster load. For example, clusters related to development or continuous integration tests can be less needed on weekends or at night. Batch processing clusters may have periods when all jobs are over and the new will only start in couple hours. Having machines that do nothing is a waste of money.

In all of these cases Cluster Autoscaler can reduce the number of unused nodes and give quite significant savings because you will only pay for these nodes that you actually need to run your pods. It also makes sure that you always have enough compute power to run your tasks.

– Jerzy Szczepkowski and Marcin Wielgus, Software Engineers, Google

rktnetes brings rkt container engine to Kubernetes

July 11 2016

Editor’s note: this post is part of a series of in-depth articles on what’s new in Kubernetes 1.3 

As part of Kubernetes 1.3, we’re happy to report that our work to bring interchangeable container engines to Kubernetes is bearing early fruit. What we affectionately call “rktnetes” is included in the version 1.3 Kubernetes release, and is ready for development use. rktnetes integrates support for CoreOS rkt into Kubernetes as the container runtime on cluster nodes, and is now part of the mainline Kubernetes source code. Today it’s easier than ever for developers and ops pros with container portability in mind to try out running Kubernetes with a different container engine.

“We find CoreOS’s rkt a compelling container engine in Kubernetes because of how rkt is composed with the underlying systemd,” said Mark Petrovic, senior MTS and architect at Xoom, a PayPal service. “The rkt runtime assumes only the responsibility it needs to, then delegates to other system services where appropriate. This separation of concerns is important to us.”

What’s rktnetes?

rktnetes is the nickname given to the code that enables Kubernetes nodes to execute application containers with the rkt container engine, rather than with Docker. This change adds new abilities to Kubernetes, for instance running containers under flexible levels of isolation. rkt explores an alternative approach to container runtime architecture, aimed to reflect the Unix philosophy of cleanly separated, modular tools. Work done to support rktnetes also opens up future possibilities for Kubernetes, such as multiple container image format support, and the integration of other container runtimes tailored for specific use cases or platforms.

Why does Kubernetes need rktnetes?

rktnetes is about more than just rkt. It’s also about refining and exercising Kubernetes interfaces, and paving the way for other modular runtimes in the future. While the Docker container engine is well known, and is currently the default Kubernetes container runtime, a number of benefits derive from pluggable container environments. Some clusters may call for very specific container engine implementations, for example, and ensuring the Kubernetes design is flexible enough to support alternate runtimes, starting with rkt, helps keep the interfaces between components clean and simple.

Separation of concerns: Decomposing the monolithic container daemon

The current container runtime used by Kubernetes imposes a number of design decisions. Experimenting with other container execution architectures is worthwhile in such a rapidly evolving space. Today, when Kubernetes sends a request to a node to start running a pod, it communicates through the kubelet on each node with the default container runtime’s central daemon, responsible for managing all of the node’s containers.

rkt does not implement a monolithic container management daemon. (It is worth noting that the default container runtime is in the midst of refactoring its original monolithic architecture.) The rkt design has from day one tried to apply the principle of modularity to the fullest, including reusing well-tested system components, rather than reimplementing them.

The task of building container images is abstracted away from the container runtime core in rkt, and implemented by an independent utility. The same approach is taken to ongoing container lifecycle management. A single binary, rkt, configures the environment and prepares container images for execution, then sets the container application and its isolation environment running. At this point, the rkt program has done its “one job”, and the container isolator takes over.

The API for querying container engine and pod state, used by Kubernetes to track cluster work on each node, is implemented in a separate service, isolating coordination and orchestration features from the core container runtime. While the API service does not fully implement all the API features of the current default container engine, it already helps isolate containers from failures and upgrades in the core runtime, and provides the read-only parts of the expected API for querying container metadata.

Modular container isolation levels

With rkt managing container execution, Kubernetes can take advantage of the CoreOS container engine’s modular stage1 isolation mechanism. The typical container runs under rkt in a software-isolated environment constructed from Linux kernel namespaces, cgroups, and other facilities. Containers isolated in this common way nevertheless share a single kernel with all the other containers on a system, making for lightweight isolation of running apps.

However, rkt features pluggable isolation environments, referred to as stage1s, to change how containers are executed and isolated. For example, the rkt fly stage1 runs containers in the host namespaces (PID, mount, network, etc), granting containers greater power on the host system. Fly is used for containerizing lower-level system and network software, like the kubelet itself. At the other end of the isolation spectrum, the KVM stage1 runs standard app containers as individual virtual machines, each above its own Linux kernel, managed by the KVM hypervisor. This isolation level can be useful for high security and multi-tenant cluster workloads.

Currently, rktnetes can use the KVM stage1 to execute all containers on a node with VM isolation by setting the kubelet’s –rkt-stage1-image option. Experimental work exists to choose the stage1 isolation regime on a per-pod basis with a Kubernetes annotation declaring the pod’s appropriate stage1. KVM containers and standard Linux containers can be mixed together in the same cluster.

How rkt works with Kubernetes

Kubernetes today talks to the default container engine over an API provided by the Docker daemon. rktnetes communicates with rkt a little bit differently. First, there is a distinction between how Kubernetes changes the state of a node’s containers – how it starts and stops pods, or reschedules them for failover or scaling – and how the orchestrator queries pod metadata for regular, read-only bookkeeping. Two different facilities implement these two different cases.

Managing microservice lifecycles

The kubelet on each cluster node communicates with rkt to prepare containers and their environments into pods, and with systemd, the linux service management framework, to invoke and manage the pod processes. Pods are then managed as systemd services, and the kubelet sends systemd commands over dbus to manipulate them. Lifecycle management, such as restarting failed pods and killing completed processes, is handled by systemd, at the kubelet’s behest.

The API service for reading pod data

A discrete rkt api-service implements the pod introspection mechanisms expected by Kubernetes. While each node’s kubelet uses systemd to start, stop, and restart pods as services, it contacts the API service to read container runtime metadata. This includes basic orchestration information such as the number of pods running on the node, the names and networks of those pods, and the details of pod configuration, resource limits and storage volumes (think of the information shown by the kubectl describe subcommand).

Pod logs, having been written to journal files, are made available for kubectl logs and other forensic subcommands by the API service as well, which reads from log files to provide pod log data to the kubelet for answering control plane requests.

This dual interface to the container environment is an area of very active development, and plans are for the API service to expand to provide methods for the pod manipulation commands. The underlying mechanism will continue to keep separation of concerns in mind, but will hide more of this from the kubelet. The methods the kubelet uses to control the rktnetes container engine will grow less different from the default container runtime interface over time.

Try rktnetes

So what can you do with rktnetes today? Currently, rktnetes passes all of the applicable Kubernetes “end-to-end” (aka “e2e”) tests, provides standard metrics to cAdvisor, manages networks using CNI, handles per-container/pod logs, and automatically garbage collects old containers and images. Kubernetes running on rkt already provides more than the basics of a modular, flexible container runtime for Kubernetes clusters, and it is already a functional part of our development environment at CoreOS.

Developers and early adopters can follow the known issues in the rktnetes notes to get an idea  of the wrinkles and bumps test-drivers can expect to encounter. This list groups the high-level pieces required to bring rktnetes to feature parity with the existing container runtime and API. We hope you’ll try out rktnetes in your Kubernetes clusters, too.

Use rkt with Kubernetes Today

The introductory guide Running Kubernetes on rkt walks through the steps to spin up a rktnetes cluster, from kubelet –container-runtime=rkt to networking and starting pods. This intro also sketches the configuration you’ll need to start a cluster on GCE with the Kubernetes kube-up.sh script.

Recent work aims to make rktnetes cluster creation much easier, too. While not yet merged, an in-progress pull request creates a single rktnetes configuration toggle to select rkt as the container engine when deploying a Kubernetes cluster with the coreos-kubernetes configuration tools. You can also check out the rktnetes workshop project, which launches a single-node rktnetes cluster on just about any developer workstation with one vagrant up command.

We’re excited to see the experiments the wider Kubernetes and CoreOS communities devise to put rktnetes to the test, and we welcome your input – and pull requests!

–Yifan Gu and Josh Wood, rktnetes Team, CoreOS. Twitter @CoreOSLinux.

Minikube: easily run Kubernetes locally

July 11 2016

Editor’s note: This is the first post in a series of in-depth articles on what’s new in Kubernetes 1.3 

While Kubernetes is one of the best tools for managing containerized applications available today, and has been production-ready for over a year, Kubernetes has been missing a great local development platform.

For the past several months, several of us from the Kubernetes community have been working to fix this in the Minikube repository on GitHub. Our goal is to build an easy-to-use, high-fidelity Kubernetes distribution that can be run locally on Mac, Linux and Windows workstations and laptops with a single command.

Thanks to lots of help from members of the community, we’re proud to announce the official release of Minikube. This release comes with support for Kubernetes 1.3, new commands to make interacting with your local cluster easier and experimental drivers for xhyve (on Mac OSX) and KVM (on Linux).

Using Minikube

Minikube ships as a standalone Go binary, so installing it is as simple as downloading Minikube and putting it on your path:

Minikube currently requires that you have VirtualBox installed, which you can download here.

(This is for Mac, for Linux substitute “minikube-darwin-amd64” with “minikube-linux-amd64”)_curl -Lo minikube https://storage.googleapis.com/minikube/releases/latest/minikube-darwin-amd64 && chmod +x minikube && sudo mv minikube /usr/local/bin/

To start a Kubernetes cluster in Minikube, use the minikube start command:

$ minikube start

Starting local Kubernetes cluster...

Kubernetes is available at https://192.168.99.100:443

Kubectl is now configured to use the cluster

At this point, you have a running single-node Kubernetes cluster on your laptop! Minikube also configures kubectl for you, so you’re also ready to run containers with no changes.

Minikube creates a Host-Only network interface that routes to your node. To interact with running pods or services, you should send traffic over this address. To find out this address, you can use the minikube ip command:

Minikube also comes with the Kubernetes Dashboard. To open this up in your browser, you can use the built-in minikube dashboard command:

In general, Minikube supports everything you would expect from a Kubernetes cluster. You can use kubectl exec to get a bash shell inside a pod in your cluster. You can use the kubectl port-forward and kubectl proxy commands to forward traffic from localhost to a pod or the API server.

Since Minikube is running locally instead of on a cloud provider, certain provider-specific features like LoadBalancers and PersistentVolumes will not work out-of-the-box. However, you can use NodePort LoadBalancers and HostPath PersistentVolumes.

Architecture

Minikube is built on top of Docker’s libmachine, and leverages the driver model to create, manage and interact with locally-run virtual machines.

RedSpread was kind enough to donate their localkube codebase to the Minikube repo, which we use to spin up a single-process Kubernetes cluster inside a VM. Localkube bundles etcd, DNS, the Kubelet and all the Kubernetes master components into a single Go binary, and runs them all via separate goroutines.

Upcoming Features

Minikube has been a lot of fun to work on so far, and we’re always looking to improve Minikube to make the Kubernetes development experience better. If you have any ideas for features, don’t hesitate to let us know in the issue tracker

Here’s a list of some of the things we’re hoping to add to Minikube soon:

  • Native hypervisor support for OSX and Windows
  • We’re planning to remove the dependency on Virtualbox, and integrate with the native hypervisors included in OSX and Windows (Hypervisor.framework and Hyper-v, respectively).
  • Improved support for Kubernetes features
  • We’re planning to increase the range of supported Kubernetes features, to include things like Ingress.
  • Configurable versions of Kubernetes
  • Today Minikube only supports Kubernetes 1.3. We’re planning to add support for user-configurable versions of Kubernetes, to make it easier to match what you have running in production on your laptop.

Community

We’d love to hear feedback on Minikube. To join the community:

  • Post issues or feature requests on GitHub
  • Join us in the #minikube channel on Slack

Please give Minikube a try, and let us know how it goes!

–Dan Lorenc, Software Engineer, Google

Five Days of Kubernetes 1.3

July 11 2016

Last week we released Kubernetes 1.3, two years from the day when the first Kubernetes commit was pushed to GitHub. Now 30,000+ commits later from over 800 contributors, this 1.3 releases is jam packed with updates driven by feedback from users.

While many new improvements and features have been added in the latest release, we’ll be highlighting several that stand-out. Follow along and read these in-depth posts on what’s new and how we continue to make Kubernetes the best way to manage containers at scale. 

| Day 1 |

* Minikube: easily run Kubernetes locally * rktnetes: brings rkt container engine to Kubernetes | | Day 2 | * Autoscaling in Kubernetes
* Partner post: Kubernetes in Rancher, the further evolution | | Day 3 | * Deploying thousand instances of Cassandra using Pet Set
* Partner post: Stateful Applications in Containers, by Diamanti | | Day 4 | * Cross Cluster Services
* Partner post: Citrix and NetScaler CPX | | Day 5 | * Dashboard - Full Featured Web Interface for Kubernetes
* Partner post: Steering an Automation Platform at Wercker with Kubernetes | | Bonus | * Updates to Performance and Scalability |

Connect

We’d love to hear from you and see you participate in this growing community:

  • Get involved with the Kubernetes project on GitHub 
  • Post questions (or answer questions) on Stackoverflow 
  • Connect with the community on Slack
  • Follow us on Twitter @Kubernetesio for latest updates
@Kubernetesio View on Github #kubernetes-users Stack Overflow Download Kubernetes