This article is more than one year old. Older articles may contain outdated content. Check that the information in the page has not become incorrect since its publication.

Introducing Kubeflow - A Composable, Portable, Scalable ML Stack Built for Kubernetes

By Jeremy Lewi (Google), David Aronchick (Google) | Thursday, December 21, 2017

Kubernetes and Machine Learning

Kubernetes has quickly become the hybrid solution for deploying complicated workloads anywhere. While it started with just stateless services, customers have begun to move complex workloads to the platform, taking advantage of rich APIs, reliability and performance provided by Kubernetes. One of the fastest growing use cases is to use Kubernetes as the deployment platform of choice for machine learning.

Building any production-ready machine learning system involves various components, often mixing vendors and hand-rolled solutions. Connecting and managing these services for even moderately sophisticated setups introduces huge barriers of complexity in adopting machine learning. Infrastructure engineers will often spend a significant amount of time manually tweaking deployments and hand rolling solutions before a single model can be tested.

Worse, these deployments are so tied to the clusters they have been deployed to that these stacks are immobile, meaning that moving a model from a laptop to a highly scalable cloud cluster is effectively impossible without significant re-architecture. All these differences add up to wasted effort and create opportunities to introduce bugs at each transition.

Introducing Kubeflow

To address these concerns, we’re announcing the creation of the Kubeflow project, a new open source GitHub repo dedicated to making using ML stacks on Kubernetes easy, fast and extensible. This repository contains:

JupyterHub to create & manage interactive Jupyter notebooks
A Tensorflow Custom Resource (CRD) that can be configured to use CPUs or GPUs, and adjusted to the size of a cluster with a single setting
A TF Serving container Because this solution relies on Kubernetes, it runs wherever Kubernetes runs. Just spin up a cluster and go!

Using Kubeflow

Let's suppose you are working with two different Kubernetes clusters: a local minikube cluster; and a GKE cluster with GPUs; and that you have two kubectl contexts defined named minikube and gke.

First we need to initialize our ksonnet application and install the Kubeflow packages. (To use ksonnet, you must first install it on your operating system - the instructions for doing so are here)

     ks init my-kubeflow  
     cd my-kubeflow  
     ks registry add kubeflow \  
     github.com/google/kubeflow/tree/master/kubeflow  
     ks pkg install kubeflow/core  
     ks pkg install kubeflow/tf-serving  
     ks pkg install kubeflow/tf-job  
     ks generate core kubeflow-core --name=kubeflow-core

We can now define environments corresponding to our two clusters.

     kubectl config use-context minikube  
     ks env add minikube  

     kubectl config use-context gke  
     ks env add gke

And we’re done! Now just create the environments on your cluster. First, on minikube:

     ks apply minikube -c kubeflow-core

And to create it on our multi-node GKE cluster for quicker training:

     ks apply gke -c kubeflow-core

By making it easy to deploy the same rich ML stack everywhere, the drift and rewriting between these environments is kept to a minimum.

To access either deployments, you can execute the following command:

     kubectl port-forward tf-hub-0 8100:8000

and then open up http://127.0.0.1:8100 to access JupyterHub. To change the environment used by kubectl, use either of these commands:

     # To access minikube  
     kubectl config use-context minikube  

     # To access GKE  
     kubectl config use-context gke

When you execute apply you are launching on K8s

JupyterHub for launching and managing Jupyter notebooks on K8s
A TF CRD

Let's suppose you want to submit a training job. Kubeflow provides ksonnet prototypes that make it easy to define components. The tf-job prototype makes it easy to create a job for your code but for this example, we'll use the tf-cnn prototype which runs TensorFlow's CNN benchmark.

To submit a training job, you first generate a new job from a prototype:

     ks generate tf-cnn cnn --name=cnn

By default the tf-cnn prototype uses 1 worker and no GPUs which is perfect for our minikube cluster so we can just submit it.

     ks apply minikube -c cnn

On GKE, we’ll want to tweak the prototype to take advantage of the multiple nodes and GPUs. First, let’s list all the parameters available:

     # To see a list of parameters  
     ks prototype list tf-job

Now let’s adjust the parameters to take advantage of GPUs and access to multiple nodes.

     ks param set --env=gke cnn num\_gpus 1  
     ks param set --env=gke cnn num\_workers 1  

     ks apply gke -c cnn

Note how we set those parameters so they are used only when you deploy to GKE. Your minikube parameters are unchanged!

After training, you export your model to a serving location.

Kubeflow also includes a serving package as well.

To deploy a the trained model for serving, execute the following:

     ks generate tf-serving inception --name=inception  
     ---namespace=default --model\_path=gs://$bucket_name/$model_loc
     ks apply gke -c inception

This highlights one more option in Kubeflow - the ability to pass in inputs based on your deployment. This command creates a tf-serving service on the GKE cluster, and makes it available to your application.

For more information about of deploying and monitoring TensorFlow training jobs and TensorFlow models please refer to the user guide.

Kubeflow + ksonnet

One choice we want to call out is the use of the ksonnet project. We think working with multiple environments (dev, test, prod) will be the norm for most Kubeflow users. By making environments a first class concept, ksonnet makes it easy for Kubeflow users to easily move their workloads between their different environments.

Particularly now that Helm is integrating ksonnet with the next version of their platform, we felt like it was the perfect choice for us. More information about ksonnet can be found in the ksonnet docs.

We also want to thank the team at Heptio for expediting features critical to Kubeflow's use of ksonnet.

What’s Next?

We are in the midst of building out a community effort right now, and we would love your help! We’ve already been collaborating with many teams - CaiCloud, Red Hat & OpenShift, Canonical, Weaveworks, Container Solutions and many others. CoreOS, for example, is already seeing the promise of Kubeflow:

“The Kubeflow project was a needed advancement to make it significantly easier to set up and productionize machine learning workloads on Kubernetes, and we anticipate that it will greatly expand the opportunity for even more enterprises to embrace the platform. We look forward to working with the project members in providing tight integration of Kubeflow with Tectonic, the enterprise Kubernetes platform.” -- Reza Shafii, VP of product, CoreOS

If you’d like to try out Kubeflow right now right in your browser, we’ve partnered with Katacoda to make it super easy. You can try it here!

And we’re just getting started! We would love for you to help. How you might ask? Well…

Please join theslack channel
Please join thekubeflow-discuss email list
Please subscribe to theKubeflow twitter account
Please download and run kubeflow, and submit bugs! Thank you for your support so far, we could not be more excited!

Note:

This article was amended in June 2023 to update the trained model bucket location.