This article is more than one year old. Older articles may contain outdated content. Check that the information in the page has not become incorrect since its publication.
Introducing Kubeflow - A Composable, Portable, Scalable ML Stack Built for Kubernetes
Today’s post is by David Aronchick and Jeremy Lewi, a PM and Engineer on the Kubeflow project, a new open source GitHub repo dedicated to making using machine learning (ML) stacks on Kubernetes easy, fast and extensible.
Kubernetes and Machine Learning
Kubernetes has quickly become the hybrid solution for deploying complicated workloads anywhere. While it started with just stateless services, customers have begun to move complex workloads to the platform, taking advantage of rich APIs, reliability and performance provided by Kubernetes. One of the fastest growing use cases is to use Kubernetes as the deployment platform of choice for machine learning.
Building any production-ready machine learning system involves various components, often mixing vendors and hand-rolled solutions. Connecting and managing these services for even moderately sophisticated setups introduces huge barriers of complexity in adopting machine learning. Infrastructure engineers will often spend a significant amount of time manually tweaking deployments and hand rolling solutions before a single model can be tested.
Worse, these deployments are so tied to the clusters they have been deployed to that these stacks are immobile, meaning that moving a model from a laptop to a highly scalable cloud cluster is effectively impossible without significant re-architecture. All these differences add up to wasted effort and create opportunities to introduce bugs at each transition.
To address these concerns, we’re announcing the creation of the Kubeflow project, a new open source GitHub repo dedicated to making using ML stacks on Kubernetes easy, fast and extensible. This repository contains:
- JupyterHub to create & manage interactive Jupyter notebooks
- A Tensorflow Custom Resource (CRD) that can be configured to use CPUs or GPUs, and adjusted to the size of a cluster with a single setting
- A TF Serving container Because this solution relies on Kubernetes, it runs wherever Kubernetes runs. Just spin up a cluster and go!
ks init my-kubeflow cd my-kubeflow ks registry add kubeflow \ github.com/google/kubeflow/tree/master/kubeflow ks pkg install kubeflow/core ks pkg install kubeflow/tf-serving ks pkg install kubeflow/tf-job ks generate core kubeflow-core --name=kubeflow-core
We can now define environments corresponding to our two clusters.
kubectl config use-context minikube ks env add minikube kubectl config use-context gke ks env add gke
And we’re done! Now just create the environments on your cluster. First, on minikube:
ks apply minikube -c kubeflow-core
And to create it on our multi-node GKE cluster for quicker training:
ks apply gke -c kubeflow-core
By making it easy to deploy the same rich ML stack everywhere, the drift and rewriting between these environments is kept to a minimum.
To access either deployments, you can execute the following command:
kubectl port-forward tf-hub-0 8100:8000
and then open up http://127.0.0.1:8100 to access JupyterHub. To change the environment used by kubectl, use either of these commands:
# To access minikube kubectl config use-context minikube # To access GKE kubectl config use-context gke
When you execute apply you are launching on K8s
- JupyterHub for launching and managing Jupyter notebooks on K8s
- A TF CRD
Let's suppose you want to submit a training job. Kubeflow provides ksonnet prototypes that make it easy to define components. The tf-job prototype makes it easy to create a job for your code but for this example, we'll use the tf-cnn prototype which runs TensorFlow's CNN benchmark.
To submit a training job, you first generate a new job from a prototype:
ks generate tf-cnn cnn --name=cnn
By default the tf-cnn prototype uses 1 worker and no GPUs which is perfect for our minikube cluster so we can just submit it.
ks apply minikube -c cnn
On GKE, we’ll want to tweak the prototype to take advantage of the multiple nodes and GPUs. First, let’s list all the parameters available:
# To see a list of parameters ks prototype list tf-job
Now let’s adjust the parameters to take advantage of GPUs and access to multiple nodes.
ks param set --env=gke cnn num\_gpus 1 ks param set --env=gke cnn num\_workers 1 ks apply gke -c cnn
Note how we set those parameters so they are used only when you deploy to GKE. Your minikube parameters are unchanged!
After training, you export your model to a serving location.
Kubeflow also includes a serving package as well.
To deploy a the trained model for serving, execute the following:
ks generate tf-serving inception --name=inception ---namespace=default --model\_path=gs://$bucket_name/$model_loc ks apply gke -c inception
This highlights one more option in Kubeflow - the ability to pass in inputs based on your deployment. This command creates a tf-serving service on the GKE cluster, and makes it available to your application.
For more information about of deploying and monitoring TensorFlow training jobs and TensorFlow models please refer to the user guide.
Kubeflow + ksonnet
One choice we want to call out is the use of the ksonnet project. We think working with multiple environments (dev, test, prod) will be the norm for most Kubeflow users. By making environments a first class concept, ksonnet makes it easy for Kubeflow users to easily move their workloads between their different environments.
Particularly now that Helm is integrating ksonnet with the next version of their platform, we felt like it was the perfect choice for us. More information about ksonnet can be found in the ksonnet docs.
We also want to thank the team at Heptio for expediting features critical to Kubeflow's use of ksonnet.
We are in the midst of building out a community effort right now, and we would love your help! We’ve already been collaborating with many teams - CaiCloud, Red Hat & OpenShift, Canonical, Weaveworks, Container Solutions and many others. CoreOS, for example, is already seeing the promise of Kubeflow:
“The Kubeflow project was a needed advancement to make it significantly easier to set up and productionize machine learning workloads on Kubernetes, and we anticipate that it will greatly expand the opportunity for even more enterprises to embrace the platform. We look forward to working with the project members in providing tight integration of Kubeflow with Tectonic, the enterprise Kubernetes platform.” -- Reza Shafii, VP of product, CoreOS
And we’re just getting started! We would love for you to help. How you might ask? Well…
- Please join theslack channel
- Please join thekubeflow-discuss email list
- Please subscribe to theKubeflow twitter account
- Please download and run kubeflow, and submit bugs! Thank you for your support so far, we could not be more excited!
Jeremy Lewi & David Aronchick Google
- This article was amended in June 2023 to update the trained model bucket location.