Company Pinterest Location San Francisco, California Industry Web and Mobile App

Challenge

After eight years in existence, Pinterest had grown into 1,000 microservices and multiple layers of infrastructure and diverse set-up tools and platforms. In 2016 the company launched a roadmap towards a new compute platform, led by the vision of creating the fastest path from an idea to production, without making engineers worry about the underlying infrastructure.

Solution

The first phase involved moving services to Docker containers. Once these services went into production in early 2017, the team began looking at orchestration to help create efficiencies and manage them in a decentralized way. After an evaluation of various solutions, Pinterest went with Kubernetes.

Impact

"By moving to Kubernetes the team was able to build on-demand scaling and new failover policies, in addition to simplifying the overall deployment and management of a complicated piece of infrastructure such as Jenkins," says Micheal Benedict, Product Manager for the Cloud and the Data Infrastructure Group at Pinterest. "We not only saw reduced build times but also huge efficiency wins. For instance, the team reclaimed over 80 percent of capacity during non-peak hours. As a result, the Jenkins Kubernetes cluster now uses 30 percent less instance-hours per-day when compared to the previous static cluster."

Pinterest was born on the cloud—running on AWS since day one in 2010—but even cloud native companies can experience some growing pains.

Since its launch, Pinterest has become a household name, with more than 200 million active monthly users and 100 billion objects saved. Underneath the hood, there are 1,000 microservices running and hundreds of thousands of data jobs.

With such growth came layers of infrastructure and diverse set-up tools and platforms for the different workloads, resulting in an inconsistent and complex end-to-end developer experience, and ultimately less velocity to get to production. So in 2016, the company launched a roadmap toward a new compute platform, led by the vision of having the fastest path from an idea to production, without making engineers worry about the underlying infrastructure.

The first phase involved moving to Docker. "Pinterest has been heavily running on virtual machines, on EC2 instances directly, for the longest time," says Micheal Benedict, Product Manager for the Cloud and the Data Infrastructure Group. "To solve the problem around packaging software and not make engineers own portions of the fleet and those kinds of challenges, we standardized the packaging mechanism and then moved that to the container on top of the VM. Not many drastic changes. We didn't want to boil the ocean at that point."

The first service that was migrated was the monolith API fleet that powers most of Pinterest. At the same time, Benedict's infrastructure governance team built chargeback and capacity planning systems to analyze how the company uses its virtual machines on AWS. "It became clear that running on VMs is just not sustainable with what we're doing," says Benedict. "A lot of resources were underutilized. There were efficiency efforts, which worked fine at a certain scale, but now you have to move to a more decentralized way of managing that. So orchestration was something we thought could help solve that piece."

That led to the second phase of the roadmap. In July 2017, after an eight-week evaluation period, the team chose Kubernetes over other orchestration platforms. "Kubernetes lacked certain things at the time—for example, we wanted Spark on Kubernetes," says Benedict. "But we realized that the dev cycles we would put in to even try building that is well worth the outcome, both for Pinterest as well as the community. We've been in those conversations in the Big Data SIG. We realized that by the time we get to productionizing many of those things, we'll be able to leverage what the community is doing."

At the beginning of 2018, the team began onboarding its first use case into the Kubernetes system: Jenkins workloads. "Although we have builds happening during a certain period of the day, we always need to allocate peak capacity," says Benedict. "They don't have any auto-scaling capabilities, so that capacity stays constant. It is difficult to speed up builds because ramping up takes more time. So given those kind of concerns, we thought that would be a perfect use case for us to work on."

They ramped up the cluster, and working with a team of four people, got the Jenkins Kubernetes cluster ready for production. "We still have our static Jenkins cluster," says Benedict, "but on Kubernetes, we are doing similar builds, testing the entire pipeline, getting the artifact ready and just doing the comparison to see, how much time did it take to build over here. Is the SLA okay, is the artifact generated correct, are there issues there?"

"So far it's been good," he adds, "especially the elasticity around how we can configure our Jenkins workloads on Kubernetes shared cluster. That is the win we were pushing for."

By the end of Q1 2018, the team successfully migrated Jenkins Master to run natively on Kubernetes and also collaborated on the Jenkins Kubernetes Plugin to manage the lifecycle of workers. "We're currently building the entire Pinterest JVM stack (one of the larger monorepos at Pinterest which was recently bazelized) on this new cluster," says Benedict. "At peak, we run thousands of pods on a few hundred nodes. Overall, by moving to Kubernetes the team was able to build on-demand scaling and new failover policies, in addition to simplifying the overall deployment and management of a complicated piece of infrastructure such as Jenkins. We not only saw reduced build times but also huge efficiency wins. For instance, the team reclaimed over 80 percent of capacity during non-peak hours. As a result, the Jenkins Kubernetes cluster now uses 30 percent less instance-hours per-day when compared to the previous static cluster."

Benedict points to a "pretty robust roadmap" going forward. In addition to the Pinterest big data team's experiments with Spark on Kubernetes, the company collaborated with Amazon's EKS team on an ENI/CNI plug in.

Once the Jenkins cluster is up and running out of dark mode, Benedict hopes to establish best practices, including having governance primitives established—including integration with the chargeback system—before moving on to migrating the next service. "We have a healthy pipeline of use-cases to be on-boarded. After Jenkins, we want to enable support for Tensorflow and Apache Spark. At some point, we aim to move the company's monolithic API service. If we move that and understand the complexity around that, it builds our confidence," says Benedict. "It sets us up for migration of all our other services."

After years of being a cloud native pioneer, Pinterest is eager to share its ongoing journey. "We are in the position to run things at scale, in a public cloud environment, and test things out in way that a lot of people might not be able to do," says Benedict. "We're in a great position to contribute back some of those learnings."