Last fall, MacLeod Broad’s platform team at Workiva was prepping one of the company’s first products utilizing Amazon Web Services when they ran into a roadblock.
Early on, Workiva’s backend had run mostly on Google App Engine
. But things changed along the way as Workiva’s SaaS offering, Wdesk
, a cloud-based platform for managing and reporting business data, grew its customer base to more than 70 percent of the Fortune 500 companies. "As customer needs grew and the product offering expanded, we started to leverage a wider offering of services such as Amazon Web Services as well as other Google Cloud Platform services, creating a multi-vendor environment."
With this new product, there was a "sync and link" feature by which data "went through a whole host of services starting with the new spreadsheet system [Amazon Aurora
] into what we called our linking system, and then pushed through http to our existing system, and then a number of calculations would go on, and the results would be transmitted back into the new system," says Broad. "We were trying to optimize that for speed. We thought we had made this great optimization and then it would turn out to be a micro optimization, which didn’t really affect the overall speed of things."
The challenges faced by Broad’s team may sound familiar to other companies that have also made the shift from monoliths to more distributed, microservice-based systems. "We had a number of people working on this, all on different teams, so it was difficult to get our head around what the issues were and where the bottlenecks were," says Broad.
"Each service team was going through different iterations of their architecture and it was very hard to follow what was actually going on in each teams’ system," he adds. "We had circular dependencies where we’d have three or four different service teams unsure of where the issues really were, requiring a lot of back and forth communication. So we wasted a lot of time saying, ‘What part of this is slow? Which part of this is sometimes slow depending on the use case? Which part is degrading over time? Which part of this process is asynchronous so it doesn’t really matter if it’s long-running or not? What are we doing that’s redundant, and which part of this is buggy?’"