How Segment redesigned its core systems to solve an existential scaling crisis

Segment, the startup Twilio bought last fall for $3.2 billion, was just beginning to take off in 2015 when it ran into a scaling problem: It was growing so quickly, the tools it had built to process marketing data on its platform were starting to outgrow the original system design.

Inaction would cause the company to hit a technology wall, managers feared. Every early-stage startup craves growth and Segment was no exception, but it also needed to begin thinking about how to make its data platform more resilient or reach a point where it could no longer handle the data it was moving through the system. It was — in a real sense — an existential crisis for the young business.

The project that came out of their efforts was called Centrifuge, and its purpose was to move data through Segment’s data pipes to wherever customers needed it quickly and efficiently at the lowest operating cost.

Segment’s engineering team began thinking hard about what a more robust and scalable system would look like. As it turned out, their vision would evolve in a number of ways between the end of 2015 and today, and with each iteration, they would take a leap in terms of how efficiently they allocated resources and processed data moving through its systems.

The project that came out of their efforts was called Centrifuge, and its purpose was to move data through Segment’s data pipes to wherever customers needed it quickly and efficiently at the lowest operating cost. This is the story of how that system came together.

Growing pains

The systemic issues became apparent the way they often do — when customers began complaining. When Tido Carriero, Segment’s chief product development officer, came on board at the end of 2015, he was charged with finding a solution. The issue involved the original system design, which like many early iterations from startups was designed to get the product to market with little thought given to future growth and the technical debt payment was coming due.

“We had [designed] our initial integrations architecture in a way that just wasn’t scalable in a number of different ways. We had been experiencing massive growth, and our CEO [Peter Reinhardt] came to me maybe three times within a month and reported various scaling challenges that either customers or partners of ours had alerted him to,” said Carriero.

The good news was that it was attracting customers and partners to the platform at a rapid clip, but it could all have come crashing down if the company didn’t improve the underlying system architecture to support the robust growth. As Carriero reports, that made it a stressful time, but having come from Dropbox, he was actually in a position to understand that it’s possible to completely rearchitect the business’s technology platform and live to tell about it.

“One of the things I learned from my past life [at Dropbox] is when you have a problem that’s just so core to your business, at a certain point you start to realize that you are the only company in the world kind of experiencing this problem at this kind of scale,” he said. For Dropbox that was related to storage, and for Segment it was processing large amounts of data concurrently.

In the build-versus-buy equation, Carriero knew that he had to build his way out of the problem. There was nothing out there that could solve Segment’s unique scaling issues. “Obviously that led us to believe that we really need to think about this a little bit differently, and that was when our Centrifuge V2 architecture was born,” he said.

Building the imperfect beast

The company began measuring system performance, at the time processing 8,442 events per second. When it began building V2 of its architecture, that number had grown to an average of 18,907 events per second.


By Ron Miller

Posted in Software Development and tagged , , , , , , .