Over the last couple of years, Twitter has been posting a series of blogs on its infrastructure discussing how it reached its present scale of efficiency in its hardware and data centers to maximize uptime as the business scaled in media content, customers and worldwide presence. As of the third quarter of 2017, the microblogging service averaged 330 million monthly active users worldwide and is continually striving to scale up.
The social networking and microblogging site began its life just over a decade ago in March 2006 when hardware from physical enterprise vendors dominated the data center. Over the past decade, Twitter has continually sought to engineer and refresh its fleet to take advantage of latest technology trends and maximize hardware efficiency to deliver a world-class service.
Network Traffic – Issues and Solutions
Twitter started to migrate away from third party hosting in early 2010 and build its own network architecture to address the number of service issues it previously encountered. It moved onto a network with points of presence (POPs) on five continents and data centers with hundreds of thousands of servers. In early 2015, however, the service started to reach physical scalability limits in its data centers after a full mesh topology failed to support additional hardware to add further racks. Also, the existing data center IGP started to behave in unexpected ways due to the increasing route scale and topology complexity.
In response, Twitter began to convert existing data centers to a Clos topology + BGP on a live network in order to prevent interruption to the service. More information and diagrams available here.
Data Center – History and Challenges
Service interruption issues at the early stage of the site were a combination of data center design issues, maintenance challenges, equipment failure and human error.
Hardware failures occurred at the server component level, top of rack switch, and core switch leading to some service interruption.
Solutions Found
Twitter focused on improving the physical layer designs to maximize the resiliency of the hardware and data center operations, and build diverse power sources, fiber paths and physical domains.
They modeled dependencies between the physical failure domains and the services spread across them to improve their ability to predict fault tolerance and make improvements.
They added more data centers in more locations to reduce risk from natural disaster and allow failure between regions during unforeseen incidents or planned outages. The active-active operation of data centers allowed for staged code deployment minimizing overall impact of code rollouts.
They improved power use efficiency within data centers by building out the operating ranges of the environment envelope and finding ways for the hardware to still function at higher operating temperatures.
Future Planning
Twitter strives to make live changes to the hardware and operating network in order to reduce impact to users. Its main future focus is to combine efficiency of use within the existing physical footprints while still scaling up globally.
Hardware Efficiency – History and Challenges
The hardware engineering team at Twitter began by working with off-the-shelf purchased hardware, but soon evolved this into customization of hardware for performance optimization and to achieve cost efficiency as available market offerings didn’t meet the team’s needs.
Off-the-shelf servers came with features that helped reliability at small scale, but once applied to Twitter’s larger use model, became problematic; for example, some raid controllers interfered with SSD performance and became significantly costly in their resolution. In addition, problems were arising from supply and performance of SAS media. Eventually, the decision was made to invest in a hardware engineering team to customize white box solutions, thus reduce expense and increase performance metrics.
Solutions Found
Since 2016, Twitter has developed its own GPU systems for inference and training of machine learning models.
The team found that the two main ways to reduce the cost of a server were:
- Removing unused components
- Improving utilization
Twitter’s workload is broken into four verticals: storage, compute, database, and gpu. Twitter looks at requirements on a per vertical basis, which lets the Hardware Engineering team create a focused feature set for each vertical. This is aimed at optimizing component selection when the equipment is unused or underutilized; for example, its storage configuration has been designed for Hadoop workloads and was produced at a TCO reduction of 20% over the initial OEM solution. Simultaneously, the design enhanced the performance and reliability of the hardware. Similarly, for our compute vertical, the Hardware Engineering Team has maximized the efficiency of its systems by eliminating unnecessary features.
Migration from Bare Metal to Mesos
In 2012-2013, Twitter began to adopt two things: (i) service discovery so that if a web service needed to talk to the user service, it could do so more easily than before (via a zookeeper cluster and a library in Finagle) and (ii) Mesos, including Aurora, Twitter’s own scheduler framework.
Service was automatically registered under mesos into a “serverset” meaning that any service that needed to talk to another service could simply watch that path and receive a live view of what servers were available.
Mesos/Aurora allowed a service owner to push a package into a “packer” (a service backed by HDFS) and upload an aurora configuration to describe the service (its needs for CPU, memory, instances, command lines of tasks) so that Aurora could complete the deploy. It then schedules instances on available hosts, downloads the artifact from packer, registers it in service discovery, and finally launches it. If any failures occur, Mesos/Aurora reschedules the instance on a different host automatically.
A Private PaaS for Twitter
Combining Service Discovery with Mesos/Aurora revolutionized Twitter’s infrastructure. Despite encountering bugs and growing pains, the fundamental combination of the two has proved sound. Instead of having to think about reconfiguring and managing hardware, the engineering team can now focus on how much capacity to deploy and how best to configure their services. Mesos allows the engineers to pack multiple services into a single box and add capacity to a service that is requesting quota, changing one line of a config, and doing deploy. Over two years, most “stateless” services moved into Mesos, starting with the largest and most important services (the ads serving system and the user service), which lessened their operational burden.
The series of infrastructure posts is a fascinating look at how Twitter has managed to continuously scale up and yet maintain (and improve) efficiency.
Read the full series here.