Twitter Moving 300PB of Cold Data Storage and Compute Hadoop Clusters To Google Cloud

May 9, 2018

Twitter’s Hadoop System is at the core of its data platform, providing vast storage for analytics of its user actions and allowing its infrastructure to scale whenever needed. Twitter’s large Hadoop clusters number among the world’s largest; its Hadoop file systems host over 300PB of data over tens of thousands of servers.

Twitter just announced that it is working with Google Cloud to move cold data storage and its flexible compute Hadoop clusters to the Google Cloud Platform, allowing its engineering teams to improve their experience and productivity.

“There is strong alignment with Twitter’s engineering strategy to meet the demands of its platform and the services Google Cloud offers at a global scale”, said Brian Stevens, CTO of Google Cloud. “Google Cloud Platform’s data solutions and trusted infrastructure will provide Twitter with the technical flexibility and consistency that its platform requires, and we look forward to an ongoing technical collaboration with their team.”

Twitter promises that the migration will enable a number of improvements, including speedier capacity provisioning, improved flexibility, access to more tools and services, security improvements, along with enhanced disaster recovery capabilities. Architecturally, the migration, once complete, will allow Twitter to separate compute and storage for this class of Hadoop workloads, which the company says will have “a number of long-term scaling and operational benefits”.

Industry chatter around the news has signalled it as a significant win for Google Cloud. As TechRepublic’s Matt Asay put it, Winning Twitter’s business is a signal to the world: If Google can handle Twitter’s workloads, it can also manage yours.”

Spotify moved its infrastructure over to Google Cloud in March, in a move that ZDNet’s Larry Dignan “may become what Netflix was to Amazon Web Services: an all clear signal to enterprises to move more workloads to the cloud provider”.

During its collaboration with Hadoop, Twitter has made a number of contributions, including to ViewFs, the client-side Hadoop filesystem view. ViewFs makes the interaction with Twitter’s HDFS infrastructure as straightforward as a single namespace spanning all its datacenters and clusters, a highly useful feature given Twitter’s scale.

The HDFS Federation scales the filesystem to Twitter’s needs while NameNode High Availability assists with reliability within a namespace. These combined features add useful complexity to managing and using Twitter’s large Hadoop clusters with alternate versions. By using simple paths, ViewFs eliminates the need for Twitter’s DevOps to remember complicated URLs.

In a post by Twitter engineer ‎@gerashegalov‎, the company lays out how it works with ViewFS in detail, including the extension it has developed, TwitterViewFs, which dynamically generates a new configuration, allowing the Twitter team to have a “simple holistic filesystem view”. Twitter also added Nfly for cross-datacenter availability of HDFS data. It hoped then that “the broader Hadoop user community will benefit from our experience”. Certainly, the Hadoop user community will be watching its latest move with interest.