Big Data Storage Models Overview – Lustre, GlusterFS and Ceph

April 9, 2019

Storage used to be simple. A single large expensive drive (SLED) was all that was needed; then the industry moved onto redundant arrays of inexpensive disks (RAID) – still relatively straightforward. Once hard drives began to be replaced by solid-state drives (SSD), physical drives struggled to keep up with modern server data needs, in particular in the era of migration to the cloud and containerization. This is where software-defined storage (SDS) and file distribution systems have stepped in, in particular into the world of high performance computing (HPC).

HPC relies on large data, in particular for workloads such as those found in the oil and gas, financial services, life sciences and media rendering sectors. The ability to access large volumes of data at very high speeds and low latencies is critical to HPC, yet it has always been a challenge when running HPC workloads.

HPC clusters all typically face similar kinds of challenges. As the speed of processors and memory has sharply risen to meet the needs of large scale applications, creating very large amounts of data and very large individual files has become easier. These ever increasing amounts of data need to be created and processed in a timely, efficient manner. Disk drives, however, still largely spin at the same speeds and drive access times continue to be measured in milliseconds. Poor I/O performance can therefore seriously undermine the overall performance of even the fastest clusters, in particular multi-petabyte clusters.

The two storage models most common in big data ecosystems, which set out to solve these problems, are distributed file systems (DFS) and object stores (OS).

We will look at three different examples of popular big data storage models in this post: two different DFS: Lustre, and GlusterFS, and one object store: Ceph.

Distributed File Systems

Distributed File Systems (DFS) offer the standard type of directories-and-files hierarchical organization we find in local workstation file systems. Every file or directory is identified by a specific path, which includes every other component in the hierarchy above it. Compared to local filesystems, in a DFS, files or file contents may be stored across disks of multiple servers instead of on a single disk.

Lustre

The Lustre open-source parallel file system is the largest player in parallel file systems used in HPC and is commonly used in supercomputers today. It is recognized as the most widely used file system by the Top 500 HPC sites worldwide. Lustre is used across a diverse range of HPC environments, from seismic processing to the oil and gas sector to the movie industry. It has also migrated over to use in the enterprise world as it can be used effectively for fast processing of workloads in a wide range of areas, including video processing, financial modeling, machine learning and electronic design automation (EDA).

The Lustre global parallel file system can be used over NFS and SAN file systems. Lustre is used mainly for Linux based HPC clusters. It is open-sourced and licensed under the GPLv2.

Lustre started life as an academic research project, which was then acquired by Sun Microsystems, then Oracle (where it fragmented under Whamcloud), then Intel when it acquired Whamcloud in 2012.

The Lustre File System was most recently sold to an enterprise storage vendor, DataDirect Networks (DDN). NetworkWorld said, “Finally, it’s in the hands of a company that makes sense.” Last July when Intel was paring down all non-essential products, a sale was made to DDN who got all assets. DDN specialize in storage for HPC and exascale systems and uses Lustre across its product lines.

“Intel did a really good job of selling it off in a way that Luster will be well taken care of,” said Steve Conway, COO and senior research vice president at Hyperion Research, a division of IDC that concentrates on HPC.

Key Advantage: Scalability

One of the major advantages of Lustre is that it can scale exponentially, both in performance and storage capacity. It also has the ability to distribute very large files over many nodes. Parallel file systems such as Lustre are ideal for high end HPC cluster IO systems because large files are shared over many nodes in the standard cluster environment.

Lustre can scale to tens of thousands of clients, and is capable of handling end-to-end data throughput of more than 100GigE networks in excess of 10 GB/second and InfiniBand EDR links that reach bandwidths of up to 10 GB/sec.

Different Iterations of Lustre

As Lustre is open source, there are various iterations available, including:

Amazon FsX for Lustre

Amazon offers its own version of Lustre, Amazon FsX for Lustre. It works natively with AmazonS3, “making it easy for you process cloud data sets with high performance file systems”. When connected to an S3 bucket, an FSx for Lustre file system presents S3 objects as files, enabling the writing of results back to S3. FSx for Lustre can also be used as a standalone high-performance file system to burst on-premises workloads to the cloud. Data can be made available for rapid processing by compute instances running on AWS through the copying on-premises data to an FSx for Lustre file system. Payment is on a resources-used basis.

Lustre File System on the Google Cloud Platform (GCP)

Engineers at GCP have created a set of scripts to easily configure a Lustre storage cluster on Google Compute Engine using the Google Cloud Deployment Manager. The scripts are available in the GCP GitHub repository here. A walk-through of how to use the scripts is available here.

GlusterFS

GlusterFS is a scalable network filesystem designed for data-intensive tasks like cloud storage and media streaming. It is free, open source software, which can utilize everyday hardware. It is known for being salable, affordable and flexible.

It is a scale-out NAS and object store. GlusterFS uses a hashing algorithm that places data within the storage pool, which offers the key to scaling. The hashing algorithm is exported to all the servers, letting them work out where each data item should be kept. Data can therefore be easily replicated, and the lack of central metadata files ensures that there is no bottleneck in accessing them, as might occur with a system such as Hadoop.

Gluster uses block storage, meaning that it stores a set of data in chunks on open space in connected Linux computers. What this offers is a highly scalable system with access to traditional storage and file transfer protocols, which can scale quickly without a single point of failure. Storing huge amounts of older data, therefore, is possible without the loss of accessibility or security.

Key Advantage: Data is Kept Together

Gluster distributes data to connected computers, however, and because data storage happens in blocks, everything is kept together. The GlusterFS finds the right size storage area for the data in all its storage locations, places the data for storage, and generates an identifying hash. Kernel systems store data without producing another metadata system, instead generating a unique hash for the file. As there is no interference from a metadata server, Gluster is able to react and scale more quickly than its competitors, yet still maintain usability. From the interface, users are able to view their data blocks as directories. Since each file has a unique hash, a user needs to make a copy before renaming, or else they will lose access to the data.

While being open source, GlusterFS is maintained by Red Hat.

Object Stores

In contrast to distributed file systems, object stores work by storing data in a flat non-hierarchical namespace in which each piece of data is identified by an arbitrary unique identifier. Any other detail about the piece of data is stored as metadata, with the data itself (this can include permissions, file size, file type, and details of relationships to other objects).

In an Object Store, it is possible for each consuming application to decorate the same object with its individual application-specific metadata entries, which carry meaning only to itself. This represents an advantage in terms of flexibility and openness compared to the more rigid directories-and-files hierarchy of a distributed file system, which forces applications to store their application-specific metadata away from the files, frequently in external databases, thereby complicating applications architectures. By contrast, Object Stores lead to simpler application architectures. However, a distributed file system is sometimes the only available option, such as if the system contains components that expect a file system interface.

Ceph

Ceph is a Linux-based distributed file system, which “incorporates replication and fault tolerance while maintaining POSIX compatibility” in order to provide fault tolerance and simplify management of huge amounts of data.

Ceph started life as a PhD research project in storage systems at the University of California, Santa Cruz by then student Sage Weil. It was launched as part of the mainline Linux kernel in 2010. In November 2018, in response to the merger between IBM and Red Hat, the Linux Foundation announced a new Ceph Foundation. One of the founding premier members is Red Hat. The goal of the foundation is to “organize and distribute financial contributions in a coordinated, vendor-neutral fashion for immediate community benefit” in order to “help galvanize rapid adoption, training and in-person collaboration across the Ceph ecosystem”.

Ceph is used by a wide range of organizations across a diverse mix of sectors, including financial institutions (Bloomberg, Fidelity), cloud service providers (Rackspace, Linode), academic and government institutions (Massachusetts Open Cloud), telecommunications infrastructure providers (Deutsche Telekom), auto manufacturers (BMW) and software solution providers (SAP, Salesforce).

Key Advantage: Flexibility

One of the defining features to the Ceph object storage system is that it offers a traditional file system interface with POSIX semantics, something which traditional object storage systems have been unable to do, and therefore only complement rather than replace traditional file systems. However, the traditional file system interface means that organizations can configure their legacy applications to use the Ceph file system as well, allowing for the running of just one storage cluster for object, block and file-based data storage.

In addition, as Ceph uses object storage, this means it stores data in binary objects spread out over many computers. Ceph builds a private cloud system using OpenStack technology, allowing users to mix unstructured and structured data in the same system.

Conclusion

Deciding which storage and big data solution to use involves many factors, but all three of the options discussed here offer extendable and stable storage of data.

Gluster’s default storage block size is twice that of Ceph: 128k compared to 64k for Ceph, which GlusterFS says allows it to offer faster processing. However, Ceph’s block size can also be increased with the right configuration setting.

All three are open source, and as with Lustre, there are also third-party management solutions to connect to Ceph and GlusterFS. The most popular for Ceph are InkTank, RedHat, Decapod and Intel and for Gluster, RedHat.