The Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (SOSP 2003)

Google needs a distributed file system that matches its applications needs. Solution: GFS.

GFS: Scalable distributed file system for large distributed data-intensive applications.

Why are we reading this paper?

Why is distributed storage hard?

In GFS, we will see that consistency is traded off for simpler design, greater performance, and high availability.

What would we like for consistency?

Replication for fault-tolerance makes strong consistency tricky.

GFS Key Ideas

Motivating Example

Google

Cluster of PCs as Supercomputer

Conclusion: Need large, distributed, highly fault-tolerant file system.

Serving a Google Search query

Steps in query answering:

image

“On average, a single query in Google reads hundreds of megabytes of data and consumes tens of billions of CPU cycles. Supporting a peak request stream of thousands of queries per second requires an infrastructure comparable in size to that of the largest supercomputer installations. Combining more than 15,000 commodity-class PC’s with fault-tolerant software creates a solution that is more cost-effective than a comparable system build out of a smaller number of high-end servers.”

Google Platform Characteristics

Design Motivations

  1. GFS runs on a large number of machines. Failures occur regularly, so fault-tolerance and auto-recovery need to be built in.
  2. File sizes are much larger. Standard I/O assumptions (e.g. block size) have to be reexamined.
  3. Record appends are the prevalent form of writing. Need good semantics for concurrent appends to the same file by multiple clients.
  4. Google applications and GFS are both designed in-house - so they can and should be co-designed.

Google File System: Design Criteria

What were the problems GFS was trying to solve?

Google needed a large-scale and high-performant unified storage system for many of its internal services such as MapReduce, web crawler services.

In particular, this storage system must:

In particular, GFS is optimized for high sustained bandwidth (target applications place a premium on processing data in bulk at a high rate), but not necessarily for low latency (GFS is typically used for internal services and is not client-facing).

Architecture

  clients (library, RPC -- but not visible as a UNIX FS)
  
  coordinator tracks file names
  
  chunkservers store 64 MB chunks
  
  big files split into 64 MB chunks, on lots of chunkservers
    big chunks -> low book-keeping overhead
  
  each chunk replicated on 3 chunkservers

image

Analogy with File system

On a single-machine FS:

In the GFS:

image

Very important: data flow is decoupled from control flow

Neither the clients nor the chunkservers cache file data

What is a chunk?

What is a master?

Coordinator(Master) state

  tables in RAM (for speed, must be smallish):
  
    file name -> array of chunk handles (nv)
    
    chunk handle -> version # (nv)
    
                    list of chunkservers (v)
                    
                    primary (v)
                    
                    lease time (v)
                    
  non-volatile "nv" state also written to disk, in case crash+reboot

The Master Node

The Operation Log

Interface

Configuration

GFS data chunking and distribution

The following figure shows how a file is broken into chunks and distributed among multiple chunkservers.

image

Google Cluster Environment

The environment is shown in following figure.

image

Ref: https://pk.org/417/notes/16-dfs.html

Client interaction model

Implementation

Each chunkserver stores chunks. A chunk is identified by a chunk handle, which is a globally unique 64-bit number that is assigned by the master when the chunk is first created. On the chunkserver, every chunk is stored on the local disk as a regular Linux file. For integrity, the chunkserver stores a 32-bit checksum for each chunk (and logged to disk) for each chunk on that chunkserver.

Every chunk is replicated onto multiple chunkservers. By default, there are three replicas of a chunk althrough different levels can be specified on a per-file basis. Files that are accessed by lots of processes may need more replicas to avoid congestion at any server.

Master

The primary role of the master is to maintain all of the file system metadata. This include the names and directories of each file, access control information, the mapping from each file to a set of chunks, and the current location of chunks on chunkservers. Metadata is stored only on the master. This simplifies the design of GFS as there is no need to handle synchronizing information for a changing file system among multiple masters.

For fast performance, all metadata is stored in the master’s main memory. This includes the entire filesystem namespace as well as all the name-to-chunk maps. For fault tolerance, any changes are written to the disk onto an operation log. This operation log is also replicated onto remote machines. The operation log is similar to a journal. Every operation to the file system is logged into this file. Periodic checkpoints of the file system state, stored in a B-tree structure, are performed to avoid having to recreate all metadata by playing back the entire log.

Having a single master for a huge file system sounds like a bottleneck but the role of the master is only to tell clients which chunkservers to use. The data access itself is handled between the clients and chunkservers.

The file system namespace of directories and file names is maintained by the master. Unlike most file systems, there is no separate directory structure that contains the names of all the files within that directory. The namespace is simply a single lookup table that contains pathnames (which can look like a directory hierarchy if desired) and maps them to metadata. GFS does not support hard links or symbolic links.

The master manages chunk leases (locks on chunks with expiration), garbage collection (the freeing of unused chunks), and chunk migration (the movement and copying of chunks to different chunkservers). It periodically communicates with all chunkservers via heartbeat messages to get the state of chunkservers and sends commands to chunkservers. The master does not store chunk locations persistently on its disk. This information is obtained from queries to chunkservers and is done to keep consistency problems from arising.

The master is a single point of failure in GFS and replicates its data onto backup masters for fault tolerance.

Chunk size

File access

To read a file, the client contacts the master to read a file’s metadata; specifically, to get the list of chunk handles. It then gets the location of each of the chunk handles. Since chunks are replicated, each chunk handle is associated with a list of chunkservers. The client can contact any available chunkserver to read chunk data.

File writes are expected to be far less frequent than file reads. To write to a file, the master grants a chunk lease to one of the replicas. This replica will be the primary replica chunkserver and will be the first one to get updates from clients. The primary can request lease extensions if needed. When the master grants the lease, it increments the chunk version number and informs all of the replicas containing that chunk of the new version number.

The actual writing of data is split into two phases: sending and writing.

First, the client is given a list of replicas that identifies the primary chunkserver and secondaries. The client sends the data to the closest replica chunkserver. That replica forwards the data to another replica chunkserver, which then forwards it to yet another replica, and so on. Eventually all the replicas get the data, which is not yet written to a file but sits in a cache.

When the client gets an acknowledgement from all replicas that the data has been received it then sends a write request to the primary, identifying the data that was sent in the previous phase. The primary is responsible for serialization of writes. It assigns consecutive serial numbers to all write requests that it has received, applies the writes to the file in serial-number order, and forwards the write requests in that order to the secondaries. Once the primary gets acknowledgements from all the secondaries, the primary responds back to the client and the write operation is complete.

The key point to note is that data flow is different from control flow. The data flows from the client to a chunkserver and then from that chunkserver to another chunkserver, and from that other chunkserver to yet another one until all chunkservers that store replicas for that chunk have received the data. The control (the write request) flow goes from the client to the primary chunkserver for that chunk. The primary then forwards the request to all the secondaries. This ensures that the primary is in control of the order of writes even if it receives multiple write requests concurrently. All replicas will have data written in the same sequence. Chunk version numbers are used to detect if any replica has stale data that was not updated because that chunkserver was down during some update.

Hadoop Distributed File System (HDFS)

File system operations

The Hadoop Distributed File System is inspired by GFS. The overall architecture is the same, although some terminology changes.

image

image

Heartbeating and Replication

DataNodes periodically send a heartbeat message and a block report to the NameNode.

The NameNode waits for a configured percentage of DataNodes to check in and then waits an additional 30 seconds. After that time, if any data blocks do not have their minimum number of replicas, the NameNode sends replica requests to DataNodes, asking them to create replicas of specific blocks.

image

The NameNode chooses a list of DataNodes that will host replicas of each block of a file. A client writes directly to the first replica. As the first replica gets the data from the client, it sends it to the second replica even before the entire block is written (e.g., it may get 4 KB out of a 64 MB block). As the second replica gets the data, it sends it to the third replica

image

Implementation

The NameNode contains two files: EditLog and FsInfo.

The entire active file system image is kept in memory. On startup, the NameNode reads FsInfo and applies the list of changes in EditLog to create an up-to-date file system image. Then, the image is flushed to the disk and the EditLog is cleared. This sequence is called a checkpoint. From this point, changes to file system metadata are logged to EditLog but FsInfo is not modified until the next checkpoint.

On DataNodes, each block is stored as a separate file in the local file system. The DataNode does not have any knowledge of file names, attributes, and associated blocks; all that is handled by the NameNode. It simply processes requests to create, delete, write, read blocks, or replicate blocks. Any use of directories is done strictly for local efficiency - to ensure that a directory does not end up with a huge number of files that will impact performance.

To ensure data integrity, each HDFS file has a separate checksum file associated with it. This file is created by the client when the client creates the data file/. Upon retrieval, if there is a mismatch between the block checksum and the computed block checksum, the client can request a read from another DataNode.

Summary

image

image

image

How GFS is Different From Other Distributed FS’s

Based on Google’s workload, GFS assumes

Ref