• Knowledge Center
  • Technology
  • Industry
  • Services
  • Testimonials
  • "I would say that they are meeting/exceeding my expectations. They appear very capable and their high level of enthusiasm for the project is great."
  • Senior QA manager
  • FAQs
  • What can RayooTech do for my outsourcing IT project?
  • RayooTech has many years experience in software outsourcing and offshore, so we would offer professional programmers and the best outsourcing services for...

Hadoop Architecture



Hadoop consists of many elements. The bottom is Hadoop Distributed File System (HDFS), which stores the files on all storage nodes of Hadoop cluster. The up layer of HDFS (for this article) is MapReduce engine, which is made up of JobTrackers and Task Trackers.


HDFS

For an external client, HDFS is like a traditional hierarchical file system, which can create, delete, move, or rename files, and so on. However, HDFS architecture is built based on a specific set of nodes, which is determined by its own characteristics. These nodes include NameNode (only one), which provides metadata service in internal HDFS, and DataNode, which provides storage blocks for HDFS. There being only one NameNode, this is a shortcoming of HDFS (single point of failure).


Divide file stored in HDFS into blocks, then copy these blocks to multiple computers (DataNode). This differs from traditional RAID architecture. Block size (usually 64MB) and number of replicated blocks are decided by client while creating files. NameNode can control all file operations. All HDFS internal communication is based on standard TCP/IP protocol.


NameNode

NameNode is software usually running on a separate machine in HDFS cases, in charge of file system name space and controling the access of external clients. Moreover, it decides whether to map file to the replicated block of DataNode. For the most common three replicated blocks, the first one is stored on different nodes in the same rack, and the last one is on a node in different racks.


The actual I/O transactions do not pass NameNode, only metedata showing file mapping of DataNode and block passing. When external client sends a request to require creating a file, NameNode will respond with block identification and the DataNode IP address of the first copy of the block, and notify other DataNode which will receive a copy of the block.


NameNode stores all the information about file system name space in a file called Fslmage. This file and a log file containing all the transactions (here is EditLog) will be stored in local file system of NameNode. Fslmage and EditLog files also need copies in case of file damage or NameNode system loss.


DataNode

DataNode is also software usually running on a separate machine in HDFS cases. Hadoop cluster contains a NameNode and a large number of DataNode. DataNode usually organizes in the form of rack, and rack connects all the systems through a switch. An assumption of Hadoop: the transmission speed between the internal nodes of rack is faster than the nodes between racks.


DataNode responds the reading and writing request from HDFS client, as well as creation, deletion and copy from the block command of NameNode. NameNode rely on regular heartbeat messages from each DataNode. Each message contains a block report, according to which NameNode can verify block mapping and other file system metadata. If DataNode can not send heartbeat messages, NameNode will take repair measures to re-copy the block lost in the node.


File Operations

Thus, HDFS is not a universal file system. Its main purpose is to support the access of large written files in the form of stream. If the client wants to write files to HDFS, the first thing is to cache this file to the local temporary storage. If the cached data is larger than the desired HDFS block size, the request to create a file will be sent to NameNode, which will respond client with DataNode identity and target block, and notice the DataNode which will save file block copy as well. When the client sends the temporary file to the first DataNode, it will immediately transmit block content to copy DataNode through the pipeline. Meanwhile, the client is responsible for creating checksum files stored in the same HDFS namespace. After the final document block sent, NameNode will submit file creation to its persistent data storage (in EditLog and Fslmage file).


Linux Cluster

Hadoop framework can be used on a single Linux platform (development and debugging), but only using the commercial server stored on racks can play to its strength. These racks form a Hadoop cluster. Hadoop decides how to allocate jobs and files in the whole cluster through cluster topology knowledge. Hadoop assuming nodes may fail, so we use native method to deal with a single computer or even failure of all the racks.