Big Data - Apache Hadoop HDFS Interview Questions And Answers
In first post on Apache Hadoop technology, it was briefly touched upon what Apache Hadoop technology is, what is its use, what are its constituents. In this blog, main focus is on
HDFS(Hadoop Distributed File Systems)
related interview questions which are most frequently asked. HDFS is the main storage for large datasets and forms a key part of Hadoop Architecture. HDFS handles streaming data and running clusters on the commodity hardware. In this first list of questions on HDFS, the focus is to provide QnA in one single post in an effective manner. Please share your queries in comments section and if you like the post, please share it with like minded friends of yours.
1) How is HDFS different from Network Attached Storage(NAS)?
Answer:
NAS is computer level data storage server which requires dedicated hardware. HDFS is a distributed filesystem(in form of data blocks) using clusters of machines having local hard drives. It is cost effective as compared with NAS.
HDFS is designed to work with MapReduce system for analysis as computation are move to data. NAS is unsuitable for MapReduce as data is stored separately from the computations. In NAS, data is stored first and then on top of that analysis/computations are done. It gets really difficult to work on NAS as data grows exponentially.
As HDFS is distributed and clustered, it has high availability and reliability as compared with NAS.
2) What are NameNode and DataNode in HDFS? How are they different from each other?
Answer:
3) What is metadata in Hadoop?
Answer:
Metadata means data about data. In Hadoop, both NameNode and DataNode store metadata information. The Master node(NameNode) contains HDFS state namespace(metadata of all files and directories, information about blocks(name, location etc.)) and transaction information(edit logs). Slave nodes contain checksum(.meta file) metadata for each and every particular block on slaves nodes.
4) How NameNode handles DataNode failures in HDFS?
Answer:
HDFS follows master-slave architecture wherein master is NameNode and slave is DataNode. There is just one NameNode containing metadata. The multiple numbers of DataNodes are linked to the NameNode which actually is responsible for storing actual data in HDFS.
Each DataNode sends Heartbeat, every 2 minutes, implying that DataNode is functioning properly. It means all data blocks are getting reported to NameNode.
When NameNode does not receive heartbeat signals from Data node, it assumes that the data node is either dead or non-functional.When DataNode is declared dead/non-functional, all the data blocks it hosts are transferred to the other data nodes with which the blocks are replicated initially.
This is how DataNode failures are handled by NameNode.
5) How much Metadata will be created on NameNode in Hadoop?
Answer:
NameNode(NN) metadata stores:
- file to Block mapping
- locations of blocks on DataNodes
- active data nodes
- a bunch of other metadata
This is all stored in memory on the NN.
The things stored on disk are:
- fsimage
- edit log
- status logs.
a) fsimage - An fsimage file contains the complete state of the file system at a point in time. Every file system modification is assigned a unique, monotonically increasing transaction ID. An fsimage file represents the file system state after all modifications up to a specific transaction ID.
b) edits - An edits file is a log that lists each file system change (file creation, deletion or updation) that was made after the most recent fsimage.
When a file is put into HDFS, it is split into blocks (of configurable size).
Let’s say we have a file called “file.txt” that is 1GB (1000MB) and our block size is 128MB. We will end up with 7 128MB blocks and a 104MB block. The NameNode keeps track of the fact that “file.txt” in HDFS maps to these eight blocks and three replicas of each block. DataNodes store blocks, not files, so the mapping is important to understanding where our data is and what our data is.
Corresponding to a block 150 bytes (roughly) of metadata is created, Since there are 8 blocks with replication factor 3 i.e. 24 blocks. Hence 150×24 = 3600 bytes of metadata will be created.
On disk, the NameNode stores the metadata for the file system. This includes file and directory permissions, ownerships, and assigned blocks in the fsimage and the edit logs. In properly configured setups, it also includes a list of DataNodes that make up the HDFS (dfs.include parameter) and DataNodes that are to be removed from that list (dfs.exclude parameter).
6) What is Safe Mode and when does NameNode enter in Safe Mode?
Answer:
Safemode is a maintenance state of NameNode. In Safemode, a NameNode doesn’t allow any modifications to the file system, HDFS cluster remains in read-only state and doesn’t replicate or delete data blocks.
When a HDFS cluster is unstable, modifications in datablocks are prevented by making HDFS Safemode. So no deletion or replication of data happens over data blocks.
Status of Safemode :
hadoop dfsadmin -safemode get
To Manually enter Safemode:
hadoop dfsadmin -safemode enter
To Exit Safemode:
hadoop dfsadmin -safemode leave
7) How to restart all the daemons in Hadoop HDFS?
Answer:
8) What are the modes in which Apache Hadoop run?
Answer:
There are three modes in which Apache Hadoop can run:
a. Standalone(Local) Mode
It is default mode under which Hadoop is configured to run with no distributed mode. Runs as a single Java process. This mode utilizes the local file system. This mode useful for debugging and usually the fastest mode in Hadoop.
b. Pseudo Distributed Mode(Single node)
Runs on a single node in a Pseudo Distributed mode wherein each daemon runs on seperate java process. Custom configuration is required( core-site.xml, hdfs-site.xml, mapred-site.xml ). Here HDFS is utilized for input and ouput. Useful for testing and debugging purposes.
c. Fully Distributed Mode
Production mode of running Hadoop wherein typically one machine in cluster is designated as NameNode and another as Resource Manager exclusively. These are masters. All other nodes act as Data Node and Node Manager which are the slaves. Configuration parameters and environment need to specified for Hadoop Daemons.Key features of this mode are fully distributed computing capability, reliability, fault tolerance and scalability.
9) On what basis name node distribute blocks across the data nodes in HDFS?
Answer:
Hadoop Data Blocks distribution strategy across clusters is having trade offs like data reliability, write and read bandwidth.The basic placement policy places the first block on the client.The second replica is placed on a different rack than the first one (also known as off rack) while third replica is placed on the same rack as the second one but in a different datanode. Further replicas are placed in random nodes avoiding same rack placement.
Hadoop also has a Balancer daemon. It distributes blocks by moving them from over-utilized datanodes to under-utilized datanodes while keeping the basic placement policy in mind.
10) What is a block in HDFS and why it has a size of 64MB?
Answer:
In Hadoop, large sized files are broken down into smaller chunks known as data blocks. HDFS Data blocks are the smallest unit of data. These data blocks are sequentially stored on a physical disk that makes reading and writing a block fast. Lets say you have a 128Mb file. With a 4k block size, you'd have to make 32k requests to get that file (1 request per block). In HDFS, each request has to be processed by the Name Node to figure out where that block can be found. That's a lot of traffic! If you use 64MB blocks, the number of requests goes down to 2.
11) What is Fault Tolerance in Hadoop HDFS?
Answer:
HDFS is highly fault tolerant. It handles faults by the process of replication. firstly data is broken down into blocks and these blocks are distributed across different machines present in the cluster. After this, replica of each block is created on other machines present in the cluster. By default HDFS creates 3 copies of a file on other machines present in the cluster. So due some reason if any machine on the HDFS goes down or fails, then also user can easily access that data from other machines in the cluster in which replica of file is present. Hence HDFS provides faster file read and write mechanism, due to its unique feature of distributed storage.
12) Why is block size set to 128 MB in HDFS?
Answer:
The default size of a block in HDFS is 128 MB (Hadoop 2. x) and 64 MB (Hadoop 1. x). The reason of having block size of 128MB is same as mentioned in Question#12 given above.
13) What happens if the block on Hadoop HDFS is corrupted?
Answer:
Though HDFS is quite robust when it comes to fault tolerance, yet there are cases when HDFS corrupts files then there are two ways to rectify this situation:
a. HDFS fsck : HDFS fsck command(run "hdfs fsck") is an offline process that examines on-disk structures.This command operates only on data(DataNodes in HDFS case) and determines which files contain corrupt blocks, and gives you options about how to fix them.
b. NameNode recovery : HDFS is highly tolerant and available due to replication process of data blocks which are distributed in cluster. If there is "x" replication factor then there are "x-1" replicas of a data block. Usually HDFS does auto correction by identifying the corrupt blocks, replacing them with good blocks and maintains the replication factor.
Sart the NameNode with the -recover flag to activate recovery mode, like:
./bin/hadoop namenode -recover
14)How data or file is read in Hadoop HDFS?
Answer:
Data or file read in HDFS is a sequential process. There are following steps involved in "read" operation:
a. A client initiates read request by calling 'open()' method of FileSystem object; it is an object of type DistributedFileSystem.
b. This object connects to namenode using RPC and gets metadata information such as the locations of the blocks of the file.
c. In response to this metadata request, addresses of the DataNodes having a copy of that block is returned back.
d. Once addresses of DataNodes are received, an object of type FSDataInputStream is returned to the client. FSDataInputStream contains DFSInputStream which takes care of interactions with DataNode and NameNode. A client invokes 'read()' method which causes DFSInputStream to establish a connection with the first DataNode with the first block of a file.
e.Data is read in the form of streams wherein client invokes 'read()' method repeatedly. This process of read() operation continues till it reaches the end of block.
f.Once the end of a block is reached, DFSInputStream closes the connection and moves on to locate the next DataNode for the next block
g.Once a client has done with the reading, it calls a close() method.
More questions will follow in my subsequent post.
1) How is HDFS different from Network Attached Storage(NAS)?
Answer:
NAS is computer level data storage server which requires dedicated hardware. HDFS is a distributed filesystem(in form of data blocks) using clusters of machines having local hard drives. It is cost effective as compared with NAS.
HDFS is designed to work with MapReduce system for analysis as computation are move to data. NAS is unsuitable for MapReduce as data is stored separately from the computations. In NAS, data is stored first and then on top of that analysis/computations are done. It gets really difficult to work on NAS as data grows exponentially.
As HDFS is distributed and clustered, it has high availability and reliability as compared with NAS.
2) What are NameNode and DataNode in HDFS? How are they different from each other?
Answer:
NameNode is the master node which primarily stores metadata information of data blocks like location of the block, name of the block, replication factors etc. It regulates data access for actual file for end clients. It also assigns work to slaves(i.e. DataNode). NameNode is responsible for a faster retrieval of main data and must be stored/hosted on a reliable hardware
DataNode is the slave node/worker node in Hadoop cluster which actually stores business data. As instructed from Master, DataNode performs creation/replication/deletion of data blocks. DataNode requires huge amount of storage for its operation and commodity hardware can be used for hosting it.
Answer:
Metadata means data about data. In Hadoop, both NameNode and DataNode store metadata information. The Master node(NameNode) contains HDFS state namespace(metadata of all files and directories, information about blocks(name, location etc.)) and transaction information(edit logs). Slave nodes contain checksum(.meta file) metadata for each and every particular block on slaves nodes.
Answer:
HDFS follows master-slave architecture wherein master is NameNode and slave is DataNode. There is just one NameNode containing metadata. The multiple numbers of DataNodes are linked to the NameNode which actually is responsible for storing actual data in HDFS.
Each DataNode sends Heartbeat, every 2 minutes, implying that DataNode is functioning properly. It means all data blocks are getting reported to NameNode.
When NameNode does not receive heartbeat signals from Data node, it assumes that the data node is either dead or non-functional.When DataNode is declared dead/non-functional, all the data blocks it hosts are transferred to the other data nodes with which the blocks are replicated initially.
This is how DataNode failures are handled by NameNode.
5) How much Metadata will be created on NameNode in Hadoop?
Answer:
NameNode(NN) metadata stores:
- file to Block mapping
- locations of blocks on DataNodes
- active data nodes
- a bunch of other metadata
This is all stored in memory on the NN.
The things stored on disk are:
- fsimage
- edit log
- status logs.
a) fsimage - An fsimage file contains the complete state of the file system at a point in time. Every file system modification is assigned a unique, monotonically increasing transaction ID. An fsimage file represents the file system state after all modifications up to a specific transaction ID.
b) edits - An edits file is a log that lists each file system change (file creation, deletion or updation) that was made after the most recent fsimage.
When a file is put into HDFS, it is split into blocks (of configurable size).
Let’s say we have a file called “file.txt” that is 1GB (1000MB) and our block size is 128MB. We will end up with 7 128MB blocks and a 104MB block. The NameNode keeps track of the fact that “file.txt” in HDFS maps to these eight blocks and three replicas of each block. DataNodes store blocks, not files, so the mapping is important to understanding where our data is and what our data is.
Corresponding to a block 150 bytes (roughly) of metadata is created, Since there are 8 blocks with replication factor 3 i.e. 24 blocks. Hence 150×24 = 3600 bytes of metadata will be created.
On disk, the NameNode stores the metadata for the file system. This includes file and directory permissions, ownerships, and assigned blocks in the fsimage and the edit logs. In properly configured setups, it also includes a list of DataNodes that make up the HDFS (dfs.include parameter) and DataNodes that are to be removed from that list (dfs.exclude parameter).
6) What is Safe Mode and when does NameNode enter in Safe Mode?
Answer:
Safemode is a maintenance state of NameNode. In Safemode, a NameNode doesn’t allow any modifications to the file system, HDFS cluster remains in read-only state and doesn’t replicate or delete data blocks.
When a HDFS cluster is unstable, modifications in datablocks are prevented by making HDFS Safemode. So no deletion or replication of data happens over data blocks.
Status of Safemode :
hadoop dfsadmin -safemode get
To Manually enter Safemode:
hadoop dfsadmin -safemode enter
To Exit Safemode:
hadoop dfsadmin -safemode leave
Answer:
start-all.sh & stop-all.sh : These commands are used to start and stop Hadoop daemons all at once. These commands are deprecated.
8) What are the modes in which Apache Hadoop run?
Answer:
There are three modes in which Apache Hadoop can run:
a. Standalone(Local) Mode
It is default mode under which Hadoop is configured to run with no distributed mode. Runs as a single Java process. This mode utilizes the local file system. This mode useful for debugging and usually the fastest mode in Hadoop.
b. Pseudo Distributed Mode(Single node)
Runs on a single node in a Pseudo Distributed mode wherein each daemon runs on seperate java process. Custom configuration is required( core-site.xml, hdfs-site.xml, mapred-site.xml ). Here HDFS is utilized for input and ouput. Useful for testing and debugging purposes.
c. Fully Distributed Mode
Production mode of running Hadoop wherein typically one machine in cluster is designated as NameNode and another as Resource Manager exclusively. These are masters. All other nodes act as Data Node and Node Manager which are the slaves. Configuration parameters and environment need to specified for Hadoop Daemons.Key features of this mode are fully distributed computing capability, reliability, fault tolerance and scalability.
9) On what basis name node distribute blocks across the data nodes in HDFS?
Answer:
Hadoop Data Blocks distribution strategy across clusters is having trade offs like data reliability, write and read bandwidth.The basic placement policy places the first block on the client.The second replica is placed on a different rack than the first one (also known as off rack) while third replica is placed on the same rack as the second one but in a different datanode. Further replicas are placed in random nodes avoiding same rack placement.
Hadoop also has a Balancer daemon. It distributes blocks by moving them from over-utilized datanodes to under-utilized datanodes while keeping the basic placement policy in mind.
10) What is a block in HDFS and why it has a size of 64MB?
Answer:
In Hadoop, large sized files are broken down into smaller chunks known as data blocks. HDFS Data blocks are the smallest unit of data. These data blocks are sequentially stored on a physical disk that makes reading and writing a block fast. Lets say you have a 128Mb file. With a 4k block size, you'd have to make 32k requests to get that file (1 request per block). In HDFS, each request has to be processed by the Name Node to figure out where that block can be found. That's a lot of traffic! If you use 64MB blocks, the number of requests goes down to 2.
11) What is Fault Tolerance in Hadoop HDFS?
Answer:
HDFS is highly fault tolerant. It handles faults by the process of replication. firstly data is broken down into blocks and these blocks are distributed across different machines present in the cluster. After this, replica of each block is created on other machines present in the cluster. By default HDFS creates 3 copies of a file on other machines present in the cluster. So due some reason if any machine on the HDFS goes down or fails, then also user can easily access that data from other machines in the cluster in which replica of file is present. Hence HDFS provides faster file read and write mechanism, due to its unique feature of distributed storage.
12) Why is block size set to 128 MB in HDFS?
Answer:
The default size of a block in HDFS is 128 MB (Hadoop 2. x) and 64 MB (Hadoop 1. x). The reason of having block size of 128MB is same as mentioned in Question#12 given above.
13) What happens if the block on Hadoop HDFS is corrupted?
Answer:
Though HDFS is quite robust when it comes to fault tolerance, yet there are cases when HDFS corrupts files then there are two ways to rectify this situation:
a. HDFS fsck : HDFS fsck command(run "hdfs fsck") is an offline process that examines on-disk structures.This command operates only on data(DataNodes in HDFS case) and determines which files contain corrupt blocks, and gives you options about how to fix them.
b. NameNode recovery : HDFS is highly tolerant and available due to replication process of data blocks which are distributed in cluster. If there is "x" replication factor then there are "x-1" replicas of a data block. Usually HDFS does auto correction by identifying the corrupt blocks, replacing them with good blocks and maintains the replication factor.
Sart the NameNode with the -recover flag to activate recovery mode, like:
./bin/hadoop namenode -recover
14)How data or file is read in Hadoop HDFS?
Answer:
Data or file read in HDFS is a sequential process. There are following steps involved in "read" operation:
a. A client initiates read request by calling 'open()' method of FileSystem object; it is an object of type DistributedFileSystem.
b. This object connects to namenode using RPC and gets metadata information such as the locations of the blocks of the file.
c. In response to this metadata request, addresses of the DataNodes having a copy of that block is returned back.
d. Once addresses of DataNodes are received, an object of type FSDataInputStream is returned to the client. FSDataInputStream contains DFSInputStream which takes care of interactions with DataNode and NameNode. A client invokes 'read()' method which causes DFSInputStream to establish a connection with the first DataNode with the first block of a file.
e.Data is read in the form of streams wherein client invokes 'read()' method repeatedly. This process of read() operation continues till it reaches the end of block.
f.Once the end of a block is reached, DFSInputStream closes the connection and moves on to locate the next DataNode for the next block
g.Once a client has done with the reading, it calls a close() method.
More questions will follow in my subsequent post.