Header Ads

  • Breaking Now

    BigData Apache Hadoop HDFS Interview QnA - II

    This is the second post in continuation to my previous post on Apache Hadoop HDFS related interview questions. I  have covered almost 16 questions in this post.

    Q 1:How data or file is written into Hadoop HDFS?
    A: In order to write a file in HDFS:
    - A client needs a handle to master i.e. namenode (master). 
    - The namenode provides the address of the datanodes (slaves) on which client will start writing the data. 
    - The client directly writes data on the datanodes by creating data write pipeline.

    Q 2: What should be the block size in Hadoop, ideally?
    A: There is no hard rule in nailing down the block size in Hadoop. It boils down to what size is the input data. If input data is huge then the block size of 128/256 MB is good to have for optimized performance. While dealing with small files, small block sizes are recommended. This is done by using parameter dfs.block.size in order to override the default block size value.
    Few points to remember:

    - With larger block is being processed and some failure occur more work need to be done
    - Fewer blocks if the block size is larger. 
    - Fewer blocks mean fewer nodes hence reduced throughput for parallel access
    - Makes it possible for client to read/write more data without interacting with the Namenode, saving time.
    -Larger blocks reduce metadata size of the Namenode, reducing Namenode load.
    -Having fewer & larger blocks, also means longer tasks which in turn may not gain maximum parallelism

    Q 3:What is Heartbeat in Hadoop?
    A: In Hadoop, by using heartbeat, Namenode and datanode communicate. Heartbeat is a signal that is sent by the datanode to the namenode after the regular interval to time to indicate its presence, i.e. to indicate that it is alive.

    - The default heartbeat interval is 3 seconds. 
    - If there is no heartbeat from DataNode to NameNode in ten minutes, then NameNode considers the DataNode to be out of service. This triggers blocks replicas hosted by that DataNode to be made unavailable. 
    - The NameNode then schedules the creation of new replicas of those blocks on other DataNodes.

    Q 4:How often DataNode send heartbeat to NameNode in Hadoop?
    A: The default is 3 seconds.

    Q 5:How HDFS helps NameNode in scaling in Hadoop?
    A: The primary benefit of Hadoop is its Scalability of cluster by adding more nodes. 
    Two types of Scalability in Hadoop: Vertical and Horizontal
    Vertical scalability/Scale up:
    - Increase the hardware capacity of the individual machine. 
    - Add more RAM or CPU to your existing system to make it more robust and powerful.

    Horizontal scalability/Scale out
    - Addition of more machines or setting up the cluster. 
    - Add more nodes to existing cluster and most importantly, you can add more machines without stopping the system, so no downtime or green zone, nothing of such sort while scaling out. So at last to meet your requirements you will have more machines working in parallel.

    HDFS has two main layers:-

    1. Namespace - manages directories, files and blocks. It supports file system operations such as creation, modification, deletion and listing of files and directories.

    2. Block Storage - Block storage provides operations like creation, deletion, modification and getting the location of the blocks. It also takes care of replica placement and replication.

    In architecture without HDFS Federation, datanode can be scaled both vertically & horizontally. But namenode was scaled only vertically not horizontally. This architecture has multiple datanodes, but it has only one NameNodefor (one namespace) for all datanodes .This limits the number of blocks, files, and directories supported on the file system. 

    In order to overcome this limitation HDFS Federation is introduced, wherein scaling the namenode horizontally is possible through use of multiple independent Namenodes/namespaces. In HDFS Federation, Namenodes do not require coordination with each other as the namenode is independent. All the datanodes are used as common storage for blocks by all the Namenodes.

    Q 6:What is Secondary NameNode in Hadoop HDFS?
    A: Secondary NameNode in hadoop is a specially dedicated node in HDFS cluster. It's main function is to take checkpoints of the file system metadata present on namenode. It is not a backup namenode. It just checkpoints namenode’s file system namespace. The Secondary NameNode is a helper to the primary NameNode but not replace for primary namenode.As the NameNode is the single point of failure in HDFS, if NameNode fails entire HDFS file system is lost. So in order to overcome this, Hadoop implemented Secondary NameNode whose main function is to store a copy of FsImage file and edits log file.

    Q 7:Ideally what should be the replication factor in Hadoop?
    A: The replication factor is the number of times Hadoop framework replicate each and every Data Block, in order to provide Fault Tolerance. The default replication factor is 3. It can be configured as per the requirement; can be increased (more than 3.) or decreased(less than 3) as per need.

    Q 8:How one can change Replication factor when Data is already stored in HDFS
    A:The replication factor can be set in the HDFS configuration file( hdfs-site.xml).It is used to set global replication factor for the entire cluster and only work on the newly created files but not on the existing files.
    e.g.
    hadoop fs -setrep -w 3 /file/filename.xml //Change the replication factor on a per-file basis :

    hadoop fs -setrep -w 3 -R /directory/dir.xml//change the replication factor for files that already exist in HDFS.-R flag would recursively change the replication factor on all the files

    Q 9:Why HDFS performs replication, although it results in data redundancy in Hadoop?
    A: This is done in order to improve fault tolerance and downtime. In the process it improves overall data reliability.

    Q 10:What is Safemode in Apache Hadoop?
    A: A Safemoder in Hadoop is meant for NameNode wherein it does not allow to write to a file. During Safemode, the entire cluster remains in read only mode.

    Q 11:What happen when namenode enters in safemode in hadoop?
    A: When NameNode enters in SafeMode in Hadoop, no changes are allowed to file system. The file system is loaded from its last saved fsimage to memory and log file is edited.

    Q 12:How to remove safemode of namenode forcefully in HDFS?
    A: There is one command to leave SafeMode:

    hadoop dfsadmin -safemode leave


    Q 13:How to create the directory when Name node is in safe mode?
    A: One has to come out of SafeMode, only then a directory can be created else error while creating a directory would keep popping up


    Q 14:What is difference between a MapReduce InputSplit and HDFS block
    A: HDFS Block is the physical representation of data in Hadoop. While MapReduce InputSplit is the logical representation of data present in the block in Hadoop. It is basically used during data processing in MapReduce program or other processing techniques.

    Q 15:Explain Small File Problem in Hadoop
    A:A HDFS block size is having default value as 64MB. HDFS is not designed to handle small files efficiently : it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern

    Q 16:How to create Users in Hadoop HDFS?
    A: There are following steps involved in making users in Hadoop HDFS:

    a. Create a user on your Host (client machine) and add it to "Hadoop" group
    # useradd pogouser -G Hadoop

    b. Create Home directory for this user on HDFS.
    # su - hdfs -c "hdfs dfs -mkdir /user/pogouser"
    # su - hdfs -c "hdfs dfs -chown newuser1:hadoop /user/pogouser"
    # su - hdfs -c "hdfs dfs -chmod 755 /user/pogouser"

    and then can work on hdfs like putting a file to HDFS.

    # su - pogouser
    # hdfs dfs -put /etc/passwd  /user/pogouser
    # hdfs dfs -cat  /user/pogouser/passwd

    Post Top Ad

    Post Bottom Ad