Friday, 9 September 2016

Yarn Questions

Q. what is YARN ?

Ans .
YARN consists of the following main components:
  • ResourceManager
  • NodeManager
  • ApplicationsMaster

1 . ResourceManager :

The ResourceManager typically runs on its own machine and is responsible for scheduling and allocating resources. The two main components of the ResourceManager are:

  1. Scheduler
  2. Applications Manager (AsM)

if you are familiar with Hadoop 1.x, note that YARN splits up the functionality of the JobTracker into two separate processes:

The ResourceManager allocates resources for applications but does not manage the lifecycle of applications. Instead, applications are managed by an ApplicationMaster that runs  on a node in the cluster. Each application running in the cluster requires its own ApplicationMaster.

2. NodeManager :

The NodeManager is a daemon process that runs on each DataNode in the cluster. Its responsibilities include:


  1. Communicating its status with the RM
  2. Tracking the health of the node
  3. Overseeing the lifecycle management of containers
  4. Monitoring resource usage of each container (i.e. memory and CPUs)
  5. Managing resource localization (for JAR files, libraries, and any other application-specific files used by containers)
  6. Managing the logs generated by containers



3. ApplicationMaster :

The per-application ApplicationMaster is the bootstrap process that initiates a YARN application once it gets past the application submission and its own launch.
The responsibilities of the AM include:

  1. Negotiating appropriate containers from the ResourceManager
  2. Working with the NodeManagers to execute and monitor the containers and their resource consumption
  3. Providing fault tolerance.

The benefits of the AM include:
  1. Extensibility Hadoop computing can now be more than Java MapReduce applications
  2. Scalability Hadoop clusters can now be considerably larger, because the ResourceManager does not manage fault tolerance (a problem with the old JobTracker that caused bottlenecks and limited the size of a Hadoop cluster)
  3. YARN applications have been executed on clusters of over 10,000 nodes.
YARN LIFE CYCLE :





Q. What is YARN container ? How does it work ?

Ans .
Containers :

A container in YARN represents a unit of work in an application. A container has the following behaviors:
  1. Runs on a node, managed by a NodeManager
  2. Makes use of some resources on the node, specifically: memory and CPU currently allocated to a container
  3. Depends on some libraries that are represented as local resources, which are provided by the NodeManager using a LocalResource
  4. Performs needed work
  5. The container does the actual work of the specific YARN application. This is where custom code appears that allows you to do whatever it is you need to do to your big data on Hadoop.



Q. What is difference between MR1 and MR2 (YARN) ?

Ans .


MRv1 uses the JobTracker to create and assign tasks to task trackers, which can become a resource bottleneck when the cluster scales out far enough (usually around 4,000 clusters).

MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. In MapReduce MRv2, the functions of the JobTracker have been split between three services. 
The ResourceManager is a persistent YARN service that receives and runs applications (a MapReduce job is an application) on the cluster. It contains the scheduler, which, as previously, is pluggable.
 The MapReduce-specific capabilities of the JobTracker have been moved into the MapReduce Application Master, one of which is started to manage each MapReduce job and terminated when the job completes. 
The JobTracker function of serving information about completed jobs has been moved to the JobHistory Server. 
The TaskTracker has been replaced with the NodeManager, a YARN service that manages resources and deployment on a host. It is responsible for launching containers, each of which can house a map or reduce task.
Q. How does Read/write happens in HDFS ?
Ans .

Step 1: The client creates the file by calling create() method on DistributedFileSystem.

Step 2: DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it.The namenode performs various checks to make sure the file doesn’t already exist and that the client has the right permissions to create the file. If these checks pass, the namenode makes a record of the new file; otherwise, file creation fails and the client is thrown an IOException. TheDistributedFileSystem returns an FSDataOutputStream for the client to start writing data to.

Step 3: As the client writes data, DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue. The data queue is consumed by the DataStreamer, which is responsible for asking the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas. The list of datanodes forms a pipeline, and here we’ll assume the replication level is three, so there are three nodes in the pipeline. TheDataStreamer streams the packets to the first datanode in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline.

Step 4: Similarly, the second datanode stores the packet and forwards it to the third (and last) datanode in the pipeline.

Step 5: DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline.

Step 6: When the client has finished writing data, it calls close() on the stream.

Step 7: This action flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before contacting the namenode to signal that the file is complete The namenode already knows which blocks the file is made up of , so it only has to wait for blocks to be minimally replicated before returning successfully.