41.On what concept the Hadoop framework works?Hadoop Framework works on the following two core components- 42.What are the main components of a Hadoop Application?Hadoop applications have wide range of technologies that provide great advantage in solving complex business problems. 43.What is Hadoop streaming?Hadoop distribution has a generic application programming interface for writing Map and Reduce jobs in any desired programming language like Python, Perl, Ruby, etc. This is referred to as Hadoop Streaming. Users can create and run jobs with any kind of shell scripts or executable as the Mapper or Reducers. 44.What is the best hardware configuration to run Hadoop?The best configuration for executing Hadoop jobs is dual core machines or dual processors with 4GB or 8GB RAM that use ECC memory. Hadoop highly benefits from using ECC memory though it is not low - end. ECC memory is recommended for running Hadoop because most of the Hadoop users have experienced various checksum errors by using non ECC memory. However, the hardware configuration also depends on the workflow requirements and can change accordingly. 45.What are the most commonly defined input formats in Hadoop?The most common Input Formats defined in Hadoop are:
46.What is a block and block scanner in HDFS?Block - The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB. 47.Explain the difference between NameNode, Backup Node and Checkpoint NameNode?NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespace-
fsimage file- It keeps track of the latest checkpoint of the namespace.
edits file-It is a log of changes that have been made to the namespace since checkpoint. 48.What is commodity hardware?Commodity Hardware refers to inexpensive systems that do not have high availability or high quality. Commodity Hardware consists of RAM because there are specific services that need to be executed on RAM. Hadoop can be run on any commodity hardware and does not require any super computer s or high end hardware configuration to execute jobs. 49.What is the port number for NameNode, Task Tracker and Job Tracker?NameNode 50070 50.Explain about the process of inter cluster data copying?HDFS provides a distributed data copying facility through the DistCP from source to destination. If this data copying is within the hadoop cluster then it is referred to as inter cluster data copying. DistCP requires both source and destination to have a compatible or same version of hadoop. 51.How can you overwrite the replication factors in HDFS?1)Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command-
$hadoop fs –setrep –w 2 /my/test_file (test_file is the filename whose replication factor will be set to 2) 52.Explain the difference between NAS and HDFS?
53.Explain what happens if during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3 ?Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust the number of times the blocks are to be replicated to ensure high data availability. For every block that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data. Under these circumstances when the replication factor is set to 1 ,if the DataNode crashes under any circumstances, then only single copy of the data would be lost. 54.What is the process to change the files at arbitrary locations in HDFS?HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in append only format i.e. writes to a file in HDFS are always made at the end of the file 55.Explain about the indexing process in HDFS?Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored. 56.What is a rack awareness and on what basis is data stored in a rack?All the data nodes put together form a storage area i.e. the physical location of the data nodes is referred to as Rack in HDFS. The rack information i.e. the rack id of each data node is acquired by the NameNode. The process of selecting closer data nodes depending on the rack information is known as Rack Awareness. 57.Explain the usage of Context Object?Context Object is used to help the mapper interact with other Hadoop systems. Context Object can be used for updating counters, to report the progress and to provide any application level status updates. ContextObject has the configuration details for the job and also interfaces, that helps it to generating the output. 58.What are the core methods of a Reducer?The 3 core methods of a reducer are – 59.Explain about the partitioning, shuffle and sort phase?Shuffle Phase-Once the first map tasks are completed, the nodes continue to perform several other map tasks and also exchange the intermediate outputs with the reducers as required. This process of moving the intermediate outputs of map tasks to the reducer is referred to as Shuffling. 60.How to write a custom partitioner for a Hadoop MapReduce job?Steps to write a Custom Partitioner for a Hadoop MapReduce Job-
|