java hadoop Interview Questions

61.What is the relationship between Job and Task in Hadoop?

A single job can be broken down into one or many tasks in Hadoop.

62.Is it important for Hadoop MapReduce jobs to be written in Java?

It is not necessary to write Hadoop MapReduce jobs in java but users can write MapReduce jobs in any desired programming language like Ruby, Perl, Python, R, Awk, etc. through the Hadoop Streaming API.

63.What is the process of changing the split size if there is limited storage space on Commodity Hardware?

If there is limited storage space on commodity hardware, the split size can be changed by implementing the “Custom Splitter”. The call to Custom Splitter can be made from the main method.

64.What are the primary phases of a Reducer?

The 3 primary phases of a reducer are –
1)Shuffle
2)Sort
3)Reduce

65.What is a TaskInstance?

The actual hadoop MapReduce jobs that run on each slave node are referred to as Task instances. Every task instance has its own JVM process. For every new task instance, a JVM process is spawned by default for a task.

66.Can reducers communicate with each other?

Reducers always run in isolation and they can never communicate with each other as per the Hadoop MapReduce programming paradigm.
We have further categorized Hadoop MapReduce Interview Questions for Freshers and Experienced-

67.When should you use HBase and what are the key components of HBase?

HBase should be used when the big data application has –
1)A variable schema
2)When data is stored in the form of collections
3)If the application demands key based access to data while retrieving.
Key components of HBase are –
Region- This component contains memory data store and Hfile.
Region Server-This monitors the Region.
HBase Master-It is responsible for monitoring the region server.
Zookeeper- It takes care of the coordination between the HBase Master component and the client.
Catalog Tables-The two important catalog tables are ROOT and META.ROOT table tracks where the META table is and META table stores all the regions in the system.

68.What are the different operational commands in HBase at record level and table level?

Record Level Operational Commands in HBase are –put, get, increment, scan and delete.
Table Level Operational Commands in HBase are-describe, list, drop, disable and scan.

69.What is Row Key?

Every row in an HBase table has a unique identifier known as RowKey. It is used for grouping cells logically and it ensures that all cells that have the same RowKeys are co-located on the same server. RowKey is internally regarded as a byte array.

70.Explain the difference between RDBMS data model and HBase data model?

RDBMS is a schema based database whereas HBase is schema less data model.
RDBMS does not have support for in-built partitioning whereas in HBase there is automated partitioning.
RDBMS stores normalized data whereas HBase stores de-normalized data

71.Explain about the different catalog tables in HBase?

The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system.

72.What is column families? What happens if you alter the block size of ColumnFamily on an already populated database?

The logical deviation of data is represented through a key known as column Family. Column families consist of the basic unit of physical storage on which compression features can be applied. In an already populated database, when the block size of column family is altered, the old data will remain within the old block size whereas the new data that comes in will take the new block size. When compaction takes place, the old data will take the new block size so that the existing data is read correctly.

73.Explain the difference between HBase and Hive?

HBase and Hive both are completely different hadoop based technologies-Hive is a data warehouse infrastructure on top of Hadoop whereas HBase is a NoSQL key value store that runs on top of Hadoop. Hive helps SQL savvy people to run MapReduce jobs whereas HBase supports 4 primary operations-put, get, scan and delete. HBase is ideal for real time querying of big data where Hive is an ideal choice for analytical querying of data collected over period of time.

74.Explain the process of row deletion in HBase?

On issuing a delete command in HBase through the HBase client, data is not actually deleted from the cells but rather the cells are made invisible by setting a tombstone marker. The deleted cells are removed at regular intervals during compaction.

75. What are the different types of tombstone markers in HBase for deletion?

There are 3 different types of tombstone markers in HBase for deletion-
1)Family Delete Marker- This markers marks all columns for a column family.
2)Version Delete Marker-This marker marks a single version of a column.
3)Column Delete Marker-This markers marks all the versions of a column.

76.Explain about HLog and WAL in HBase?

All edits in the HStore are stored in the HLog. Every region server has one HLog. HLog contains entries for edits of all regions performed by a particular Region Server.WAL abbreviates to Write Ahead Log (WAL) in which all the HLog edits are written immediately.WAL edits remain in the memory till the flush period in case of deferred log flush.
We have further categorized Hadoop HBase Interview Questions for Freshers and Experienced-

77.Explain about some important Sqoop commands other than import and export.?

Create Job (--create) Here we are creating a job with the name my job, which can import the table data from RDBMS table to HDFS. The following command is used to create a job that is importing data from the employee table in the db database to the HDFS file.
$ Sqoop job --create myjob \
--import \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee --m 1
Verify Job (--list) ‘--list’ argument is used to verify the saved jobs. The following command is used to verify the list of saved Sqoop jobs.
$ Sqoop job --list
Inspect Job (--show)
‘--show’ argument is used to inspect or verify particular jobs and their details. The following command and sample output is used to verify a job called myjob.
$ Sqoop job --show myjob
Execute Job (--exec)
‘--exec’ option is used to execute a saved job. The following command is used to execute a saved job called myjob.
$ Sqoop job --exec myjob

78.How Sqoop can be used in a Java program?

The Sqoop jar in classpath should be included in the java code. After this the method Sqoop.runTool () method must be invoked. The necessary parameters should be created to Sqoop programmatically just like for command line

79.What is the process to perform an incremental data load in Sqoop?

The process to perform incremental data load in Sqoop is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop. The delta data can be facilitated through the incremental load command in Sqoop.
Incremental load can be performed by using Sqoop import command or by loading the data into hive without overwriting it. The different attributes that need to be specified during incremental load in Sqoop are-
1)Mode (incremental) –The mode defines how Sqoop will determine what the new rows are. The mode can have value as Append or Last Modified.
2)Col (Check-column) –This attribute specifies the column that should be examined to find out the rows to be imported.
3)Value (last-value) –This denotes the maximum value of the check column from the previous import operation.

80.Is it possible to do an incremental import using Sqoop?

Yes, Sqoop supports two types of incremental imports-
1)Append
2)Last Modified
To insert only rows Append should be used in import command and for inserting the rows and also updating Last-Modified should be used in the import command.

« Previous | 0 | 1 | 2 | 3 | 4 | Next »

The largest Interview Solution Library on the web