81.What is the standard location or path for Hadoop Sqoop scripts?/usr/bin/Hadoop Sqoop 82.How can you check all the tables present in a single database using Sqoop?The command to check the list of all tables present in a single database using Sqoop is as follows- 83.How are large objects handled in Sqoop?Sqoop provides the capability to store large sized data into a single field based on the type of data. Sqoop supports the ability to store- 84.Can free form SQL queries be used with Sqoop import command? If yes, then how can they be used?Sqoop allows us to use free form SQL queries with the import command. The import command should be used with the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import command the –target dir value must be specified. 85.Differentiate between Sqoop and distCP?DistCP utility can be used to transfer data between clusters whereas Sqoop can be used to transfer data only between Hadoop and RDBMS. 86.What are the limitations of importing RDBMS tables into Hcatalog directly?There is an option to import RDBMS tables into Hcatalog directly by making use of –hcatalog –database option with the –hcatalog –table but the limitation to it is that there are several arguments like –as-avrofile , -direct, -as-sequencefile, -target-dir , -export-dir are not supported. 87.Explain about the core components of Flume?The core components of Flume are – 88.Does Flume provide 100% reliability to the data flow?Yes, Apache Flume provides end to end reliability because of its transactional approach in data flow. 89.How can Flume be used with HBase?Apache Flume can be used with HBase using one of the two HBase sinks –
In HBaseSink, a Flume Event is converted into HBase Increments or Puts. Serializer implements the HBaseEventSerializer which is then instantiated when the sink starts. For every event, sink calls the initialize method in the serializer which then translates the Flume Event into HBase increments and puts to be sent to HBase cluster. Working of the AsyncHBaseSink- AsyncHBaseSink implements the AsyncHBaseEventSerializer. The initialize method is called only once by the sink when it starts. Sink invokes the setEvent method and then makes calls to the getIncrements and getActions methods just similar to HBase sink. When the sink stops, the cleanUp method is called by the serializer. 90.Explain about the different channel types in Flume. Which channel type is faster?The 3 different built in channel types available in Flume are- 91.Which is the reliable channel in Flume to ensure that there is no data loss?FILE Channel is the most reliable channel among the 3 channels JDBC, FILE and MEMORY. 92. Explain about the replication and multiplexing selectors in Flume?Channel Selectors are used to handle multiple channels. Based on the Flume header value, an event can be written just to a single channel or to multiple channels. If a channel selector is not specified to the source then by default it is the Replicating selector. Using the replicating selector, the same event is written to all the channels in the source’s channels list. Multiplexing channel selector is used when the application has to send different events to different channels. 93.How multi-hop agent can be setup in Flume?Avro RPC Bridge mechanism is used to setup Multi-hop agent in Apache Flume. 94.Does Apache Flume provide support for third party plug-ins?Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from external sources and transfer it to external destinations. 95.Is it possible to leverage real time analysis on the big data collected by Flume directly? If yes, then explain how?Data from Flume can be extracted, transformed and loaded in real-time into Apache Solr servers usingMorphlineSolrSink 96.Differentiate between FileSink and FileRollSink ?The major difference between HDFS FileSink and FileRollSink is that HDFS File Sink writes the events into the Hadoop Distributed File System (HDFS) whereas File Roll Sink stores the events into the local file system. 97.Can Apache Kafka be used without Zookeeper?It is not possible to use Apache Kafka without Zookeeper because if the Zookeeper is down Kafka cannot serve client request. 98.What is the role of Zookeeper in HBase architecture?In HBase architecture, ZooKeeper is the monitoring server that provides different services like –tracking server failure and network partitions, maintaining the configuration information, establishing communication between the clients and region servers, usability of ephemeral nodes to identify the available servers in the cluster. 99.Explain about ZooKeeper in Kafka?Apache Kafka uses ZooKeeper to be a highly distributed and scalable system. Zookeeper is used by Kafka to store various configurations and use them across the hadoop cluster in a distributed manner. To achieve distributed-ness, configurations are distributed and replicated throughout the leader and follower nodes in the ZooKeeper ensemble. We cannot directly connect to Kafka by bye-passing ZooKeeper because if the ZooKeeper is down it will not be able to serve the client request. 100.Explain how Zookeeper works?ZooKeeper is referred to as the King of Coordination and distributed applications use ZooKeeper to store and facilitate important configuration information updates. ZooKeeper works by coordinating the processes of distributed applications. ZooKeeper is a robust replicated synchronization service with eventual consistency. A set of nodes is known as an ensemble and persisted data is distributed between multiple nodes. |