Attending a hadoop interview is difficult. You have to know what sort of hadoop questions will be asked, how you can answer them, what is the best way to put a good impression etc. Though it is quite challenging to nail a hadoop interview, we have some really interesting and common hadoop interview questions answered just for you. Yes we have deliberated and researched a lot on finding the best answers and it can be said that after reading this post, you will definitely have some of your biggest doubts solved. So keep reading to find out!
Common Interview Questions and Answers on Hadoop:
The following mentioned are few important hadoop developer interview questions. They are
1. How does the framework of Hadoop work or How hadoop works?
The framework of Hadoop will work on two components. First one is called HDFS. HDFS also known as Hadoop Distributed File System happens to be a java based file system which is scalable and can have reliable storage for datasets that are large. Datas in this could be stored in blocks and it also operates in the Master Slave Architecture. The second one is called Hadoop MapReduce. This one is also a java based program which gives you immense scalability across the Hadoop clusters. MapReduce will also reduce the load of work in several tasks that can run parallel.
2. Give me any three differences between NAS and HDFS?
NAS will run on a single machine and there is no chance of data redundancy as such where HDFS will run on a regular cluster of different machines and thus there is a chance of data redundancy because of the replication protocol itself. NAS also stores all the data on a hardware that is dedicated whereas the HDFS has all the blocks of data spread in several drives found in the machine itself. And third, the NAS data is kept safely independently in the computation and because of this Hadoop MapReduce can never be used for processing. But HDFS will work with MapReduce as this computation will be moved to data later.
3. What do you mean my column families? What happens if the size of Column Family is alterated?
The logical deviation of data is something that is represented through a key that is also known as column family. Column families are those which consist of a basic unit for storage of physical materials on top of which some features of compression can be applied too. When the database is already populated, when the size of the block of the column family has already been reduced, data that is old will stay inside the block size and the fresh data that will come shall use the new size of blocks as well. When compaction takes place again, the old data will also take the new size of block which will allow the present data to be read properly.
4. What is the difference between HBase and Hive?
Both hive and hbase can be used in different technologies that are based on Hadoop. Hive happens to be a infrastructure warehouse of data which is used on Hadoop whereas HBase is NoSQL. The key value stores which run on Hadoop themselves. Hive will also help those who know about SQL run a few jobs in MapReduce when Hbase will also support 4 of the operations such as put, get, scan and delete. The Hbase happens to be good for querying for data but Hive on the other hand is good for querying data is analytical and is collected over a while.
5. What do you mean by the term speculative execution in hadoop?
Jobs running on Hadoop clusters can be segregated into a multiple number of tasks. When the cluster is huge, some of the tasks will be running very slow because of several reasons. The degradation of hardware and software miconfiguration are some of them. Hadoop also tends to initiate a new replica of tasks when it notices that a task which has been going on for a while and has failed to show any signs of progress on an average level as the other tasks of this job. The duplicate of such tasks can be called speculative execution. When a task has been completed and it manages to duplicate all the tasks, it will automatically get killed. So when the original task has been completed before the speculative one, then the speculative task will also be killed. But if the speculative task manages to finish first then the original will be killed.
6. What are the benefits of using counters in Hadoop?
Counters happen to be very useful when it comes to gathering enough statistics related to the job. Imagine if you presently have around a hundred node clusters and a job which consists of about hundred mappers which are running inside a cluster of about hundred different nodes. You can say that you may like to know every time you see a record that is invalid in your phase of Maps. You could also add another log message inside your Mapper so that every time you come across an invalid line, you could also make an entry in the same long. But if you consolidate the log messages from another few different nodes, that will take up too much time. You could instead using a counter that will increase the value of the counter the minute you see a record that doesn’t hold any validity. The good thing about these counters is that they will give you good value of the job instead of showing a 100 outputs separately.
7. How can you write a customer partitioner?
In order to write a custom partitioner for any Hadoop job, you will have to do three of the things below. First of all you have to create a separate new class that will extend the class of the partitioner. Second you will have to over ride the method and get new partition. Third, inside the wrapper there will also be a Mapreduce. Finally you can add another partitioner to the programming of the job by using a method set. You can also add the custom partitioner to the job which is used as a config file. But that will happen if the wrapper will read from oozie.
8. What are some of the jobs that job trackers do?
There are some tasks which usually Job Trackers take care of. First of all, they accept jobs from several clients. Secondly, they will also speak to NameNode in order to check and determine the location of the data. Third it will also locate the TaskTracker nodes with several slots which are available at the data or even near it. Fourth it will submit the work to the TaskTracker that has been chosen and also monitors the progress by simply getting hear beat signals from the Task tracker itself.
9. How will you describe a sequence file?
Sequence files have been used in the past to store the binary keys. They are also responsible for supporting and splitting the data even when it the file has been compressed which is usually not possible with a compressed file that is regular. You could either choose to do the same record level compression where the value pair will be compressed. Another thing you can do is choose right at the first level where the records shall be put together in a compression.
10. Tell us about the ways of executing in Apache Pig?
Apache Pig will run in two modes- one that is Pig Command Mode and the other that is Hadoop Map Reduce. The local mode will need usage to only one machine for all the files so that they can be installed first and then executed in a host that is local but MapReduce needs accessing in the cluster of Hadoop.
11. How will you explain MapReduce and its need while programming with Apache Pig
All programs in Apache Pig have been written usually in query language which is also called nowadays as Pig Latin. It is has some similarity with SQL language of query as well. In order get the query executed, you must also remember to make use of an engine that specialises in this. Queries are converted from pig engines into jobs and therefore MapReduce will act as an engine of execution which is required to run programs.
12. How will you explain co group in Pig?
COGROUP is found in Pig that works in several tuples. The operator can also be applied on several statements which contain or have a few relations at least a hundred and twenty seven relations at every time. When you are making use of the operator on tables, then Pig will immediately book two tables and post that it will join two of the tables on some of the columns that are grouped.
Talk about SMB join in hive:
In SMB join in Hive, every mapper will read a bucket from the table as well as the corresponding bucket which is found on the second table after which there will be a merge sort joint. Sort Merge Bucket will join in some of the hives and can be used mainly when there is zero limit on the partition. SMB join could also be used when the size of the table is big. In SMB the columns have also been bucketed and sorted using join columns. All the tables will have the same old number of buckets in SMB join.
At the end of this post, we certainly hope that the hadoop interview questions and answers have helped you greatly. You can get the ball rolling and answer some of those questions that have not been given yet. Also if you have feedback or questions regarding the post, do let us know in the comment box below. We would love to hear from you. Plus remember that these interview questions on hadoop should be prepared thoroughly before you attend a Hadoop interview. On that note, good luck and here’s hoping you get that job you have always wanted.