Getting started with Apache Hadoop

Hope this blog helps you out in terms of getting started with Apache Hadoop.

What all is covered here:

Installing and configuring Apache Hadoop - single node cluster on ubuntu Linux
Running default map reduce example - wordcount with some real load
Other alternatives - Cloudera QuickStart VM

Please note this blog is not reference material for learning Apache Hadoop - it's only talking about how to get started in terms of learning Apache Hadoop. I have tried to keep the other details out of this blog so that anybody who is googling for getting started with Single node Hadoop cluster on Ubuntu and want to run the OOTB sample example

Installing and configuring Apache Hadoop - single node cluster on ubuntu Linux

Please refer to these these blogs(in the order i mentioned):

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

http://mysolvedproblem.blogspot.com/2012/05/installing-hadoop-on-ubuntu-linux-on.html

http://hadoop.apache.org/docs/r1.2.1/single_node_setup.html

First two are very good source for installing and configuring single node cluster on ubuntu Linux.

The only customization or detour i have done from the above ones - I have directly used the user root rather than hduser as i was getting some permission errors in Ubuntu Linux and I don't want to spend too much time in resolving them.

Moreover I have downloaded the Apache Hadoop - 1.21 and used Oracle/Sun JDK 1.7.x

Running default map reduce example - wordcount with some real load

Apache Hadoop comes with some excellent out of the box examples that you can play with. I have choosen "wordcount" for this blog. Wordcount simply counts the number of times a word occurred in the input.

In his bloghttp://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ - Michael got three eBooks as the input. I have gone bit far in terms of the overall load. I have simply downloaded around 2500+ eBooks in Plain-text format and ran wordcount against it.

I have also choosen the same source to get the eBooks via this command:

wget -H -w 2 -m http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en --referer="http://www.google.com" --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" --header="Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" --header="Accept-Language: en-us,en;q=0.5" --header="Accept-Encoding: gzip,deflate" --header="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" --header="Keep-Alive: 300"

You can use a variant of it(wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=html&langs[]=en") - in case it is not working.

Now you can perform the following commands(from /usr/local/hadoop/ directory):

1) bin/hadoop dfs -copyFromLocal /tmp/ebooks /user/hduser/ebooks

The above one will bring the plain text eBooks from local filesystem into HDFS for mapreduce consumption/processing.

2) Please use the following commands to verify whether the above command worked fine or not?

bin/hadoop dfs -ls /user/hduser

bin/hadoop dfs -ls /user/hduser/ebooks

3) bin/hadoop jar hadoop-examples-1.2.1.jar wordcount /user/hduser/ebooks /user/hduser/ebooks-out

This will execute the OOTB wordcount mapreduce program for around 2500+ plain text eBooks.

4) Once the mapreduce has been successfully completed, please verify the output by issuing the following commands:

bin/hadoop dfs -ls /user/hduser/ebooks-out

bin/hadoop dfs -cat /user/hduser/ebooks-out/part-r-00000

5) If you want to run it again, please use a command like this:

bin/hadoop jar hadoop-examples-1.2.1.jar wordcount /user/hduser/ebooks /user/hduser/ebooks-out1

Use a different output otherwise it won't run.

6) Best way to learn hadoop is to view the logs and using the following web interfaces when the map reduce program is executing(that's the number one reason I took such a big sample input);

http://localhost:50070/ – web UI of the NameNode daemon
http://localhost:50030/ – web UI of the JobTracker daemon
http://localhost:50060/ – web UI of the TaskTracker daemon

Here are some sample screen shots:

Other alternatives - Cloudera QuickStart VM

Cloudera has provided quick start VMs for getting started with Apache Hadoop. You need to install VM Player and then install one of the images that works fine for your needs.

Pros:

No setup needed - you can get going within 30 minutes
It has everything - no need to get individual packages
Somewhat more stable

Cons:

I had issues with not able to increase the size of my VM RAM from 1 GB to anything higher so i moved to Ubuntu Linux where i have the VM RAM equal to 4GB.
Once VM player is running - your overall system is slow. Even VM is also slower to access.
Can't copy/paste the commands that easily from your host system to VM and vice versa.

Here is the screenshot of how the VM looks like after getting installed on Ubuntu Linux.

Horton Works has also provided their VM/Sandbox kind of environment to play with.

Enjoy learning Apache Hadoop.....

Followed this one for MRV2:
http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php

Saturday, October 25, 2014