Saturday, October 25, 2014

Getting started with Apache Hadoop

Hope this blog helps you out in terms of getting started with Apache Hadoop.
What all is covered here:
  1. Installing and configuring Apache Hadoop - single node cluster on ubuntu Linux
  2. Running default map reduce example - wordcount with some real load
  3. Other alternatives - Cloudera QuickStart VM
Please note this blog is not reference material for learning Apache Hadoop - it's only talking about how to get started in terms of learning Apache Hadoop. I have tried to keep the other details out of this blog so that anybody who is googling for getting started with Single node Hadoop cluster on Ubuntu and want to run the OOTB sample example

Installing and configuring Apache Hadoop - single node cluster on ubuntu Linux

Please refer to these these blogs(in the order i mentioned):

First two are very good source for installing and configuring single node cluster on ubuntu Linux.
The only customization or detour i have done from the above ones - I have directly used the user root rather than hduser as i was getting some permission errors in Ubuntu Linux and I don't want to spend too much time in resolving them. 
Moreover I have downloaded the Apache Hadoop - 1.21 and used Oracle/Sun JDK 1.7.x

Running default map reduce example - wordcount with some real load

Apache Hadoop comes with some excellent out of the box examples that you can play with. I have choosen "wordcount" for this blog. Wordcount simply counts the number of times a word occurred in the input. 
In his bloghttp://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ - Michael got three eBooks as the input. I have gone bit far in terms of the overall load. I have simply downloaded around 2500+ eBooks in Plain-text format and ran wordcount against it.
I have also choosen the same source to get the eBooks via this command:
wget -H -w 2 -m http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en --referer="http://www.google.com" --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" --header="Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5" --header="Accept-Language: en-us,en;q=0.5" --header="Accept-Encoding: gzip,deflate" --header="Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7" --header="Keep-Alive: 300"

You can use a variant of it(wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=html&langs[]=en") - in case it is not working.

Now you can perform the following commands(from /usr/local/hadoop/ directory):
1) bin/hadoop dfs -copyFromLocal /tmp/ebooks /user/hduser/ebooks
The above  one will bring the plain text eBooks from local filesystem into HDFS for mapreduce consumption/processing.
2) Please use the following commands to verify whether the above command worked fine or not?
bin/hadoop dfs -ls /user/hduser
bin/hadoop dfs -ls /user/hduser/ebooks
3) bin/hadoop jar hadoop-examples-1.2.1.jar wordcount /user/hduser/ebooks /user/hduser/ebooks-out
This will execute the OOTB wordcount mapreduce program for around 2500+ plain text eBooks.
4) Once the mapreduce has been successfully completed, please verify the output by issuing the following commands:
bin/hadoop dfs -ls /user/hduser/ebooks-out
bin/hadoop dfs -cat /user/hduser/ebooks-out/part-r-00000

5) If you want to run it again, please use a command like this:
bin/hadoop jar hadoop-examples-1.2.1.jar wordcount /user/hduser/ebooks /user/hduser/ebooks-out1
Use a different output otherwise it won't run.

6) Best way to learn hadoop is to view the logs and using the following web interfaces when the map reduce program is executing(that's the number one reason I took such a big sample input);
Here are some sample screen shots:





Other alternatives - Cloudera QuickStart VM

Cloudera has provided quick start VMs for getting started with Apache Hadoop. You need to install VM Player and then install one of the images that works fine for your needs.
Pros:
  • No setup needed - you can get going within 30 minutes
  • It has everything - no need to get individual packages
  • Somewhat more stable
Cons:
  • I had issues with not able to increase the size of my VM RAM from 1 GB to anything higher so i moved to Ubuntu Linux where i have the VM RAM equal to 4GB.
  • Once VM player is running - your overall system is slow. Even VM is also slower to access.
  • Can't copy/paste the commands that easily from your host system to VM and vice versa.
Here is the screenshot of how the VM looks like after getting installed on Ubuntu Linux.



Horton Works has also provided their VM/Sandbox kind of environment to play with.

Enjoy learning Apache Hadoop.....

Followed this one for MRV2:
http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php