Paritosh the Geek: Install Hadoop on a Single Machine

In order to learn any new technology, the best way IMHO is to install it on your local system. You can then explore it and experiment on it at your convenience. It is the route I always take.

I started with installing Hadoop on a single node, i.e. my machine. It was a tricky task with most of the tutorials making many assumptions. Now that I have completed the install, I can safely say that those were simple assumptions and that anyone familiar with linux is deemed to understand them. Then I decided to write an installation instruction for the dummies.

Here is the most comprehensive documentation of "How to install Hadoop on your local system". Please let me know if I have missed anything.

Prerequisites

1. Linux

The first and foremost requirement is to get a PC with Linux installed on it. I used a machine with Ubuntu 9.10 installed on it. You can also work with Windows, as Hadoop is purely java based and it will work with any OS that can run JVM (which in turn implies pretty much all the modern OS's)

2. Sun Java6

Install the Sun Java6 on your Linux machine using:

$ sudo apt-get install sun-java6-bin sun-java6-jre sun-java6-jdk

3. Create a new user "hadoop"

Create a new user hadoop (though it is not required, it is recommended in order
to to separate the Hadoop installation from other software applications and user
accounts running on the same machine by having a dedicated user for hadoop).
Use the following commands:

$ sudo addgroup hadoop
$ sudo useradd -d /home/hadoop -m hadoop -g hadoop

4. Configure SSH

Install OpenSSHServer on your system:

$ sudo apt-get install openssh-server

Then generate an SSH key for the hadoop user. As the hadoop user do the
following:

$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
1a:38:cd:0c:92:f9:8b:33:f3:a9:8e:dd:41:68:04:dc hadoop@paritoshdesktop
The key's randomart image is:
+[ RSA 2048]+
|o . |
| o E |
| = . |
| . + * |
| o = = S |
| . o o o |
| = o . |
| o * o |
|..+.+ |
++
$

Then enable SSH access to your local machine with this newly created key:

$cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Test your SSH connection:

$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is 1e:be:bb:db:71:25:e2:d5:b0:a9:87:9a:2c:43:e3:ae.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux paritoshdesktop 2.6.3120generic #58Ubuntu SMP Fri Mar 12 05:23:09 UTC
2010 i686
$

Now that the prerequisites are complete, lets go ahead with the Hadoop
installation.

Install Hadoop from Cloudera

1. Add repository

Create a new file /etc/apt/sources.list.d/cloudera.list with the following
contents, taking care to replace DISTRO with the name of your distribution (find
out by running lsb_release -c):

deb http://archive.cloudera.com/debian DISTRO-cdh3 contrib
debsrc http://archive.cloudera.com/debian DISTRO-cdh3 contrib

2. Add repository key. (optional)

Add the Cloudera Public GPG Key to your repository by executing the following
command:

$ curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
OK
$

This allows you to verify that you are downloading genuine packages.

Note: You may need to install curl:
$ sudo apt-get install curl

3. Update APT package index.

Simply run:

$ sudo apt-get update

4. Find and install packages.

You may now find and install packages from the Cloudera repository using your favorite APT package manager (e.g apt-get, aptitude, or dselect). For example:

$ apt-cache search hadoop
$ sudo apt-get install hadoop

Setting up a Hadoop Cluster

Here we will try to setup a Hadoop Cluster on a single node.

1. Configuration

Copy the hadoop0.20 directory to the hadoop home folder.

$ cd /usr/lib/
$ cp -Rf hadoop0.20 /home/hadoop/

Also, add the following to your .bashrc and .profile

# Hadoop home dir declaration
HADOOP_HOME=/home/hadoop/hadoop0.20
export HADOOP_HOME

Change the following in different configuration files in the /$HADOOP_HOME/conf dir:

1.1 hadoopenv.sh

Change the Java home, depending on where your java is installed:

# The java implementation to use. Required.
export JAVA_HOME=/usr/bin/java

1.2 core-site.xml

Change your core-site.xml to reflect the following:

1.3 mapred-site.xml

Change your mapred-site.xml to reflect the following:

1.4 hdfs-site.xml

Change your hdfs-site.xml to reflect the following:

2. Format the NameNode

To format the Hadoop Distributed filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command:

$ /hadoop/bin/hadoop namenode -format

Running Hadoop

To start hadoop, run the startall.sh from the /$HADOOP_HOME/bin/ directory.

$ ./start-all.sh

starting namenode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-namenode-paritosh-desktop.out

localhost: starting datanode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-datanode-paritosh-desktop.out

localhost: starting secondarynamenode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-secondarynamenode-paritosh-desktop.out

starting jobtracker, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-jobtracker-paritosh-desktop.out

localhost: starting tasktracker, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-tasktracker-paritosh-desktop.out

$

To check whether all the processes are running fine, run the following:

$ jps

17736 TaskTracker

17602 JobTracker

17235 NameNode

17533 SecondaryNameNode

17381 DataNode

17804 Jps

$

4 comments:

Anonymous said...: hai,thanks for that.i have learn to lot of hadoop.Hadoop Training in chenai; July 10, 2014 at 4:04 AM
Unknown said...: There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this (Salesforce Training in Chennai).; August 10, 2015 at 11:16 PM
Unknown said...: I have read your blog, it was good to read & I am getting some useful info's through your blog keep sharing... Informatica is an ETL tools helps to transform your old business leads into new vision. Learn Informatica training in chennai from corporate professionals with very good experience in informatica tool.
Regards,
Best Informatica Training In Chennai|Informatica training center in Chennai|Informatica training chennai; September 15, 2015 at 11:54 PM
Unknown said...: I really enjoyed while reading your article, the information you have mentioned in this post was damn good. Keep sharing your blog with updated and useful information.
Regards,
sas course in Chennai|sas training center in Chennai|sas training in Velachery; December 22, 2015 at 2:15 AM

Paritosh the Geek

Install Hadoop on a Single Machine

4 comments:

Post a Comment

Paritosh The Geek

Fellow Geeks

What do I write about?

I follow these

How many geeks came here?

The Network

Who came here

History

Disclaimer