Install Hadoop on a Single Machine

In order to learn any new technology, the best way IMHO is to install it on your local system. You can then explore it and experiment on it at your convenience. It is the route I always take.

I started with installing Hadoop on a single node, i.e. my machine. It was a tricky task with most of the tutorials making many assumptions. Now that I have completed the install, I can safely say that those were simple assumptions and that anyone familiar with linux is deemed to understand them. Then I decided to write an installation instruction for the dummies.

Here is the most comprehensive documentation of "How to install Hadoop on your local system". Please let me know if I have missed anything.


Prerequisites


1. Linux

The first and foremost requirement is to get a PC with Linux installed on it. I used a machine with Ubuntu 9.10 installed on it. You can also work with Windows, as Hadoop is purely java based and it will work with any OS that can run JVM (which in turn implies pretty much all the modern OS's)

2. Sun Java6

Install the Sun Java6 on your Linux machine using:
$ sudo apt-get install sun-java6-bin sun-java6-jre sun-java6-jdk

3. Create a new user "hadoop"

Create a new user hadoop (though it is not required, it is recommended in order
to to separate the Hadoop installation from other software applications and user
accounts running on the same machine by having a dedicated user for hadoop).
Use the following commands:
$ sudo addgroup hadoop
$ sudo useradd -d /home/hadoop -m hadoop -g hadoop

4. Configure SSH

Install OpenSSH­Server on your system:
$ sudo apt­-get install openssh­-server

Then generate an SSH key for the hadoop user. As the hadoop user do the
following:
$ ssh-­keygen -­t rsa ­-P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
1a:38:cd:0c:92:f9:8b:33:f3:a9:8e:dd:41:68:04:dc hadoop@paritosh­desktop
The key's randomart image is:
+­­[ RSA 2048]­­­­+
|o . |
| o E |
| = . |
| . + * |
| o = = S |
| . o o o |
| = o . |
| o * o |
|..+.+ |
+­­­­­­­­­­­­­­­­­+
$


Then enable SSH access to your local machine with this newly created key:
$cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Test your SSH connection:
$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is 1e:be:bb:db:71:25:e2:d5:b0:a9:87:9a:2c:43:e3:ae.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux paritosh­desktop 2.6.31­20­generic #58­Ubuntu SMP Fri Mar 12 05:23:09 UTC
2010 i686
$

Now that the prerequisites are complete, lets go ahead with the Hadoop
installation.

Install Hadoop from Cloudera

1. Add repository

Create a new file /etc/apt/sources.list.d/cloudera.list with the following
contents, taking care to replace DISTRO with the name of your distribution (find
out by running lsb_release -c):
deb http://archive.cloudera.com/debian DISTRO­-cdh3 contrib
deb­src http://archive.cloudera.com/debian DISTRO­-cdh3 contrib
2. Add repository key. (optional)

Add the Cloudera Public GPG Key to your repository by executing the following
command:
$ curl -­s http://archive.cloudera.com/debian/archive.key | sudo apt­-key add ­-
OK
$

This allows you to verify that you are downloading genuine packages.

Note: You may need to install curl:
$ sudo apt­-get install curl

3. Update APT package index.

Simply run:
$ sudo apt­-get update

4. Find and install packages.

You may now find and install packages from the Cloudera repository using your favorite APT package manager (e.g apt­-get, aptitude, or dselect). For example:
$ apt-­cache search hadoop
$ sudo apt­-get install hadoop

Setting up a Hadoop Cluster

Here we will try to setup a Hadoop Cluster on a single node.

1. Configuration

Copy the hadoop­0.20 directory to the hadoop home folder.
$ cd /usr/lib/
$ cp ­-Rf hadoop­0.20 /home/hadoop/
Also, add the following to your .bashrc and .profile
# Hadoop home dir declaration
HADOOP_HOME=/home/hadoop/hadoop­0.20
export HADOOP_HOME
Change the following in different configuration files in the /$HADOOP_HOME/conf dir:

1.1 hadoop­env.sh

Change the Java home, depending on where your java is installed:
# The java implementation to use. Required.
export JAVA_HOME=/usr/bin/java

1.2 core-­site.xml

Change your core­-site.xml to reflect the following:

1.3 mapred-site.xml

Change your mapred-­site.xml to reflect the following:

1.4 hdfs-site.xml

Change your hdfs­-site.xml to reflect the following:

2. Format the NameNode

To format the Hadoop Distributed filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command:
$ /hadoop/bin/hadoop namenode -format

Running Hadoop

To start hadoop, run the start­all.sh from the
/$HADOOP_HOME/bin/ directory.

$ ./start-all.sh

starting namenode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-namenode-paritosh-desktop.out

localhost: starting datanode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-datanode-paritosh-desktop.out

localhost: starting secondarynamenode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-secondarynamenode-paritosh-desktop.out

starting jobtracker, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-jobtracker-paritosh-desktop.out

localhost: starting tasktracker, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-tasktracker-paritosh-desktop.out

$


To check whether all the processes are running fine, run the following:

$ jps

17736 TaskTracker

17602 JobTracker

17235 NameNode

17533 SecondaryNameNode

17381 DataNode

17804 Jps

$

4 comments:

Anonymous said...

hai,thanks for that.i have learn to lot of hadoop.Hadoop Training in chenai

Unknown said...

There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this (Salesforce Training in Chennai).

Unknown said...

I have read your blog, it was good to read & I am getting some useful info's through your blog keep sharing... Informatica is an ETL tools helps to transform your old business leads into new vision. Learn Informatica training in chennai from corporate professionals with very good experience in informatica tool.
Regards,
Best Informatica Training In Chennai|Informatica training center in Chennai|Informatica training chennai

Unknown said...

I really enjoyed while reading your article, the information you have mentioned in this post was damn good. Keep sharing your blog with updated and useful information.
Regards,
sas course in Chennai|sas training center in Chennai|sas training in Velachery

Post a Comment