I started with installing Hadoop on a single node, i.e. my machine. It was a tricky task with most of the tutorials making many assumptions. Now that I have completed the install, I can safely say that those were simple assumptions and that anyone familiar with linux is deemed to understand them. Then I decided to write an installation instruction for the dummies.
Here is the most comprehensive documentation of "How to install Hadoop on your local system". Please let me know if I have missed anything.
The first and foremost requirement is to get a PC with Linux installed on it. I used a machine with Ubuntu 9.10 installed on it. You can also work with Windows, as Hadoop is purely java based and it will work with any OS that can run JVM (which in turn implies pretty much all the modern OS's)
2. Sun Java6
Install the Sun Java6 on your Linux machine using:
$ sudo apt-get install sun-java6-bin sun-java6-jre sun-java6-jdk
3. Create a new user "hadoop"
Create a new user hadoop (though it is not required, it is recommended in order
to to separate the Hadoop installation from other software applications and user
accounts running on the same machine by having a dedicated user for hadoop).
Use the following commands:
$ sudo addgroup hadoop
$ sudo useradd -d /home/hadoop -m hadoop -g hadoop
4. Configure SSH
Install OpenSSHServer on your system:
$ sudo apt-get install openssh-server
Then generate an SSH key for the hadoop user. As the hadoop user do the
$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
The key's randomart image is:
+[ RSA 2048]+
|o . |
| o E |
| = . |
| . + * |
| o = = S |
| . o o o |
| = o . |
| o * o |
Then enable SSH access to your local machine with this newly created key:
$cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Test your SSH connection:
$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is 1e:be:bb:db:71:25:e2:d5:b0:a9:87:9a:2c:43:e3:ae.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux paritoshdesktop 2.6.3120generic #58Ubuntu SMP Fri Mar 12 05:23:09 UTC
Now that the prerequisites are complete, lets go ahead with the Hadoop
Install Hadoop from Cloudera
1. Add repository
Create a new file /etc/apt/sources.list.d/cloudera.list with the following
contents, taking care to replace DISTRO with the name of your distribution (find
out by running lsb_release -c):
deb http://archive.cloudera.com/debian DISTRO-cdh3 contrib2. Add repository key. (optional)
debsrc http://archive.cloudera.com/debian DISTRO-cdh3 contrib
Add the Cloudera Public GPG Key to your repository by executing the following
$ curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
This allows you to verify that you are downloading genuine packages.
Note: You may need to install curl:
$ sudo apt-get install curl
3. Update APT package index.
$ sudo apt-get update
4. Find and install packages.
You may now find and install packages from the Cloudera repository using your favorite APT package manager (e.g apt-get, aptitude, or dselect). For example:
$ apt-cache search hadoop
$ sudo apt-get install hadoop
Setting up a Hadoop Cluster
Here we will try to setup a Hadoop Cluster on a single node.
Copy the hadoop0.20 directory to the hadoop home folder.
$ cd /usr/lib/Also, add the following to your .bashrc and .profile
$ cp -Rf hadoop0.20 /home/hadoop/
# Hadoop home dir declarationChange the following in different configuration files in the /$HADOOP_HOME
Change the Java home, depending on where your java is installed:
# The java implementation to use. Required.
Change your core-site.xml to reflect the following:
Change your mapred-site.xml to reflect the following:
Change your hdfs-site.xml to reflect the following:
2. Format the NameNode
To format the Hadoop Distributed filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command:
/hadoop/bin/hadoop namenode -format
To start hadoop, run the startall.sh from the
starting namenode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-namenode-paritosh-desktop.out
localhost: starting datanode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-datanode-paritosh-desktop.out
localhost: starting secondarynamenode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-secondarynamenode-paritosh-desktop.out
starting jobtracker, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-jobtracker-paritosh-desktop.out
localhost: starting tasktracker, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-tasktracker-paritosh-desktop.out
To check whether all the processes are running fine, run the following: