Install Hadoop on a Single Machine

In order to learn any new technology, the best way IMHO is to install it on your local system. You can then explore it and experiment on it at your convenience. It is the route I always take.

I started with installing Hadoop on a single node, i.e. my machine. It was a tricky task with most of the tutorials making many assumptions. Now that I have completed the install, I can safely say that those were simple assumptions and that anyone familiar with linux is deemed to understand them. Then I decided to write an installation instruction for the dummies.

Here is the most comprehensive documentation of "How to install Hadoop on your local system". Please let me know if I have missed anything.


Prerequisites


1. Linux

The first and foremost requirement is to get a PC with Linux installed on it. I used a machine with Ubuntu 9.10 installed on it. You can also work with Windows, as Hadoop is purely java based and it will work with any OS that can run JVM (which in turn implies pretty much all the modern OS's)

2. Sun Java6

Install the Sun Java6 on your Linux machine using:
$ sudo apt-get install sun-java6-bin sun-java6-jre sun-java6-jdk

3. Create a new user "hadoop"

Create a new user hadoop (though it is not required, it is recommended in order
to to separate the Hadoop installation from other software applications and user
accounts running on the same machine by having a dedicated user for hadoop).
Use the following commands:
$ sudo addgroup hadoop
$ sudo useradd -d /home/hadoop -m hadoop -g hadoop

4. Configure SSH

Install OpenSSH­Server on your system:
$ sudo apt­-get install openssh­-server

Then generate an SSH key for the hadoop user. As the hadoop user do the
following:
$ ssh-­keygen -­t rsa ­-P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
1a:38:cd:0c:92:f9:8b:33:f3:a9:8e:dd:41:68:04:dc hadoop@paritosh­desktop
The key's randomart image is:
+­­[ RSA 2048]­­­­+
|o . |
| o E |
| = . |
| . + * |
| o = = S |
| . o o o |
| = o . |
| o * o |
|..+.+ |
+­­­­­­­­­­­­­­­­­+
$


Then enable SSH access to your local machine with this newly created key:
$cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Test your SSH connection:
$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is 1e:be:bb:db:71:25:e2:d5:b0:a9:87:9a:2c:43:e3:ae.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux paritosh­desktop 2.6.31­20­generic #58­Ubuntu SMP Fri Mar 12 05:23:09 UTC
2010 i686
$

Now that the prerequisites are complete, lets go ahead with the Hadoop
installation.

Install Hadoop from Cloudera

1. Add repository

Create a new file /etc/apt/sources.list.d/cloudera.list with the following
contents, taking care to replace DISTRO with the name of your distribution (find
out by running lsb_release -c):
deb http://archive.cloudera.com/debian DISTRO­-cdh3 contrib
deb­src http://archive.cloudera.com/debian DISTRO­-cdh3 contrib
2. Add repository key. (optional)

Add the Cloudera Public GPG Key to your repository by executing the following
command:
$ curl -­s http://archive.cloudera.com/debian/archive.key | sudo apt­-key add ­-
OK
$

This allows you to verify that you are downloading genuine packages.

Note: You may need to install curl:
$ sudo apt­-get install curl

3. Update APT package index.

Simply run:
$ sudo apt­-get update

4. Find and install packages.

You may now find and install packages from the Cloudera repository using your favorite APT package manager (e.g apt­-get, aptitude, or dselect). For example:
$ apt-­cache search hadoop
$ sudo apt­-get install hadoop

Setting up a Hadoop Cluster

Here we will try to setup a Hadoop Cluster on a single node.

1. Configuration

Copy the hadoop­0.20 directory to the hadoop home folder.
$ cd /usr/lib/
$ cp ­-Rf hadoop­0.20 /home/hadoop/
Also, add the following to your .bashrc and .profile
# Hadoop home dir declaration
HADOOP_HOME=/home/hadoop/hadoop­0.20
export HADOOP_HOME
Change the following in different configuration files in the /$HADOOP_HOME/conf dir:

1.1 hadoop­env.sh

Change the Java home, depending on where your java is installed:
# The java implementation to use. Required.
export JAVA_HOME=/usr/bin/java

1.2 core-­site.xml

Change your core­-site.xml to reflect the following:

1.3 mapred-site.xml

Change your mapred-­site.xml to reflect the following:

1.4 hdfs-site.xml

Change your hdfs­-site.xml to reflect the following:

2. Format the NameNode

To format the Hadoop Distributed filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command:
$ /hadoop/bin/hadoop namenode -format

Running Hadoop

To start hadoop, run the start­all.sh from the
/$HADOOP_HOME/bin/ directory.

$ ./start-all.sh

starting namenode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-namenode-paritosh-desktop.out

localhost: starting datanode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-datanode-paritosh-desktop.out

localhost: starting secondarynamenode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-secondarynamenode-paritosh-desktop.out

starting jobtracker, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-jobtracker-paritosh-desktop.out

localhost: starting tasktracker, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-tasktracker-paritosh-desktop.out

$


To check whether all the processes are running fine, run the following:

$ jps

17736 TaskTracker

17602 JobTracker

17235 NameNode

17533 SecondaryNameNode

17381 DataNode

17804 Jps

$

Hadoop... Let's welcome the Elephant!!

While working with my crawler, I came to a conclusion that I can't work with my poor old system for processing such a large amount of data. I needed some powerful machine and huge amount of storage to do it. Unfortunately (or fortunately for me) I don't have that kind of moolah to invest in a enterprise class server and a huge SAN Storage. What began as a pet project was growing out to become a major logistics headache!!

Then I read something about a new open source entrant into the Distributed Computing space called Hadoop. Actually its not a new entrant (has been in development since 2004, when the Google's MapReduce algorithm paper was published). Yahoo has been using it for similar purposes of creating page indexes for Yahoo Web Search. Also, Apache Mahout, a machine learning project from Apache Foundation, uses Hadoop as its compute horsepower.

Suddenly, I knew Hadoop is the way to go. It uses commodity PCs (#), gives Petabyes of storage and the power of Distributed Computing. And the best par about it is that it is a FOSS.

You can read more about Hadoop from the following places:

1. Apache Hadoop site.
2. Yahoo Hadoop Developer Network.
3. Cloudera Site.

My next few posts would elaborate more on Hadoop's working.

Note:

# Commodity PC doesn't imply cheap PCs, a typical choice of machine for running a Hadoop datanode and tasktracker in late 2008 would have the following specifications:

  • Processor: 2 quad-core Intel Xeon 2.0GHz CPUs
  • Memory: 8 GB ECC RAM
  • Storage: 41 TB SATA disks
  • Network: Gigabit Ethernet