Install Hadoop on a Single Machine

In order to learn any new technology, the best way IMHO is to install it on your local system. You can then explore it and experiment on it at your convenience. It is the route I always take.

I started with installing Hadoop on a single node, i.e. my machine. It was a tricky task with most of the tutorials making many assumptions. Now that I have completed the install, I can safely say that those were simple assumptions and that anyone familiar with linux is deemed to understand them. Then I decided to write an installation instruction for the dummies.

Here is the most comprehensive documentation of "How to install Hadoop on your local system". Please let me know if I have missed anything.


Prerequisites


1. Linux

The first and foremost requirement is to get a PC with Linux installed on it. I used a machine with Ubuntu 9.10 installed on it. You can also work with Windows, as Hadoop is purely java based and it will work with any OS that can run JVM (which in turn implies pretty much all the modern OS's)

2. Sun Java6

Install the Sun Java6 on your Linux machine using:
$ sudo apt-get install sun-java6-bin sun-java6-jre sun-java6-jdk

3. Create a new user "hadoop"

Create a new user hadoop (though it is not required, it is recommended in order
to to separate the Hadoop installation from other software applications and user
accounts running on the same machine by having a dedicated user for hadoop).
Use the following commands:
$ sudo addgroup hadoop
$ sudo useradd -d /home/hadoop -m hadoop -g hadoop

4. Configure SSH

Install OpenSSH­Server on your system:
$ sudo apt­-get install openssh­-server

Then generate an SSH key for the hadoop user. As the hadoop user do the
following:
$ ssh-­keygen -­t rsa ­-P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
1a:38:cd:0c:92:f9:8b:33:f3:a9:8e:dd:41:68:04:dc hadoop@paritosh­desktop
The key's randomart image is:
+­­[ RSA 2048]­­­­+
|o . |
| o E |
| = . |
| . + * |
| o = = S |
| . o o o |
| = o . |
| o * o |
|..+.+ |
+­­­­­­­­­­­­­­­­­+
$


Then enable SSH access to your local machine with this newly created key:
$cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Test your SSH connection:
$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is 1e:be:bb:db:71:25:e2:d5:b0:a9:87:9a:2c:43:e3:ae.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux paritosh­desktop 2.6.31­20­generic #58­Ubuntu SMP Fri Mar 12 05:23:09 UTC
2010 i686
$

Now that the prerequisites are complete, lets go ahead with the Hadoop
installation.

Install Hadoop from Cloudera

1. Add repository

Create a new file /etc/apt/sources.list.d/cloudera.list with the following
contents, taking care to replace DISTRO with the name of your distribution (find
out by running lsb_release -c):
deb http://archive.cloudera.com/debian DISTRO­-cdh3 contrib
deb­src http://archive.cloudera.com/debian DISTRO­-cdh3 contrib
2. Add repository key. (optional)

Add the Cloudera Public GPG Key to your repository by executing the following
command:
$ curl -­s http://archive.cloudera.com/debian/archive.key | sudo apt­-key add ­-
OK
$

This allows you to verify that you are downloading genuine packages.

Note: You may need to install curl:
$ sudo apt­-get install curl

3. Update APT package index.

Simply run:
$ sudo apt­-get update

4. Find and install packages.

You may now find and install packages from the Cloudera repository using your favorite APT package manager (e.g apt­-get, aptitude, or dselect). For example:
$ apt-­cache search hadoop
$ sudo apt­-get install hadoop

Setting up a Hadoop Cluster

Here we will try to setup a Hadoop Cluster on a single node.

1. Configuration

Copy the hadoop­0.20 directory to the hadoop home folder.
$ cd /usr/lib/
$ cp ­-Rf hadoop­0.20 /home/hadoop/
Also, add the following to your .bashrc and .profile
# Hadoop home dir declaration
HADOOP_HOME=/home/hadoop/hadoop­0.20
export HADOOP_HOME
Change the following in different configuration files in the /$HADOOP_HOME/conf dir:

1.1 hadoop­env.sh

Change the Java home, depending on where your java is installed:
# The java implementation to use. Required.
export JAVA_HOME=/usr/bin/java

1.2 core-­site.xml

Change your core­-site.xml to reflect the following:

1.3 mapred-site.xml

Change your mapred-­site.xml to reflect the following:

1.4 hdfs-site.xml

Change your hdfs­-site.xml to reflect the following:

2. Format the NameNode

To format the Hadoop Distributed filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command:
$ /hadoop/bin/hadoop namenode -format

Running Hadoop

To start hadoop, run the start­all.sh from the
/$HADOOP_HOME/bin/ directory.

$ ./start-all.sh

starting namenode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-namenode-paritosh-desktop.out

localhost: starting datanode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-datanode-paritosh-desktop.out

localhost: starting secondarynamenode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-secondarynamenode-paritosh-desktop.out

starting jobtracker, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-jobtracker-paritosh-desktop.out

localhost: starting tasktracker, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-tasktracker-paritosh-desktop.out

$


To check whether all the processes are running fine, run the following:

$ jps

17736 TaskTracker

17602 JobTracker

17235 NameNode

17533 SecondaryNameNode

17381 DataNode

17804 Jps

$

Hadoop... Let's welcome the Elephant!!

While working with my crawler, I came to a conclusion that I can't work with my poor old system for processing such a large amount of data. I needed some powerful machine and huge amount of storage to do it. Unfortunately (or fortunately for me) I don't have that kind of moolah to invest in a enterprise class server and a huge SAN Storage. What began as a pet project was growing out to become a major logistics headache!!

Then I read something about a new open source entrant into the Distributed Computing space called Hadoop. Actually its not a new entrant (has been in development since 2004, when the Google's MapReduce algorithm paper was published). Yahoo has been using it for similar purposes of creating page indexes for Yahoo Web Search. Also, Apache Mahout, a machine learning project from Apache Foundation, uses Hadoop as its compute horsepower.

Suddenly, I knew Hadoop is the way to go. It uses commodity PCs (#), gives Petabyes of storage and the power of Distributed Computing. And the best par about it is that it is a FOSS.

You can read more about Hadoop from the following places:

1. Apache Hadoop site.
2. Yahoo Hadoop Developer Network.
3. Cloudera Site.

My next few posts would elaborate more on Hadoop's working.

Note:

# Commodity PC doesn't imply cheap PCs, a typical choice of machine for running a Hadoop datanode and tasktracker in late 2008 would have the following specifications:

  • Processor: 2 quad-core Intel Xeon 2.0GHz CPUs
  • Memory: 8 GB ECC RAM
  • Storage: 41 TB SATA disks
  • Network: Gigabit Ethernet

Machine Learning : The Crawler

Building the crawler was the easiest part of this project.

All this crawler does is take a seed blog (my blog) URL, run through all the links in its front page and store the ones that look like a blogpost URL. I assume that all the blogs in this world are linked to on at least one other blog. Thus all of them will get indexed in this world if the spider is given enough time ad memory.

This is the code for the crawler. Its in Python and is quite easy. Please run through it and let me know if there is any other way to optimise it further:

import sys
import re
import urllib2
import urlparse
from pysqlite2 import dbapi2 as sqlite

conn = sqlite.connect('/home/spider/blogSearch.db')
cur = conn.cursor()

tocrawltpl = cur.execute('SELECT * FROM blogList where key=1')
for row in tocrawltpl:
tocrawl = set([row[1]])

linkregex = re.compile("")

while 1:

        try:
                crawling = tocrawl.pop()
        except KeyError:
                raise StopIteration

        url = urlparse.urlparse(crawling)

        try:
                response = urllib2.urlopen(crawling)
        except:
                continue

        msg = response.read()
        links = linkregex.findall(msg)

        for link in (links.pop(0) for _ in xrange(len(links))):
                if link.endswith('.blogspot.com/'):
                        if link.startswith('/'):
                                link = 'http://' + url[1] + link
                        elif link.startswith('#'):
                                link = 'http://' + url[1] + url[2] + link
                        elif not link.startswith('http'):
                                link = 'http://' + url[1] + '/' + link

                        select_query='SELECT * FROM blogList where url="%s"' %link
                        crawllist = cur.execute(select_query)
                        flag=1
                        for row in crawllist:
                                flag=0

                        if flag:
                                tocrawl.add(link)
                                insert_query='INSERT INTO blogList (url) VALUES ("%s")' %link
                                cur.execute(insert_query)
                                conn.commit()

Machine Learning : Emotional Intelligent Search Engines

When we think of machines being emotionally intelligent, we tend to see a vast problem with no seemingly plausible solution. However, as I stated in my post about Machine Learning and Semantic Web, it can be broken down into smaller, manageable problems. And I intend to do that.

Here, in this post I present the introduction to a small step towards this solution. This solution, if implemented properly (which I hope it will be), will allow a machine to return emotionally intelligent search results.

I assume that anyone reading this article knows the way a search engine works, the Python programming language and how databases work.

Statement of Problem :

"Build a emotional intelligent search engine which does the following tasks:

  1. Crawls the web for blogposts and downloads them.
  2. Indexes them according to their emotional quotient.
  3. Returns results according to the abstract questions asked by the end user. "
Proposed solution :

The proposed solution POC consists of :
  1. A crawler which crawls the blogosphere collecting links of all the blogs out there. I intend to target only the blogs on blogspot. These links are sanitised and stored in a database. Then another crawler cum 'blog-getter' looks into this database, picks up each blog one by one and then downloads all the posts in them. It stores the posts in flat file format and stores its metadata in a database.
  2. A lexicon builder parses these posts one by one and builds upon its vocabulary of emotional words (fundamentally all the adjectives in the posts) with the help of a human tutor.
  3. An indexer then parses all these posts, cross referencing them with its lexicon (built in the previous step) and then stores the parsed words into a db. It stores nouns and adjectives in separate tables referencing them to the metadata table (built in the first step).
  4. When the end user enters a search query, suppose "a nice Chinese restaurant", the query parser first finds all posts about 'Chinese restaurant'. From these results it finds the ones that are nice (according to its logic 'nice' means where people were happy and used 'happy' adjectives when they described such restaurants in their posts). It then displays the results in descending order of 'niceness'.
Implementation

I sincerely believe that this is an absolutely doable project, and I have made some progress with building the lexicon. As this is an ongoing project, I will be posting my experiences and the codes as and when I complete testing them.

The coding will be in Python. Why? Because I am in love with the language. Period.

The database, as of now is SQLLite, but I am planning to convert it into a NoSQL as soon as I get a firm grasp on the technology.

Platform right now is a single Linux box. I know it will require a lot more computing power, storage and memory when deployed. I am planning to port this to Hadoop when it comes to deployment stages. But for now, this remains my baby and will have its place in my desktop :).

Supervised Machine Learning

Introduction

When I was born, I was ignorant but very curious. I always asked
questions : "What? Where? Why? How?" And my parents guided me through
this jungle of knowledge, telling me about things that really mattered
and leaving what was complicated to the future.

As I grew up, I started learning about the very complex things by
cross referencing it with my already existent knowledge. If the new
incidents were not in my knowledge database, I queried the books,
articles and lately Google. Thus from an ignorant child with an empty
knowledge database, I grew up to become a knowledgeable person.

Let me take an example of how I learned English.

I am from India, English is neither my nor my parents' first language.
However, they really wanted me to learn how to read, write and speak
in English. They couldn't teach me by changing their mother tongue.
What they did, instead, was that they taught me the alphabets, a few
easy words and the basic grammar constructs. Then they gave me a
dictionary and a grammar book. I referenced new words that I saw with
the dictionary and new sentence formations with the grammar book. If I
still couldn't understand what was written, I asked my teacher.

Now when I come across anything written in English, I cross reference
it with the vocabulary and grammar that I have learned all these
years. If something new comes up, I check it up online or with my
peers. Thus I build upon my knowledge about English in an ongoing,
continuous process.

Implementation in Machine Learning

Now, building on this very basic pattern how any human child learns
about this physical world, we can model this learning process for the
machines. This is a brief outline of the idea (here I take an example
of the English Language, but this is equally applicable to any other
natural language or any other real world entity):

"In the same way as I learnt about the world, Machines can also be
taught. In the same way that we teach a child how to read and
understand the English Language, a computer can be taught to recognize
patterns of English alphabets as words and can reference its meaning
with its dictionary (which is built by human assistance). The machine
can gradually build up its vocabulary by reading more and finding the
meanings of new words and sentence constructs online. If it finds
something which is complex enough, it can ask its human tutor its
meaning.

Given the amount of cheap storage, memory and computing power we have
in our hands today, I think it is a perfectly doable thing. I just
need time and resources to do it."

Value of this research

Why is it important for a machine to learn a natural language? The
answer is simple, if machine learns a natural language, it can mine
the enormous amount of data on the internet to learn about any other
discipline. We will provide the world with the perfect student. It can
learn financial modelling from a human tutor and apply it to the
streaming financial data from the internet to predict future trends.
Maybe even produce its own models. It can learn biotechnology from its
human tutor and mine the huge available pathological data on the
internet to produce new medicines. The application possibilities seem
to be bound only by human imaginations.

Current Status of my Research

Right now, I am working on a search engine which crawls the
blogosphere and indexes blogposts according to their emotional
quotient. Its a small step in the final scheme of things but this is
what I can do with the time and resources that I have (I have a '9 to
6' day job as a software engineer). I would love to devote more time
to this idea, given the mundane activities (like earning my bread and
butter) is taken care of :-).

I am not sure if this idea has been implemented previously anywhere.
Also, I acknowledge it is a very simple solution to a seemingly very
complex problem. However, I feel that this is the most suitable
solution. If we humans can do it then so can the machines.

Microsoft Interops Day

Well, well, well... I committed the crime. An open source developer attending a Microsoft event is definitely a crime. But then they promised it was going to be an event for PHP development on Windows. I took the bait, hook, line and sinker. Damn!!

The interoperability day turned out to be a cheap marketing gimmick, showcasing what you can do with IIS and Silverlight. This is a session by session breakup of what the Microsoft guys showed us:

Session 1:
Promised : Build an end-to-end, authentication and authorization model for your existing PHP application in less than 30 mins.

Delivered : An existing PHP application was taken, authentication provided by removing anonymous access from the root directory, and authorization provided by adding a login page. Quite a good demonstration , I would admit. With a few clicks of mouse, we could achieve something which takes a few lines of code.

My Take : To a hand coder like me, who needs to have control on all the stuff going around me, I don't think I will ever use this stuff.

Session 2:
Promised : Access your data exposed from your PHP app, through a RESTful service interface.

Delivered : This is where I dozed off. The demo was to show how you can easily configure a webservice for a SQL Server DB, plugin your CS code to harvest this code and show it through your browser.

My Take : Now, I may be snoring while they showed how the PHP application comes into the picture, but I have no memory of any PHP code being shown in this session actually. It was all about how to use standard plug n play snippets in Visual Studio, ADO.NET and SQL Server.

Session 3:
Promised :
Configuring Windows to run your PHP applications.

Delivered : They showed us some so-called cool GUI configuration techniques for porting .htaccess configurations.

My Take : Now why would I do that? When I have a single .htaccess file to control all my apache configuration, why would I monkey around with the superfluous click-and-you-are-done controls in IIS? And the best part of the presentation was that the guy showing this all leads the PHP UG in Bangalore. What a sham!

Session 4:
Promised :
Discover the web designer in you, to build pure CSS based PHP websites.

Delivered : A couple of CSS tips and tricks downloaded from the net. And a huge demo of how Silverlight can Light up your web apps.

My Take : No discussion what-so-ever for using CSS for PHP templates!! Cummon!! Atleast stick to what was promised in your mailer. But then MS is never known to stick to its promises, is it?

Overall, I think that if they had called us in for a demo of what MS is doing for the Web Developers, then I would have definitely loved the sessions. I have never worked on MS stuff for web apps development and I would have loved to have a kick start. But I went there with a differend picture in mind, and was deeply disappointed by the way in which the MS guys used the name of PHP for promoting their products.

N.B. I am currently working on AWS using Python. I find it a compelling cloud computing platform and will be starting a series of posts on it. Anyone who loves cloud computing, hold on to your keyboards!! ;)

Blogadda looses my data!!

Dear Blogger Friend,

Thank you for visiting BlogAdda.com and submitting your blog.


We had an issue with our host yesterday due to which information about your blog was inadvertently affected. Your blog URL is in the records and you might have to update the rest of the information.

Kindly login at BlogAdda (
http://www.blogadda.com/login
) and click on 'My Account' link on the top. You'll see the title of your blog linked. Clicking the title will expand into a few options and you can click on 'Manage' to update details of your blog.

Our apologies for the inconvenience caused and we would appreciate if you can update the blog details asap.

We would like to assure you that we take utmost care of the data and this would not happen again. We thank you for your co-operation.


Regards

Administration Team

BlogAdda.com

Twitter:
http://www.twitter.com/blogadda
Facebook:
http://www.facebook.com/blogadda

What happenes when you get such a mail? I get pissed off. Within a minute of getting this mail, I shot back my reply:

No need to be sorry!! I have got tonnes of work other than to re-enter my blog details on your stupid servers which can't even store this tiny winy bit of information properly.

I am logging out of blogadda. I will make sure no one in my friendlist joins or advises anyone to join such a site which doesn't value its user's data.

I know blogadda is a free blog aggregator service and they have no responsibility what-so-ever for my data. But the question remains... how the hell can they manage to loose it?

There are many ways to loose data:

  1. Hacking : Someone hacked into their database and snooped off with the data. In that case I my email ID is in their hands and at the very least, I can expect to get thousands of spam emails which may or maynot have adware and malwares attached to them.
  2. Backup failure: Some bozo missed out taping my data. If they have a backup policy. Which I doubt hey have.
  3. Natural Disaster: The data centre was flooded and thus my data was lost.
  4. Terrorist attack: The black masked terrorists came and fled off with the tapes of my data. They also deleted my entries from the database. On the second thoughts, is my second name Bush? Definitely not.

After going through all such possibilities, all I can say is that I am mesmerised by this kinda act of terrorism on my individuality. Damn it!