Hadoop... Let's welcome the Elephant!!

While working with my crawler, I came to a conclusion that I can't work with my poor old system for processing such a large amount of data. I needed some powerful machine and huge amount of storage to do it. Unfortunately (or fortunately for me) I don't have that kind of moolah to invest in a enterprise class server and a huge SAN Storage. What began as a pet project was growing out to become a major logistics headache!!

Then I read something about a new open source entrant into the Distributed Computing space called Hadoop. Actually its not a new entrant (has been in development since 2004, when the Google's MapReduce algorithm paper was published). Yahoo has been using it for similar purposes of creating page indexes for Yahoo Web Search. Also, Apache Mahout, a machine learning project from Apache Foundation, uses Hadoop as its compute horsepower.

Suddenly, I knew Hadoop is the way to go. It uses commodity PCs (#), gives Petabyes of storage and the power of Distributed Computing. And the best par about it is that it is a FOSS.

You can read more about Hadoop from the following places:

1. Apache Hadoop site.
2. Yahoo Hadoop Developer Network.
3. Cloudera Site.

My next few posts would elaborate more on Hadoop's working.

Note:

# Commodity PC doesn't imply cheap PCs, a typical choice of machine for running a Hadoop datanode and tasktracker in late 2008 would have the following specifications:

  • Processor: 2 quad-core Intel Xeon 2.0GHz CPUs
  • Memory: 8 GB ECC RAM
  • Storage: 41 TB SATA disks
  • Network: Gigabit Ethernet

0 comments:

Post a Comment