Then I read something about a new open source entrant into the Distributed Computing space called Hadoop. Actually its not a new entrant (has been in development since 2004, when the Google's MapReduce algorithm paper was published). Yahoo has been using it for similar purposes of creating page indexes for Yahoo Web Search. Also, Apache Mahout, a machine learning project from Apache Foundation, uses Hadoop as its compute horsepower.
Suddenly, I knew Hadoop is the way to go. It uses commodity PCs (#), gives Petabyes of storage and the power of Distributed Computing. And the best par about it is that it is a FOSS.
You can read more about Hadoop from the following places:
1. Apache Hadoop site.
2. Yahoo Hadoop Developer Network.
3. Cloudera Site.
My next few posts would elaborate more on Hadoop's working.
# Commodity PC doesn't imply cheap PCs, a typical choice of machine for running a Hadoop datanode and tasktracker in late 2008 would have the following specifications:
- Processor: 2 quad-core Intel Xeon 2.0GHz CPUs
- Memory: 8 GB ECC RAM
- Storage: 41 TB SATA disks
- Network: Gigabit Ethernet