Helping Sustained Learning

Right since our childhood, we have been force fed by our education system to mug up all that we can. No one knows why. The teacher's don't. The parents don't. The children for sure don't.

Given the fact that most of the human knowledge is already on the internet (read Wikipedia), don't you think we need to break the rules now? Do any of us grown up adults remember anything more than how to find stuff from the Net and use it logically? I, for one, sure do. So, do we really need our children to go through the excruciatingly painful education system?

Aside: Wow!! I really like to ask questions.

Aside II: Can I provide answers? Sure why not!! What I propose is something like a secondary brain. Our vault of memory on the cloud.

Like it or not Sherlock Holmes was right. We do have a limit to the Petabyte HDD we carry in out heads. We can only store so much information in our head. Why bug it up with all the unnecessary information children get from the subjects they are not interested in?

Why not give them a Single-point-of-access-device for all their information needs? Whatever they read or see can be stored on the cloud so they don't need to remember anything (if they wish to). All that is required of them is that they process that information quickly and logically.

We already have systems in place for all that I propose. All we need is convergence and social acceptability.

Spare a thought for it!!

The Social Agenda

Do Social networks really work?

Yesterday while talking to my boss (BTW he is the only one I still consider to be my boss, apart from my wife, as he is the only one I listened to at work ;)), he came up with an interesting idea. People have not changed since centuries. They have always been drifting apart. From whole villages living in one cave to rising divorce rates, we have come a long way from social living. People are becoming more and more individualistic.

Now, back to the main topic. Do social networks work? For the past decade, we have seen numerous social networks coming and then fizzling out. MySpace, Friendster, Orkut and now Facebook. All have seen the same story. People get excited when they are launched. Call all their friends to join them. They form groups and communities. They chat all night long. But then after that what? They move on to the next big thing... Twitter.

For the fear of being called a hypocrite, I would like to admit that I have also been a part of all the bandwagons. Be it Orkut, MySpace, Blogs, Facebook or Twitter. You can find my profiles on all of them. But then I become bored like the rest of you and move on.

Now that Google is coming onto the scene with a stronger product (than Orkut) in Google +, I am not sure whether it is too late to make any dent in the SNSs' fortunes.

There are three things and only three things that really sell in this world. Sex, knowledge and food. That is why you see all the Porn sites being so popular and Google making tonnes of money. Online economies around these three will always remain popular and viable. The rest as they say will become history.

BTW with the much hyped IPOs of all the SNSes, I seriously feel there is another dot com burst brewing. So cash in your stocks and come back to the real world!!

Cloud Computing and Television

This post the use of Cloud Computing to provide true ‘On Demand Entertainment’ and make television a Unified Entertainment Device. This paper uses some interesting use-cases for Televisions which can be possible by using the power of Cloud Computing to illustrate this. These applications can transform televisions to be a complete entertainment platform.

Introduction

Televisions have come a far way from their monochromatic ancestors of the 1930's to become the modern age's ultimate home entertainers. Demands and expectations of the consumers have evolved tremendously to put a strain on the traditional mediums of entertainment, i.e. audio and video. Moreover, the advent of internet and has created another dimension in home entertainment – lets call it "On Demand Entertainment" (ODE). True ODE means consumers will be able to watch, listen, play or read whatever they want whenever they want.

The consumers of Televisions now have options of going on the Internet and search for ODE including (but not restricted to) games, news, video and audio. Internet giants like Amazon, Hulu, Netflix and Youtube have started cutting into Television industry's profits and have become a major force to reckon with in home entertainment segment.

In a parallel development, consumers now want seamless integration and convergence between the new and the old media of entertainment. This expectation has led to innovation in Television industry in the form of IPTV, satellite TV and internet enabled TVs. These technologies strive to provide ODE as well as try to fulfill the demand for a unified entertainment device. However, true ODE is still a distant dream because of the strain it puts on the storage and computation power of the back-end data centers.

Cloud Computing or internet based computing, which provides on demand storage and compute power to be billed in a pay-per-use basis, comes as a perfect strategic fit to solve the puzzle of ODE. Cloud Computing can provide a solution to the issue of huge requirements in compute and storage to provide true ODE.

This post describes how Cloud Computing can be used to deliver true On Demand Entertainment, using some specific use-cases of:
  • On Demand Gaming
  • Ubiquitous Media Playback
  • Online Personal Media Store
The Workings!!

Entertainment today includes much more than the traditional media of books, Television and Radio. As we discussed earlier, ODE has become a major expectation now-a-days. Also, people are now looking for a single device which can take care of all their entertainment needs. Televisions are facing some serious competition in this race for a unified entertainment device from hand held gadgets and Internet. Televisions need help of modern technology to break to the fore-front of this race. Cloud Computing is one such technology which can tremendously help the Television industry.

According to Wikipedia:

Cloud Computing is Internet-based computing, whereby shared resources, software, and information are provided to computers and other devices on demand, like the electricity grid.

This on-demand Cloud of servers, generally called the cloud, can provide for the huge hardware requirements for some interesting use cases in Television industry:

On Demand Gaming:

Games are very compute intensive applications. So much so that they have dedicated platforms built for serious gamers. Televisions too have in-built games but not of the class of "Core Games". This is because core games require huge compute power that Televisions can't provide.

Now that Televisions have become internet enabled, we can use the compute power of the Cloud to do the computation at the back-end. We can push the gaming consoles on the cloud. All the user interactions can be pushed onto the Cloud; the cloud will compute based on the game rules and send back the results for the Television to display.

This can be a disruptive product in the gaming industry as this will give rise to true multi player games, where players join and leave the games as and when they will. Anyone with an internet enabled Television set can join the game as dependency on expensive gaming consoles will end.

Ubiquitous Media Playback:

Another interesting area of application of Cloud Computing in entertainment industry is of Ubiquitous Media Playback.

Let's take an example:

User is watching a movie on his Television when he suddenly gets an urgent call to go somewhere. The movie is at a very interesting phase and he doesn't want to miss it. He can simply activate his “Ubiquitous Media Playback” feature with the push of a button and the movie starts playing back on his hand held gadget.

For this to become reality, all that is needed is that both his hand held gadget and Television have internet access. The Television starts uploading the movie (from the point where it was stopped) to the cloud; the cloud converts the movie to be fit to playback on his hand held gadget and streams it to the gadget. The gadget resumes playing the movie from the same point.

Thus, a simple application of the power of cloud computing can enhance the viewer experience manifolds.

Online Personal Media Store:

People now try to keep all their media in digital format. They store it on hard disks, CDs, DVDs and BDs. But it still forms a bulky collection with the chance of loosing their data always lingering on top of their mind. What if they had a Personal media Store on the Cloud? What if they can use their Televisions to store all they want on the cloud?

This can be possible. Televisions connected to the Internet can be used to dump all the personal media on the cloud. The Cloud can then sort and organise the media under various categories and make a searchable index of all the user content. This data can then be customized according to the display and other capabilities of all the user‟s devices. The user can then access this library of data using any device he wants.

This can be a pay per use service which can be very easily commercialized.

These are just a few use cases which illustrate the power of Cloud Computing in home entertainment. If we take a sweeping look at the whole home entertainment landscape, there would be thousands of applications of this tremendously powerful technology.

Conclusion

We know that technology is pushing the edges day in and day out. Televisions need to adopt these new technologies rapidly, in order to provide a better experience to the consumers. Cloud Computing is one such technology which has the power of revolutionizing the way entertainment is served. As this paper suggests, the use of this new technology with different applications can make the true ODE provider and a unified entertainment device.

Automatic program recommendation using Facial Expression Recognition

More than often people get bored of the program they are currently watching. However, they dread scanning through hundreds of channels to find out if anything interesting is coming. They rather end up switching off the TV. This idea proposes a solution to these situations: The TV suggests the viewer what’s interesting on the air.

The TV, using a camera attached to it, first recognizes the viewer using “Face Recognition” techniques. It then relays the statistics about the time the viewer views a particular program and his facial expressions (happy, sad, interested, disinterested, etc) while watching it to a central hub.

The central hub (which may be on the cloud or in a self managed datacenter) then collates this data with the ontology of the program. [This ontology is built by data mining and supervised machine learning using a distributed cluster of computers.] Thus, over time, the central hub builds a profile of the viewer according to the viewer’s viewing pattern.

In the meanwhile, the TV also recognizes the facial expression of the viewer and looks for the signs of boredom (drowsy eyes, long frowns, vacant stares, etc). As soon as it recognizes a “disinterested pattern”, it calls a web-service running on the central hub.

This web-service searches the ontology of programs currently on air and sorts it according to the viewer’s profile. Then it predicts a list of programs which will most likely interest the viewer and sends it back to the TV.

The TV then suggests the viewer about the programs (s)he can watch if he is not interested in the current one. If the viewer changes the program then the TV sends the information to the central hub about the program he switched to. The Central hub adds this information to the profile of the viewer.

In case of conflicts, i.e. more than one viewer, the central hub decides on the predicted list of interesting programs using a priority list of the viewers based on previous history as to whose choice prevailed last time such a conflict arose.

This figure describes the architecture in a simple yet concise manner:


Note: If anyone wants to use this idea or wants further clarification then please contact me @ paritosh (dot) gunjan (at) gmail (dot) com.

Install Hadoop on a Single Machine

In order to learn any new technology, the best way IMHO is to install it on your local system. You can then explore it and experiment on it at your convenience. It is the route I always take.

I started with installing Hadoop on a single node, i.e. my machine. It was a tricky task with most of the tutorials making many assumptions. Now that I have completed the install, I can safely say that those were simple assumptions and that anyone familiar with linux is deemed to understand them. Then I decided to write an installation instruction for the dummies.

Here is the most comprehensive documentation of "How to install Hadoop on your local system". Please let me know if I have missed anything.


Prerequisites


1. Linux

The first and foremost requirement is to get a PC with Linux installed on it. I used a machine with Ubuntu 9.10 installed on it. You can also work with Windows, as Hadoop is purely java based and it will work with any OS that can run JVM (which in turn implies pretty much all the modern OS's)

2. Sun Java6

Install the Sun Java6 on your Linux machine using:
$ sudo apt-get install sun-java6-bin sun-java6-jre sun-java6-jdk

3. Create a new user "hadoop"

Create a new user hadoop (though it is not required, it is recommended in order
to to separate the Hadoop installation from other software applications and user
accounts running on the same machine by having a dedicated user for hadoop).
Use the following commands:
$ sudo addgroup hadoop
$ sudo useradd -d /home/hadoop -m hadoop -g hadoop

4. Configure SSH

Install OpenSSH­Server on your system:
$ sudo apt­-get install openssh­-server

Then generate an SSH key for the hadoop user. As the hadoop user do the
following:
$ ssh-­keygen -­t rsa ­-P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
1a:38:cd:0c:92:f9:8b:33:f3:a9:8e:dd:41:68:04:dc hadoop@paritosh­desktop
The key's randomart image is:
+­­[ RSA 2048]­­­­+
|o . |
| o E |
| = . |
| . + * |
| o = = S |
| . o o o |
| = o . |
| o * o |
|..+.+ |
+­­­­­­­­­­­­­­­­­+
$


Then enable SSH access to your local machine with this newly created key:
$cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Test your SSH connection:
$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is 1e:be:bb:db:71:25:e2:d5:b0:a9:87:9a:2c:43:e3:ae.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux paritosh­desktop 2.6.31­20­generic #58­Ubuntu SMP Fri Mar 12 05:23:09 UTC
2010 i686
$

Now that the prerequisites are complete, lets go ahead with the Hadoop
installation.

Install Hadoop from Cloudera

1. Add repository

Create a new file /etc/apt/sources.list.d/cloudera.list with the following
contents, taking care to replace DISTRO with the name of your distribution (find
out by running lsb_release -c):
deb http://archive.cloudera.com/debian DISTRO­-cdh3 contrib
deb­src http://archive.cloudera.com/debian DISTRO­-cdh3 contrib
2. Add repository key. (optional)

Add the Cloudera Public GPG Key to your repository by executing the following
command:
$ curl -­s http://archive.cloudera.com/debian/archive.key | sudo apt­-key add ­-
OK
$

This allows you to verify that you are downloading genuine packages.

Note: You may need to install curl:
$ sudo apt­-get install curl

3. Update APT package index.

Simply run:
$ sudo apt­-get update

4. Find and install packages.

You may now find and install packages from the Cloudera repository using your favorite APT package manager (e.g apt­-get, aptitude, or dselect). For example:
$ apt-­cache search hadoop
$ sudo apt­-get install hadoop

Setting up a Hadoop Cluster

Here we will try to setup a Hadoop Cluster on a single node.

1. Configuration

Copy the hadoop­0.20 directory to the hadoop home folder.
$ cd /usr/lib/
$ cp ­-Rf hadoop­0.20 /home/hadoop/
Also, add the following to your .bashrc and .profile
# Hadoop home dir declaration
HADOOP_HOME=/home/hadoop/hadoop­0.20
export HADOOP_HOME
Change the following in different configuration files in the /$HADOOP_HOME/conf dir:

1.1 hadoop­env.sh

Change the Java home, depending on where your java is installed:
# The java implementation to use. Required.
export JAVA_HOME=/usr/bin/java

1.2 core-­site.xml

Change your core­-site.xml to reflect the following:

1.3 mapred-site.xml

Change your mapred-­site.xml to reflect the following:

1.4 hdfs-site.xml

Change your hdfs­-site.xml to reflect the following:

2. Format the NameNode

To format the Hadoop Distributed filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command:
$ /hadoop/bin/hadoop namenode -format

Running Hadoop

To start hadoop, run the start­all.sh from the
/$HADOOP_HOME/bin/ directory.

$ ./start-all.sh

starting namenode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-namenode-paritosh-desktop.out

localhost: starting datanode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-datanode-paritosh-desktop.out

localhost: starting secondarynamenode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-secondarynamenode-paritosh-desktop.out

starting jobtracker, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-jobtracker-paritosh-desktop.out

localhost: starting tasktracker, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-tasktracker-paritosh-desktop.out

$


To check whether all the processes are running fine, run the following:

$ jps

17736 TaskTracker

17602 JobTracker

17235 NameNode

17533 SecondaryNameNode

17381 DataNode

17804 Jps

$

Hadoop... Let's welcome the Elephant!!

While working with my crawler, I came to a conclusion that I can't work with my poor old system for processing such a large amount of data. I needed some powerful machine and huge amount of storage to do it. Unfortunately (or fortunately for me) I don't have that kind of moolah to invest in a enterprise class server and a huge SAN Storage. What began as a pet project was growing out to become a major logistics headache!!

Then I read something about a new open source entrant into the Distributed Computing space called Hadoop. Actually its not a new entrant (has been in development since 2004, when the Google's MapReduce algorithm paper was published). Yahoo has been using it for similar purposes of creating page indexes for Yahoo Web Search. Also, Apache Mahout, a machine learning project from Apache Foundation, uses Hadoop as its compute horsepower.

Suddenly, I knew Hadoop is the way to go. It uses commodity PCs (#), gives Petabyes of storage and the power of Distributed Computing. And the best par about it is that it is a FOSS.

You can read more about Hadoop from the following places:

1. Apache Hadoop site.
2. Yahoo Hadoop Developer Network.
3. Cloudera Site.

My next few posts would elaborate more on Hadoop's working.

Note:

# Commodity PC doesn't imply cheap PCs, a typical choice of machine for running a Hadoop datanode and tasktracker in late 2008 would have the following specifications:

  • Processor: 2 quad-core Intel Xeon 2.0GHz CPUs
  • Memory: 8 GB ECC RAM
  • Storage: 41 TB SATA disks
  • Network: Gigabit Ethernet

Machine Learning : The Crawler

Building the crawler was the easiest part of this project.

All this crawler does is take a seed blog (my blog) URL, run through all the links in its front page and store the ones that look like a blogpost URL. I assume that all the blogs in this world are linked to on at least one other blog. Thus all of them will get indexed in this world if the spider is given enough time ad memory.

This is the code for the crawler. Its in Python and is quite easy. Please run through it and let me know if there is any other way to optimise it further:

import sys
import re
import urllib2
import urlparse
from pysqlite2 import dbapi2 as sqlite

conn = sqlite.connect('/home/spider/blogSearch.db')
cur = conn.cursor()

tocrawltpl = cur.execute('SELECT * FROM blogList where key=1')
for row in tocrawltpl:
tocrawl = set([row[1]])

linkregex = re.compile("")

while 1:

        try:
                crawling = tocrawl.pop()
        except KeyError:
                raise StopIteration

        url = urlparse.urlparse(crawling)

        try:
                response = urllib2.urlopen(crawling)
        except:
                continue

        msg = response.read()
        links = linkregex.findall(msg)

        for link in (links.pop(0) for _ in xrange(len(links))):
                if link.endswith('.blogspot.com/'):
                        if link.startswith('/'):
                                link = 'http://' + url[1] + link
                        elif link.startswith('#'):
                                link = 'http://' + url[1] + url[2] + link
                        elif not link.startswith('http'):
                                link = 'http://' + url[1] + '/' + link

                        select_query='SELECT * FROM blogList where url="%s"' %link
                        crawllist = cur.execute(select_query)
                        flag=1
                        for row in crawllist:
                                flag=0

                        if flag:
                                tocrawl.add(link)
                                insert_query='INSERT INTO blogList (url) VALUES ("%s")' %link
                                cur.execute(insert_query)
                                conn.commit()