Paritosh the Geek: 2010

Automatic program recommendation using Facial Expression Recognition

More than often people get bored of the program they are currently watching. However, they dread scanning through hundreds of channels to find out if anything interesting is coming. They rather end up switching off the TV. This idea proposes a solution to these situations: The TV suggests the viewer what’s interesting on the air.

The TV, using a camera attached to it, first recognizes the viewer using “Face Recognition” techniques. It then relays the statistics about the time the viewer views a particular program and his facial expressions (happy, sad, interested, disinterested, etc) while watching it to a central hub.

The central hub (which may be on the cloud or in a self managed datacenter) then collates this data with the ontology of the program. [This ontology is built by data mining and supervised machine learning using a distributed cluster of computers.] Thus, over time, the central hub builds a profile of the viewer according to the viewer’s viewing pattern.

In the meanwhile, the TV also recognizes the facial expression of the viewer and looks for the signs of boredom (drowsy eyes, long frowns, vacant stares, etc). As soon as it recognizes a “disinterested pattern”, it calls a web-service running on the central hub.

This web-service searches the ontology of programs currently on air and sorts it according to the viewer’s profile. Then it predicts a list of programs which will most likely interest the viewer and sends it back to the TV.

The TV then suggests the viewer about the programs (s)he can watch if he is not interested in the current one. If the viewer changes the program then the TV sends the information to the central hub about the program he switched to. The Central hub adds this information to the profile of the viewer.

In case of conflicts, i.e. more than one viewer, the central hub decides on the predicted list of interesting programs using a priority list of the viewers based on previous history as to whose choice prevailed last time such a conflict arose.

This figure describes the architecture in a simple yet concise manner:

Note: If anyone wants to use this idea or wants further clarification then please contact me @ paritosh (dot) gunjan (at) gmail (dot) com.

Install Hadoop on a Single Machine

In order to learn any new technology, the best way IMHO is to install it on your local system. You can then explore it and experiment on it at your convenience. It is the route I always take.

I started with installing Hadoop on a single node, i.e. my machine. It was a tricky task with most of the tutorials making many assumptions. Now that I have completed the install, I can safely say that those were simple assumptions and that anyone familiar with linux is deemed to understand them. Then I decided to write an installation instruction for the dummies.

Here is the most comprehensive documentation of "How to install Hadoop on your local system". Please let me know if I have missed anything.

Prerequisites

1. Linux

The first and foremost requirement is to get a PC with Linux installed on it. I used a machine with Ubuntu 9.10 installed on it. You can also work with Windows, as Hadoop is purely java based and it will work with any OS that can run JVM (which in turn implies pretty much all the modern OS's)

2. Sun Java6

Install the Sun Java6 on your Linux machine using:

$ sudo apt-get install sun-java6-bin sun-java6-jre sun-java6-jdk

3. Create a new user "hadoop"

Create a new user hadoop (though it is not required, it is recommended in order
to to separate the Hadoop installation from other software applications and user
accounts running on the same machine by having a dedicated user for hadoop).
Use the following commands:

$ sudo addgroup hadoop
$ sudo useradd -d /home/hadoop -m hadoop -g hadoop

4. Configure SSH

Install OpenSSHServer on your system:

$ sudo apt-get install openssh-server

Then generate an SSH key for the hadoop user. As the hadoop user do the
following:

$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
1a:38:cd:0c:92:f9:8b:33:f3:a9:8e:dd:41:68:04:dc hadoop@paritoshdesktop
The key's randomart image is:
+[ RSA 2048]+
|o . |
| o E |
| = . |
| . + * |
| o = = S |
| . o o o |
| = o . |
| o * o |
|..+.+ |
++
$

Then enable SSH access to your local machine with this newly created key:

$cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Test your SSH connection:

$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is 1e:be:bb:db:71:25:e2:d5:b0:a9:87:9a:2c:43:e3:ae.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux paritoshdesktop 2.6.3120generic #58Ubuntu SMP Fri Mar 12 05:23:09 UTC
2010 i686
$

Now that the prerequisites are complete, lets go ahead with the Hadoop
installation.

Install Hadoop from Cloudera

1. Add repository

Create a new file /etc/apt/sources.list.d/cloudera.list with the following
contents, taking care to replace DISTRO with the name of your distribution (find
out by running lsb_release -c):

deb http://archive.cloudera.com/debian DISTRO-cdh3 contrib
debsrc http://archive.cloudera.com/debian DISTRO-cdh3 contrib

2. Add repository key. (optional)

Add the Cloudera Public GPG Key to your repository by executing the following
command:

$ curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
OK
$

This allows you to verify that you are downloading genuine packages.

Note: You may need to install curl:
$ sudo apt-get install curl

3. Update APT package index.

Simply run:

$ sudo apt-get update

4. Find and install packages.

You may now find and install packages from the Cloudera repository using your favorite APT package manager (e.g apt-get, aptitude, or dselect). For example:

$ apt-cache search hadoop
$ sudo apt-get install hadoop

Setting up a Hadoop Cluster

Here we will try to setup a Hadoop Cluster on a single node.

1. Configuration

Copy the hadoop0.20 directory to the hadoop home folder.

$ cd /usr/lib/
$ cp -Rf hadoop0.20 /home/hadoop/

Also, add the following to your .bashrc and .profile

# Hadoop home dir declaration
HADOOP_HOME=/home/hadoop/hadoop0.20
export HADOOP_HOME

Change the following in different configuration files in the /$HADOOP_HOME/conf dir:

1.1 hadoopenv.sh

Change the Java home, depending on where your java is installed:

# The java implementation to use. Required.
export JAVA_HOME=/usr/bin/java

1.2 core-site.xml

Change your core-site.xml to reflect the following:

1.3 mapred-site.xml

Change your mapred-site.xml to reflect the following:

1.4 hdfs-site.xml

Change your hdfs-site.xml to reflect the following:

2. Format the NameNode

To format the Hadoop Distributed filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command:

$ /hadoop/bin/hadoop namenode -format

Running Hadoop

To start hadoop, run the startall.sh from the /$HADOOP_HOME/bin/ directory.

$ ./start-all.sh

starting namenode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-namenode-paritosh-desktop.out

localhost: starting datanode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-datanode-paritosh-desktop.out

localhost: starting secondarynamenode, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-secondarynamenode-paritosh-desktop.out

starting jobtracker, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-jobtracker-paritosh-desktop.out

localhost: starting tasktracker, logging to /home/hadoop/hadoop-0.20/bin/../logs/hadoop-hadoop-tasktracker-paritosh-desktop.out

$

To check whether all the processes are running fine, run the following:

$ jps

17736 TaskTracker

17602 JobTracker

17235 NameNode

17533 SecondaryNameNode

17381 DataNode

17804 Jps

$

Hadoop... Let's welcome the Elephant!!

While working with my crawler, I came to a conclusion that I can't work with my poor old system for processing such a large amount of data. I needed some powerful machine and huge amount of storage to do it. Unfortunately (or fortunately for me) I don't have that kind of moolah to invest in a enterprise class server and a huge SAN Storage. What began as a pet project was growing out to become a major logistics headache!!

Then I read something about a new open source entrant into the Distributed Computing space called Hadoop. Actually its not a new entrant (has been in development since 2004, when the Google's MapReduce algorithm paper was published). Yahoo has been using it for similar purposes of creating page indexes for Yahoo Web Search. Also, Apache Mahout, a machine learning project from Apache Foundation, uses Hadoop as its compute horsepower.

Suddenly, I knew Hadoop is the way to go. It uses commodity PCs (#), gives Petabyes of storage and the power of Distributed Computing. And the best par about it is that it is a FOSS.

You can read more about Hadoop from the following places:

1. Apache Hadoop site.
2. Yahoo Hadoop Developer Network.
3. Cloudera Site.

My next few posts would elaborate more on Hadoop's working.

Note:

# Commodity PC doesn't imply cheap PCs, a typical choice of machine for running a Hadoop datanode and tasktracker in late 2008 would have the following specifications:

Processor: 2 quad-core Intel Xeon 2.0GHz CPUs
Memory: 8 GB ECC RAM
Storage: 41 TB SATA disks
Network: Gigabit Ethernet

Machine Learning : The Crawler

Building the crawler was the easiest part of this project.

All this crawler does is take a seed blog (my blog) URL, run through all the links in its front page and store the ones that look like a blogpost URL. I assume that all the blogs in this world are linked to on at least one other blog. Thus all of them will get indexed in this world if the spider is given enough time ad memory.

This is the code for the crawler. Its in Python and is quite easy. Please run through it and let me know if there is any other way to optimise it further:

import sys
import re
import urllib2
import urlparse
from pysqlite2 import dbapi2 as sqlite

conn = sqlite.connect('/home/spider/blogSearch.db')
cur = conn.cursor()

tocrawltpl = cur.execute('SELECT * FROM blogList where key=1')
for row in tocrawltpl:
tocrawl = set([row[1]])

linkregex = re.compile("")

while 1:

try:
crawling = tocrawl.pop()
except KeyError:
raise StopIteration

url = urlparse.urlparse(crawling)

try:
response = urllib2.urlopen(crawling)
except:
continue

msg = response.read()
links = linkregex.findall(msg)

for link in (links.pop(0) for _ in xrange(len(links))):
if link.endswith('.blogspot.com/'):
if link.startswith('/'):
link = 'http://' + url[1] + link
elif link.startswith('#'):
link = 'http://' + url[1] + url[2] + link
elif not link.startswith('http'):
link = 'http://' + url[1] + '/' + link

select_query='SELECT * FROM blogList where url="%s"' %link
crawllist = cur.execute(select_query)
flag=1
for row in crawllist:
flag=0

if flag:
tocrawl.add(link)
insert_query='INSERT INTO blogList (url) VALUES ("%s")' %link
cur.execute(insert_query)
conn.commit()

Machine Learning : Emotional Intelligent Search Engines

When we think of machines being emotionally intelligent, we tend to see a vast problem with no seemingly plausible solution. However, as I stated in my post about Machine Learning and Semantic Web, it can be broken down into smaller, manageable problems. And I intend to do that.

Here, in this post I present the introduction to a small step towards this solution. This solution, if implemented properly (which I hope it will be), will allow a machine to return emotionally intelligent search results.

I assume that anyone reading this article knows the way a search engine works, the Python programming language and how databases work.

Statement of Problem :

"Build a emotional intelligent search engine which does the following tasks:

Crawls the web for blogposts and downloads them.
Indexes them according to their emotional quotient.
Returns results according to the abstract questions asked by the end user. "

Proposed solution :

The proposed solution POC consists of :

A crawler which crawls the blogosphere collecting links of all the blogs out there. I intend to target only the blogs on blogspot. These links are sanitised and stored in a database. Then another crawler cum 'blog-getter' looks into this database, picks up each blog one by one and then downloads all the posts in them. It stores the posts in flat file format and stores its metadata in a database.
A lexicon builder parses these posts one by one and builds upon its vocabulary of emotional words (fundamentally all the adjectives in the posts) with the help of a human tutor.
An indexer then parses all these posts, cross referencing them with its lexicon (built in the previous step) and then stores the parsed words into a db. It stores nouns and adjectives in separate tables referencing them to the metadata table (built in the first step).
When the end user enters a search query, suppose "a nice Chinese restaurant", the query parser first finds all posts about 'Chinese restaurant'. From these results it finds the ones that are nice (according to its logic 'nice' means where people were happy and used 'happy' adjectives when they described such restaurants in their posts). It then displays the results in descending order of 'niceness'.

Implementation

I sincerely believe that this is an absolutely doable project, and I have made some progress with building the lexicon. As this is an ongoing project, I will be posting my experiences and the codes as and when I complete testing them.

The coding will be in Python. Why? Because I am in love with the language. Period.

The database, as of now is SQLLite, but I am planning to convert it into a NoSQL as soon as I get a firm grasp on the technology.

Platform right now is a single Linux box. I know it will require a lot more computing power, storage and memory when deployed. I am planning to port this to Hadoop when it comes to deployment stages. But for now, this remains my baby and will have its place in my desktop :).

Supervised Machine Learning

Introduction

When I was born, I was ignorant but very curious. I always asked
questions : "What? Where? Why? How?" And my parents guided me through
this jungle of knowledge, telling me about things that really mattered
and leaving what was complicated to the future.

As I grew up, I started learning about the very complex things by
cross referencing it with my already existent knowledge. If the new
incidents were not in my knowledge database, I queried the books,
articles and lately Google. Thus from an ignorant child with an empty
knowledge database, I grew up to become a knowledgeable person.

Let me take an example of how I learned English.

I am from India, English is neither my nor my parents' first language.
However, they really wanted me to learn how to read, write and speak
in English. They couldn't teach me by changing their mother tongue.
What they did, instead, was that they taught me the alphabets, a few
easy words and the basic grammar constructs. Then they gave me a
dictionary and a grammar book. I referenced new words that I saw with
the dictionary and new sentence formations with the grammar book. If I
still couldn't understand what was written, I asked my teacher.

Now when I come across anything written in English, I cross reference
it with the vocabulary and grammar that I have learned all these
years. If something new comes up, I check it up online or with my
peers. Thus I build upon my knowledge about English in an ongoing,
continuous process.

Implementation in Machine Learning

Now, building on this very basic pattern how any human child learns
about this physical world, we can model this learning process for the
machines. This is a brief outline of the idea (here I take an example
of the English Language, but this is equally applicable to any other
natural language or any other real world entity):

"In the same way as I learnt about the world, Machines can also be
taught. In the same way that we teach a child how to read and
understand the English Language, a computer can be taught to recognize
patterns of English alphabets as words and can reference its meaning
with its dictionary (which is built by human assistance). The machine
can gradually build up its vocabulary by reading more and finding the
meanings of new words and sentence constructs online. If it finds
something which is complex enough, it can ask its human tutor its
meaning.

Given the amount of cheap storage, memory and computing power we have
in our hands today, I think it is a perfectly doable thing. I just
need time and resources to do it."

Value of this research

Why is it important for a machine to learn a natural language? The
answer is simple, if machine learns a natural language, it can mine
the enormous amount of data on the internet to learn about any other
discipline. We will provide the world with the perfect student. It can
learn financial modelling from a human tutor and apply it to the
streaming financial data from the internet to predict future trends.
Maybe even produce its own models. It can learn biotechnology from its
human tutor and mine the huge available pathological data on the
internet to produce new medicines. The application possibilities seem
to be bound only by human imaginations.

Current Status of my Research

Right now, I am working on a search engine which crawls the
blogosphere and indexes blogposts according to their emotional
quotient. Its a small step in the final scheme of things but this is
what I can do with the time and resources that I have (I have a '9 to
6' day job as a software engineer). I would love to devote more time
to this idea, given the mundane activities (like earning my bread and
butter) is taken care of :-).

I am not sure if this idea has been implemented previously anywhere.
Also, I acknowledge it is a very simple solution to a seemingly very
complex problem. However, I feel that this is the most suitable
solution. If we humans can do it then so can the machines.

Paritosh the Geek

Automatic program recommendation using Facial Expression Recognition

Install Hadoop on a Single Machine

Hadoop... Let's welcome the Elephant!!

Machine Learning : The Crawler

Machine Learning : Emotional Intelligent Search Engines

Supervised Machine Learning

Paritosh The Geek

Fellow Geeks

What do I write about?

I follow these

How many geeks came here?

The Network

Who came here

History

Disclaimer