Machine Learning : The Crawler

Building the crawler was the easiest part of this project.

All this crawler does is take a seed blog (my blog) URL, run through all the links in its front page and store the ones that look like a blogpost URL. I assume that all the blogs in this world are linked to on at least one other blog. Thus all of them will get indexed in this world if the spider is given enough time ad memory.

This is the code for the crawler. Its in Python and is quite easy. Please run through it and let me know if there is any other way to optimise it further:

import sys
import re
import urllib2
import urlparse
from pysqlite2 import dbapi2 as sqlite

conn = sqlite.connect('/home/spider/blogSearch.db')
cur = conn.cursor()

tocrawltpl = cur.execute('SELECT * FROM blogList where key=1')
for row in tocrawltpl:
tocrawl = set([row[1]])

linkregex = re.compile("")

while 1:

        try:
                crawling = tocrawl.pop()
        except KeyError:
                raise StopIteration

        url = urlparse.urlparse(crawling)

        try:
                response = urllib2.urlopen(crawling)
        except:
                continue

        msg = response.read()
        links = linkregex.findall(msg)

        for link in (links.pop(0) for _ in xrange(len(links))):
                if link.endswith('.blogspot.com/'):
                        if link.startswith('/'):
                                link = 'http://' + url[1] + link
                        elif link.startswith('#'):
                                link = 'http://' + url[1] + url[2] + link
                        elif not link.startswith('http'):
                                link = 'http://' + url[1] + '/' + link

                        select_query='SELECT * FROM blogList where url="%s"' %link
                        crawllist = cur.execute(select_query)
                        flag=1
                        for row in crawllist:
                                flag=0

                        if flag:
                                tocrawl.add(link)
                                insert_query='INSERT INTO blogList (url) VALUES ("%s")' %link
                                cur.execute(insert_query)
                                conn.commit()

Machine Learning : Emotional Intelligent Search Engines

When we think of machines being emotionally intelligent, we tend to see a vast problem with no seemingly plausible solution. However, as I stated in my post about Machine Learning and Semantic Web, it can be broken down into smaller, manageable problems. And I intend to do that.

Here, in this post I present the introduction to a small step towards this solution. This solution, if implemented properly (which I hope it will be), will allow a machine to return emotionally intelligent search results.

I assume that anyone reading this article knows the way a search engine works, the Python programming language and how databases work.

Statement of Problem :

"Build a emotional intelligent search engine which does the following tasks:

  1. Crawls the web for blogposts and downloads them.
  2. Indexes them according to their emotional quotient.
  3. Returns results according to the abstract questions asked by the end user. "
Proposed solution :

The proposed solution POC consists of :
  1. A crawler which crawls the blogosphere collecting links of all the blogs out there. I intend to target only the blogs on blogspot. These links are sanitised and stored in a database. Then another crawler cum 'blog-getter' looks into this database, picks up each blog one by one and then downloads all the posts in them. It stores the posts in flat file format and stores its metadata in a database.
  2. A lexicon builder parses these posts one by one and builds upon its vocabulary of emotional words (fundamentally all the adjectives in the posts) with the help of a human tutor.
  3. An indexer then parses all these posts, cross referencing them with its lexicon (built in the previous step) and then stores the parsed words into a db. It stores nouns and adjectives in separate tables referencing them to the metadata table (built in the first step).
  4. When the end user enters a search query, suppose "a nice Chinese restaurant", the query parser first finds all posts about 'Chinese restaurant'. From these results it finds the ones that are nice (according to its logic 'nice' means where people were happy and used 'happy' adjectives when they described such restaurants in their posts). It then displays the results in descending order of 'niceness'.
Implementation

I sincerely believe that this is an absolutely doable project, and I have made some progress with building the lexicon. As this is an ongoing project, I will be posting my experiences and the codes as and when I complete testing them.

The coding will be in Python. Why? Because I am in love with the language. Period.

The database, as of now is SQLLite, but I am planning to convert it into a NoSQL as soon as I get a firm grasp on the technology.

Platform right now is a single Linux box. I know it will require a lot more computing power, storage and memory when deployed. I am planning to port this to Hadoop when it comes to deployment stages. But for now, this remains my baby and will have its place in my desktop :).

Supervised Machine Learning

Introduction

When I was born, I was ignorant but very curious. I always asked
questions : "What? Where? Why? How?" And my parents guided me through
this jungle of knowledge, telling me about things that really mattered
and leaving what was complicated to the future.

As I grew up, I started learning about the very complex things by
cross referencing it with my already existent knowledge. If the new
incidents were not in my knowledge database, I queried the books,
articles and lately Google. Thus from an ignorant child with an empty
knowledge database, I grew up to become a knowledgeable person.

Let me take an example of how I learned English.

I am from India, English is neither my nor my parents' first language.
However, they really wanted me to learn how to read, write and speak
in English. They couldn't teach me by changing their mother tongue.
What they did, instead, was that they taught me the alphabets, a few
easy words and the basic grammar constructs. Then they gave me a
dictionary and a grammar book. I referenced new words that I saw with
the dictionary and new sentence formations with the grammar book. If I
still couldn't understand what was written, I asked my teacher.

Now when I come across anything written in English, I cross reference
it with the vocabulary and grammar that I have learned all these
years. If something new comes up, I check it up online or with my
peers. Thus I build upon my knowledge about English in an ongoing,
continuous process.

Implementation in Machine Learning

Now, building on this very basic pattern how any human child learns
about this physical world, we can model this learning process for the
machines. This is a brief outline of the idea (here I take an example
of the English Language, but this is equally applicable to any other
natural language or any other real world entity):

"In the same way as I learnt about the world, Machines can also be
taught. In the same way that we teach a child how to read and
understand the English Language, a computer can be taught to recognize
patterns of English alphabets as words and can reference its meaning
with its dictionary (which is built by human assistance). The machine
can gradually build up its vocabulary by reading more and finding the
meanings of new words and sentence constructs online. If it finds
something which is complex enough, it can ask its human tutor its
meaning.

Given the amount of cheap storage, memory and computing power we have
in our hands today, I think it is a perfectly doable thing. I just
need time and resources to do it."

Value of this research

Why is it important for a machine to learn a natural language? The
answer is simple, if machine learns a natural language, it can mine
the enormous amount of data on the internet to learn about any other
discipline. We will provide the world with the perfect student. It can
learn financial modelling from a human tutor and apply it to the
streaming financial data from the internet to predict future trends.
Maybe even produce its own models. It can learn biotechnology from its
human tutor and mine the huge available pathological data on the
internet to produce new medicines. The application possibilities seem
to be bound only by human imaginations.

Current Status of my Research

Right now, I am working on a search engine which crawls the
blogosphere and indexes blogposts according to their emotional
quotient. Its a small step in the final scheme of things but this is
what I can do with the time and resources that I have (I have a '9 to
6' day job as a software engineer). I would love to devote more time
to this idea, given the mundane activities (like earning my bread and
butter) is taken care of :-).

I am not sure if this idea has been implemented previously anywhere.
Also, I acknowledge it is a very simple solution to a seemingly very
complex problem. However, I feel that this is the most suitable
solution. If we humans can do it then so can the machines.