Here, in this post I present the introduction to a small step towards this solution. This solution, if implemented properly (which I hope it will be), will allow a machine to return emotionally intelligent search results.
I assume that anyone reading this article knows the way a search engine works, the Python programming language and how databases work.
Statement of Problem :
"Build a emotional intelligent search engine which does the following tasks:
- Crawls the web for blogposts and downloads them.
- Indexes them according to their emotional quotient.
- Returns results according to the abstract questions asked by the end user. "
The proposed solution POC consists of :
- A crawler which crawls the blogosphere collecting links of all the blogs out there. I intend to target only the blogs on blogspot. These links are sanitised and stored in a database. Then another crawler cum 'blog-getter' looks into this database, picks up each blog one by one and then downloads all the posts in them. It stores the posts in flat file format and stores its metadata in a database.
- A lexicon builder parses these posts one by one and builds upon its vocabulary of emotional words (fundamentally all the adjectives in the posts) with the help of a human tutor.
- An indexer then parses all these posts, cross referencing them with its lexicon (built in the previous step) and then stores the parsed words into a db. It stores nouns and adjectives in separate tables referencing them to the metadata table (built in the first step).
- When the end user enters a search query, suppose "a nice Chinese restaurant", the query parser first finds all posts about 'Chinese restaurant'. From these results it finds the ones that are nice (according to its logic 'nice' means where people were happy and used 'happy' adjectives when they described such restaurants in their posts). It then displays the results in descending order of 'niceness'.
I sincerely believe that this is an absolutely doable project, and I have made some progress with building the lexicon. As this is an ongoing project, I will be posting my experiences and the codes as and when I complete testing them.
The coding will be in Python. Why? Because I am in love with the language. Period.
The database, as of now is SQLLite, but I am planning to convert it into a NoSQL as soon as I get a firm grasp on the technology.
Platform right now is a single Linux box. I know it will require a lot more computing power, storage and memory when deployed. I am planning to port this to Hadoop when it comes to deployment stages. But for now, this remains my baby and will have its place in my desktop :).
0 comments:
Post a Comment