Tuesday, May 10, 2016

Hinduism - Science or Superstition?

Hinduism - Science or Superstition?

I am a Hindu by birth and spent most part of my life in ignorance, as many others do today. I couldn't understand why this one religion has created so much confusion and conflicts, all through out history. 

I started this journey of understanding life, with the divine grace of Jagadguru - Sri Adi Shankaracharya. The guru, who is nothing less than a ocean of oceans paved way for knowledge of "The Supreme Truth". It is this supreme truth which revealed to me that the teachings of every religion is one and the same, and to respect every religion as your own. On this special day of my guru's Jayanthi, and in this Brahma Mahotsava, I am commencing my write up. 

Tatvam Asi

The 2 sons of Shiva

Lord Ganesha & Karthikeya are the 2 sons of Shiva and his consort Parvati. Ganesha being the plump and elephant-headed, while Karthikeya is a charming warrior. Most of us have heard the stories of their birth, and I would briefly re-iterate it before analyzing the science behind it. 

Goddess Parvati wished to bathe, and created lord Ganesha out of the turmeric paste applied to her body. He was kept guard at the entrance, to not let anyone in till the mother finishes her bath. When lord Shiva appears there, Ganesha doesn't recognize him and stops him at the entrance. All words go in vain, and Shiva had to cut Ganesha's human head, only to restore it with an elephant head later. Ganesha has a big belly and nothing can satisfy his hunger. One day after a heavy food, as he was travelling on his mouse, he fell down to ground and his stomach burst open. The moon laughed at him, and Ganesha had to tie a snake around his belly. Ganesha cursed the moon that no one would look at him on this day. He has two wives - Riddhi and Siddhi. Mother parvati always prefers to stay with Ganesha than her other son. No worship is successful in Hinduism, without starting it with Ganesha. He is the obstacle remover or Vigna Vinashaka.

There lived an asura (daemon) by name "Tarakasura", who was so powerful that his powers were felt in all the 3 worlds. He had a boon that, only the son of Shiva can kill him. He thought this could never happen, as Shiva is an ascetic. Lord Indira who rules the heavens, sent Agni Deva (Fire God) to remind Shiva of the son. Shiva tells Agni that there is no womb in this world, that could father his child (not even goddess Shakti/Parvati). Saying so, he creates a seed out of his 3rd eye and asks Agni to carry it across. The seed born out of Shiva was so intense and burning, that it almost burnt Agni. The lord of fire himself was unable to handle the fire that he had to drop it in the holy Ganga river, as she constantly flows out of the matted hairs of Shiva. As the seed fell into Ganga, it split into 6 and from this 6 individual splits, were born 6 children. These 6 children were mothered by 6 krittikas (stars or nakshatras). When goddess parvati came to visit the son, all the 6 children united within one body with 6 heads. And hence he got the name Shanmukha (or 6-faced). He went on to become the commander-in-chief of the Devatas and kill Tarakasura. He has 2 wives AmrutaValli (or Devayani) and Valli. We find a lot of temples of Karthikeya in TamilNadu, India where people apply sandal paste on their bodies during worship.

Science behind it

Human body is comprised of 7 chakras that runs all the way from the base of the spine to the head region. These 7 colors are essentially the colors of rainbow - VIBGYOR, with Violet at the top and Red at the bottom. The chakras resonate at specific frequencies and have specific tones to activate them. 
The first five chakras signifies the 5 elements that is responsible for this physical body. Starting from the 5th chakra (Visudha) to the 1st (Muladhara), they are - Akasha (Space), Vayu (Air), Agni (Fire), Aapa (Water) and Prithvi (Earth). All matter are bound to the limits of space and time, within this 5 chakras. The other chakras open the higher dimensions in an individual. In this higher dimensions, an individual is empowered with enormous intuitive power and transcends beyond time and space. 

The presiding deity at Muladhara is lord Ganesha. We humans are nothing but matter with consciousness, and energy flows through our bodies in the channels. There is always a one-to-one relation between Microcosom and Macrocosm - what appears outside is also inside, what is in the cosmos is also within one's own self. Shiva and Shakti lies within our body as the 3rd eye and the Kundalini Shakti respectively. The kundalini shakti is the energy house and is dormant in the base of the spine as a coiled serpent. She represents the kinetic energies and Shiva represents the ultimate knowledge and is static. The process of uniting the Shiva and Shakti is Yoga (unlike the physical exercises as believed by many). The energy moves as two inter-twined snakes (or electro-magnetic wave) between the chakras. The two snake like channels are called nadis - Ida and Pingala.

The 1st chakra signifies Gravity, the very reason we have time and space. Gravity pulls everything towards itself, and it is depicted by the big belly of Ganesha. The whole creation is within his belly and he runs the worlds. Nothing can satisfy his hunger - this means, gravity can pull everything towards it. His stomach burst open and he had to tie a snake around his belly - this means, the whole universe is protected by a divine power or energy, and that is the same kundalini shakti. Parvati always stays with Ganesha than her other son - this means, the kundalini shakti is dormant by nature in the 1st chakra. He is the gatekeeper and born out of turmeric - to achieve higher consciousness, one has to start from the muladhara and all the way up. So we need to overcome the pull of gravity, which means we need to overcome our instincts and develop intuitive powers. In poojas, Ganesha is invoked through turmeric powder even today. Shiva had to cut the head of Ganesha, this indicates that someone born out of ignorance has to shed his instincts to replace it with knowledge. Elephant's size and nature of clearing its pathway as it walks in the forest, indicates that an aspirant of yoga clears his obstacles and egos as he progresses towards the cause. Cursing the Moon here is to indicate that the Moon is malefic (negative) in nature on the day of Ganesha Chaturthi. The Moon has it's effect on living beings and even on tides. The malefic and benefic nature of Moon is studied in detail in vedic astrology and we can see the faith in many other religions too. The 2 wives are the 2 nadis (ida and pingala mentioned earlier) raising in the muladhara chakra. Riddhi and Siddhi indicate the prosperity and spiritual powers one achieves by invoking the powers of Ganesha.

The 2nd chakra represents water and 3rd indicates fire. Child birth is associated with "Breaking waters". Water is the source of life and it is the reason we believe that falling stars brings luck. The falling stars (or shooting stars) are comets and contain compressed ice, and their collision with earth could be the reason we have water on earth, and hence life. Water also has a tremendous property of purification (body and chitta/mind) and memory. A person in the bed of death, prefers to have a glass of water from his dearest ones, so that he can remember the relation while making preparations for his next birth as well. Water was used to curse people during earlier days by ancient rishis (or seers), as the nature of water can be transformed by thoughts (induced through mantras). It has the highest position in the 5 elements, hence it flows right from the top of Shiva's head. Kartikeya is born out of the 3rd eye of Shiva and handed over to Agni, who in turn drops it in Ganga. This means, the fire element originates in the 3rd eye as Electro-Magnetic wave and is hotter than the hottest. When kundalini is being awakened, the energy in the 3rd chakra can quickly get uncontrollable, and hence has to make a quick downward journey to the 2nd chakra, to meet the water and there by reduce the heat. This purifies the energy and makes it controlled, without which one could easily become unrest or maniac-type. The Ganga here is signified by the 2nd chakra or water. The 6 faces of Karthikeya is the 6-petalled orange lotus at the 2nd chakra, this is denoted as the 6 krittikas. Karthikeya comes down from his initial position to meet his brother, Ganesha. There is a shake hand of energy between the muladhara and the swadisthana. The sandal paste applied by Tamilians while praying to Karthikeya, is indicative of the color of lotus petal. His 2 wives are indicative of the same ida and pingala mentioned earlier. Valli in tamil means creeper and Amrutha Valli and Valli are the 2 wives. The energy moves through the channels or creepers. The first wife is the will-power (ichcha shakti) and the 2nd wife being the action (kriya shakti). 

In the next post, I would like to write upon their parents and the cosmos...

Saturday, July 12, 2014

Big Data and Analytics Summit 2014

Big Data and Analytics Summit 2014 – NASSCOM

When: 27th June 2014, 9:00 AM – 5:15 PM

Where: Hotel Trident, Hyderabad

Sponsored By: Accenture, CapGemini, Paypal, Airtel, EXL, Virtusa, Genpact, Contact Singapore, Target


1.       The Nasscom event was attended by thousands of analysts, architects, data scientists and managers across India. It provided a platform to interact with the market leaders like Mu Sigma, Amazon, P&G, Target, Paypal, Indix, Simplify360 and many more. It was a full day event with lots of presentations, tech talks, panel discussions, Q&A sessions, and product demos.

2.       There were many stalls at the event.  For instance, at the Accenture stall were two products exhibited – one was for automobile insurance the other one was for video analytics using Google glass. A car will be fitted with sensors that measure the speed, friction, wear and tear of the vehicle. These recordings are used to calculate the risk in real-time. Insurance premium is determined by the risk; for a safely driven car the premium is lower.

3.       Marc Chemin from CapGemini spoke about the advent of Big Data in Europe.
·         Big data is used in 3 areas – enhancement (e.g. to boost sales), disruption & new business.
·         With the advent of Big Data and Analytics, a lot of Public agencies are being replaced by Private agencies. The public agencies work in silos and don’t usually share any public data. However very intelligent systems can be built by sharing data, and winning people’s confidence.
·         Lastly it was discussion around Data privacy in Europe. There are 29 authorities in Europe and each of them defines their own privacy policies around data regulation. Hence it gets tough to execute some of the Big Data projects.

4.       Michael Barnes from Forrester Research provided a lot of insights to Analytics market.
·         In the earlier days data was used as a system of record, now it is used as system of engagement to touch people. More and more companies are coming out and investing in Business Technologies to provide better customer experience.
·         In the earlier days data was scarce, expensive, easy to manage, neat-rich-structured, manageable number of formats and structures. With Big data we have plenty of data, cheap, impossible to manage, messy & detailed, nearly infinite in format & structure.
·         Getting the right & sufficient data, skills, technology and experience are key requirements to work in Analytics.
·         Currently Big Data is used for real-time reporting. In future it will be used for process automation, adhoc decision support, operational planning, and strategy planning.
·         Michael gave this wonderful definition – “Big Data is not a technology, a project, an objective with a single compute. It is a journey, process, strategic effect. The practices & technologies that close the gap between the data available and able to turn data into business insights”.

5.       This was followed by presentation by Michael Svilar from Accenture Analytics. Some of the projects that they are executing – Connected Cars, Network Analytics, Safe City & Smart Water (to stop wastage of water in UK using Big Data). The statistics shows that in India 94% of the companies that use Big Data are satisfied with the results. Out of this, 65% are using to analyze customer behavior.

6.       The real crowd puller was by Dheeraj Rajaram, founder of one of the most successful startup in Analytics – Mu Sigma. He discussed some of the innovations in Analytics.
·         Innovation 1 – How some of the companies are using Social Score to sanction loans. They analyze how people are spending on and on what they are spending, by analyzing their Facebook posts.
·         Innovation 2 – Real-time video analytics used to determine possibility of fraud in betting industry.
·         Innovation 3 – Field engineers constantly face the risk of their lives. Big Data is being used to put an end to this. By 2015, 6 Billion devices will be connected.

7.       The next session was on Fraud Detection & Risk Management in online and offline space.
·         Online ecommerce companies constantly face the issue of Fraud & Risk (Cash On Delivery - customer might not pay for the item). Money-back guarantee is another big blow for these companies with the U.S dealing with 2.3 b$ worth of returns annually. Amazon quoted about this issue saying, companies need to be customer centric and always start with the customer in mind & work backwards. Consider a scenario where a lady purchases a lot of baby accessories and while billing she forgot to bill a baby wipe which was at the bottom of the basket. Instead of treating it as a theft, the store can be polite in asking her if she would like to bill anything else.
·         In the online space when a customer makes a purchase, all that the company has to predict risk is 50 milli second. In an offline mode user has an identity. However in the online mode, all that the company knows is the customer’s email id. Within 50 ms, 1000s of rules and predictive models have to be executed and validated.
·         In the online space even a tiniest loophole can be disastrous in lightning viral environment. So companies cannot create any space for mistake.
·         Do not just rely on a supervised learning method for your business. You would need unsupervised learning, random forest, neural networks, decision trees and many more.
·         One instance of fraud was discussed by the panel. There were 20 banks in Somalia and in a span of few weeks, 15 of the banks were mugged. The Somalia government approached the analysts to predict when the remaining 5 banks will be mugged, using statistical analysis.

8.       The panel discussion by the startups was the most motivating of all the talks. Bhupendra from Simplify360 & Sridhar Venkatesh from Indix kept the whole crowd on the toes. They spoke about their journey from nowhere to becoming the most promising startup.  Bhupendra shared his experiences of starting a product with no roadmap in 2009, when there was a global recession.  And how the big surprises awaited him at the end, to become the most influential technologist in India. Dedication, Perseverance, Innovation and Team play a big role in startups.

9.       James from the Boston Consulting Group gave the concluding talk and shared light on a lot of key areas –
·         Myth1 – Bigger Data is Better Data.  Actual fact - Size doesn’t matter
·         Myth2 – It’s all about correlation. Actual fact – Causation remains the key
·         Myth3 – Focus on new ideas. Actual fact – Often do the same things better
·         Myth4 – Requires big investments. Actual fact – More capability much cheaper
·         Myth5 – It’s only for advanced firms. Actual fact – All businesses can benefit.

10.   Check out some of the stalls put up by a few cool startups - Veda, Nanobi, Germin8, and DataWeave. One thing in common in all these startups was the factor of Social, Sentiment, and Semantic aspects. The market looks very promising with all these startups providing a hint towards it.

Friday, April 19, 2013

My No SQL - Big Data Presentation to Wellington - NZ

Raju Rama Krishna
Wellington, NZ
17th May
Topics:- High Scalability, MongoDB, Redis

WJUG 2013-04-17: Raju Ramakrishna - NoSQL Family from John Hurst on Vimeo.

Sunday, March 31, 2013

Neo4j Graph Case Study - Facebook & Trip Planner

NoSQL Graph Database - Neo4j

Graphs are everywhere - social networks, routing algorithms, payment flow, and many more. Facebook friends, Twitter followers, LinkedIn graph are a few scenarios for Graph. Quite recently Facebook released Social Graph search. Neo4j is an open source NOSQL Graph database, immensely popular and being used by companies like Adobe, Intuit. Twitter uses FlockDB for the Graphs. 

Download and extract Neo4j Community edition from http://www.neo4j.org/download. Start the server by running /bin/Neo4j.bat. The Neo4j monitoring tool can be accessed from http://localhost:7474/webadmin/. The monitoring tool provides several utilities as shown below. 

Gremlin - A Groovy tool for querying Graph data

Neo4j Monitoring console

Neo4j Data Browser

Apart from the above utilities, Neo4j provides support for Graph Query language called Cypher. Since Ne04j is built on Java, it runs on JVM and can be used as a Embedded database. Or it could be used as a standalone database, through REST api. There is another interesting tool called Neoclipse which is very rich in visualizing Graphs. The following Graphs for facebook friends and railway routes are generated with Neoclipse.

Applications of Neo4j

1. Facebook friend connect

  • Login to Facebook and follow the instructions from here to download the Facebook Graph data.
  • Next, run the following program to download the Profile photos from Facebook to your local machine. The photos are used only for visualization purpose for neoclipse. This step uses the data downloaded from the previous step. change the "file" variable.
  • The next step is to import the graph data to Neo4j. Run the below program. I have used the data for two Facebook profiles and loaded all the friends and interconnection data.
  • Now the data is ready, I implemented a search utility which - given 2 Facebook IDs, it will provide all posible ways for the 2 people to get connected. 

2. Railway Trip Planner

  • For this application, I used data from Indian Railways site. A few trains were randomly chosen and their routes and information were stored in local file system as a tab-delimited text. The route information contains station code, names, distances, time taken etc.  I created directories with name as train number. Each directory contains info.csv and schedule.csv. Download it and save the directories and its contents locally. It can be downloaded from here
  • Next, run the below program to import the graph data to Neo4j.
  • Now the data is available, I created a trip planner. It provides 2 options - Search for a path between 2 stations with least distance, Search for a path between 2 stations with least number of stops. In each case print the total distance. 
  • The output of the above search program is

3. Facebook Graph Search

  • Facebook Graph Search is released. It provides search based on your friends connections, likes and recommendations. So I could ask it for "Suggest me Indian Vegetarian restaurants in and around South London". There is a wonderful article on how to build this with Neo4j at http://jimwebber.org/2013/02/neo4j-facebook-graphsearch-for-the-rest-of-us/
  • I implemented the code based on the above design.  First build the Graph <
  • Next run search on restaurants

High Availability

Neo4j enterprise edition provides high availability in the form of replication and shard caching. I tried this on a 63 Million Node relationships with a cluster of 3 nodes. Planning to write up in next series....

Sunday, March 3, 2013

Real-Time Data Analysis using Twitter Stream


Twitter processes half a billion tweets per day. The size of a tweet is max 140 characters. For experimentation purpose, Twitter provides these tweets to the public. Initially they were providing it through REST API and then moved onto OAuth. Twitter Streaming API makes this possible, and provides real-time streaming tweets at 1% of the Twitter's load. So that means, on an average Twitter receives 5787  ( 500000000 Tweets / (24 Hours * 60 Mins * 60 Secs) ) Tweets per second and the Streaming API sends us 58 Tweets per second. And this data is arriving constantly, so we need a mechanism to store 58 tweets every second and run real-time analytics on them. Even though each Tweet is 140 characters or 280 Bytes ( Char is 2 bytes ), the Streaming-API sends us a lot of information for each Tweet (1 Tweet = 4 KB). This information is sent in the JSON format.

The Twitter data provides a very valuable tool in the field of marketing. Vendors can do sentiment analysis using Tweets for a specific hashTag. So if Samsung wants to know how content people are about their products, they can do so with the Tweet data. As a result a lot of NLP (Natural Language Processing) field researches have started. Apart from this, we can do a lot of machine learning tasks on these Tweets.

As part of this experiment I implemented a Consumer to read the stream of JSON tweets and persist in MongoDB. Since its a write-heavy application, I load-balanced (Sharded) my MongoDB. This application keeps running forever. Now the data starts filling up my MongoDB clusters. To keep the storage minimum, I extracted and stored only TweetID, Tweeter Name, Actual Tweet, the Source of the Tweet ( eg. twitter.com, facebook.com, blackberry, etc). Then I setup a MongoDB incremental Map-Reduce job to run every 10 minutes. This job gets the statistics of unique sources and their counts. From this I generated the top 10 statistics and create chart using JFreeChart. 

Architectural Overview


Setup Twitter Account

Goto https://dev.twitter.com/apps and click "Create a new application".
Fill up all mandatory information and submit the application
Goto the tab "OAuth Tool" and note down the Consumer Key and Consumer Secret.
Run the following program by changing the consumer key and consumer secret

Follow the instructions of the program to generate the Access Token and Access Token Secret.
It can later be obtained from the "OAuth Tool" tab.

Setup MongoDB

MongoDB provides sharding capability on database/collections. So I setup a simple MongoDB sharding setup with 2 Laptops - 1 TB, 4 GB RAM, Toshiba Windows 7

System-1 :
System-2 :
MongoDB 2.2.3 is installed in both the laptops at c:\\apps\mongodb.
Create directories in System-1

On System1
c:\apps\mongodb\bin> mongod --shardsvr --dbpath c:\apps\mongodb\data1 --port 27020
On System2
c:\apps\mongodb\bin> mongod --shardsvr --dbpath c:\apps\mongodb\data1 --port 27020
On System1
c:\apps\mongodb\bin> mongod --shardsvr --dbpath c:\apps\mongodb\data2 --port 27021

c:\apps\mongodb\bin> mongod --configsvr --dbpath c:\apps\mongodb\conf --port 27022

c:\apps\mongodb\bin> mongos --configsvr --configdb,, --port 27017

c:\apps\mongodb\bin> mongo

mongos> use admin
switched to admin
mongos> db.runCommand({addShard:""});
{"shardAdded": "shard0000", "ok": 1}
mongos> db.runCommand({addShard:""});
{"shardAdded": "shard0001", "ok": 1}
mongos> db.runCommand({addShard:""});
{"shardAdded": "shard0002", "ok": 1}
mongos> db.runCommand({listShards: 1})
 "shards" : [
     "_id"  : "shard0000",
     "host" : ""
     "_id"  : "shard0001",
     "host" : ""
     "_id"  : "shard0002",
     "host" : ""
 "ok" : 1

mongos> use twitterdb
switched to db twitterdb

mongos> db.createCollection("tweets")
{"ok" : 1}

mongos> use admin
switched to db admin

mongos> db.runCommand({enableSharding: "twitterdb"})
{ "ok" : 1}

mongos> db.runCommand({shardCollection: "twitterdb.tweets", key: {id_str: 1}})
{"collectionSharded" : "twitterdb.tweets", "ok" : 1}

mongos> use twitterdb
switched to db twitterdb

mongos> db.tweets.find().count()

Running the Application

So we have just finished setting up the Shards and the database setup. 

RUN The below Twitter Stream Application below (please change the appropriate values as per your settings). Data Starts pumping into MongoDB. Don't forget to stop the application when you are done, else twitter stream consumes network bandwidth and the mongodb storage will shoot up.
Keep a tab on the filesystem data:-


 The data files grows continously. Verify the counts of the tweets in the database.

  mongos> db.tweets.find().count()

 So 25,000 tweets accumulated in 10 mins or 40 tweets per second Find out how many people tweeted in this 10 minutes using web

  mongos> db.tweets.find({"source" : "web"}).count()

Now RUN the below MapReduce Job to run every 10 minutes and aggregate the results and generate the reports as Pie-Chart. These charts will be stored in your local file system.
Chart Utility


The file can be downloaded here containing the entire project.

Saturday, February 23, 2013

Massive Movie Recommendation System using MongoDB

MongoDB Basics & Internals

Big Data is a buzz word of today. It means Velocity, Volume and Variety of data. Google processes 24 PetaBytes of data per day. FaceBook users upload 300 Million photos and 3.2 Billion Likes/Comments per day. These are some examples of Velocity and Volume. "Variety" refers to different types of data handled by these applications. For instance, Log files have unstructured/semi-structured data. This is not what RDBMS is designed for. They are suitable for structured data and volumes not exceeding TeraBytes. Since RDBMS are disk-based they are relatively slow (IO-bound). Indexing is not a solution, when data grows inserts will get slow. Joins are big bottlenecks and hence you have to go for de-normalization.

High volume sites were mostly built on LAMP (Linux, Apache, MySQL, PHP) stack. Load-Balancing was done at web-tier and application-tier levels. However this was difficult with data-tier. So one solution people used was Memcached, acting as a caching layer in front of data-tier. They utilize a cluster of MySQL instances to split the data among these clusters. And these clusters were replicated as well. Further optimizations were done on this - MySQL Handler Socket was used to tune 500K requests per second. One of the solutions to address high-volumes was to use CDN(Content Delivery Networks) like Akamai. HTTP Accelerators like Varnish were used to handle web-tier load. 

When Google published papers on Big Table and Map-Reduce, it changed the game. Doug Cutting, who was working with Yahoo then, built Hadoop (an open-source implementation of big table and map-reduce). Several NoSQL databases appeared then. We can broadly categorize them as Document-Oriented (MongoDB, CouchDB), Graph (Neo4J), Columnar (HBase, Cassandra) Databases. MongoDB is a document-oriented database from 10Gen. FourSquare, ShutterFly, CodeAcademy are a few of the customers who uses MongoDB. 

MongoDB works on JSON datatypes and each document is a JSON Document, internally stored as BSON (Binary representation of JSON). Its void of Joins and Transactions. It supports Geo-Spatial queries. This is a USP for FourSquare to use MongoDB. Indexing, Auto-Sharding and Replication support are few more things in MongoDB. I could configure my 3 laptops to store Millions of User data. I configured laptop-1 and laptop-2 as Shards with users A-M stored on laptop-1 and N-Z on laptop-2. And laptop-3 I configured as a replica of laptop-1. Replication was immediate (less than a sec) and even offline-replication was simply awesome (bring down the replica and start again, automatically synced up). Replica Sets support voting for choosing Primary. There is support for Map-Reduce with Mongo (built on top of Hadoop). MongoDB provides several tools - important ones are mongod, mongos, mongo, mongostat. MongoDB uses Memory-mapped files to map the files to virtual memory. Hence 64-Bit machines are preferred with MongoDB. Internally the data is stored as a 2 level Linked-List. When a database (e.g. movieLens) is created it creates 3 files automatically - movieLens.ns (16 MB), movieLens.0 (64 MB) and movieLens.1 (128 MB). The movieLens.ns file is a giant HashMap that stores the metadata of the database and the collections ( tables are called collections in MongoDB). The movieLens.0 and movieLens.1 files are data files and the numbers are increased sequentially, sizes increased exponentially till it achieves 2 GB. The internal architecture and setup of MongoDB is a very interesting and a long story (for next post). 

Movie Recommendation

We will build a movie recommendation system with MongoDB to handle massive volume of data. I used the 10 Million movie ratings provided by MovieLens. Download the data (MovieLens 10M) from http://www.grouplens.org/node/73. Extract it to a location like C:\Project\Recommender. It consists of 3 main files - movies.dat, ratings.dat, and tags.dat. We will load this data to MongoDB. For this, download MongoDB from http://www.mongodb.org/downloads.  Install MongoDB in some location like c:\mongodb. Create a directory c:\mongodb\data. Start mongod with this command -
c:\mongodb\bin>  mongod --dbpath c:\mongodb\data
Now we will run the following Java application to import data onto MongoDB. For this we need MongoDB Java driver and set the Heap Size to 1 GB.

The output is as follows-

Inserted 10681 Movies
Inserted 95580 Tags
Inserted 10000054 Ratings
Completed in 685 Seconds

Run mongo using the following command
c:\mongodb\bin>  mongod

Now verify that the movieLens database is created. Some of the commands you can run are as below-
show dbs
show collections

Next, create Indexes:
db.ratings.ensureIndex({user_id:1, item_id:1})

Next run map-reduce task to create a new Collection to store userId/RatingCount

map = function() { emit({user_id: this.user_id}, {count:1}) }

reduce = function(k, v) { var count = 0; v.forEach( function(x) { count += x['count']; }); return {count: count}; }

db.ratings.mapReduce(map, reduce, "user_ratings");

        "result" : "user_ratings",
        "timeMillis" : 520420,
        "counts" : {
                "input" : 10000054,
                "emit" : 10000054,
                "reduce" : 169150,
                "output" : 69878
        "ok" : 1,

Now we are ready to create our Recommendation Engine. The one which we are going to implement is a "Collabarative Filtering" application, similar to Amazon or Netflix. The algorithms used by these sites are normally Weighted Slope-One or Singular Value Decomposition (SVD). Twitter for instance uses SVD. Apache Mahout (an open source Machine Learning solution for Big Data) provides Recommendation System. I have used a similar strategy here. 

The idea used here is- 
Users rate movies on a scale of 1-5. To know what movie to recommend for a given user (user_1), we use the ratings that he provided for movies. Find out k-neighbours (users who are similar to him in taste). From this list of users, find out the movies that they have rated. Get a weighted list of movies based on ratings, ignoring movies that he (user_1) has already rated. The basics for this are - Manhattan Distance, Eucledians Distance, Minkowski Distance, Pearson's Co-efficient and K-Nearest-Neighbors. 

The output of the program is-


Loaded 10000054 ratings
Loaded 10681 movies
Recommended for 16
Lord of the Rings: The Two Towers, The
Liar Liar
Field of Dreams
Christmas Story, A
Gone in 60 Seconds
Remember the Titans
No Country for Old Men
Recommended for 127
Clockwork Orange, A
Fight Club
Reservoir Dogs
12 Monkeys (Twelve Monkeys)
Shawshank Redemption, The
Silence of the Lambs, The
Terminator 2: Judgment Day
One Flew Over the Cuckoo's Nest
Shining, The

In the next post I will discuss more about how the Recommendation System works and the algorithm

Saturday, January 19, 2013

Building an online Movie Library

This is more of a PoC (Proof of Concept) for the Movie Library Site. I have taken a couple of requirements to build a working code -

The site maintains a list of tons of movies and each movie is identified by a title, price, release date and category. To assist the users in searching movies easily - a search field is provided. So if we enter "English" in the search box, it would return all English movies in the sorted order of title. It would also provide a "Faceted Navigation" - a concept which is used widely these days. So along with displaying the list of English movies, it would categories them on the left hand side section with search results based on "Categories". And the Categories will be displaying the title and will be sorted on the total occurrences found (desc order). Following is the result page for the application we are building (displaying Kannada movies)

The next requirement is to provide Pagination for the results as shown below. In the above page, if we click on "List All" we end up with the below page -

The next functionality we want to build is for the Suppliers. The suppliers want to add movies into our library and hence we provide them with a FTP location. Suppliers would create one or more CSV Files (Comma-Separated-File) of predefined format and put it onto the FTP location. We will create a scheduler application to poll the folder, pick up the files and extract and populate it in the database. On completion of this step, the movies will become available on the site. The format of the CSV is as below-

In a real-world scenario they would be creating a zip file containing CSV files and images.

Tools/Technologies Used-
JDK 1.6 (or above)
Maven 3.0.4
Eclipse Juno
Spring 3.1.1
Hibernate 4.1.4
Lucene/Hibernate Search 4.2.0
Spring MVC 3.1.1
Jackson 1.9.4 (for JSON)
JQuery 1.7.2
Spring Integration 2.2.0 (Spring Batch support)
Mockito Test Framework for Test Driven Development (in progress...)


1) Download and install JDK 1.6 from here
2) Download and install Eclipse IDE from here
3) Download and install Maven from here
4) Setup environment variables as follows-          
5) Download MYSQL and install from here 
6) Set the root username/password as root/root. Open MySQL Shell and create a database with name - "TPCH"
7) Create a table "CD" with initial data. Following SQL can be used-
8) Next download the complete movie library project from here
9) Extract it to a folder, say C:\\Projects. This will create the following directory structure-

It is a maven project with 4 maven modules (highlighted in red). Apart from that there is a lucene folder, which is used for indexing.
The main project is a POM.
Web has dependency on Service and DAO
Service is dependent on DAO
Batch is independent stand-alone depends on DAO

Maven has a standard directory structure-
src/main/resources used for configuration
src/main/java used for source code
src/main/test used for test cases

10) Open Eclipse and import as maven project (root should be C:\\Projects\spring-mockito)
11) Configuration changes (if any)-

Ensure the connection url, username and password are as specified by you.
The property "hibernate.search.default.indexBase" is used to specify where lucene has to store its index.

Ensure the values are right here

12) Run the following commands from a command prompt-
C:\Project\spring-mockito>mvn clean install
C:\Project\spring-mockito\spring-mockito-web>mvn jetty:run

The first command cleans up the maven modules and creates the targets.
The second command deploys the war file and starts-up jetty (light weight server). If you want to test on tomcat, please copy the war from spring-mockito-web\target folder and deploy to tomcat directory.

13) To access the application the url is http://localhost:8080/spring-mockito-web/index.jsp
14) To test the Supplier functionality first ensure that you have a directory C:\\FileServer\input created already. You can change the location in the file
Run the stand-alone java class com.raj.projects.batch.client.Main from spring-mockito-batch project. This will startup the poller and the integration environment.
15) Create files with .csv extension as shown below-

You can store as many files as you want. However ensure that the Ids are unique. 

Drop the files into the folder C:\\FileServer\input

They will be picked up and processed and movie entries written to database. Then the files will be deleted.