Tuesday, December 25, 2012

Count Distinct Values in huge text file

This is a common problem today. You are given a huge text file, containing tens of millions of text data. And you have to count how many distinct values exist in the file. The problem gets more complicated, if you cannot load the entire data into memory. How do you make your algorithm "Online". An "online algorithm" is one in which, the result is expected to be returned immediately - may be in one pass, and you won't have memory to hold all the data.

The usage of this algorithm is in various fields. In web analytics to return number of unique visitors last month, last week, so on. Another usage is in detecting a suspected attack. If the load on your server increases all of a sudden, and you notice a spike in the network, then the first thing to look for the possibility of an attack. In this case, the requests could be coming repeatedly from a system. So next we will discuss different implementations available. For the purpose of this article, I used Shakespeare's complete work as a huge text file. It can be downloaded from http://www.gutenberg.org/cache/epub/100/pg100.txt. The file is around 5.5 MB in size and contains 67802 distinct words.

Read the input file line by line, break it into words. Then store the word in a set. Once all the words are stored, then get the size of the set. This is the distinct count. However with this approach, we have to store the entire 5.5 MB data in the Set. Assuming that there are repeated words, and say 4 MB is the distinct words. Still we need to save 4 MB data in Set. And since a Set internally uses a dictionary, and assuming 0.7 load factor then we would need 4/0.7 = 5.7 MB. And if the file is extremely big then we cannot use this approach at all.

A small work-around for Approach 1 is here. Assume that you have a very good hashCode function and the hashCode for any 2 words are never the same. In this case, store the hashCode instead of the word. So hashCode being int and when stored in a set as Integer, it is 32 bits or 4-bytes long. So to store 67802 hashCodes we need 378 KB memory. This is a huge gain, but the problem is we should find a good hashCode function. And the size of the Set increases with the input file size.

We can use "Bloom Filters" as an alternative. It is a 2-dimensional boolean array. In this approach, we use 3 or 4 different hashCode functions - h1, h2, h3, h4. For each word, compute h1 to h4. The rows of our bloom filter are the hashCodes and columns are fixed based on the accuracy needed. One thing to note here is that, the Bloom Filter is a probabilistic algorithm. It doesn't give 100% accurate results always. There is a chance of error 1% to 4% or so. The bigger the size of the Bloom Filter array sizes, the more accurate the results are. Assume that the values for h1, h2, h3, and h4 are 22, 17, 35, and 19. If the bloom filter array is represented as a, we set a[22][0], a[17][1], a[35][2], a[19][3] to true. Using this approach, I could get a result of 67801 instead of 67802 with array 4*1M. Since boolean is a byte in Java, we would need 4*1M bytes or 3 MB memory. However with a memory of 300 KB, I got 3% error. The efficiency of these algorithms are - we don't need to store the entire data and we can get the result in one pass.


The next approach we discuss is a very efficient algorithm. And the same has been used in many applications - including Cassandra (by Facebook). It is called "HyperLogLog Counter". Its a slightly complex and mathematical algorithm. It needs very little memory and doesn't vary much on the input data size. The idea is as follows. Read each word and compute the hash of the word. Cassandra uses "MurMur Hash". In this algorithm we use something called "Stochastic Averaging". We define 'b' as a number between 4 and 16. And create an int (or byte) array of size 2^b. So if we use b=4, then we have 16 buckets. Assume that we have chosen b=15, hashCode of a word is 1379266913. We convert this hashCode to a 32-bit binary number (Integer is 32-bit). And then break it into first 15 bits and next 17 bits. The first 15-bits decide which bucket to use. The next 17-bit is used to determine the rank. The rank is nothing but, the position of the first set-bit (or position of 1), from left to right. At the bucket index, if the value is lesser than the rank, then rank is stored in the bucket. So basically at each bucket, we store the highest 1-bit position. Once all the words are read and the buckets populated, we use the harmonic mean to compute the result.
Using this approach with b-value of 15, meaning 32768 buckets ( or 32 KB) memory the result obtained was 67749 (against the actual 67802). Which means using just 32 KB memory we got a result which is 99.92% accurate!!


Friday, December 14, 2012

Google Fresher Interview


There is a linked list of numbers of length N. N is very large and you don’t know N. 
You have to write a function that will return k random numbers from the list. 
Numbers should be completely random.


One approach is to use random and modulo with N. But if we don't know N, it becomes interesting. The approach that we discuss is called "reservoir sampling". In this approach, we read the first K numbers and store it in the list. For our example lets say K is 10. So we will have the first 10 elements in the list. Next read the 11th element, compute a random function which returns a random number between 0 to 11. If the random number (r) is less than 10, then set the r'th number to 11th number, else continue.


Question: 2

Given a Data Structure having first n integers and next n chars. A = i1 i2 i3 ... iN c1 c2 c3 ... cN. 
Write an in-place algorithm to rearrange the elements of the array ass A = i1 c1 i2 c2 ... in cn


We can use a divide-and-conquer approach. Lets say the list contains {1, 2, 3, 4, 5, 6, 7, 8, a, b, c, d, e, f, g, h}. Using binary-search kind of approach, convert it into {1, 2, 3, 4, a, b, c, d, 5, 6, 7, 8, e, f, g, h}. Break the list into 2 parts and apply the same algorithm on the 2 parts - {1, 2, 3, 4, a, b, c, d} and {5, 6, 7, 8, e, f, g, h}. 


Question: 3

There is an array A[N] of N numbers. You have to compose an array Output[N] such that Output[i] will be equal to multiplication of all the elements of A[N] except A[i]. For example Output[0] will be multiplication of A[1] to A[N-1] and Output[1] will be multiplication of A[0] and from A[2] to A[N-1]. Solve it without division operator and in O(n).


We can do it in 2 pass, O(n). If the list is a: {4, 2, 5, 1}, we create a product list b: {0, 0, 0, 0}. Set b[0] to 1. Set pf (product forward) to 1 and pb(product backward) to 1. Start pass forward. Set pf  = pf * a[i-1]. So at i=1, pf will be 4. Set value of pf to b[i]. So the list will become {1, 4, 0, 0}. Next at i2, pf will become 4*2=8, so the list will be {1, 4, 8, 0} and next it will be {1, 4, 8, 40}.
Start pass backward from last but one item that is i=2. Set pb = pb*a[i+1]. Set value of b[i]*pb to b[i]. So pb=1. The list will be finally {10, 20, 8, 40}


Question: 4

You are given an array with integers (both positive and negative) in any random order. 
Find the sub-array with the largest sum


Sort a very large text file on limited memory


We are given a very large text file of size say 10 GB and we need to sort the file and store it in an output location. But the memory (RAM) is limited, say 1 GB.


1. Brute-Force Approach
Read the input file line by line using BufferedReader and store each line in a TreeSet (assume there are no duplicate lines). Once the file is completely read, store the contents of the TreeSet into the output file.
For my analysis, I used a 400 MB file containing line items. I extracted this from TPC-H dataset. You can use any large file. When I ran my program with 512 MB memory (VM Argument set to -Xmx512M), I got OutOfMemory. So I set it to 1GB, it ran fine. The time taken was 7 seconds.
So as we can observe, if we need to sort 10GB then we would need around 15-20 GB RAM. The memory utilization is by the TreeSet to hold the data.

2. External Sort
In this approach, we will only read the input file line by line and store it into the TreeSet. This step is similar to above, but we would not store the entire file in the TreeSet. If the file is 10 GB and our RAM capacity is 1 GB, we will store 512MB data in the TreeSet. Once it becomes 512 MB we flush it into disk, on a temporary file (call it temp1.txt). Repeat this procedure of writing into temporary files, till you read 10 GB completely. So we will have 20 sorted temporary files. Next step is to merge these 20 files into a single sorted file and delete the temporary files. We call it a K-Way merge algorithm. Consider the example below

temp1:  { dog, fan, jug }
temp2:  { egg, kit, rat, sun }
temp3:  { ant, gun, hat}

read the first items - dog, egg, ant. The smallest of it, write to output.txt and remove the element from temp file. To the list, add the next item from the removed file. So we will remove ant from temp3 and put it to output.txt. So in the list we will have - dog, egg. Add gun to the list (since we removed ant from temp3). Now next smallest is dog add to output.txt and add fan to the list. Repeat till all the elements are added.

I first implemented this method to sort a 400 MB file on a 40 MB RAM. The time taken was 22 seconds.
This is definitely a gain in the space-complexity at the cost of increased time. Can we do it better?

Next I utilized Memory-Mapped-Files (MMAP). For this we need to consider how Java does file IO. Data in the file system (secondary storage or hard-disk) has to be brought to memory (RAM). To do this, we can either read character by character, or read a chunk of data at a time. The time taken by file IO is huge, compared to the CPU processing time. So its better to reduce the frequency of file reads, hence we use BufferedReader.
But if we use MMAP, then a file or portion of a file can be mapped directly to memory. To achieve this, it uses SWAP space (or virtual memory). Every system has this virtual memory and is different from the RAM. In Java, objects are created on the Heap or RAM or memory as we call it. But if we bring data to swap space then the data can be read directly without an operating system call (which is magnitude of times slower). You cannot Memory Map a full 10 GB file, so we can MMAP a portion of the file. The other advantage of MMAP is, multiple processes can share the same file (thread-safe). In Java, we have the classes for this in NIO package. FileChannel is the class to hold the data.

With this approach, I memory mapped the 400 MB file for 8KB size. Read this 8KB data and store it in a TreeSet. Repeat 5000 times, and store the 40 MB data in a temporary file (we will have 10 temporary sorted files). Then apply k-way merge.
Using this approach, the program ran in 17 secs ( with a 5 secs gain ). Its generally, atleast 20% faster than the BufferedReader.




Monday, December 3, 2012

Adaptive Spam Filtering algorithm

When we talk about Spam, we generally mean emails. So a spam mail is one which is sent to you as an email promotion or a bulk mail. And in most of the cases you are not interested in receiving them. So earlier days we had to go through the mail and identify if its a spam or not. A mail which is not spam (is called ham), we keep in Inbox and for the spam, we manually used to move it to a junk folder. Now that is a lot of work to do, given that these days 50-60% of mails are spam. So there are a few algorithms which came up to solve this issue. And the best of all is "Bayesian Algorithm". Its an adaptive, machine-learning algorithm. And we will discuss the details below.

Classifying an email as spam or not cannot be done at the mail server. It needs to be done at the email client. For instance lets say there are 2 users - A and B. And A works for a Bank and B works as a Pharmacist. A mail with content "Reduce your mortgage loan" is spam for B but ham for A. And a mail "Solution for baldness" is spam for A but ham for B. So when the recipient receives the email, if he received a mail and he considers it as spam, he can "Mark it as Spam". This is not a big issue. On the other hand, if he noticed a mail that was  ham went  into his spam folder, he can "Mark it as NOT Spam". This is an issue, as the mail might be an important one and you might miss out on it (as its not showing in your inbox). So the spam detectors should be careful not to mark a ham as spam. Also, spam can be detected based on email content, email subject, sender email, recepient emails, etc. Lets see how they work.

In the industry we have a collection of thousands of ham/spam emails which can be used to build our Spam filter application. Download these emails into your data store. Run a job on it (Map-Reduce or batch) to go through the email message and split them as multiple words. You might have to do additional tasks like removing special characters, quotes, converting to lower case, ignoring words of length less than 4, ignore common words, ignore words with only letters, etc. Now the valid words you add it into a HashMap as Key. The value for the Map is a Node. The Node class has 3 fields - spamCount, hamCount and probability. So if I am reading a word "XYZ" from spam email and it is the first time I encountered this word, then the Node class would have spamCount=1, hamCount=0. We will calculate probability after the map is constructed. Note that the same word can appear in the ham list. Every time a word is put in the map, increment a class level variable totalSpam (or totalHam) by 1. After all the emails are read and the map is constructed, iterate the map and get each key. For the key get the spamCount and hamCount. Calculate probability using -

probability = (spamCount/totalSpam)/((spamCount/totalSpam) + (hamCount/totalHam))

Do this for all the keys. The probability is a floating point value between 0.0 and 1.0.
That completes the training step. Next is the filtering step.

An email comes from a sender "X". So again, get the words (as described above) and for each word get the probability of the word in the map. If the word doesn't exist it the map, it means the spam filter is not trained for this word. So it could be a valid word, give it a value 0.5. Calculate the interest values I for each word as follows-

I = |0.5 - probability|

Once it is calculated for all the words, sort the I values in descending order (highest interest). Out of this take N values (N=15). For these I values, get the corresponding probabilities p1, p2, p3.. p15. Now calculate the total probability using the following formula

P = (p1*p2*p3..p15)/((p1*p2*p3..p15)  + ((1-p1)*(1-p2)*(1-p3)....(1-p15)))

This value would be between 0.0 and 1.0. The nearer the value is to 0, the lesser the chances of it being spam. So we mark anything equal to or greater than 0.9 as spam. 

Next comes machine learning. It could happen that, an email which is not marked spam needs is found to be spam. You mark it as spam. To do that, add the word back to map and calculate the probabilities again. 

I have built a basic implementation which can be trained and also do machine-learning. I created 3 files - ham.txt, spam.txt and common-words.txt. In this basic implementation I am storing text as mail content in one line of the text file. In the sample data I setup, I want to filter all jobsite, lottery, retirement emails. So the spam filter gives the following output.
  1. 'quick and easy jobsite' is spam
  2. 'will there be a production release tomorrow' is not a spam
  3. 'plan your retirement soon' is spam
  4. 'you have won a state lottery claim your money now' is spam
  5. 'register with our jobsite and get your dream job and money now' is spam
  6. 'today with our jobsite making money is not all that hard' is not a spam
  7. 'today with our jobsite making money is not all that hard' is spam
  8. 'back to office today and started off coding' is not a spam
Note that 6 was initially found to be a ham. The reason being a few words like today, money, etc are found in ham list as well. But when I mark it as spam, the next time when I received the same email at 7, it automatically traced it to be a spam.


Saturday, December 1, 2012

FaceBook - Write a function that can show the number of users online at any given time

You're given a period of time where users log in and log out and a set of login and log out times for those users. Facebook is asking you to create an efficient algorithm that will calculate a simple, but extremely important, metric.

Facebook currently has more than 1 Billion users and assume that each user is given a unique identifier like user_192. Hundreds of people login (and logout) to FB every second. There is one audit log maintained explicitly to store the time of login/logout and the userid. A sample log is given below-

[10/11/2012 10:25:06] Login user_4
[10/11/2012 10:28:55] Login user_6
[10/11/2012 10:29:19] Logout user_4
[10/11/2012 10:36:33] Login user_8

Using the above information, we need to come up with the most efficient algorithm (fast and low memory utilization) to calculate the users logged in at any time. For instance in the above example, if I query for 10/11/2012 10:29:00, there were 2 users logged in user_4 and user_6. Assume that there can be users who are logged into FaceBook the entire day (and never logout). Also assume that the entire 1 Billion users login every day. Most importantly, we need to consider boundary conditions as well. As in somebody logged in at 23:55:00 and logged out the next day at 01:17:12. 

There were a few solutions available, which had time complexity O(n log n). However, I was thinking if I could do i in 1 pass, which is O(n). And also, I didn't want to use the BitSet approach as I would have to create a BitSet of size 1 Billion every day. So, I came up with this solution.

Maintain a map, where the key is the date (10/11/2012). The value is an int[] of size 86400 (total number of secs in a day). The values of the array are pre-initialized to -1. Maintain a counter initialized to 0. Now go through the log file, for each log entry convert the time portion into secs. If the activity is "Login" then increment the counter, if "Logout" then decrement the counter. Goto the array position and set the value to the value of the counter. In the figure above, the blue window represents the time user_7 is logged in. The green window represents the time user_4 is logged in. The red window represents users who are logged in fo 2 overlapping days. So what we have achieved is, in linear time we have created our data structure.

Now lets see how we can do querying. Convert the time to seconds, get the corresponding value from the int array. If the value is not -1, then the value is the result. 
If the value is -1, from the current position of the array navigate back till you encounter a value which is not -1 (case a). While doing so, you might end up at the beginning of the array and still not found any value which is not -1 (case b). In case a, the value at the position is the result. In case b, we need to go back and refer to the previous day's map. In that map's int array, start from the end of the array till you find the value not -1. 
Will see with an example. In the figure above, at time 37411 we had 1 user. At time 37506 we had 2 users. So if I query for time 37506 we can directly say 2. If we query for 37500 we have 1 user. How did we arrive at this? At 37500 the value was -1. So we navigate left and at 37411 we get  value 1. That is the result.

Note:  In a real-world scenario the user log would not be a single text file, but a distributed file. Facebook have their own logging system PTail and Puma. And the back-end uses HBase over HDFS. So the log would be broken into multiple 64 KB pieces. Map-reduce job runs on each of these pieces in parallel building the above map.


LOG File

Thursday, November 29, 2012

Real-time data analytics on Terrabytes of data

Determine the 10 most frequent words given a terabyte of strings.

This is one of my favorite FaceBook Interview Question. There is a terabyte size file comprising of words. Scan the file and calculate the top 10 words in the file w.r.t number of occurrences. List them in descending order of occurrence. Also, support for searching word frequency for any word in the text. Further, implement the solution such that it can be used in machine learning. So that, it can accept streams of data coming in real-time and calculate the top-10 in real-time (no batching allowed).

The output of the solution that we are going to discuss is show below-

130 MB data
Loaded 14146404 words into Data Structure. Time taken = 18 sec

Top 10 words are:
[argument] found 1011599 times
[type] found 360234 times
[2012, main] found 337678 times
[values] found 337549 times
[debug, summary, options, auditing] found 337375 times
[return] found 336572 times
[2012] found 335377 times
[class] found 269462 times
[auto] found 123839 times
Word performance found 97 times

I generated this report on a 130 MB log file comprising 14 Million Words. And I applied some custom rules for frequency calculation.
All the 3 should be counted as occurrence algorithm. Then of-course the most important part of loading the 130 MB data and operating on it. Woow!! all done in 18 sec. And further I was able to query the frequency of any arbitrary word. The search takes less than a millisecond.

Before discussing the solution, lets see similar problems that we see day-today. If you consider an e-commerce site, they have top-sellers. And a gaming site would display leader board. The other problem is in network analysis - you are building a solution to capture IP-Addresses of requesting machines in real-time. In real-time calculate the IP of a machine with highest hits. The traditional algorithms of using a HashMap or a Array Counter would not work here. Since you would very soon, run out of memory or end up with a slow-computation.

There is another approach to this, and it is Online Algorithm. This is a general term used for algorithms were you build a solution once, and then in real-time you can query and get response. So here you are waiting for the response of the system. Example of Online Algorithm is - Patricia Tries which I posted earlier.
So what do these do? They are very efficient in terms of computation and also in space. For instance, Patricia Trie uses only 2n-1 nodes and is a binary trie. At the same time, when it comes to processing time, it doesn't depend on the size of the data. It only depends on the input size. So can we use these Online Algorithms for real-time data analytics? May be not. However, if you want to build an application to go through a huge Novel (for instance, Complete work of William Shakespeare, Bible, etc) and print the frequencies of each word, it is good. We need to look beyond this.

Lets take the example of Novel and address it. Go through the novel line by line and fetch each word. Assume that you have 3 hash functions - h1, h2, h3. Each of them take a word as input and returns an int as return type. The hash functions should be in such a way that the no two values are same (all distinct). And also the values are between 0 to 9. Now create 3 int arrays of size 10 each, all initialized to 0. Lets say, we computed for a string "xyz" and we got the below.

h1("xyz") = 8
h2("xyz") = 6
h3("xyz") = 9

So in the 3 arrays, increase the value by one at the positions returned above. So we would have below

array1 [ 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
array2 [ 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
array3 [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

Now proceed to next word and do the same.

h1("abc") = 7
h2("abc") = 3
h3("abc") = 9

So in this case, h3("abc") is clashing with h3("xyz"). We will come to this in awhile. Set the table values.

array1 [ 0, 0, 0, 0, 0, 0, 0, 1, 1, 0]
array2 [ 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
array3 [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 2]

Assume that the next word is again "xyz". So we would get the table below

array1 [ 0, 0, 0, 0, 0, 0, 0, 1, 2, 0]
array2 [ 0, 0, 0, 1, 0, 0, 2, 0, 0, 0]
array3 [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 3]

To calculate the frequency of "xyz", again calculate h1, h2, and h3. Goto the tables and find the minimum value at the positions. So we have 2, 2 and 3. Hence the minimum 2 is the answer. The value "xyz" is found 2 times. Similarly for "abc" we get 1, 1, 3. The minimum 1 is the answer. Its been found that with a proper choice of hash functions the collision can be reduced considerably. And also by increasing the array size from 10 to a bigger number, would make the possibility of conflicts remote. Algorithms of this type are called Streaming Algorithms (or Heavy Hitters). This is called Count-Min Sketch Algorithm. They are used by Google, FB, Twitter and many more. Big Data solutions are based on this. The bloom filters used in HBase is a sub-type of this algorithm.

The beauty of this algorithm is that, you don't need to store the input and you can compute the result in one pass. So both time and space complexity is remarkably low. Even though this is a probabilistic approach, the chances of error are found to be 1 in a Million. This is still good, given the performance gain.

Now this algorithm is good, but how can we compute the Top-10 Words (we are not storing the actual words)? For this, I did a workaround. Basically, used a MaxHeap to compute the top-10. Keep around top 1000 occurrences, to support a future requirement.  In the code, you might notice that I have not implemented a MaxHeap at all, however I wanted to show that MaxHeap can be built using a simple HashMap too (really simple). 

Note What this algorithm cannot do? If we want to print all words and their frequencies, its not possible. Since we are not storing all the words. I have built solutions using Online Algorithms (using Patricia Trie) and Traditional Algorithms (HashMap with Linked List). For people interested in this, please email me rajur79@gmail.com


Tuesday, November 27, 2012

LRU Cache Implementation - O(1)

LRU Cache is a well known data structure used in many of the applications like ehCache (used by Hibernate), Web Caching (to cache most frequently used pages), Disk-Cleanup utilities etc. The main concept here is - you need to build a cache which holds key/value pairs.The cache would have a predefined size, if the no of key/value pairs exceeds the size then flush out least recently used items.

For instance if I added items a, b, c, d, e in the same order and the size limit is 5. Then if i want to add f, then  i have to flush out a (since that is the one last created). So I would have b, c, d, e, f. Now if i access c (either update or get), then that becomes the most recent item, so the order would be b, d, e, f, c. We should support remove operation as well.

Java provides LinkedHashMap class, which can be used to implement a LRU Cache. However, the time-complexity would not be O(1). As it doesn't ensure collision-free hashing and the linked list traversal is sequential. So we will redesign this Cache. I have not added Multi-Threading capability to my implementation and it is not templatized.

In my earlier post I had discussed how to build a Perfect HashMap (with no collisions). So today we will use that HashMap and we will slightly modify the structure to achieve LRU. The HashMap would store key/value pair, for now I am using key as Integer and value as String. Lets call this key/value class as "Pair". The HashMap is nothing but an array of "Pair"s. But as per our requirement whenever the key is accessed (update/get) we need to make it as the most recent. So what we need is a Linked List to store the keys and whenever the key is accessed bring it to the front. However if I have to do that, then I need to go through all the keys to find the specific key. So I modified the Pair class to below

class Pair {
Integer key;
Node val;

class Node {
String val;
Node left;
Node right;

So I can store the value in the node and the same node can be used as a linkedlist node.

So whenever a new key/value is added the value node is added to the head. And when the cache is full, the key and the value node at tail are removed. Whenever a key is accessed, the corresponding node is brought to the head.


Sunday, November 25, 2012

Multi-Player Snake And Ladder Game using Java

A bit of fun using java to build a snake and ladder game. We are given the position of the snakes and the ladder. Initialize the board by setting the 4 arrays - snakestart, snakeend, ladderstart, ladderend. Once it is set, then input the total number of players (N). Initialize a score array of integer[N]. In a round-robin fashion starting with user 1 to user N, throw a dice. This is nothing but generating a random number between 1-12. The value obtained is added to the score[i]. If the score results in a ladder or snake, get the position from ladderEnd or snakeEnd. To check if the score is a ladder or snake, do a binary search on snakeStart and ladderStart. If the total reaches 100, declare the winner :)


Saturday, November 24, 2012

Interviewstreet Challenge: Median


The median of M numbers is defined as the middle number after sorting them in order, if M is odd or the average number of the middle 2 numbers (again after sorting) if M is even. You have an empty number list at first. Then you can add or remove some number from the list. For each add or remove operation, output the median of numbers in the list.
Example : For a set of m = 5 numbers, { 9, 2, 8, 4, 1 } the median is the third number in sorted set { 1, 2, 4, 8, 9 } which is 4. Similarly for set of m = 4, { 5, 2, 10, 4 }, the median is the average of second and the third element in the sorted set { 2, 4, 5, 10 } which is (4+5)/2 = 4.5  
The first line is an integer n indicates the number of operations. Each of the next n lines is either "a x" or "r x" which indicates the operation is add or remove.
For each operation: If the operation is add output the median after adding x in a single line. If the operation is remove and the number x is not in the list, output "Wrong!" in a single line. If the operation is remove and the number x is in the list, output the median after deleting x in a single line. (if the result is an integer DO NOT output decimal point. And if the result is a double number , DO NOT output trailing 0s.)
0 < n <= 100,000
for each "a x" or "r x" , x will fit in 32-bit integer.
Sample Input:
r 1
a 1
a 2
a 1
r 1
r 2
r 1
Sample Output:
Note: As evident from the last line of the input, if after remove operation the list becomes empty you have to print "Wrong!" ( quotes are for clarity )

All Java Solutions on the InterviewStreet needs to run in 5 seconds with 256 MB maximum RAM available. So that means our solution can never be "Brute-Force" or O(n^2). Same applies to the space complexity.

Median can be calculated easily if the data is already sorted. However in this case the data is not sorted, we can add and remove data in any order. 
So we are left with 2 options - 
  1. keep the data sorted at all times
  2. use Heaps.
If we need to keep the data sorted at all times, we need to ensure that whenever user adds a value we search the appropriate position and add the value. Well, you might be wondering why didn't I use a TreetSet? The problem is Set doesn't allow duplicates. In my case, I want to allow duplicates. And also, we have to ensure that the adding a value doesn't take O(n) time. So we can use a variant of binary search - find the number or a number less than it. I have already discussed this approach in my earlier postings. So what I did is, I built a sorted arraylist class - where add/remove are binary search operations rather than sequential. This saved a lot of time. The worst-case time it took was 3 sec compared to the 5 sec benchmark.
Next comes the median calculation. Once we have the sorted arraylist, then the median is the middle number if the arraylist size is odd. Else we take the number at (size/2-1) and (size/2), the average of them is the median. Use BigIntegers to ensure that the result doesn't exceed INT.MAX

Maintain data in 2 Heaps - One MaxHeap and One MinHeap. The size of both should be same (if the total number of elements in both the Heaps together is even). Else the size would differ by 1. I have given the code for the Median  solution in my earlier postings. The main problem with this approach is - when the data is removed or added we have to maintain the size of the Heaps. For this, we might have to move some data from one Heap to the other.


Interviewstreet Challenge: Flowers


You and your K-1 friends want to buy N flowers. Flower number i has host ci. Unfortunately the seller does not like a customer to buy a lot of flowers, so he tries to change the price of flowers for customer who had bought flowers before. More precisely if a customer has already bought x flowers, he should pay (x+1)*ci dollars to buy flower number i.
You and your K-1 firends want to buy all N flowers in such a way that you spend the as few money as possible.

The first line of input contains two integers N and K.
next line contains N positive integers c1,c2,...,cN respectively.

Print the minimum amount of money you (and your friends) have to pay in order to buy all n flowers.
Sample onput :
3 3
2 5 6
Sample output :
Explanation :
In the example each of you and your friends should buy one flower. in this case you have to pay 13 dollars.
Constraint :
1 <= N, K  <= 100
Each ci is not more than 1000,000
All Java Solutions on the InterviewStreet needs to run in 5 seconds with 256 MB maximum RAM available. So that means our solution can never be "Brute-Force" or O(n^2). Same applies to the space complexity.

The task is to find out the total cost such that it is minimum. And we can get minimum cost only if the cost (x+1)*ci is minimum. That means X should be minimum and ci should also be minimum. So this indicates that each person should buy flowers uniformly. We cannot have person A buying 3 flowers and person B buying only 1. So we should use round-robin algorithm here - each person buy one flower at a time. And to make ci minimum means, we need to buy low cost flowers later. That means start buying high cost flowers and then move to low cost ones. 
So we arrive at a solution - Sort the costs and use round-robin

For instance if the costs are- 9, 6, 8, 7. And lets say there are 3 people who want to buy. Then it means 1 person has to buy 2 flowers and the other 2 buy one each. And also the person who buys 2 flowers, should buy the flower with cost 6 (minimum cost) as the 2nd flower. So that gives us-
Person A = 9, 6
Person B = 8
Person C = 7


Thursday, November 22, 2012

Interviewstreet Challenge: Even Tree


You are given a tree (a simple connected graph with no cycles).You have to remove as many edges from the tree as possible to obtain a forest with the condition that : Each connected component of the forest contains even number of vertices
Your task is to calculate the number of removed edges in such a forest.

The first line of input contains two integers N and M. N is the number of vertices and M is the number of edges. 2 <= N <= 100.
Next M lines contains two integers ui and vi which specifies an edge of the tree. (1-based index)

Print a single integer which is the answer
Sample Input 
10 9
2 1
3 1
4 3
5 2
6 1
7 2
8 6
9 8
10 8
Sample Output :
Explanation : On removing the edges (1, 3) and (1, 6), we can get the desired result.
Original tree:

Decomposed tree:

Note: The tree in the input will be such that it can always be decomposed into components containing even number of nodes. 

All Java Solutions on the InterviewStreet needs to run in 5 seconds with 256 MB maximum RAM available. So that means our solution can never be "Brute-Force" or O(n^2). Same applies to the space complexity.

Well this is an interesting problem. Initially it appears to be a problem of n-ary trees. But to build a tree and perform the decomposition on the trees, it is time consuming. So I approached it with a Map approach. And well all my 10 testcases took around 1 sec only.
Create a map with parent as key and list of all its children. Now once this map is created, then take the root. Get each of its children from the list. So we have 1 as the node and 2, 3, 6 as children. So take 2. Check if the total number of nodes under it are even or odd (I am using a recursive call here to get the count.) In this case we have only two nodes - 7 and 5 (even number of nodes). So we cannot remove the link 1-3. Since that leaves with a subtree with 3-nodes (2, 7 and 5) which is not even number. So do not increment the count. However remove 2 from the list of 1's children and add 7 and 5 to the list. 
Now take 6, it has 3 children (8, 9 and 10). So the link 1-6 can be removed. Increment counter, remove 6, add 8, 9 and 10 to the list.
The next entries 7, 5, and 4 have no children. Continue to 8. It has 2 children, hence cannot remove.

Now take the next element in the list - it is 3. It has only 1 child (odd number). So that means we can break 1-3 link. Increase the counter, remove 3 and add 4 to the list.


Interviewstreet Challenge: String Similarity


String Similarity (25 Points)
For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.
Calculate the sum of similarities of a string S with each of it's suffixes.
The first line contains the number of test cases T. Each of the next T lines contains a string each.
Output T lines containing the answer for the corresponding test case.
1 <= T <= 10
The length of each string is at most 100000 and contains only lower case characters.
Sample Input:


All Java Solutions on the InterviewStreet needs to run in 5 seconds with 256 MB maximum RAM available. So that means our solution can never be "Brute-Force" or O(n^2). Same applies to the space complexity.
So the length of the String can be upto 1,00,000 and in total we have upto 10 test cases. All these have to be run in 5 secs. So the hint here is that you cannot do any String manipulation operations. If you are using String.indexOf or String.charAt then you would incur additional time. A better option is to convert the String to character arrray and work on it.
The next hint is that instead of creating multiple substrings, create a virtual substring inline. So what it means is, If the string is "raju", I don't need to store another substring "aju". I can use 2 pointers one pointing at string "raju" position 0 and other pointing string "raju" position 1. If they match, increment the counter and move the pointers forward.
String "ababaa". The first substring is the original string itself "ababaa" and it would match for all characters.
Next we need to compare "ababaa" with "babaa". So check index 0 and index 1. they don't match. So stop here.
Next proceed for "ababaa" with "abaa". So check index 0 and index 2 (a and a). They match. Increment counter. Next index 1 and Index 3 (b and b). Again increment counter. Next also match. After that b does not match with a. So stop here.
Continue this search.


Interviewstreet Challenge: Pairs



Given N numbers , [N<=10^5] we need to count the total pairs of numbers that have a difference of K. [K>0 and K<1e9 span="span">

Input Format:
1st line contains N & K (integers).
2nd line contains N numbers of the set. All the N numbers are assured to be distinct.
Output Format:
One integer saying the no of pairs of numbers that have a diff K.

Sample Input #00:
5 2
1 5 3 4 2

Sample Output #00:3

All Java Solutions on the InterviewStreet needs to run in 5 seconds with 256 MB maximum RAM available. So that means our solution can never be "Brute-Force" or O(n^2). Same applies to the space complexity.

The array contains 1,00,000 numbers and we need to find the pairs of numbers whose difference is K. There is a clue here that if the numbers are not sorted, we are only left with Brute-Force approach .
That means we need to check each number with every other number to see if they make a pair. This proves costly and we would run out of 5 seconds.
So one of the simplest solution is to sort the array using any inbuilt sort method. Now apply a modified version of binary search. So take each number 'x' and do a binary search for (x+k). If its found they make a pair. Now take the next number in the sorted array and so on.


Tuesday, November 20, 2012

5 Solutions for Search

Search has been a favorite topic of today and there are a lots of algorithms around it. Its all due to the rise in the data generated by systems and the increase in storage capacity. So we would need more faster and efficient algorithms to search. Today we will discuss Search in-depth, and look at multiple algorithms. I am limiting the solution to 5, but there are tons of other solutions not discussed here.

Types of Search
  1. Fixed Text Base and Variable Pattern - In this type of search, we have a more or less fixed data set on which we repeatedly search. But each time we search for a different pattern. This is more or less like a web search. Lets take Google as an example. There are huge number of documents that has to be searched. So the crawlers index the web and build a data structure. So the data set is fixed now. Our search operates on this data set.
  2. Variable Text Base and Fixed Pattern - In this type of search, we have the data set changing regularly. But we search for the same pattern every time. This is employed in censoring applications, e.g. News Feed. The feed comes in continuous fashion. On this we search if it contains any abusive words. 
  3. Variable Text Base and Variable Pattern - In this pattern, the data keeps changing and so also our queries. This is a common one. 
Solution:1 - Naive or Brute-Force approach
This is one of the simplest solutions and least effective one. In this we are given a string "S" of length "n". And we are given a pattern "P" of length "m". We need to scan through S and find where all we find the pattern P. Print all the indexes in increasing order. This is some what like saying  print all s.indexOf(p). 
e.g. S:   thisishiscar
      P:   hiscar
In this approach, we keep two variables i and j initialized to 0. If the characters at i and j are same, save the position i in another variable k. Increase i and j by till the two characters match. In our case, they match till below.
After this position it does not match. So we have to reset j to 0 and set i to (k+1). This is because we found that the position k does not match, so we have to restart from the next position. 
Definitely this approach is not suggested. Lets derive the time-complexity for this algorithm (for people who are not familiar with Big-O). Here in worst case we compare each character of pattern 'P' to each character of String 'S'. So we have m*n comparisons. Hence the time-complexity is O(m*n). So what it means is if 1 comparison takes say 1 ms, then to search for a pattern of length 5 in a string of length 100, we take 500 ms.  


Alternate Solution:- To address the problem of comparing every character of pattern with every character of text, we have KMP Algorithm. But then again the problem with this algorithm is we need to calculate the fault functions for each pattern.

Solution:2 - Java List Approach
Now lets increase the system behavior slightly. So lets say we are given a list of names where each name contains FirstName and LastName. We have to search for a name (either firstname or lastname). The program should print all names where it matches.
I have used a text file available on the web consisting of 1,51,671 names. Its available at http://stavi.sh/downloads/names.txt. Download this text file to your local system. I have stored it in "C:\\Apps" folder. We will use this file for the remaining approaches.
In this approach, store all the names in an ArrayList. When the user searches for a name "xyz" - go through the entire list. Search if the value in the list starts with "xyz " (notice the space here at end) or ends with " xyz" (notice the space here at the beginning). This is because we want to do exact search.
So what is the time complexity of this approach? Its again same as approach 1. We are checking the entire list and comparing the pattern against each entry. So if we have N words and the length of the longest word is "K". And if the pattern length is "M", then the complexity is O(N*M*K). This is because for each word in the word list we are comparing the pattern O(K*M) and we are doing it for 'N' words.

The program loaded the 151671 words in 77 ms and searched in 94 ms.


Solution:3 - Java List and Map Approach
So how can we increase the processing time of approach 2? Lets say all the names contains only the 26 English alphabets. So instead of keeping one list we will keep a list of size 26. So any first name or last name starting with 'A' will goto list entry 0. And one starting with 'B' goto entry 1. At each entry we will maintain a LinkedHashMap. This map will contain the firstname/surname and the index of the name in the text file. So if we have a name "Anand Chari" at position 4 of the text file then we will store ("Anand", 4) in list entry 0 and ("Chari", 4) in list entry 2. Once this data structure is built, to search we can get the first character of the search key and goto the appropriate list entry. We will search only in this entry and not the other 25 entries.

So we have made the search 25 times faster?? Well yes, but the map factor adds to the cost. And still we are going through all the names which have the specific first character. So the order is O(N*M*K/(d-1)) where d is the character set size. In this case its 26.

The program loaded the 151671 words in 769 ms and searched in 3 ms.


Solution:4 - TRIE
We will introduce TRIE data structure. The basic structure of the TRIE is as shown below

There are 2 ways of implementing a TRIE - as a character array or linked list. In case of character array TRIE we have each node storing a value and a marker indicating if its an end node (double-circle). Along with this it stores children. The children here are again TRIE nodes. The number of children is 'd', which is the character set size. So we could have 26 for english or 256 for unicode character set. So this wastes a lot of space.
To overcome this, we can use the Linked List TRIE. In this the children are not 'd' but only the ones that are used. So we can avoid the additional space requirements. But we have to pay for the lost indexing capability, we have to do sequential access in linked list. We can maintain the linked list sorted for faster access.
In this approach, we will store the firstname/lastname and the occurrence of the name in the file.
Search is very simple, we will have to do a depth first search. For instance to search for "acd", we goto node    "a" and from there we go down to "b". From here we goto "c". Then we go down to "d". So we only searched 4 nodes. Then what is the time-complexity of this algorithm? O(m*d). This is because in worst-case we might have to traverse d-nodes for each character of the pattern. This we do for each of the 'm' characters.

The program loaded the 151671 words in 1 sec and searched in 0 ms. As you notice, the build-time is increased but the search time is really fast.


The problem with TRIES is the space requirement. It stores characters at each node. So the number of nodes is (d*N), even with linked list approach. That means we would need 151671*26 nodes just for English alphabets. To address this we have PATRICIA (Practical Algorithm to Retrieve Information Coded in Alphanumeric) TRIE. Here if we have N keys the total nodes would be (2*N - 1) always.
Its a difficult algorithm to explain, but I will try my best to attempt. It uses the binary representation of string.
Ascii value of 'A' is 65 which is represented in binary as "0100 0001".
So if we have a String s = "AC", we can represent it as below
0100 0001 0100 0011

Lets say we want to add 3 strings "THAT", "THEM" and "THEY" to the tree.

THAT     0101 0100 0100 1000 0100 0001 0101 0100
THEM    0101 0100 0100 1000 0100 0101 0100 1101
THEY     0101 0100 0100 1000 0100 0101 0101 1001

So the words THAT and THEM differ at position 21 (left to right). Similarly, THEM and THEY differ at position 27. So we make a binary tree like this

Notice that to store 3 keys we have used 5 nodes (2*3 - 1).

In "THAT" the 21st bit is 0 that is why it is on the left. For "THEM" and "THEY" the bit is 1 and thats why they they are on right. The words "THEM" and "THEY" differ at bit 27. THEM has 0 at the position and THEY has 1.

So we build a tree like this and the search is very simple in this case. There are several other reasons for choosing this data structure. We can do lot more things like proximity search, approximate search, regex search, substring search, and so on.

The program loaded the 151671 words in 2 sec and searched in 1 ms. The increase in build time is due to the fact that we are using string manipulation functions to convert String to Binary. I haven't optimized it yet.

So what did we gain in this?? A lot less memory requirement and very fast search. 


Thursday, November 15, 2012

Solution for Probability using Tree

Probability is one area that's most often ignored by many. However there are very interesting problems on it. One such thing is tossing coins or picking up a ball, etc. Why do people ask such questions in interviews? Well, if we observe carefully tossing a coin is Head/Tail, that means binary. So the solution for these can be solved using binary algorithms. We will discuss how to achieve using a special type of Complete Binary Search Tree called "Probability Tree".
Lets say i toss a coin 'n' times and i get the following sequence - HTHTHHTTTHHHTHTH
Here H denotes Head and T denotes Tail. Now with this data, can you predict the probability for a series of 1, 2 or 3 coin tosses? Example, what is the probability that my first toss results in H? Or what is the probability of getting Head first time and Tail second time? Or what is the probability of getting HHT? All these can be solved using probability tree.
For our consideration we will take a maximum of 3-coin toss series problems. Treat H as '0' and T as '1'. Take the input and read all 3 series starting from the left. Since there are 16 tosses in total, we can make 3 series of 14 inputs. They are


What we have done is, we have read 3 at a time and removed the first. So (1,2,3), (2,3,4), (3,4,5) etc..

Now prepare a 3 level Complete Binary Search Tree and assume left nodes denote 'T' and right nodes denote 'H'. Initialize the count of all nodes to 0. Take the above 14 inputs one by one. First we start with HTH. So we go right, left, and right.  As we traverse we increase the count of the node by 1. Likewise do it for all the 14 inputs. Now we are done with counts. Now at level-1 the left node has count 6 and right node has 8. So the total for the 2 nodes is 14. The probability for left node is 6/14 and for right is 8/14. Store it and move on to calculate next levels. For each node just consider its two children and set their probabilities. Thats all we are done!!

Now to find the probability of an event lets say "H". We can directly goto right node and get its probability. So it is nearly 0.57.
Lets say we want to find the probability of "TT". Its 0.43*0.33.
Probability of "TTH" is 0.43*0.33*0.5.

Note that the total probability is the probability for all leaf nodes and it is equal to 1.


Wednesday, November 14, 2012

Algorithm used by BZip, JPeg compression

We all use winzip, jpeg and other compression utilities. The space gained by compression is really high. We will discuss the underlying implementation behind them.

In Java char primitive type is stored as 2 bytes (or 16 bits). The ascii values however are 7 bits. So the total number of ascii characters are 128. So in ascii 'a' is 97 and 'A' is 65.
Refer to the link for the complete list

So in Java if I want to store a string "cat" it takes 2*3 = 6 bytes (or 48 bits).

Imagine you are writing an utility which loads an English novel into memory. If we need to store this huge data then we will most probably run into memory issues. So what is the alternative?
Huffman has proposed an algorithm for compression. It has support for encoding and decoding as well.
Lets say our text is "bcbbbbbbaacaabbcade". This contains 19 letters, so we need 19*16 = 304 Bits to represent this String.

Consider the input string. Count how many times each character appears in the string. Store it somewhere in the ascending order of frequency. So i would get the below list-


What is the most efficient way to do this? Lets see what is the size of our character set. Its 256. So initialize an integer array of size 256. Read the input string character by character and increase the count at the index of the character. So if our integer array is arr[256], then we would have arr[97] = 5, arr[98] = 9 and so on.
So the time-complexity is O(n) where n is the length of the input string.

Now take the first 2 characters (e,1) and (d,1). We will store these 2 as child nodes of a binary-tree like structure. The parent of these 2 nodes will be a consolidated node (ed, 2). Here we added the 2 characters and also their frequencies.

           (ed, 2)
(e,1)                 (d, 1)

So the frequency for 'ed' is 2. This is the minimum currently, the next minimum is (c, 3). So our new node (edc, 5) would have left child (ed, 2) and right child (c, 3). Similarly we add 5 and 9. Finally we would get (bedca, 19). This is the final root node. The frequency of this node is 19 which is the length of the input string. The following is the structure of the tree.
Mark all left nodes 0 and right nodes 1.
Double-circled nodes are leaves.

Start from root node and do a DFS(Depth First Search) till you encounter leaf node. We get

b = 0 
a = 11
c = 101
d = 1001
e = 1000

Input string will translate to "010100000011111011111001011110011000". So the length of the string is 36. That means we only need 36 bits to store and not 304 bits. Woow that's a great saving of 88.15%.

Now how to decode any string. Lets say we want to decode "0001101011000". Read the numbers from left to right. So we get 0, read from the tree till we hit a leaf node. So we hit b for 0. So we get bbb for 000. Now we get 1, we goto (edca,10). Continue, since we still didn't encounter leaf node. Next number is 1, so we encountered (a, 5). Hence it is 'a'. So we get "bbbabce". This is the decrypted value.

In the solution, I have built the Huffman's tree in a most-efficient way. I don't use any costly recursion, I have used plain arrays. Have fun coding!!


Monday, November 12, 2012

Spell Check - Minimum Distance Calculation

Wherever there is a manual user input, there is a possibility of a typo error. As per research its been found that possibility of the first letter being typed wrongly is very remote.
Errors are classified into the following types

  1. Extra character : "inpuut" instead of "input"
  2. Missing character: "publsh" instead of "publish"
  3. Wrong character: "prebiew" instead of "preview"

We call each of the above errors as 1 distance.  Now given 2 strings, its possible to calculate the distance.
For eg. "Apple" and "Applets" distance is 2 as it is 2 missing characters.
"Apple" and "Tablet" distance is 4 as it is 3 Wrong characters (A instead of T, p instead of a, p instead of b),   and 1 missing character (t missing).

There is an algorithm to calculate this distance, found by a Russian mathematician Levenshtein. This algorithm is named according to him. So if the length of string1 is 'n' and length of string2 is 'm' then the algorithm runs in O(n*m). First we plot the string1 and string2 characters in a matrix format and then start filling up the values.

Calculation starts from (1,1) and the result is stored at the right bottom (red).

at (1,1) we have 't' and 'a'. they do not match so consider the values (0,0), (0,1) and (1,0) and take the minimum value and add 1. So the numbers are 0, 1, and 1. The minimum is 0 add 1 to it. We get 1.

at (2,1) we have 'a' and 'a'. they match so take the value at (1,0) which is the diagonally above the current element.

The minimum distance is stored at (6,5) which is 4

The same algorithm can be extended to check multiple strings. It can be applied to find all matching words in a dictionary. Some of the spell checkers that we see day-today use this to find the nearest matching word in a given dictionary. While doing so, the average distance they would check against is 2 (as per research).
But if they have to prepare this matrix for each string and compare it in O(n*m) its a time consuming thing. So there is a revised algorithm which runs in O(n log n + m log m). I am planning to put up this algorithm in the next sessions where I will be posting on the algorithm behind Google Docs.