Tuesday, December 25, 2012

Count Distinct Values in huge text file

This is a common problem today. You are given a huge text file, containing tens of millions of text data. And you have to count how many distinct values exist in the file. The problem gets more complicated, if you cannot load the entire data into memory. How do you make your algorithm "Online". An "online algorithm" is one in which, the result is expected to be returned immediately - may be in one pass, and you won't have memory to hold all the data.

The usage of this algorithm is in various fields. In web analytics to return number of unique visitors last month, last week, so on. Another usage is in detecting a suspected attack. If the load on your server increases all of a sudden, and you notice a spike in the network, then the first thing to look for the possibility of an attack. In this case, the requests could be coming repeatedly from a system. So next we will discuss different implementations available. For the purpose of this article, I used Shakespeare's complete work as a huge text file. It can be downloaded from http://www.gutenberg.org/cache/epub/100/pg100.txt. The file is around 5.5 MB in size and contains 67802 distinct words.

Read the input file line by line, break it into words. Then store the word in a set. Once all the words are stored, then get the size of the set. This is the distinct count. However with this approach, we have to store the entire 5.5 MB data in the Set. Assuming that there are repeated words, and say 4 MB is the distinct words. Still we need to save 4 MB data in Set. And since a Set internally uses a dictionary, and assuming 0.7 load factor then we would need 4/0.7 = 5.7 MB. And if the file is extremely big then we cannot use this approach at all.

A small work-around for Approach 1 is here. Assume that you have a very good hashCode function and the hashCode for any 2 words are never the same. In this case, store the hashCode instead of the word. So hashCode being int and when stored in a set as Integer, it is 32 bits or 4-bytes long. So to store 67802 hashCodes we need 378 KB memory. This is a huge gain, but the problem is we should find a good hashCode function. And the size of the Set increases with the input file size.

We can use "Bloom Filters" as an alternative. It is a 2-dimensional boolean array. In this approach, we use 3 or 4 different hashCode functions - h1, h2, h3, h4. For each word, compute h1 to h4. The rows of our bloom filter are the hashCodes and columns are fixed based on the accuracy needed. One thing to note here is that, the Bloom Filter is a probabilistic algorithm. It doesn't give 100% accurate results always. There is a chance of error 1% to 4% or so. The bigger the size of the Bloom Filter array sizes, the more accurate the results are. Assume that the values for h1, h2, h3, and h4 are 22, 17, 35, and 19. If the bloom filter array is represented as a, we set a[22][0], a[17][1], a[35][2], a[19][3] to true. Using this approach, I could get a result of 67801 instead of 67802 with array 4*1M. Since boolean is a byte in Java, we would need 4*1M bytes or 3 MB memory. However with a memory of 300 KB, I got 3% error. The efficiency of these algorithms are - we don't need to store the entire data and we can get the result in one pass.


The next approach we discuss is a very efficient algorithm. And the same has been used in many applications - including Cassandra (by Facebook). It is called "HyperLogLog Counter". Its a slightly complex and mathematical algorithm. It needs very little memory and doesn't vary much on the input data size. The idea is as follows. Read each word and compute the hash of the word. Cassandra uses "MurMur Hash". In this algorithm we use something called "Stochastic Averaging". We define 'b' as a number between 4 and 16. And create an int (or byte) array of size 2^b. So if we use b=4, then we have 16 buckets. Assume that we have chosen b=15, hashCode of a word is 1379266913. We convert this hashCode to a 32-bit binary number (Integer is 32-bit). And then break it into first 15 bits and next 17 bits. The first 15-bits decide which bucket to use. The next 17-bit is used to determine the rank. The rank is nothing but, the position of the first set-bit (or position of 1), from left to right. At the bucket index, if the value is lesser than the rank, then rank is stored in the bucket. So basically at each bucket, we store the highest 1-bit position. Once all the words are read and the buckets populated, we use the harmonic mean to compute the result.
Using this approach with b-value of 15, meaning 32768 buckets ( or 32 KB) memory the result obtained was 67749 (against the actual 67802). Which means using just 32 KB memory we got a result which is 99.92% accurate!!


Friday, December 14, 2012

Google Fresher Interview


There is a linked list of numbers of length N. N is very large and you don’t know N. 
You have to write a function that will return k random numbers from the list. 
Numbers should be completely random.


One approach is to use random and modulo with N. But if we don't know N, it becomes interesting. The approach that we discuss is called "reservoir sampling". In this approach, we read the first K numbers and store it in the list. For our example lets say K is 10. So we will have the first 10 elements in the list. Next read the 11th element, compute a random function which returns a random number between 0 to 11. If the random number (r) is less than 10, then set the r'th number to 11th number, else continue.


Question: 2

Given a Data Structure having first n integers and next n chars. A = i1 i2 i3 ... iN c1 c2 c3 ... cN. 
Write an in-place algorithm to rearrange the elements of the array ass A = i1 c1 i2 c2 ... in cn


We can use a divide-and-conquer approach. Lets say the list contains {1, 2, 3, 4, 5, 6, 7, 8, a, b, c, d, e, f, g, h}. Using binary-search kind of approach, convert it into {1, 2, 3, 4, a, b, c, d, 5, 6, 7, 8, e, f, g, h}. Break the list into 2 parts and apply the same algorithm on the 2 parts - {1, 2, 3, 4, a, b, c, d} and {5, 6, 7, 8, e, f, g, h}. 


Question: 3

There is an array A[N] of N numbers. You have to compose an array Output[N] such that Output[i] will be equal to multiplication of all the elements of A[N] except A[i]. For example Output[0] will be multiplication of A[1] to A[N-1] and Output[1] will be multiplication of A[0] and from A[2] to A[N-1]. Solve it without division operator and in O(n).


We can do it in 2 pass, O(n). If the list is a: {4, 2, 5, 1}, we create a product list b: {0, 0, 0, 0}. Set b[0] to 1. Set pf (product forward) to 1 and pb(product backward) to 1. Start pass forward. Set pf  = pf * a[i-1]. So at i=1, pf will be 4. Set value of pf to b[i]. So the list will become {1, 4, 0, 0}. Next at i2, pf will become 4*2=8, so the list will be {1, 4, 8, 0} and next it will be {1, 4, 8, 40}.
Start pass backward from last but one item that is i=2. Set pb = pb*a[i+1]. Set value of b[i]*pb to b[i]. So pb=1. The list will be finally {10, 20, 8, 40}


Question: 4

You are given an array with integers (both positive and negative) in any random order. 
Find the sub-array with the largest sum


Sort a very large text file on limited memory


We are given a very large text file of size say 10 GB and we need to sort the file and store it in an output location. But the memory (RAM) is limited, say 1 GB.


1. Brute-Force Approach
Read the input file line by line using BufferedReader and store each line in a TreeSet (assume there are no duplicate lines). Once the file is completely read, store the contents of the TreeSet into the output file.
For my analysis, I used a 400 MB file containing line items. I extracted this from TPC-H dataset. You can use any large file. When I ran my program with 512 MB memory (VM Argument set to -Xmx512M), I got OutOfMemory. So I set it to 1GB, it ran fine. The time taken was 7 seconds.
So as we can observe, if we need to sort 10GB then we would need around 15-20 GB RAM. The memory utilization is by the TreeSet to hold the data.

2. External Sort
In this approach, we will only read the input file line by line and store it into the TreeSet. This step is similar to above, but we would not store the entire file in the TreeSet. If the file is 10 GB and our RAM capacity is 1 GB, we will store 512MB data in the TreeSet. Once it becomes 512 MB we flush it into disk, on a temporary file (call it temp1.txt). Repeat this procedure of writing into temporary files, till you read 10 GB completely. So we will have 20 sorted temporary files. Next step is to merge these 20 files into a single sorted file and delete the temporary files. We call it a K-Way merge algorithm. Consider the example below

temp1:  { dog, fan, jug }
temp2:  { egg, kit, rat, sun }
temp3:  { ant, gun, hat}

read the first items - dog, egg, ant. The smallest of it, write to output.txt and remove the element from temp file. To the list, add the next item from the removed file. So we will remove ant from temp3 and put it to output.txt. So in the list we will have - dog, egg. Add gun to the list (since we removed ant from temp3). Now next smallest is dog add to output.txt and add fan to the list. Repeat till all the elements are added.

I first implemented this method to sort a 400 MB file on a 40 MB RAM. The time taken was 22 seconds.
This is definitely a gain in the space-complexity at the cost of increased time. Can we do it better?

Next I utilized Memory-Mapped-Files (MMAP). For this we need to consider how Java does file IO. Data in the file system (secondary storage or hard-disk) has to be brought to memory (RAM). To do this, we can either read character by character, or read a chunk of data at a time. The time taken by file IO is huge, compared to the CPU processing time. So its better to reduce the frequency of file reads, hence we use BufferedReader.
But if we use MMAP, then a file or portion of a file can be mapped directly to memory. To achieve this, it uses SWAP space (or virtual memory). Every system has this virtual memory and is different from the RAM. In Java, objects are created on the Heap or RAM or memory as we call it. But if we bring data to swap space then the data can be read directly without an operating system call (which is magnitude of times slower). You cannot Memory Map a full 10 GB file, so we can MMAP a portion of the file. The other advantage of MMAP is, multiple processes can share the same file (thread-safe). In Java, we have the classes for this in NIO package. FileChannel is the class to hold the data.

With this approach, I memory mapped the 400 MB file for 8KB size. Read this 8KB data and store it in a TreeSet. Repeat 5000 times, and store the 40 MB data in a temporary file (we will have 10 temporary sorted files). Then apply k-way merge.
Using this approach, the program ran in 17 secs ( with a 5 secs gain ). Its generally, atleast 20% faster than the BufferedReader.




Monday, December 3, 2012

Adaptive Spam Filtering algorithm

When we talk about Spam, we generally mean emails. So a spam mail is one which is sent to you as an email promotion or a bulk mail. And in most of the cases you are not interested in receiving them. So earlier days we had to go through the mail and identify if its a spam or not. A mail which is not spam (is called ham), we keep in Inbox and for the spam, we manually used to move it to a junk folder. Now that is a lot of work to do, given that these days 50-60% of mails are spam. So there are a few algorithms which came up to solve this issue. And the best of all is "Bayesian Algorithm". Its an adaptive, machine-learning algorithm. And we will discuss the details below.

Classifying an email as spam or not cannot be done at the mail server. It needs to be done at the email client. For instance lets say there are 2 users - A and B. And A works for a Bank and B works as a Pharmacist. A mail with content "Reduce your mortgage loan" is spam for B but ham for A. And a mail "Solution for baldness" is spam for A but ham for B. So when the recipient receives the email, if he received a mail and he considers it as spam, he can "Mark it as Spam". This is not a big issue. On the other hand, if he noticed a mail that was  ham went  into his spam folder, he can "Mark it as NOT Spam". This is an issue, as the mail might be an important one and you might miss out on it (as its not showing in your inbox). So the spam detectors should be careful not to mark a ham as spam. Also, spam can be detected based on email content, email subject, sender email, recepient emails, etc. Lets see how they work.

In the industry we have a collection of thousands of ham/spam emails which can be used to build our Spam filter application. Download these emails into your data store. Run a job on it (Map-Reduce or batch) to go through the email message and split them as multiple words. You might have to do additional tasks like removing special characters, quotes, converting to lower case, ignoring words of length less than 4, ignore common words, ignore words with only letters, etc. Now the valid words you add it into a HashMap as Key. The value for the Map is a Node. The Node class has 3 fields - spamCount, hamCount and probability. So if I am reading a word "XYZ" from spam email and it is the first time I encountered this word, then the Node class would have spamCount=1, hamCount=0. We will calculate probability after the map is constructed. Note that the same word can appear in the ham list. Every time a word is put in the map, increment a class level variable totalSpam (or totalHam) by 1. After all the emails are read and the map is constructed, iterate the map and get each key. For the key get the spamCount and hamCount. Calculate probability using -

probability = (spamCount/totalSpam)/((spamCount/totalSpam) + (hamCount/totalHam))

Do this for all the keys. The probability is a floating point value between 0.0 and 1.0.
That completes the training step. Next is the filtering step.

An email comes from a sender "X". So again, get the words (as described above) and for each word get the probability of the word in the map. If the word doesn't exist it the map, it means the spam filter is not trained for this word. So it could be a valid word, give it a value 0.5. Calculate the interest values I for each word as follows-

I = |0.5 - probability|

Once it is calculated for all the words, sort the I values in descending order (highest interest). Out of this take N values (N=15). For these I values, get the corresponding probabilities p1, p2, p3.. p15. Now calculate the total probability using the following formula

P = (p1*p2*p3..p15)/((p1*p2*p3..p15)  + ((1-p1)*(1-p2)*(1-p3)....(1-p15)))

This value would be between 0.0 and 1.0. The nearer the value is to 0, the lesser the chances of it being spam. So we mark anything equal to or greater than 0.9 as spam. 

Next comes machine learning. It could happen that, an email which is not marked spam needs is found to be spam. You mark it as spam. To do that, add the word back to map and calculate the probabilities again. 

I have built a basic implementation which can be trained and also do machine-learning. I created 3 files - ham.txt, spam.txt and common-words.txt. In this basic implementation I am storing text as mail content in one line of the text file. In the sample data I setup, I want to filter all jobsite, lottery, retirement emails. So the spam filter gives the following output.
  1. 'quick and easy jobsite' is spam
  2. 'will there be a production release tomorrow' is not a spam
  3. 'plan your retirement soon' is spam
  4. 'you have won a state lottery claim your money now' is spam
  5. 'register with our jobsite and get your dream job and money now' is spam
  6. 'today with our jobsite making money is not all that hard' is not a spam
  7. 'today with our jobsite making money is not all that hard' is spam
  8. 'back to office today and started off coding' is not a spam
Note that 6 was initially found to be a ham. The reason being a few words like today, money, etc are found in ham list as well. But when I mark it as spam, the next time when I received the same email at 7, it automatically traced it to be a spam.


Saturday, December 1, 2012

FaceBook - Write a function that can show the number of users online at any given time

You're given a period of time where users log in and log out and a set of login and log out times for those users. Facebook is asking you to create an efficient algorithm that will calculate a simple, but extremely important, metric.

Facebook currently has more than 1 Billion users and assume that each user is given a unique identifier like user_192. Hundreds of people login (and logout) to FB every second. There is one audit log maintained explicitly to store the time of login/logout and the userid. A sample log is given below-

[10/11/2012 10:25:06] Login user_4
[10/11/2012 10:28:55] Login user_6
[10/11/2012 10:29:19] Logout user_4
[10/11/2012 10:36:33] Login user_8

Using the above information, we need to come up with the most efficient algorithm (fast and low memory utilization) to calculate the users logged in at any time. For instance in the above example, if I query for 10/11/2012 10:29:00, there were 2 users logged in user_4 and user_6. Assume that there can be users who are logged into FaceBook the entire day (and never logout). Also assume that the entire 1 Billion users login every day. Most importantly, we need to consider boundary conditions as well. As in somebody logged in at 23:55:00 and logged out the next day at 01:17:12. 

There were a few solutions available, which had time complexity O(n log n). However, I was thinking if I could do i in 1 pass, which is O(n). And also, I didn't want to use the BitSet approach as I would have to create a BitSet of size 1 Billion every day. So, I came up with this solution.

Maintain a map, where the key is the date (10/11/2012). The value is an int[] of size 86400 (total number of secs in a day). The values of the array are pre-initialized to -1. Maintain a counter initialized to 0. Now go through the log file, for each log entry convert the time portion into secs. If the activity is "Login" then increment the counter, if "Logout" then decrement the counter. Goto the array position and set the value to the value of the counter. In the figure above, the blue window represents the time user_7 is logged in. The green window represents the time user_4 is logged in. The red window represents users who are logged in fo 2 overlapping days. So what we have achieved is, in linear time we have created our data structure.

Now lets see how we can do querying. Convert the time to seconds, get the corresponding value from the int array. If the value is not -1, then the value is the result. 
If the value is -1, from the current position of the array navigate back till you encounter a value which is not -1 (case a). While doing so, you might end up at the beginning of the array and still not found any value which is not -1 (case b). In case a, the value at the position is the result. In case b, we need to go back and refer to the previous day's map. In that map's int array, start from the end of the array till you find the value not -1. 
Will see with an example. In the figure above, at time 37411 we had 1 user. At time 37506 we had 2 users. So if I query for time 37506 we can directly say 2. If we query for 37500 we have 1 user. How did we arrive at this? At 37500 the value was -1. So we navigate left and at 37411 we get  value 1. That is the result.

Note:  In a real-world scenario the user log would not be a single text file, but a distributed file. Facebook have their own logging system PTail and Puma. And the back-end uses HBase over HDFS. So the log would be broken into multiple 64 KB pieces. Map-reduce job runs on each of these pieces in parallel building the above map.


LOG File