Friday, October 26, 2012

Page Hits for WebSites and Blogs

Assume that you are building an Analytics Tool for a very popular site (like facebook). And the tool should capture the real time hits for the site and maintain a cumulative frequency. The site will be hit every second and you have to capture the details and show a summary like last 1 month 45,000 hits, last 1 week 20,000 hits, yesterday 1,200 hits and so on. You should also support range queries like 22,000 hits between 02/10/2012 to 21/02/2012.

Assume that you have stored the hits sequentially in an array. For eg.
3, 2, 0, 9, 2, 5, 1 
The above is the page hits for a 1 week period. The most natural way of calculating the total cumulative frequency is to sum up. It becomes-
3, 5, 5, 14, 16, 21, 22
So we know we received 22 hits in 1 week. Lets say we have data for last every hour like this and the data is stored for last 5 years. This is pretty huge data. Now if for some reason the frequency for day 2 is updated from 2 to 5. This impacts our frequency results table, so we have to recalculate the cumulative values for the 2nd day to 7th day. 

A better solution for this is Fenwick Tree (or Binary Indexed Tree). In this data structure each index position is expressed as a binary number. Lets consider 12. Its represented in binary as 0000 1100. The last set bit (from right to left) is at position 2 (with 0-indexed). Lets call this last set index of idx as lsi(idx). Each idx manages in the range [ idx to idx - 2^lsi(idx) + 1]. So in case of idx 12, the lsi(12) was 2. So it manages 9..12. The following chart provides the indexes and ranges.

1=1
2=1..2
3=3
4=1..4
5=5
6=5..6
7=7
8=1..8
9=9
10=9..10
11=11
12=9..12
13=13

For the below input:
{1,0,2,1,1,3,0,4,2,5,2,2,3}

The output tree is:
[1, 1, 2, 4, 1, 4, 0, 12, 2, 7, 2, 11, 3]

So if you observe, 12 is managing 9 to 12. So the values its managing are 2,5,2,2 (The tree is 1-indexed and not 0-indexed). The total sum is 11. Similarly, index 6 manages 5..6. So its values are 1,3 summing up 4.
There is an easy way to find this - which is bit manipulation ( i have used left shift and right shift ).
Updating an index value is simple. Identify the impacted indexes and add the value.

To calculate the range values, use Tree[j] - Tree[i]. Please see the code snippet below.

No comments:

Post a Comment

UA-36403895-1