These are the contents of the pages:
page 1: hello world
page 2: fantastic four review
page 3: four reasons that iPhone 4 is fantastic in the world
reverse indexing is to build up a lookup from "word" to the id of "page":
hello: 1
world: 1,3
fantastic: 2,3
four: 2,3
review: 2
reasons: 3
that: 3
iPhone: 3
4: 3
is: 3
in: 3
the: 3
OK. Sounds simple ? Yes, it's simple, but only in the case that your data set is small that everything can be fit into memory. When the amount of data goes large, there're some certain bottle necks. This is the current situations:
number of pages: 10 millions
number of words: 10 billions, assume we've setup a bound of 1000 words maximum to be extracted from one page
number of bytes in a words: assume 10 in average
We need 10G x 10 = 100GB memory to hold all the words in memory !!! While it's still possible, in some big server machine. But how about 100 millions of pages, and 1 billions of pages ?
why not using MySQL ?
assume we can perform 3000 inserts/updates per second with MySQL. We need 10,000,000,000 / 3,000 = 3,333,333 seconds = 38 days to build up the whole thing. Obviously that's not a good idea.
why not using key-value Database ?
There's not enough memory to hold all the keys in memory. If the keys are on the disks, it's not much difference comparing to MySQL.
Final solution:
Plain binary file with merge sort
This is a feasible idea. We've made each tuple of (word, rank, pageID) into a fixed length structure:
struct DictWord {
char m_word[32]; // word, maximum 32 bytes UTF-8
float m_rank; // ranking of the word in that page
MD5 m_pageID; // 16 bytes MD5 check sum of the page URL
}
This is just 32+4+16=52 bytes. For 10 billions of words we need 520 GB of data, which is still OK !!!
quick sort & external merge sort:
In order to facilitate a fast lookup, we can't escape from either hashing or sorting. And it's much easier to have external sorting rather than external hashing (I guess that's why Map-Reduce also uses external sorting). Therefore, sorting on disk is a nicer approach.
- quick sort on a small bunch of data in memory:
we partition the big binary file into 1GB each, within which we can do a qsort in memory. we only need 8-9 seconds to qsort 20 millions words
- merge sort on all the files from the disk:
finally the external merge sort begins for 520 files. The speed is limited by the disk access. For a traditional harddisk of sequential read/write access of 200MB/s, we need 2 x 520GB / (200MB/s) = 5200s = 1.4 hours. If you have 2 harddisks you can do read and write simultaneously, resulting in 0.7 hour time.
After reverse indexing, all the words will in a sorted order:
4: 3
fantastic: 2,3
four: 2,3
hello: 1
in: 3
iPhone: 3
is: 3
reasons: 3
reasons: 3
review: 2
that: 3
the: 3
world: 1,3
So to summarize the whole thing, that's:
extract words from pages =>
build up tuple <word, rank, pageID> in memory =>
qsort on tuples when the 1GB memory buffer is full and write into a new file =>
external merge sort on each 1GB file =>
then you get a large file (~500GB) of well-sorted words with pageID.
current works: 6.7 millions of testing pages. avg 270 words / pages. still working.
No comments:
Post a Comment