dujiaen's blog

Sunday, May 8, 2016

Crawling one billion web pages in a single PC in 5 days ? It's feasible

Recently I've updated my network plan from 300Mbps to 500Mbps. The PCCW network from Hong Kong is so good that I am able to get around 470-520Mbps downloading speed all the time, with my application downloading about 520 different webpages per second, still, in a single PC.

But something need to be considered.

- It's soon ran out of harddisk space. An average webpage size, which is around 120kB. You can remove part of them, by getting rid of the comment area (from ""), the style area (<style> to </style>) and the scripting area (from <script> to </script>). But after removing these 3 areas, you still have around 60kB to be written into the harddisk.

It's big amount of data with redundancies:

1 webpage: 60kB
1,000 webpages: 60MB
1,000,000 webpages: 60GB
1,000,000,000 webpages: 60TB

for 60TB storage, you need 8 of 8TB harddisks, it's still possible although it's quite costly. Alternatively we can have fast, simple compression algorithm.

Finding a repeating pattern for compression:

consider the following tests:
<title> Hello World! </title>

It's partially repeating,
<title> Hello World! </title>
******
the fragment marked with '*' have appeared before. So why not use 2 integers (index difference, repeat length) to represent the repeating fragment ?

and consider another:
hello hello hello hello
*****************

Well! It's a self repeating sequence where the original part and the repeating part is overlapped.

Finding it fast, using in-place hashing:

I have ambition to get rid of 3rd party libraries as possible as I can. People would like to use some open source libraries such as LZO. But it's difficult to optimize the 3rd party stuff unless you have a complete picture of what they are doing. Therefore I'll make it myself.

In-place hashing:
In the old-school hashing algorithm, we are taught to have a following data structures:

init: Item** hashTable = new Item*[HashIndexSize];
insert(index, data): hashTable[index] = new Item{ m_data = data, m_next = hashTable[index] };
// each Item itself is a node of a linked list for the same hash index

This is kind of OK. But it runs pretty slow for the following reasons;
- It allocate small block of memory from the heap.
- Items are scattered inside the entire virtual memory space, which increase CPU cache miss.

New design:

init: Item *hashTable = new Item[HashIndexSize];
insert(index, data): hashTable[index] = data;
// each Item contains the data only, without the "next" pointer.

Ahh ? Does it works ? How about hash collision? here's a better way:
insert(index, data):
for (i = 0; i < maxCollision; i++) {
if (hashTable[(i+index) % HashIndexSize] == null) {
hashTable[(i+index) % HashIndexSize] = data; break;
}
}

That we have a in-place hash with maximum collision configurable. And the hash itself is a bare array of continuous memory. We use that to locate a possible repeating fragment in linear time.

Using a in-place hashTable of size of 128k items, with maximum collision 16. I've got a good compression performance of 400MB/s, in the single threaded way with a 3.5GHz skylake CPU. And with the compression algorithm, I can still archive up to 700 webpages per second crawling speed with Dual Core Pentium CPU.

After compression, average webpage size is reduced to some where between 10 and 11 kB.

Therefore you only need to have 12TB storage to storage 1 billion web pages !!! And theoretically you can have a crawling speed up to 2000 webpages per second with an i7-6700k (3 times powerful than Pentium G4500), in a single PC.

Thursday, February 11, 2016

Make your own Search Engine: Reverse Indexing billions of words and phrases in less than an hour

After downloading millions of page contents, it's time to make searching fast, and as fast as possible. One common approach is to reverse index all the words within the pages, this is an example:

These are the contents of the pages:

page 1: hello world

page 2: fantastic four review

page 3: four reasons that iPhone 4 is fantastic in the world

reverse indexing is to build up a lookup from "word" to the id of "page":

hello: 1

world: 1,3

fantastic: 2,3

four: 2,3

review: 2

reasons: 3

that: 3

iPhone: 3

4: 3

is: 3

in: 3

the: 3

OK. Sounds simple ? Yes, it's simple, but only in the case that your data set is small that everything can be fit into memory. When the amount of data goes large, there're some certain bottle necks. This is the current situations:

number of pages: 10 millions

number of words: 10 billions, assume we've setup a bound of 1000 words maximum to be extracted from one page

number of bytes in a words: assume 10 in average

We need 10G x 10 = 100GB memory to hold all the words in memory !!! While it's still possible, in some big server machine. But how about 100 millions of pages, and 1 billions of pages ?

why not using MySQL ?

assume we can perform 3000 inserts/updates per second with MySQL. We need 10,000,000,000 / 3,000 = 3,333,333 seconds = 38 days to build up the whole thing. Obviously that's not a good idea.

why not using key-value Database ?

There's not enough memory to hold all the keys in memory. If the keys are on the disks, it's not much difference comparing to MySQL.

Final solution:

Plain binary file with merge sort

This is a feasible idea. We've made each tuple of (word, rank, pageID) into a fixed length structure:

struct DictWord {

char m_word[32]; // word, maximum 32 bytes UTF-8

float m_rank; // ranking of the word in that page

MD5 m_pageID; // 16 bytes MD5 check sum of the page URL

}

This is just 32+4+16=52 bytes. For 10 billions of words we need 520 GB of data, which is still OK !!!

quick sort & external merge sort:

In order to facilitate a fast lookup, we can't escape from either hashing or sorting. And it's much easier to have external sorting rather than external hashing (I guess that's why Map-Reduce also uses external sorting). Therefore, sorting on disk is a nicer approach.

- quick sort on a small bunch of data in memory:

we partition the big binary file into 1GB each, within which we can do a qsort in memory. we only need 8-9 seconds to qsort 20 millions words

- merge sort on all the files from the disk:

finally the external merge sort begins for 520 files. The speed is limited by the disk access. For a traditional harddisk of sequential read/write access of 200MB/s, we need 2 x 520GB / (200MB/s) = 5200s = 1.4 hours. If you have 2 harddisks you can do read and write simultaneously, resulting in 0.7 hour time.

After reverse indexing, all the words will in a sorted order:

4: 3

fantastic: 2,3

four: 2,3

hello: 1

in: 3

iPhone: 3

is: 3
reasons: 3

review: 2

that: 3

the: 3

world: 1,3

So to summarize the whole thing, that's:

extract words from pages =>

build up tuple <word, rank, pageID> in memory =>

qsort on tuples when the 1GB memory buffer is full and write into a new file =>

external merge sort on each 1GB file =>

then you get a large file (~500GB) of well-sorted words with pageID.

current works: 6.7 millions of testing pages. avg 270 words / pages. still working.

Sunday, January 24, 2016

How to crawl 20 million web pages in a single PC in half a day

These days I have been working on a web crawler, with the ambition of making a search Engine myself. And up to now I’ve already crawled up to 20 million pages in a single ordinary PC. And here I’d like to share my design:

PC Specifications:

CPU: Pentium G4500 Dual Core 3.5GHz
RAM: DDR4 2133 8GB
Harddisk: TOSHIBA 3TB 7200rpm
OS: Windows 10 x64

Implementation Platform: C++, Visual Studio 2012
Extra Tools: MD5 checksum library

It’s not that difficult if you think about it thoroughly. But it’s good to have some fast & robust underlying data structures:

Implementation of a semi in-memory Key-Value Database in C++:

Relational databases are good for transactions and searching, but they’re not good for handling huge data, nor performance. We need to have a simple, fast, and key-value database. And for a search engine purpose, the crawling database should have the following properties:

Able to insert a pair of (Key-Value), where key is a fixed number of bytes and value is arbitrary number of bytes, with a performance greater than100,000 inserts per second with traditional hard-disk.
Able to check if a key exists in the database without disk access, with a performance greater than 1,000,000 checks per second. Which means that all the keys must be in some form stored in memory.
Able to get value given a key, with only 1 or 2 disk access. As values are huge we can’t get them all in memory. With the traditional hard-disk of performance of around 100 iops.
Able to delete a key with a performance greater than 1,000,000 deletes per second.
Able to get a random key that exists in database without disk access. This is a additional property, but it’s useful.

Implementation: The append only database.

Disk Layout: Each database has 2 files, the KEY file and the VAL file. The KEY file is actually a big array of items, where each item is in the form of (16-byte key + 8-byte offset). Here “offset” means the offset of the VAL file from where it contains the value. The VAL file simple contains a stream of value items, where each value item is (4 byte length + value).

Memory Layout: we need to maintain a in-memory hash from key to offset. Let’s just call it key-offset hash.

Database operations: Inserting a (key-value) is equivalent to appending a new (key-offset) to KEY file and (value) to VAL file. We make a buffer to hold enough data before we flush into the hard-disk in one go. Removing by key is equivalent to appending a (key-offset) to KEY file where offset equals to -1. Other operations are trivial.

Database Memory Usage:
for each key, we need 16+8=24 bytes in memory. Plus the hash and memory allocation overhead, it’s around 64 bytes for x64 applications. (Well, we can further reduce the overhead by throwing away the std::unordered_map). So for a system with 8GB available memory, it can hold up to 125 million database items.

The crawling system Architecture:

Main Database (MainDB): (key: md5(URL), val: page content).
We can’t use the URL itself as the key of a database, as it’s stored in memory and it’s costly. But using the md5 library we can map each url (or even anything) into a fixed 16-byte id, given the assumption that it’s ok to have just a very very low probability (1 over 2^128) of md5 collision given any 2 different URLs.

Pending URL Database (PendDB): (key: incremental counter, val: bulk of URLs)
We need to maintain a temporary database for all the pending URLs. But due to the fact that retrieving the value from the database is pretty slow (50-100 iops for a normal 7200rpm hard-disk), we need to bundle 20-30 of them in bulk. So that reading 1000 pending URLs only takes 33-50 hard-disk accesses. And each time we can retrieve a random bulk of URLs, thanks to the database property 5.

Pending URL HashSet (PendSet): hash-set of md5(URL) that keeps tracking whether a given URL is in pending URL database or not.
Processing URL HashSet (ProcessingSet): hash-set of md5(URL) that keeps tracking whether a given URL is being processed.
Host to ipv4 Hash (Host2IP): hash of md5(Host)->ipv4 that we can easily retrieve the ipv4 address given a host.
Bad Host Set (BadHosts): hash-set of md5(Host). All the hosts that are not reachable.
URL log file (Log): plain text file containing all the URLs seen so far.

Main Searching Flow:OK. Up to now, it’s more than 70% of the story. The searching flow is easy:

For each crawling threads:

Step 1 get a random key from PendDB, get the value (bulk of URLs) by the key.
Step 2 for each URL u in (bulk of URLs)
– Step 2.1 remove u from PendSet, put u into ProcessingSet
– Step 2.2 extract host h & port from u
– Step 2.3 if the h is in BadHosts, continue to Step 2; otherwise if h is in Host2IP, get the ip from Host2IP, other wise do a dnslookup.
– Step 2.4 try to initiate a tcp connect (ip, port), if it fails, and if h if not in Host2IP, add h into BadHosts and continue to Step 2.
– Step 2.5 send HTTP request “Get /URL HTTP/1.1\n\n” and receive the page content
– Step 2.6 insert (md5(u), page content) into MainDB.
– Step 2.7 for each URL v in downloaded page content: if md5(v) is neither in MainDB, nor PendSet, nor ProcessingSet, add v into new pending URL buffer. And if the buffer is full, package those new pending URLs in bulk and insert into PendDB.
Step 3. goto Step 1 until no further URLs in PendDB.

Number of Threads: 2000-5000

I am in living Hong Kong with a 300mbps network from PCCW network provider. Having 5000 crawling threads it’s possible to achieve a speed of 500 pages/s. And the bottleneck is always the network. On the other hand, setting a high number of thread implies using more CPU and memory. Typically I tuned the stack memory to 128kB or even 64kB for each threads, so it only need 640MB memory to power all 5000 threads.

Current Numbers:

20.4 million pages are downloaded in MainDB, with 27.6 million pending URLs to download.

Memory Usage: 2.4GB currently. Network Usage: 69.2Mbps ??? well I need to use a 100Mbps router to share the network for other devices so it’s limited to 2000 crawling threads currently.

HardDisk Usage: MainDB: 800GB. PendDB: 2.4GB. Log: 5.2GB (am I running out of harddisk space ? maybe…).

Pages