Pages

Sunday, January 24, 2016

How to crawl 20 million web pages in a single PC in half a day

These days I have been working on a web crawler, with the ambition of making a search Engine myself. And up to now I’ve already crawled up to 20 million pages in a single ordinary PC. And here I’d like to share my design:
PC Specifications:
CPU: Pentium G4500 Dual Core 3.5GHz
RAM: DDR4 2133 8GB
Harddisk: TOSHIBA 3TB 7200rpm
OS: Windows 10 x64
Implementation Platform: C++, Visual Studio 2012
Extra Tools: MD5 checksum library
It’s not that difficult if you think about it thoroughly. But it’s good to have some fast & robust underlying data structures:
Implementation of a semi in-memory Key-Value Database in C++:
Relational databases are good for transactions and searching, but they’re not good for handling huge data, nor performance. We need to have a simple, fast, and key-value database. And for a search engine purpose, the crawling database should have the following properties:
  1. Able to insert a pair of (Key-Value), where key is a fixed number of bytes and value is arbitrary number of bytes, with a performance greater than100,000 inserts per second with traditional hard-disk.
  2. Able to check if a key exists in the database without disk access, with a performance greater than 1,000,000 checks per second. Which means that all the keys must be in some form stored in memory.
  3. Able to get value given a key, with only 1 or 2 disk access. As values are huge we can’t get them all in memory. With the traditional hard-disk of performance of around 100 iops.
  4. Able to delete a key with a performance greater than 1,000,000 deletes per second.
  5. Able to get a random key that exists in database without disk access. This is a additional property, but it’s useful.
Implementation: The append only database.
Disk Layout: Each database has 2 files, the KEY file and the VAL file. The KEY file is actually a big array of items, where each item is in the form of (16-byte key + 8-byte offset). Here “offset” means the offset of the VAL file from where it contains the value. The VAL file simple contains a stream of value items, where each value item is (4 byte length + value).
Memory Layout: we need to maintain a in-memory hash from key to offset. Let’s just call it key-offset hash.
Database operations: Inserting a (key-value) is equivalent to appending a new (key-offset) to KEY file and (value) to VAL file. We make a buffer to hold enough data before we flush into the hard-disk in one go. Removing by key is equivalent to appending a (key-offset) to KEY file where offset equals to -1. Other operations are trivial.
Database Memory Usage:
for each key, we need 16+8=24 bytes in memory. Plus the hash and memory allocation overhead, it’s around 64 bytes for x64 applications. (Well, we can further reduce the overhead by throwing away the std::unordered_map). So for a system with 8GB available memory, it can hold up to 125 million database items.
The crawling system Architecture:
  • Main Database (MainDB): (key: md5(URL), val: page content).
    We can’t use the URL itself as the key of a database, as it’s stored in memory and it’s costly. But using the md5 library we can map each url (or even anything) into a fixed 16-byte id, given the assumption that it’s ok to have just a very very low probability (1 over 2^128) of md5 collision given any 2 different URLs.
  • Pending URL Database (PendDB): (key: incremental counter, val: bulk of URLs)
    We need to maintain a temporary database for all the pending URLs. But due to the fact that retrieving the value from the database is pretty slow (50-100 iops for a normal 7200rpm hard-disk), we need to bundle 20-30 of them in bulk. So that reading 1000 pending URLs only takes 33-50 hard-disk accesses. And each time we can retrieve a random bulk of URLs, thanks to the database property 5.
  • Pending URL HashSet (PendSet): hash-set of md5(URL) that keeps tracking whether a given URL is in pending URL database or not.
  • Processing URL HashSet (ProcessingSet): hash-set of md5(URL) that keeps tracking whether a given URL is being processed.
  • Host to ipv4 Hash (Host2IP): hash of md5(Host)->ipv4 that we can easily retrieve the ipv4 address given a host.
  • Bad Host Set (BadHosts): hash-set of md5(Host). All the hosts that are not reachable.
  • URL log file (Log): plain text file containing all the URLs seen so far.

Main Searching Flow:OK. Up to now, it’s more than 70% of the story. The searching flow is easy:
For each crawling threads:
Step 1 get a random key from PendDB, get the value (bulk of URLs) by the key.
Step 2 for each URL u in (bulk of URLs)
– Step 2.1  remove u from PendSet, put u into ProcessingSet
– Step 2.2 extract host h & port from u
– Step 2.3 if the h is in BadHosts, continue to Step 2; otherwise if h is in Host2IP, get the ip from Host2IP, other wise do a dnslookup.
– Step 2.4 try to initiate a tcp connect (ip, port), if it fails, and if h if not in Host2IP, add h into BadHosts and continue to Step 2.
– Step 2.5 send HTTP request “Get /URL HTTP/1.1\n\n” and receive the page content
– Step 2.6 insert (md5(u), page content) into MainDB.
– Step 2.7 for each URL v in downloaded page content: if md5(v) is neither in MainDB, nor PendSet, nor ProcessingSet, add v into new pending URL buffer. And if the buffer is full, package those new pending URLs in bulk and insert into PendDB.
Step 3. goto Step 1 until no further URLs in PendDB.
Number of Threads: 2000-5000
I am in living Hong Kong with a 300mbps network from PCCW network provider. Having 5000 crawling threads it’s possible to achieve a speed of 500 pages/s. And the bottleneck is always the network. On the other hand, setting a high number of thread implies using more CPU and memory. Typically I tuned the stack memory to 128kB or even 64kB for each threads, so it only need 640MB memory to power all 5000 threads.
Current Numbers:
1
20.4 million pages are downloaded in MainDB, with 27.6 million pending URLs to download.
Memory Usage: 2.4GB currently. Network Usage: 69.2Mbps ??? well I need to use a 100Mbps router to share the network for other devices so it’s limited to 2000 crawling threads currently.
1
HardDisk Usage: MainDB: 800GB. PendDB: 2.4GB. Log: 5.2GB (am I running out of harddisk space ? maybe…).

1