Pages

Sunday, May 8, 2016

Crawling one billion web pages in a single PC in 5 days ? It's feasible

Recently I've updated my network plan from 300Mbps to 500Mbps. The PCCW network from Hong Kong is so good that I am able to get around 470-520Mbps downloading speed all the time, with my application downloading about 520 different webpages per second, still, in a single PC.

But something need to be considered.

- It's soon ran out of harddisk space. An average webpage size, which is around 120kB. You can remove part of them, by getting rid of the comment area (from "<!--" to "-->"), the style area (<style> to </style>) and the scripting area (from <script> to </script>). But after removing these 3 areas, you still have around 60kB to be written into the harddisk.

It's big amount of data with redundancies:

1 webpage: 60kB
1,000 webpages: 60MB
1,000,000 webpages: 60GB
1,000,000,000 webpages: 60TB

for 60TB storage, you need 8 of 8TB harddisks, it's still possible although it's quite costly. Alternatively we can have fast, simple compression algorithm.

Finding a repeating pattern for compression:

consider the following tests:
<title> Hello World! </title>

It's partially repeating,
<title> Hello World! </title>
                       ******
the fragment marked with '*' have appeared before. So why not use 2 integers (index difference, repeat length) to represent the repeating fragment ?

and consider another:
hello hello hello hello
      *****************

Well! It's a self repeating sequence where the original part and the repeating part is overlapped.

Finding it fast, using in-place hashing:

I have ambition to get rid of 3rd party libraries as possible as I can. People would like to use some open source libraries such as LZO. But it's difficult to optimize the 3rd party stuff unless you have a complete picture of what they are doing. Therefore I'll make it myself.

In-place hashing:
In the old-school hashing algorithm, we are taught to have a following data structures:

init: Item** hashTable = new Item*[HashIndexSize];

insert(index, data): hashTable[index] = new Item{ m_data = data, m_next = hashTable[index] };
// each Item itself is a node of a linked list for the same hash index

This is kind of OK. But it runs pretty slow for the following reasons;
- It allocate small block of memory from the heap.
- Items are scattered inside the entire virtual memory space, which increase CPU cache miss.

New design:

init: Item *hashTable = new Item[HashIndexSize];

insert(index, data): hashTable[index] = data;
// each Item contains the data only, without the "next" pointer.

Ahh ? Does it works ? How about hash collision? here's a better way:
insert(index, data):
for (i = 0; i < maxCollision; i++) {
  if (hashTable[(i+index) % HashIndexSize] == null) {
   hashTable[(i+index) % HashIndexSize] = data; break;
  }
}

That we have a in-place hash with maximum collision configurable. And the hash itself is a bare array of continuous memory. We use that to locate a possible repeating fragment in linear time.

Using a in-place hashTable of size of 128k items, with maximum collision 16. I've got a good compression performance of 400MB/s, in the single threaded way with a 3.5GHz skylake CPU. And with the compression algorithm, I can still archive up to 700 webpages per second crawling speed with Dual Core Pentium CPU.

After compression, average webpage size is reduced to some where between 10 and 11 kB.

Therefore you only need to have 12TB storage to storage 1 billion web pages !!! And theoretically you can have a crawling speed up to 2000 webpages per second with an i7-6700k (3 times powerful than Pentium G4500), in a single PC.


No comments:

Post a Comment