7.1.4.1. Crawlers and Crawling

7.1.4.1.1. What is web crawler/Spider/Robot?

A web crawler is a computer program that visit web pages in an organized way. For example, Google's googlebot, Yahoo's Yahoo!Slurp, Bing's Bingbot, Adidxbot, MSNbot, MSBBotMedia, BingPreview.

Google now uses multiple crawlers :

Googlebot
Googlebot News
Googlebot Images
Googlebot Video
Google Mobile Smartphone
Google Mobile AdSense
Google AdsBot
Google app crawler

7.1.4.1.2. What is crawling issues?

7.1.4.1.2.1. How to crawl?

Quality : how to find the best page first?
Efficiency : how to avoid duplication(or near duplication)
Etiquette : behave politely by not disturbing a website's performance

7.1.4.1.2.2. How much to crawl? How much to index?

Coverage : What percentage of the web should be covered?
Relative Coverage : How much do competitors have?

7.1.4.1.2.3. How often to crawl?

Freshness : How much has changed? How much has really changes?

7.1.4.1.2.4. Complications && More Issues:

crawling the entire page isn't feasible with one machine ==>
- multiple threads
- distributed
latency/bandwidth to remote server can vary widely
avoiding malicious pages : spam, spider traps(dynamically generated pages)
- avoid-spider-traps
- Handling Spam
Robots.txt can prevent page from visiting robots.txt
avoiding mirrors and duplicate pages avoid-page-duplication
maintain politeness - don't hit a server too often
Refresh Strategies freshness
Speeding up DNS lookup

7.1.4.1.2.5. Policies of web crawler

The behavior of a Web crawler is the outcome of a combination of policies :

A selection policy that states which pages to download
A revisit policy that states when to check for changes to the pages
A politeness policy that states how to avoid overloading websites
A parallelization policy that states how to coordinate distributed web crawlers.

7.1.4.2. Simplest Crawler Operation (Crawling Algorithm Version 01)

Begin with known "seed" page, fetch and parse a page(extract the URLs within the page and put them on a queue), fetch each URL on the queue and repeat.

7.1.4.3. Robots.txt

defines the limitations for a web crawler as it visit a website. The website announces its request on what can/can't be crawled by placing a robots.txt file in the root directory.

7.1.4.4. Basic Search Strategies

7.1.4.4.1. Breadth first search

BFS crawling brings in high quality pages early in the crawl.

7.1.4.4.2. Depth first search

How the new links are added to the queue determines the search strategy.

FIFO(append to end of Q) gives breadth-first-search.
LIFO(append to front of Q) gives depth-first-search.
Heuristically ordering the Q gives a "focused crawler"
- Move forward URLs whose In-degree is large
- Move forward URLs whose PageRank is large

7.1.4.5. Avoid Page Duplication

A crawler must detect that when revisiting a page that has already been crawled. Therefore, a crawler must efficiently index URLs as well as already visited pages.

The web is a graph not a tree.

determine if a URL has already been seen representing-urls
- store URL in standard format
- develop a fast way to check if a URL has already been seen
determine if a new page has been seen
- develop a fast way to determine if an identical or near-identical page was already indexed

7.1.4.5.1. Link Extraction

Must find all links in a page and extract URLs
Relative URL's must be completed, using current page URL or \base\ tage

7.1.4.5.1.1. Two anomalies

Some anchor don't have links
SOme anchor produce dynamic pages which can lead to looping

7.1.4.5.1.2. Representing URLs

One Proposed Method

First hash on host/domain name
then, use a trie data structure to determine if the path/resource is the same on one in the database

Trie for URL Exact Matching

Characteristics of tries

They share the same prefix among multiple "words"
Each path from the root to a leaf corresponds to one "word"
EndMarker symbol $ at the ends of words: To avoid confusion between words with almost identical elements
For n URLs and maximum length k, time is $O(\log_kn)$

Another Proposed Method

URLs are sorted lexicographically and then stored as a delta-encoded test file

Each entry is stored as the difference(delta) between the current and previous URL; this substantially reduces storage
However, the restoring the actual URL is slower
TO improve speed, checkingpointing(storing the full URL) is done periodically

7.1.4.5.1.3. Why Normalizing URLs is Important?

People can type URL in different way but they all point to the same place. e.g. http://www.google.com and google.com

Normalizing URLs(4 rules)

Convert scheme and host to lower case.
Capitalize letters in escape sequences.
Decode precent-encoded octets of unreserved characters.
Remove the default port.

7.1.4.6. Avoid Spider Traps

A spider trap is when a crawler re-visits the same page over and over again. The most well-known spider trap is the one created by the use of Session ID

7.1.4.7. Handling Spam

First Generation : consisted of pages with a high number of repeated terms
Second Generation : cloaking ==> return different pages to crawler and users
Third Generation : doorway page ==> contain highly ranked text and metadata

7.1.4.8. The Mercator Web Crawler

7.1.4.9. Measuring and Tuning a Crawler reduces to:

Improving URL parsing speed
Improving network bandwidth speed
Improving fault tolerance

7.1.4.10. DNS caching, pre-fetching and resolution

DNS caching : build a caching server that retains IP-domain name mappings previously discovered
Pre-fetching client: once the page is parsed immediately make DNS resolution requests to the caching server; if unresolved, use UDP(User Datagram Protocol) to resolve from the DNS server.
Customize the crawler so it allows issuing of many resolution requests simultaneously; there should be many DNS resolvers.

7.1.4.11. Multi-Threaded Crawling

One bottleneck is network delay in downloading individual pages.

It is best to have multiple threads running in parallel each requesting a page from a different host.

thread : a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler.
In most cases, a thread is a component of a process.
Multiple threads can exist within the same process and share resources.

7.1.4.12. Distributed Crawling Approache

Distributed URL's to threads to guarantee equitable distribution of requests across different hosts maximize through-put and avoid overloading any single server.

How many crawlers should be running at any time?
- Scenario 1: A centralized crawler controlling a set of parallel crawlers all running on a LAN
- Scenario 2: A distributed set of crawlers running on widely distributed machines, with or without cross communication.

7.1.4.12.1. Distributed Model

Three strategies of Coordination of Distributed Crawling :

Independent : no coordination, every process follows its extracted links
Dynamic assignment : a central coordinator dynamically divides the web into small partitions and assign each partition to a process
Static assignment : web is partitioned and assigned without a central coordinator before the crawl starts
- inter-partition links can be handled in three ways as follow :
  - Firewall mode
  - Cross-over mode
  - Exchange mode(communication required):
    - Batch communications : every process collects some URLs and sends them in a batch
    - Replication: the k most popular URLs are replicated at each process and not exchanged

7.1.4.12.1.1. Some ways to partition the Web:

URL-hash based : this yields many inter-partition links
Site-hash based : reduces the intern-partition links
Hierarchical : by TLD(Top-Level-Domain)

7.1.4.12.2. General Conclusions of Cho and Garcia-Molina

Firewall crawlers attain good, general coverage with low cost
Cross-over ensures 100% quality, but suffer from overlap
Replicating URLs and batch communication can reduce overhead

7.1.4.13. Freshness

how fresh the page in our database compared to the page out there

Web is very dynamic : many new pages, updated pages and deleted pages, etc.

7.1.4.13.1. Cho and Garcia-Molina Experiment

7.1.4.13.2. Implications for a Web Crawler

7.1.4.13.3. Change Frequency vs. Optimal Re-visiting

7.1.4.13.4. Cho and Garcia-Molina

Two simple re-visiting policies:

Uniform policy : with same frequency
Proportional Policy : re-visiting more often the pages that change more frequently

The uniform policy outperforms the proportional policy.

The explanation for this result comes from the fact that, when a page changes too often, the crawler will waste time by trying to re-crawl it too fast and still will not be able to keep its copy of the page fresh.

To improve freshness, we should penalize the element that change to fast.

7.1.4.14. SiteMap

A sitemap is a list of pages of a web site accessible to crawlers.

XML is used as the standard for representing sitemaps. contains <loc>, <lastmod>, <changefreq>, and <priority>

7.1.4.15. URL frontier Using Mercator schema

kind of round-robin
give different probilities