7.1.4.1. Crawlers and Crawling

7.1.4.1.1. What is web crawler/Spider/Robot?

A web crawler is a computer program that visit web pages in an organized way. For example, Google's googlebot, Yahoo's Yahoo!Slurp, Bing's Bingbot, Adidxbot, MSNbot, MSBBotMedia, BingPreview.

Google now uses multiple crawlers :

  • Googlebot
  • Googlebot News
  • Googlebot Images
  • Googlebot Video
  • Google Mobile Smartphone
  • Google Mobile AdSense
  • Google AdsBot
  • Google app crawler

7.1.4.1.2. What is crawling issues?

7.1.4.1.2.1. How to crawl?

  • Quality : how to find the best page first?
  • Efficiency : how to avoid duplication(or near duplication)
  • Etiquette : behave politely by not disturbing a website's performance

7.1.4.1.2.2. How much to crawl? How much to index?

  • Coverage : What percentage of the web should be covered?
  • Relative Coverage : How much do competitors have?

7.1.4.1.2.3. How often to crawl?

  • Freshness : How much has changed? How much has really changes?

7.1.4.1.2.4. Complications && More Issues:

7.1.4.1.2.5. Policies of web crawler

The behavior of a Web crawler is the outcome of a combination of policies :

  • A selection policy that states which pages to download
  • A revisit policy that states when to check for changes to the pages
  • A politeness policy that states how to avoid overloading websites
  • A parallelization policy that states how to coordinate distributed web crawlers.

7.1.4.2. Simplest Crawler Operation (Crawling Algorithm Version 01)

Begin with known "seed" page, fetch and parse a page(extract the URLs within the page and put them on a queue), fetch each URL on the queue and repeat.

7.1.4.3. Robots.txt

defines the limitations for a web crawler as it visit a website. The website announces its request on what can/can't be crawled by placing a robots.txt file in the root directory.

7.1.4.4. Basic Search Strategies

BFS crawling brings in high quality pages early in the crawl.

How the new links are added to the queue determines the search strategy.

  • FIFO(append to end of Q) gives breadth-first-search.
  • LIFO(append to front of Q) gives depth-first-search.
  • Heuristically ordering the Q gives a "focused crawler"
    • Move forward URLs whose In-degree is large
    • Move forward URLs whose PageRank is large

7.1.4.5. Avoid Page Duplication

A crawler must detect that when revisiting a page that has already been crawled. Therefore, a crawler must efficiently index URLs as well as already visited pages.

The web is a graph not a tree.

  • determine if a URL has already been seen representing-urls
    • store URL in standard format
    • develop a fast way to check if a URL has already been seen
  • determine if a new page has been seen
    • develop a fast way to determine if an identical or near-identical page was already indexed
  1. Must find all links in a page and extract URLs
  2. Relative URL's must be completed, using current page URL or \base\ tage

7.1.4.5.1.1. Two anomalies

  1. Some anchor don't have links
  2. SOme anchor produce dynamic pages which can lead to looping

7.1.4.5.1.2. Representing URLs

One Proposed Method

  1. First hash on host/domain name
  2. then, use a trie data structure to determine if the path/resource is the same on one in the database
Trie for URL Exact Matching

Characteristics of tries

  • They share the same prefix among multiple "words"
  • Each path from the root to a leaf corresponds to one "word"
  • EndMarker symbol $ at the ends of words: To avoid confusion between words with almost identical elements
  • For n URLs and maximum length k, time is O(logkn)O(\log_kn)

Another Proposed Method

URLs are sorted lexicographically and then stored as a delta-encoded test file

  • Each entry is stored as the difference(delta) between the current and previous URL; this substantially reduces storage
  • However, the restoring the actual URL is slower
  • TO improve speed, checkingpointing(storing the full URL) is done periodically

7.1.4.5.1.3. Why Normalizing URLs is Important?

People can type URL in different way but they all point to the same place. e.g. http://www.google.com and google.com

Normalizing URLs(4 rules)

  1. Convert scheme and host to lower case.
  2. Capitalize letters in escape sequences.
  3. Decode precent-encoded octets of unreserved characters.
  4. Remove the default port.

7.1.4.6. Avoid Spider Traps

A spider trap is when a crawler re-visits the same page over and over again. The most well-known spider trap is the one created by the use of Session ID

7.1.4.7. Handling Spam

  1. First Generation : consisted of pages with a high number of repeated terms
  2. Second Generation : cloaking ==> return different pages to crawler and users
  3. Third Generation : doorway page ==> contain highly ranked text and metadata

7.1.4.8. The Mercator Web Crawler

7.1.4.9. Measuring and Tuning a Crawler reduces to:

  • Improving URL parsing speed
  • Improving network bandwidth speed
  • Improving fault tolerance

7.1.4.10. DNS caching, pre-fetching and resolution

  • DNS caching : build a caching server that retains IP-domain name mappings previously discovered
  • Pre-fetching client: once the page is parsed immediately make DNS resolution requests to the caching server; if unresolved, use UDP(User Datagram Protocol) to resolve from the DNS server.
  • Customize the crawler so it allows issuing of many resolution requests simultaneously; there should be many DNS resolvers.

7.1.4.11. Multi-Threaded Crawling

One bottleneck is network delay in downloading individual pages.

It is best to have multiple threads running in parallel each requesting a page from a different host.

  • thread : a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler.
  • In most cases, a thread is a component of a process.
  • Multiple threads can exist within the same process and share resources.

7.1.4.12. Distributed Crawling Approache

Distributed URL's to threads to guarantee equitable distribution of requests across different hosts maximize through-put and avoid overloading any single server.

  • How many crawlers should be running at any time?
    • Scenario 1: A centralized crawler controlling a set of parallel crawlers all running on a LAN
    • Scenario 2: A distributed set of crawlers running on widely distributed machines, with or without cross communication.

7.1.4.12.1. Distributed Model

Three strategies of Coordination of Distributed Crawling :

  • Independent : no coordination, every process follows its extracted links
  • Dynamic assignment : a central coordinator dynamically divides the web into small partitions and assign each partition to a process
  • Static assignment : web is partitioned and assigned without a central coordinator before the crawl starts
    • inter-partition links can be handled in three ways as follow :
      • Firewall mode
      • Cross-over mode
      • Exchange mode(communication required):
        • Batch communications : every process collects some URLs and sends them in a batch
        • Replication: the k most popular URLs are replicated at each process and not exchanged

7.1.4.12.1.1. Some ways to partition the Web:

  • URL-hash based : this yields many inter-partition links
  • Site-hash based : reduces the intern-partition links
  • Hierarchical : by TLD(Top-Level-Domain)

7.1.4.12.2. General Conclusions of Cho and Garcia-Molina

  • Firewall crawlers attain good, general coverage with low cost
  • Cross-over ensures 100% quality, but suffer from overlap
  • Replicating URLs and batch communication can reduce overhead

7.1.4.13. Freshness

how fresh the page in our database compared to the page out there

Web is very dynamic : many new pages, updated pages and deleted pages, etc.

7.1.4.13.1. Cho and Garcia-Molina Experiment

7.1.4.13.2. Implications for a Web Crawler

7.1.4.13.3. Change Frequency vs. Optimal Re-visiting

7.1.4.13.4. Cho and Garcia-Molina

Two simple re-visiting policies:

  • Uniform policy : with same frequency
  • Proportional Policy : re-visiting more often the pages that change more frequently

The uniform policy outperforms the proportional policy.

The explanation for this result comes from the fact that, when a page changes too often, the crawler will waste time by trying to re-crawl it too fast and still will not be able to keep its copy of the page fresh.

To improve freshness, we should penalize the element that change to fast.

7.1.4.14. SiteMap

A sitemap is a list of pages of a web site accessible to crawlers.

  • XML is used as the standard for representing sitemaps. contains <loc>, <lastmod>, <changefreq>, and <priority>

7.1.4.15. URL frontier Using Mercator schema

  1. kind of round-robin
  2. give different probilities

results matching ""

    No results matching ""