7.1.12.1. Link Analysis Algorithms

  • Background on Citation Analysis
  • Google’s PageRank Algorithm
  • Simplified Explanation of PageRank
  • Examples of how PageRank algorithm works
  • Observations about PageRank algorithm
  • Importance of PageRank
  • Kleinberg’s HITS Algorithm

7.1.12.2. History of Link Analysis

Bibliometrics : is a set of methods to quantitatively analyze academic literature. Citation analysis and content analysis are commonly used bibliometric methods.

Citation analysis is the examination of the frequency, patterns, and graphs of citations in articles and books.

Impact factor (of a journal): frequency with which the average article in a journal has been cited in a particular year or period

  • The impact factor of a journal J in year Y is the average number of citations (from indexed documents published in year Y) to a paper published in J in year Y-1 or Y-2.

Bibliographic coupling: two papers that cite many of the same papers

Co-citation: two papers that were cited by many of the same papers

7.1.12.3. Citations vs. Web Links

Web links are a bit different than citations:

  • Many links are navigational
  • Many pages with high in-degree are portals not content providers
  • Not all links are endorsements
  • Company websites normally don’t point to their competitors
  • Citations to relevant literature is enforced by peer-review

7.1.12.4. What is PageRank?

PageRank is a web link analysis algorithm used by Google

  • PageRank was developed at Stanford University by Google founders Larry Page and Sergey Brin
    • The paper describing pagerank was co-authored by Rajeev Motwani and Terry Winograd
  • PageRank is a “vote”, by all the other pages on the Web, about how important a page is.

  • A link to a page counts as a vote of support.

  • If there’s no link there’s no support (but it’s an abstention from voting rather than a vote against the page).
  • PageRank says nothing about the content or size of a page, the language it’s written in, or the text used in the anchor of a link!
  • Looked at another way, PageRank is a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page

7.1.12.5. PageRank Algorithm

We calculate PR iteratively a number of times until we start converging to the same value.

Number of iterations required for convergence is empirically (but not formally derived) O(log n) (where n is the number of links)

The damping factor value and its effect:

  • If too high, more iterations are required
  • If too low, you get repeated over-shoot, – Both above and below the average – the numbers just swing like pendulum and never settle down.

  • Both above and below the average – the numbers just swing like pendulum and never settle down.

7.1.12.6. Example

Some Suggestions Based on What We Have Seen in Example :

Suggestions for improving your page rank

  • Increasing the internal links in your site can minimize the damage to your PR when you give away votes by linking to external sites.
  • If a particular page is highly important use a hierarchical structure with the important page at the “top”.
  • Where a group of pages may contain outward links – increase the number of internal links to retain as much PR as possible.
  • Where a group of pages do not contain outward links – the number of internal links in the site has no effect on the site’s average PR. You might as well use a link structure that gives the user the best navigational experience.
  • Use Site Maps: Linking to a site map on each page increases the number of internal links in the site, spreading the PR out and protecting you against your vote “donations” to other sites

7.1.12.7. Importance of PageRank

  • PageRank is just one factor Google uses to determine a page’s relevance. It assumes that people will link to your page only if they think the page is good. But this is not always true.

  • Content is still the king!!!

    • Anchor, body, title tags etc. still are very important for search engines
  • PageRank is a multiplier factor.

    • If Google wants to penalize any page they can set the PageRank equal to zero. It will be last on the search results page.

7.1.12.8. Disavowing Backlinks

Google actually provides users with a way to “disavow” backlinks

  • That is, to inform Google you do NOT want certain links pointing to your pages to be counted in their PageRank computation

7.1.12.9. Another Link Analysis Algorithm – HITS by Kleinberg

HITS stands for “Hyperlink-Induced Topic Search

The algorithm is based on the observations that

  • Some pages known as hubs serve as large directories of information on a given topic
  • Some pages known as authorities serve as the page(s) that best describe the information on a given topic

The algorithm begins by retrieving a set of results to a search query, The HITS algorithm is performed ONLY on the resulting set

The algorithm goes through a set of iterations where

  • an authority value is computed as the sum of the scaled hub values that point to that page;
  • and a hub value is the sum of the scaled authority values of the pages it points to

7.1.12.9.1. Authorities

Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic.

In-degree (number of pointers to a page) is one simple measure of authority.

However in-degree treats all links as equal.

7.1.12.9.2. Hubs

Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities).

7.1.12.9.3. HITS Algorithm

7.1.12.10. Base Limitations

results matching ""

    No results matching ""