7.1.2.1. Web Trends

  • Nowadays more and more people are access search Engine through their mobile devices.
  • Mobile Internet access surpassed PC in China
  • Tablet Growth is actually more rapid than smartphones

7.1.2.1.1. Measuring the Web

7.1.2.1.2. Some Web Facts

7.1.2.1.2.1. Number of Websites

1.760 Billion

7.1.2.1.2.2. Page Content Category

holding page:

7.1.2.1.2.3. Web Page Language Diversity

7.1.2.1.2.4. Static Pages: Rate of Change

7.1.2.1.2.5. Complexity of Data Types

focus on various contents and doing a translation or an extraction that content type into the format the search engine can process so the most famous of them in the open source communities is the things called Apache Tika toolkit, which comes along with Lucene and comes along with Solr

so it automatically detects the language type and extracts the meta data and whatever the text contents existing using various parsing library

N-grams identify what language they belong to

7.1.2.1.2.6. Content Types Indexed by Google

7.1.2.1.3. Web Characteristics

  • Siginificant duplication
  • High Linkage
  • Complex graph topology
  • Spam

7.1.2.1.4. Graph Structure in the Web(an old study)

7.1.2.1.5. Manual Hierarchical Web Taxonomies(Yahoo)

7.1.2.1.6. Open Directory Project(DMoz)

7.1.2.1.6.1. Drilling Down BY Category

7.1.2.1.7. Internet Archive

Wayback Machine

results matching ""

    No results matching ""