7.1.9.1. Inverted Indexing
- Definition of an Inverted index
- Examples of Inverted Indices
- Representing an Inverted Index
- Processing a Query on a Linked Inverted Index
- Skip Pointers to Improve Merging
- Phrase Queries
- biwords
- Grammatical Tagging
- N-Grams
- Distributed Indexing
7.1.9.2. Creating an Inverted Index
An inverted index is typically composed of a vector containing all distinct words of the text collection in lexicographical order (which is called the vocabulary) and for each word in the vocabulary, a list of all documents (and text positions) in which that word occurs
Terms in the inverted file index may be refined:
- Case folding: converting all uppercase letters to lower case
- Stemming: reducing words to their morphological roots
- Stop words: removing words that are so common they provide no information
7.1.9.2.1. Term-Document Incidence Matrix
One way to think about an inverted index is to consider it as a sparse matrix where rows represent terms and columns represent documents
7.1.9.2.2. Term-Document Incident Matrices are Naturally Sparse
- For each term T, we must store a list of all documents that contain T.
- Linked lists are generally preferred to arrays
–
- Dynamic space allocation
- Insertion of terms into documents easy
- There is space overhead of pointers, though this is not too serious
7.1.9.3. Query Processing Across the Postings List
- For each term in the query, retrieve its postings.
Merge to get the ones in common
Query Optimization : for term in query SORT BY INCREASING FREQUENCY Process in order of increasing freq:
- start with smallest set, then keep cutting further.
Algorithm for the intersection of two postings lists p1 and p2
7.1.9.3.1. To speed up the merging of postings we use the technique of Skip Pointers
- skip pointers are added at indexing time;
- they are shortcuts, and they only help for AND queries and they are useful when the corpus is relatively static
- where should they be placed?
- More skips means shorter skip spans, and that we are more likely to skip. But it also means lots of comparisons to skip pointers, and lots of space storing skip pointers.
- Fewer skips means few pointer comparisons, but then long skip spans which means that there will be fewer opportunities to skip.
- The Solution: A simple heuristic for placing skips, which has been found to work well in practice, is that for a postings list of length P, use sqrt{P} evenly-spaced skip pointers. This heuristic possibly can be improved upon as it ignores any details of the distribution of query terms. [Moffat and Zobel 1996]
- how do the algorithms change?
7.1.9.4. Phrase Queries
- Bi-word indexes
- Positional indexes
7.1.9.4.1. Bi-word indexes
A bi-word (or a 2-gram) is a consecutive pair of terms in some text
To improve phrase searching one approach is to index every bi-word in the text
Consequences
- Bi-words will cause an explosion in the vocabulary database
- Queries longer than 2 words will have to be broken into bi-word segments
A Bi-word index that is extended to longer sequences, or even variable length sequences is called a phrase index
- produce false positive (occurrences of the bi-words, but not the full 4 word query)
- A Bi-word index that is extended to longer sequences, or even variable length sequences is called a phrase index
7.1.9.4.2. Positional Indexes
Store, for each term in the index, entries of the form:
<term, number of docs containing term;
doc1: position1, position2, position3 ... ;
doc2: position1, position2 ... ;
etc.>
i.e. for each occurrence of the term in a document its position is stored
Adopting a positional index expands required postings storage significantly, even if we compress position values/offsets
7.1.9.4.2.1. Algorithm for matching a phrase query: ???
- Extract inverted index entries for each distinct term: e.g. to, be, or, not, to, be
- Merge their doc:position lists to enumerate all positions with “to be or not to be”.
- Match those documents that contain the terms in the adjacent positions
7.1.9.5. Grammatical Tagging
7.1.9.6. Google Relies Upon n-gram Indexes
- Google has investigated the use of n-grams stored in its index
- N-grams of all lengths form a Zipf distribution (power law) with a few common phrases occurring very frequently and a large number occurring with frequency 1
- Google has released the set of n-grams it has determined