7.1.6.1. Information Retrieval
7.1.6.1.1. What is Information Retrieval?
Information retrieval (IR) has been a computer science subject for many decades
- Traditionally it deals with the indexing of a given set of textual documents and the retrieval of relevant documents given a query
- Relatively recently searching for pages on the World Wide Web has become the “killer app.”
There has been a great deal of research on
- How to index a set of documents (a corpus)
- How to efficiently retrieve relevant documents
7.1.6.2. IR History
- 2000's
- Link analysis for Web Search
- Extension to retrieval of multimedia: images, music, video
- Question Answering : return an actual answer rather than a ranked list of documents
7.1.6.3. Areas Related to But Different than Information Retrieval
- Database Management
- Library and Information Science
- Artificial Intelligence
- Focused on the representation of knowledge, reasoning, and intelligent action.
- Formalisms for representing knowledge and queries:
- First-order Predicate Logic
- Bayesian Networks
- Recent work on web ontologies and intelligent information agents brings it closer to IR
- Natural Language Processing
- Machine Learning
- Focused on the development of computational systems that improve their performance with experience
- Machine learning is distinct from data mining, which focuses on the discovery of previously unknown properties of the given data
7.1.6.4. A More Detailed IR Architecture
7.1.6.5. Defining Terms
Parsing forms index words (tokens) and includes:
- Stopword removal
- See http://www.ranks.nl/tools/stopwords.html for google stopwords
- Stemming: reducing a word to its root
- Indexing constructs an inverted index of word to document pointers.
- Searching retrieves documents that contain a given query token from the inverted index。
- Ranking scores all retrieved documents according to a relevance metric.
7.1.6.6. Retrieval Models
A retrieval model specifies the details of:
- Document representation
- Query representation
- Retrieval function
Determines a notion of relevance :
- Notion of relevance can be binary or continuous (i.e. ranked retrieval)
Three major Information Retrieval Models are:
- Boolean models (set theoretic)
- Vector space models (statistical/algebraic) + Probabilistic models
7.1.6.6.1. Common Pre-Processing Steps
- Strip unwanted characters/markup (e.g. HTML tags, punctuation, numbers, etc.).
- Break into tokens (keywords) or whitespace.
- Stem tokens to “root” words
.
- computational ==> comput
- Remove common stopwords (e.g. a, the, it, etc.).
- Detect common phrases (possibly using a domain specific dictionary).
- Build inverted index (keyword ==> list of docs containing it).
7.1.6.6.2. Boolean Model
A document is represented as a set of keywords.
Queries are Boolean expressions of keywords
- Pros:
- Easy to understand for simple queries.
- Clean formalism.
- Boolean models can be extended to include ranking ?
- Reasonably efficient implementations possible for normal queries.
- Cons:
- Very rigid: AND means all; OR means any
- Difficult to express complex user requests
- Difficult to control the number of documents retrieved
- All matched documents will be returned
- Difficult to rank output
- All matched documents logically satisfy the query
- Difficult to perform relevance feedback
- If a document is identified by the user as relevant or irrelevant, how should the query be modified?
7.1.6.6.3. The Vector-Space Model
Assume t
distinct terms remain after preprocessing; call them index terms or the vocabulary
These “orthogonal” terms form a vector space :
A document is represented by a vector of index terms
Both documents and queries are expressed as t-dimensional
vectors
A collection of n
documents can be represented in the vector space model by a term-document matrix
.
An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document; but we still need a way to compute the weight
7.1.6.6.3.1. Term Weights: Term Frequency
Assumption: the more frequent terms in a document are more important, i.e. more indicative of the topic.
May want to normalize term frequency (tf) across the entire corpus:
7.1.6.6.3.2. Term Weights: Inverse Document Frequency
Terms that appear in many different documents are less indicative of overall topic
of course is always <= N (total number of documents)
An indication of a term’s discrimination power
Log is used to dampen the effect relative to df
7.1.6.6.3.3. TF-IDF Weighting
- increase with the number of occurrances within a document
- increase with the rarity of the term in the collection
- A term occurring frequently in the document but rarely in the rest of the collection is given high weight.
Many other ways of determining term weights have been proposed. • Experimentally, has been found to work well
Given a query q
, then we score the query against a document d
using the formula
7.1.6.6.3.4. Cosine Similarity
Distance between vectors and is captured by the cosine of the angle between them.
A similarity measure is a function that computes the degree of similarity between two vectors
Using a similarity measure between the query and each document has positive aspects:
- It is possible to rank the retrieved documents in the order of presumed relevance.
- It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled.
7.1.6.6.3.5. Normalized Vectors
A vector can be normalized (given a length of 1) by dividing each of its components by the vector's length
- This maps vectors onto the unit circle:
- Then,
- Longer documents don’t get more
- For normalized vectors, the cosine is simply the dot product:
7.1.6.6.3.6. Similarity Measure for Document and Query
Similarity between vectors for the document and query can be computed as the vector inner product:
7.1.6.6.3.7. Limitations of the Inner Product
- Favors long documents with a large number of unique terms.
- Measures how many terms matched but not how many terms are not matched.
7.1.6.7. Naïve Implementation
7.1.6.8. Efficient Cosine Ranking
- Ranking consists of computing the k docs in the corpus “nearest” to the query ⇒ k largest query-doc cosines.
- To do efficient ranking one must:
- Compute a single cosine efficiently.
- Choose the k largest cosine values efficiently.
7.1.6.9. Summary of Algorithm for Vector Space Ranking
- Represent the query as a weighted tf.idf vector
- represent each document as a weighted tf.idf vector
- compute the cosine similarity score for the query vector and each document vector
- Rank documents with respect to the query by score
- Return the top K (e.g. k=10) to the user
7.1.6.9.1. Use Heap for Selecting Top k
Heap : Binary tree in which each node’s value > values of children
Takes 2n
operations to construct, then each of “winners” read off in steps.
7.1.6.9.2. Bottleneck
- Still need to first compute cosines from query to each of n docs → several seconds for n = 1M
- Can select from only non-zero cosines
- Need union of postings lists accumulators (<<1M)
- Can further limit to documents with non-zero cosines on rare (high idf) words
- Enforce conjunctive search (a la Google): non-zero cosines on all words in query
- But still potentially expensive
7.1.6.9.3. A Pre-Processing Strategy
- Preprocess: Pre-compute, for each term, its k nearest docs.
- (Treat each term as a 1-term query)
- lots of preprocessing
- Result: “preferred list” for each term.
- Search:
- For a t-term query, take the union of their t preferred lists – call this set S.
- Compute cosines from the query to only the docs in S, and choose top k.
7.1.6.9.4. Comments on Vector Space Model
- Simple, mathematically based approach.
- Considers both local (tf) and global (idf) word occurrence frequencies.
- Provides partial matching and ranked results.
- Tends to work quite well in practice despite obvious weaknesses.
- Allows efficient implementation for large document collections.
7.1.6.9.4.1. Problems with Vector Space Model
- Missing semantic information (e.g. word sense).
- Missing syntactic information (e.g. phrase structure, word order, proximity information).
- Assumption of term independence
- Lacks the control of a Boolean model (e.g., requiring a
term to appear in a document).
- Given a two-term query “A B”, may prefer a document containing A frequently but not B, over a document that contains both A and B, but both less frequently.