Vector Space Model

Motivation

When you want to find some information by using Search Engines, you have to make a query used for search. Unfortunately, since you don't know exactly what it means, your query will be ambiguous and not accurate. Therefore, Search Engines give you the information in a ranked list rather than the right position.

Intuition

In order to make a ranked list, you need to calculate the similarity between the query and documents based on terms or words.
One of the calculation of similarity is dot product on a vector space.

In the vector space, there are many documents with respect to word dimensions
The first to rank is d2, because to see with eyes it's the most similarity with the query.

Problem

How do we plot those vectors wonderfully and nicely and very very fairly ?

- How do we define the dimension ?
- How do we place a document vector ?
- How do we place a query vector ?
- How do we match a similarity ?

Consideration

1. The frequency of each word of Query.

First, Score increases proportionally to the frequency of term.
But, by using TF, I want to penalize the word which has a lot of frequency.

[ y = (k+1)x / x+k ]

By adjusting the value of k, you can control upper bound. Upper bound is useful to control the inference of a particular term.

2. The importance of each word of Query.

In a document, if a certain term such as 'the', 'or', 'about'... is frequent, its importance will decrease.

3. The length of Document

If a document is longer than the average document length, then there will be some penalization, another one will have some rewards.
By adjusting the value of b, you can control the degree of length normalization.

~~4. The sequence of Query~~
Probabilistic Model에서는 Word끼리 Independence를 가진다고 가정하기 때문에 Sequence는 관련없다.
하지만,

Robustness

+ Pivoted Length Normalization VSM [Singhal et al 96]

1. Term Frequency Weighting : c(w,q) & c(w,d)

If the word is frequent in the document, it's more important.

2. TF Transformation (sub-linear transformation) : ln[1+ln[1+c(w,d)]] rather than c(w,d)

It can prevent a high score in the case of the frequency of just one word.

3. IDF(Inverse Document Frequency) Weighting : log((M+11)/df(w))

What is the more important word between 'about' and 'presidential' ?

IDF can penalize popular terms.

4. Document Length Normalization (Pivoted Length Normalization) : 1-b+b*|b|/average doc length

Actually, if there are the same words in the documents, statistically, the short document is more important.

+ BM25/Okapi [Robertson & Walker 94]

1. ...

2. TF Transformation (sub-linear transformation with upper bound) : (k+1)*c(w,d) / c(w,d)+k rather than c(w,d)

3. ...
4. ...

Reference
[1] Text Retrieval and Search Engines

Rubbing & Scrubbing My Data

이 블로그 검색

Vector Space Model

태그

댓글

댓글 쓰기

이 블로그의 인기 게시물

Pattern Discovery in Data Mining

Logistic Regression