기본 콘텐츠로 건너뛰기

Vector Space Model


Motivation

When you want to find some information by using Search Engines, you have to make a query used for search. Unfortunately, since you don't know exactly what it means, your query will be ambiguous and not accurate. Therefore, Search Engines give you the information in a ranked list rather than the right position.


Intuition

In order to make a ranked list, you need to calculate the similarity between the query and documents based on terms or words.
One of the calculation of similarity is dot product on a vector space.


In the vector space, there are many documents with respect to word dimensions
The first to rank is d2, because to see with eyes it's the most similarity with the query.


Problem


How do we plot those vectors wonderfully and nicely and very very fairly ?

- How do we define the dimension ?
- How do we place a document vector ?
- How do we place a query vector ?
- How do we match a similarity ?


Consideration

1. The frequency of each word of Query.


First, Score increases proportionally to the frequency of term.
But, by using TF, I want to penalize the word which has a lot of frequency.
[ y = (k+1)x / x+k ] 
By adjusting the value of k, you can control upper bound. Upper bound is useful to control the inference of a particular term.

2. The importance of each word of Query.

In a document, if a certain term such as 'the', 'or', 'about'... is frequent, its importance will decrease.

3. The length of Document


If a document is longer than the average document length, then there will be some penalization, another one will have some rewards.
By adjusting the value of b, you can control the degree of length normalization.

4. The sequence of Query
Probabilistic Model에서는 Word끼리 Independence를 가진다고 가정하기 때문에 Sequence는 관련없다.
하지만,

Robustness

+ Pivoted Length Normalization VSM [Singhal et al 96]


1. Term Frequency Weighting : c(w,q) & c(w,d)
If the word is frequent in the document, it's more important.

2. TF Transformation (sub-linear transformation) : ln[1+ln[1+c(w,d)]] rather than c(w,d)
It can prevent a high score in the case of the frequency of just one word.  

3. IDF(Inverse Document Frequency) Weighting : log((M+11)/df(w))
What is the more important word between 'about' and 'presidential' ?
IDF can penalize popular terms.

4. Document Length Normalization (Pivoted Length Normalization) : 1-b+b*|b|/average doc length
Actually, if there are the same words in the documents, statistically, the short document is more important. 



+ BM25/Okapi [Robertson & Walker 94]

1. ...
2. TF Transformation (sub-linear transformation with upper bound) : (k+1)*c(w,d) / c(w,d)+k rather than c(w,d)
3. ...
4. ...




Reference
[1] Text Retrieval and Search Engines


댓글

이 블로그의 인기 게시물

Pattern Discovery in Data Mining

Coursera Illinois at Urbana-Champaign by Jiawei Han 2015.03.19 CONTENT 1. A brief Introduction to Data Mining 2. Pattern Discovery : Basic Concepts 3. Efficient Pattern Mining Methods 4. Pattern Evaluation 5. Mining Diverse Patterns 6. Constraint-Based Pattern Mining 7. Sequential Pattern Mining 8. Graph Pattern Mining 9. Pattern-Based Classification 10. Exploring Pattern Mining Applications Lecture 1 : A brief Introduction to Data Mining - We'are drowning in data but starving for knowledge ( a lot of data are unstructured ) - Data mining : a misnomer ! -> Knowledge mining from data - Extraction of interesting patterns (non-trivial, implicit, previously unknown and potentially useful) or knowledge from massive data. - Data mining is a interdisciplinary field (machine learning, pattern recognition, statistics, databases, big data, business intelligence..) Knowledge Discovery (KDD) Process Methodology View: Confluence of Multiple Disciplines Lecture 2 : Pattern Discovery : Ba...

Logistic Regression

By Andrew Ng 1. Logistic Regression      1.1   Visualizing Data Part             └ plotData function     1.2   Advanced  Optimization  Part             └ mapFeature function             └ costFunctionReg function             └ fminunc function         1.3   Decision Boundary and Prediction   Part             └ plotDecisionBoundary function             └ predict function 1. Logistic Regression We'll implement regularized logistic regression to predict whether microchips from a fabrication plant passes quality assurance (QA). During QA, each microchip goes through various tests to ensure it is functioning correctly. Suppose you're the product manager of factory and you have the test results for some microchips on two different tests. From...

Introduction to Data Structure Using C

윤성우의 열혈 자료구조 자료구조에 대한 기본적 이해 프로그램 = 자료구조(데이터 표현) + 알고리즘(데이터 처리) 자료구조가 결정되어야 그에 따른 효율적인 알고리즘 설계 가능하다. 알고리즘은 자료구조에 의존적이다. 알고리즘의 성능분석 방법 시간 복잡도(Time Complexity) - 속도      연산의 횟수를 통해 알고리즘의 빠르기 판단           데이터 수의 증가에 따른 연산횟수의 변화 정도 판단           탐색 알고리즘에서 시간 복잡도를 결정하는 핵심 연산자는 동등비교(==)이다. 공간 복잡도(Space Complexity) - 메모리 사용량 일반적으로 알고리즘 성능 평가할 때 메모리 사용량보다 실행속도에 초점을 둔다. 일반적인 알고리즘의 평가에는 논란의 소지가 거의 없는 '최악의 경우(worst case)가 선택된다. Big-Oh Notation은 '데이터 수의 증가에 따른 연산횟수의 증가패턴을 나타내는 표기법이다.' 또는 '데이터 수의 증가에 따른 연산횟수 증가율의 상한선을 표현한 것'이다. Recursive 하노이 타워      가장 작은 단위 패턴을 코딩하고 이를 일반화시키면 된다. 리스트 (List)      순차 리스트 (배열 기반)           배열 자료구조 특징인 Index를 통해서 어느 곳이든 바로 이동이 가능하다.           하지만, 프로그램 실행 전에 크기가 결정되어야 한다.           그리고 배열은 메모리의 특성이 정적이므로 메모리 길이 변경이 불가능하다.           정적인 배열은 필요로 하는 메모리의 크기에 유연하게 대처하지 못한다. ...