기본 콘텐츠로 건너뛰기

Text Retrieval and Search Engines


by ChengXiang "Cheng" Zhai

CONTENT

1. Natural Language Content Analysis
2. Text Access
3. Text Retrieval Problem
4. Text Retrieval Methods
5. Vector Space Model (Retrieval Model l)
6. System Implementation
7. Evaluation
8. Probabilistic Model (Retrieval Model ll)
9. Feedback
10. Web Search



Harnessing Big Text Data: Text Retrieval + Text Mining

Course Summary



1. Natural Language Content Analysis


1.1 Natural Language Processing (NLP)

Natural Language Processing (NLP) is a field of computer science, artificial intelligence, and computational linguisitc concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human-computer interaction. Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation.

Computers can understand human language like that Koreans understand English.
but, it's very difficult to understand the ambiguity and common sense of human language.

An Example of NLP

1. Lexical analysis (part-of-speech tagging) : Noun, Prep, Verb, Aux, Det ...
2. Syntactic analysis (Parsing) : Noun phrase, Verb phrase, Sentence phrase ...
3. Semantic analysis : to obtain partial understanding of the sentences.
  - Entity / relation extraction : the distinction between people and location.
  - Word sense disambiguation : the difference of definition of word according to changing the context
  - Sentiment analysis : whether sentence is positive or negative.
4. Inference
5. Pragmatic analysis (speech act) : it requires all steps to be done.


Since the big difficulty of NLP, we can't really do 100% POS tagging, General complete parsing, and Precise deep semantic analysis. Therefore,  robust and general NLP tends to be "shallow" while "deep" understanding doesn't scale up. In general, we can implement only deep learning because NLP needs not restrictive data but big data. That is, Natural Language data is different from training data. However, deeper NLP is needed for complex search tasks.



2. Text Access

2.1 Motivation
Q. How can a text information system helps users get access to the relevant text data ?

Text access technology plays two important roles in text analysis and management applications. First it enables retrieval of the most relevant text data to a particular analysis problem, thus avoiding unnecessary overhead from processing a large amount of non-relevant data. Second, it enables interpretation of any analysis results or discovered knowledge in appropriate context and provides data provenance (origin).

The general goal of text data access is to connect users with the right information at the right time. To connect between them efficiently, the selection of relevant text data from a large collection is crucial and can be done in two modes, pull and push mode.


2.2 Pull Mode
The user initiates the access process to find the relevant text data, typically by using a search engine. This mode of text access is essential when a user has an ad hoc information need, that is a temporary information need that might disappear once the need is satisfied. In such a case, a user use a query to find relevant information with a search engine.
While querying is the most common way of accessing text data in the pull mode, browsing is another complementary way of accessing text data in the pull mode, and can be useful when a user does not know how to formulate an effective query, or finds it inconvenient to enter a keyword query, or simply wants to explore a topic with no fixed goal.

Indeed, when searching the Web, users tend to mix querying and browsing while traversing through hyperlinks. In general, we may regard querying(seeking or going directly) and browsing(sightseeing or hovering around the goal) as two complementary ways of finding relevant information in the information space.


2.3 Push Mode
The system initiates the process to recommend a set of relevant information items to the user. This mode of information access is generally more useful to satisfy a long-standing information need of a user or analyst.

For example, a researcher's research interests can be regarded as relatively stable over time. In comparison, the information stream is dynamic. In such a scenario, although a user can regularly search for relevant literature information with queries, it's more desirable for a recommender system to monitor the dynamic information stream and "push" any relevant articles to the user based on the matching of the articles with the user's interests. In some long-term analytics applications, it would also be desirable to use the push mode to monitor any relevant text data about a topic related to the application.

Another scenario of push mode is producer-initiated recommendation (also called selective dissemination of information (SDI)). In such a scenario, the producer of information has an interest in disseminating the information among relevant users, and would push an information item to such users. For example, Advertising of product information on search result pages or email notifications.

The dichotomy of text information access modes
There are broadly two kinds of information needs: short term need and long term need. Short-term needs which are temporary are most often associated with pull mode, search engine and long-term needs are most often associated with push mode, recommendation.

Ad hoc retrieval is extremely important because ad hoc information(temporary but frequent, so important) needs show up far more frequently than long-term information needs and the techniques effective for ad hoc retrieval can usually be re-used for filtering and recommendation as well.

To build more sophisticated and intelligent information system, we should use both Recommend System and Search Engine as being complementary.



3. Text Retrieval Problem





Search Problem

- Since a user needing information is not well known about information, the query may not be accurate and it can hardly find right position.
- Since a user cannot digest all the contents at once, the user generally would have to look at each document sequentially.
- As a result, prioritization is needed, then we have to build ranked list.


Retrieval Model

+ Similarity-based models: f(q,d) = similarity(q,d)
  - Vector space model
+ Probabilistic models: f(d,q) = p(R=1| d,q), where R involved in {0,1}
  - Classic probabilistic model
  - Language model
  - Divergence-from-randomness model
+ Probabilistic inference model: f(q,d) = p(d->q)
+ Axiomatic model: f(q,d) must tend to result in similar ranking functions involving similar variables.


TR System Architecture

< Typical TR System Architecture >

Search Engine system can be divided into three parts
1. Indexer : to create an index which is a data structure for the search engine to use to quickly answer query. (offline)
2. Scorer : which would use a index to quickly answer a user's query by scoring the documents and then rank them. (online)
3. Feedback : implicit&explicit feedback (offline&online)

Tokenization
- Normalize lexical units : Words with similar meanings should be mapped to the same indexing term
- Stemming : Mapping all inflectional forms of words to the same root form
ex) computer -> compute, computation-> compute, computing->compute


Inverted Index for Fast Search

< Example of Inverted Index >

- #docs + Total freq(TF)
 : to be used in VSM(IDF weighting)
- Postings
 : can go quickly to the postings to fetch all the documents that match "news"
- Position
 : it very useful for checking whether the matching of query terms is actually within a small window ( ex. news와 매칭되는 모든 doc들을 한 군대 모아서 볼 수 있다. )
- Inverted Index is much more efficient then sequentially scanning docs. because we take the union of all the documents that are matched with at least one query term and then we would aggregate the term weights.


Empirical Distribution of Words ( the reason why we should use Inverted Index )

- There are stable language-independent patterns in how people use natural languages.
- A few words("the","a","we",...) occur very frequently; most occur rarely.
- The most frequent word in one corpus may be rare in another.
- The exact words that are common may vary from context to context.

< Zipf's Law >

- Highest Frequency
 : They are generally not very useful for retrieval. So, they are often removed because they occupy a lot of space in the Inverted Index space. this is called a stop words removal
- Many rare words
 : They are very useful for search (very high IDF)
- The higher rank is the lower frequency is


Constructing Inverted Index

< Sort-based Inversion >

- "Local" Sort : 아직까지 doc끼리는 섞진 않는다
- Merge Sort : 모든 doc들과 term들을 sorting한다.
- The main difficulty is to build a huge index with limited memory. So we need to have a compression such as TF compression and Doc ID compression.




5. Vector Space Model
The vector space model is a special case of similarity based on models in that we assume relevance is roughly correlated to similarity between a document and a query. In order to solve our search problem, we have to convert the vague notion of relevance into a more precise definition that can be implemented with a programming language. In this process, we have to make a number of assumptions to get a precise definition.

In high dimensional space, where each dimension corresponds to a term, we can plot our documents in this space since they are represented as vectors of term magnitudes.
Illustration of documents plotted in vector space
What we see here is only the vector of the document, even though of course the document has other information. For example, the order of the words is simply ignored. Thus in this representation, $d_1$ seems to suggest a topic in either presidential or library. By using this vector space representation, we can actually capture the differences between topics of documents. Similarly, we're going to place our query in this space as another vector. We can then measure the similarity between the query vector and every document vector.

The vector space model is a framework, where our assumption is that we present a document and query by a term vector and both are all equal-length vectors. A term can be any basic concept such as a word or a phrase, or even n-grams of characters or any other feature representation. A query vector or a document vector would consist of a number of elements corresponding to the weights of different terms. The relevance in this case is assumed to be the similarity between the two vectors. Therefore, our relevance function is also defined as the similarity between the query vector and document vector.

In order to make this explanation or representation much more precise, (or to make it refined for a function to be implemented on a computer) We must deal with:
First, How to define or select the basic concepts(terms) - we clearly assume the concepts are orthogonal, otherwise there will be redundancy.
Second, How to place documents and queries in this vector space (or how to assign term weights)
Finally, How to define the similarity measure 




8. Probabilistic Model 
In probabilistic models, we define the ranking function based on the probability that a given document $d$ is relevant to a query $q$, or $p(R=1|d,q)$ where $R$∈{0,1} is binary random variable denoting relevance. In other words, we assume that the query and the documents (i.e.,data) are all observations from random variables. Therefore, the problem is to estimate the probability of relevance.  

8.1 Basic probabilistic retrieval model


8.2 Query likelihood retrieval model
The query likelihood retrieval model is one of the most effective models in probabilistic models. In query likelihood, our assumption is that this probability of relevance can be approximated by the probability of a query given a document and relevance, $p(q}d,R=1)$. Intuitively, this probability just captures the following probability: if a user likes document $d$(because we have R=1), how likely would the user enter query $q$ in order to retrieve document $d$?.






Reference
[1] Text Retrieval and Search Engines, Illinois at Urbana-Champaign, ChengXiang Zhai, Coursera
[2] Text Data Analysis and Management: A Practical Introduction to Text Mining and Information Retrieval, ChengXiang Zhai and Sean Massung
[3] "Natural language processing." Wikipedia. Wikimedia Foundation, Web.



댓글

이 블로그의 인기 게시물

Pattern Discovery in Data Mining

Coursera Illinois at Urbana-Champaign by Jiawei Han 2015.03.19 CONTENT 1. A brief Introduction to Data Mining 2. Pattern Discovery : Basic Concepts 3. Efficient Pattern Mining Methods 4. Pattern Evaluation 5. Mining Diverse Patterns 6. Constraint-Based Pattern Mining 7. Sequential Pattern Mining 8. Graph Pattern Mining 9. Pattern-Based Classification 10. Exploring Pattern Mining Applications Lecture 1 : A brief Introduction to Data Mining - We'are drowning in data but starving for knowledge ( a lot of data are unstructured ) - Data mining : a misnomer ! -> Knowledge mining from data - Extraction of interesting patterns (non-trivial, implicit, previously unknown and potentially useful) or knowledge from massive data. - Data mining is a interdisciplinary field (machine learning, pattern recognition, statistics, databases, big data, business intelligence..) Knowledge Discovery (KDD) Process Methodology View: Confluence of Multiple Disciplines Lecture 2 : Pattern Discovery : Ba

Vector Space Model

Motivation When you want to find some information by using Search Engines, you have to make a query used for search. Unfortunately, since you don't know exactly what it means, your query will be ambiguous and not accurate. Therefore, Search Engines give you the information in a ranked list rather than the right position. Intuition In order to make a ranked list, you need to calculate the similarity between the query and documents based on terms or words. One of the calculation of similarity is dot product on a vector space. In the vector space, there are many documents with respect to word dimensions The first to rank is d2, because to see with eyes it's the most similarity with the query. Problem How do we plot those vectors wonderfully and nicely and very very fairly ? - How do we define the dimension ? - How do we place a document vector ? - How do we place a query vector ? - How do we match a similarity ? Consideration 1. The frequency of each word of Query. First, Score in