Discussion

Data in this post is from Quora, Reddit, Stack Overflow and so on, where people usually discuss about concepts.

[ Quora ]

Why are rule based methods becoming unpopular in NLP? [Original]

[정리]

- 질좋은/많은 데이터가 없을 경우 rule-based가 큰 역할을 한다.

- 만약, 어떤 문제를 fast-and-dirty하게 풀 때, pattern(=rule)을 리스트화하는게 가장 좋을 것이다.

- tokenization/ stemming/ sentence_breaking/ morphology와 같이 deterministic을 요구하는 경우(그다지 복잡하지 않은 경우)에 rule-based가 좋다. 즉, linguistic data frame에는 rule-based의 pre/post processing이 필수적이다. 그리고 이러한 pre/post processing은 ML-classifier의 성능에 큰 영향을 준다. 따라서, rule-based와 ml-based는 서로 뗄레야 뗄수없는 하나의 framework이다. 단지 서로의 역할만 정해져 있고, 어디에 초점을 맞추는지에 따라 달라질 것이다.

- 하지만, rule-based는 scale 하기 어렵다. 즉, 수백/수천/수만개의 rule들이 있으면, 서로 매우복잡하게 얽히고 설켜 결국엔 시스템이 엉망이될 수도 있다.

- 좋은 방법은 시스템을 여러 단계로 나눈 후, 정교한 feature가 정의되면 rule-based로 각 단계를 해결하고 (아마 pre/post processing에 해당), 그 다음 feature들을 combine하고 싶을 때는 ML-based로 가는 것(generalization을 더 잘하기 위해)이다. 참고로 rule-based는 overfitting을 확인할 방법이 없다.
- Classification 문제에서 목표는 적합한 Decision Boundary를(이하DB) 찾는 것이다. rule-based는 복잡한 DB를 찾기 쉽지 않다. 반면, ML-based는 regression을 통해 복잡한 DB를 찾을 수 있다. DB를 찾는 측면에서, rule-based는 discrete spectrum이라 생각할 수 있고, ML-based는 continuous spectrum이라고 생각할 수 있다. 이를 과장해서 보면, 각각discrete과 continuous에 해당한다고 볼 수 있기 때문에, 장단점이 존재한다.

[ Reddit ]

[ Stack Overflow ]

이 블로그의 인기 게시물

Vector Space Model

Motivation When you want to find some information by using Search Engines, you have to make a query used for search. Unfortunately, since you don't know exactly what it means, your query will be ambiguous and not accurate. Therefore, Search Engines give you the information in a ranked list rather than the right position. Intuition In order to make a ranked list, you need to calculate the similarity between the query and documents based on terms or words. One of the calculation of similarity is dot product on a vector space. In the vector space, there are many documents with respect to word dimensions The first to rank is d2, because to see with eyes it's the most similarity with the query. Problem How do we plot those vectors wonderfully and nicely and very very fairly ? - How do we define the dimension ? - How do we place a document vector ? - How do we place a query vector ? - How do we match a similarity ? Consideration 1. The frequency of each word of Query. First, Score in...

자세한 내용 보기

Pattern Discovery in Data Mining

Coursera Illinois at Urbana-Champaign by Jiawei Han 2015.03.19 CONTENT 1. A brief Introduction to Data Mining 2. Pattern Discovery : Basic Concepts 3. Efficient Pattern Mining Methods 4. Pattern Evaluation 5. Mining Diverse Patterns 6. Constraint-Based Pattern Mining 7. Sequential Pattern Mining 8. Graph Pattern Mining 9. Pattern-Based Classification 10. Exploring Pattern Mining Applications Lecture 1 : A brief Introduction to Data Mining - We'are drowning in data but starving for knowledge ( a lot of data are unstructured ) - Data mining : a misnomer ! -> Knowledge mining from data - Extraction of interesting patterns (non-trivial, implicit, previously unknown and potentially useful) or knowledge from massive data. - Data mining is a interdisciplinary field (machine learning, pattern recognition, statistics, databases, big data, business intelligence..) Knowledge Discovery (KDD) Process Methodology View: Confluence of Multiple Disciplines Lecture 2 : Pattern Discovery : Ba...

자세한 내용 보기

Text Mining and Analytics

by ChengXiang Zhai CONTENT 1. Overview Text Mining and Analysis 2. Natural Language Processing & Text Representation 3. Word Association Mining and Analysis └ Paradigmatic └ Syntagmatic 7. Topic Mining and Analysis 8. Probabilistic Topic Models 9. Probabilistic Latent Semantic Analysis (PLSA) 10. Latent Dirichlet Allocation (LDA) 11. Text Clustering 12. Text Categorization 13. Opinion Mining and Sentiment Analysis 14. Latent Aspect Rating Analysis 15. Text-Based Prediction 16. Contextual Text Mining 3. Word Association Mining and Analysis 3.1 Paradigmatic Relation Discovery ㆍParadigmatic Relation A & B have paradigmatic relation if they can be substituted for each other (i.e., A & B are in the same class) 3.2. Syntagmatic Relation Discovery In semiotics, syntagmatic analysis is analysis of syntax or surface structure (syntagmatic structure) as opposed to paradigms (paradigmatic analysis). This is often achieved using commutation tests....

자세한 내용 보기

Rubbing & Scrubbing My Data

이 블로그 검색

Discussion

태그

댓글

댓글 쓰기

이 블로그의 인기 게시물

Vector Space Model

Pattern Discovery in Data Mining

Text Mining and Analytics