기본 콘텐츠로 건너뛰기

Pattern Discovery in Data Mining

Coursera
Illinois at Urbana-Champaign
by Jiawei Han
2015.03.19



CONTENT

1. A brief Introduction to Data Mining
2. Pattern Discovery : Basic Concepts
3. Efficient Pattern Mining Methods
4. Pattern Evaluation
5. Mining Diverse Patterns
6. Constraint-Based Pattern Mining
7. Sequential Pattern Mining
8. Graph Pattern Mining
9. Pattern-Based Classification
10. Exploring Pattern Mining Applications



Lecture 1 : A brief Introduction to Data Mining

- We'are drowning in data but starving for knowledge ( a lot of data are unstructured )
- Data mining : a misnomer ! -> Knowledge mining from data
- Extraction of interesting patterns(non-trivial, implicit, previously unknown and potentially useful) or knowledge from massive data.
- Data mining is a interdisciplinary field(machine learning, pattern recognition, statistics, databases, big data, business intelligence..)


Knowledge Discovery (KDD) Process


Methodology View: Confluence of Multiple Disciplines




Lecture 2 : Pattern Discovery : Basic Concepts

Market basket analysis : A typical example of frequent itemset mining. This process analyzes customer buying habits by finding associations between the different items that customers place in their "shopping baskets". The discovery of these associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers.

2.1 What is pattern discovery? Why is it important ?
- Patterns : A set of item, subsequences, or substructures that occur frequently together (or strongly correlated) in a data set
- Patterns represent intrinsic and important properties of datasets
- Finding inherent regularities in a data set is important for many essential data mining tasks



2.2 Frequent Patterns and Association Rules
- (absolute) support (count) of X : Frequency
- (relative) support, s : The fraction of transactions that contains X (i.e., the probability that a transaction contains X)

From Frequent Item-sets to Association Rules
Rules that satisfy both a minimum support threshold (min_sup) and a minimum confidence threshold (min_conf) are called strong.
the confidence of rule A->B can be easily derived from the support counts of A and A union B.
Once the support counts of A, B and A union B are found, it's straightforward to derive the corresponding association rules A->B and check whether they are strong.
Thus, the problem of mining association rules can be reduced to that of mining frequent item-sets.

In general, association rule mining process can be viewed as a two-step process:
  1. Find all frequent item-sets: By definition, each of these item-sets will occur at least as frequently as a predetermined minimum support count, min_sup.
  2. Generate strong association rules from the frequent item-sets: By definition, these rules must satisfy minimum support and minimum confidence. 
Because the second step is much less costly than the first, the overall performance of mining association rules is determined by the first step.
A major challenge in mining frequent item-sets from a large data set is the fact that such mining often generate a huge number of item-sets satisfying min_sup, especially when min-sup is set low. This is because if an item-set is frequent, each of its subsets is frequent as well.


Session 3: Compressed Representation: Closed Patterns and Max-Patterns

A long item-set will contain a combinational number of shorter, frequent sub-item-sets.
Given a frequent item-set of length 100, such as {a1,a2,...,a100} and assuming (absolute) minsup=1, the total number of frequent item-sets that it contains is that:
This is too huge a number of itemsets for any computer to compute or store. To overcome this difficulty, we introduce the concepts of closed frequent itemset and maximal frequent itemset. (the way of compressing the possible patterns/combinations)

To explain the closed frequent itemset and maximal frequent itemset, 
Let's suppose we have a simple set of transactions like this:
< Simple transaction table >
Both transaction ID 10 and 20 have 15 combinations respectively but there is some overlap in possible combinations in transaction ID 10 and 20.
< Combination Table >
This is where closed itemsets come in handy. According to Apriori Principle, if a more complex combination meets the minimum support requirements, then the subset or simpler patterns will also meet those requirements. That is, when we know the super-set is a frequent pattern, we can automatically know  its subsets are also frequent patterns, based on the apriori principle.

< Closed pattern Venn diagram >
- the basic of Closed pattern rule: the thing which has no super-pattern with the same support

If we use the apriori principle and these 3 closed patterns, we can reconstruct the entire data table we started with without losing any information. Pretty awesome !!!

There is another way to compress pattern data called max-patterns. They are almost exactly the same as closed patterns except that we don't care about whether other patterns have the same support. According to this example, CD is a distinct closed pattern which has 2 support. However, since the max-patterns don't care about the number of support, CD pattern will be involved with ABCD or CDEF patterns which are the max-patterns.








In my intuition...
- there are two methods that you can see at a glance the degree of coupling 
between X and Y by using association rules.  ex) X->Y (70%,100%) from transaction-set.
- In Support, the combination of X and Y regardless of their sequence.
70% : # of transactions that contain both X and Y / # of all transaction IDs.
- In Confidence, the combination of X and Y contained with their sequence.
100% : # of transactions that contain both X and Y / # of the transactions contained with X

In general knowledge...
- Association rules : X->Y(s,c)
Support, s : The probability that a transaction contains X and Y. (Sample space : 전체)
Confidence, c : The conditional probability that a transaction containing X also contains Y. (Sample space : X(부분))



Closed / Max frequent patterns

In my intuition...

- If you are interested in all frequent patterns regardless of the length, 
you'd better use the closed-frequent pattern mining.
- or if you are interested in the only longest frequent patterns, 
just use the max-frequent pattern mining.
- the closed-pattern mining is like the Huffman coding 
which able to compress information without loss and they can mine core frequent patterns with no duplication.
- Suppose that there are the same frequent pattern # of {a} and {abc} pattern. In this case, you don't need to mine the pattern of {a} because you can generate the pattern of {a} from the pattern of {abc}.
- Using the closed-pattern mining, you can save your computer resource.
- I think it's useful to use max-pattern mining for certain cases, but it's lossy compression in that you can't generate sub-frequent pattern from super-frequent pattern. you should merely use max-frequent patterns.

In general knowledge...

- Closed patterns : A pattern(=itemset) X is closed if X is frequent, and there exists no super-pattern Y with the same support as X.
- Max-patterns : A pattern X is a max-pattern if X is frequent ans there exists no frequent super-pattern Y.



Lecture 3. Efficient Pattern Mining Methods


3.1 The Downward Closure Property of Frequent Patterns

The Downward Closure Property of Frequent Patterns
We get a frequent itemset. Also, its subset are all frequent. So, there must be some hidden relationships among frequent patterns. It's downward closure(also called "Apriori") property of frequent patterns.

* Apriori property : Any subset of a frequent itemset must be frequent. This property belongs to a special category of properties called antimonotonicity in the sense that if a set cannot pass a test, all of its supersets will fail the same test as well. It is called antimonotonicity because the property is monotonic in the context of failing a test.

* Apriori pruning principle : If there is any itemset which is infrequent, its superset should not even be generated.


3.2 The Apriori Algorithm
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent itemsets for Boolean association rules. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties. Apriori employs an iterative approach known as a level-wise search, where k-itemsets are used to explore (k+1)-itemsets. 
To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property is used to reduce the search space.
This subset testing  can be done quickly by maintaining a hash tree of all frequent itemsets.


Pseudo-Code


Example



Session 3: Extensions or Improvements of Apriori


To reduce passes of transaction database scans
( itemset의 support count를 test하기 위해 database를 scan해야되는데, database scan은 Disc accessing으로 expensive computation이다. )
Partitioning: Scan Database Only Twice





Apriori Algorithm

In my intuition...

- You can easily find frequent itemsets over using Apriori and 
it's straightforward, basic method of generating candidate from database.
- The Apriori is iterative method based on Level-wise Search 
similar with Depth First Search from graph search.
- It takes long time to find the longest frequent pattern and expensive database scan. so there are many advanced algorithms beyond Apriori like Partitioning(only twice Database Scan), Dynamic itemset counting(Pagerank), Direct Hashing and Pruning(reduce the number of candidates), and so on...

In general knowledge...

- Repeat :
Generate length-(k+1) candidate itemsets from length-k frequent itemsets.
Test the candidates against DB to find frequent (k+1)-itemsets
Set k := k+1
- Until : no frequent or candidate set can be generated
- Return all the frequent itemsets derived


FPGrowth Algorithm

In my intuition...

- this algorithm uses the method of compressing the information, 
through an efficient data structure FP-tree, similar with closed pattern mining
- The FP-tree usually has a smaller size than the uncompressed data 
by using Header Table which shows just item, frequency and header(Node).
- Using FP-tree, all frequent patterns can be generated through conditional pattern bases in one shot for single branch FP-tree.
- In the FP-tree, Best case is that FP-tree has only one path and Worst case is every transaction has a unique set of items.

In general knowledge...

- To mine frequent patterns by pattern growth, recursively construct and mine FP-trees doing the above for each partitioned database, called conditional database.
- Until the resulting FP-tree is empty, or until it contains only one path, single path will generate all the combinations of its sub-paths, each of which is a frequent pattern.


Mining Compressed Patterns

In my intuition...

- I want to prone only the really useful pattern, evolving closed pattern mining.
- If they share some similarity, it needs to compress

In general knowledge...












- You have to check the transactions 
whether they are similar with a count of Item-Sets and Support  

- Delta-clustering : It can be a standard of similarity of support 
that is, if support ratio of two patterns is within delta, they will merge into one pattern
- It seems like P3 will be a representative pattern if so, it's heavy support loss.
thus, so as to balance, P2, P3 and P4 are the desired output.

- P1은 P2와 비교하면 Item-Sets과 Support가 모두 비슷하기 때문에 desired output이 될 수 없다.


* Mining Colossal Patterns

In my intuition & general knowledge...
아주 긴 pattern을 찾는데 있어서, breadth-first(Apriori)과 depth-first(FPgrowth) 상관없이 "small to large" 패러다임으로 접근하면, 수많은(combinatorial explosion) 서브패턴을 거치게 된다.

맨 뒤에 우리가 찾고 싶은 colossal pattern들이 많이 있는데, 이것들을 찾기 전 midsize pattern들 때문에 지수적인 시간이 소요된다. 그러면 거의 불가능하다.

* The advent of Pattern-Fusion


- Philosophy : Collection of small patterns hints at the larger patterns.
- Idea : not crawl but jump!
- Strategy : Fuse small patterns together in one step to generate new pattern candidates of significant sizes.
- {a}, {a,b} ... 등 작은 패턴들은 모두 short-cut하고 서로 merge시킨다. 그리고 빠르게 jump하면서 넘어간다.  (*merge하는 기준 : dense balls)
- short-cut한 small pattern들의 Collection을 colossal pattern을 찾는데 사용한다.
- 조금 긴 pattern들 중에 support가 비슷한 pattern들은 어느 특정한 colossal pattern의 sub pattern일 가능성이 높다
- colossal pattern으로 갈수록 그의 core pattern들은 무수히 많아진다.

For example...
- {a2, a4, a45}와 {a3, a34, a39}의 support가 서로 비슷하면 {a2, a3, a39, a45}로 merge해 버린다. 이런식으로 merge하면 자동으로 jump하게 된다.
- 이렇게 small size pattern들을 서로 merge하면서 더 큰 size의 candidate pattern(=core pattern)을 얻는다.
- 이런식으로 small size pattern으로 시작해서 core pattern들을 merge해나가다 보면, colossal pattern을 얻을 수 있다.



* Sequential Pattern Mining

A sequence : <(ef)(ab)(df)cb>
<a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>

- ( ) : a basket where patterns(events) occur at the same time. there is no meaning of ordering in the basket.

<ab>=<(a)(b)> vs <(ab)>

- <ab> :  a를 사고나서 b를 산다. there is a gap between a and b.
- <(ab)> : a와 b를 동시에 같이 산다. there is no gap between a and b.




Pattern Evaluation


Q. How to judge if a Pattern is interesting ?


* LEVEL 1 - Support & Confidence

"A=>B" [s,c]

- Pros : simple, interesting association rules
- Cons : 단순히 서로다른 2개 (패턴)의 확률과 조건부확률을 표현, 그다지 strong하지 않다. It is not enough to explain the association rule such as relationship between negatively and positively correlated pattern.


* LEVEL 2 - Lift / Chi-square(X^2)

1) Lift
 - Lift(A,B) = 1 : A and B are independent
              > 1 :                 positively correlated
              < 1 :                 negatively correlated

2) Chi-square
 - X^2 = 0 : independent
          > 0 : correlated, either positive or negative, needs additional test
 - additional test
   expected value > observed value = negatively correlated
   expected value < observed value = positively correlated

- pros : not 패턴을 모두 포함시킨 경우를 모두 확인할 수 있다. 즉 support와 confidence보다 보는 시야가 넓어짐, 그리하여 각각 패턴(not포함)마다 전체적인 통계적 비율을 산출할 수 있다. not 패턴의 경우를 적극 활용할 수 있다. not 패턴도 interesting pattern이다.
전체적인 비율(standard)보다 2가지 패턴의 경우가 좀 더 작으면 negatively correlated 한 경우, 즉 it's not easy to get together이고, 좀 더 높으면 positively correlated한 경우, 즉 it's easy to get together이다.
- cons :  *Null transactions problem


Lift(B,C) = 8.44 >> 1 (Lift shows B and C are strongly positively correlated -> it's wrong!)
X^2 = 670:Observed(BC) >> expected value(11.85) -> it's wrong!
전체 비율에서 B와 C의 Support가 많이 낮지만, Null transaction때문에 Lift와 X^2 measure에서 correlation은 높게 측정된다. Lift값이 1보다 현저하게 크거나, X^2값이 expected value보다 매우 크게 나오면 올바르지 않은 결과라고 의심해 볼 수 있다.


Null invariance is crucial for the analysis of massive transaction data.
Lift and X^2 are not null-invariant : not good to evaluate data that contain too many too few null transactions. Null transaction이 너무 적으면 서로 independent하고 너무 많으면 너무 positive(too big)하게 측정된다 - 옳지않다.





* LEVEL 3 - Null-invariant measures 


*Null invariance : Value does not change with the # of null-transactions

하지만, 여러가지 Null invariance measure들 중에 모두 똑같은 결과를 내진 못한다. 어떤 방법이 가장 좋을까 ? 


- D4,D5,D6는 Null-invariant measure들을 구분할 수 있는 좋은 척도가 된다.
- -mc와 m-c의 증감 비율이 일정 -> Kulc measure만 balance를 유지한다.
- 2개의 feature간의 correlation을 측정하는게 주 목적이므로 -m-c(both null part)의 개수는 무의미하다. 오로지 mc, -mc, m-c관계가 중요하다.

- pros : Kulc hold firm and is in balance of both directional implications.
- cons : 하지만, D6와 같이 very imbalanced된 경우를 구분하지 못한다. (D4와 D6와의 구분 필요)


* LEVEL 4 - Kulczynski Measure + Imbalance Ratio

* IR(Imbalance Ratio) : measure the imbalance of two itemsets A and B in rule implications.

- mc <---(ratio)---> -mc & m-c
- D4와 D6를 구분하는 것처럼 clear하게 correlation을 분석할 수 있다.





Lecture 7. Sequential Pattern Mining

Motivation
Sequential Pattern mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence.

Sequential pattern mining has broad applications
 - Customer shopping sequences. For example, purchase a laptop first, then a digital camera, and then a smartphone within 6 months.
 - Medical treatments, natural disasters (e.g., earthquakes), science & engineering processes, stocks and markets, ...
 - Weblog click streams, calling patterns, ...
 - Software engineering: Program execution sequences, ...
 - Biological sequences: DNA, protein, ...

Sequential pattern mining: Given a set of sequences, find the complete set of frequent subsequences (i.e., satisfying the min_sup threshold)

Let't say there is a sequence is <(ef)(ab)(df)(c)(d)>. An element surrounded with parenthesis may contain a set of items (also called events). Items within an element are unordered and we just list them alphabetically. For example,  is a subsequence of <(ef)(ab)(df)(c)(d)>.

Sequence database
Given support thereshold min_sup=2, <(ab)c> is a sequential pattern.
The Apriori property still holds. Therefore, if a subsequence $s_1$ is infrequent, none of $s_1$'s super-sequences can be frequent.

Session 7.1. GSP: Apriori-Based Sequential Pattern Mining

Session 7.2. PrefixSpan: Sequential Pattern Mining by Pattern-Growth



Lecture 8. Graph Pattern Mining

Motivation

Graph pattern mining becomes increasingly crucial to applications in a variety of domains including bioinformatics, cheminformatics, social network analysis, computer vision and multimedia.

Frequent graph patterns are subgraphs that are found from a collection of graphs or a single massive graph with a frequency no less than a user-specified support threshold. Frequent subgraphs are useful at characterizing graph sets, discriminating different groups of graphs, classifying and clustering graphs, and building graph indices.

There are lots of applications of Graph Pattern Mining.
- Bioinformatics: Gene network, protein interactions, metabolic pathways
- Chem-informatics: Mining chemical compound structures
- Social Networks, web communities, tweets
- Web graphs, XML structures, semantic Web, information networks
- Software engineering: program execution flow analysis
- Building blocks for graph classification, clustering, compression, comparison, and correlation analysis
- Graph indexing and graph similarity search

There are many large set of graphs in the world. For example, there are many chemical compounds. Given a labeled graph dataset $D =$ {$G_1, G_2,...,G_n$}, the supporting graph set of a subgraph $g$ is $D_g$ = {$G_i | g⊆G_i, G_i∈D$}. That is, with the dataset $D$, how many graphs contain this sub-graph $g$?.

The measurement to determine whether or not sub-graph is frequent is very similar to the case of transactions containing some items. That is, the support value of subgraph g is the value of the number of graph containing the subgraph g divided by the whole number of graph in our dataset.

Graph dataset $G_a, G_b, G_c$

Frequent Subgraph $g_1, g_2$
Since both subgraph have 2 min_sup, that is 67% support, we can say there are frequent graph patterns.

Different Methodologies
- Generation of candidate subgraphs: Apriori vs. Pattern-Growth
- Search order: Breadth vs. Depth
- Elimination of duplicate subgraphs: Passive vs. Active
- Support calculation: store embeddings, MoFa
- Order of pattern discovery: path->tree->graph


8.1 gSpan: A Pattern Growth Approach
Unlike Apriori-based approach using bread-first search, where you have to generate all the k-edge's subgraphs before you go to (k+1)-edge, gSpan algorithm is a Pattern-Growth approach using depth-first search, where you can directly go to (k+1)-edge when you are in k-edge.

The major challenge is that we tend to have many, many different paths when edge is level-up. In other words, if you can't control it, we'll generate tremendous number of duplicate subgraphs.

In order to solve the problem, gSpan defines only one order to generate subgraphs rather than many different orders(paths). The one order is called Right-most path extension which is the path from root to the right-most leaf. When expanding to other vertex, gSpan choose the vertex with the smallest index at each step. Via this process, we can reduce generation of tremendous duplicate subgraphs while keeping completeness about the enumeration of graphs.

We can grow in order called Right-most path extension where every time you start from the lowest edge like 0, you'll always select the smallest index at every step. After doing this, we can get something like sequence which actually named as DFS code(or DFS spanning tree) that flatten a graph into a sequence using depth-first search.
As a result, via very efficient gSpan algorithm, we can use only the sequence, DFS code rather than the whole path of graph, to determine frequent graph pattern.





Lecture 10. Exploring Pattern Mining Applications
Instead of discussing all kinds of applications because there are so many, we'll focus only on one application.

10.1 Frequent Pattern Mining for Text Data: Phrase Mining & Topic Modeling
Frequent pattern mining essentially means when you look at many, many different documents from the corpus, you try to work out how many topics or you might focus the discussion on certain things or you pretend that there could be multiple topics in the set of documents. For each topic, you get a distribution of words. If a topic consists of single word distribution, we can consider it as uni-grams or if a topic consists n-grams, we can say it's phrases.

However, uni-grams(single words) can be difficult to interpret or sometimes you may get confused. For example, from the topic that represents the area of Machine Learning, the word of machine could be anything such as hardware machine, software machine, mechanical machine and so on.
Instead, when we generate the phrases to represent the topic, we can easily recognize what it means. Phrases give us less ambiguous and is more meaningful concepts for machine learning.


















Review
Starting with Market basket analysis, I could learn various types of frequent pattern mining. (e.g., sequential pattern, colossal pattern, graph pattern mining, etc)
In the process of finding frequent pattern from various types of data, I felt two things. First, there are many mathematical models to mine data, for example, based on Statistics, Linear Algebra and Probability. Thus, I felt, I'd better to focus on one mathematical model to data mining. Second, I could learn the efficiency of algorithms in that minimization of search space, database scan, the number of candidate, and using efficient data structure(e.g.,tree(FP-tree)) and elegant techniques (e.g.,divide and conquer, bread-first search(Apriori-Based Approach), depth-first search(Pattern-Growth Approach)).
The weakness of this course I think, is most concepts are a little shallow(i.e., high-level mode). I think, it would be better to represent deeply to representative algorithms.
But, it's worth to be able to see Big Tree of Data Mining in fast mode.


Reference
[1] Data Mining: Concepts and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems)
[2] Pattern Discovery in Data Mining, Illinois at Urbana-Champaign, Jiawei Han, Coursera
[3] http://simpledatamining.blogspot.kr/2015/02/closed-itemsets.html
[4] "Sequential pattern mining." Wikipedia. Wikimedia Foundation, Web.
[5] Chapter 12 - Mining Graph Patterns, Springer

댓글

이 블로그의 인기 게시물

Vector Space Model

Motivation When you want to find some information by using Search Engines, you have to make a query used for search. Unfortunately, since you don't know exactly what it means, your query will be ambiguous and not accurate. Therefore, Search Engines give you the information in a ranked list rather than the right position. Intuition In order to make a ranked list, you need to calculate the similarity between the query and documents based on terms or words. One of the calculation of similarity is dot product on a vector space. In the vector space, there are many documents with respect to word dimensions The first to rank is d2, because to see with eyes it's the most similarity with the query. Problem How do we plot those vectors wonderfully and nicely and very very fairly ? - How do we define the dimension ? - How do we place a document vector ? - How do we place a query vector ? - How do we match a similarity ? Consideration 1. The frequency of each word of Query. First, Score in

Text Retrieval and Search Engines

by ChengXiang "Cheng" Zhai CONTENT 1. Natural Language Content Analysis 2. Text Access 3. Text Retrieval Problem 4. Text Retrieval Methods 5. Vector Space Model (Retrieval Model l) 6. System Implementation 7. Evaluation 8. Probabilistic Model (Retrieval Model ll) 9. Feedback 10. Web Search Harnessing Big Text Data: Text Retrieval + Text Mining Course Summary 1. Natural Language Content Analysis 1.1 Natural Language Processing (NLP) Natural Language Processing (NLP) is a field of computer science, artificial intelligence, and computational linguisitc concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human-computer interaction. Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation. Computers can understand human language like that Koreans understand English. but, it's