[Data Science] Association Rule Mining (1)

728x90

Motivation

구글이나 네이버같은 검색엔진에 검색어를 입력하면 자동완성이 되는 것도 일종의 association rule(연관 규칙)에 기반한 추천이다.

Association Rule Mining

아래와 같은 거래 데이터베이스 (transaction data)가 있다고 하자. 여기서 알 수 있는 association rule은 우측 그림과 같다.

여기서의 association rule은

기저귀를 구매한 사람은 맥주도 구매한다. (아기 아빠는 피곤해서 맥주가 필요한 것일까? 🤣🤣)
우유와 빵을 구매한 사람은 달걀과 콜라도 구매한다.
맥주와 빵을 구매한 사람은 우유를 구매한다.

이렇게 association rule을 찾는 알고리즘은 크게 Apriori, FP-Growth 2가지 알고리즘이 있다.

Frequent Itemset과 관련 용어 정리

Itemset

1개 이상의 item을 원소로 하는 집합.

e.g. {Milk, Bread, Diaper}

k-itemset

$k$개의 item을 원소로 하는 집합.

e.g. {Milk, Bread, Diaper}는 3-itemset이다.

Support count($\sigma$, absolute support)

(DB에서) itemset이 등장한 횟수

e.g. 위의 Figure 1에서, $\sigma$({Milk, Bread, Diaper}) = 2 이다.

Frequent Itemset

support가 minsup threshold 이상인 itemset

Association Rule

$X, Y$가 itemset일 때, (겉으로 드러나지 않지만) $X \to Y$로 표현되는 규칙

e.g. {Milk, Diaper} → {Beer}

Rule Evaluation Metrics

association rule $X \to Y$에 대한 평가지표이다.

Support (s): $X, Y$를 모두 포함하는 transaction의 비율.

Confidence (c): $X$안에 $Y$가 얼마나 자주 등장하는 비율.

다시 Figure 1으로 돌아와서, {Milk, Diaper → Beer}라는 association rule의 support와 confidence를 구해보자.

itemset {Milk, Diaper, Beer}가 포함된 TID=2, 4이므로

\[ s = \cfrac{\sigma(\text{Milk, Diaper, Beer})}{|T|} = 2/5 = 0.4 \]

또한 {Milk, Diaper}가 포함된 TID=2, 4, 5이므로

\[ c = \cfrac{\sigma(\text{Milk, Diaper, Beer})}{\sigma({\text{Milk, Diaper}})} = 2/3 = 0.67 \]

Association Rule Mining Task

이제 transaction $T$가 주어졌을 때, association rule mining task는 다음 2가지 조건을 만족하는 모든 rule을 찾는 것이다.

support $\ge$ minsup
confidence $\ge$ minconf

그렇다면 어떻게 하면 모든 association rule을 찿을 수 있을까?

먼저 떠올릴 수 있는 첫번째 생각은 Brute-force 접근법이다.

Brute-force approach

모든 가능한 association rule을 생성 (이미 계산복잡도가 팩토리얼이다!)
각 rule마다 s, c를 계산
min_sup, min_conf 이상이 되는 rule만 찾는다.

Brute-force에서 몇가지 아이디어를 얻을 수 있다.

위의 rule은 같은 itemset {Milk, Diaper, Beer}의 binary partition이다.
같은 itemset에서 파생된 rule은 support값은 같지만, confidence는 다를 수 있다.

따라서 같은 itemset 내에서, support과 confidence의 값을 같이 고려하지 않아도 된다.

이를 정제하면 크게 2단계로 볼 수 있다.

Frequent Itemset Generation - support $\ge$ minsup인 모든 itemset을 생성하기.
Rule Generation - 각 frequent itemset에서 높은 confidence인 itemset 생성하기.

Figure 4. $d$개의 item의 candiate itemset은 $2^d$개이다.

Figure 4에서 보듯이, $M=2^d$이므로 전체 시간복잡도는 $O(NMw) = O(Nw2^d)$이다.

전체 itemset의 개수를 $2^d$, 전체 가능한 rule의 개수 $R$을 구해보자.

LHS는 $d$개의 item중에서 $k$개를 뽑은 방법의 수이고, RHS는 나머지 $(d-k)$개의 item중에서 $j$개를 봅는 방법의 수다.

\[ R = \sum_{k=1}^{d-1} \left[ \binom{d}{k} \times \sum_{j=1}^{d-k}\binom{d-k}{j} \right] = 3^d - 2^{d+1} + 1 \]

따라서 여전히 Brute-force로 접근한 Frequent Itemset Generation은 computationally expensive하다!!

다음 포스트는 Frequent Itemset Generation 알고리즘에 대해 써보겠습니다.

728x90

'스터디 > 데이터사이언스' 카테고리의 다른 글

[Data Science] Association Rule Mining (3) - Apriori + Hash Tree (0)	2023.04.02
[Data Science] Association Rule Mining (2) - Apriori principle (0)	2023.04.01
[Data Science] Data Preprocessing (5) - Data Transformation (0)	2023.03.28
[Data Science] Data Preprocessing (4) - Data Reduction (0)	2023.03.28
[Data Science] Data Preprocessing (3) - Data Integration (0)	2023.03.28

궁금한게많은joon

[Data Science] Association Rule Mining (1) - Introduction

Motivation

Association Rule Mining

Frequent Itemset과 관련 용어 정리

Association Rule Mining Task

'스터디 > 데이터사이언스' 카테고리의 다른 글

티스토리툴바

[Data Science] Association Rule Mining (1) - Introduction

Motivation

Association Rule Mining

Frequent Itemset과 관련 용어 정리

Association Rule Mining Task

'스터디 > 데이터사이언스' 카테고리의 다른 글

관련글

티스토리툴바