[Data Science] Bayesian Classifier

728x90

Bayesian Classifier

attribute와 class label이 random variable이라 생각하면 attribute tuple이 주어졌을 때 특정 class label일 확률이 최대가 되는 클래스가 정답이라는 접근방법이다. 이때 attribute는 $(A_1, A_2, \dots, A_n)$이고 class label은 $C$라 하면

\[ \max P(C | A_1, \dots, A_n) \]

이 되는 $C$를 찾는 것이다.

그렇다면 $P(C | A_1, \dots, A_n)$을 어떻게 구할까?

이 때, bayes theorem을 이용하면 다음과 같다.

\[ P(C | A_1, \dots, A_n) = \cfrac{P(A_1, \dots, A_n|C) P(C)}{P(A_1, \dots, A_n)} \]

여기서 분모 $P(A_1, \dots, A_n)$은 모든 클래스에 대하여 상수이기 때문에 분자의 값이 최대가 되면 된다.

Naive Bayes Classifier

가정: 모든 attribute는 서로 독립이다.

따라서 $j$번째 클래스일 확률은

\[ P(A_1, \dots, A_n|C_j) = P(A_1|C_j)P(A_2|C_j) \cdots P(A_n|C_j) \]

따라서 우리는 $j$번째 클래스 $C_j$에 대하여 아래의 식을 계산하면 된다.

\[ P(C_j) \Pi_{i=1}^{n} P(A_i | C_j) \]

How to estimate probabilities from data?

이제 우리는 $P(A_i | C_j)$를 구하는 방법을 아래 데이터를 예시로 알아보자.

Sample data for naive bayes classifier — Sample Data

Discrete Attributes

\[ P(A_i | C_k) = \cfrac{|A_{ik}|}{N_c} \]

$|A_{ik}|$: attribute $A_i$가 클래스 $C_k$에 해당하는 instance의 개수이다.

$N_c$: 클래스가 $C_k$인 instance의 개수

예를 들어, $P(\text{Marital Status = Married} | \text{No})$를 구해보자.

클래스가 No이면서 Status=Married인 인스턴스는 Tid=2, 4, 6, 9로 4개이고 클래스가 No인 인스턴스는 Tid=1, 2, 3, 4, 6, 7, 9로 7개이다. 따라서 $P(\text{Marital Status = Married} | \text{No}) = 4/7$ 이다.

※ 조건부 확률 문제와 동일하게 계산하면 된다.

Continuous Attributes

일반적으로, 연속형 attribute의 경우, 정규분포를 따른다고 가정한다.

각 $(A_i, C_j)$ tuple 에 대하여 확률 $P(A_i, C_j)$는 다음과 같이 구한다.

\[ P(A_i | C_j) = \cfrac{1}{\sqrt{2 \pi \sigma_{ij}^2}} \text{exp} \left( -\cfrac{(A_i - \mu_{ij})^2}{2\sigma_{ij}^2} \right) \]

이때 $\mu_{ij}$와 $\sigma_{ij}^2$는 각각 표본평균과 표본분산(Bassel corrected variance, $(n-1)$로 나눈 그 식)이다.

예를 들어 $P(\text{Income} | \text{No})$ 을 구해보자.

$\mu_{(\text{Income}, \text{No})} = (125+100+70+120+60+220+75)/7 = 110$ (K)

$\sigma_{(\text{Income}, \text{No})} = \sqrt{17850/6}=\sqrt{2975}=54.54$ (K)

따라서 Tid=4인 경우의 확률을 구해보면

$P(\text{Income}=120K | \text{No}) = \cfrac{1}{\sqrt{2 \pi}(54.54)} \text{exp} \left( -\cfrac{(120-110)^2}{2(2975)} \right) = 0.0072$

Example

$X = (\text{Refund=No, Martial Status = Married, Income=120K})$에 대하여 Evade 클래스를 예측해보자.

3개의 attribute에 대한 conditional probability를 구하면 아래와 같다.

sample probabilities of given instance X — $X = (\text{Refund=No, Martial Status = Married, Income=120K})$

\begin{align*} P(X | \text{No}) &= P(\text{Refund=No} | \text{No}) \times P(\text{Martial Status=Married} | \text{No}) P(\text{Income=120K} | \text{No}) \\ &= (4/7)(4/7)(0.0072) \\ &= 0.0024 \end{align*}

\begin{align*} P(X | \text{Yes}) &= P(\text{Refund=No} | \text{Yes}) \times P(\text{Martial Status=Married} | \text{Yes}) P(\text{Income=120K} | \text{Yes}) \\ &= (1)(0)(1.2 \times 10^{-9}) \\ &= 0 \end{align*}

$P(\text{No} | X) = P(X|\text{No})P(\text{No}) = (0.0024)(7/10) > 0$

$P(\text{Yes} | X) = P(X|\text{Yes})P(\text{Yes}) = (0)(3/10)=0$

따라서 (Naive Bayes Classifier는) 주어진 튜플 $X$의 클래스는 No로 예측한다.

그런데 위 계산을 하다보면 느껴지는 문제점이 있다. 모든 항이 곱셈으로 이뤄져있기 때문에 어떤 한 항의 확률이 $0$이면($P(A_i \cap C_j) = 0$) 모든 확률이 $0$이 되는 문제가 있다.

M-Estimate

Probability Estimation으로 주로 3가지 방법이 사용된다.

$c$: 클래스의 개수

$p$: prior probability (사전 확률)

$m$: parameter (prior에 대한 confidence. arbitrary choose)

Original: $P(A_i|C)= \cfrac{N_{ic}}{N_c}$

Laplace: $P(A_i | C) = \cfrac{N_{ic} + 1}{N_c + c}$

m-estimate: $P(A_i | C) = \cfrac{N_{ic}+mp}{N_c + m}$

$P(Married|Yes)$를 M-estimate를 이용하여 계산해보자.

이때 우리는 prior $p$(여기서는 $P(Married)$에 대한 정보가 없는데, 일반적으로 uniformly distributed라고 가정한다. 즉 $p = P(\text{Status=Married})=1/3$이다.

또한 $m=4$라고 하면

\[ P(\text{Married} | \text{Class=Yes}) = \cfrac{0 + (4)(1/3)}{3+4}=0.19 \]

Summary of Naive Bayes

적은 데이터는 전체 확률에 영향을 주지 않는다. (Robust to isolated noise points)
probability estimate 단계에서 결측치(Missing Value)를 무시함으로써 해결한다.
Robust to irrelevant attribute
일부 attribute는 독립 조건을 만족하지 않을 수 있다. 이 경우 naive bayes classifier를 사용하지 않는 것이 좋다. (Height와 Weight는 다소 상관성이 높다.)

728x90

'스터디 > 데이터사이언스' 카테고리의 다른 글

[Data Science] Logistic Regression (0)	2023.05.14
[Data Science] Linear Regression (0)	2023.05.12
[Data Science] Decision Tree in Python (with Scikit-learn) (0)	2023.05.01
[Data Science] Decision Tree - Model Evaluation (Confusion Matrix, Metric, ROC Curve, AUC Score) (0)	2023.04.30
[Data Science] Decision Tree - Overfitting (0)	2023.04.29

궁금한게많은joon

[Data Science] Bayesian Classifier

Bayesian Classifier

Naive Bayes Classifier

How to estimate probabilities from data?

M-Estimate

Summary of Naive Bayes

'스터디 > 데이터사이언스' 카테고리의 다른 글

티스토리툴바

[Data Science] Bayesian Classifier

Bayesian Classifier

Naive Bayes Classifier

How to estimate probabilities from data?

M-Estimate

Summary of Naive Bayes

'스터디 > 데이터사이언스' 카테고리의 다른 글

관련글

티스토리툴바