[Data Science] Logistic Regression

728x90

Class Probability Estimation

많은 task중에서 어떤 instance가 주어졌을 때 어떤 class에 해당할지 예측하고 싶다.

예를 들어 fraud detection은 baking이나 commerce에서 중요한 이슈이다.

다행히도, linear model을 이용하여 binary class일 확률을 예측할 수 있다.

\[ f(\mathbf{x}) = w_0 + w_1 x_1 + \cdots + w_n x_n \]

그러나 우리가 예측할 class일 확률은 $[0, 1]$인데, $f(\mathbf{x})$의 범위는 $(-\infty, \infty)$이다.

이를 해결하기 위해 log-odds를 도입한다.

odd는 likelihood of an event로, 일어날 확률와 일어나지 않는 확률의 비이다. 즉 $\cfrac{p_+(\mathbf{x})}{1 - p_+(\mathbf{x})}$ 이다. $p_+(\mathbf{x}) \in [0, 1]$이므로 odd의 범위는 $[0, \infty]$이다.

여기에 로그(보통 밑이 $e$인 로그)를 합성하면 log-odd라 부르고 $\ln\left( \cfrac{p_+(\mathbf{x}) }{1-p_+(\mathbf{x})} \right) \in (-\infty, \infty)$이다.

각각 합성되는 과정은 일대일대응이고 증가함수이다.

Probabilities, odds, and the corresponding log-odds

따라서 우리는 선형모델 $f(\mathbf{x})$을 이용하여 log-odds를 예측하는 regression을 생각할 수 있다.

Note: log-odds를 예측한다는 의미에서 logistic "regression"이지만 실제 target variable이 categorical이므로 classification이다.

Logistic Regression

$\mathbf{x}$: feature vector. 예를 들어, class membership data가 해당된다.

$p_+(\mathbf{x})$: $(+)$를 binary event가 발생할 확률. $1-p_+(\mathbf{x})$는 event가 발생하지 않을 확률이다.

\[ \ln\left( \cfrac{p_+(\mathbf{x}) }{1-p_+(\mathbf{x})} \right) = f(\mathbf{x}) = w_0 + w_1 x_1 + \dots + w_n x_n \]

여기서 $p_+(\mathbf{x})$를 따로 뽑아서 계산하면

\[ p_+(\mathbf{x}) = \cfrac{1}{1 + e^{-f(\mathbf{x})}} \]

이는 시그모이드 함수와 동일하다.

Objective Function

이상적으로, $\mathbf{x}_+$이면 $p_+(\mathbf{x}_+) = 1$이고 $\mathbf{x}_-$이면 $p_+(\mathbf{x}_-) = 0$이 되길 바란다. 그러나 실제 데이터를 이용한 확률이 정확히 0이나 1이지 않다. 그럼에도 $p_+(\mathbf{x}_+)$은 $1$에 가까이, $p_+(\mathbf{x}_-)$는 $0$에 가까이 하고 싶을 것이다.

\[ g(\mathbf{x}, \mathbf{w}) = \begin{cases} p_+(\mathbf{x}), \text{if } \mathbf{x} \text{ is a +} \\ 1 - p_+(\mathbf{x}), \text{if } \mathbf{x} \text{ is a -} \end{cases} \]

Interpreting Logistic Regression

그렇다면 logistic regression을 통해 얻은 coefficient $w_1, w_2, \dots, w_n$을 어떻게 해석할까?

간단히 $f(\mathbf{x}) = w_0 + w_1 x_1$이라 하자.

$x_1$이 1단위만큼 증가하면 $w_1$만큼 증가한다.

이는 log-odd는 $e^{w_1}$배 증가한 것과 같다. (로그 내의 곱셈은 로그의 합이므로)

smoking과 10-year Heart Disease의 logistic regression이 다음과 같다고 하자.

\[ f(x) = -1.93 + 0.38x_1 \]

$e^{0.38} = 1.46$이므로 다음과 같이 해석한다.

흡연 그룹은 비흡연 그룹에 비하여 log-odd가 1.46배 많다.
흡연 그룹은 비흡연 그룹에 비하여 심장병에 걸릴 확률이 1.46배 높다.
흡연 그룹은 비흡연 그룹에 비하여 심장볍에 걸릴 확률이 46% 더 높다.

만일 $e^{w}$가 1보다 작으면 반대로 해석할 수 있다.

참고

https://towardsdatascience.com/https-towardsdatascience-com-what-and-why-of-log-odds-64ba988bf704

WHAT and WHY of Log Odds

WHAT are Log Odds and WHY are they useful

towardsdatascience.com

https://quantifyinghealth.com/interpret-logistic-regression-coefficients/

Interpret Logistic Regression Coefficients [For Beginners] – QUANTIFYING HEALTH

The logistic regression coefficient β associated with a predictor X is the expected change in log odds of having the outcome per unit change in X. So increasing the predictor by 1 unit (or going from 1 level to the next) multiplies the odds of having the

quantifyinghealth.com

728x90

'스터디 > 데이터사이언스' 카테고리의 다른 글

[Data Science] Mediator, Moderator (0)	2023.05.15
Dummy coding, Effect Coding (0)	2023.05.15
[Data Science] Linear Regression (0)	2023.05.12
[Data Science] Bayesian Classifier (0)	2023.05.03
[Data Science] Decision Tree in Python (with Scikit-learn) (0)	2023.05.01

궁금한게많은joon

[Data Science] Logistic Regression

Class Probability Estimation

Logistic Regression

Objective Function

Interpreting Logistic Regression

참고

'스터디 > 데이터사이언스' 카테고리의 다른 글

티스토리툴바

[Data Science] Logistic Regression

Class Probability Estimation

Logistic Regression

Objective Function

Interpreting Logistic Regression

참고

'스터디 > 데이터사이언스' 카테고리의 다른 글

관련글

티스토리툴바