[Bayesian] Bayesian Linear Regression (베이지안 선형 회귀)

728x90

Bayesian Linear Regression

Prior

likelihood가 가우시안이므로 prior 역시 가우시안이다.

\[ p(\theta)=\mathcal{N}(\theta|m, S) \]

$m$은 평균, $S$는 공분산행렬이다.

Posterior

$\theta$의 posterior distribution 역시 가우시안임이 알려져있다.

\begin{align} p(\theta | \mathcal{D}) &= \mathcal{N}(\theta|m_{\mathcal{D}}, S_{\mathcal{D}}) \\ S_{\mathcal{D}}^{-1} &= S^{-1} + \Phi^\top\Phi / \sigma^2 \\ m_{\mathcal{D}} &= S_{\mathcal{D}}(S^{-1}m + \Phi^\top y / \sigma^2) \end{align}

Prediction

새로 관측된 데이터 $x_*$가 주어질 때 design matrix를 통과한 형태는 $\hat{\theta}^\top \phi(x_*)$

predictive distribution은 다음과 같다.

\begin{align} p(y_*|x_*,\mathcal{D}) &= \mathbb{E}_{p(\theta|\mathcal{D})}[p(y_*|x_*, \theta)] \\ &= \int p(y_*|x_*, \theta) p(\theta|\mathcal{D}) \mathrm{d}\theta \\ &= \int \mathcal{N}(y_*|\theta^\top \phi(x_*), \sigma^2) \mathcal{N}(\theta|m_{\mathcal{D}}, S_{\mathcal{D}}) \mathrm{d}\theta \end{align}

가우시안분포의 성질을 이용하면 다음과 같은 분포를 얻을 수 있다.

\begin{align} p(y_*|x_*, \mathcal{D}) &= \mathcal{N}(y_*|m_*, \sigma_*^2) \\ m_* &= m_{\mathcal{D}}^\top \phi(x_*) \\ \sigma_*^2 &= \sigma^2 + \phi(x_*)^\top S_{\mathcal{D}} \phi(x_*) \end{align}

Proof of posterior distribution

input dataset으로 $X=[x_1, \dots, x_n]^\top \in \mathbb{R}^{n \times d}$, label은 $y=[y_1, \dots, y_n]^\top \in \mathbb{R}^{n}$이라 하자. Bayesian linear regression의 feature map은 $\phi: \mathbb{R}^d \to \mathbb{R}^h$이고 prior와 likelihood가 다음과 같다고 하자.

\[ \theta \sim \mathcal{N}(\theta|0, (\sigma^2 / \lambda)I_h), \quad y_i|x_i, \theta \overset{\text{ind.}}{\sim} \mathcal{N}(y_i|\theta^\top \phi(x_i), \sigma^2) \text{ for } i=1, \dots, n \]

이때, posterior $p(\theta|X,y)=\mathcal{N}(\theta|\mu_n, \Sigma_n)$ 의 $\mu_n, \Sigma_n$을 구해보자.

$\Phi = [\phi(x_1), \dots, \phi(x_n)]^\top$라 하자.

joint log-likelihood는

\begin{align} \log p(y, \theta | X) &= -\cfrac{1}{2\sigma^2} \theta^\top \Phi^\top \Phi \theta + \cfrac{\theta^\top \Phi^\top y}{\sigma^2} - \cfrac{\lambda \theta^\top \theta}{2 \sigma^2} + \text{const} \\ &= -\cfrac{1}{2} \left( \cfrac{\Phi^\top \Phi}{\sigma^2} + \cfrac{\lambda}{\sigma^2}I_h \right)\theta + \cfrac{\theta^\top \Phi^\top y}{\sigma^2} + \text{const} \\ &= -\cfrac{1}{2} \theta^\top \Sigma_n^{-1}\theta + \theta^\top \Sigma_n^{-1} \mu_n + \text{const} \end{align}

마지막 두 줄의 계수를 비교하면 다음을 얻을 수 있다.

\[ \Sigma_n = \sigma^2 \left( \Phi^\top \Phi + \lambda I_h \right)^{-1}, \quad \mu_n = \left( \Phi^\top \Phi + \lambda I_h \right)^{-1}\Phi^\top y \]

Sequential Bayesian Linear Regression (PRML)

MLE vs Bayesian

아래 그림은 MLE와 Bayesian linear regression을 비교한 것이다.

(a) MLE를 통해 얻은 predictive density. error bar($\pm 2\sigma$)가 $x$의 값과 상관없이 일정하다.

(b) Posterior predictive density의 평균. $x$를 관찰하지 못한 구간에서는 error bar가 더 길어진다. (불확실성이 커진다)

Machine Learning, A Probabilistic Perspective (MLAPP)

Marginal likelihood

$\mathcal{D}$의 marginal likelihood는 다음과 같이 계산할 수 있다.

\begin{align} p(y|X) &= \int p(y|X, \theta) p(\theta) \mathrm{d} \theta \\ &= \int \prod_{i=1}^{n}\mathcal{N}(y_i|\theta^\top \phi(x_i), \sigma^2) \mathcal{N}(\theta|m, S) \mathrm{d} \theta \\ &= (2\pi\sigma^2)^{-n/2} \cfrac{|S_{\mathcal{D}}|^{1/2}}{|S|^{1/2}} \exp \left( \cfrac{1}{2}m_{\mathcal{D}}^\top S_{\mathcal{D}}^{-1} m_{\mathcal{D}} -\cfrac{1}{2}m^\top S^{-1}m - \cfrac{y^\top y}{2 \sigma^2} \right) \end{align}

여기서 $\phi := \{ \sigma^2, m, S \}$는 hyperparameter이다. 어떻게 이 3개의 값을 정할 것인가?

Cross-Validation
Empirical Bayes: marginal likelihood를 최대화하는 값을 찾는다. $\phi_* = \underset{\phi}{\text{argmax}} \log p(y|X; \phi)$
Full Bayesian (hierarchical Bayesian): hyper-prior $p(\phi; \eta)$를 도입하여 posterior를 계산한다.

\[ p(\phi|\mathcal{D}; \eta) = \cfrac{p(y|X, \phi) p(\phi;\eta)}{p(y|X; \eta)} \]

Empirical Bayes

hyperparameter를 다음과 같이 제한시키자

\[ p(\theta) = \mathcal{N}(\theta|0, \rho^2 I) \]

그러면 log marginal likelihood (log evidence)는 다음과 같다.

\[ \log p(y|X; \phi) = -\cfrac{n}{2}\log(2\pi \sigma^2) + \cfrac{1}{2}\log |S_{\mathcal{D}}| - \cfrac{h}{2}\log \rho + \cfrac{1}{2}m_{\mathcal{D}}^\top S_{\mathcal{D}}^{-1}m_{\mathcal{D}} - \cfrac{y^\top y}{2 \sigma^2} \]

이때 $S_{\mathcal{D}}^{-1}=I_h/\rho^2 + \Phi^\top \Phi / \sigma^2, \quad m_{\mathcal{D}}=S_{\mathcal{D}}\Phi^\top y / \sigma^2$ 이다.

$(\sigma^2, \rho^2)$는 gradient descent로 최적화할 수 있다.

일반적으로 inverse gamma distribution을 이용한다.

\[ p(\sigma^2) = \text{iGamma}(\sigma^2|a,b)=\cfrac{b^a (\sigma^2)^{-a-1}e^{-b/\sigma^2}}{\Gamma(a)} \]

가우시안 likelihood의 conjugate prior는 Gaussian-inverse-gamma prior를 적용하면 다음과 같다.

\[ p(\theta, \sigma^2) = \mathcal{N}(\theta|m, \sigma^2V) \text{iGamma}(\sigma^2|a, b) \]

그리고 posterior는 또다시 Gaussian-inverse-gamma가 된다.

\[ p(\theta, \sigma^2|\mathcal{D}) = \mathcal{N}(\theta|m_{\mathcal{D}}, \sigma^2 V_{\mathcal{D}}) \text{iGamma}(\sigma^2|a_{\mathcal{D}},b_{\mathcal{D}}) \]

이때 $V_{\mathcal{D}}^{-1}, m_{\mathcal{D}}, a_{\mathcal{D}}, b_{\mathcal{D}}$는 다음과 같다.

\begin{align} V_{\mathcal{D}}^{-1} &= V^{-1} + \Phi^\top \Phi \\ m_{\mathcal{D}} &= V_{\mathcal{D}}(V^{-1}m + \Phi^\top y) \\ a_{\mathcal{D}} &= a + \cfrac{h}{2} \\ b_{\mathcal{D}} &= b + \cfrac{1}{2}(m^\top V^{-1}m + y^\top y - m_{\mathcal{D}}V_{\mathcal{D}}^{-1} m_{\mathcal{D}}) \end{align}

728x90

'스터디 > 인공지능, 딥러닝, 머신러닝' 카테고리의 다른 글

[CS224W] GNN for RecSys (1) - Task and Evaluation (2)	2024.11.06
[논문리뷰] Deep Graph Infomax (DGI) (0)	2024.06.18
Double Descent: new approach of bias-variance trade-off (0)	2024.03.03
Overfitting을 막는 방법들 (regularization, cross-validation, early stopping) (0)	2024.03.02
Gradients of Neural Networks (0)	2023.11.27

궁금한게많은joon

[Bayesian] Bayesian Linear Regression (베이지안 선형 회귀)