Double Descent: new approach of bias-variance trade-off

728x90

Double Descent: Modern approach of bias-variance trade-off

classic ML의 관점에서, bias-variance trade-off는 model complexity와 관련있다.

이전 포스트를 참고해도 좋다.

https://trivia-starage.tistory.com/238

Overfitting을 막는 방법들 (regularization, cross-validation, early stopping)

Overfitting and Regularization ※ The blog post is based on lecture materials from Xavier Bresson, a professor at the National University of Singapore. The lecture materials can be found on the professor's LinkedIn. You can also found it at [1]. Under-fit

trivia-starage.tistory.com

test error는 generalization error와 의미가 같으며, (model complexity에 따라) U-shape curve를 그린다.

(left) model complexity에 따른 train/test error. (right) model complexity에 따른 bias-variance curve. images from [2]

classic ML에서, model complexity가 높아지면 variance와 generalization error 역시 증가한다.

그러나, 현대의 deep learning은 이러한 경향이 잘 나타나지 않는다.

딥러닝 모델이 복잡해져도 generalization error가 증가하지 않고 오히려 감소하는 현상이 발생한다.

경험적 결과(empirical result)는 전통적 ML 이론과 모순이다.

※ 딥러닝 해본 사람은 느끼겠지만, 데이터수가 진짜 많으면 오버피팅 없이 잘 된다는 느낌(?!)을 받는다. 논문 [3]과 [4]는 이러한 이론과 실험으로 우리의 느낌을 뒷받침해준다. 특히 [4]는 이론없이 매우 엄청난 양의 실험과 그래프로만 내용이 채워져있다.

Double descent

double descent curve는 error curve의 X축을 model complexity와 dataset size의 비율로 나타낸다.

$p$는 model(보통 딥러닝 모델)의 파라미터 개수, $n$을 데이터셋의 개수라 하자.

$p \ll n$이면 Uner-parameterized function이라 하고, $p \gg n$이면 Over-parameterized function이라 한다.

$p = n$일때 interpolation point를 도입하여 training set에 overfitting하기 위한 minimal capacity이다.

double descent curve for bias-variance (image from [2])

$p=n$이면, model은 training data에 over-fit할 파라미터를 충분히 가지고 있다. 그러나 variance가 매우 크기 때문에, 일반화에는 실패한다.

그러나 $p \gg n$이 되면, training data보다 훨씬 많은 파라미터를 갖게 된다. 이 영역에서는 learner $f(x)$는 계속 overfit하지만, SGD에 의해 $\| \theta \|_2$의 $L_2$ norm이 크게 감소하여 model capacity가 효과적으로 감소한다.

이름 그림으로 표현하면 다음과 같다.

larger space has high capacity (image from [2])

여기서 SGD는 여러 이유로 딥러닝에서 중요하다

최적화 과정에서 loss의 saddle point를 남겨둔다
더 좋은 local optima(혹은 global)을 찾게하여 generalization에 성공한다
연산 속도 향상
double descent 현상이 발생하기에 필수

Farewell to early stopping? No!

그러면 단순히 over-parameterized function만 구현해서 때려박으면(?) 되는 것인가?

앞에서 배운 정규화 이런거 다 필요없는것 아닌가?

아니다.

double descent 현상이 발생하려면 (일반적으로) 매우 큰 신경망(exceedingly large network)에서만 동작한다.

임계값(critial threshold)이 되는 파라미터 개수는 $p^* = O(nk)$이다. ($k$는 분류할 클래스의 개수)

아래는 여러가지 예시이다.

ImageNet의 경우, $n=10^6$, $k=10^3$이므로 $p^*=10^9$이다.

Vision에서 backbone으로 자주 사용되는 ResNet-152의 경우 파라미터 개수는 $p=10^7 < 10^9$이다.

ViT의 경우, $n=10^9, \ k=10^4$이므로 $p^* = 10^{13}$이다.

ViT-22B의 파라미터 개수는 $10^{10}$으로 $p^*$보다 작다.

NLP의 경우 $n=10^{11}, \ k=10^4$이므로 $p^* = 10{15}$이다.

GPT-3의 파라미터 개수는 $p=175B = 10^{11}$으로 $p^*$보다 작다.

따라서, 여전히 practitioner들은 early stopping을 주요 regularization 기법으로 사용한다.

물론, early stopping을 사용하면 double descent 현상이 나타자니 않는다.

※ [4]에 따르면, early stopping은 model이 train error가 0이 되기 방해하므로(이전에 학습이 끝나므로) 일반적으로 double descent가 나타나지 않는다고 한다. 그러나 early stopping을 사용해도 double descent가 나타나는 상황은 여전히 존재한다고 한다. [4]의 8. Conclusion and Discussion 참고.

Decision trees and Ensemble methods

논문 [3]에서는 Neural Network 말고도 boosting된 decision tree(여기서는 AdaBoost로 실험)와 Random Forests에서도 double descent 현상이 발생할 수 있다고 한다.

random forests에서 파라미터는 $\cfrac{N_{leaf}^{\max}}{N_{tree}}$로 정의하였다.

$N_{leaf}^{\max}$는 각 tree마다 허용된 최대 leaf node의 개수이다.

boosting model에서 파라미터는 $\cfrac{N_{tree}}{N_{forest}}$로 정의하였다.

Double descent curve for random forests (image from [3])

Double descent curve for $L_2$ boosting trees (image from [3])

※ 개인적 의견으로는, tree계열 모델에서는 double descent 효과가 그렇게 까지 효과가 좋아보이지 않아보인다.

References

[1] https://storage.googleapis.com/xavierbresson/index.html (CS3244: Machine learning, Lecture 6)

[2] https://storage.googleapis.com/xavierbresson/lectures/CS3244/lecture06_overfitting_regularization.pdf

[3] https://arxiv.org/pdf/1812.11118.pdf (Reconciling modern machine learning practice and the bias-variance trade-off)

[4] https://arxiv.org/pdf/1912.02292.pdf (Deep Double Descent: Where Bigger Models and More Data Hurt)

728x90

'스터디 > 인공지능, 딥러닝, 머신러닝' 카테고리의 다른 글

[논문리뷰] Deep Graph Infomax (DGI) (0)	2024.06.18
[Bayesian] Bayesian Linear Regression (베이지안 선형 회귀) (0)	2024.05.08
Overfitting을 막는 방법들 (regularization, cross-validation, early stopping) (0)	2024.03.02
Gradients of Neural Networks (0)	2023.11.27
[Bayesian] Evidence lower bound (ELBO) and EM-algorithm (0)	2023.11.11

궁금한게많은joon

Double Descent: new approach of bias-variance trade-off