본문 바로가기

728x90

전체 글269

[Clustering] Overview, Approach, Cluster Analysis Cluster AnalysisCluster - 같은 집단은 비슷하거나 연관있도록, 다른 집단은 비슷하지 않거나 적게 연관되도록 군집화한다.Cluster analysis (clustering, data segmentation) - 데이터간의 특징을 통해 유사성을 찾아서 유사한 군집으로 매칭한다.Unsupervised learning: predefined class가 필요하지 않다. clustering 자체만으로도 데이터 분포에 대한 정보를 얻을 수 있지만(stand-alone tool), 다른 알고리즘을 적용하기 전에 전처리 단계로 사용할 수 있다. (preprocessing step) Clustering as a Preprocessing ToolSummarization - regression, class.. 2023. 5. 17.

Likelihood function, Sufficient Statistics, Minimum Sufficient Statistics (가능도함수, 충분통계량, 최소충분통계량) Likelihood Functionlikelihood inference는 관찰한 데이터 $s$와 Statistical Model(통계모델) $\{P_{\theta}: \theta \in \Omega \}$ 을 이용한 추정법이다. 일반적으로 pmf, pdf의 경우 각각 $P_{\theta},\ f_{\theta}$로 표기하지만 포스팅에서는 맥락에 맞추어 $f_{\theta}$로 통일한다. likelihood function은 다음과 같이 정의한다.\[ L(\theta | s) = f_{\theta}(s) \] $f_{\theta_1}(s) > f_{\theta_2}(s)$라면, 데이터 $s$는 $\theta = \theta_1$일 때 더 관측될 가능성(믿음)이 높다고 한다. $S = \{ 1, 2, \do.. 2023. 5. 16.

[Ensemble] AdaBoost in Python (scikit-learn) Setup필요한 라이브러리를 import하자.# To support both python 2 and python 3from __future__ import division, print_function, unicode_literals# Common importsimport numpy as npimport os# to make this notebook's output stable across runsnp.random.seed(42)# To plot pretty figures%matplotlib inlineimport matplotlib as mplimport matplotlib.pyplot as pltmpl.rc('axes', labelsize=14)mpl.rc('xtick', labelsize=12)mpl.. 2023. 5. 16.

[Ensemble] AdaBoost AdaBoostadaptive boosting 이다.Algorithm in OverviewGiven: $d$개의 class-labeled tuple이 input으로 주어진다. $(\mathbf{X}_1, y_1), \dots, (\mathbf{X}_d, y_d)$ 맨 처음, 모든 tuple은 uniformly weighed 된다. 즉 $j$번째 tuple의 weight는 $\frac{1}{d}$이다. $T$ round동안 $T$개의 classifier를 생성한다. 그리고 $i$번째 round에서,$\mathcal{D}$로부터 복원추출(sampling with replacement)하여 training set $D_i$를 얻는다.각 tuple은 각 weight에 기반하여 selected 확률을 지닌다.$D.. 2023. 5. 16.

[Data Science] Mediator, Moderator Mediator (intermediate variable, mediating variable, 매개변수)두 변수 (주로 독립변수와 종속변수)의 관계(이유, 매커니즘 등)를 설명한다.그러나 매개변수 자체가 인과관계(causality)를 의미하지 않는다. (모델링이 가능하다는 것이다) 예시로 운동($X$)이 정신건강($Y$)에 미치는 영향에 대한 가설($X \to Y$)을 고려해보자.그러나 실제로는 자존감이라는 매개변수($Me$)가 운동에 영향을 받고, 자존감이 정신건강에 영향을 주는 모형으로 설명할 수 있다. Mediation Effiect어떤 변수 $M$이 mediator의 효과가 있는지 알아보기 위해서 3개의 regression이 필요하다.$Y = b_0 + b_1X + e$$M = b_0 + b_2X.. 2023. 5. 15.

Dummy coding, Effect Coding 범주형 변수(categorical variables)를 regression model의 input으로 사용할 때 2가지 방법을 고려할 수 있다.예시로 4개의 범주(초등, 중등, 고등, 초등교육 미만)를 사용한다. $G = \{ G_1,\ G_2,\ G_3,\ G_4 \}$$G_1$: Primary$G_2$: Secondary$G_3$: Post-secondary$G_4$: Less than primaryDummy coding4개의 범주에 대하여 해당 범주면 1, 아니면 0으로 할당하는 방법을 생각할 수 있다.이때 마지막 범주의 경우 모두 0으로 표현하면 $k$개의 범주에 대하여 길이가 $(k-1)$개의 더미만 필요하다. Note: One-hot encoding은 $k$개의 범주를 $k$개의 더미 변수로 .. 2023. 5. 15.

이전 1 ··· 21 22 23 24 25 26 27 ··· 45 다음

728x90

티스토리툴바