[Clustering] Hierarchical Clustering (계층적 군집화)

728x90

Introduction

Hierarchical tree 형태로 nested cluster를 구성한다.

Dendrogram으로 클러스터를 시각화한다.

Clustering criteria로 주로 distance matrix를 이용한다.

Strengths of Hierarchical Clustering

클러스터가 몇 개인지 직접 가정할 필요가 없다. (k-means는 우리가 직접 $k$를 정해야했다)

dendrogram에서 적정 수준에서 cutting을 하면 어떠한 클러스터 개수도 가능하다.

계층적 클러스터는 (아마도) 의미있는 분류체계와 대응될 것이다. (meaningful taxonomies)

biological science에서 특히 유용하다. (animal kingdom, phylogeny reconstruction, 등)

Two main Types

Agglomerative

각 point는 각 개별 cluster로부터 시작한다.
각 단계에서 가장 가까운 cluster pair는 하나의 cluster로 병합된다. (클러스터가 $1$개가 될 때까지, 혹은 $k$개 까지)

Divisive

모든 point를 포함하는 하나의 cluster에서 시작한다.
각 단계에서 cluster를 분할한다.

일반적으로 agglomerative 방법이 divisive보다 더 자주 사용된다.

Agglomerative Clustering

Distance Matrix를 계산한다.
각 data point를 하나의 cluster로 간주한다.
Repeat
1. 가장 가까운 두 cluster를 merge
2. Distnce Matrix를 업데이트
하나의 클러스터가 남을 때까지 반복

key operation은 두 클러스터의 similarity를 계산하는 것이다. (아래에 설명)

Distance between Clusters

Single Link

각 클러스터($K_i, K_j$)의 원소(t_i, t_j) 끼리의 최단거리

$dist(K_i, K_j) = \min(t_{ip}, t_{jq})$

Complete Link

각 클러스터의 원소 간의 최대 거리

$dist(K_i, K_j) = \max(t_{ip}, t_{jq})$

Average

각 클러스터의 원소간의 평균 거리

$dist(K_i, K_j) = \text{avg}(t_{ip}, t_{jq})$

Centroid

각 클러스터의 centroid 간의 거리

$dist(K_i, K_j) = dist(C_i, C_j)$

Medoid

각 클러스터의 medoid간의 거리

$dist(K_i, K_j) = dist(M_i, M_j)$

Extensions to Hierarchical Clustering

Major Weakness

이전 단계에서 수행한 것을 다시 되돌릴 수 없다. (Can never undo what was done previously)
시간복잡도가 $O(n^2)$이기 때문에 많은 데이터에 적용할 수 없다. (Do not scale well)

Intergration of hierarchical & distance-based clustering

BIRCH (1996): CF-tree를 이용하여 sub-cluster의 quality를 점진적으로 조정 (다음 포스팅 주제)
CHAMELEON (1999): dynamic modeling을 이용

※ BIRCH 알고리즘은 2006 SIGMOD Test of Time Award를 수상했다.

728x90

'스터디 > 인공지능, 딥러닝, 머신러닝' 카테고리의 다른 글

[CS224w] Training Graph Neural Networks (0)	2023.05.24
[Clustering] BIRCH Algorithm (0)	2023.05.23
[Clustering] Drawbacks of K-means and Solutions with Python (K-means 단점과 해결방법) (0)	2023.05.21
[Clustering] K-means Clustering in Python (K-means 알고리즘) (0)	2023.05.20
[Clustering] Model-based Methods, Expectation Maximization (EM) (0)	2023.05.19

궁금한게많은joon

[Clustering] Hierarchical Clustering (계층적 군집화)

Introduction

Strengths of Hierarchical Clustering

Two main Types

Agglomerative Clustering

Distance between Clusters

Extensions to Hierarchical Clustering

'스터디 > 인공지능, 딥러닝, 머신러닝' 카테고리의 다른 글

티스토리툴바

[Clustering] Hierarchical Clustering (계층적 군집화)

Introduction

Strengths of Hierarchical Clustering

Two main Types

Agglomerative Clustering

Distance between Clusters

Extensions to Hierarchical Clustering

'스터디 > 인공지능, 딥러닝, 머신러닝' 카테고리의 다른 글

관련글

티스토리툴바