[CS246] CURE Algorithm: Extension of k-means to clusters of arbitrary shapes

728x90

Motivation

k-mean과 BFR 알고리즘은 convex하거나 fixed-axis와 같은 클러스터 shape에 여러 가정이 있었다.

CURE 알고리즘은 이러한 cluster shape에 어떠한 가정을 하지 않고 유클리드 거리만을 가정한다. 데이터가 정규분포일 필요도 없고, 고정된 축일 필요도 없다. 따라서 centroid 개념도 필요하지 않다. 대신에, collection of representative points가 클러스터의 표현을 나타낸다.

클러스터 개수는 $k$개라고 가정한다.

CURE는 Clustering Using REpresentatives의 앞글자를 따온 것이고, 원래 논문의 제목은 "CURE: An Efficient Clustering Algorithm for Large Databases" 이다.

CURE: 2 Pass Algorithm

Pass 1

데이터에서 random sample을 뽑고, main memory에서 hierarchical clustering을 수행한다.
각 cluster에서 representative point를 뽑는다.
- 각 클러스터 마다, 멀리 흩어져있는 점들을 뽑는다. (as dispersed as possible)
- 이들을 중앙 centroid를 향해 움직인다. (shrink factor:$\alpha$, 약 20% 정도)
- 가까운 representative pair와 cluster를 합친다.

우상단 클러스터에서 가장 멀리 흩어진 점 h, h, h, e가 representative point로 선택된다.

이들을 centroid를 향해 20% 더 가까이 위치한다.

Pass 2

데이터셋을 다시 스캔하여 각 point를 가장 가까운 클러스터에 할당한다.

Why to 20% Move Inward?

매우 넓게 퍼져있는 클러스터에 대해 몇개 점을 20% 가까이 하게 하면, 클러스터 크기는 작아진다.

그리고 작고 밀도있는 클러스터의 경우 그렇게 많이 작아지지 않을 것이다.

(Large, dispersed clusters will shrink more than small, dense ones)

이렇게 함으로써, CURE 알고리즘은 작고 밀도있는 클러스터를 선호하게 된다.

CURE favors a small, dense cluster (Stanford CS246)

728x90

'스터디 > 데이터사이언스' 카테고리의 다른 글

[CS246] Dimensionality Reduction (2) - PCA (0)	2023.10.08
[CS246] Dimensionality Reduction (1) - Introduction (0)	2023.10.07
[CS246] BFR Algorithm: Extension of k-means to large data (0)	2023.10.01
[CS246] Finding Similar Items (6) - LSH Family for Cosine Distance (0)	2023.09.23
[CS246] Finding Similar Items (5) - LSH Family (0)	2023.09.22

궁금한게많은joon

[CS246] CURE Algorithm: Extension of k-means to clusters of arbitrary shapes

Motivation