[CS246] Flajolet-Martin (FM) Algorithm: Counting Distinct Elements from Data Stream

728x90

Flajolet-Martin (FM) Algorithm: Counting Distinct Elements from Data Stream

Problem Definition

data stream은 크기가 $N$인 집합으로 간주할 수 있다.

Applications

website에 방문하는 (unique한) 유저 수는 얼마인가?
web crawling으로 얻은 website에 존재하는 word의 수는?
지난 주에 판매한 상품 품목(distinct products) 개수는?

Real Problem

Obvious approach: dictionary 자료구조를 이용한다. (key:element, value: counting)

그러나 우리는 (data stream의 크기에 비해) 매우 적은 저장공간만 가진다. (limited working storage)

정확한 값을 저장할 수는 없다. 추정해야한다. (Estimate the count in an unbiased way)

Flajolet-Martin Approach

key idea: hash elements to a binary string

아이템이 많을 수록, 더 많은 해시값을 얻을 수 있을 것이다.

tailing 0의 개수가 distinct elemtns의 추정량으로 간주할 수 있다.

$N$개의 element에 대하여 $\log_2 N$에 매핑될 수 있도록 해시함수 $h$를 고른다.
- hash value는 이진 문자열이다.
- 예를 들어, 어떤 stream data $a$에 대하여 $h(a)=1100$과 같은 형태이다.
$r(a)$를 $h(a)$의 tailing 0의 개수라고 하자.
- $h(a)=1100$이면, $r(a)=2$
- $h(b)=1010$이면, $r(b)=1$
Record $R=\max_a r(a)$라 정의하면, distinct element의 추정량은 $2^R$이다.

Why it works

Intuition (Very rough and heuristic)

모든 $N$개의 값에 대하여 $h(a)$는 $a$를 균등한 확률로 매핑한다.

즉, 모든 element는 tailing $r$ zeros를 가질 확률이 동일하다.

\[ \Pr(\text{a tail of } r \text{ zeros}) = 2^{-r} \]

예를 들어, $***0$으로 해싱될 확률은 50%이고, $**00$으로 해싱될 확률은 25%이다.

More formal view

$m$을 지금까지 관찰된 distinct element 개수라고 하자. 그러면 tail of $r$ zeros가 적어도 하나 이상 있을 확률은 다음과 같다.

\[ 1 - (1-2^{-r})^m = 1 - (1-2^{-r})^{2^r(m2^{-r})} \approx 1 - e^{-m2^{-r}} \]

만일 $m << 2^r$이라면 $m/2^r \to 0$이므로 $1 - e^{-m2^{-r}} \to 0$

만일 $m >> 2^r$이라면 $m/2^r \to \infty$이므로 $1 - e^{-m2^{-r}} \to 1$

따라서 $2^R$은 $m$ 근처에 있을 것이다.

Why it doesn't work

FM-algorithm이 항상 동작한다는 보장은 없다.

$E[2^R]$은 실질적으로 무한이기 때문이다.

$R$이 하나씩 증가할 때마다, 확률은 2배씩 증가하기 때문이다. (St. Petersburg paradox)

Workaround: Use many hash function

각 해시함수 $h_i$에 대하여 $R_i$ sample을 얻을 수 있다.

(Option 1) Average. but 특정 $2^{R_i}$가 매우 크다면?

(Option 2) Median. but 항상 2의 거듭제곱만을 추정하게 된다. (실제로는 5, 10과 같이 2의 거듭제곱이 아닐 수 있다)

Solutions

sample을 partition으로 나눈다
각 group마다 median을 구한다.
median의 average를 추정량으로 구한다.

Example

data stream = [3, 1, 4, 1, 5, 9, 2, 6, 5], 비트문자열의 길이는 $5$라 하자. 세개의 해시함수가 다음과 같을 때, estimated distinct elements?

\begin{align} h_1(x) &= 2x + 1 \mod 32 \\ h_2(x) &= 3x+7 \mod 32 \\ h_3(x)&=4x \mod 32 \end{align}

Stream Data	$h_1(x)$	bit-string (1)	$h_2(x)$	bit-string (2)	$h_3(x)$	bit-string (3)
3	7	00111	16	10000	12	01100
1	3	00011	10	01010	4	00100
4	9	01001	19	10011	16	10000
1	3	00011	10	01010	4	00100
5	11	01011	22	10110	20	10100
9	19	10011	2	00010	4	00100
2	5	01101	13	01101	8	01000
6	13	01101	25	11001	24	11000
5	11	01011	22	10110	20	10100

$r_1 = [0, 0, 0, 0, 0, 0, 0, 0, 0]$, $R_1 = 0$

$r_2 = [4, 1, 0, 1, 1, 1, 0 ,0, 1]$, $R_2 = 4$

$r_3 = [2, 2, 4, 2, 2, 2, 3, 3, 2]$, $R_3 = 4$

실제 distinct element의 개수는 7인데, 각 해시함수별로 예측한 추정량은 각각 1, 16, 16으로 지나치게 under/overestimated되어있다. (물론 데이터 수가 작긴 하다)

$R=[0, 4, 4]$의 샘플링을 이용해보자.

(Average) $R$의 평균값으로 추정하면 $(0+4+4)/3 = 2.66$이므로 $2^{2.66} \approx 6.34$개로 추정

(Median) $R$의 중앙값으로 추정하면 $\text{Median}(0, 4, 4)=4$이므로 $2^{4}=16$개로 추정

(Partitioning) $[0], [4, 4]$로 partitioning했다고 하자. 그러면 각 partition의 median은 0, 4이고 평균은 2이므로 $2^2=4$개로 추정.

※ 데이터와 해시함수의 수가 적어서 sampling 방법이 오히려 정확도가 떨어지는 것으로 보인다. (필자 추정)

728x90

'스터디 > 데이터사이언스' 카테고리의 다른 글

Non-negative Matrix Factorization (NMF), 비음수 행렬 분해 (0)	2024.01.02
[CS246] Counting Frequent Elements in a stream (0)	2023.12.28
[CS246] Filtering Data Streams (0)	2023.12.19
[Community Detection] Spectral Clustering (1)	2023.12.06
[CS246] Gradient Boosted Decision Trees (GBDT) (0)	2023.12.04

궁금한게많은joon

[CS246] Flajolet-Martin (FM) Algorithm: Counting Distinct Elements from Data Stream

Flajolet-Martin (FM) Algorithm: Counting Distinct Elements from Data Stream

Problem Definition

Applications

Real Problem

Flajolet-Martin Approach

Why it works

Why it doesn't work

Example

'스터디 > 데이터사이언스' 카테고리의 다른 글

티스토리툴바

[CS246] Flajolet-Martin (FM) Algorithm: Counting Distinct Elements from Data Stream

Flajolet-Martin (FM) Algorithm: Counting Distinct Elements from Data Stream

Problem Definition

Applications

Real Problem

Flajolet-Martin Approach

Why it works

Why it doesn't work

Example

'스터디 > 데이터사이언스' 카테고리의 다른 글

관련글

티스토리툴바