Bloom filters

PDF of Eric’s handwritten notes are here.

Today we’ll discuss hashing and see a widely-used scheme known as a Bloom filter. Before diving into hashing let’s look at a toy problem which will be helpful for our intuition.

Balls into bins

Given $n$ balls and $n$ bins, each ball is thrown into a random bin. Load of bin $i$ denoted by $L(i) =$  # of balls assigned to bin $i$.

Max load is the maximum # of balls in a bin $= \max \limits_i L(i)$. It can be shown that the max load is $O(\log n)$ with high probability (with high prob. means with probability $\ge 1 - \frac{1}{n^c}$ for constant $c$).

A proof for $c=1$: Pr(bin $i$ gets load $\ge\log n$) = $\binom{n}{\log n}\big(\frac{1}{n}\big)^{\log n}\le\big(\frac{ne}{\log n}\big)^{\log n}\big(\frac{1}{n}\big)^{\log n}=\big(\frac{e}{\log n}\big)^{\log n}\le\big(\frac{1}{4}\big)^{\log n}=\frac{1}{n^2}$.

Pr(some bin gets load $\ge\log n$) $\le\sum_{i=1}^n\text{Pr}(\text{bin }i\text{ gets load}\ge \log n)\le n\times\frac{1}{n^2}=\frac{1}{n}$.

Pr(no bin gets load $\ge\log n$) $\ge 1-\frac{1}{n}$.

(In fact, the bound can be refined to $\Theta(\frac{\log n}{\log \log n})$).

Power of 2 choices

Assign balls sequentially into bins

For $i = 1$ to $n$:

• Choose 2 random bins, $j$ and $k$
• If $L(j) < L(k)$: then assign ball $i$ to bin $j$
• else: assign ball $i$ to bin $k$

With this change, the max load is $O(\log \log n)$ with high probability.

If $d$ random bins are used (instead of 2), the max load is $O(\frac{\log \log n}{\log d})$ with high probability.

Application to Hashing

Universe $U$ of possible passwords, database $S \subseteq U$ of unacceptable passwords. Query: Is $x \in S$?

Hash table: $H = \{0,1, \dots, n-1\}$

Map elements of $U$ into bins $\{0,1,\dots, n-1\}$ using hash function $h : U \rightarrow \{0,1, \dots, n-1\}$.

Chain hashing: $H[i]$ is a linked list of elements $x$ of $S$ where $h(x)=i$

To add $x$ into $S$:

• Compute $h(x)$
• add $x$ onto linked list at $H[h(x)]$

To query $y \in S$:

• Compute $h(y)$
• Check if $y$ is in the linked list at $H[h(y)]$

Query time = load at bin $h(y)$

Let $|S| = m$, $|H| = n$. This is equivalent to putting $m$ balls into $n$ bins.

If $h(x)$ is a random function, then $h(x)$ is a random bin in $\{0,1,\dots,n-1\}$. If $m=n$, the query time is $O(\log n)$.

How to get a $h$ which is random?

Choose $a \in \{1,2, \dots, n-1\}$. For $x \in U$, $h(x) = ax \mod n$

A better choice is to choose $a,b \in \{1,2, \dots, n-1\}$, $h(x)=(ax+b) \mod n$.

Better Approach

Hash table: $H = \{0,1, \dots, n-1\}$. Hash function: $h_1, h_2 : U \rightarrow \{0,1, \dots, n-1\}$.

To add $x$ into $S$:

• Compute $h_1(x), h_2(x)$
• add $x$ onto linked list at $H[h(x)]$ which is less loaded

To query $y \in S$:

• Compute $h_1(y), h_2(y)$
• Check if $y$ is in the linked list at $H[h_1(y)], H[h_2(y)]$

For random $h_1, h_2$, if $m = n$, the max load is $O(\log \log n)$.

Bloom Filters

• Uses less space
• Simpler
• Faster: $O(1)$
• False positives: If $y \notin S$, sometimes, we say YES $y \in S$

$H$ is a 0-1 array of size $n$, initialized to all 0’s

$h: U \rightarrow \{0,1, \dots, n-1\}$

Add $x$ into $S$:

• Compute $h(x)$
• Set $H[h(x)] = 1$

Query $y \in S$:

• Check if $H[h(y)] = 1$, then YES else NO

False Positive problem: It might happen that $y \notin S$ but we added $z$ into $S$ and $h(z) = h(y)$

Better Approach

Instead, use $k$ hash functions $h_1, h_2, \dots, h_k$

For each $i$, $h_i : U \rightarrow \{0,1, \dots, n-1\}$

Initialize $H$ to all 0’s

Add $x$ into $S$:

• For all $i \in \{1,2, \dots, k\}$
• Set $H[h_i(x)] = 1$

Query $y \in S$:

• if for all $i \in \{1,2, \dots, k\}, H[h(y_i)] = 1$ , output YES
• else (if $\exists i$ where $H[h(y_i)] = 0$), output NO

Guarantees:

• If $y \in S$, then for query $y \in S$, we always say YES
• If $y \notin S$, then for query $y \in S$, we might say YES. This is a false positive.

Probability of a false positive

$Pr(\text{false positive for } y \in S) = Pr(\text{for all } i, H[h_i(x)] = 1) = Pr(H[b] = 1)^k = (1 - Pr(H[b] = 0))^k$

$Pr(H[b] = 0) = (1 - \frac{1}{n})^{mk} \quad \Rightarrow \quad (1 - Pr(H[b] = 0))^k = (1 - (1-\frac{1}{n})^{mk})^k$

If $n=cm$, then $(1 - Pr(H[b] = 0))^k = (1 - (1-\frac{1}{n})^{mk})^k \approx (1 - (e^{-\frac{1}{n}})^{mk})^k = (1 - e^{-\frac{k}{c}})^k$

To find the best $k$ that minimizes $(1 - Pr(H[b] = 0))^k \approx f(k) = (1 - e^{-\frac{k}{c}})^k$, we take the derivative of $f(k)$ and find the solution for $f'(k) = 0$.

The best $k$ is $c \ln 2$ which makes $f(k) = (\frac{1}{2})^k = (\frac{1}{2})^{c \ln 2} \approx (0.6185)^c$.