Bloom filters

PDF of Eric’s handwritten notes are here.

Today we’ll discuss hashing and see a widely-used scheme known as a Bloom filter. Before diving into hashing let’s look at a toy problem which will be helpful for our intuition.

Balls into bins

Given n balls and n bins, each ball is thrown into a random bin. Load of bin i denoted by L(i) =  # of balls assigned to bin i.

Max load is the maximum # of balls in a bin = \max \limits_i L(i). It can be shown that the max load is O(\log n) with high probability (with high prob. means with probability \ge 1 - \frac{1}{n^c} for constant c).

A proof for c=1: Pr(bin i gets load \ge\log n) = \binom{n}{\log n}\big(\frac{1}{n}\big)^{\log n}\le\big(\frac{ne}{\log n}\big)^{\log n}\big(\frac{1}{n}\big)^{\log n}=\big(\frac{e}{\log n}\big)^{\log n}\le\big(\frac{1}{4}\big)^{\log n}=\frac{1}{n^2}.

Pr(some bin gets load \ge\log n) \le\sum_{i=1}^n\text{Pr}(\text{bin }i\text{ gets load}\ge \log n)\le n\times\frac{1}{n^2}=\frac{1}{n}.

Pr(no bin gets load \ge\log n) \ge 1-\frac{1}{n}.

(In fact, the bound can be refined to \Theta(\frac{\log n}{\log \log n})).

Power of 2 choices

Assign balls sequentially into bins

For i = 1 to n:

  • Choose 2 random bins, j and k
  • If L(j) < L(k): then assign ball i to bin j
  • else: assign ball i to bin k

With this change, the max load is O(\log \log n) with high probability.

If d random bins are used (instead of 2), the max load is O(\frac{\log \log n}{\log d}) with high probability.

Application to Hashing

Universe U of possible passwords, database S \subseteq U of unacceptable passwords. Query: Is x \in S?

Hash table: H = \{0,1, \dots, n-1\}

Map elements of U into bins \{0,1,\dots, n-1\} using hash function h : U \rightarrow \{0,1, \dots, n-1\}.

Chain hashing: H[i] is a linked list of elements x of S where h(x)=i

To add x into S:

  • Compute h(x)
  • add x onto linked list at H[h(x)]

To query y \in S:

  • Compute h(y)
  • Check if y is in the linked list at H[h(y)]

Query time = load at bin h(y)

Let |S| = m, |H| = n. This is equivalent to putting m balls into n bins.

If h(x) is a random function, then h(x) is a random bin in \{0,1,\dots,n-1\}. If m=n, the query time is O(\log n).

How to get a h which is random?

Choose a \in \{1,2, \dots, n-1\}. For x \in U, h(x) = ax \mod n

A better choice is to choose a,b \in \{1,2, \dots, n-1\}, h(x)=(ax+b) \mod n.

Better Approach

Hash table: H = \{0,1, \dots, n-1\}. Hash function: h_1, h_2 : U \rightarrow \{0,1, \dots, n-1\}.

To add x into S:

  • Compute h_1(x), h_2(x)
  • add x onto linked list at H[h(x)] which is less loaded

To query y \in S:

  • Compute h_1(y), h_2(y)
  • Check if y is in the linked list at H[h_1(y)], H[h_2(y)]

For random h_1, h_2, if m = n, the max load is O(\log \log n).

Bloom Filters

  • Uses less space
  • Simpler
  • Faster: O(1)
  • False positives: If y \notin S, sometimes, we say YES y \in S

H is a 0-1 array of size n, initialized to all 0’s

h: U \rightarrow \{0,1, \dots, n-1\}

Add x into S:

  • Compute h(x)
  • Set H[h(x)] = 1

Query y \in S:

  • Check if H[h(y)] = 1, then YES else NO

False Positive problem: It might happen that y \notin S but we added z into S and h(z) = h(y)

Better Approach

Instead, use k hash functions h_1, h_2, \dots, h_k

For each i, h_i : U \rightarrow \{0,1, \dots, n-1\}

Initialize H to all 0’s

Add x into S:

  • For all i \in \{1,2, \dots, k\}
    • Set H[h_i(x)] = 1

Query y \in S:

  • if for all i \in \{1,2, \dots, k\}, H[h(y_i)] = 1 , output YES
  • else (if \exists i where H[h(y_i)] = 0), output NO

Guarantees:

  • If y \in S, then for query y \in S, we always say YES
  • If y \notin S, then for query y \in S, we might say YES. This is a false positive.

Probability of a false positive

Pr(\text{false positive for } y \in S) = Pr(\text{for all } i, H[h_i(x)] = 1) = Pr(H[b] = 1)^k = (1 - Pr(H[b] = 0))^k

Pr(H[b] = 0) = (1 - \frac{1}{n})^{mk} \quad \Rightarrow \quad (1 - Pr(H[b] = 0))^k = (1 - (1-\frac{1}{n})^{mk})^k

If n=cm, then (1 - Pr(H[b] = 0))^k = (1 - (1-\frac{1}{n})^{mk})^k \approx (1 - (e^{-\frac{1}{n}})^{mk})^k = (1 - e^{-\frac{k}{c}})^k

To find the best k that minimizes (1 - Pr(H[b] = 0))^k \approx f(k) = (1 - e^{-\frac{k}{c}})^k, we take the derivative of f(k) and find the solution for f'(k) =  0.

The best k is c \ln 2 which makes f(k) = (\frac{1}{2})^k = (\frac{1}{2})^{c \ln 2} \approx (0.6185)^c.