PDF of Eric’s handwritten notes are here.

Today we’ll discuss hashing and see a widely-used scheme known as a Bloom filter. Before diving into hashing let’s look at a toy problem which will be helpful for our intuition.

**Balls into bins**

Given balls and bins, each ball is thrown into a random bin. Load of bin denoted by # of balls assigned to bin .

*Max load* is the maximum # of balls in a bin . It can be shown that the *max load* is with high probability (with high prob. means with probability for constant ).

A proof for : Pr(bin gets load ) = .

Pr(some bin gets load ) .

Pr(no bin gets load ) .

(In fact, the bound can be refined to ).

**Power of 2 choices**

Assign balls sequentially into bins

For to :

- Choose 2 random bins, and
- If : then assign ball to bin
- else: assign ball to bin

With this change, the max load is with high probability.

If random bins are used (instead of 2), the max load is with high probability.

**Application to Hashing**

Universe of possible passwords, database of unacceptable passwords. Query: Is ?

Hash table:

Map elements of into bins using hash function .

**Chain hashing**: is a linked list of elements of where

To add into :

- Compute
- add onto linked list at

To query :

- Compute
- Check if is in the linked list at

Query time = load at bin

Let , . This is equivalent to putting balls into bins.

If is a random function, then is a random bin in . If , the query time is .

How to get a which is random?

Choose . For ,

A better choice is to choose , .

**Better Approach**

Hash table: . Hash function: .

To add into :

- Compute
- add onto linked list at which is less loaded

To query :

- Compute
- Check if is in the linked list at

For random , if , the max load is .

**Bloom Filters**

- Uses less space
- Simpler
- Faster:
- False positives: If , sometimes, we say YES

is a 0-1 array of size , initialized to all 0’s

Add into :

- Compute
- Set

Query :

- Check if , then YES else NO

False Positive problem: It might happen that but we added into and

**Better Approach**

Instead, use hash functions

For each ,

Initialize to all 0’s

Add into :

- For all
- Set

Query :

- if for all , output YES
- else (if where ), output NO

Guarantees:

- If , then for query , we always say YES
- If , then for query , we might say YES. This is a false positive.

**Probability of a false positive**

If , then

To find the best that minimizes , we take the derivative of and find the solution for .

The best is which makes .