13- Bloom Filters¶

Overview¶

Invented by Burton Bloom (1970).
A space-efficient probabilistic data structure to test set membership.
Answers: “Have we seen this key before?”
Key properties: - No false negatives: if it says “not seen,” it’s correct. - Possible false positives: may say “seen” when it hasn’t.
Common uses: - Prefiltering before slower lookups. - Checking for weak passwords or known records.

Core Concept¶

A Bloom filter is an array of bits (0 or 1).
Each inserted key is mapped to k indices using k hash functions.
Each hash index is set to 1 during insertion.

Operations¶

Insert(key): - Compute k hash values. - Set bits at those indices to 1.
Lookup(key): - Compute k hash values. - If any bit is 0 → key not present. - If all bits are 1 → key possibly present.

BloomFilter Structure¶

::

BloomFilter {: Integer: size Integer: k Array of bits: bins Array of hash functions: h

}

Example Pseudocode¶

::

BloomFilterInsertKey(filter, key):

FOR i in 0..k-1:: index = filter.h[i](key) filter.bins[index] = 1

BloomFilterLookup(filter, key):

FOR i in 0..k-1:: index = filter.h[i](key) IF filter.bins[index] == 0:

RETURN False

RETURN True

Performance¶

Time complexity: O(k)
Memory: O(m) bits, where m is number of bins.
Independent of total elements inserted.
Lookup can terminate early after first 0.

Tuning Parameters¶

False positive rate approximation:

\[P_{false} = (1 - (1 - 1/m)^{nk})^k\]

n: number of inserted items
m: number of bits
k: number of hash functions

Effects: - Larger m → lower false positives. - More items (n) → higher false positives. - Optimal k depends on balance between accuracy and space.

Example False Positive Rates (n = 100)¶

m	k = 1	k = 3	k = 5
200 400 600 800 1000	0.3942 0.2214 0.1536 0.1176 0.0952	0.4704 0.1473 0.0610 0.0306 0.0174	0.6535 0.1855 0.0579 0.0217 0.0094

Bloom Filters vs. Hash Tables¶

Hash Tables: - Exact results, store keys explicitly. - Higher memory usage (keys + pointers).
Bloom Filters: - No keys or pointers stored. - Compact, bit-based memory footprint. - Faster, cache-friendly lookups.

Applications¶

Password blacklist checks.
Caching prefilters.
Large-scale record existence tests.
Distributed databases and search engines.

Key Insights¶

Trades accuracy for speed and memory.
Enables fast “likely/not present” checks.
Demonstrates probabilistic data structures with controlled error.
Foundation for more complex structures like counting Bloom filters.

Summary¶

Bloom filters provide a lightweight, probabilistic way to test membership: - Tiny memory footprint. - Constant-time operations. - No false negatives, some false positives. - Useful as a prefilter for expensive lookups or large data sets.