13- Bloom Filters¶
Overview¶
Invented by Burton Bloom (1970).
A space-efficient probabilistic data structure to test set membership.
Answers: “Have we seen this key before?”
Key properties: - No false negatives: if it says “not seen,” it’s correct. - Possible false positives: may say “seen” when it hasn’t.
Common uses: - Prefiltering before slower lookups. - Checking for weak passwords or known records.
Core Concept¶
A Bloom filter is an array of bits (0 or 1).
Each inserted key is mapped to k indices using k hash functions.
Each hash index is set to 1 during insertion.
Operations¶
Insert(key): - Compute k hash values. - Set bits at those indices to 1.
Lookup(key): - Compute k hash values. - If any bit is 0 → key not present. - If all bits are 1 → key possibly present.
BloomFilter Structure¶
- ::
- BloomFilter {
Integer: size Integer: k Array of bits: bins Array of hash functions: h
}
Example Pseudocode¶
- ::
- BloomFilterInsertKey(filter, key):
- FOR i in 0..k-1:
index = filter.h[i](key) filter.bins[index] = 1
- BloomFilterLookup(filter, key):
- FOR i in 0..k-1:
index = filter.h[i](key) IF filter.bins[index] == 0:
RETURN False
RETURN True
Performance¶
Time complexity: O(k)
Memory: O(m) bits, where m is number of bins.
Independent of total elements inserted.
Lookup can terminate early after first 0.
Tuning Parameters¶
False positive rate approximation:
n: number of inserted items
m: number of bits
k: number of hash functions
Effects: - Larger m → lower false positives. - More items (n) → higher false positives. - Optimal k depends on balance between accuracy and space.
Example False Positive Rates (n = 100)¶
m |
k = 1 |
k = 3 |
k = 5 |
|---|---|---|---|
200 400 600 800 1000 |
0.3942 0.2214 0.1536 0.1176 0.0952 |
0.4704 0.1473 0.0610 0.0306 0.0174 |
0.6535 0.1855 0.0579 0.0217 0.0094 |
Bloom Filters vs. Hash Tables¶
Hash Tables: - Exact results, store keys explicitly. - Higher memory usage (keys + pointers).
Bloom Filters: - No keys or pointers stored. - Compact, bit-based memory footprint. - Faster, cache-friendly lookups.
Applications¶
Password blacklist checks.
Caching prefilters.
Large-scale record existence tests.
Distributed databases and search engines.
Key Insights¶
Trades accuracy for speed and memory.
Enables fast “likely/not present” checks.
Demonstrates probabilistic data structures with controlled error.
Foundation for more complex structures like counting Bloom filters.
Summary¶
Bloom filters provide a lightweight, probabilistic way to test membership: - Tiny memory footprint. - Constant-time operations. - No false negatives, some false positives. - Useful as a prefilter for expensive lookups or large data sets.