Probabilistic Data Structures
When exact answers are too expensive (memory or time), probabilistic structures offer massive efficiency gains in exchange for a small, controlled error rate.
Bloom Filters
A space-efficient probabilistic data structure that is used to test whether an element is a member of a set.
- False Positives: Possible (might say "yes" when it's "no").
- False Negatives: Impossible (if it says "no", it's definitely "no").
- Use Cases:
- Checking if a username is taken before hitting the DB.
- Cache filtering (only cache items requested more than once).
- Weak password detection.
How it works
- Start with a bit array of $m$ bits, all set to 0.
- Use $k$ different hash functions.
- To add an item: Hash it $k$ times, set the bits at those indices to 1.
- To check an item: Hash it $k$ times. If all bits at those indices are 1, the item might be in the set. If any is 0, it is definitely not.
HyperLogLog
An algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset.
- Memory: Extremely low (e.g., ~1.5KB for counts up to $10^9$).
- Error Rate: Typically < 2%.
- Use Cases:
- Counting unique visitors to a website (DAU/MAU).
- Analyzing distinct search queries.
- Network traffic monitoring (unique IP addresses).
Key Insight
Uses the observation that the cardinality of a stream of uniformly distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number.
Count-Min Sketch
A frequency table for events in a stream. Think of it as a "probabilistic hash map" for counting.
- Use Cases:
- "Heavy hitters" (finding most frequent items).
- DDoS detection (identifying IPs with high request rates).
- Trade-off: Overestimates counts (collisions add to the count), but never underestimates.
Summary Table
| Structure | Answers Question | Error Type | Typical Use |
|---|---|---|---|
| Bloom Filter | Is $x$ in set $S$? | False Positive | Membership testing |
| HyperLogLog | How many unique items? | Approximation | Cardinality estimation |
| Count-Min Sketch | How often did $x$ occur? | Overestimation | Frequency / Top-K |