@adlrocha - Network Applications of Bloom Filters

Improving my math intuition series - Part I

Last week I introduced how I was planning to dedicate the next few months to take a math course in order to improve my math skills and intuition. So far, online learning hasn’t worked for me, I never managed to build an habit of it. To prevent this from happening, this time I’ve decided to use this newsletter as a way to make myself accountable, peg my desire to learn maths with this newsletter habit, and start “learning in public”.

Class 1: Bloom Filters 📊

How, and more importantly, why do they work?

So far I am meeting my targets, and this week I managed to go through class 1 of the course. The task for this week was to read Network Applications of Bloom Filters: A Survey. by Broder and Mitzenmacher. If you have studied or worked with anything related to computer science, you are probably familiar with Bloom Filters. You may have used them extensively either as a user or as a developer. This paper makes a great analysis of the math behind Bloom Filters, mathematically showing how and why the work. Actually, after you see it, the math required to understand why Bloom fitlers work is basic statistics and a few tricks more.

“A Bloom filter is a simple space-efficient randomized data structure for representing a set in order to support membership queries”. If you have a set, S, composed of n elements [x1, …, xn], you can use a Bloom filter to represent the set and perform efficient membership queries over it. The filter will output 1 if the input is an element in the set, and 0 with high-probability if the input is not in the set. This “high-probably” is determined by the probability of false positives of the filter.

In an m-bits Bloom filter that uses k independent hash functions with range {1, …, m}, for each element xi in the set, the bits hi(k) (the result of hashing xi using the k independent hash function)  are set to 1. By repeating this task for all xi, we get an m-bits array with a bunch of its bits set to 1. This array is used to represents all the elements that belong to the set. Checking if an element is in the set is as simple as checking if hi(k) are set to one in the m bit array.

We see that Bloom filters are easy to build, but why do they work? How can we be sure that this simple trick of hashing k times every element and setting some bits to 1 in a m-bit array is able to represent a set with a small probability of false positives? And this is the reason why this paper is so interesting, as it guides you through the hidden math that shows why this construction works (giving you the math intuition behind Bloom filters which I mentioned last week and that I am hoping to get with this course).

  • The first thing the paper does is to guide us through how to analyze the probability of false positives of Bloom filters, and how to minimize it. To get to this proof ourselves we would only have needed:

    • Simple statistics: The proof is based on computing the probability that a specific bit is still 0 after representing the full set, and using this probability to get the probability of the k-hashes being 1 for an element which is not the test.

    • To know that by deriving a function and equating it to zero you can get its minimums.

    • And asymptotic approximations to ease the analysis and development of the proofs. This is something I wasn’t personally familiar with, and which I had to dive deep into in order to understand every step in the analysis. Apparently, asymptotic approximations are extensively used in theoretical computer science, so I highly recommend you understand it before going on with the course. This resource really helped me with this. You know you’ve understood asymptotic approximation if you are able to get why is this possible:

  • Using similar tricks to the ones used above, the paper also evaluates Bloom filters efficiency, deriving the well-known optimal relationship between the number of bits used to represent the filter minimizing the probability of false positives (something I personally used myself as black box before without knowing where did it come from).

  • Bloom filters have a set of nice properties easy to demonstrate and which can come pretty handy for certain applications:

    • A Bloom filter that represents the union of two sets can be obtained by taking the OR of the two m-bit vectors of the original Bloom filters.

    • They can easily be halved in size, dynamically shrinking it by ORing the first and second halves of the filter together, and using a mask for the membership proof.

    • Approximate the intersection between two sets using their inner product.

  • Finally, we can build variations of Bloom filters to allow deletion and compression such as:

    • Counting Bloom filters which instead of setting the i-bit to 1 like vanilla Bloom filters, they use a counter to know if more than one element has set that position to 1 before removing an element from the set.

    • Compressed Bloom filters

Applications

The second part of the paper goes through a really nice survey of historical and more recent (considering that the paper is from 2004) application of Bloom filters. Such as:

  • Dictionaries

  • Databases

  • Distributed Caching

  • P2P and Overlay Networks

  • Resource Routing

  • Packet Routing

  • Measurement Infrastructure.

(if you lack the time to read all of them, the ones in bold were the ones I enjoyed the most)

These sections of the paper are really interesting but lighter on the use of math so I am not going to develop them further here. They are really well-described in the paper and worth reading, but I don’t think it is worth for me to go through my notes on them (they’ve given me several new ideas that I may mention in future publications, though).

And finally, how can you be sure if it is worth to use a Bloom filter in your application? Use the Bloom filter principle which says that: “whenever a list or set is used, and space is at a premium consider using a Bloom filter if the effect of false positives can be mitigated”. Simple.

Tasks for next week 📰

Next week I am planning to go through New directions in traffic measurement and accounting: focusing on the elephants, ignoring the mice by Estan and Varghese. It is a long paper, I really hope I can find the time to read it and process it in full.

In the meantime, if you want to join me on this endeavor of learning math and theoretical computer science in public do not hesitate to ping me. Other’s have done it already. Let’s build an async math learning club 🤓. See you next week!