Note:MathJax

k-Way Collisions of Balls in Bins

Posted: 2017-01-06.

A basic probability question is what happens when we draw i.i.d. samples from a distribution, often referred to as "throwing balls into bins". Here I just want to show how the number of "k-way collisions" can give a simple yet useful analysis for, especially, the size of the max-loaded bin.

My Arms Are Sore

Suppose I have $n$ bins and I begin my day by throwing $m$ balls into them, one at a time, each independent of the previous ones. In the simplest case, every bin has equally likely probability. On my more general days, each bin $i$ has probability $A_i$ of receiving each ball.

One can view this as drawing $m$ i.i.d. samples from a distribution $A$ on support size $n$, with the equally-likely case corresponding to the uniform distribution.

After completing this exercise and a hearty breakfast, I spend a few hours doing grown up stuff like grading papers and writing emails. In the afternoon, though, I come back and decide to analyze the balls-in-bins process. Specifically: Can I upper-bound the number of balls in the max-loaded bin?

Key Idea: (Generalized) Collisions

Check this out:

For any two different balls that got thrown, we can say they collided if they land in the same bin. This causes one "two-way" collision. Let $C_2$ be a random variable equal to the number of $2$-way collisions.

How many do we expect -- that is, what is $\mathbb{E} C_2$? Well, there are ${m \choose 2}$ pairs of balls. What is the "collision probability" of a given pair? It's the sum over the bins of the probability that both balls land in that bin, which is $A_i^2$ for bin $i$. So \begin{align} \mathbb{E} C_2 &= {m \choose 2} \sum_i A_i^2 \\ &= {m \choose 2} \|A\|_2^2 \end{align} to use some fancy notation.

Now -- are you ready? -- let $C_k$ be the number of "$k$-way collisions", i.e. the number of sets of $k$ balls that all land in the same bin. By the same reasoning,

\begin{align} \mathbb{E} C_k &= {m \choose k} \sum_i A_i^k \\ &= {m \choose k} \|A\|_k^k . \end{align}

Now ready or not, we can get a pretty nice result.

Lemma. Suppose we throw $m$ balls uniformly into $n$ bins. The probability that any bin has $k$ balls is at most

\[ n \left(\frac{m e}{n k}\right)^k . \]

Proof. Notice that if any bin has $k$ or more balls, then there is at least one $k$-way collision. But the probability of this is upper-bounded by its expectation using Markov's inequality: \begin{align} \Pr[C_k \geq 1] &\leq \mathbb{E} C_k \\ &= {m \choose k} \sum_i \frac{1}{n^k} \\ &= {m\choose k} n \frac{1}{n^k} . \end{align} To get the claim, now use the inequality ${m \choose k} \leq \left(\frac{m e}{k}\right)^k$ where $e = 2.718...$. (n.b. this is loose; the $e$ can be almost removed; for more on this inequality and Markov's, see e.g. Tips, Tricks, and Techniques for theoretical CS.)

The Point

This gives pretty easy yet pretty tight upper bounds on the max-loaded bin of this process. Let's see a couple rough examples. (See Raab and Steger (1998) for the tight bounds in various parameter regimes, both upper and lower.)

Example/Proposition. Suppose we throw $n$ balls into $n$ bins. Then with high probability, the max-loaded bin has at most $O(\log n / \log\log n)$ balls.

Proof. For $m=n$, the lemma implies that for all $k$, \begin{align} \Pr[\text{exists a $k$-loaded bin}] &\leq \Pr[C_k \geq 1] \\ &\leq n \left(\frac{e}{k}\right)^k . \end{align} Now plug in $k = \Omega\left(\log n / \log\log n\right)$ to get that this probability is $O(1)$.

Example/Proposition ("Not Enough Birthdays"). Suppose we throw only $\alpha \sqrt{n}$ balls into $n$ bins. Then the probability of any collision is at most $0.5 \alpha^2$.

Proof. Using the reasoning above,

\begin{align} \Pr[\text{exists a $2$-loaded bin}] &\leq \Pr[C_2 \geq 1] \\ &\leq \mathbb{E} C_2 \\ &= \frac{m(m-1)}{2} \frac{1}{n} \\ &\leq \frac{m^2}{2n} \\ &= \frac{\alpha^2}{2} . \end{align}

Example/Proposition. Suppose we throw $n$ balls into $n$ bins. Then the probability of a $\log n$ loaded bin is at most $1/n^{\omega(1)}$.

Proof. Let $m = n$, then as seen above, \begin{align} \Pr[\text{exists a $k$-loaded bin}] &\leq n \left(\frac{e}{k}\right)^k . \end{align} Set $k = \log n$, use $e^{\log n} \leq n^2$, and use $(\log n)^{\log n} = 2^{\log n \log\log n} = n^{\log\log n}$. We get the probability is

\[ \leq \frac{n^3}{n^{\log\log n}} = \frac{1}{n^{\log\log n - 3}} . \]

The Point (continued)

The other nice part of the collisions approach is that it works for non-uniform distributions $A$ as well. In fact, it directly relates the probability of $k$-way collisions to the $k$-norm of $A$, since $\mathbb{E} C_k = {m \choose k} \|A\|_k^k$.

In the case $k=2$, we can use collisions to get an essentially-optimal tester for uniformity of $A$. The idea is that the uniform distribution has the fewest collisions of any distribution on support size $n$. If $A$ is $\epsilon$-far from uniform in $\ell_1$ distance, then $\|A\|_2^2$ is significantly larger than the $\frac{1}{n}$ of the uniform distribution, so its number of collisions will be too high and we can detect that with only $\frac{\sqrt{n}}{\epsilon^2}$ samples. You can see this developed in e.g. my paper "$\ell_p$ Testing and Learning of Discrete Distributions", but the idea of using collisions for uniformity testing is much older, e.g. Paninski 2008.

(One subtlety that comes up is carefully bounding the variance of $C_2$ in order to show that, with enough samples of a nonuniform distribution, $C_2$ is high with large probability. This is slightly messy; maybe there is a more elegant way to do it than computing the variance.)

A point to notice is that measuring $C_k$ will help get a direct handle on $\|A\|_k$, since $\mathbb{E}C_k = {m\choose k}\|A\|_k^k$. I'm not sure yet how to put this to work for other questions. It might be useful for testing uniformity in $\ell_p$ metrics for $p \gt 2$, whose exact sample complexity is an open question (though maybe only interesting to me).

Back to Basics

Having given my triceps some time to recover, let's return to our bins. The first thing I want to point out is two separate ways to write $C_k$: \begin{align} C_k &= \sum_{1 \leq a_1 \lt \cdots \lt a_k \leq m} \mathbf{I}\left[\text{balls $a_1, \dots, a_k$ land in same bin}\right] \\ &= \sum_{i=1}^n {X_i \choose k} \end{align} where $X_i$ is the number of balls falling in bin $i$ and we adopt the convention ${X_i \choose k} = 0$ if $X_i \lt k$. Notice the second definition is just the sum over bins of the number of collisions in that bin.

It might be that an interesting/useful quantity is $Z_k := \sum_{i=1}^n {X_i \choose k}^{1/k}$, but I don't know how to analyze it.

The other thing I want to point out is the unreasonable effectiveness of Markov's inequality here. Those familiar with Markov's know that it's usually just a stepping stone to higher-powered bounds like Chernoff bounds. To explain its usefulness here, recall that in Chernoff-type approaches we apply Markov's not with $\Pr[X \geq t]$ but $\Pr[X^k \geq t^k]$ or even $\Pr[e^X \geq e^t]$. Now notice that our variable, $C_k$, is already in a sense "powered up" to the $k$th power.

Basically, $C_k$ acts almost like a sharp indicator. It is zero for a small number of throws $m$, until at some point, we begin seeing a few $k$-loaded bins, when it is $\Theta(1)$. As $m$ increases, soon we expect $(k+1)$ or even $(k+2)$-loaded bins.

But $k+1$ balls all in the same bin automatically implies $k+1$ different $k$-way collisions. If $k+2$ balls fall in the same bin, this automatically gives ${k+2 \choose k} \approx k^2$ different $k$-way collisions! So $C_k$ will begin increasing very rapidly, and with high variance. (The same idea applies when fixing $m$ and decreasing $k$.)

By sticking to the low regime where $\mathbb{E} C_k \ll 1$, we don't allow these events to influence the probability of a high-loaded bin much, and $\mathbb{E} C_k$ becomes a pretty good bound on $\Pr[C_k \geq 1]$. But once we reach higher regimes, $C_k$ becomes harder to control.

The Big Q

The big open question to me is to develop anything close to this elegant for lower bounds on the max-loaded bin. As I mentioned above, for uniformity testing, lower-bounding the collisions from nonuniform $A$ requires pretty delicate reasoning about the variance of $C_2$ (for example, there are multiple regimes depending on $m,n,\epsilon$). This is due to exactly the issue we just discussed - once we throw enough balls to start expecting $k$-way collisions, $C_k$ starts getting lage and high-variance fast. In contrast, using collisions for upper bounds is easy.

So for lower-bounding the max-loaded bin, anything simpler than this sort of heavy-handed variance analysis seems to need some clever approach. I've tried using quantities like $Z_k$ as defined above, or $C_k^{1/k}$, without much success. Can you do better?

References

(1998) Martin Raab and Angelika Steger. ``Balls into Bins'' - A Simple and Tight Analysis. Proceedings of the Second International Workshop on Randomization and Approximation Techniques in Computer Science.

(2005) Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomized algorithms and probabilistic analysis.

(2008) Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory.

(2015) Bo Waggoner. $\ell_p$ Testing and Learning of Discrete Distributions. Proceedings of the 6th Innovations in Theoretical Computer Science Conference.