The Law of Large Numbers is one of the most frequently misunderstood concepts of probability and statistics.
You’ve probably fallen into the gambler’s fallacy a few times already: just because you lost ten blackjack games in a row, it doesn’t mean that you’ll be more likely to be lucky next time because of the law of large numbers.
Games (and other phenomena) of chance don’t have memories. The cards, coins, and dice don’t care what happened previously, they’ll behave the same each time. On the other hand, randomness is averaged out in the long run. If you toss a coin a million times, it’ll land on its heads approximately half the time.
What is the law of large numbers, then? Let’s answer this question.
Tossing coins
The strength of probability theory lies in its ability to translate complex random phenomena into coin tosses, dice rolls, and other simple experiments.
For simplicity, let’s stick with coin tossing. What will the average number of heads be if we toss a coin, say, a thousand times? To mathematically formalize this question, we’ll need random variables.
Tossing a fair coin is described by the Bernoulli distribution, so let X₁, X₂, … be such independent and identically distributed random variables.
By definition, our i-th Bernoulli random variable is 1 if the i-th toss comes up heads; 0, if it comes up tails.
What we are interested in is the behavior of the so-called sample average.
Note that as we are talking about Bernoulli random variables, the numerator is simply the total number of heads. Thus, the sample average coincides with the relative frequency of heads.
We can simulate the coin tosses with the following Python code. (You can interactively follow along with this Google Colab notebook here.)
import numpy as np
from scipy.stats import bernoulli
n_tosses = 1000
idx = range(n_tosses)
coin_tosses = [bernoulli.rvs(p=0.5) for _ in idx]
coin_toss_averages = [np.mean(coin_tosses[:k+1]) for k in idx]
Here is the result, plotted by the evergreen Matplotlib.
As you can see, the sample average gets suspiciously close to 1/2 as the number of attempts grows. This value is not just the probability of heads, but also its expected value.
(If you are not familiar with the concept of expected value, check out my recent post below.)
So, we conjecture that the sample averages converge to the expected value.
Can we conclude this from our simulation? Definitely not. Notice that the sample average is a random variable, so we might have been just lucky. Let’s toss even more coins to find that out.
Tossing even more coins
In the case of coin tossing, we can explicitly calculate the distribution of sample averages. As a single coin toss is Bernoulli(1/2)-distributed, their sum has a binomial distribution.
What does this distribution look like? Is it concentrated around 1/2? Let’s do more simulations to find out.
(Again, you can follow along interactively here.)
more_coin_tosses = bernoulli.rvs(p=0.5, size=(n_tosses, n_tosses))
more_coin_toss_averages = np.array([[np.mean(more_coin_tosses[i][:j+1]) for j in idx] for i in idx])
Here is the (empirical) distribution of the sample averages, once more visualized by the good old Matplotlib.
We can see that as the number of tosses grows, the distribution of the sample average becomes more and more concentrated around 1/2.
How can we express this in a mathematically precise way? Enter the weak law of large numbers.
The law of large numbers
So, we have just seen that the distribution of the sample averages tends to concentrate around the expected value, but it can assume other values with a small probability. How small exactly?
To study the general case, let X₁, X₂, … be independent and identically distributed random variables. The probability of the sample average falling farther than ε from the expected value μ is expressed by the following:
The so-called weak version of the law of large numbers describes that these probabilities behave as we expect.
However, this doesn’t mean that we were not simply lucky with our first simulation.
Can the sample average of coin tosses converge to a number other than 1/2? In principle, yes. Think about the extremely unlikely case of only tossing tails, resulting in the sample average converging to zero.
Even though tossing only tails has zero probability, are there other similar outlier cases, perhaps happening with a nonzero probability?
In other words, what is the probability that the sample averages converge to the expected value? The strong law of large numbers states that this probability is one.
Thus, we were not just lucky: the sample averages almost surely converge to the expected value.
The gambler’s fallacy
There is a common misinterpretation of the law of large numbers, one that I already mentioned in the introduction.
Let’s play a simple game: I toss a fair coin, and if it comes up heads, you win $1. However, if it comes up tails, you lose $1.
Suppose that you have lost ten rounds in a row. It’s painful, I know. After all, the law of large numbers says that the frequency of heads should be close to 1/2. Clearly, the next toss should be a winning one, right?
No. First of all, the tosses are independent of each other, so the next toss doesn’t remember the previous ones.
But most importantly, now you understand why the law of large numbers cannot be applied in this context: because it applies only in the long run. Anything can happen in the short run.
Speaking of the long run: the last chapter in the early access of my Mathematics of Machine Learning book just happens to be about the law of large numbers. If you are interested in the details, like why the weak law is true, check it out!