**Wall of text below**
**TL;DR - The stats, for the most part, fail to support the claim that Howard Webb bottles big decisions.**
**Background:** [A Telegraph article](
The stats that prove Howard Webb bottles big decisions - Telegraph) claimed that Howard Webb bottles big decisions and that statistics proved this. I wanted to look at whether the statistics actually support the claims made in the article.
**Setup:** I'll start by explaining the methodology for examining the claim that Howard Webb gives more home penalties than the average PL referee. The same methodology was used to examine the other claims in the article. The results and verdict are summarized below, along with a number of caveats.
Howard Webb gave 23 penalties, of which 15 were to the home team (65%)
All other PL referees gave 337 penalties, of which 203 were to the home team (60%)
Does Howard Webb have a home bias in giving penalties compared to the average referee?
**Methodology:** Consider the issue of giving penalties to home/away teams. Suppose referees handle that in the following way. They each have coins that have a certain chance p (between 0 and 100%) of coming up heads, so that if you flipped the coin infinitely many times, the fraction of tosses that came up heads would be very close to p. Every time the referee has to give a penalty, he tosses the coin. If it comes up head, he gives the penalty to the home team else to the away team.
To check if Webb is different from other referees (or from pre-World-Cup Webb himself), we assume that all other PL referees are given coins with the same head probability p. So the question is now the same as asking whether Webb's coin has a different head probability than everyone else's. We don't see the head probability for anyone's coins, but we do see the results of their coin flips and can estimate p from it. In our current example, we see that Webb's coin produced 15 heads (home penalties) when he flipped it 23 times (total penalties). For the other PL referees, when the coin was flipped 337 times, it came up heads 203 times. Since we've assumed all other referees are identical, we can assume that a single referee gave 337 penalties of which 203 were given to the home team. If we imagine a string of H's (heads) and T's (tails) that is 337 characters long, it contained 203 H's (about 60%). This gives us an estimate of the head probability p for the average PL referee to be 60% (p=0.6).
But if we examined a small set of coin flips from this long string, we have no guarantees that it will contain 60% H's. In the extreme case, if we only examined 1 character, it will be either an H (100% H's) or a T (0% H's). Then our estimate of p would be either 1 or 0. In Webb's string of coin tosses, our estimate of p is 65% (15/23), which seems larger than the average PL referee's 60%. However, this was estimated from only 23 coin flips and could therefore be different from the correct value.
To avoid this problem, we approach the question differently. Instead, suppose that Webb was just an average PL referee. Then his coin would be identical to everyone's and you could think that if Webb were allowed to make 337 penalty calls, 203 penalties would be given to the home team. If we had 337 penalty calls Webb had made as an average referee, we could choose a random subset of 23 calls and see how many of them were given to the home team (heads). From our 60% estimate of p, we expect the number of heads to be close to 14 (23*0.6=13.8). If we did this many times, we could plot a histogram of the number of heads we get in 23 tosses and we'd see some variability relative to 14. [The histogram here](
imgur: the simple image sharer) shows exactly that with the tossing repeated 100,000 times. In fact, we see that a large number of times (about 40% of the time), the number of heads is not only larger than 14, but also larger than 15 (which we think might be a high number of home penalties to give out of 23). So, about 40% of the time, if you asked the average PL referee to make 23 penalty calls, he would make 15 or more home penalty calls. This suggests that Howard Webb's home penalty calls don't indicate any systematic bias towards home teams, just some sampling noise.
**Theoretical basis:** The framework used above is called hypothesis testing in statistics. It assumes that we have two hypotheses, a null or default hypothesis and an alternative hypothesis. We wish to reject (or fail to reject) the null hypothesis using observed data. Here, the "null" or default hypothesis is that Howard Webb's coin is the same as the average PL referee's coin. The "alternative" hypothesis that we want to test is that Howard Webb's coin comes down heads more often than the average PL ref's does. The test statistic (or quantity we observe) is the number of heads and tails Howard Webb's and the average PL ref's coin tosses produce.
Hypothesis testing recommends that in order to be able to reject the null hypothesis:
* You assume that the null hypothesis is true (i.e, Howard Webb's coin is the same as the avg. referee's so that if Webb tossed the coin 337 times, it would come up heads 203 times).
* Calculate the chance that if the data were produced under that assumption, it looks just as bad or worse than the test statistic (i.e, a set of 23 tosses chosen out of the 337 produces 15 or more heads, so that the observed data indicates an equal or larger home bias).
* If this chance is less than 5%, you claim that the null hypothesis can be rejected. If it is larger than 5%, you fail to reject the null hypothesis. You can make the threshold smaller than 5% if you want to reduce the possibility of rejecting the null hypothesis just by chance.
**Results:**
1. Of 23 penalties Webb gave 15 were to the home team - IS THAT TOO MANY?
Probability that the avg PL referee gives >= 15 home penalties out of 23 = 0.3973578
*FAIL TO REJECT HYPOTHESIS THAT POST-WC-WEBB IS THE SAME AS THE AVG PL REF*
2. Of 24 penalties Webb gave 4 were given in the last 15 min- IS THAT TOO FEW?
Probability that the avg PL referee gives <= 4 late penalties out of 24 = 0.2872285
*FAIL TO REJECT HYPOTHESIS THAT POST-WC-WEBB IS THE SAME AS THE AVG PL REF*
3. Of 11 penalties Webb gave 1 were crucial- IS THAT TOO FEW?
Probability that the avg referee gives <= 1 crucial penalties out of 11 = 0.002772367
**REJECT HYPOTHESIS THAT POST-WC-WEBB IS THE SAME AS THE AVG PL REF WRT CRUCIAL PENALTIES**
4. Of 23 penalties Webb gave 15 were to the home team - IS THAT TOO MANY?
Probability that pre-WC Webb gives >= 15 home penalties out of 23 = 0.280075
*FAIL TO REJECT HYPOTHESIS THAT POST-WC-WEBB IS THE SAME AS PRE-WC-WEBB*
5. Of 24 penalties Webb gave 4 were given in the last 15 min- IS THAT TOO FEW?
Probability that pre-WC Webb gives <= 4 late penalties out of 24 = 0.3232952
*FAIL TO REJECT HYPOTHESIS THAT POST-WC-WEBB IS THE SAME AS PRE-WC-WEBB*
6. Of 11 penalties Webb gave 1 were crucial- IS THAT TOO FEW?
Probability that pre-WC Webb gives <= 1 crucial penalties out of 11 = 0.001276239
**REJECT HYPOTHESIS THAT POST-WC-WEBB IS THE SAME AS PRE-WC-WEBB WRT CRUCIAL PENALTIES**
7. Of 22 red cards post-WC Webb gave 8 were given to the home team - IS THAT TOO FEW?
Probability that pre-WC Webb gives <= 8 home red cards out of 22 = 0.5278146
*FAIL TO REJECT HYPOTHESIS THAT POST-WC-WEBB IS THE SAME AS PRE-WC-WEBB*
Probability that the avg PL referee gives <= 8 home red cards out of 22 = 0.4431766
*FAIL TO REJECT HYPOTHESIS THAT POST-WC-WEBB IS THE SAME AS THE AVG PL REF*
8. Of 22 red cards post-WC Webb gave 1 were given with a penalty - IS THAT TOO FEW?
Probability that pre-WC Webb gives <= 1 red cards with penalty out of 22 = 0.1309911
*FAIL TO REJECT HYPOTHESIS THAT POST-WC-WEBB IS THE SAME AS PRE-WC-WEBB*
Probability that the avg PL referee gives <= 1 red cards with penalty out of 22 = 0.2079409
*FAIL TO REJECT HYPOTHESIS THAT POST-WC-WEBB IS THE SAME AS THE AVG PL REF*
9. Of 65 red cards Webb gave 0 were given early - IS THAT TOO FEW?
Probability that the avg PL referee gives <= 0 red cards early out of 65 = 0.05881382
*FAIL TO REJECT HYPOTHESIS THAT POST-WC-WEBB IS THE SAME AS THE AVG PL REF*
**Verdict:** Of the 11 statistical tests, only 2 reject the null hypothesis that post-WC-Webb is similar to the average PL ref or pre-WC-Webb. Both of them are related to crucial penalties, which are penalties in the last half hour of a match that had the potential to change the result of a match. So there is probably some truth to the claim that Howard Webb avoids making game-changing penalty decisions late in the game, but little to no evidence (based on this data) for any of the other claims.
**Caveats:** A number of caveats apply here because I can only access the data in the article and not the underlying source.
* In reality, all other PL refs are not likely to be identical. Some referees probably have p larger than 60% and some smaller.
* The data may contain some outlier refs. The outlier sets might be different for different questions.
* The coin flipping model makes assumptions about the variance of the underlying probability distribution. Real data often has higher variance. In a more general sense, I have not tested how appropriate this model is for these scenario.
* Hypothesis testing only allows us to reject or fail to reject the null hypothesis. I have shown here that even if Howard Webb was indeed an average PL ref, there a 5% chance that I will reject that hypothesis by looking at his decisions. I have not shown what chance I have (ideally would like it to be 100%) of rejecting the hypothesis of similarity if Howard Webb is in fact different from the average PL ref.