Is this significant? How to understand statistics

 

Any time you draw conclusions from data there are two key questions you need to ask yourself; firstly, is the result statistically significant; secondly, is it practically significant. In this blog post, we will explain what statistical and practical significance mean, and explore some common mistakes and the dangers of getting things wrong, before discussing the approach we take at MeVitae. 

At their most fundamental, statistical significance is a measure of probability and practical significance is a measure of effect. Statistical significance tells you how likely a result is to be correct and practical significance tells you how big that effect is in practice. 

For example, take a cure for the common cold. A common cold cure is practically significant if it reduces the duration of your cold by more than a day. It is statistically significant if we are confident in the result, i.e., if we are fairly certain that the cure actually works and the results are not due to random chance. 

How to think like a statistician (A historical interlude) 

 

200 years ago Laplace used Bayesian statistics to correctly estimate the mass of Saturn. He made a bet “of 11,000 to 1 that the error of this result is not 1/100th of its value.” He would have won his bet.

 

Statistics has been a battleground between two major camps called Frequentists and Bayesians. Frequentists think that probabilities represent long-running averages, for example, if I toss a coin lots of times it will come up heads half the time. Bayesians think that probabilities represent our level of certainty in a statement, i.e. I am 50% certain that next time I toss a coin it will come up heads. This slightly arcane-sounding distinction can lead to vastly different maths and therefore different conclusions. 

Bayesian statistics was invented by the Reverend Thomas Bayes in the 18th Century to make better bets when gambling. At the time his discovery went largely unnoticed, and his work was published posthumously by a man called Richard Price (who used Bayes’ work to try and prove the existence of God). Bayesian statistics was independently rediscovered decades later by a French mathematician called Laplace. Laplace properly formalized Bayesian statistics in the way we understand it today. To demonstrate its power, 200 years ago Laplace used Bayesian statistics to estimate the mass of Saturn. Laplace found that Saturn is 3512 times smaller than the sun and said, “It is a bet of 11,000 to 1 that the error of this result is not 1/100th of its value.” According to NASA (in the 21st Century), he would have won his bet! 

Even though Bayesian statistics is incredibly powerful, many mathematicians found (and indeed still do - find) it uncomfortable. They do not like the “woolly” feeling of the Bayesian approach and instead pushed the Frequentist philosophy. Around World War 2 the allies started secretly using Bayesian statistics. They did not care that it was unfashionable - it worked! By reading just a small number of license plates and conducting Bayesian analysis, they could estimate the number of Nazi tanks more accurately than spies in factories. The Allies located Nazi submarines with Bayesian search theory and even used Bayesian statistics to help crack the Enigma machine. 

During the cold war, Prof. Blackwell provided a great example highlighting how thinking like a Bayesian is the only sensible way to think about probability. He was working at RAND corporation helping plan for a potential nuclear war. If war is imminent, resources should be put on evacuating people from big cities. If it is not, the resources should be put into building bunkers or missile defence systems. He wanted to know how likely a nuclear war was in the next five years. Frequentist statisticians told him that because this was not a repeating event they could not calculate long-running averages and that war was either certain or impossible, but they could only answer the question in five years’ time! This was not a particularly helpful answer, and Blackwell became a devout Bayesian, discovering important theorems and training many future statisticians. 

Statistical Significance 

 

Unfortunately, Frequentist P-Values are still widely used in medical research, although there are many efforts to phase them out.

 

Let us say that we have measured the success chances of men and women in an application process. Is the difference between genders statistically significant, i.e. how likely is it that there is indeed a difference between male and female success chances? 

It turns out that Frequentists cannot directly answer the question. Instead, they produce round-about alternative approaches and give them complicated names. The main Frequentist approach would be to calculate a so-called “P-Value”. The P-value approach is to first assume what is called the null hypothesis, that there is no difference between the genders and that both men and women are equally successful. They would then calculate how likely the data are given the null hypothesis. If the probability of getting these data is less than 5%, they would claim a statistically significant effect. That is, if the data are unlikely to be true given the null hypothesis, the null hypothesis is rejected. It is important to note, that this analysis has not directly measured the statistical significance of any gender bias, it is being used as a proxy. 

One of the key flaws with this approach (other than not directly answering the question) is that it we know it does not work reliably. Even if the null hypothesis is true and there is no real difference between male and female success chances, randomly some of the time it will look like there is a difference. Whilst such scenarios are unlikely – they do happen. Therefore, just by random chance, if you were to collect data on 100 job adverts where men and women had equal success chances, the P-Value approach would typically find statistically significant effects in 5 of them by mistake - all just by random chance. Unfortunately, P-Values are still widely used in medical research, although there are many efforts to phase them out.

Bayesians, on the other hand, can directly answer the question. They can calculate the probability that there is a difference between the success chances for men and women. Indeed, at MeVitae we take a Bayesian approach. Our analytics tools report how likely to be real any observed differences between protected groups are. 

 Practical Significance 

 

By correctly understanding relative and absolute impact, we know that whilst buying two lottery tickets doubles our chances of winning we are still very unlikely to win.

 

Once we have understood whether a result is statistically significant, i.e. how likely it is to be real, we can assess the practical significance. The main tripping point here is in comparing relative and absolute impact. 

 Certain journalists and newspapers are particularly guilty of using relative impact to generate headlines, when the absolute impact is small. Examples include scare-stories relating to cancer risk, health benefits of super foods, and the impacts of government policies on crime rates. 

A simple example of the difference between relative and absolute impact can be understood when playing the euromillions lottery. The chances of me winning the lottery are around one in 140 million. I can double my chances of winning by buying two tickets. The relative impact is huge, a 100% increase in success chance. The absolute impact is tiny, I have increased my odds by one in 140 million, certainly not enough for me to quit my day job. 

Turning back to recruitment, some costly new diversity initiative could reduce the difference in success chances between men and women at some stage in the recruitment pipeline by 50%. If the initial gender difference was very small, a 50% reduction in this small difference would have very little absolute impact and therefore the practical significance is small. Those resources might be better placed on some other initiative that could have a larger absolute impact, and therefore a larger practical significance. 

Typically, we can detect effects with large practical significance with less data than are needed to detect smaller practical differences. For example, if there is a small decrease in diversity throughout your recruitment pipeline, we would need a large amount of hiring data to detect it. If, however, there was a large drop at a certain step, we could detect it with much less data. Taking a Bayesian approach, we can place upper and lower limits on the practical size of any changes in diversity and on the difference in success chances between groups. 

How we use statistics to empower recruiters 

We are developing an analytics dashboard that plugs straight into your applicant tracking system (ATS) to measure how diversity changes across your recruitment pipeline. It uses Bayesian statistics and artificial intelligence to produce easy-to-understand reports into how diversity changes, where there is statistical evidence for bias or unfairness, and the practical significance of anything we detect. We also use rigorous statistics to ensure our algorithms are fair and accurate before deployment and before any update. 

Get in touch if you would like to learn more. 

 Author: Luke Jew (Data Science Research Manager)