Think it through
Embrace the margins
The first step to Bayesianism is to stop thinking in all-or-nothing terms. Bayesians want to move past the dichotomy of you-either-believe-it-or-you-don’t, to start thinking of belief as something that comes in degrees. Those degrees can be measured on a zero to 100 per cent scale. If you’re certain an event will occur, that’s 100 per cent confidence. If you’re certain it won’t occur, that’s 0 per cent.
But again, Bayesians counsel against going to extremes. There are very few situations in which it makes sense to be certain that something will happen, or that it won’t. In his book Making Decisions (1971), the Bayesian Dennis Lindley approvingly cited Oliver Cromwell’s dictum to always ‘think it possible that you may be mistaken.’ Unless an event is strictly impossible, you shouldn’t be certain that it won’t occur.
All right, fine then. Maybe we shouldn’t assign anything that’s strictly speaking possible a confidence of zero. But we’ve all heard someone describe a possibility as ‘one in a million’. If something’s that improbable, it’s pretty much not going to happen, right? So, one in a million might as well be zero? The very same Dennis Lindley also said he was fine assigning a confidence of one in a million that the Moon is made of green cheese.
A common mistake when reasoning with probabilities is to think that a fraction of a percentage point – especially near such extreme values as 0 per cent or 100 per cent – really doesn’t matter. Any parent who’s been fortunate enough to get high-quality modern-day prenatal care will have seen genetic tests reporting how likely their growing fetus is to develop certain kinds of ailments and birth defects. I remember looking at probabilities like 0.0004 per cent and 0.019 per cent with my pregnant wife, and wondering what we should be worried about and what we could write off. Such small probability differences are difficult to grasp intuitively. But a condition with a probability of 0.019 per cent is almost 50 times as likely to occur as one with a probability of 0.0004 per cent.
It’s tempting to see a probability value like 0.0001 per cent – one in a million – and assume the difference between that and 0 per cent is little more than a rounding error. But an event with 0 per cent probability literally can’t happen, while events with a probability of 0.0001 per cent happen all the time. If you have a couple of minutes and some loose change, go flip a coin 20 times. (We’ll wait.) Whatever sequence of heads and tails you wound up observing, that specific sequence had a less than one-in-a-million chance of occurring.
To better assess the significance of the almost impossible and the almost certain, Bayesians sometimes switch from measuring probabilities on a percentage scale to measuring them with odds. If I bought you enough tickets to have a 0.001 per cent chance at winning the lottery, and bought your friend enough tickets to give him a 0.1 per cent chance, you might wonder how offended you should be. Putting those values in odds form, we see that I’ve given your friend a 1 in 1,000 shot and you only 1 in 100,000! Expressing the probabilities in odds form makes it clear that your friend has 100 tickets for every 1 of yours, and clarifies that these two probabilities – while admittedly both close to zero – are nevertheless importantly different.
Evidence supports what makes it probable
What did the Rev Bayes do to get a whole statistical movement named after him? Prior to Bayes, much probability theory concerned problems of ‘direct inference’. This is the kind of probability problem you were asked to solve many times in school. You’re told that two fair, six-sided dice are rolled, and are asked to calculate the probability that their sum will be eight. Put a bit more abstractly: you’re given a hypothesis about some probabilistic process in the world, and asked to compute the probability that it will generate a particular kind of evidence.
Bayes was interested in the opposite: so-called ‘inverse inference’. Suppose you observe some evidence, and want to infer back to a hypothesis about what kind of process in the world might have generated that evidence. In The Theory of Probability (1935), Hans Reichenbach listed many occasions on which we engage in reasoning with this structure:
The physician’s inferences, leading from the observed symptoms to the diagnosis of a specified disease, are of this type; so are the inferences of the historian determining the historical events that must be assumed for the explanation of recorded observations; and, likewise, the inferences of the detective concluding criminal actions from inconspicuous observable data.
Bayes’s most important contribution to inverse inference wasn’t recognised during his lifetime. After the reverend died in 1761, a Welsh minister named Richard Price published a theorem he had found in Bayes’s notes. This theorem was later independently rediscovered by Pierre-Simon Laplace, who was responsible for much of its early popularisation.
Price, Laplace and others promoted Bayes’s theorem as a rule for adjusting one’s confidence in a hypothesis after discovering some new piece of evidence. Modern Bayesians are called ‘Bayesians’ because of their adherence to Bayes’s Rule. According to Bayes’s Rule, your updated confidence in the hypothesis should be calculated from two factors: what your confidences looked like before you got the evidence (about which more later), and how strongly the evidence supports the hypothesis.
Here it pays to remember Bayesians’ aversion to absolutes. While it makes for good drama when a character learns a single piece of information that changes their whole worldview, most of life isn’t like that. Each new piece of information we gain changes only some of our opinions, and changes them incrementally – making us slightly more confident or slightly less confident that particular events will occur. This is because evidential support also comes in degrees: a piece of evidence might support some hypotheses weakly and others strongly; or one piece of evidence might support a particular hypothesis better than another.
To gauge how strongly evidence supports some hypothesis, ask how likely that hypothesis makes the evidence. Suppose you get home late from work one night, and walk in to find all the lights on in your home. You wonder who else is home – your husband? Your son? Well, your husband is constantly griping about the power bills, and walks around the house turning lights off all the time. But your teenage son barely notices his surroundings, and exits a room without a thought to how he’s left it. The evidence you’ve found is very likely if your son is in the house, and much less likely if your husband is home. So the evidence supports your son’s presence strongly, and your husband’s presence little or not at all.
Bayes’s Rule says that, once you gauge how much your new evidence supports various hypotheses, you should shift your confidence towards hypotheses that are better supported. However confident you were before you walked in the door that your husband or son was home, what you find inside should increase your confidence that your son is there, and decrease your confidence that your husband is. How much increase and decrease are warranted? That’s all sorted out by the specific mathematics of Bayes’s Rule. I’m trying to keep it light here and avoid equations, but the sources in the final section can fill in the details.
Attend to all your evidence
A consistent theme of Bayesian thinking is that working in shades of confidence can get much more complex and subtle than thinking in absolutes. One of the nice features of conclusive, slam-dunk evidence is that it can’t be overridden by anything. If a mathematician proves some theorem, then nothing learned subsequently can ever undo that proof, or give us reason not to believe its conclusion.
Bayesianism aims to understand incremental evidence, coming to terms with the kinds of less-than-conclusive information we face every day. One crucial feature of such evidence is that it can always be overridden. This is the lifeblood of twisty mystery novels: an eyewitness said the killer held the gun in his left hand – but it turns out she was looking in a mirror – but the autopsy reveals the victim was poisoned before he was shot…
Because the significance of evidence depends so much on context, and because potential defeaters might always be lurking, it’s important not to become complacent with what one knows and to keep an open mind for relevant new information. But it’s also important to think thoroughly and carefully about the information one already has. Rudolf Carnap proposed the Principle of Total Evidence, which requires your beliefs about a question to incorporate and reflect all the evidence you possess relevant to that question.
Here’s a kind of relevant evidence we often overlook: besides having information about some topic, we often know something about how we got that information. Now, that’s not always true: I know that Abraham Lincoln was born in a log cabin, but I have no idea where I learned that titbit. But often – and especially in today’s uncertain media environment – it pays to keep track of one’s sources, and to evaluate whether the information you’ve received might have been selected for you in a biased way.
Sir Arthur Eddington gave an example in which you draw a large group of fish from a lake, and all of them are longer than six inches. Normally, this would be strong evidence that all the fish in the lake are at least that long. But if you know that you drew the fish using a net with six-inch holes, then you can’t draw what would otherwise be the reasonable conclusion from your sample.
Paying attention to how the evidence was selected can have important real-life consequences. In How Not to Be Wrong (2014), Jordan Ellenberg recounts a story from the Second World War: the US military showed the statistician Abraham Wald data indicating that planes returning from dogfights had more bullet holes in the fuselage than in the engine. The military was considering shifting armour from the engine to the fuselage, to better protect their pilots. Wald recommended exactly the opposite, on the grounds that it was the returning planes that had holes in the fuselage; those not returning had holes to their engines, so that’s where the additional armour should go.
Don’t forget your prior opinions
You think carefully about the evidence you’ve just received. You’re careful to take it all into account, to consider context, and to remember where it came from. With all this in mind, you find the hypothesis that renders that evidence most probable, the hypothesis most strongly supported by that evidence. That’s the hypothesis you should now be most confident in, right?
Wrong. Bayes’s Rule says to respond to new evidence by increasing your confidence in the hypothesis that makes that evidence most probable. But where you land after an increase depends on where your confidence was before that evidence came in.
Adapting an example from the reasoning champion Julia Galef, suppose you’re crossing a college campus and stop a random undergraduate to ask for directions. This undergrad has a distracted, far-off look in their eye; wears clothes that one would never think of bringing near an iron; and seems slightly surprised to even be awake at this hour of the day. Should you be more confident that your interlocutor is a philosophy or a business major?
Easy answer: this look is much more typical of a philosophy major than a business major, so you should be more confident you’re dealing with the former. At a first pass, that answer seems backed up by the Bayesian thinking I’ve described. Just to pick some numbers (and be a bit unfair to philosophers), let’s suppose a third of all philosophy majors meet this description, but only one in 20 business majors does (the quants, perhaps?) On the hypothesis that the person you randomly stopped for directions is a philosophy major, the probability of your evidence is one-third. On the hypothesis that you stopped a business major, the probability is one-20th. So, your evidence about this student from their appearance more strongly supports the notion that they study philosophy.
But now consider the following: on my campus, there are currently just shy of 250 undergraduate philosophy majors and roughly 3,600 business majors. If the fractions in the previous paragraph are correct, we should expect there to be about 80 philosophy students on campus disconnected from their surroundings, and about 180 business majors. So, if you select a random undergrad, you’re still at least twice as likely to get a distracted business major as a distracted philosopher.
The key here is to remember that, before you appraised this student’s appearance, the odds were much, much greater that they were into business than philosophy. The evidence you gain from interacting with them should increase your confidence that they’re a philosopher, but increasing a small number can still leave it quite small!
Bayes’s Rule demands that your updated confidence in a hypothesis after learning some evidence combines two factors: your prior confidence in the hypothesis, and how strongly it’s supported by the new evidence. Forgetting the former, and attending only to the latter, is known as the Base Rate Fallacy. Unfortunately, this fallacy is committed frequently by professionals, even those working with life-altering data.
Suppose a new medical test has been developed for a rare disease – only one in 1,000 people has this disease. The test is pretty accurate: someone with the disease will test positive 90 per cent of the time, while someone without the disease will test positive only 10 per cent of the time. You randomly select an individual, apply the test, and get a positive result. How confident should you be that they have the disease?
Most people – including trained medical professionals! – say you should be 80 per cent or 90 per cent confident that the individual has the disease. The correct answer, according to Bayes’s Rule, is under 1 per cent. What’s going on is that most respondents are so overwhelmed by the accuracy of the test (the strength of the evidence it produces) that they neglect how rare this disease is in the population.
But let’s do some quick calculations: suppose you applied this test to 10,000 randomly selected individuals. Around 10 of them would have the disease, so nine of them would get a positive test result. On the other hand, around 9,990 of the individuals you selected wouldn’t have the disease. Since the test gives healthy individuals a positive result 10 per cent of the time, these 9,990 healthy individuals would yield around 999 false positive tests. So having tested 10,000 people, you’d get a total of 1,008 positive results, of which only nine (just under 1 per cent) would be people who actually had the disease.
Again, when dealing with cases of extreme probabilities, it can help to think about the odds. A piece of evidence that strongly supports a hypothesis (like the reliable medical test just described) might multiply the odds of that hypothesis by a factor of 10, or even 100. But if the odds start small enough, multiplying them by 10 will take you from one chance in 1,000 to one in 100.
Subgroups don’t always reflect the whole
Bayesians work a lot with conditional probabilities. Conditional probability arises when you consider how common some trait is among a subgroup of the population, instead of considering the population as a whole. If you pick a random American, they’re very unlikely to enjoy pizza made with an unleavened crust, topped with Provel cheese, and cut into squares. But conditional on the assumption that they grew up in St Louis, the probability that they’ll enjoy such a monstrosity is much higher.
Conditional probabilities can behave quite counterintuitively. Simple principles that one would think should be obvious can fail in spectacular fashion. The clearest example of this is Simpson’s Paradox.
Hopefully all of us have learned in our lives not to draw broad generalisations from a single example, or to assume that a small group is representative of the whole. A foreigner who judged American pizza preferences by visiting only St Louis would be seriously misled. By carelessness or sheer bad luck, we can stumble into a subpopulation that is unlike the others, and so bears traits that aren’t reflected by the population in general.
But Simpson’s Paradox demonstrates something much weirder than that: sometimes every subpopulation of a group has a particular trait, but that trait still isn’t displayed by the group as a whole.
In the 2016-17 NBA season, James Harden (then of the Houston Rockets) made a higher percentage of his two-point shot attempts than DeMar DeRozan (of the Toronto Raptors) made of his two-point shots. Harden also sank a higher percent of his three-point attempts than DeRozan. Yet DeRozan’s overall field-goal percentage – the percent of two-pointers and three-pointers combined that he managed to sink – was higher than Harden’s. Harden did better on both two-pointers and three-pointers, and those are the only kinds of shots that factor into the field-goal percentage, yet DeRozan was better overall. How is that possible?
Pro hoops aficionados will know that, for any player, two-point shots are easier to hit than three-pointers, yet Harden stubbornly insists on making things difficult for himself. In the 2016-17 season, he attempted almost the same number of each kind of shot (777 three-pointers versus 756 two-pointers), while DeRozan attempted more than 10 times as many two-pointers as three-pointers. Even though Harden was better at each kind of shot, DeRozan made the strategic decision to take high-percentage shots much more often than low-percentage ones. So, he succeeded at an overall higher rate.
The same phenomenon appeared when graduate departments at the University of California, Berkeley were investigated for gender bias in the 1970s. In 1973, 44 per cent of male applicants were admitted to Berkeley’s graduate school, while only 35 per cent of female applicants succeeded. Yet a statistical study found that individual departments (which actually made the admissions decisions) were letting in men and women at roughly equal rates, or even admitting women more often. The trouble was that some departments were much more difficult than others to get into (for all applicants!), and women were applying disproportionately to more selective fields.
Of course, that doesn’t eliminate all possibilities of bias; a study found that women were applying to more crowded fields because they weren’t given the undergraduate mathematical background to study subjects that were better-funded (and therefore could admit more students). But the broader point about conditional probabilities stands: you can’t assume that an overall population reflects trends in its subpopulations, even if those trends occur in all the subpopulations. You also have to consider the distribution of traits across subpopulations.