In
some ways, this is what you would expect from any large, disordered system.
Think about the predictable and quantifiable way that gases behave. It might be
impossible to trace the movement of each individual gas molecule, but the
uncertainty and disorder at the molecular level wash out when you look at the
bigger picture. Similarly, larger regularities emerge from our individually
unpredictable lives. It’s almost as though we woke up each morning with a
chance, that day, of becoming a murderer, causing a car accident, deciding to
propose to our partner, being fired from our job. “An assumption of ‘chance’
encapsulates all the inevitable unpredictability in the world,” Spiegelhalter
writes.
But it’s one
thing when your aim is to speak in general terms about who we are together, as
a collective entity. The trouble comes when you try to go the other way—to
learn something about us as individuals from how we behave as a collective.
And, of course, those answers are often the ones we most want.
The dangers of making individual predictions
from our collective characteristics were aptly demonstrated in a deal struck by
the French lawyer André-François Raffray in 1965. He agreed to pay a
ninety-year-old woman twenty-five hundred francs every month until her death, whereupon
he would take possession of her apartment in Arles.
At the time,
the average life expectancy of French women was 74.5 years, and Raffray, then
forty-seven, no doubt thought he’d negotiated himself an auspicious contract.
Unluckily for him, as Bill Bryson recounts in his new book, “The Body,” the
woman was Jeanne Calment, who went on to become the oldest person on record.
She survived for thirty-two years after their deal was signed, outliving
Raffray, who died at seventy-seven. By then, he had paid more than twice the
market value for an apartment he would never live in.
Raffray
learned the hard way that people are not well represented by the average. As
the mathematician Ian Stewart points out in “Do Dice Play God? ” (Basic), the average person has one breast
and one testicle. In large groups, the natural variability among human beings
cancels out, the random zig being countered by the random zag; but that
variability means that we can’t speak with certainty about the individual—a
fact with wide-ranging consequences.
Every day,
millions of people, David Spiegelhalter included, swallow a small white statin
pill to reduce the risk of heart attack and stroke. If you are one of those
people, and go on to live a long and happy life without ever suffering a heart
attack, you have no way of knowing whether your daily statin was responsible or
whether you were never going to have a heart attack in the first place. Of a
thousand people who take statins for five years, the drugs will help only
eighteen to avoid a major heart attack or stroke. And if you do find yourself
having a heart attack you’ll never know whether it was delayed by taking the
statin. “All I can ever know,” Spiegelhalter writes, “is that on average it
benefits a large group of people like me.”
That’s the
rule with preventive drugs: for most individuals, most of those drugs won’t do
anything. The fact that they produce a collective benefit makes them worth
taking. But it’s a pharmaceutical form of Pascal’s wager: you may as well act
as though God were real (and believe that the drugs will work for you), because
the consequences otherwise outweigh the inconvenience.
There is so
much that, on an individual level, we don’t know: why some people can smoke and
avoid lung cancer; why one identical twin will remain healthy while the other
develops a disease like A.L.S.; why some otherwise similar children flourish at
school while others flounder. Despite the grand promises of Big Data,
uncertainty remains so abundant that specific human lives remain boundlessly
unpredictable. Perhaps the most successful prediction engine of the Big Data
era, at least in financial terms, is the Amazon recommendation algorithm. It’s
a gigantic statistical machine worth a huge sum to the company. Also, it’s
wrong most of the time. “There is nothing of chance or doubt in the course
before my son,” Dickens’s Mr. Dombey says, already imagining the business
career that young Paul will enjoy. “His way in life was clear and prepared, and
marked out before he existed.” Paul, alas, dies at age six.
And yet,
amid the oceans of unpredictability, we’ve somehow managed not to drown.
Statisticians have navigated a route to maximum certainty in an uncertain world.
We might not be able to address insular quandaries, like “How long will I
live?,” but questions like “How many patient deaths are too many?” can be
tackled. In the process, a powerful idea has arisen to form the basis of modern
scientific research.
Astranger hands you a coin. You have your
suspicions that it’s been weighted somehow, perhaps to make heads come up more
often. But for now you’ll happily go along with the assumption that the coin is
fair.
You toss the
coin twice, and get two heads in a row. Nothing to get excited about just yet.
A perfectly fair coin will throw two heads in a row twenty-five per cent of the
time—a probability known as the p-value. You keep tossing and get another head.
Then another. Things are starting to look fishy, but even if you threw the coin
a thousand times, or a million, you could never be absolutely sure it was
rigged. The chances might be minuscule, but in theory a fair coin could still
produce any combination of heads.
Scientists
have picked a path through all this uncertainty by setting an arbitrary
threshold, and agreeing that anything beyond that point gives you grounds for
suspicion. Since 1925, when the British statistician Ronald Fisher first
suggested the convention, that threshold has typically been set at five per
cent. You’re seeing a suspicious number of heads, and once the chance of a fair
coin turning up at least as many heads as you’ve seen dips below five per cent,
you can abandon your stance of innocent until proved guilty. In this case, five
heads in a row, with a p-value of 3.125 per cent, would do it.
This
is the underlying principle behind how modern science comes to its conclusions.
It doesn’t matter if we’re uncovering evidence for climate change or deciding
whether a drug has an effect: the concept is identical. If the results are too
unusual to have happened by chance—at least, not more than one time out of
twenty—you have reason to think that your hypothesis has been vindicated.
“Statistical significance” has been established.
Take a
clinical trial on aspirin run by the Oxford medical epidemiologist Richard Peto
in 1988. Aspirin interferes with the formation of blood clots, and can be used
to prevent them in the arteries of the heart or the brain. Peto’s team wanted
to know whether aspirin increased your chances of survival if it was
administered in the middle of a heart attack.
Their trial
involved 17,187 people and showed a remarkable effect. In the group that was
given a placebo, 1,016 patients died; of those who had taken the aspirin, only
804 died. Aspirin didn’t work for everyone, but it was unlikely that so many
people would have survived if the drug did nothing. The numbers passed the
threshold; the team concluded that the aspirin was working.
Such
statistical methods have become the currency of modern research. They’ve helped
us to make great strides forward, to find signals in noisy data. But, unless
you are extraordinarily careful, trying to erase uncertainty comes with
downsides. Peto’s team submitted the results of their experiment to an
illustrious medical journal, which came back with a request from a referee:
could Peto and his colleagues break the results down into groups? The referee
wanted to know how many women had been saved by the aspirin, how many men, how
many with diabetes, how many in this or that age bracket, and so on.
Peto
objected. By subdividing the big picture, he argued, you introduce all kinds of
uncertainty into the results. For one thing, the smaller the size of the groups
considered, the greater the chance of a fluke. It would be “scientifically
stupid,” he observed, to draw conclusions on anything other than the big
picture. The journal was insistent, so Peto relented. He resubmitted the paper
with all the subgroups the referee had asked for, but with a sly addition. He
also subdivided the results by astrological sign. It wasn’t that astrology was
going to influence the impact of aspirin; it was that, just by chance, the
number of people for whom aspirin works will be greater in some groups than in
others. Sure enough, in the study, it appeared as though aspirin didn’t work
for Libras and Geminis but halved your risk of death if you happened to be a
Capricorn.
Using
sufficiently large groups might help to insure against flukes, but there’s
another trap that befalls unsuspecting scientists. It’s one that Peto’s
experiment also serves to underline, and one that has led to nothing less than
a statistical crisis at the heart of science.
The easiest way to understand the issue is by
returning to the conundrum of the biased coin. (Coins are the statistician’s
pet example for a reason.) Suppose that you’re particularly keen not to draw a
false conclusion, and decide to hang on to your hypothesis that the coin is
fair unless you get twenty heads in a row. A fair coin would do this only about
one in a million times, so it’s an extraordinarily high level of proof to
demand—far beyond the threshold of five per cent used by much of science.
Now, imagine
I gave out fair coins to every person in the United States and asked everyone
to complete the same test. Here’s the issue: even with a threshold of one in a
million—even with everything perfectly fair and aboveboard—we would still
expect around three hundred of these people to throw twenty heads in a row. If
they were following Fisher’s method, they’d have no choice but to conclude that
they’d been given a trick coin. The fact is that, wherever you decide to set
the threshold, if you repeat your experiment enough times, extremely unlikely outcomes
are bound to arise eventually.
Apple
learned this shortly after the iPod Shuffle was launched. The device would play
songs from a users’ library at random, but Apple found itself inundated with
complaints from users who were convinced that their Shuffle was playing songs
in a pattern. Patterns are much more likely to occur than we think, but even if
several songs by the same artist, or consecutive songs from an album, had only
a tiny probability of appearing next to one another in the playlist, so many
people were listening to their iPods that it was inevitable such seemingly
strange coincidences would occur.
In science,
the situation is starker, and the stakes are higher. With a threshold of only
five per cent, one in twenty studies will inadvertently find evidence for
nonexistent phenomena in its data. That’s another reason that Peto resisted the
proposal that he look at various subpopulations: the greater the number of
groups you look at, the greater your chances of seeing spurious effects. And this
is far from being only a theoretical concern. In medicine, a study of
forty-nine of the most cited medical publications from 1990 to 2003 found that
the conclusions of sixteen per cent were contradicted by subsequent studies.
Psychology fares worse still in these surveys (possibly because its studies are
cheaper to reproduce). A 2015 study found that attempts to reproduce a hundred
psychological experiments yielded significant results in only thirty-six per
cent of them, even though ninety-seven per cent of the initial studies reported
a p-value under the five-per-cent threshold. And scientists fear that, as with
the iPod Shuffle, the fluke results tend to get an outsized share of attention.
Many
high-profile studies are now widely believed to have been founded on such
flukes. You may have come across the research on power posing, which suggests
that adopting a dominant stance helps to reduce stress hormones in the body.
The study has a thousand citations, and an accompanying ted talk
has amassed more than fifty million views, but the findings have failed to be
replicated and are now regarded as a notable example of the flaws in Fisher’s
methods.
It’s not
that scientific fraud is common; it’s that too many researchers have failed to
handle uncertainty with sufficient care. This issue has only been exacerbated
in the era of Big Data. The more data that are collected, cross-referenced, and
searched for correlations, the easier it becomes to reach false conclusions.
Illustrating this point, Spiegelhalter includes a 2009 study in which
researchers put a subject into an fMRI scanner and analyzed the response in
8,064 brain sites while showing photographs of different human expressions. The
scientists wanted to see what regions of the brain were lighting up in response
to the photographs and used a threshold of a tenth of one per cent for their
experiment. “The twist was that the ‘subject’ was a 4lb Atlantic Salmon which
‘was not alive at the time of scanning,’ ” Spiegelhalter notes.
But, even at
that threshold, run enough tests and you’re bound to cross it eventually. Of
the more than eight thousand sites in the dead fish’s brain the researchers
inspected, sixteen duly showed a statistically significant response. And the
fear is that equally unfounded conclusions, albeit less apparently so, will
routinely be drawn, with the false assurance of “statistical significance.”
Science still stands up to scrutiny, precisely because it invites scrutiny. But
the p-value crisis suggests that our current procedures could be improved upon.
Scientists
now say that researchers should declare their hypothesis in advance of a study,
in order to make fishing for significant results much more difficult. Most
agree that the incentives of science need to be changed, too—that studies
designed to replicate the work of others should be valued more highly. There
are also suggestions for an alternative way to present experimental findings.
Many people have called for the focus of science to be on the size of the
effect—how many lives are saved by a drug, for instance—rather than on whether
the data for some effect cross some arbitrary threshold. How impressed should
we be by very strong evidence for a very weak effect? Let’s go back to aspirin.
A gigantic study—it tracked twenty-two thousand individuals over five
years—demonstrated that taking small daily doses of the drug would reduce the
risk of a heart attack. The p-value, the probability of this (or something more
extreme) happening by chance, was tiny: 0.001 per cent. But so, too, was the effect
size. A hundred and thirty otherwise healthy individuals would have to take the
drug to prevent a single heart attack, and all the while each person would be
increasing his or her risk of adverse side effects. It’s a risk that is now
deemed to outweigh the benefits for most people, and the advice for older
adults to take a baby aspirin a day has recently been recanted.
But perhaps
the real problem is how difficult we find it to embrace uncertainty. Earlier
this year, eight hundred and fifty prominent academics, including David
Spiegelhalter, signed a letter to Nature arguing that the
issue can’t be solved with a technical work-around. P-values aren’t the
problem; the problem is our obsession with setting a threshold.
Drawing an
arbitrary line in the sand creates an illusion that we can divide the true from
the false. But the results of a complicated experiment cannot be reduced to a
yes-or-no answer. Back when Spiegelhalter was asked to determine whether Dr.
Harold Shipman’s mortality rate should have aroused suspicion earlier, he
swiftly decided that the standard test of statistical significance would be a
“grossly inappropriate” way to monitor doctors. The medical profession would
effectively be pointing the finger of suspicion at one in every twenty innocent
doctors—thousands of clinicians in the U.K. Doctors would be penalized for
treating higher-risk patients.
Instead,
Spiegelhalter and his colleagues proposed an alternative test, which took
account of patient deaths as they occurred, contrasting the accumulating deaths
with the expected number. Year on year, it sequentially compares the likelihood
that a doctor’s high mortality rates are a run of bad luck with something more
suspicious, and raises an alarm once the evidence starts to build. But even this
highly sophisticated method will, owing to the capricious whims of chance,
eventually cast suspicion on the innocent. Indeed, as soon as a monitoring
system for general practitioners was piloted, it “immediately identified a G.P.
with even higher mortality rates than Shipman,” Spiegelhalter writes. This was
an unlucky doctor who worked in a coastal town with an elderly population. The
result highlights how careful you need to be even with the best statistical
methods. In Spiegelhalter’s words, while statistics can find the outliers, it
“cannot offer reasons why these might have occurred, so they need careful
implementation in order to avoid false accusations.”
Statistics,
for all its limitations, has a profound role to play in the social realm. The
Shipman inquiry concluded that, if such a monitoring system had been in place,
it would have raised the alarm as early as 1984. Around a hundred and
seventy-five lives could have been saved. A mathematical analysis of what
it is to be human can take us only so far, and, in a world of uncertainty,
statistics will never eradicate doubt. But one thing is for sure: it’s a very
good place to start. ♦
An earlier
version of this article incorrectly defined p-value.
This
article appears in the print edition of the September 9, 2019 , issue, with the headline “Your Number’s Up.”
· Hannah
Fry is a professor at University College London’s Centre for Advanced Spatial
Analysis. Her latest book is “Hello World .”
No comments:
Post a Comment