Story

How do you predict uncertainty in statistics and machine learning?

by Caitlin Hayes from the Cornell Research website

The influence of machine learning and its algorithms is all around us, having both small and profound effects on our lives. You find machine learning in post offices, on social media, and when you sit down to watch a show recommended to you by Netflix or another provider. Banks use machine learning to invest your money, to predict whether or not you’ll repay your loan and thus whether to grant you one. In some places, machine learning is now being used to decide a course of medical treatment or predict recidivism in parole hearings. It’s also used in hiring decisions.

Giles Hooker, Statistical Science/Biological Statistics and Computational Biology, says the danger in the higher stakes examples is that in machine learning, the bottom-line prediction is all you get, without a good measure of how uncertain that prediction is.

“So there’s this great big powerful thing called machine learning, but we don’t understand how it deals with uncertainty very well, and we don’t have great ways of looking at what it’s actually giving us,” Hooker says. “In many cases, I don’t care. A post office really doesn’t care how you use pixels to determine if the number you scrawled in a postcode is a two, as long as it’s accurate. But there are times when the process and the degree of stability really matters.”

Hooker connects statistics—and measuring uncertainty—to machine learning. “You can think of it as uncertainty quantification,” Hooker says. “Can I judge how reliable this particular prediction is?”

The Lab of Ornithology’s Interesting Prediction Challenges

Hooker develops methods to address problems where uncertainty is important. One of the datasets he works with is from the Lab of Ornithology’s eBird program at Cornell. The Lab of Ornithology has been building maps of bird migration pathways based on observations from amateur bird watchers from across the country and abroad. The lab then incorporates this data—with approximately 300 million entries—into animated maps that show the concentration of a bird species at a given time of year. 

“You can see the topography in the intensity of the bird population,” says Hooker. On his computer screen, bright orange representing the birds creeps up a map of North America and begins to outline the Mississippi River. “So cool,” Hooker says. The beauty of the maps, however, belies the problematic complexity of the data.

“There’s a whole bunch of biases that come up,” Hooker says. “Things light up in Nevada, even though we have almost no observations from the middle of Nevada. We don’t have a lot of data to say what’s going on in there, so we want to be able to express that somehow.”

Other biases abound: people tend to go where they think they’ll see birds, and they’re more likely to report prettier or rarer birds. Some bird watchers will submit five observations from their area and get bored, while others will submit five observations a day for years. “And some of these people are really good, and some of them can’t tell the difference between an eagle and a chickadee,” Hooker says.

For any given area on the map that lights up, there may be varying amounts of data with different degrees of quality and thoroughness. “None of those things are being noted in machine learning,” Hooker says. That compromises the integrity of the predictions the lab wants to make: where will birds be at a given time? Recently, The Nature Conservancy of California asked for guidance on where to lease land for bird habitats; the Lab of Ornithology wants to give them predictions as well as a sense of how certain those predictions are.

Hooker says they’re only scratching the surface of how to deal with this complexity. “For a statistician, it’s a wonderful sandbox to play in,” he adds. “The Lab of O really sold Cornell for me—this is just a fantastic place to be.”

Predictability, When the Stakes Are Higher

The methods Hooker is developing to deal with these problems could also call attention to uncertainty in predictions when the stakes are higher and fairness is in question. In the case of predicting recidivism at parole hearings, for example, the data about recidivism that’s plugged into the computer, like the eBird data, is going to reflect biases in how it was collected. An algorithm may then predict that a low-income African American man would be more likely to re-offend than an upper-class white woman. “That may be accurate or it may reflect the biases in where police focus their efforts,” Hooker says. Understanding how the machine uses the data to come to a prediction, as well as how stable the prediction is, could change the course of lives.

“The hope is that I can develop tools here that can then be used in more sensitive contexts.”

These kinds of racial and gender biases have shown up in other machine learning contexts. “The Lab of O has those same issues about where our data come from and its biases,” Hooker says. “The hope is that I can develop tools here that can then be used in more sensitive contexts.”

Quantifying Uncertainty in Random Forests

Hooker has already made progress toward that goal in his mathematical work, publishing a paper last year that showed how to quantify uncertainty in a popular class of prediction models, or machine learning methods, called random forests. Random forests are the baseline method for many predictions, including the Lab of Ornithology’s migration pathways.

Random forests are made up of decision trees. “You look at one covariate and you build off of it. For example, a bank might ask: are you older than 50? If yes, then is your income less than $70,000?” Hooker says. “So you get a flow chart which we can understand. With random forests, you then say I’m going to build about 800 of these; now you predict very well but no longer know what was important.”

Working with Lucas K. Mentch, PhD ’15 (now at University of Pittsburgh), Hooker came up with a theorem that gives a mathematically justified interval of possible predictions. Rather than a single prediction, the range allows for the expression and quantification of uncertainty. “We can do that at no extra computational effort,” Hooker says. “It’s one of the central breakthroughs in this area of research.” 

A breakthrough in statistical science means a breakthrough in a number of fields and their applications.

The Fun in Statistics

The open and expansive nature of the field and its vast potential for application are some of the things Hooker loves about statistics. He quotes the late mathematician John W. Tukey who described statistics as getting to play in everyone else’s backyard.

Hooker goes on to describe statistics as a service discipline. “There isn’t really a next great challenge of statistics. It’s more about what’s the next crazy type of data that we have to work out how to deal with?”

Among his many projects, Hooker also works with Cornell’s Statistical Consulting Unit, helping researchers across campus find solutions for processing their data. “Excellent scientists from all over the university with all sorts of problems come into this office and talk about the science they’re doing,” Hooker says. “They keep me on my feet, and it’s just so much fun.”