Our Trust in Big Data Shows We Don't Trust Ourselves

Big Data
Tammer Kamel talked about the challenges of data at Newsweek's data science conference in London. Public domain

Is data the modern oracle, the oil that will power the next industrial revolution—or just another round of business hype?

Of course it's true that there is more of the stuff, more information in forms that computers can collect and process, than ever in human history. Even trying to quantify it is a fool's errand, when yesterday's "biggest dataset in the world" becomes today's portable hard drive. But there is more to it than size. After years of talking to people who use big data in fields from dating apps to finding the Higgs Boson, I managed to reverse-engineer my analysis into a handy acronym—DATA.

D is for dimensions, or diverse, or different datasets. By combining very different types of information, we can get new insights. Brain scans alone are informative, but combine them with health records, postcodes and weather reports, and you can test a hypothesis that vitamin D intake affects the progression of multiple sclerosis, for example.

A is for automatic. We do so many things through our digital devices, phones or computers or wearables that collecting data is now the default. Every time you touch into a transport system, or pay with a bank card, or connect to a wifi network, you're adding to somebody's database. Much of the processing of that data is also automatic, invisible, opaque.

T is for time. Data streams into the databases almost in real time, making it easy to spot emerging patterns, and then to project that timeline into the future. Not just obvious things like traffic flows, but adding "sentiment analysis" of our social media activity to sales records and weather forecasts to predict the first big barbecue weekend of the year.

A is for AI, artificial intelligence. That's what spots the patterns in the tsunami of numbers. Yes, computers can calculate faster and more accurately than any human, but by using machine learning they do far more. Through trial and error, software modeled on aspects of how humans learn can sort images like brain scans (male/female or healthy/diseased) or more complex documents like job applications.

And this is where the dilemmas start to emerge.

The idea is that, unlike a biased human recruiter, a hiring algorithm will go on objective data. It won't take into account categories of human prejudice like race or gender. And if any disgruntled applicant disputes your hiring decision, you can claim that you followed procedure to the letter.

Even if it turns out that the algorithm got it wrong when the new employee runs off with all the company's cash, at least you won't have to carry the can. You followed procedure, didn't you? Is it your fault if this candidate was the 1 percent, the exception that proves the rule is probabilistic, not absolute?

But what if you are the other 1 percent, the applicant whose scores are lousy, for reasons over which you have no control, but who would make the best employee if somebody would just give you the chance?

Say you live in the wrong part of town, too far from the workplace. Or you've had a lot of time off sick lately. Or your friends tagged you in a Facebook photo with a jokey reference to smoking weed.

Any of these things could earn you a red flag, if your job application is pre-sorted by AI. Why? Because in the past, people in those categories were less likely to turn out well as employees. The machine has not introduced a bias. By taking the past as its model, it is simply perpetuating the bias already in the system.

"It is often assumed that big data techniques are unbiased because of the scale of the data and because the techniques are implemented through algorithmic systems. However, it is a mistake to assume they are objective simply because they are data-driven"—the words of a White House report from May 2016, warning against the assumption that machines can somehow rise above the mess of human society and, like a true deus ex machina , hand down automated justice to all.

Could a properly designed algorithm overcome human prejudice by applying only factors which don't unfairly favor one social group over another? Possibly, but that would in itself require human judgment. And these choices are not neutral, but depend on your understanding not only of the world as it is now, but of how it might change in future, and whether you think it should.

An AI that learns what to look for by finding patterns in a "training" dataset makes no distinction between fair and unfair factors. It takes a human being to notice that, by using past admissions to a medical school as your model, future admissions also include more white, male applicants than equally qualified female or racial minority applicants.

Big data verdicts may be just as unjust as human judgments. But, worse than this, they are less accountable.

Many machine learning algorithms are a black box even to their own creators. The very qualities that give such impressive powers of pattern identification, and hence of prediction, make them opaque to human examination.

When we make an important decision like choosing a job applicant, we try to set our prejudices aside and go on the evidence. But an algorithm has no concept of justice. It has been asked to build a probabilistic model of the present, based on the past, and projected into the future.

Big data has a seductive authority, especially in uncertain times. Shiny new technology, powerful computers, datasets bigger than anything we've had before, and collected more completely: all these give the illusion of objectivity, the promise of precision, something tangible by which to steer into the darkness.

But our readiness to hand over difficult choices to machines tells us more about how we see ourselves.

Instead of seeing a job applicant as a person facing their own choices, capable of overcoming their disadvantages, they become a data point in a mathematical model. Instead of seeing an employer as a person of judgment, bringing wisdom and experience to hard decisions, they become a vector for unconscious bias and inconsistent behaviour.

Why do we trust the machines, biased and unaccountable as they are? Because we no longer trust ourselves.

Timandra Harkness, presenter of BBC Radio 4's Futureproofing and author of Big Data: Does Size Matter? is speaking on the Big Data: Does size matter? panel at the Battle of Ideas on October 22-23.