Let's Not Be Too Hasty to Shut Down Big Data Security Sweeps

1127_NSA Data Surveillance
A National Security Agency (NSA) data gathering facility is seen in Bluffdale, about 25 miles south of Salt Lake City, Utah May 18. Technology holds the power to discover terrorism suspects from data—and yet to also safeguard privacy even with bulk telephone and email data intact, the author argues. Jim Urquhart/Reuters

I must disagree with my fellow liberals. The NSA bulk data shutdown scheduled for tomorrow, Sunday, November 29, is unnecessary and significantly compromises intelligence capabilities.

As recent tragic events in Paris and elsewhere turn up the contentious heat on both sides of this issue, I'm keenly aware that mine is not the usual opinion for an avid supporter of Bernie Sanders (my hometown mayor in Vermont).

But as a techie, a former Columbia University computer science professor, I'm compelled to break some news: Technology holds the power to discover terrorism suspects from data—and yet to also safeguard privacy even with bulk telephone and email data intact.

To be specific, stockpiling data about innocent people in particular is essential for state-of-the-art science that identifies new potential suspects.

I'm not talking about scanning to find perpetrators, the well-known practice of employing vigilant computers to trigger alerts on certain behavior. The system spots a potentially nefarious phone call and notifies a heroic agent—that's a standard occurrence in intelligence thrillers, and a common topic in casual speculation about what our government is doing. Everyone's familiar with this concept.

Rather, bulk data takes on a much more difficult, critical problem: precisely defining the alerts in the first place. The actual "intelligence" of an intelligence organization hinges on the patterns it matches against millions of cases—it must develop adept, intricate patterns that flag new potential suspects.

Deriving these patterns from data automatically, the function of predictive analytics, is where the scientific rubber hits the road. (Once they're established, matching the patterns and triggering alerts is relatively trivial, even when applied across millions of cases—that kind of mechanical process is simple for a computer.)

It may seem paradoxical, but data about the innocent civilian can serve to identify the criminal. Although the ACLU calls it "mass, suspicionless surveillance," this data establishes a baseline for the behavior of normal civilians. That is to say, law enforcement needs your data in order to learn from you how non-criminals behave. The more such data available, the more effectively it can do so.

Here's how it works. Predictive analytics shrinks the unwieldy haystack throughout which law enforcement must hunt for needles—albeit by first analyzing the haystack in its entirety. The machine learns from the needles (i.e., known perpetrators, suspects and persons of interest) as well as the hay (i.e., the vast majority that is non-criminal) using the same technology that drives financial credit scoring, Internet search, personalized medicine, spam filtering, targeted marketing and movie, music and book recommendations. This automatic process generates patterns that flag individuals more likely to be needles, thereby targeting investigation activities and more productively utilizing the precious bandwidth of officers and agents. Under the right conditions, this will unearth terrorists who would have otherwise gone undetected.

This increasingly common practice also drives other crime fighting functions. Today's law enforcement organizations predictively investigate, monitor, audit, warn, patrol, parole and sentence. Predictive analytics guides FBI anti-terrorism activities, judge and parole board decision-making, predictive patrolling by city police precincts and fraud detection, arguably the most pervasive government application of predictive analytics.

Such activities at the NSA are secret, but it's no stretch to presume the organization considers predictive analytics a strategic priority. The agency runs the country's largest surveillance data center, employs the world's largest number of Ph.D. mathematicians, is known to purchase tools for and seek to hire experts in predictive analytics and is charged by an executive order to "analyze... signals intelligence information and data for foreign intelligence," as stated on its mission webpage.

The civil libertarian objects vehemently to predictive targeting—and not without reason. The incentive to collect personal data only intensifies given its potential use to identify previously unknown persons of interest.

"Predictive analytics is clearly the future of law enforcement," University of the District of Columbia law professor Andrew Ferguson told me. "The problem is that the forecast for transparency and accountability is less than clear." I agree that the risk of misuse is real and the NSA in particular must drastically increase transparency.

On the other hand, predictive targeting by design actually curbs unjust intrusion. It introduces a scientifically based objectivity that can balance against some of law enforcement's subjective human biases. Further, more precise targeting means fewer false positives, that is, a decrease in the number of innocent individuals considered, monitored, investigated or detained.

The extent of data collection is not an on/off decision, and the potential of data driven targeting must inform where we set the dial. The reach of government surveillance could range anywhere from nil to gathering video feeds from every room in every building. While most would disagree with either extreme, the position at which we settle along this continuum is fundamental to how we strike a balance between privacy and security.

Ultimately we can—and we must—have our cake and eat it too. Access by a human to personal data such as phone records should be gated pending a warrant, and yet a law enforcement computer (or its counterpart within a telecom) should be fed bulk data in order to apply predictive analytics.

This arrangement will require a sophisticated, transparent process by which a law enforcement organization regulates its own internal access per externally issued warrants. The organization will need to safeguard increasingly complex internal access policies, in part via encryption and data sanitization.

The effort to set this up will more than pay off in the predictive targeting afforded by data.

Eric Siegel, a former computer science professor at Columbia University, is the author of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die.