« Deprave and corrupt | Main | Mind the gap »

Data mining snake oil

The basic complaints we've been making for years about law enforcement's and government's desire to collect masses of data have primarily focused on the obvious set of civil liberties issues: the chilling effect of surveillance, the right of individuals to private lives, the risk of abuse of power by those in charge of all that data. On top of that we've worried about the security risks inherent in creating such large targets from which data will, inevitably, leak sometimes.

This week, along came the National Research Council to offer a new trouble with dataveillance: it doesn't actually work to prevent terrorism. Even if it did work, the tradeoff of the loss of personal liberties against the security allegedly offered by policies that involve tracking everything everyone does from cradle to grave was hard to justify. But if it doesn't work - if all surveillance all the time won't make us actually safer - then the discussion really ought to be over.

The NAS report, Protecting Individual Privacy in the Struggle Against Terrorists: A Framework for Assessment, makes its conclusions clear: "Modern data collection and analysis techniques have had remarkable success in solving information-related problems in the commercial sector... But such highly automated tools and techniques cannot be easily applied to the much more difficult problem of detecting and preempting a terrorist attack, and success in doing so may not be possible at all."

Actually, the many of us who have had our cards stopped for no better reason than that the issuing bank didn't like the color of the Web site we were buying from, might question how successful these tools have been in the commercial sector. At the very least, it has become obvious to everyone how much trouble is being caused by false positives. If a similar approach is taken to all parts of everyone's lives instead of just their financial transactions, think how much more difficult it's going to be to get through life without being arrested several times a year.

The report again: "Even in well-managed programs such tools are likely to return significant rates of false positives, especially if the tools are highly automated." Given the masses of data we're talking about - the UK wants to store all of the nation's communications data for years in a giant shed, and a similar effort in the US would have to be many times as big - the tools will have to be highly automated. And - the report yet again - the difficulty of detecting terrorist activity "through their communications, transactions, and behaviors is hugely complicated by the ubiquity and enormoity of electronic databases maintained by both government agencies and private-sector corporations." The bigger the haystack, the harder it is to find the needle.

In a recent interview, David Porter, CEO of Detica, who has spent his entire career thinking about fraud prevention, said much the same thing. Porter's proposed solution - the basis of the systems Detica sells -is to vastly shrink the amount of data to be analyzed by throwing out everything we know is not fraud (or, as his colleague, Tom Black, said at the Homeland and Border Security conference in July, terrorist activity). To catch your hare, first shrink your haystack.

This report, as the title suggests, focuses particularly on balancing personal privacy against the needs of anti-terrorist efforts. (Although, any terrorist watching the financial markets the last couple of weeks would be justified in feeling his life's work had been wasted, since we can do all the damage that's needed without his help.) The threat from terrorists is real, the authors say - but so is the threat to privacy. Personal information in databases cannot be fully anonymized; the loss of privacy is real damage; and data varies substantially in quality. "Data derived by linking high-quality data with data of lesser quality will tend to be low-quality data." If you throw a load of silly string into your haystack, you wind up with a big mess that's pretty much useless to everyone and will be a pain in the neck to clean up.

As a result, the report recommends requiring systematic and periodic evaluation of every information-based government program against core values and proposes a framework for carrying that out. There should be "robust, independent oversight". Research and development of such programs should be carried out with synthetic data, not real data "anonymized"; real data should only be used once a program meets the proposed criteria for deployment and even then only phased in at a small number of sites and tested thoroughly. Congress should review privacy laws and consider how best to protect privacy in the context of such programs.

These things seem so obvious; but to get to this the point it's taken three years of rigorous documentation and study by a 21-person committee of unimpeachable senior scientists and review by members of a host of top universities, telephone companies, and top technology companies. We have to think the report's sponsors, who include the the National Science Foundation, and the Department of Homeland Security, will take the results seriously. Writing for Cnet, Declan McCullagh notes that the similar 1996 NRC CRISIS report on encryption was followed by decontrol of the export and use of strong cryptography two years later. We can but hope.

Wendy M. Grossman's Web site has an extensive archive of her books, articles, and music, and an archive of all the earlier columns in this series. Readers are welcome to post here, at net.wars home, at her personal blog, or by email to netwars@skeptic.demon.co.uk (but please turn off HTML).


TrackBack URL for this entry:

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)