A certain amount of government and practical policy is being made these days based on the idea that you can take large amounts of data and anonymize it so researchers and others can analyze it without invading anyone's privacy. Of particular sensitivity is the idea of giving medical researchers access to such anonymized data in the interests of helping along the search for cures and better treatments. It's hard to argue with that as a goal - just like it's hard to argue with the goal of controlling an epidemic - but both those public health interests collide with the principle of medical confidentiality.
The work of Latanya Sweeney was I think the first hint that anonymizing data might not be so straightforward; I've written before about her work. This week, at the Privacy Enhancing Technologies Symposium in Leuven, Belgium (which I regrettably missed) researchers Arvind Narayanan and Vitaly Shmatikov from the University of Texas at Austin won an award sponsored by Microsoft for taking reidentifying supposedly anonymized data a step further.
The pair took a database released by the online DVD rental company Netflix last year as part of the $1 million Netflix Prize, a project to improve upon the accuracy of the system's predictions. You know the kind of thing, since it's built into everything from Amazon to Tivos - you give the system an idea of your likes and dislikes by rating the movies you've rented and the system makes recommendations for movies you'll like based on those expressed preferences. To enable researchers to work on the problem of improving these recommendations, Netflix released a dataset containing more than 100 million movie ratings contributed by nearly 500,000 subscribers between December 1999 and December 2005 with, as the service stated in its FAQ, all customer identifying information removed.
Maybe in a world where researchers only had one source of information that would be a valid claim. But just as Sweeney showed in 1997 that it takes very little in the way of public records to re-identify a load of medical data supplied to researchers in the state of Massachusetts, Narayananan and Shamtikov's work reminds us that we don't live in a world like that. For one thing, people tend disproportionately to rate their unusual, quirky favorites. Rating movies takes time; why spend it on giving The Lord of the Rings another bump when what people really need is to know about the wonders of King of Hearts, All That Jazz, and The Tall Blond Man with One Black Shoe? The consequence is that the Netflix dataset is what they call "sparse" - that is, there few subscribers have very similar records.
So: how much does someone need to know about you to identify a particular user from the database? It turns out, not much. The is the public ratings and dates at the Internet Movies Database, which include dates and real names. Narayanan and Shmatikov concluded that 99 percent of records could be uniquely identified from only eight matching ratings (of which two could be wrong); for 68 percent of the records you only need two (and reidentifying the rest becomes easier). And of course, if you know a little bit about the particular person whose record you want to identify things get a lot easier - the three movies I've just listed would probably identify me and a few of my friends.
Even if you don't care if your tastes in movies are private - and both US law and the American Library Association's take on library loan records would protect you more than you yourself would - there are couple of notable things here. First of all, the compromise last week whereby Google agreed to hand Viacom anonymized data on YouTube users isn't as good a deal for users as they might think. A really dedicated searcher might well think it worth the effort to come up with a way to re-identify the data - and so far rightsholders have shown themselves to be very dedicated indeed.
Second of all, the Thomas-Walport review on data-sharing actually recommends requiring NHS patients to agree to sharing data with medical researchers. There is a blithe assumption running through all the government policies in this area that data can be anonymized, and that as long as they say our privacy is protected it will be. It's a perfect example of what someone this week called "policy-based evidence-making".
Third of all, most such policy in this area assumes it's the past that matters. What may be of greater significance, as Narayanan and Shmatikov point out, is the future: forward privacy. Once a virtual identity has been linked to a real-world identity, that linkage is permanent. Yes, you can create a new virtual identity, but any slip that links it to either your previous virtual or your real-world identity blows your cover.
The point is not that we should all rush to hide our movie ratings. The point is that we make optimistic assumptions every day that the information we post and create has little value and won't come back to bite us on the ass. We do not know what connections will be possible in the future.
Wendy M. Grossman's Web site has an extensive archive of her books, articles, and music, and an archive of all the earlier columns in this series. Readers are welcome to post here, at net.wars home, at her personal blog, or by email to email@example.com (but please turn off HTML).