Computers, Freedom, and Privacy 2009 - Day One
"Did you check that with your ethics committee?"
The speaker, who was feeling the strain of being a newcomer to privacy issues among a very tough, highly activist crowd, turned a little shakier than she already was.
"I didn't need to," she said, or something very like it. "It's not interacting with humans, just computers."
We spend a lot of time talking about where the line might be between human intelligence and artificial intelligence, but the important question may not be the usual one, Not "What does it mean to be human?" but "How far down the layer of abstractions does human interaction persist?" If I send you email intended to deceive, clearly I'm interacting with a human. If I set up a Facebook account and use it to get you to friend me by first friending one of your less careful friends and never communicate directly with you, the line gets a little more attenuated. Someone who had thought more about computers than about people might get confused.
This sort of question is going to come up a lot as we get better at datamining, the subject of an all-day tutorial on the first day of CFP (you'll find a lot of streams and papers on the conference Web site, if you'd like to investigate further), and you can pick up notes-in-progress on the conference real-time Twitter feed. (I missed out on the annual civil liberties in cyberspace tutorial, and others on health data privacy and behavioral advertising.)
The important point, as speakers like Khaled El Emam, a research chair at the University of Ottawa, and Bradley Malin, made clear, is that it's actually very difficult to anonymize data, no matter how much governments would like to persuade us otherwise. Pharmaceutical companies want medical data for research; governments want to give it to them in return for (they hope) lowered medical costs.
But what is identifiable data? Do you include data that can be reidentified when matched against a different dataset? The typical threat model assumes that an attacker will try once and give up. But in one case, Canadian media matched anonymized prescription data for an acne drug against published obituaries, and managed to find four families that matched. Media are persistent: they will call each family until they find the right one.
When we talk about anonymized data, therefore, we have to ask many more questions than we do now. What are the chances of unique records? What are the chances of unique records in the databases this database may be matched to? That determines how easy it is to find a particular individual's record. With just a name, full date of birth, and postal codes for the last year, 98 percent of 11 years of patient data covering 4 million people in Montreal was uniquely identifiable.
People have of course been working on this problem because patient data is incredibly valuable for research to improve public health.
The problem, as Malin noted, is that "People have been proposing methodologies for ten-plus years, and there's not much in the way of technology transfer."
El Emam had an explanation: "A lot of stuff is unusable." Really anonymizing the data using tools such as generalization, perturbation, or multi-party computation, is currently not a practical option: it leaves you with a dataset you can't analyze using standard research tools. Ouch.
Wendy M. Grossman's Web site has an extensive archive of her books, articles, and music, and an archive of all the earlier columns in this series. Readers are welcome to post here, follow on Twitter, or reply by email to firstname.lastname@example.org (but please turn off HTML).