« Five impossible things before breakfast | Main | Revelations »

Fighting digital cholera

To resume: briefly, during the 2014 sell-off of the Royal Mail, Ofcom deliberated making the proprietary Postcode Address Finder database open data. Instead, over the objections of the Open Data Institute and others, PAF was sold off along with the rest of its owner.

So: you can use or disclose a postcode, but if you copy portions of the database you will be in violation of the 1996 database treaty. As public ownership recedes into the past, the argument that PAF is paid for by public money will grow daily weaker. If you can't fight it...then build your own.

The Open Data Institute is funding the alpha phase of Open Addresses, a project to create sustain a completely new address database. peterwellw.jpgAt the March 2015 teacamp, Peter Wells outlined progress since its January launch. It sounds painful.

At the outset the team thought bulk use of existing open data sets would provide a reasonable chunk of the nation's addresses. Companies House, for example, has 1 million "clean" addresses, the Land Registry has 18 million, and there's the Government Digital Service's voter registration...

It was then that they discovered digital cholera. In epidemiological terms, it takes only a few cholera bacteria upstream to infect long chains of waterways communities depend on. By analogy, the Land Registry uses Royal Mail products to validate the addresses they hold. Are they contaminated? Do you want to spend money fighting legal battles with the Royal Mail to find out? Similarly, they found that GDS uses validation to check postcodes. And on and on.

"It's like a pumping station at the top of the tree with [intellectual property] in it," he said. "It goes everywhere, and lots of people are tainted. We need to very carefully track the provenance of every bit of data we publish. We have a responsibility to the people who use that data, like the water board." The tale of the BBC's intentions to open up its archives has a similar pattern: a hopeful plan followed by a lot of IP-filled icebergs. A piece I wrote for the Guardian in 2011 explains how addresses are generated and the five-way licensing deals that control their use. Note their threat model a council website license was intended to protect against: that someone would compile a new database through repeated searches. So quaint and old-fashioned.

Open Addresses concluded that it would need to use crowdsourcing, yes, but not like *that*. To start, you can go to its portal and use the very easy form to enter any addresses you happen to know.

But, Wells said, "We needed a model that says we trust some forms of data more than others." Data from Companies House is more reliable something typed into a web form; newer data is more reliable than older data. "The half-life of a UK address is 15 years." World War II and regeneration efforts, for example, significantly remapped a number of cities. Other techniques include checking against maps - is it reasonable that this street has 20 addresses, based on the size of the road? - and inference. If a street has a number 5 and a number 11, inference suggests it also has 7 and 9, which can in turn be checked against a map. "We get about four to five extra addresses per address received," he said. Each address has a score and eventually will have its own URL "linking the physical and virtual world".

Ultimately Open Addresses hopes maintaining its database will cost perhaps 1% of what PAF does and will shrink wait times for new addresses to come onstream. At the moment, adding each of the 100,000 homes the UK adds annually can take several months as the address is generated, checked, geocoded, postcoded, and eventually published, during which occupiers have no validated address, blocking them from ordering pizza, buying insurance, or registering to vote.

Also planned is setting up a privacy advisory board. The teacamp responses showed how sensitive this area is for some people. To me, this reaction confuses "my address" and "an address I know". The address I live at has been in existence for more than 100 years; it is not *my* address but an address of which I am temporarily custodian. As an American, I'm surprised that people in country with a hideous, feudal system like leasehold would understand this instinctively. The database Open Addresses is building is a database of addresses, not a database of who lives where. However, using postcode clusters to aggregate health data turns out to expose people in thinly populated areas to easy reidentification; one must explore whether there are similar hidden gotchas..

The key underlying issue - the source of the digital cholera - is a relatively little-known piece of intellectual property law called the database treaty, agreed in 1996. At the time (I wrote about it for the Telegraph, but the piece is no longer online), opponents such as James Love warned that if enacted the treaty could severely restrict statistical reporting of facts that until then had been considered part of the public domain. The ramifications have turned out to be even more profound. Be careful what legislation you ignore.

Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. Stories about the border wars between cyberspace and real life are posted occasionally during the week at the net.wars Pinboard - or follow on Twitter.


TrackBack URL for this entry:

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)