Main

December 4, 2020

Scraped

Somehow I had missed the hiQ Labs v. LinkedIn case until this week, when I struggled to explain on Twitter why condemning web scraping is a mistake. Over the years, many have made similar arguments to ban ordinary security tools and techniques because they may also be abused. The usual real world analogy is: we don't ban cars just because criminals can use them to escape.

The basics: hiQ, which styles itself as a "talent management company", used automated bots to scrape public LinkedIn profiles, and analyze them into a service advising companies what training they should invest in or which employee might be on the verge of leaving. All together now: *so* creepy! LinkedIn objected that the practice violates its terms of service and harms its business. In return, hiQ accused LinkedIn of purely anti-competitive motives, and claimed it only objected now because it was planning its own version.

LinkedIn wanted the court to rule that hiQ's scraping its profiles constitutes felony hacking under the Computer Fraud and Abuse Act (1986). Meanwhile, hiQ argued that because the profiles it scraped are public, no "hacking" was involved. EFF, along with DuckDuckGo and the Internet Archive, which both use web scraping as a basic tool, filed an amicus brief arguing correctly that web scraping is a technique in widespread use to support research, journalism, and legitimate business activities. Sure, hiQ's version is automated, but that doesn't make it different in kind.

There are two separate issues here. The first is web scraping itself, which, as EFF says, has many valid uses that don't involve social media or personal data. The TrainTimes site, for example, is vastly more accessible than the National Rail site it scrapes and re-presents. Over the last two decades, the same author, Matthew Somerville, has built numerous other such sites that avoid the heavy graphics and scripts that make so many information sites painful to use. He has indeed gotten in trouble for it sometimes; in this example, the Odeon movie theaters objected to his making movie schedules more accessible. (Query: what is anyone going to do with the Odeon movie schedule beyond choosing which ticket to buy?)

As EFF writes in its summary of the case, web scraping has also been used by journalists to investigate racial discrimination on Airbnb and find discriminatory pricing on Amazon; in the early days of the web, civic-minded British geeks used web scraping to make information about Parliament and its debates more accessible. Web scraping should not be illegal!

However, that doesn't mean that all information that can be scraped should be scraped or that all information that can be scraped should be *legal* to scrape. Like so many other basic techniques, web scraping has both good and bad uses. This is where the tricky bit lies.

Intelligence agency personnel these days talk about OSINT - "open source intelligence". "Open source" in this context (not software!) means anything they can find and save, which includes anything posted publicly on social media. Journalists also tend to view anything posted publicly as fair game for quotation and reproduction - just look at the Guardian's live blog any day of the week. Academic ethics require greater care.

There is plenty of abuse-by-scraping. As Olivia Solon reported last year, IBM scraped Flickr users' innocently posted photographs repurposed them into a database to train facial recognition algorithms, later used by Immigration and Customs Enforcement to identify people to deport. (In June, when the protests after George Floyd's murder led IBM to pull back on selling facial recognition "for mass surveillance or racial profiling".) Clearview AI scraped billions of photographs off social media and collating them into a database service to sell to law enforcement. It's safe to say that no one posted their profile on LinkedIn with the intention of helping a third-party company get paid by their employer to spy on them.

Nonetheless, those abuse cases do not make web scraping "hacking" or a crime. They are difficult to rectify in the US because, as noted in last week's review of 30 years of data protection, the US lacks relevant privacy laws. Here in the UK, since the data Somerville was scraping was not personal, his complainants typically argued that he was violating their copyright. The hiQ case, if brought outside the US, would likely be based in data protection law.

In 2019, the Ninth Circuit ruled in favor of hiQ, saying it did not violate CFAA because LinkedIn's servers were publicly accessible. In March, LinkedIn asked the Supreme Court to review the case. SCOTUS could now decide whether scraping publicly accessible data is (or is not) a CFAA violation.

What's wrong in this picture is the complete disregard for the users in the case. As the National Review says, a ruling for hiQ could deprive users of all control over their publicly posted information. So, call a spade a spade: at its heart this case is about whether LinkedIn has an exclusive right to abuse its users' data or whether it has to share that right with any passing company with a scraping bot. The profile data hiQ scraped is public, to be sure, but to claim that opens it up for any and all uses is no more valid than claiming that because this piece is posted publicly it is not copyrighted.


Illustrations: I simply couldn't think of one.

Wendy M. Grossman is the 2013 winner of the Enigma Award. Her Web site has an extensive archive of her books, articles, and music, and an archive of earlier columns in this series. Stories about the border wars between cyberspace and real life are posted occasionally during the week at the net.wars Pinboard - or follow on Twitter.