Last year, self-described data junkie Chris Whong obtained a 20 gigabytes treasure trove of data that detailed as many as 173 million individual cab rides in New York City in 2013, simply by asking for it.
Each trip record included the fare, the pickup and drop-off time and location, and other metadata. What it didn’t include, in order to hide the identity of the drivers, was the license plate number or the cab’s medallion number, which is the unique four-digit ID that’s is displayed on the side of yellow cabs in New York City.
In other words, the data trove, obtained through a Freedom of Information Act request, seemed to be fully anonymized. But in the world of big data, even apparently anonymous data troves can reveal a lot of personal information.
Last week, Noah Deneau, an electrical engineer and Redditor, came across a visualization tool that displays information from the taxi data trove, as well as a study that revealed that Mohammed is the most common first name among New York City taxi and limo drivers.
Curiosity struck: Would it be possible, Deneau asked himself, to identify devout Muslim drivers in New York City looking at the anonymized data trove and looking at which drivers are inactive during the five times a day they are supposed to pray? Deneau searched quickly for drivers that had low activity within the 30 minutes to 45 minutes of set Muslim prayer times and was able to find four examples of drivers that might fit the pattern.
This was just “a little side project that I thought might be interesting,” Deneau told Mashable. But Deneau said this wasn’t about outing specific drivers, this was about proving that “for better or for worse,” he said, very personal information always lurks in supposedly anonymous data troves.
As it turns out though, this is not the first time that someone has proved that this particular cab drivers data trove can be potentially abused and de-anonymized.
Shortly after Whong published the trove last year, Vijay Pandurangan, a software developer, revealed that the dataset was actually poorly anonymized and that it was trivially easy for “anyone” to find out the identities of the drivers, their gross annual income, and even infer their residence.
Months later, a summer intern at a data-analytics firm Neustar, figured out that by Google-stalking celebrities hailing and leaving cabs in New York City, and correlating gossip news reports to the dataset, one could actually find all about the taxi rides of stars like Bradley Cooper and Jessica Alba.
The intern, Anthony Tockar, even claimed it’d be possible to identify the frequent customers of Larry Flynt’s Hustler Club in Manhattan’s Hell’s Kitchen by analyzing the data. (Though not everyone was convinced Tockar was right.)
Even if Deneau’s experiment is wrong, it’s clear that New York’s massive taxi dataset leaks a lot of information. The New York City Taxi & Limousine Commission declined to comment for this story.
Two big data experts, who reviewed Deneau’s experiment for Mashable, agreed that it would be possible to scale it up and identify a considerable number of Muslim drivers this way — although it’s unclear whether Deneau actually identified Muslim drivers.
Gregory Piatetsky-Shapiro, a data-mining expert said, suggested that one could even identify Muslim drivers that are not observant, or drivers with Western-sounding names that actually are devout Muslims –- something that he defined as a “scary possibility.”
“This experiment shows the implications of big data,” Piatetsky-Shapiro told Mashable. “Even when data is anonymous we leave so many digital breadcrumbs that it’s very hard to remain anonymous -– so one can identify unexpected things.”
Have something to add to this story? Share it in the comments.