Posts tagged data

Transparency ≠ Accountability

Medium, data, society, danah boyd, transparency, accountability, algorithmics

In the next ten years we will see data-driven technologies reconfigure systems in many different sectors, from autonomous vehicles to personalized learning, predictive policing to precision medicine. While the changes that we will see will create new opportunities, they will also create new challenges — and new worries — and it behooves us to start grappling with these issues now so that we can build healthy sociotechnical systems.


Media: End Reporting on Polls

Medium, data, reporting, politics, polls, civic engagement, data and society, danah boyd

We now know that the polls were wrong. Over the last few months, I’ve told numerous reporters and people in the media industry this, but I was generally ignored and dismissed. I wasn’t alone — two computer scientists whom I deeply respect — Jenn Wortman Vaughan and Hanna Wallach — were trying to get an op-ed on prediction and uncertainty into major newspapers, but were repeatedly told that the data was solid. It was not. And it will be increasingly problematic.


Rules for trusting “black boxes” in algorithmic control systems

algortihmics, trust, black boxes, security, decision making, prediction, data, machine learning, ethics


Tim O'Reilly writes about the reality that more and more of our lives – including whether you end up seeing this very sentence! – is in the hands of “black boxes” – algorithmic decision-makers whose inner workings are a secret from the people they effect.

O'Reilly proposes four tests to determine whether a black box is trustable:

1. Its creators have made clear what outcome they are seeking, and it is possible for external observers to verify that outcome.

2. Success is measurable.

3. The goals of the algorithm’s creators are aligned with the goals of the algorithm’s consumers.

4. Does the algorithm lead its creators and its users to make better longer term decisions?

O'Reilly goes on to test these assumptions against some of the existing black boxes that we trust every day, like aviation autopilot systems, and shows that this is a very good framework for evaluating algorithmic systems.

But I have three important quibbles with O'Reilly’s framing. The first is absolutely foundational: the reason that these algorithms are black boxes is that the people who devise them argue that releasing details of their models will weaken the models’ security. This is nonsense.

For example, Facebook’s tweaked its algorithm to downrank “clickbait” stories. Adam Mosseri, Facebook’s VP of product management told Techcrunch, “Facebook won’t be publicly publishing the multi-page document of guidelines for defining clickbait because ‘a big part of this is actually spam, and if you expose exactly what we’re doing and how we’re doing it, they reverse engineer it and figure out how to get around it.’”

There’s a name for this in security circles: “Security through obscurity.” It is as thoroughly discredited an idea as is possible. As far back as the 19th century, security experts have decried the idea that robust systems can rely on secrecy as their first line of defense against compromise.

The reason the algorithms O'Reilly discusses are black boxes is because the people who deploy them believe in security-through-obscurity. Allowing our lives to be manipulated in secrecy because of an unfounded, superstitious belief is as crazy as putting astrologers in charge of monetary policy, no-fly lists, hiring decisions, and parole and sentencing recommendations.

So there’s that: the best way to figure out whether we can trust a black box is the smash it open, demand that it be exposed to the disinfecting power of sunshine, and give no quarter to the ideologically bankrupt security-through-obscurity court astrologers of Facebook, Google, and the TSA.

Then there’s the second issue, which is important whether or not we can see inside the black box: what data was used to train the model? Or, in traditional scientific/statistical terms, what was the sampling methodology?

Garbage in, garbage out is a principle as old as computer science, and sampling bias is a problem that’s as old as the study of statistics. Algorithms are often deployed to replace biased systems with empirical ones: for example, predictive policing algorithms tell the cops where to look for crime, supposedly replacing racially biased stop-and-frisk with data-driven systems of automated suspicion.

But predictive policing training data comes from earlier, human-judgment-driven stop-and-frisk projects. If the cops only make black kids turn out their pockets, then all the drugs, guns and contraband they find will be in the pockets of black kids. Feed this data to a machine learning model and ask it where the future guns, drugs and contraband will be found, and it will dutifully send the police out to harass more black kids. The algorithm isn’t racist, but its training data is.

There’s a final issue, which is that algorithms have to have their models tweaked based on measurements of success. It’s not enough to merely measure success: the errors in the algorithm’s predictions also have to be fed back to it, to correct the model. That’s the difference between Amazon’s sales-optimization and automated hiring systems. Amazon’s systems predict ways of improving sales, which the company tries: the failures are used to change the model to improve it. But automated hiring systems blackball some applicants and advance others, and the companies that makes these systems don’t track whether the excluded people go on to be great employees somewhere else, or whether the recommended hires end up stealing from the company or alienating its customers.

I like O'Reilly’s framework for evaluating black boxes, but I think we need to go farther.

The Google codebase [as of January 2015] includes approximately one billion files and has a history of approximately 35 million…

google, versioning, software, ACM, data

The Google codebase [as of January 2015] includes approximately one billion files and has a history of approximately 35 million commits spanning Google’s entire 18-year existence. The repository contains 86TB [Total size of uncompressed content, excluding release branches] of data, including approximately two billion lines of code in nine million unique source files. The total number of files also includes source files copied into release branches, files that are deleted at the latest revision, configuration files, documentation, and supporting data files.

Google’s codebase is shared by more than 25,000 Google software developers from dozens of offices in countries around the world. On a typical workday, they commit 16,000 changes to the codebase, and another 24,000 changes are committed by automated systems. Each day the repository serves billions of file read requests, with approximately 800,000 queries per second during peak traffic and an average of approximately 500,000 queries per second each workday. Most of this traffic originates from Google’s distributed build-and-test systems

The Geographical Oddity of Null Island

geography, geomancy, imagination, places, null, errorism, data, corruption, correlation

Null Island is an imaginary island located at 0°N 0°E (hence “Null”) in the South Atlantic Ocean. This point is where the Equator meets the Prime Meridian. The concept of the island originated in 2011 when it was drawn into Natural Earth, a public domain map dataset developed by volunteer cartographers and GIS analysts. In creating a one-square meter plot of land at 0°N 0°E in the digital dataset, Null Island was intended to help analysts flag errors in a process known as “geocoding.”


“SwiftKey analyzed more than one billion pieces of emoji data across a wide range of categories to learn how speakers of 16…

emoji, chart, unicode, analysis, data, swiftkey, language

“SwiftKey analyzed more than one billion pieces of emoji data across a wide range of categories to learn how speakers of 16 different languages and regions use emoji. The findings in this report came from an analysis of aggregate SwiftKey Cloud data over a four month period between October 2014 and January 2015, and includes both Android and iOS devices”

An expanded set of fundamental surface temperature records

science, climate, weather, data, temperature, ISTI, forecast, open, surface temperature, climate cha

Version 1.0.0 of the Global Land Surface Databank has been released and data are provided from a primary ftp site hosted by the Global Observing Systems Information Center (GOSIC) and World Data Center A at NOAA NCDC. The Stage Three dataset has multiple formats, including a format approved by ISTI, a format similar to GHCN-M, and netCDF files adhering to the Climate and Forecast (CF) convention. The data holding is version controlled and will be updated frequently in response to newly discovered data sources and user comments. All processing code is provided, for openness and transparency. Users are encouraged to experiment with the techniques used in these algorithms. The programs are designed to be modular, so that individuals have the option to develop and implement other methods that may be more robust than described here. We will remain open to releases of new versions should such techniques be constructed and verified.,+science

Cryptoforestry: Food pairing / gastronomy with a telescope

food pairing, food, cryptoforestry, open sauces, data, cooking, geography, gastronomy

The number of foodpairs a recipe generates increases exponentially with the number of ingredients. A typical cookbook (and the ones we use here are all modest one) yields anywhere between 700 and 2500 pairs, the number of connections when comparing three books is large and a really meaningful way to visualize a foodpair comparison we have not yet found. Instead we have turned to using the Jaccard Index, a simple formula for comparing similarity in datasets. If two book are absolutely similar (a book compared with itself) the index is 1, if the books are completely dissimilar the index is 0. So how higher the number how greater the similarity.

How IBM’s Watson Will Make Your Meals Tastier

food, data, flavour pairing, hedonic psychophysics, food chemistry, open sauces, IBM, Bon Appetit

This week, Bon Appetit and IBM are releasing the beta version of a new app called Chef Watson with Bon Appetit that will help home chefs think up new and inspiring ways to use ingredients. Think of Watson as an algorithmically inclined sous chef that gently suggests hundreds of flavor combinations that you’d probably never come up with on your own. To do this, Watson crawled a database of 9,000 Bon Appetit recipes looking for insights and patterns about how ingredients pair together, what style of food it is and how each food is prepared in an actual dish. When the computer combines this information with its already robust understanding of food chemistry and hedonic psychophysics (the psychology of what people find pleasant and unpleasant), you get a very smart kitchen assistant.

Cryptoforestry: Food pairing as gastronomy with a telescope

food, food pairing, flavour-pairing, cuisine, data, ingredients, recipes, culture

The theory of food pairing inspires little faith but when moving away from culinary applications perhaps it can be used to differentiate cuisines and cooking styles. How Chinese is Jamie Oliver? How similar are Mexican and Indian cuisines? How do French and Indian cooking differ? How unique is Rene Redzepi? The aim is to find a way to reveal the inner structure and logic of a cuisine, if such a thing exists, by comparing the way a cuisine or a cook combines ingredients with other cuisines and cooks.

Did Malaysian Airlines 370 disappear using SIA68 (another 777)?

MH370, SIA68, aircraft, hijacking, data, airspace, tracking, missing aircraft

Once MH370 had cleared the volatile airspaces and was safe from being detected by military radar sites in India, Pakistan, and Afghanistan it would have been free to break off from the shadow of SIA68 and could have then flown a path to it’s final landing site. There are several locations along the flight path of SIA68 where it could have easily broken contact and flown and landed in Xingjian province, Kyrgyzstan, or Turkmenistan. Each of these final locations would match up almost perfectly with the 7.5 hours of total flight time and trailing SIA68. In addition, these locations are all possibilities that are on the “ARC” and fit with the data provided by Inmarsat from the SATCOM’s last known ping at 00:11UTC.–370-disappear-using-sia68


food, taste, data, flavour

The Gastrograph system uses 18 Broad Spectrum Flavor Categories, and 6 sensations (trigeminal & somatosensory), which together fully encompass gustatory flavor space; such that anything you can taste, you can graph. The GG System allows you to compare amalgamated flavor profiles of consumer and Quality Control based reviews on all of your products, in order to determine Correlates of Quality, consumer preference, and analyze changes in products over time (ageing), all independent of reviewer variables such as socio-economic background, ethnic background, age, sex, and flavor sensitivity.

The Web as a Preservation Medium

small data, archiving, digital archives, memory, data, www, access

The archival record … is best understood as a sliver of a sliver of a sliver of a window into process. It is a fragile thing, an enchanted thing, defined not by its connections to “reality,” but by its open-ended layerings of construction and reconstruction. Far from constituting the solid structure around which imagination can play, it is itself the stuff of imagination.

Human Resolution

digital, material, physical substrate, energy, internet, data, materiality, post-invisibility

Relating a Google search return to an equivalent expenditure of fossil fuels, or the fluctuation of pixels across a screen with the exploited labour of rural migrant workers in Shenzen, or topsoil loss in Inner Mongolia, is as remote and unattainable for the majority of users as is an understanding of the technical functionality of the devices themselves

Paleoclimate: The End of the Holocene

climate, global temperature, holocence, data

Recently a group of researchers from Harvard and Oregon State University has published the first global temperature reconstruction for the last 11,000 years – that’s the whole Holocene (Marcott et al. 2013). The results are striking and worthy of further discussion, after the authors have already commented on their results in this blog.

Toiling in the data-mines: what data exploration feels like

data, code, material exploration, tom armitage, BERG

There are several aspects to this post. Partly, it’s about what material explorations look like when performed with data. Partly, it’s about the role of code as a tool to explore data. We don’t write about code much on the site, because we’re mainly interested in the products we produce and the invention involved in them, but it’s sometimes important to talk about processes and tools, and this, I feel, is one of those times. At the same time, as well as talking about technical matters, I wanted to talk a little about what the act of doing this work feels like.


spime, programmable matter, IoT, blogject, blobject, data, network, neologism

Kirkyan is a currently-theoretical “Thing” (PDF) related to both blogject (early example here by originator of the kirkyan concept) and spime. At the core, the concept revolves around the idea that the same data used to create a physical spime can be used to also create a virtual spime, and that the two can then be connected via the same ubiquitous computing network. Where spimes have a number of predefined limitations (e.g. “Cradle-to-cradle” life-spans), a kirkyan is inherently redundant and thus has additional capabilities. Furthermore, while the physical and the virtual are related and in constant networked contact, they are, to a significant degree, autonomous.