AI has a GIGO problem

2021-03-31

The computer science maxim “garbage in, garbage out,” (GIGO) dates back at least as far as 1957. It’s an iron law of computing: no matter how powerful your data-processing system is, if you feed it low-quality data, you’ll get low-quality conclusions.

And of course, machine learning (AKA “AI”) (ugh) does not repeal GIGO. Far from it. ML systems that operate on garbage data produce garbage predictive models, which produce garbage conclusions at vast scale, coated with a veneer of algorithmic objectivity facewash.

The scale and credeibility of ML-derived GIGO presents huge risks to our society in domains as varied as the credit system, criminal justice, hiring, education - even whether your kids will be taken away by Child Protective Services.

To make this all worse, the vast data-sets used to train ML systems are in scarce supply, which leads multiple ML models to be trained on the same data, enshrining the defects of that data in all kinds of systems.

One of the most significant training datasets is Imagenet, a collection of 14m labeled images that jumpstarted the ML revolution in 2012. As Will Knight writes for Wired, Imagenet’s labels came from low-waged, undersupervised workers.

https://www.wired.com/story/foundations-ai-riddled-errors/

Imagenet is one of the data-sets examined in new research from MIT’s Curtis Northcutt,, who found that Imagenet and other comparable datasets have a typical error rate of about 6%.

https://arxiv.org/abs/2103.14749

This small margin of error has big consequences: first, because the errors aren’t evenly distributed, and instead cluster around the kinds of biases that labelers have (for example, labeling images of woman medical professionals with “nurse” and men with “doctor”).

And second, because the incorrect labels obscure relative performance differences between models. When one model does better than another, you can’t know if that’s because it is a better model, or because it’s less sensitive to incorrect labels.

Image: Seydelmann (modified)
https://commons.wikimedia.org/wiki/File:GW300_1.jpg

CC BY-SA:
https://creativecommons.org/licenses/by-sa/3.0/deed.en

Cryteria (modified)
https://commons.wikimedia.org/wiki/File:HAL9000.svg

CC BY:
https://creativecommons.org/licenses/by/3.0/deed.en