Posts tagged data

Can speculative evidence inform decision making?

Medium, Anab Jain, futures, decision making, choice, uncertainty, evidence, speculation, data, 2017, Superflux

Over at Superflux, our work investigating potential and plausible futures, involves extensively scanning for trends and signals from which we trace and extrapolate into the future. Both qualitative and quantitative data play an important role. In doing such work, we have observed how data is often used as evidence, and seen as definitive. Historical and contemporary datasets are often used as evidence for a mandate for future change, especially in some of the work we have undertaken with governments and policy makers. But lately we have been thinking if this drive for data as evidence has led to the unshakeable belief that data is evidence.

via https://medium.com/@anabjain/can-speculative-evidence-inform-decision-making–6f7d398d201f

The most Geo-tagged Place on Earth

Medium, GPS, geography, data, null island, Borges, The Aleph, geotag, glitch

This brings be back to what is likely the most geo-tagged place on earth. It is a place that can be found marked with unambiguous precision on many social media sites or self-crafted mapping projects. The place seems to be relevant in almost any context, and has been tagged and described in an unaccountable number of ways. The place seems to combine many places at once, all sharing the same location — similar to Jorge Luis Borges’ Aleph: “The Aleph’s diameter was probably little more than an inch, but all space was there, actual and undiminished. Each thing (a mirror’s face, let us say) was infinite things, since I distinctly saw it from every angle of the universe.”

via https://blog.offenhuber.net/the-most-geo-tagged-place-on-earth-fc76758cc505

The superpower of interactive datavis? A micro-macro view!

Medium, statsitics, data, infoviz, micro-macro

What I mean by micro-macro is trying to get a better understanding of the world by accessing it on two levels: for one, there’s the micro-level of anecdotes where we get the good feeling of looking at actual, concrete aspects of the world instead of abstract mathematical descriptions. But we combine this with the macro-level to understand how these relatable anecdotes fit into the whole. This dual approach enables us to estimate if a given example represents normalcy (a stand-in for how things “usually” are) or is an outlier and does not allow conclusions for all cases.

via https://medium.com/@dominikus/the-superpower-of-interactive-datavis-a-micro-macro-view–4d027e3bdc71

Data, Fairness, Algorithms, Consequences

Medium, danah boyd, data, privacy, algorithms, bias, discrimination, transparency, responsibility

When we open up data, are we empowering people to come together? Or to come apart? Who defines the values that we should be working towards? Who checks to make sure that our data projects are moving us towards those values? If we aren’t clear about what we want and the trade-offs that are involved, simply opening up data can — and often does — reify existing inequities and structural problems in society. Is that really what we’re aiming to do?

via https://points.datasociety.net/toward-accountability–6096e38878f0

Digital Privacy at the U.S. Border: Protecting the Data On Your Devices and In the Cloud

EFF, privacy, security, travel, US, USA, data, borders

The U.S. government reported a five-fold increase in the number of electronic media searches at the border in a single year, from 4,764 in 2015 to 23,877 in 2016.1 Every one of those searches was a potential privacy violation. Our lives are minutely documented on the phones and laptops we carry, and in the cloud. Our devices carry records of private conversations, family photos, medical documents, banking information, information about what websites we visit, and much more. Moreover, people in many professions, such as lawyers and journalists, have a heightened need to keep their electronic information confidential. How can travelers keep their digital data safe? This guide (updating a previous guide from 20112) helps travelers understand their individual risks when crossing the U.S. border, provides an overview of the law around border search, and offers a brief technical overview to securing digital data.

via https://www.eff.org/wp/digital-privacy-us-border–2017

Scientists make huge dataset of nearby stars available to public

astronomy, space, MIT, data

Today, a team that includes MIT and is led by the Carnegie Institution for Science has released the largest collection of observations made with a technique called radial velocity, to be used for hunting exoplanets. The huge dataset, taken over two decades by the W.M. Keck Observatory in Hawaii, is now available to the public, along with an open-source software package to process the data and an online tutorial. By making the data public and user-friendly, the scientists hope to draw fresh eyes to the observations, which encompass almost 61,000 measurements of more than 1,600 nearby stars.

via http://news.mit.edu/2017/dataset-nearby-stars-available-public-exoplanets–0213

Breaking things is easy

machine-learning, security, modeling, model, data, ML, 2016

Until a few years ago, machine learning algorithms simply did not work very well on many meaningful tasks like recognizing objects or translation. Thus, when a machine learning algorithm failed to do the right thing, this was the rule, rather than the exception. Today, machine learning algorithms have advanced to the next stage of development: when presented with naturally occurring inputs, they can outperform humans. Machine learning has not yet reached true human-level performance, because when confronted by even a trivial adversary, most machine learning algorithms fail dramatically. In other words, we have reached the point where machine learning works, but may easily be broken.

via http://www.cleverhans.io/security/privacy/ml/2016/12/16/breaking-things-is-easy.html

Emacs for Data Science

emacs, programming, text-editor, open-science, data

A modern data scientist often has to work on multiple platforms with multiple languages. Some projects may be in R, others in Python. Or perhaps you have to work on a cluster with no gui. Or maybe you need to write papers with latex. You can do all that with Emacs and customize it to do whatever you like. I won’t lie though. The learning curve can be steep, but I think the investment is worth it.

via https://blog.insightdatascience.com/emacs-for-data-science-af814b78eb41#.kkdmh5g6x

Transparency ≠ Accountability

Medium, data, society, danah boyd, transparency, accountability, algorithmics

In the next ten years we will see data-driven technologies reconfigure systems in many different sectors, from autonomous vehicles to personalized learning, predictive policing to precision medicine. While the changes that we will see will create new opportunities, they will also create new challenges — and new worries — and it behooves us to start grappling with these issues now so that we can build healthy sociotechnical systems.

via https://points.datasociety.net/transparency-accountability–3c04e4804504

Media: End Reporting on Polls

Medium, data, reporting, politics, polls, civic engagement, data and society, danah boyd

We now know that the polls were wrong. Over the last few months, I’ve told numerous reporters and people in the media industry this, but I was generally ignored and dismissed. I wasn’t alone — two computer scientists whom I deeply respect — Jenn Wortman Vaughan and Hanna Wallach — were trying to get an op-ed on prediction and uncertainty into major newspapers, but were repeatedly told that the data was solid. It was not. And it will be increasingly problematic.

via https://points.datasociety.net/media-end-reporting-on-polls-c9b5df705b7f

Rules for trusting “black boxes” in algorithmic control systems

algortihmics, trust, black boxes, security, decision making, prediction, data, machine learning, ethics

mostlysignssomeportents:

Tim O'Reilly writes about the reality that more and more of our lives – including whether you end up seeing this very sentence! – is in the hands of “black boxes” – algorithmic decision-makers whose inner workings are a secret from the people they effect.

O'Reilly proposes four tests to determine whether a black box is trustable:

1. Its creators have made clear what outcome they are seeking, and it is possible for external observers to verify that outcome.

2. Success is measurable.

3. The goals of the algorithm’s creators are aligned with the goals of the algorithm’s consumers.

4. Does the algorithm lead its creators and its users to make better longer term decisions?

O'Reilly goes on to test these assumptions against some of the existing black boxes that we trust every day, like aviation autopilot systems, and shows that this is a very good framework for evaluating algorithmic systems.

But I have three important quibbles with O'Reilly’s framing. The first is absolutely foundational: the reason that these algorithms are black boxes is that the people who devise them argue that releasing details of their models will weaken the models’ security. This is nonsense.

For example, Facebook’s tweaked its algorithm to downrank “clickbait” stories. Adam Mosseri, Facebook’s VP of product management told Techcrunch, “Facebook won’t be publicly publishing the multi-page document of guidelines for defining clickbait because ‘a big part of this is actually spam, and if you expose exactly what we’re doing and how we’re doing it, they reverse engineer it and figure out how to get around it.’”

There’s a name for this in security circles: “Security through obscurity.” It is as thoroughly discredited an idea as is possible. As far back as the 19th century, security experts have decried the idea that robust systems can rely on secrecy as their first line of defense against compromise.

The reason the algorithms O'Reilly discusses are black boxes is because the people who deploy them believe in security-through-obscurity. Allowing our lives to be manipulated in secrecy because of an unfounded, superstitious belief is as crazy as putting astrologers in charge of monetary policy, no-fly lists, hiring decisions, and parole and sentencing recommendations.

So there’s that: the best way to figure out whether we can trust a black box is the smash it open, demand that it be exposed to the disinfecting power of sunshine, and give no quarter to the ideologically bankrupt security-through-obscurity court astrologers of Facebook, Google, and the TSA.

Then there’s the second issue, which is important whether or not we can see inside the black box: what data was used to train the model? Or, in traditional scientific/statistical terms, what was the sampling methodology?

Garbage in, garbage out is a principle as old as computer science, and sampling bias is a problem that’s as old as the study of statistics. Algorithms are often deployed to replace biased systems with empirical ones: for example, predictive policing algorithms tell the cops where to look for crime, supposedly replacing racially biased stop-and-frisk with data-driven systems of automated suspicion.

But predictive policing training data comes from earlier, human-judgment-driven stop-and-frisk projects. If the cops only make black kids turn out their pockets, then all the drugs, guns and contraband they find will be in the pockets of black kids. Feed this data to a machine learning model and ask it where the future guns, drugs and contraband will be found, and it will dutifully send the police out to harass more black kids. The algorithm isn’t racist, but its training data is.

There’s a final issue, which is that algorithms have to have their models tweaked based on measurements of success. It’s not enough to merely measure success: the errors in the algorithm’s predictions also have to be fed back to it, to correct the model. That’s the difference between Amazon’s sales-optimization and automated hiring systems. Amazon’s systems predict ways of improving sales, which the company tries: the failures are used to change the model to improve it. But automated hiring systems blackball some applicants and advance others, and the companies that makes these systems don’t track whether the excluded people go on to be great employees somewhere else, or whether the recommended hires end up stealing from the company or alienating its customers.

I like O'Reilly’s framework for evaluating black boxes, but I think we need to go farther.

http://boingboing.net/2016/09/15/rules-for-trusting-black-box.html

The Google codebase [as of January 2015] includes approximately one billion files and has a history of approximately 35 million…

google, versioning, software, ACM, data

The Google codebase [as of January 2015] includes approximately one billion files and has a history of approximately 35 million commits spanning Google’s entire 18-year existence. The repository contains 86TB [Total size of uncompressed content, excluding release branches] of data, including approximately two billion lines of code in nine million unique source files. The total number of files also includes source files copied into release branches, files that are deleted at the latest revision, configuration files, documentation, and supporting data files.


Google’s codebase is shared by more than 25,000 Google software developers from dozens of offices in countries around the world. On a typical workday, they commit 16,000 changes to the codebase, and another 24,000 changes are committed by automated systems. Each day the repository serves billions of file read requests, with approximately 800,000 queries per second during peak traffic and an average of approximately 500,000 queries per second each workday. Most of this traffic originates from Google’s distributed build-and-test systems

http://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext

The Geographical Oddity of Null Island

geography, geomancy, imagination, places, null, errorism, data, corruption, correlation

Null Island is an imaginary island located at 0°N 0°E (hence “Null”) in the South Atlantic Ocean. This point is where the Equator meets the Prime Meridian. The concept of the island originated in 2011 when it was drawn into Natural Earth, a public domain map dataset developed by volunteer cartographers and GIS analysts. In creating a one-square meter plot of land at 0°N 0°E in the digital dataset, Null Island was intended to help analysts flag errors in a process known as “geocoding.”

via https://blogs.loc.gov/maps/2016/04/the-geographical-oddity-of-null-island/

“SwiftKey analyzed more than one billion pieces of emoji data across a wide range of categories to learn how speakers of 16…

emoji, chart, unicode, analysis, data, swiftkey, language

“SwiftKey analyzed more than one billion pieces of emoji data across a wide range of categories to learn how speakers of 16 different languages and regions use emoji. The findings in this report came from an analysis of aggregate SwiftKey Cloud data over a four month period between October 2014 and January 2015, and includes both Android and iOS devices”

http://www.scribd.com/doc/262594751/SwiftKey-Emoji-Report

An expanded set of fundamental surface temperature records

science, climate, weather, data, temperature, ISTI, forecast, open, surface temperature, climate cha

Version 1.0.0 of the Global Land Surface Databank has been released and data are provided from a primary ftp site hosted by the Global Observing Systems Information Center (GOSIC) and World Data Center A at NOAA NCDC. The Stage Three dataset has multiple formats, including a format approved by ISTI, a format similar to GHCN-M, and netCDF files adhering to the Climate and Forecast (CF) convention. The data holding is version controlled and will be updated frequently in response to newly discovered data sources and user comments. All processing code is provided, for openness and transparency. Users are encouraged to experiment with the techniques used in these algorithms. The programs are designed to be modular, so that individuals have the option to develop and implement other methods that may be more robust than described here. We will remain open to releases of new versions should such techniques be constructed and verified.

http://www.realclimate.org/index.php/archives/2014/07/release-of-the-international-surface-temperature-initiatives-istis-global-land-surface-databank-an-expanded-set-of-fundamental-surface-temperature-records/,+science

Cryptoforestry: Food pairing / gastronomy with a telescope

food pairing, food, cryptoforestry, open sauces, data, cooking, geography, gastronomy

The number of foodpairs a recipe generates increases exponentially with the number of ingredients. A typical cookbook (and the ones we use here are all modest one) yields anywhere between 700 and 2500 pairs, the number of connections when comparing three books is large and a really meaningful way to visualize a foodpair comparison we have not yet found. Instead we have turned to using the Jaccard Index, a simple formula for comparing similarity in datasets. If two book are absolutely similar (a book compared with itself) the index is 1, if the books are completely dissimilar the index is 0. So how higher the number how greater the similarity.

http://cryptoforest.blogspot.nl/2014/07/food-pairing-gastronomy-with-telescope.html

How IBM’s Watson Will Make Your Meals Tastier

food, data, flavour pairing, hedonic psychophysics, food chemistry, open sauces, IBM, Bon Appetit

This week, Bon Appetit and IBM are releasing the beta version of a new app called Chef Watson with Bon Appetit that will help home chefs think up new and inspiring ways to use ingredients. Think of Watson as an algorithmically inclined sous chef that gently suggests hundreds of flavor combinations that you’d probably never come up with on your own. To do this, Watson crawled a database of 9,000 Bon Appetit recipes looking for insights and patterns about how ingredients pair together, what style of food it is and how each food is prepared in an actual dish. When the computer combines this information with its already robust understanding of food chemistry and hedonic psychophysics (the psychology of what people find pleasant and unpleasant), you get a very smart kitchen assistant.

http://www.wired.com/2014/06/how-ibms-watson-will-make-your-meals-tastier/

Cryptoforestry: Food pairing as gastronomy with a telescope

food, food pairing, flavour-pairing, cuisine, data, ingredients, recipes, culture

The theory of food pairing inspires little faith but when moving away from culinary applications perhaps it can be used to differentiate cuisines and cooking styles. How Chinese is Jamie Oliver? How similar are Mexican and Indian cuisines? How do French and Indian cooking differ? How unique is Rene Redzepi? The aim is to find a way to reveal the inner structure and logic of a cuisine, if such a thing exists, by comparing the way a cuisine or a cook combines ingredients with other cuisines and cooks.

http://cryptoforest.blogspot.nl/2014/05/food-pairing-gastronomy-with-telescope.html

Did Malaysian Airlines 370 disappear using SIA68 (another 777)?

MH370, SIA68, aircraft, hijacking, data, airspace, tracking, missing aircraft

Once MH370 had cleared the volatile airspaces and was safe from being detected by military radar sites in India, Pakistan, and Afghanistan it would have been free to break off from the shadow of SIA68 and could have then flown a path to it’s final landing site. There are several locations along the flight path of SIA68 where it could have easily broken contact and flown and landed in Xingjian province, Kyrgyzstan, or Turkmenistan. Each of these final locations would match up almost perfectly with the 7.5 hours of total flight time and trailing SIA68. In addition, these locations are all possibilities that are on the “ARC” and fit with the data provided by Inmarsat from the SATCOM’s last known ping at 00:11UTC.

http://keithledgerwood.tumblr.com/post/79838944823/did-malaysian-airlines–370-disappear-using-sia68

Gastrograph

food, taste, data, flavour

The Gastrograph system uses 18 Broad Spectrum Flavor Categories, and 6 sensations (trigeminal & somatosensory), which together fully encompass gustatory flavor space; such that anything you can taste, you can graph. The GG System allows you to compare amalgamated flavor profiles of consumer and Quality Control based reviews on all of your products, in order to determine Correlates of Quality, consumer preference, and analyze changes in products over time (ageing), all independent of reviewer variables such as socio-economic background, ethnic background, age, sex, and flavor sensitivity.

https://www.gastrograph.com/

The Web as a Preservation Medium

small data, archiving, digital archives, memory, data, www, access

The archival record … is best understood as a sliver of a sliver of a sliver of a window into process. It is a fragile thing, an enchanted thing, defined not by its connections to “reality,” but by its open-ended layerings of construction and reconstruction. Far from constituting the solid structure around which imagination can play, it is itself the stuff of imagination.

http://inkdroid.org/journal/2013/11/26/the-web-as-a-preservation-medium/

Human Resolution

digital, material, physical substrate, energy, internet, data, materiality, post-invisibility

Relating a Google search return to an equivalent expenditure of fossil fuels, or the fluctuation of pixels across a screen with the exploited labour of rural migrant workers in Shenzen, or topsoil loss in Inner Mongolia, is as remote and unattainable for the majority of users as is an understanding of the technical functionality of the devices themselves

http://www.metamute.org/editorial/articles/human-resolution

Paleoclimate: The End of the Holocene

climate, global temperature, holocence, data

Recently a group of researchers from Harvard and Oregon State University has published the first global temperature reconstruction for the last 11,000 years – that’s the whole Holocene (Marcott et al. 2013). The results are striking and worthy of further discussion, after the authors have already commented on their results in this blog.

http://www.realclimate.org/index.php/archives/2013/09/paleoclimate-the-end-of-the-holocene/

Toiling in the data-mines: what data exploration feels like

data, code, material exploration, tom armitage, BERG

There are several aspects to this post. Partly, it’s about what material explorations look like when performed with data. Partly, it’s about the role of code as a tool to explore data. We don’t write about code much on the site, because we’re mainly interested in the products we produce and the invention involved in them, but it’s sometimes important to talk about processes and tools, and this, I feel, is one of those times. At the same time, as well as talking about technical matters, I wanted to talk a little about what the act of doing this work feels like.

http://berglondon.com/blog/2009/10/23/toiling-in-the-data-mines-what-data-exploration-feels-like/

Kirkyan

spime, programmable matter, IoT, blogject, blobject, data, network, neologism

Kirkyan is a currently-theoretical “Thing” (PDF) related to both blogject (early example here by originator of the kirkyan concept) and spime. At the core, the concept revolves around the idea that the same data used to create a physical spime can be used to also create a virtual spime, and that the two can then be connected via the same ubiquitous computing network. Where spimes have a number of predefined limitations (e.g. “Cradle-to-cradle” life-spans), a kirkyan is inherently redundant and thus has additional capabilities. Furthermore, while the physical and the virtual are related and in constant networked contact, they are, to a significant degree, autonomous.

http://www.rebang.com/csven/Kirkyan.htm