Posts tagged data
Data Visualisations Using the Grammar of Graphics
ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
Brussels - A lovely melting-pot
A data visualization essay exploring Brussels and its people.
Datashader
Datashader is a graphics pipeline system for creating meaningful representations of large datasets quickly and flexibly. Datashader breaks the creation of images into a series of explicit steps that allow computations to be done on intermediate representations. This approach allows accurate and effective visualizations to be produced automatically without trial-and-error parameter tuning, and also makes it simple for data scientists to focus on particular data and relationships of interest in a principled way.
Terrapattern
Terrapattern provides an open-ended interface for visual query-by-example. Simply click an interesting spot on Terrapattern’s map, and it will find other locations that look similar. Our tool is ideal for locating specialized ‘nonbuilding structures’ and other forms of soft infrastructure that aren’t usually indicated on maps. It’s an open-source tool for discovering “patterns of interest” in unlabeled satellite imagery—a prototype for exploring the unmapped, and the unmappable.
IndexMundi - Country Facts
IndexMundi contains detailed country statistics, charts, and maps compiled from multiple sources. You can explore and analyze thousands of indicators organized by region, country, topic, industry sector, and type.
Your Data is Being Manipulated
At this moment, AI is at the center of every business conversation. Companies, governments, and researchers are obsessed with data. Not surprisingly, so are adversarial actors. We are currently seeing an evolution in how data is being manipulated. If we believe that data can and should be used to inform people and fuel technology, we need to start building the infrastructure necessary to limit the corruption and abuse of that data — and grapple with how biased and problematic data might work its way into technology and, through that, into the foundations of our society.
via https://points.datasociety.net/your-data-is-being-manipulated-a7e31a83577b
You Say Data, I Say System
An over-simplified and dangerously reductive diagram of a data system might look like this
Collection → Computation → Representation
Whenever you look at data — as a spreadsheet or database view or a visualization, you are looking at an artifact of such a system. What this diagram doesn’t capture is the immense branching of choice that happens at each step along the way. As you make each decision — to omit a row of data, or to implement a particular database structure or to use a specific colour palette you are treading down a path through this wild, tall grass of possibility. It will be tempting to look back and see your trail as the only one that you could have taken, but in reality a slightly divergent you who’d made slightly divergent choices might have ended up somewhere altogether different. To think in data systems is to consider all three of these stages at once, but for now let’s look at them one at a time.
via https://medium.com/@blprnt/you-say-data-i-say-system–54e84aa7a421
Can speculative evidence inform decision making?
Over at Superflux, our work investigating potential and plausible futures, involves extensively scanning for trends and signals from which we trace and extrapolate into the future. Both qualitative and quantitative data play an important role. In doing such work, we have observed how data is often used as evidence, and seen as definitive. Historical and contemporary datasets are often used as evidence for a mandate for future change, especially in some of the work we have undertaken with governments and policy makers. But lately we have been thinking if this drive for data as evidence has led to the unshakeable belief that data is evidence.
via https://medium.com/@anabjain/can-speculative-evidence-inform-decision-making–6f7d398d201f
The most Geo-tagged Place on Earth
This brings be back to what is likely the most geo-tagged place on earth. It is a place that can be found marked with unambiguous precision on many social media sites or self-crafted mapping projects. The place seems to be relevant in almost any context, and has been tagged and described in an unaccountable number of ways. The place seems to combine many places at once, all sharing the same location — similar to Jorge Luis Borges’ Aleph: “The Aleph’s diameter was probably little more than an inch, but all space was there, actual and undiminished. Each thing (a mirror’s face, let us say) was infinite things, since I distinctly saw it from every angle of the universe.”
via https://blog.offenhuber.net/the-most-geo-tagged-place-on-earth-fc76758cc505
The superpower of interactive datavis? A micro-macro view!
What I mean by micro-macro is trying to get a better understanding of the world by accessing it on two levels: for one, there’s the micro-level of anecdotes where we get the good feeling of looking at actual, concrete aspects of the world instead of abstract mathematical descriptions. But we combine this with the macro-level to understand how these relatable anecdotes fit into the whole. This dual approach enables us to estimate if a given example represents normalcy (a stand-in for how things “usually” are) or is an outlier and does not allow conclusions for all cases.
via https://medium.com/@dominikus/the-superpower-of-interactive-datavis-a-micro-macro-view–4d027e3bdc71
Data, Fairness, Algorithms, Consequences
When we open up data, are we empowering people to come together? Or to come apart? Who defines the values that we should be working towards? Who checks to make sure that our data projects are moving us towards those values? If we aren’t clear about what we want and the trade-offs that are involved, simply opening up data can — and often does — reify existing inequities and structural problems in society. Is that really what we’re aiming to do?
via https://points.datasociety.net/toward-accountability–6096e38878f0
Digital Privacy at the U.S. Border: Protecting the Data On Your Devices and In the Cloud
The U.S. government reported a five-fold increase in the number of electronic media searches at the border in a single year, from 4,764 in 2015 to 23,877 in 2016.1 Every one of those searches was a potential privacy violation. Our lives are minutely documented on the phones and laptops we carry, and in the cloud. Our devices carry records of private conversations, family photos, medical documents, banking information, information about what websites we visit, and much more. Moreover, people in many professions, such as lawyers and journalists, have a heightened need to keep their electronic information confidential. How can travelers keep their digital data safe? This guide (updating a previous guide from 20112) helps travelers understand their individual risks when crossing the U.S. border, provides an overview of the law around border search, and offers a brief technical overview to securing digital data.
Scientists make huge dataset of nearby stars available to public
Today, a team that includes MIT and is led by the Carnegie Institution for Science has released the largest collection of observations made with a technique called radial velocity, to be used for hunting exoplanets. The huge dataset, taken over two decades by the W.M. Keck Observatory in Hawaii, is now available to the public, along with an open-source software package to process the data and an online tutorial. By making the data public and user-friendly, the scientists hope to draw fresh eyes to the observations, which encompass almost 61,000 measurements of more than 1,600 nearby stars.
via http://news.mit.edu/2017/dataset-nearby-stars-available-public-exoplanets–0213
Breaking things is easy
Until a few years ago, machine learning algorithms simply did not work very well on many meaningful tasks like recognizing objects or translation. Thus, when a machine learning algorithm failed to do the right thing, this was the rule, rather than the exception. Today, machine learning algorithms have advanced to the next stage of development: when presented with naturally occurring inputs, they can outperform humans. Machine learning has not yet reached true human-level performance, because when confronted by even a trivial adversary, most machine learning algorithms fail dramatically. In other words, we have reached the point where machine learning works, but may easily be broken.
via http://www.cleverhans.io/security/privacy/ml/2016/12/16/breaking-things-is-easy.html
Emacs for Data Science
A modern data scientist often has to work on multiple platforms with multiple languages. Some projects may be in R, others in Python. Or perhaps you have to work on a cluster with no gui. Or maybe you need to write papers with latex. You can do all that with Emacs and customize it to do whatever you like. I won’t lie though. The learning curve can be steep, but I think the investment is worth it.
via https://blog.insightdatascience.com/emacs-for-data-science-af814b78eb41#.kkdmh5g6x
Transparency ≠ Accountability
In the next ten years we will see data-driven technologies reconfigure systems in many different sectors, from autonomous vehicles to personalized learning, predictive policing to precision medicine. While the changes that we will see will create new opportunities, they will also create new challenges — and new worries — and it behooves us to start grappling with these issues now so that we can build healthy sociotechnical systems.
via https://points.datasociety.net/transparency-accountability–3c04e4804504
Project Jupyter
The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.
Media: End Reporting on Polls
We now know that the polls were wrong. Over the last few months, I’ve told numerous reporters and people in the media industry this, but I was generally ignored and dismissed. I wasn’t alone — two computer scientists whom I deeply respect — Jenn Wortman Vaughan and Hanna Wallach — were trying to get an op-ed on prediction and uncertainty into major newspapers, but were repeatedly told that the data was solid. It was not. And it will be increasingly problematic.
via https://points.datasociety.net/media-end-reporting-on-polls-c9b5df705b7f
Rules for trusting “black boxes” in algorithmic control systems
Tim O'Reilly writes about the reality that more and more of our lives – including whether you end up seeing this very sentence! – is in the hands of “black boxes” – algorithmic decision-makers whose inner workings are a secret from the people they effect.
O'Reilly proposes four tests to determine whether a black box is trustable:
1. Its creators have made clear what outcome they are seeking, and it is possible for external observers to verify that outcome.
2. Success is measurable.
3. The goals of the algorithm’s creators are aligned with the goals of the algorithm’s consumers.
4. Does the algorithm lead its creators and its users to make better longer term decisions?
O'Reilly goes on to test these assumptions against some of the existing black boxes that we trust every day, like aviation autopilot systems, and shows that this is a very good framework for evaluating algorithmic systems.
But I have three important quibbles with O'Reilly’s framing. The first is absolutely foundational: the reason that these algorithms are black boxes is that the people who devise them argue that releasing details of their models will weaken the models’ security. This is nonsense.
For example, Facebook’s tweaked its algorithm to downrank “clickbait” stories. Adam Mosseri, Facebook’s VP of product management told Techcrunch, “Facebook won’t be publicly publishing the multi-page document of guidelines for defining clickbait because ‘a big part of this is actually spam, and if you expose exactly what we’re doing and how we’re doing it, they reverse engineer it and figure out how to get around it.’”
There’s a name for this in security circles: “Security through obscurity.” It is as thoroughly discredited an idea as is possible. As far back as the 19th century, security experts have decried the idea that robust systems can rely on secrecy as their first line of defense against compromise.
The reason the algorithms O'Reilly discusses are black boxes is because the people who deploy them believe in security-through-obscurity. Allowing our lives to be manipulated in secrecy because of an unfounded, superstitious belief is as crazy as putting astrologers in charge of monetary policy, no-fly lists, hiring decisions, and parole and sentencing recommendations.
So there’s that: the best way to figure out whether we can trust a black box is the smash it open, demand that it be exposed to the disinfecting power of sunshine, and give no quarter to the ideologically bankrupt security-through-obscurity court astrologers of Facebook, Google, and the TSA.
Then there’s the second issue, which is important whether or not we can see inside the black box: what data was used to train the model? Or, in traditional scientific/statistical terms, what was the sampling methodology?
Garbage in, garbage out is a principle as old as computer science, and sampling bias is a problem that’s as old as the study of statistics. Algorithms are often deployed to replace biased systems with empirical ones: for example, predictive policing algorithms tell the cops where to look for crime, supposedly replacing racially biased stop-and-frisk with data-driven systems of automated suspicion.
But predictive policing training data comes from earlier, human-judgment-driven stop-and-frisk projects. If the cops only make black kids turn out their pockets, then all the drugs, guns and contraband they find will be in the pockets of black kids. Feed this data to a machine learning model and ask it where the future guns, drugs and contraband will be found, and it will dutifully send the police out to harass more black kids. The algorithm isn’t racist, but its training data is.
There’s a final issue, which is that algorithms have to have their models tweaked based on measurements of success. It’s not enough to merely measure success: the errors in the algorithm’s predictions also have to be fed back to it, to correct the model. That’s the difference between Amazon’s sales-optimization and automated hiring systems. Amazon’s systems predict ways of improving sales, which the company tries: the failures are used to change the model to improve it. But automated hiring systems blackball some applicants and advance others, and the companies that makes these systems don’t track whether the excluded people go on to be great employees somewhere else, or whether the recommended hires end up stealing from the company or alienating its customers.
I like O'Reilly’s framework for evaluating black boxes, but I think we need to go farther.
http://boingboing.net/2016/09/15/rules-for-trusting-black-box.html
A Technical Primer On Causality
What does “causality” mean, and how can you represent it mathematically? How can you encode causal assumptions, and what bearing do they have on data analysis? These types of questions are at the core of the practice of data science, but deep knowledge about them is surprisingly uncommon.
via https://medium.com/@akelleh/a-technical-primer-on-causality–181db2575e41
Causal Data Science
I started a series of posts aimed at helping people learn about causality in data science (and science in general), and wanted to compile them all together here in a living index.
via https://medium.com/@akelleh/causal-data-science–721ed63a4027
The Google codebase [as of January 2015] includes approximately one billion files and has a history of approximately 35 million…
“
The Google codebase [as of January 2015] includes approximately one billion files and has a history of approximately 35 million commits spanning Google’s entire 18-year existence. The repository contains 86TB [Total size of uncompressed content, excluding release branches] of data, including approximately two billion lines of code in nine million unique source files. The total number of files also includes source files copied into release branches, files that are deleted at the latest revision, configuration files, documentation, and supporting data files.
Google’s codebase is shared by more than 25,000 Google software developers from dozens of offices in countries around the world. On a typical workday, they commit 16,000 changes to the codebase, and another 24,000 changes are committed by automated systems. Each day the repository serves billions of file read requests, with approximately 800,000 queries per second during peak traffic and an average of approximately 500,000 queries per second each workday. Most of this traffic originates from Google’s distributed build-and-test systems
The Geographical Oddity of Null Island
Null Island is an imaginary island located at 0°N 0°E (hence “Null”) in the South Atlantic Ocean. This point is where the Equator meets the Prime Meridian. The concept of the island originated in 2011 when it was drawn into Natural Earth, a public domain map dataset developed by volunteer cartographers and GIS analysts. In creating a one-square meter plot of land at 0°N 0°E in the digital dataset, Null Island was intended to help analysts flag errors in a process known as “geocoding.”
via https://blogs.loc.gov/maps/2016/04/the-geographical-oddity-of-null-island/
All Roads Lead to Rome.
“SwiftKey analyzed more than one billion pieces of emoji data across a wide range of categories to learn how speakers of 16…
“SwiftKey analyzed more than one billion pieces of emoji data across a wide range of categories to learn how speakers of 16 different languages and regions use emoji. The findings in this report came from an analysis of aggregate SwiftKey Cloud data over a four month period between October 2014 and January 2015, and includes both Android and iOS devices”
untitled 116550917232
The Molenbeek Data Shadow as of 2014–10–28 (or the geolocation of Mozilla sympathizers around Brussels)
The Molenbeek Data Shadow as of 2014-10-28 (or the geolocation of Mozilla sympathizers around Brussels)
An expanded set of fundamental surface temperature records
Version 1.0.0 of the Global Land Surface Databank has been released and data are provided from a primary ftp site hosted by the Global Observing Systems Information Center (GOSIC) and World Data Center A at NOAA NCDC. The Stage Three dataset has multiple formats, including a format approved by ISTI, a format similar to GHCN-M, and netCDF files adhering to the Climate and Forecast (CF) convention. The data holding is version controlled and will be updated frequently in response to newly discovered data sources and user comments. All processing code is provided, for openness and transparency. Users are encouraged to experiment with the techniques used in these algorithms. The programs are designed to be modular, so that individuals have the option to develop and implement other methods that may be more robust than described here. We will remain open to releases of new versions should such techniques be constructed and verified.
Cryptoforestry: Food pairing / gastronomy with a telescope
The number of foodpairs a recipe generates increases exponentially with the number of ingredients. A typical cookbook (and the ones we use here are all modest one) yields anywhere between 700 and 2500 pairs, the number of connections when comparing three books is large and a really meaningful way to visualize a foodpair comparison we have not yet found. Instead we have turned to using the Jaccard Index, a simple formula for comparing similarity in datasets. If two book are absolutely similar (a book compared with itself) the index is 1, if the books are completely dissimilar the index is 0. So how higher the number how greater the similarity.
http://cryptoforest.blogspot.nl/2014/07/food-pairing-gastronomy-with-telescope.html
How IBM’s Watson Will Make Your Meals Tastier
This week, Bon Appetit and IBM are releasing the beta version of a new app called Chef Watson with Bon Appetit that will help home chefs think up new and inspiring ways to use ingredients. Think of Watson as an algorithmically inclined sous chef that gently suggests hundreds of flavor combinations that you’d probably never come up with on your own. To do this, Watson crawled a database of 9,000 Bon Appetit recipes looking for insights and patterns about how ingredients pair together, what style of food it is and how each food is prepared in an actual dish. When the computer combines this information with its already robust understanding of food chemistry and hedonic psychophysics (the psychology of what people find pleasant and unpleasant), you get a very smart kitchen assistant.
http://www.wired.com/2014/06/how-ibms-watson-will-make-your-meals-tastier/
The Ongoing Collapse
The Ongoing Collapse is a growing collection of data sources and links positioned as a reflection of the state of the world in the terms that it likes to use. It is built and maintained by Tobias Revell.
Cryptoforestry: Food pairing as gastronomy with a telescope
The theory of food pairing inspires little faith but when moving away from culinary applications perhaps it can be used to differentiate cuisines and cooking styles. How Chinese is Jamie Oliver? How similar are Mexican and Indian cuisines? How do French and Indian cooking differ? How unique is Rene Redzepi? The aim is to find a way to reveal the inner structure and logic of a cuisine, if such a thing exists, by comparing the way a cuisine or a cook combines ingredients with other cuisines and cooks.
http://cryptoforest.blogspot.nl/2014/05/food-pairing-gastronomy-with-telescope.html
Did Malaysian Airlines 370 disappear using SIA68 (another 777)?
Once MH370 had cleared the volatile airspaces and was safe from being detected by military radar sites in India, Pakistan, and Afghanistan it would have been free to break off from the shadow of SIA68 and could have then flown a path to it’s final landing site. There are several locations along the flight path of SIA68 where it could have easily broken contact and flown and landed in Xingjian province, Kyrgyzstan, or Turkmenistan. Each of these final locations would match up almost perfectly with the 7.5 hours of total flight time and trailing SIA68. In addition, these locations are all possibilities that are on the “ARC” and fit with the data provided by Inmarsat from the SATCOM’s last known ping at 00:11UTC.
http://keithledgerwood.tumblr.com/post/79838944823/did-malaysian-airlines–370-disappear-using-sia68
Gastrograph
The Gastrograph system uses 18 Broad Spectrum Flavor Categories, and 6 sensations (trigeminal & somatosensory), which together fully encompass gustatory flavor space; such that anything you can taste, you can graph. The GG System allows you to compare amalgamated flavor profiles of consumer and Quality Control based reviews on all of your products, in order to determine Correlates of Quality, consumer preference, and analyze changes in products over time (ageing), all independent of reviewer variables such as socio-economic background, ethnic background, age, sex, and flavor sensitivity.
The Web as a Preservation Medium
The archival record … is best understood as a sliver of a sliver of a sliver of a window into process. It is a fragile thing, an enchanted thing, defined not by its connections to “reality,” but by its open-ended layerings of construction and reconstruction. Far from constituting the solid structure around which imagination can play, it is itself the stuff of imagination.
http://inkdroid.org/journal/2013/11/26/the-web-as-a-preservation-medium/
Human Resolution
Relating a Google search return to an equivalent expenditure of fossil fuels, or the fluctuation of pixels across a screen with the exploited labour of rural migrant workers in Shenzen, or topsoil loss in Inner Mongolia, is as remote and unattainable for the majority of users as is an understanding of the technical functionality of the devices themselves
Paleoclimate: The End of the Holocene
Recently a group of researchers from Harvard and Oregon State University has published the first global temperature reconstruction for the last 11,000 years – that’s the whole Holocene (Marcott et al. 2013). The results are striking and worthy of further discussion, after the authors have already commented on their results in this blog.
http://www.realclimate.org/index.php/archives/2013/09/paleoclimate-the-end-of-the-holocene/
Toiling in the data-mines: what data exploration feels like
There are several aspects to this post. Partly, it’s about what material explorations look like when performed with data. Partly, it’s about the role of code as a tool to explore data. We don’t write about code much on the site, because we’re mainly interested in the products we produce and the invention involved in them, but it’s sometimes important to talk about processes and tools, and this, I feel, is one of those times. At the same time, as well as talking about technical matters, I wanted to talk a little about what the act of doing this work feels like.
http://berglondon.com/blog/2009/10/23/toiling-in-the-data-mines-what-data-exploration-feels-like/
Storm, distributed and fault-tolerant realtime computation
Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use
Millions exposed by Facebook data glitch
An investigation into the bug showed that contact details for about six million people were inadvertently shared in this way. Despite this, Facebook said the “practical impact” had been small because information was most likely to have been shared with people who already knew the affected individuals.
The Office For Creative Research
The Office for Creative Research is a multidisciplinary research group exploring new modes of engagement with data, through unique practices that borrow from both the arts and sciences. OCR clients are research partners, helping to pose, refine and ultimately solve difficult problems with data.
Aircraft Telemetry Data
Using telemetry data which are transmitted by most of the aircrafts allows to calculate their trajectories. The data is send in the Automatic Dependent Surveillance (ADS) format
The cloud, or Brahman as the hindus call it, is the All, surrounding everything. It is everywhere; immaterial, yet very real.
All attempts to attack The Pirate Bay from now on is an attack on everything and nothing. The site that you’re at will still be here, for as long as we want it to. Only in a higher form of being. A reality to us. A ghost to those who wish to harm us.
Kirkyan
Kirkyan is a currently-theoretical “Thing” (PDF) related to both blogject (early example here by originator of the kirkyan concept) and spime. At the core, the concept revolves around the idea that the same data used to create a physical spime can be used to also create a virtual spime, and that the two can then be connected via the same ubiquitous computing network. Where spimes have a number of predefined limitations (e.g. “Cradle-to-cradle” life-spans), a kirkyan is inherently redundant and thus has additional capabilities. Furthermore, while the physical and the virtual are related and in constant networked contact, they are, to a significant degree, autonomous.