NeuralHash is the perceptual hashing model that back’s Apple’s new CSAM (child sexual abuse material) reporting mechanism. It’s an algorithm that takes an image as input and returns a 96-bit unique identifier (a hash) that should match for two images that are “the same” (besides some minor perturbations like JPEG artifacts, resizing, or cropping).
Yesterday, news broke that researchers had extracted the neural network Apple uses to hash images from the latest operating system and made it available for testing the system. Quickly, artificially colliding adversarial images were created that had matching NeuralHashes. Apple clarified that they have an independent server-side network that verifies all matches with an independent network before flagging images for human review (and then escalation to law enforcement).
Posts tagged machine learning
“Focusing only on defending against small perturbations is insufficient, as large, local perturbations can also break classifiers.”
AlphaGo is made up of a number of relatively standard techniques: behavior cloning (supervised learning on human demonstration data), reinforcement learning (REINFORCE), value functions, and Monte Carlo Tree Search (MCTS). However, the way these components are combined is novel and not exactly standard. In particular, AlphaGo uses a SL (supervised learning) policy to initialize the learning of an RL (reinforcement learning) policy that gets perfected with self-play, which they then estimate a value function from, which then plugs into MCTS that (somewhat surprisingly) uses the (worse!, but more diverse) SL policy to sample rollouts. In addition, the policy/value nets are deep neural networks, so getting everything to work properly presents its own unique challenges (e.g. value function is trained in a tricky way to prevent overfitting). On all of these aspects, DeepMind has executed very well. That being said, AlphaGo does not by itself use any fundamental algorithmic breakthroughs in how we approach RL problems.
Making music with computer tools is delightful. Musical ideas can be explored quickly and composing songs is easy. Yet for many, these tools are overwhelming: An ocean of settings can be tweaked and it is often unclear, which changes lead to a great song. This experiment investigates how to use evolutionary algorithm and novelty search to help musicians find musical inspiration in Ableton Live.
The practice of using people’s outer appearance to infer inner character is called physiognomy. While today it is understood to be pseudoscience, the folk belief that there are inferior “types” of people, identifiable by their facial features and body measurements, has at various times been codified into country-wide law, providing a basis to acquire land, block immigration, justify slavery, and permit genocide. When put into practice, the pseudoscience of physiognomy becomes the pseudoscience of scientific racism.
Rapid developments in artificial intelligence and machine learning have enabled scientific racism to enter a new era, in which machine-learned models embed biases present in the human behavior used for model development. Whether intentional or not, this “laundering” of human prejudice through computer algorithms can make those biases appear to be justified objectively.
Over the course of the next few months, we will be launching a prototype of the research already completed in statistical fact checking and claim detection. So far, our work has been in identifying claims in text by the named entities they contain, what economic statistics those claims are about, and verifying if they are “fact-checkable”. At the moment, we can only check claims that can be validated by known statistical databases — we built our system on Freebase (an fact database that came out of Wikipedia’s knowledge graph), and will be migrating it to new databases such as EUROSTAT and the World Bank Databank.
Text summarization problem has many useful applications. If you run a website, you can create titles and short summaries for user generated content. If you want to read a lot of articles and don’t have time to do that, your virtual assistant can summarize main points from these articles for you. It is not an easy problem to solve. There are multiple approaches, including various supervised and unsupervised algorithms. Some algorithms rank the importance of sentences within the text and then construct a summary out of important sentences, others are end-to-end generative models. End-to-end machine learning algorithms are interesting to try. After all, end-to-end algorithms demonstrate good results in other areas, like image recognition, speech recognition, language translation, and even question-answering.
Tim O'Reilly writes about the reality that more and more of our lives – including whether you end up seeing this very sentence! – is in the hands of “black boxes” – algorithmic decision-makers whose inner workings are a secret from the people they effect.
O'Reilly proposes four tests to determine whether a black box is trustable:
1. Its creators have made clear what outcome they are seeking, and it is possible for external observers to verify that outcome.
2. Success is measurable.
3. The goals of the algorithm’s creators are aligned with the goals of the algorithm’s consumers.
4. Does the algorithm lead its creators and its users to make better longer term decisions?
O'Reilly goes on to test these assumptions against some of the existing black boxes that we trust every day, like aviation autopilot systems, and shows that this is a very good framework for evaluating algorithmic systems.
But I have three important quibbles with O'Reilly’s framing. The first is absolutely foundational: the reason that these algorithms are black boxes is that the people who devise them argue that releasing details of their models will weaken the models’ security. This is nonsense.
For example, Facebook’s tweaked its algorithm to downrank “clickbait” stories. Adam Mosseri, Facebook’s VP of product management told Techcrunch, “Facebook won’t be publicly publishing the multi-page document of guidelines for defining clickbait because ‘a big part of this is actually spam, and if you expose exactly what we’re doing and how we’re doing it, they reverse engineer it and figure out how to get around it.’”
There’s a name for this in security circles: “Security through obscurity.” It is as thoroughly discredited an idea as is possible. As far back as the 19th century, security experts have decried the idea that robust systems can rely on secrecy as their first line of defense against compromise.
The reason the algorithms O'Reilly discusses are black boxes is because the people who deploy them believe in security-through-obscurity. Allowing our lives to be manipulated in secrecy because of an unfounded, superstitious belief is as crazy as putting astrologers in charge of monetary policy, no-fly lists, hiring decisions, and parole and sentencing recommendations.
So there’s that: the best way to figure out whether we can trust a black box is the smash it open, demand that it be exposed to the disinfecting power of sunshine, and give no quarter to the ideologically bankrupt security-through-obscurity court astrologers of Facebook, Google, and the TSA.
Then there’s the second issue, which is important whether or not we can see inside the black box: what data was used to train the model? Or, in traditional scientific/statistical terms, what was the sampling methodology?
Garbage in, garbage out is a principle as old as computer science, and sampling bias is a problem that’s as old as the study of statistics. Algorithms are often deployed to replace biased systems with empirical ones: for example, predictive policing algorithms tell the cops where to look for crime, supposedly replacing racially biased stop-and-frisk with data-driven systems of automated suspicion.
But predictive policing training data comes from earlier, human-judgment-driven stop-and-frisk projects. If the cops only make black kids turn out their pockets, then all the drugs, guns and contraband they find will be in the pockets of black kids. Feed this data to a machine learning model and ask it where the future guns, drugs and contraband will be found, and it will dutifully send the police out to harass more black kids. The algorithm isn’t racist, but its training data is.
There’s a final issue, which is that algorithms have to have their models tweaked based on measurements of success. It’s not enough to merely measure success: the errors in the algorithm’s predictions also have to be fed back to it, to correct the model. That’s the difference between Amazon’s sales-optimization and automated hiring systems. Amazon’s systems predict ways of improving sales, which the company tries: the failures are used to change the model to improve it. But automated hiring systems blackball some applicants and advance others, and the companies that makes these systems don’t track whether the excluded people go on to be great employees somewhere else, or whether the recommended hires end up stealing from the company or alienating its customers.
I like O'Reilly’s framework for evaluating black boxes, but I think we need to go farther.
Many facts about the SKYNET program remain unknown, however. For instance do analysts review each mobile phone user’s profile before condemning them to death based on metadata? How can the US government be sure it is not killing innocent people, given the apparent flaws in the machine learning algorithm on which that kill list is based?“On whether the use of SKYNET is a war crime, I defer to lawyers,” Ball said. “It’s bad science, that’s for damn sure, because classification is inherently probabilistic. If you’re going to condemn someone to death, usually we have a ‘beyond a reasonable doubt’ standard, which is not at all the case when you’re talking about people with 'probable terrorist’ scores anywhere near the threshold. And that’s assuming that the classifier works in the first place, which I doubt because there simply aren’t enough positive cases of known terrorists for the random forest to get a good model of them.”
For months I had joked to my family that I was probably on a watch list for my excessive use of Tor and cash withdrawals […] the things I had to do to evade marketing detection looked suspiciously like illicit activities. All I was trying to do was to fight for the right for a transaction to be just a transaction, not an excuse for a thousand little trackers to follow me around. But avoiding the big-data dragnet meant that I not only looked like a rude family member or an inconsiderate friend, but I also looked like a bad citizen.,
“We understand that they work, just not how they work”
Of course, we in the real world know that shaved apes like us never saw a system we didn’t want to game. So in the event that sarcasm detectors ever get a false positive rate of less than 99% (or a false negative rate of less than 1%) I predict that everybody will start deploying sarcasm as a standard conversational gambit on the internet. Trolling the secret service will become a competitive sport […] Al Qaida terrrrst training camps will hold tutorials on metonymy, aggressive irony, cynical detachment, and sarcasm as a camouflage tactic for suicide bombers. Post-modernist pranks will draw down the full might of law enforcement by mistake, while actual death threats go encoded as LOLCat macros. Any attempt to algorithmically detect sarcasm will fail because sarcasm is self-referential and the awareness that a sarcasm detector may be in use will change the intent behind the message.
xkcd is self-proclaimed as “a webcomic of romance, sarcasm, math, and language”. There was a recent effort to quantify whether or not these “topics” agree with topics derived from the xkcd text corpus using Latent Dirichlet Allocation (LDA). That analysis makes the all too common folly of choosing an arbitrary number of topics. Maybe xkcd’s tagline does provide a strong prior belief of a small number of topics, but here we take a more objective approach and let the data choose the number of topics. An “optimal” number of topics is found using the Bayesian model selection approach (with uniform prior belief on the number of topics) suggested by Griffiths and Steyvers (2004). After an optimal number is decided, topic interpretations and trends over time are explored.
The Groningen Meaning Bank consists of public domain English texts with corresponding syntactic and semantic representations.