Teaching computers to recognize ALL the Englishes

allthingslinguistic:

A couple recent articles about bias in machine language processing. A study from UMass Amherst on improving parsing of African American English using data from twitter: 

For the past 30 years, computer science researchers have been teaching their machines to read, for example, assigning back issues of the Wall Street Journal, so computers can learn the English they need to run search engines like Google or mine platforms like Facebook and Twitter for opinions and marketing data.

But using only standard English has left out whole segments of society who use dialects and non-standard varieties of English. […]

To expand NLP and teach computers to recognize words, phrases and language patterns associated with African-American English, the researchers analyzed dialects found on Twitter used by African Americans. They identified these users with U.S. census data and Twitter’s geo-location features to correlate to African-American neighborhoods through a statistical model that assumes a soft correlation between demographics and language.

They validated the model by checking it against knowledge from previous linguistics research, showing that it can successfully figure out patterns of African-American English. Green, a linguist who is an expert in the syntax and language of African-American English, has studied a community in southwest Louisiana for decades. She says there are clear patterns in sound and syntax, how sentences are put together, that characterize this dialect, which is a variety spoken by some, not all, African Americans. It has interesting differences compared to standard American English; for example, “they be in the store” can mean “they are often in the store.”

The researchers also identified “new phenomena that are not well known in the literature, such as abbreviations and acronyms used on Twitter, particularly those used by African-American speakers,” notes Green.

Several blog posts from Rachael Tatman comparing YouTube’s auto-generated captions in “accent challenge” videos across dialects and gender:

I picked videos with accents from Maine (U.S), Georgia (U.S.), California (U.S), Scotland and New Zealand. I picked these locations because they’re pretty far from each other and also have pretty distinct regional accents. […]There’s variation, sure, but in general the recognizer seems to be working best on people from California (which just happens to be where Google is headquartered) and worst on Scottish English. The big surprise for me is how well the recognizer works on New Zealand English, especially compared to Scottish English.

When I compared performance on male and female talkers, I found something deeply disturbing: YouTube’s auto captions consistently performed better on male voices than female voices. […] First, let me make one thing clear: the problem is not with how women talk. The suggestion that, for example, “women could be taught to speak louder, and direct their voices towards the microphone” is ridiculous. In fact, women use speech strategies that should make it easier for voice recognition technology to work on women’s voices.  Women tend to be more intelligible (for people without high-frequency hearing loss), and to talk slightly more slowly. In general, women also favor more standard forms and make less use of stigmatized variants. Women’s vowels, in particular, lend themselves to classification: women produce longer vowels which are more distinct from each other than men’s are

See the blog posts for stats, graphs, and methodology.