Cognition/Linguistics: September 2011 Archives
One of the dreams of solving the captioning backlog is to rely on speech recognition. I do have to say that speech recognition is far more effective at time than I would have dreamed, but still my intuition has told me it's not entirely working. A fascinating article from Robert Fortner on "The Unrecognized Death of Speech Recognition", essentially backs up the intuition with some hard numbers. He notes that accuracy has not improved much since the early 2000s and that in most cases, the rate is not within human tolerance (humans apparently have about a 2% error rate and even that can lead to some pretty ridiculous arguments).
When Speech Recognition Works
Speech recognition can be effective in two situations
- Specific context (airport kiosk, limited menu commands) - even here though it should be noted that it's pretty darn easy to frustrate the average health insurance voice recognition system so that they give up.
- Specific speaker - Speech recognition is effective when trainied on a single voice, and the training time is shorter than it used to be. For captioning purposes, this means that if a single speaker makes the original audio (e.g. faculty lecture) or someone else repeats what's on the audio (the captioner), speech recognition is pretty effective.
By the way, in the recent Second Accessibility Summit, Glenda Sims noted that correcting an inaccurate transcript is more difficult than starting from scratch.
What Speech Recognition Is
To understand why speech recognitin isn't improving, you should consider the task it's trying to perform. When human ears listens to language, it hears a stream of separate words and sounds and groups those into words and sentences. The reality is that speech is a continuous sound waves with very subtle acoustic transitions for different sounds (see images below, the bottom ones are the spectograms that phoneticians use). Your ears and brain are doing a lot of processing to help you understand that that person just said.
Your brain not only breaks up sound waves, it also accounts for the acoustics of different genders, different regional accents,filtering out different types of background noise and it probably includes some "smart guessing" on what a word is as well (which doesn't always work). It's no wonder that replicating the functionailty of the mechanism is taking time.
Ingoring the Linguists
There's one factor that Robert Fortner points to - speech specialists are not always involved. As one IBM researcher claimed "Every time I fire a linguist my system improves"...but apparently there is an upper limit to this without more information. Maybe it's time to start rethinking the problem and if the programming team might need some outside experts.
In the spirit of continuing to clean my desk, my next book to review is The Wisdom of Crowds by James Surowiecki. I think a lot of people at ETS are familiar with the book, but I think it's worth explaining exactly how the wisdom is generated.
The term "Wisdom of Crowds" seems to suggest a scenario where people make decisions as committees, but that's not what it really is. Rather the "wisdom" comes from being able to tap into the results of multiple individual decisions rather than relying on a single committee or expert.
A classic example is a contest to guess the weight of an ox. Individually, the guesses varied widely, but the average of the guesses was within one pound of the actual weight. It wasn't the case that the group decided the weight of the ox, but rather that the individual guesses added up to the correct answer.
I admit that I've always been a little skeptical of "collaboration" because I often equate with group think, but this kind of collective wisdom still values individual diversity. In fact, Surowiecki argues that you get the best results specifically when you can factor in individual input.
There are a lot of interesting applications to this concept in the book, but I think one of the most important is ensuring that you really ARE getting a diversity of opinion. One reason that anonymous voting is so important is that it does insure you are getting an accurate opinion from individuals and not votes partially based on social pressure.
Another situation this applies to is getting feedback from your students. I think a lot of us have experienced the eerie silence that follows the instructor's request for an answer, not to mention the awkward nods of agreement with slightly puzzled faces. Are the students agreeing with you or just trying to mirror your opinion?
One reason I like the concept of clickers is that it does enable the kind of high volume individual input needed to assess your students' actual thinking. We talk about how it can assess misconceptions (true), but sometimes it can access a wisdom you didn't know was there.
Earlier this week, I was talking about gender stereotypes in language and asking students if they could identify some stereotypes. In more than one case though, I saw some puzzled looks. I began to realize that some of my research may be getting out of date, at least in their circles.
I'm also reminded my personal guideline of multiple tabloid sources. If one tabloid claims a movie star is an alien spy, it's probably a lie. But if two more or tabloids independently have the same story...it is probably true.