Recently in Cognition/Linguistics Category

Understanding Speech Recognition

| | Comments (0)

One of the dreams of solving the captioning backlog is to rely on speech recognition. I do have to say that speech recognition is far more effective at time than I would have dreamed, but still my intuition has told me it's not entirely working. A fascinating article from Robert Fortner on "The Unrecognized Death of Speech Recognition", essentially backs up the intuition with some hard numbers. He notes that accuracy has not improved much since the early 2000s and that in most cases, the rate is not within human tolerance (humans apparently have about a 2% error rate and even that can lead to some pretty ridiculous arguments).

When Speech Recognition Works

Speech recognition can be effective in two situations

  1. Specific context (airport kiosk, limited menu commands) - even here though it should be noted that it's pretty darn easy to frustrate the average health insurance voice recognition system so that they give up.
  2. Specific speaker - Speech recognition is effective when trainied on a single voice, and the training time is shorter than it used to be. For captioning purposes, this means that if a single speaker makes the original audio (e.g. faculty lecture) or someone else repeats what's on the audio (the captioner), speech recognition is pretty effective.

By the way, in the recent Second Accessibility Summit, Glenda Sims noted that correcting an inaccurate transcript is more difficult than starting from scratch.

What Speech Recognition Is

To understand why speech recognitin isn't improving, you should consider the task it's trying to perform. When human ears listens to language, it hears a stream of separate words and sounds and groups those into words and sentences. The reality is that speech is a continuous sound waves with very subtle acoustic transitions for different sounds (see images below, the bottom ones are the spectograms that phoneticians use). Your ears and brain are doing a lot of processing to help you understand that that person just said.

Two Wave Forms for Two words

Your brain not only breaks up sound waves, it also accounts for the acoustics of different genders, different regional accents,filtering out different types of background noise and it probably includes some "smart guessing" on what a word is as well (which doesn't always work). It's no wonder that replicating the functionailty of the mechanism is taking time.

Ingoring the Linguists

There's one factor that Robert Fortner points to - speech specialists are not always involved. As one IBM researcher claimed "Every time I fire a linguist my system improves"...but apparently there is an upper limit to this without more information. Maybe it's time to start rethinking the problem and if the programming team might need some outside experts.

Book Review: The Wisdom of Crowds (via Clickers?)

| | Comments (0)

In the spirit of continuing to clean my desk, my next book to review is The Wisdom of Crowds by James Surowiecki. I think a lot of people at ETS are familiar with the book, but I think it's worth explaining exactly how the wisdom is generated.

The term "Wisdom of Crowds" seems to suggest a scenario where people make decisions as committees, but that's not what it really is. Rather the "wisdom" comes from being able to tap into the results of multiple individual decisions rather than relying on a single committee or expert.

A classic example is a contest to guess the weight of an ox. Individually, the guesses varied widely, but the average of the guesses was within one pound of the actual weight. It wasn't the case that the group decided the weight of the ox, but rather that the individual guesses added up to the correct answer.

I admit that I've always been a little skeptical of "collaboration" because I often equate with group think, but this kind of collective wisdom still values individual diversity. In fact, Surowiecki argues that you get the best results specifically when you can factor in individual input.

There are a lot of interesting applications to this concept in the book, but I think one of the most important is ensuring that you really ARE getting a diversity of opinion. One reason that anonymous voting is so important is that it does insure you are getting an accurate opinion from individuals and not votes partially based on social pressure.

Another situation this applies to is getting feedback from your students. I think a lot of us have experienced the eerie silence that follows the instructor's request for an answer, not to mention the awkward nods of agreement with slightly puzzled faces. Are the students agreeing with you or just trying to mirror your opinion?

One reason I like the concept of clickers is that it does enable the kind of high volume individual input needed to assess your students' actual thinking. We talk about how it can assess misconceptions (true), but sometimes it can access a wisdom you didn't know was there.

Earlier this week, I was talking about gender stereotypes in language and asking students if they could identify some stereotypes. In more than one case though, I saw some puzzled looks. I began to realize that some of my research may be getting out of date, at least in their circles.

I'm also reminded my personal guideline of multiple tabloid sources. If one tabloid claims a movie star is an alien spy, it's probably a lie. But if two more or tabloids independently have the same story...it is probably true.

New Media Seminar Week Final!: Comics and Design

| | Comments (0)

We ended this seminar with a reading from Scott McCloud's Understanding Comics which was presented in...comic format.

Frames and Time

The topic was "Frames" or how we interpret the passing of time based on the sequence of panels. Perhaps the most interesting observation is that even most panels are still images, very are are actually single moments in time. McCloud points out that if there are 2 or more dialogue balloons in a panel, we have to infer a time/sequence that the characters would convey the dialogue. Other panels may also feature motion lines or other conventions to convey the passage of time in a single image. In other words, comics have to compress reality a bit in the images in order to push the narrative forward.

Design Question: How do we learn this?

I actually read the whole thing many years ago, and I recall thinking "Duh". It's not that McCloud is not accurate, but that these conventions are so well designed that comic readers tend to pick them up unconsciously just be reading them. In other words, there no comic book literacy lessons that readers have to learn beforehand. Most of understand that THWACK! is a sound effect, dialogue takes time and the difference between omniscient narration in boxes at the edges of the panels and character dialogue balloons.

Not even Twitter and Facebook are this easy.

How did this happen? Partly because comics do adapt from other conventions like text. For instance both Western comics (images & dialogue) and Western text are read left-to-right, top-to-bottom. In Japan though, manga comics might be published so that images and text are scanned right to left (they are reversed when they get translated to English).

What I think is more interesting are the new conventions that were introduced with minimal fuss. Illustrators drew in some lines to simulate motion, and readers generally got it. We also figured out dialogue balloons and that the line pointing to a character meant that the character is the speaker. More interestingly, these conventions have been translated across cultures into places like Japan, China and Brazil.

Are comic book artists tapping into hard-wired visual processing algorithms? Or is it just that they understand our cultural visual vocabulary so well? I think you can debate either side, but we can learn a lesson in adaptation here. Comic book illustrators, for the most part, have been able to develop a visual vocabulary that is easily learned. I'm sure there are lots of lessons here if we could expand on this study from creating better diagrams to understanding how to make new interfaces.

The S word - Semiotics

Another interesting point for me is that McCloud is delving into a lot of semiotic theory...without ever once using the word semiotics. He really does an excellent job of explaining the mechanics of delivering the narrative in comic form without ever getting too technical (it wasn't just the images - it was the combination of images and pithy explanatory text that worked). One of the target audiences may be instructors, but it really does work for a comic book reader wanting to learn more about the craft.

This is a great example of making an esoteric topic accessible to general audiences. And that's a skill we all wished were a little more common this semester.