Understanding how people speak when they interact with spoken dialogue systems is critical to improving the performance of those systems. For example, previous research has found that user attempts to correct system errors are themselves more likely to be misrecognized than other utterances, and thus may require special handling. Knowing whether speakers are more likely to repeat or rephrase their utterances, add new information or shorten their input, and how system behavior influences these choices can suggest appropriate on-line modifications to a dialogue system's interaction strategy or to the recognition procedures it employs.
This research investigates whether speakers' prosodic behavior provides useful indicators of a) whether a speaker turn will be recognized correctly or not by an automatic speech recognition system; b) whether a speaker is reacting to a system error; and c) whether a speaker is correcting such an error. Our analytic results show that there are significant prosodic and lexical differences between misrecognized and correctly recognized speech and between correction and non-correction utterances. In addition, the characteristics of correction utterances vary with system interaction strategy. Our machine learning results show that prosodic and other differences can in fact be used to automatically predict both misrecognitions and their corrections. We are exploring how to use our results to improve spoken dialogue system behavior.