General analysis:
Of all these papers, I am most interested in “Nonverbal Cues for Discourse Structure”, though I am impressed by the statistical rigor found in “Toward Interface Design for Human Language Technology”. The former recognizes the value of body language in facilitating dialog. To a much lesser degree, this is also attempted in the optional paper, “Task Oriented Collaboration with Embodied Agents in Virtual Worlds”. If Rea’s body language handling could be added to Steve, the result would be very interesting indeed.
Non-Verbal Cues for Discourse Structure. Cassell, J., Nakano, Y., Bickmore, T., Sidner, C., Rich, C. Proceedings of the 41st Annual Meeting of the Association of Computational Linguistics, 2001.
“Studying posture shift to understand discourse structure is useful, but because people may not interact with computer agents in the same way they do with other humans, it would be [wise] to use reinforcement learning to adapt to users during the session.”
- Antonio Roque
I’ve
heard good arguments for and against such an approach. I personally favor making agents as
human-like as possible to avoid specialized behavior, which has proven to be
difficult to predict and respond to.
“It seems a reasonable thing to try here would be to survey listeners on their feelings regarding the attentiveness and naturalness of a speaker given abnormal gesturing. Has such been done?”
- Matthew T. Bell
“In the Analysis section of this paper, the authors state that for inter-discourse-segment and inter-turn intervals, they normalize the number of posture shifts by the number of inter-segment occurrences (ps/int). For longer intervals, intra-discourse segment and intra-turn intervals, they normalize by posture shifts per second (ps/s). However, in the tables that they show throughout the analysis section, the authors use both ps/s and ps/int for the inter-segment and inter-turn intervals. Can anyone help explain why the authors do this?”
- Theresa Wilson
Toward interface design for human language technology: Modality and structure as determinants of linguistic complexity. Oviatt, S. L., Cohen, P. R. & Wang, M. Q. Speech Communication, European Speech Communication Association, 1994, vol. 15, nos. 3-4, 283-300 (Invited paper for special edition on spoken dialogue, ed. by K. Shirai & S. Furui).
“Could reduced linguistic complexity be obtained by just asking the right questions at the right time (in regular dialogues)? In other words, to what extent could these restricted formats be simulated by carefully constructed questions?”
-
Seems reasonable to me. The
hard part would be figuring out how to properly phrase the questions.
“The result that humans do not [prefer] unrestricted interfaces is an interesting one. Perhaps what it suggests is that language - even natural language - is viewed by its users already as an artifact and not a natural phenomenon. Indeed, is the concept of a "natural language" entirely accurate? Humans produce language to communicate, and in a goal-centered fashion, much as when we design and use any artifact. If language is simply another artifact, granted a vastly complex and particularly human one, then what consequences would that have for a view of dialog systems and the languages they implement?”
- Matthew T. Bell
Perhaps this suggests that body language is the only true natural language, since much of it is involuntary.
“I was disappointed to see that a lot of the speed differential in using the ExInit GUI was due to scrolling long lists and dragging each unit into place separately. Would QuickSet have performed better if these actions were made faster or if the user was allowed to type in his instructions instead of speaking them?”
- Andy P. Gaydos
You
have a valid point, however the authors did note that ExInit is similar to systems in wide military use today and
accepted as state of the art.
The efficiency of multimodal interaction for a map-based task. Cohen, P. R., McGee, D. R., Clow, J. Proceedings of the Applied Natural Language Processing Conference (ANLP), 2000.
“Military personnel are trained to speak clearly and efficiently. In other words, they are ideal subjects for this sort of study. Is this a confound?”
-
I
hadn’t thought of this, but you may very well be right.
“I’d be interested in seeing a different form of their system in which the user first attempts to express commands in language first, and the interface is used only for disambiguation and clarification purposes. This would seem to fit a military commander's typical style of communication better.”
-
This would
also better fit my notion of the usefulness of multimodality. Non-speech input is used to improve natural
language understanding.
“Just out of curiosity, how does work in interfaces for the handicapped inform this sort of research. I ask because all this multi-modal stuff, and particularly that discussed in this paper, reminds me sharply of discussions of interfaces for the blind that I've had in the past”
- Matthew T. Bell
That would
make sense, given that is research is augmenting a traditional interface with
speech, rather than augmenting speech with another signal.
AdApt - a multimodal conversational dialogue system in an apartment domain. Gustafson J, Bell L, Beskow J, Boye J, Carlson R, Edlund J, Granstrom, B, House D & Wiren M. Proc. of ICSLP, 2000.
“Although I didn't predict it, it made perfect sense to find out that the number of incomplete/fragmented utterances was high - it's as if the user's thoughts on the interface are competing with thoughts that lead
to speech. So, is this solved by simply moving to a text-based modality? Is this question answered by Oviatt, et al?”
-
A
graphical map interface seems most logical for the apartment selection domain ( as well as the map task domain in Oviatt,
et al). A text interface would require
the user to imagine the visual equivalents to descriptions, thereby increasing
cognitive load.
“They make note that one of their gestures studies involved pointing then asking a question about the object pointed at. Such gesturing seems chosen, even perhaps reasoned about more than an automatic process. If so, perhaps people could be queried on why they use the gestures they do, with those rationales used to inform a system's development. Thoughts?”
- Matthew T. Bell
This is an interesting point given that in the Cohen, et al paper, such pointing gestures were ignored entirely. This was a deliberate design choice, not an oversight.