Smith - "An evaluation of strategies for selectively verifying ." Amy Soller According to the main expectation rule, it seems that the system would automatically engage in a verification sub-dialog if the user, for example, simply asked the system to repeat itself. This seems pretty strict, unless the definition of verification is relaxed. Antonio Roque This paper made me think that what the Circuit Fix-It Shop really needed was an evaluation in which one of the conditions was a typed text, rather than a spoken word, interface. Matthew Bell While reading this paper I keep thinking back to the video where I saw persons interacting with the FixIt Shop. Back then, it struck me that the system seemed to behave in a fashion that was almost insulting. One got the feeling that it condescended the user. In reading this paper, that now seems to make sense: The system was designed to be very take-charge so as to avoid problems. What measures could they take to circumvent this without loosing the efficiency factor they need? Chad Lane What is the cost of an underverification? Is it just that the system will proceed with a false positive, or is there more? In table 1 (p636), strategy 3 produces the highest underverification rate, and yet it is suggested as the best strategy. In this sort of domain, it is possible (likely?) users may not have the vocabulary mastered? One possible result is that the system may detect an expectation violation when in fact the user is just misusing the language. What about semantic verification subdialogues rather than just ASR-driven verification subdialogues? In other words, asking the user to say something in a different way (as opposed to just repeating what was just said) could turn out to be effective and in many instances, more realistic. Stefanie Bruninghaus My question about the Circuit-Fix-It system has been addressed in the paper already (how will the results transfer to other applications, with better ASR and with differnt dialog characteristics - the author discussed that this is an empirical question). Theresa Wilson Not so much a question as a comment. I was glad to see the author address the question about whether all over-verifications are bad (section 5). We've talked a bit in class about the importance of grounding in human-human conversation. The verification sub-dialogues that the fixit shop uses are a very explicit type of grounding (as the author mentions). Chu Carroll et all - MIMC papers Amy Soller Is this the only system that utilizes the Dempster-Shafer theory? It seems fitting. Also, I would like to see some numbers in the system evaluation section. Antonio Roque So now we know the difference between task initiative and dialogue initiative. It still seems a little tricky, though; why, for example, does the author consider that the system only takes over task initiative in turn 6; it seems to do so in turn 4. Matthew Bell The adaptability of their system seems intuitively to match what human's do in dialog processing, although constrained to one domain only. Two issues stand out: 1) It seems possible to go back and annotate directly for shifting control in the discourse, to induce a strategy rather than deduce a strategy. This seems to have been the approach of Litman. This is an interesting contrast. Is there any way to more directly compare the results of these two projects? 2) Their "context-dependent" rules seem to mirror their subject closely. Might a human also be thought of as keeping track of the dominent set of of topics in order to reduce uncertainty in semantic meaning? The difference is, humans would not be bound to one topic, and could enlarge the library of topics as needed, where this system does not (yet) have this ability. Theresa Wilson It would be interesting to know how often MIMIC actually adapted its dialogue strategy in the experiments that the authors describe. Each dialogue had the potential for one or more shifts in task and/or dialogue initiative. Were there dialogues that had no adaptations? Did the dialogue initiative (as opposed to the task initiative) ever shift from the system to the user? I think that these would have been interesting questions for the authors to address. Litman et all - "Designing and evaluating an Adaptive Spoken Dialogue System" Antonio Roque "Designing and Evaluating an Adaptive Spoken Dialogue System" used both ANOVA and PARADISE; I confess I'm not fully familiar with ANOVA yet, so I'm curious: in general, when is each best to use? Matthew Bell I feel I understand now what PARADISE is doing a bit better. I also find it curious that task success didn't match very strongly with user satisfaction. Could it be that the concept of accomplishing a "task" is less natural to humans than some other concept salient within dialog, potentially accounting for the discomfort we feel when "talking" with a computer? Chad Lane Task success seemed extraordinally low (table I, p18), even with adaptive TOOT. In the end this is the most important measure for the user, so what is a reasonable goal? 90% success? Users preferred user-adaptable TOOT over adaptive TOOT with statistical significance, but at how much training cost? In order for users to take advantage of it, they would need to be informed of some concepts, like initiative, implicit verification, explicit verification, etc. At least it seems this way from the example on p. 21. I think users are generally more prone to like tools they understand, so perhaps this could explain the difference. Is it safe to assume that ASR performance will improve with time? If so, what other adaptation criteria could be used? Would there be a need to adapt at all? General: There are many factors that, if known, could affect these systems' abilities to detect miscommunication. For example, the native language of the speaker, the patience of the user, user experience with the system. What else? How can this information be obtained? How can it be used? Alan Berfield What besides recognition of poor ASR could be used to trigger adaptation? Stefanie Bruninghaus I would expect that the particular rule learned (p. 9) is more an effect of using RIPPER. RIPPER seems to learn very few very short rules. It would be interesting to see ID3 or another learning algorithm and compare the results. Why is the preference in TOOT to accept false posiive recognitions? It seems from the Smith paper that initiating a verfication subdialog is more acceptable to a user than dealing with system errors. And, finally, why wouldn't one try the strategy to first use the most conservative dialog strategy, and then, when that works well, slowly let more and more initiative go to the user? Roy Wilson (1) Does the fact that the best learned ruleset contains only one of 23 features surprise you? (2) The authors suggest that 80% from the 10-fole cross-validation is better than the baseline: would a statistical test of this based on the variance of the sampled error rates strengthen this claim? (3) In footnote 7, the authors note that in another study they found interaction effects: how did that study differ from this one? (4) The authors claim that their performance equation "explains" the lack of a significant main/interaction effect involving adaptability. But I (doh!) don't get it: if it is adaptability is good, then (ignoring the question of insufficient power) OUGHT it affect user satisfaction? Theresa Wilson For the non-adaptive version of TOOT, what were the settings for the dialogue strategy? Were the settings the same for all users who tested non-adaptive TOOT?