Stefanie Bruninghaus I am wondering how central the performance of the ASR component is in all these evaluations. The evaluation results suggest that all other features of the systems are secondary to ASR. This makes me wonder what the implications these evaluations have for dialog systems in general, beyond ASR-based systems. I also expected some discussion how dialog management has a measurable influence on system performance; that's where things get interesting and where a designer can add intelligence. Closely related is the issue of the benefit of this kind of comparative evaluation. In other fields, like Information Extraction, the large-scale comparative evaluations have been criticized for causing a convergence of technology. That is, participants copy the technology of the most successful system, and improve upon it a little bit. While the focus on travel info systems is probably a pragmatic choice, it limits the comparison to the ASR aspects. If a larger scale evaluation would be carried out using PARADISE, it seems that the focus could become more on analyzing the surface behavior of the different systems, and less on the underlying design choices. ---------------------------------------------------------------------- Amy Soller In the Walker, Kamm, & Litman (Paradise) paper - How can one be sure that all these variables are independent? It seems that the quality and efficiency of the dialog are somewhat dependent on each other. For example, better quality suggests more efficiency - if there are less errors and less repetitions needed, then the dialog will be shorter. I found the DARPA Communicator survey paper frustrating to read - it had me trying to guess which systems were the ones assigned to 1, 2, 3, etc. The researchers must have known which systems were which - why not report on it? This sort of thing encourages competition, which is important for iteratively improving the design of the systems, right? I also wonder how the performance of the various systems changed (if at all) for the open tasks, as compared to the closed tasks. And a technical question - What statistical procedure do you use to determine which variables account for x percent of the variance? ---------------------------------------------------------------------- Matthew Bell On all of the papers: What is a user model? David Chin: This paper made more sense than the others, although I'm still a little lost as I'm not quite certain what a user model is. I do note that he points out that the effect size should be at least .8, where in the Walker et al paper they weemed happy to have (an effect size?) of .42. Which is it? Walker Passonneau and Boland: Even after accounting for more of the variance, they can only account for .42 of it. Where's the rest? Nuisance variables, ala Chin's paper? Walker Kamm Litman What is PARADISE? ---------------------------------------------------------------------- Terry Wilson In "User Models and User-Adapted Systems", on page 189 the author proposes a number of common measure that he would like to see adopted for evaluating user models. He has covered the first three measures earlier in the paper and proceeds to explain the last measure, but what specifically are "post-hoc probabilities"? In "General Models of Usability when the authors discuss the evaluations of the user models that they trained, they use phrases like, "the ELVIS model accounts for 55% of the variance in user satisfaction." What does it mean to say that, that a given model accounts for a certain percentage of the variance in user satisfaction. ---------------------------------------------------------------------- Antonio Roque In "Towards Developing General Models of Usability with PARADISE", the authors favor generalizing normalized dialogue quality metrics, rather than efficiency metrics, which seem unlikely to generalize (p. 8). It would be interesting to try to find some way to adjust the efficiency metrics in the same way that accuracy values are adjusted in kappa measures; for example, elapsed time to completion might be adjusted based on a minimum time to completion or an average time to completion of the tasks being compared. ---------------------------------------------------------------------- Andy Gaydos DARPA Communicator Dialog Travel Planning Systems The plans for a second data collection in April 2001 looked interesting and promising. Did this data collection begin as planned? ---------------------------------------------------------------------- Eric Williams Are the models derived with PARADISE really any good? The statistics given are a little confusing. If the percentages given (% of variance accounted for) are simply how many cases were correctly classified (in regards to user satisfaction), their results frankly suck. Also, the cross-domain predictability suggests to me equally poor prediction, rather than something to be excited about. In other words, I'm not convinced their models were usefully (as opposed to strictly statistically) significant. 35% accuracy given a guess accuracy of 20% is statistically significant, but it still stinks. I think the problem lies in the target to be predicted. Why on earth did they sum the survey results??? I would have liked to have seen each subjective satisfaction measure set as the target to be learned, with the objective measures all used as input features. On a side note, what the heck is PARADISE anyway? I saw no reference to the method(s) used to derive their models. Was it rule induction, decision trees? Voodoo ritual? ---------------------------------------------------------------------- Roy Wilson (1) The Bouwman-Huilstijn (1998) This paper is interesting methodologically and substantively. Although the idea of using ASR reliability measures to control dialogue duration is interesting (and one that we have seen before?), I classify this paper as "engineering" rather than "scientific" (recalling Antonio's early remarks) in its approach to evaluation design and reporting. Although it isn't entirely fair to fault the authors for not orienting their work to a 2001 paper, it is legitimate to observe that this paper does not employ the design rules of thumb or the evaluation standards advocated by Chin. (2) Walker, Kamm, and Litman (2000) The authors use stepwise linear regression to "train" (aka "parameterize"?) a multiple linear regression model. I've read and seen that stepwise-regression (SR) is sensitive to both the order in which variables are entered and the statistical criteria used for entry: SR can generate very different models (a variable included in one SR can be excluded in another). More information on how the authors addressed these issues would be soothing to some statistical/methodological worry-warts. The goal of finding "general factors that predict user satisfaction" (p. 13) is worth searching (even marching) for. I wonder whether the authors considered doing a confirmatory (vs. exploratory) factor analysis (CFA): Figure 1 shows the (unobservable) factors (aka constructs, latent variables, etc.) and suggests a model of how they are related. Figure 1 lists the manifest variables that are indicators for each factor. Efficiency and quality "cause" costs to be what they are; user (dis)satisfaction is "caused" by task outcomes and by costs. CFA "requires the researcher to theorize an underlying structure and assess whether the observed data 'fits' this a priori specified model" (Mueller, Basic Principles of Structural Equation Modeling, 1996). Such a formulation might be the skeleton for a substantive (that is, not 'merely' statistical) theory of user satisfaction. See below for more on "fit". (3) Walker, Aberdeen, et.al (2001). Assuming that multivariate normality is sufficiently approximated (as assumed in multivariate linear regression), a CFA model could be used to assess the validity (does it measure user satisfaction?) and reliability (does it consistently measure?) of the "instrument" implicitly defined by the data collection procedure described in this paper. Although accounting for 35% of the variance in user satisfaction is a measure of reliability, it may be "somewhat low" (Mueller, p. 79) and may be misleading since multivariate linear regression assumes zero or negligible measurement error. A structural equation model (SEM) allows for measurement error to be specified or estimated, which affects the estimates of how the indicators relate to the constructs and how the construts relate to each other. (4) Walker, Passeonneau, and Boland (2001). In the discussion section, it is stated that the addition of dialogue act metrics "improves the fit of models of user satisfaction from 37% to 42%". An F-ratio can be computed to test the statistical (though not the substantive) significance of the delta. Was it? (See Mueller, p. 16). (5) Chin (2001). My statistical gurus have informed me that: the use of ANCOVA is often VERY tricky, although Chin's language suggests otherwise; if possible, it is safer (and often more statistically powerful) to create blocks based on covariates and proceed with a regular ANOVA. ANCOVA requires assumptions beyond those required by ANOVA and is not robust against their violation. ---------------------------------------------------------------------- Mihai Rotaru PARADISE framework: 1. Is user satisfaction really a good measure? I guess at this point it is still important, but as SDS advance, the user satisfaction measure will became more and more subjective (imagine using user satisfaction on Linux vs Windows). 2. I was wondering if a weighted sum of the user satisfaction metrics will be more appropriate. For example it will be very interesting to see which features are responsible for individual user satisfaction metrics (like TTS performance, Future Use). 3. Most of the user satisfaction individual metrics implicitly hide a comparison. And this comparison is usually made with a human. What is the effect of such a comparison, and how aware are user of it? What will be the results if one will make additional experiments with humans answering the tasks and normalize afterwards.