==================================================================== Q for Jurafsky et al: The authors discuss that the tag distribution is skewed, and they seem to be prefer an even distribution. But, is that really desirable? I would expect that skewness correctly reflects human dialog. While the paper discussed how additional sources of information can be integrated (prosody, for instance), I missed how the one overall integration of all this evidence is supposed to be work out. Q for TBL paper: I am wondering how much of the success of this work is to be attributed to the use of the TBL algorithm, and how much is due to the nifty feature selection work. It seems to me that by using clusters (p. 4), for instance, they do not only get fewer rules, but adds generalization power, too. To support the claim that TBL is the better algorithm, I would like to see experiments where they use the same representation (with nifty features) and feed it into an alternative algorithm. Stefanie Bruninghaus ==================================================================== In (Stolke et al., 2000), I am a little confused about exactly how the decision tree posteriors are used in the HMM (p. 17) Also, Stolke and his colleagues are up front about the 35% baseline, however I didn't see the mention of a baseline in Samuel et al's work. Yet, the various approaches are tallied, specifying accuracy percentages, in Table 14 of Stolke et al.'s paper. I would imagine that the baseline could make a significant difference, especially in Samuel & colleagues work, where it seems they begin by tagging the entire corpus according to the baseline tag. Also, if we cover the optional paper, I wonder why the authors thought that including the automaton information would improve performance. Even though the state information is hidden, isn't it still inherently captured in the model? Amy Soller ==================================================================== In "Dialogue Act Tagging with Transformation-Based Learning", pg 4-5 describes using a Monte Carlo method for limiting the problems of intractability associated with searching through the generated potential rules; in what ways would other search methods heuristics improve the algorithm? Interesting to see on pg 6 of "Dialogue Act Modeling..." mention of the 0.8 kappa value, given our discussion of it last class; it seemed to me that the Carletta paper didn't "argue" the worth of that value so much as introduce it. Antonio Roque ==================================================================== Samuel et. al On the one hand, I enjoyed this paper more, but on the other I find myself somewhat skeptical of their claims. Two, specifically, stand out: 1) They make the claim that they needed to randomly sample templates for producing rules (or perhaps they were randomly selecting only some of the rules generated by any given set of templates?!), but that this did not hurt them as the "best" rules are somehow still so overwhelmingly likely to end up in the mix (pg4-5). As I understood TBL, we first choose templates, then generate all rules, then measure the effectiveness of those rules. How could we choose only a subset of templates or rules, throw out all the others a priori, and still be gauranteed (or near gauranteed) to get the "best" ones? 2) On pg 5 they seem to assert that their method obtained results roughly equivalent to results shown by other methods. Further, their accuracy seems higher than the kappa values we saw in Monday's papers (although not higher than the kappas shown in Stolcke et al). Given their results are appearently higher than agreement shown by humans (and hence suspicious), and no better than those produced by other methods (hence not giving us anything new), why have they considered their method a success? On this issue a couple arguments in their favor stand out to me, the greatest being that the rules produced by TBL can be analyzed by humans post hoc to formulate theories on what is going on (pg 2). If this is the main reason to use TBL for this application, what sorts of theories have researchers used TBL rule sets to deduce? Stolcke et. al I have two questions, one a clarification question from the paper, and the other a question about potential modifications to their model that seem obvious and perhaps even tacit in their considerations for future work. First for clarification: I'm not clear on how they are connecting bayesian networks, decision trees, and neural networks with HMMs. How would one go from an HMM generated discourse grammar to any of the above? The connection to Bayes nets is perhaps clearer to me, as both HMMs and Bayes nets make use of independence assumptions, but the decision tree and ANN connection does not seem as clear. Second: They cite the promise of using knowledge about DAs for improving ASR, but show disappointing results. As I understood the earlier parts of their paper, it seems their original architecture treats the DA as an abstraction induced using statistical reasoning and acustic and other evidence, which perhaps is a source of some of the difficulty in later using the DA as evidence about interpreting acustics. The model of "listening" here is entirely passive: The listener gets evidence about the current DA, then picks a likely word given that and the acustics. Through introspection it seems more likely to me that humans are more active in their participating in dialogue. We agressively assign interpretations to utterences given us in light of the previous discourse and our relationship with our interlocuters, only reinterpreting an utterence if negative feedback is given (e.g. "Huh? No I didn't say that. I said ____.") Has any work been done to see what effect making computer dialogue understanding systems, artificial listeners, more agentive in interpretation, vis a vis the above? Is this part of the idea behind using paradigns of games and other machine learning methods where the computer actively makes moves, receiving positive or negative feedback based upon success or failure in some terms (perhaps based on gold-standard tags in a corpus, or even actual human feedback in the case of dialogue systems)? Is this similar in spirit to Diane's and others use of reinforcement learning? Also, are the authors contemplating something of this sort on pg 27 in the future work section? Matthew Bell ==================================================================== Samuel, et. al. About "irrelevant" rules... is it really possible to know for sure that a rule is irrelevant without human confirmation? It doesn't seem like it would be that hard to imagine a rule that (1) is sensible but with low improvement or (2) not sensible, but with high improvement. Maybe I'm reading too much into their use of "irrelevant"? They address the the issue of an irrelevant rule making it into the final model on the third page, but I don't understand the solution. It is situated within the discussion of cue phrases, but I believe it is a larger problem. Another related question to irrelevance.... is there something that can be learned from irrelevant rules? Should they really be discarded? For example, if a large number irrelevant rules seem to contain a common feature that doesn't appear in the relevant rule set very often, I think you'd have evidence enough to throw out the feature. Stolcke, et. al. et. al. et. al. One thing I like about the paper is that they address non-task-oriented dialogues. What aspects of their findings could be used to inform more symbolic approaches to automatic tagging? Similarly, since task-oriented dialogues seem the most popular in the papers we've considered up until now, do the results here suggest any important differences? H. Chad Lane ==================================================================== In Table 2 of Stolcke et al., several dialogue act labels are shown to occur in less than 1% and less than 0.1% of the utterances. Is it worth trying to identify these less frequent tags or would it be better to give these utterances a different tag? Andy Gaydos ==================================================================== Samuel et al. ------------- They say that they are trying to filter out the superstring dialogue cues. I'm wondering if it may be possible, though, that a superstring que could indicate a different dialogue act than the more general substring version? Maybe this is rare enough that it has minimal effect? What kind of dialogue acts would machine learning likely fail at recognizing without common-sense world knowledge? Stolcke et al. -------------- Would their technique be expected to perform better on a more task-oriented corpus? Alan D. Berfield ==================================================================== (1) Is anyone troubled by the definition of prediction correctness given by Reithlinger, et. al.? If not, please tell me what I'm missing. (2) Samuel, et.al, in Figure 3 report on the tagging accuracy of their system as a function of the type of word substrings used. They appear to do a straight comparison. Given the large N's reported, however, it may not be surprising that so many of the differences are statistically significant. A better measure might be the effect size, as measured (if I recall correctly) by the difference of the group means divided by the square root of the pooled variance. The effect size would state the difference in the average accuracies in terms of a common standard deviation unit. It is claimed to provided a better measure of a substantive, rather than merely statistical, difference. The idea of using sampling methods via simulation to offset the complexity of TBL is very interesting, but suggests the need for confidence measures. Is the following a better alternative? Assuming the computation of PE is straightforward and ignoring the inaccuracies introduced by possibly using a large-sample method, compute kappa over the model results and its significance level. If the kappa is significant, use it directly as a confidence measure rather than committee-based sampling (about which I know nothing). The behavior of the system can be explored as a function of theta, R, the set of dialogue act cues, etc., with kappa providing a response surface. (3) Is it legitimate to calculate a kappa for Stolke, et. al.'s dialog act labeling accuracy data (stated in the abstract)? I get a kappa of 0.55. Unfortunately, Stolcke, et.al, do not provide data to allow a calculation of the statistical significance of kappa. Roy Wilson ==================================================================== Samuel K. et all The last paragraph in Discussion brings the fact the machine learning will not be able to completely solve the dialog act tagging problem because of the its incapability to incorporate world knowledge. I guess the cases missed my machine learning technique can be the exceptions that are ignored when machine learning (and we all agree that there are a lot of exceptions in languages - I remember learning French :-). Will a memory based learning algorithm improve the accuracy? Stolcke A, et all In page 6, second paragraph they say that the corpus was annotated by 8 linguistic graduates. This relates to what we discuss in the previous class about kappa statistic and experts. I was wondering why didn't they choose non-linguistic people? Anyway, it is clear for me that kappa statistic just gives an evaluation based on a specific class of coders. Mihai Rotaru ====================================================================