-------------------------------------------------------------------- Stefanie Bruninghaus -------------------------------------------------------------------- Q related to Carletta's Kappa paper: Some mistakes (like mislabeling a rare but very important class) are more expensive than others. Would there be any advantage in methods that take cost into account? Is Kappa really such a great statistic? It seems that chance agreement is not exactly a good baseline assumtion, wouldn't it be better to use a bit of a more stringent measure instead? E.g., a naive bayes using class distributions as a more intelligent baseline - after all, humans are not coding by chance, either. Carletta coding scheme paper: What seems to be most interesting about this paper is that the coding scheme is hierarchical in a different way than DAMSL - the coding also considers the dialog structure in the coding scheme (which makes sense since the kind of dialog act and the higher up dialog structure are not independent). I am wondering whether this idea could be transfered to DAMSL - it seems that the Di Eugenio paper has some of that already in it, by defining well-formed proposals are aggregations of other DAs. Di Eugenio/Jordan paper: It seems intuitive that recognizing a WFP can get one to increase performance on recognizing agreement. It's pretty easy to believe that for an utterance U P(U = agree) is not equal P(U = agree|prev_U = proposal) - but the argument in Sec. 5 does not discuss how they intend to use this information. -------------------------------------------------------------------- H. Chad Lane -------------------------------------------------------------------- Core & Allen: DAMSL's focus has been on task-oriented dialogs, but they state they believe it to be applicable to all dialogs. I'm curious, what other sorts of dialogs are there and is there anything about them that forces the authors to make this distinction? Kind of a silly question, but I'm curious.... In table 1 and in a nearby paragraph, they seem to be concerned about the undergrad vs. grad distinction in their coders. However, they really don't discuss it in their results section. What's the big deal? Carletta (Kappa paper): K > .8 is considered good reliability in the field of content analysis, is there any reason to believe it to be the right threshold for dialog annotation? -------------------------------------------------------------------- Theresa Wilson -------------------------------------------------------------------- 1. In "An empirical investigation of proposals in collaborative dialogues", the authors give an example of a conversation at the bottom of page one. The authors say that because something is first mentioned (in this case, a blue sofa), it cannot count as a proposal to include it in the solution. Perhaps my understanding of the scenario is incorrect, but this doesn't necessarily seem true. If earlier in the conversation, the two conversants agree on the general task, say that they will be discussing furniture to buy, and then speaker A begins with, "I have a blue sofa for 300", then wouldn't this be a proposal? (The blue sofa is a piece of furniture.) And proposals are what the paper was focusing on recognizing. 2. From the description in "Coding dialogs with the DAMSL annotation scheme", I wasn't clear on the differences between Utterance Features in the Information Level dimensioni (page 4): a. Task b. Task Management c. Communication Management As did the annotators, as the authors noted in the last paragraph before the conclusions. Could we discuss these in class? 3. Reading the papers on the two different annotation schemes, "The reliability of a dialogue structure coding scheme", and "Coding dialogs with the DAMSL annotation scheme", it seems that some of the very tags that the annotators had trouble distinguishing between in the "dialogue structure" paper, a. check v. query-yn b. instruct v. clarify c. acknowledge v. ready v. reply are the same types of things for which the DAMSL coding scheme allows multiple tags. For example, with DAMSL, an utterance can be tagged as both an acknowledgement and a reply. This is not true for the move classification presented in the first paper. Has any attempt been made to compare these two dialog coding schemes on the same dataset, specifically with DAMSL allowing multiple tags while the classification for moves does not? -------------------------------------------------------------------- Mihai Rotaru -------------------------------------------------------------------- 1. Carletta - "Assessing agreement on classification tasks: the kappa statistic" I wonder how someone can approach reliability not from the agreement point of view but from the non-agreement point of view. Such a study will reflect how a coding scheme applied by coders can spread their categorization. My critique to the kappa scheme is that it does take in account the probability of agreeing be chance it fails to further apply it to individual categories. More detailed, I think that a disagreement in a category with high chance agreement probability should count heavier than one in a category with low chance agreement probability. Or otherwise?!? How can someone model the learning of the human coders in a reliability statistic? I guess that any approach to reliability should take in account the fact the human coders have the ability to learn and to improve through experience. Thus, at least some procedure in collecting data for reliability should be devised. 2. All papers on coding schemes The fact that coding schemes in dialogs do not have a very high kappa makes me wonder on their utility. This comes from the fact that human subjects can have a dialog but still when it comes to coding they disagree. Thus, in my opinion, the coding schemes is one road to follow but not for a long time. Maybe someone can convince me otherwise... -------------------------------------------------------------------- Matt Bell -------------------------------------------------------------------- Core and Allen One thing that has been at least hinted at in some of the previous papers was a notion of maximizing utility by meeting as many "goals" as possible in a communication situation. One utterence can meet multiple goals simulteneously along multiple dimensions. The DAMSL tagset seems as if it would open up opportunity to explore this through machine learning. Has DAMSL been used to do this? Given multiple acceptable games -- a satisficing rather than optimizing criterion -- would the usefulness of a DAMSL annotated corpus remain high even given mediocre kappa scores? ================================ Carletta (1996) All the kappa papers mention the .67 < K < .8, K >= .8 heuristics for sufficient agreement over chance. How were these set? It sounds from this set of papers as if these values are sort of ad hoc. Has any work been done to confirm the validity of these scores? Perhaps statistical significance of a result over and above chance would be more reliable than simply picking an arbitrary cut-off -- or does this make sense? ================================ Eugenio et al The authors mention that they made use of the DRI, a tagging initiative for sharing corpora on a large scale. They then indicate that they used a modified version of the DRI. Why did they do this? This decision does not seem to be explained, and would seem to diminish the rationale for using DRI. ================================ Carletta et al (1996) On pg. 9 the authors mention Grice and Savino's research into cross-cultural and interpersonal factors influencing characteristics of dialogues. These revealed a tendency on the part of Italian's who were "very familiar" with each other to explicitly reject proposed goals. This in itself seems a useful observation for, e.g., register detection. Is work being pursued in this direction? -------------------------------------------------------------------- Antonio Roque -------------------------------------------------------------------- In "Coding Dialogues with the DAMSL Annotation Scheme", the authors claim that a "corpus reliably annotated with DAMSL labels would provide a valuable resource in the study of discourse as well as a source of training and testing for a dialog system." ("Conclusion", second to last page). It's true that a standard for annotating dialogue corpora would be useful for the development of practical systems. However, I suspect that DAMSL makes too many assumptions to help substantial breakthroughs in the study of discourse. It assumes that its decisions about the number and nature of layers and functions are good enough to cover all possible dialogues (though the authors admit on the second page that DAMSL is task-oriented). DAMSL doesn't provide the extensibility needed if, for example, it turns out that a particular dialogue task needs a layer breakdown different than that provided. Maybe what we need is a meta-language for dialogue annotations that allows the specification of dialogue act classifications, much as XML does for markup languages. -------------------------------------------------------------------- Roy Wilson -------------------------------------------------------------------- Carletta (1996) mentions that comparing kappa across studies requires comparable units. Assuming kappa is defined simply as (PA - PE)/(1 - PE), are the kappas reported by Core and Allen (1997) and Di Eugenio, et. al, (1998) comparable? -------------------------------------------------------------------- Ilya Goldin -------------------------------------------------------------------- Coding Dialogs with the DAMSL Annotation Scheme. Mark Core, James Allen. AAAI Fall Symposium on Communicative Action in Humans and Machines, 1997. Assessing Agreement on Classification Tasks: The Kappa Statistic. Jean Carletta. Computational Linguistics, 22(2):249-254, 1996. An Empirical Investigation of Proposals in Collaborative Dialogues. Barbara Di Eugenio, Pamela W. Jordan, Johanna D. Moore and Richmond H. Thomason. Proceedings of the 17th International Conference on Computational Linguistics and the 36th Meeting of the Association for Computational Linguistics (COLING-ACL). 1998. Di Eugenio et al move beyond the standard DRI scheme in that they explore how aspects of the scheme can be combined to indicate higher-level structures in the dialogue (see their definition of a well-formed proposal). What other higher-level structures are relevant to task-oriented dialogues? Are there methodical way of discovering evidence for them, perhaps through correlations? During the process of formulating their definition of a WFP, the authors note that one of the constituting aspects (atecedents of commits) are not tagged. Does this imply a plausible revision to the coding scheme, or are there reasons not to code such aspects? -------------------------------------------------------------------- Vincent Aleven -------------------------------------------------------------------- Carletta "We would argue that in subjective codings such as these, there are no experts." Yes, you could argue that there are no experts - or conversely that we are all experts in the use of language and speech. But it still seems that at some of these subjective coding tasks, you would become better with practice (or at least more consistent). Also, what if - after computing kappas between each pair of coders - you would find that one coder is better or worse than all others? DiEugenio et al p. 4 - "it appears we have reached an impasse; if human raters cannot reliably recognize when two participants achieve agreement, the process of automating the process is grim." But this seems inconsistent with the fact that the participants in the dialogues presumably had little trouble recognizing when they had reached agreement. Or, if you had asked the coders to read over the dialogues and for each, list the things that they thought the participants in the dialogue agreed on, you'd expect to have a much higher kappa than the reported .54 for agreement. So ... not sure how to resolve this - in order to do this kind of coding you apparently not only have to recognize that agreement is reached within a certain dialogue segment but also at exactly which utterance - perhaps this is a more "fine-grained" question? So - is but perhaps the assumption that you must be able to label each utterance reliably in order to follow the dialogue too strong? Alternatively, you could say that we need different labels. But the argument that the ambiguity is in the data, not in the set of labels, seems more reasonable. "the columns in the tables read as follows: if utterance Ui has tag X, do the coders agree on the subtag?". Two questions: (1) if utterance Ui has tag X according to whom? one rater? all raters? (2) Further, doesn't this distort the reported kappas by ignoring possible agreement (or disagreement?) about whether utterance Ui actually has tag X? Or am I missing something? Core & Allen "For the interpretation of a dialog, it is critical to have a primitive abstraction of the purpose of each utterance." (1st sentence of the conclusion). Should this be read as: "For the interpretation of a dialog, it is critical TO BE ABLE RELIABLY TO DETERMINE the primitive abstraction of the purpose of each utterance." ? (Where "reliably" means with high kappa.) Or am I reading too much into this? I am not completely comfortable with the way the DiEugenio and Core & Allen papers treat Krippendorf's .67 and .8 threshold for kappas. According to Carletta, Krippendorf said that these numbers are somewhat arbitrary - meant to be useful guidelines but not to be taken too seriously. So some discussion of whether these thresholds are appropriate for the task at hand might have been useful. In particular, some of the examples convinced me that certain utterances are inherently ambiguous (e.g., "Okay"). If we accept that some things are inherently ambiguous (at least at the current level of analysis - labeling utterances), than that would argue that the kappa thresholds must be treated with caution. Should we have a notion of things being "reliably ambiguous"? Carletta, et al. What are the relative advantages and disadvantages of this coding scheme over the DAMSL and Coconut extensions to DAMSL? When would you use one, when would you use the other?