Date: Tue, 12 Feb 2002 19:53:09 -0500 From: Amy Soller Subject: questions 4 class X-Sender: soller@imap.pitt.edu To: Matt Bell Message-id: <4.2.0.58.20020212141303.00c2c160@imap.pitt.edu> MIME-version: 1.0 X-Mailer: QUALCOMM Windows Eudora Pro Version 4.2.0.58 Content-type: text/plain; charset="us-ascii"; format=flowed X-Virus-Scanned: Secured by Pitt CS Status: RO X-Status: X-Keywords: X-UID: 11 Litman, Hirschberg, & Swerts, 2000 - What are the intuitive tradeoffs between misrecognition in terms of WER and in terms of CA? I wonder if there are some behavioral idiosyncrasies between men and women that would favor a particular set of prosodic features under each condition. Did Ripper take this into consideration? For example, could it come up with rules like: IF the speaker is recognized as a woman, THEN give additional weight to F0 Max. Heeman & Allen - I really enjoyed this paper. I wonder, though, if it is difficult to get enough training data where people are naturally issuing dialog repairs. It seems natural enough in communication, but it also seems that folks are a little more precise when communicating with a system. From: "Mihai Rotaru" To: Subject: CS3710 - Questions Date: Tue, 12 Feb 2002 23:07:18 -0500 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_0053_01C1B41A.04A0C900" X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2600.0000 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 X-Virus-Scanned: Secured by Pitt CS Status: O X-Status: X-Keywords: X-UID: 12 Hi Matt, =20 Here are my questions: Litman et all. 1. Is it possible that the missrecognition model found to be a model of the= ASR capabilities or of the corpus? It has been argued that it is not a mod= el for hyperarticulated speech. Were the results similar with W99 corpus (t= he same dramatically improvement on WER)? =20 2. Is WER a good measure of missrecognition? The experiments labeled a turn= missrecognized if the WER>0. Rejecting a turn based on this criterion does= n't seem to be very practical (if we missrecognized the user saying "a" ins= tead of "an", should we reject the turn???). Thus, where will this criterio= n be useful? While CA seems to be a more practical criterion, predicting based on CA cri= teria does not seem to be significantly improved when using prosody... =20 Heeman and Allen 1.I guess that extending the acoustic model to include not only the words b= ut the intonational phrases and speech repairs (especially the change in ac= ustic when entering the editing term) will further improve accuracy. Does a= nyone know of such a work? I guess the best way to verify this intuition wi= ll be me speaking something in Romanian and someone trying to detect intona= tional phrases and speech repairs :-) =20 Mihai Rotaru Date: Wed, 13 Feb 2002 00:38:12 -0500 (EST) From: "Alan D. Berfield" To: "Matthew T. Bell" Subject: questions Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Scanned: Secured by Pitt CS Status: O X-Status: X-Keywords: X-UID: 13 Litman et. al Is WER or CA more useful/practical for a system? How about using both? Can anyone think of reasons why prosody would not work as well at predicting CA-defined errors? Just curious: Why do they call it the "goat" phenomenon? Shriberg et. al They mention that they intend to explore combinations of prosody and word analysis. Have they or anyone else done anything with this yet? Date: Wed, 13 Feb 2002 01:10:33 -0500 (EST) From: Antonio Roque Subject: CS 3710 questions X-X-Sender: To: mbell@cs.pitt.edu Message-id: MIME-version: 1.0 Content-type: TEXT/PLAIN; charset=US-ASCII X-Virus-Scanned: Secured by Pitt CS Status: O X-Status: X-Keywords: X-UID: 14 "Predicting Automatic Speech Recognition Performance Using Prosodic Cues" builds on the idea that "hyperarticulated speech... may be associated with recognition failures" (section 3). This is true in human to human conversations, and if it's currently true of spoken dialogue systems, it's probably because humans still interact with SDSes as if they were human. However, assuming SDSes don't achieve full human language ability in the near future, humans will find distinct ways of interacting with SDSes, and those ways may not include the hyperarticulation features studied here. If this is the case, then this approach for improving ASR performance could be tied only to the temporary unfamiliarity that humans currently have with SDSes. Date: Wed, 13 Feb 2002 05:20:10 -0500 To: mbell@cs.pitt.edu From: Eric Williams Subject: discussion questions Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; format=flowed X-RBL-Warning: This E-mail was sent from a mail server [No Reverse DNS] with no reverse DNS entry. X-Note: Declude X-Virus-Scanned: Secured by Pitt CS Status: O X-Status: X-Keywords: X-UID: 15 Litman, et al The discussion section poses several questions that would be answered in further work. I'm wondering if our professor could tell us if they were answered. (It's kinda cool having the author of a paper available to query in person. :) ) Shriberg, et al, et al et al... (Geez, a lot of people were involved with this!) WOW! I was really impressed with the thoroughness of this paper and underlying project. I'd very much like to know what sorts of follow-up work has been done. Eric Williams PhD student in Intelligent Systems email: funkydung@pobox.com AIM: funkidung (work), phloidian (home) ICQ: 11781402 MSN: phloidian Reply to funkydung@pobox.com ***************** Sorry no quote today. Visit qliner.com. --- --- Date: Wed, 13 Feb 2002 08:57:21 -0500 From: Roy Wilson Subject: Questions Sender: roy+@pitt.edu To: mbell@cs.pitt.edu Message-id: <3C6A70C1.E1F14EA4@pitt.edu> Organization: LRDC MIME-version: 1.0 X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.2-2 i686) Content-type: text/plain; charset=us-ascii Content-transfer-encoding: 7bit X-Accept-Language: en X-Virus-Scanned: Secured by Pitt CS Status: O X-Status: X-Keywords: X-UID: 16 The Heeman-Allen work is, in the words of Darth Vader, "most impressive." My understanding and appreciation of the work would gain from my having considerably more brains/time to put into reading it and the background it seems to presuppose. Since ignorance never (well, hardly ever :-)) stops me from levying criticism, here goes. (1) The authors observe that comparing the performance of their model to others is problematic because of different corpora, inputs, and because their work is the first to combine a number of approaches in a unified statistical LM. True enough, but what about using the TRAINS corpora and simply implementing (some of) the other proposed approaches (such as Stolcke and Shriberg) as well? (Of course, this would make the work even more difficult to do and more difficult for ME to follow!) (2) This work seems quite data and labor intensive. Per Amy's comment last time, I wonder: given the model and software needed to do the calculations, how many person-hours would it take to carry it out using a different corpus covering a different domain? On the plus side, I appreciated their use of the corpus branching perplexity to estimate the size of the search space on pp. 16-17 when trying to predict the next POS tag. This allows them to show the magnitude of the payoff for using a richer history. -- Roy Wilson AI Programmer/Natural Language Generation rwilson@pitt.edu CIRCLE Group LRDC #701 Learning Research and Development Center 412-624-7464 University of Pittsburgh From: twilson@cs.pitt.edu Message-Id: <200202131437.JAA08106@nitrogen.cs.pitt.edu> Subject: q's for class To: mbell@cs.pitt.edu (Matthew T. Bell) Date: Wed, 13 Feb 2002 09:37:16 -0500 (EST) X-Mailer: ELM [version 2.5 PL1] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Virus-Scanned: Secured by Pitt CS Status: O X-Status: X-Keywords: X-UID: 17 "Predicting Automatic Speech Recognition Performance Using Prosodic Cues" What does it mean for the mean of a raw value to be normalized by the value of the first or preceding turn? (From section 3) ----- Heeman and Allen There are a lot of details that are covered in this paper, which leaves room for a lot of questions. One in particular that I have is about the classification trees that they are building (described in sections 3.3.2 3.3.3.) While I believe I understand how they are building and using the POS tree, I don't quite understand how they are building and in particular how they are using the word classification trees for the POS tags. It does seem like their approach is an interesting and novel one, though, so I'd like to get a clear picture of what they are doing. Theresa