Homework 2 Natural Language Processing (CS 3730 / ISSP 3120) Fall 2001 Assigned: October 3, 2001 Due: October 17, 2001 EXTENDED to October 24! Please turn in your assignment to Angela Balcita, CS receptionist (located next to my MIB office). 1. (10 pts) Exercise 8.1, Jurafsky and Martin (p. 319) (Note errata for page 297: Tag for Personal Pronoun in Figure 8.6 should be PRP.) 2. (30 pts) Exercise 8.4, Jurafsky and Martin (p. 320), modified as follows. NOTE: see the correction in the book's errata site! (page 320, Exercise 8.4: In both equations, the summations should range from i = 1 to n, not i-1 to n.) As discussed in class, your confusion matrix should contain NUMBERS (as in the description of cell(x,y) on p. 313 and in lecture 9 slide 37) - not percentages as in the example later in the box on p. 313. You have implemented a baseline unigram tagger, and wish to evaluate it on the following test set, where the correct tags are shown: is/VBZ expected/VBN to/TO race/VB tomorrow/NN the/DT race/NN for/IN outer/JJ space/NN book/VB the/DT race/NN a. First assume that the tags below are the most probable tags for the following words in your lexicon: book NN expected VB for IN is VB outer JJ race NN space NN the DT to TO tomorrow NN Compute the agreement between your baseline tagger and the gold standard test corpus, using Kappa as described in Exercise 8.4. Show your derivation! Why is it better to use Kappa rather than Percent Correct? b. Redo the above evaluation using a different tagger, namely one that tags every lexical item with DT (since "the" is usually the most frequent word in an English corpora). Compare your current and previous results. Is either result a good result? 3. (60 pts) See http://www.cs.pitt.edu/~litman/courses/CS3730/hws/ngrams.html