HOMEWORK 2 (CS 2731 / ISSP 2230)

Assigned: October 3, 2011

Due: October 24, 2011 (midnight)

There are three parts to this homework. In the first part, you are to write a Context Free Grammar (CFG) using the NLTK toolkit. A readme for downloading and using the toolkit can be found at: nltk_readme.html

The second part of this homework involves downloading and using a robust statistical parser developed at Stanford. You can find information for downloading and using the Stanford parser (download the PCFG version, not the dependency parser) at: http://nlp.stanford.edu/software/lex-parser.shtml

The third part of the homework involves answering written questions about statistical and lexical parsing.

For questions, I have created a discussion board on Blackboard if you would like to discuss/share with classmates.

1. Context Free Grammar (60 points)

Write a context free grammar in NLTK to handle the (slightly modified) story of Where the Wild Things Are, a story by Maurice Sendak which is a classic children's book. You must write the grammar yourself. Don't use publicly available grammars or build the grammar using reverse engineering.

You will find the story in WildthingswithuncorrectedPOS.txt. This file contains the story tagged with PennTreebank POS tags. Note that one change you may want to make is to convert different forms of verb (e.g., VBD, VBZ) into just VB and different forms of common nouns (e.g., NNS, NN) into just NN, which would simplify the creation of CFG rules. Whether you do that and exactly how is up to you. You will be creating grammar rules that will allow the system to parse relative clauses (RC), prepositional phrases (PP), Verb phrases (VP) (transitive, intransitive, bitransitive, with an embedded sentence as object ), embedded sentences (call them S_Bar), time modifiers (TIME), and you should allow some constructions (NP, VP, S_BAR) to contain conjunction, and sentences to include subordinate conjunction (i.e., two embedded sentences are joined by a subordinate conjunction).

You may find it helpful to consult a grammar of English during this process. Chapter 12 of our textbook is the best resource. You may also find it useful to look at the annotation guidelines given for the Penn TreeBank POS tags. These guidelines give examples for different parts of speech and reasons why a word in a particular usage falls into that POS. You will find them here: tagguide.pdf

You will be graded in part on whether your rules are syntactically justifiable. You should attempt to make your rules general where possible (i.e., don't make new rules for each and every new string of words you see; a rule should ideally cover several phrases that you see in the input).

You should hand in a file containing your grammar, documented so that it describes why you selected the grammar rules that you did. For example, you may justify using a particular set of rules because they handle multiple constructions or because they follow a rule that you found in Chapter 12.

We will run your grammar in the NLTK parser environment using the chart parser.

You will be graded on the following elements:

  1. A parse tree is produced for each sentence. (15 points)
  2. Each of the constructions listed above is represented in the grammar (20 points)
  3. Choices about the particular rules used and the resulting parse are adequately justified. Points will be deducted for parse trees that do not capture a good structure of the language according to your justification (25 points)

CLARIFICATIONS ADDED 10/13:

2. Stanford Parser (20 points)

Download and run the Stanford Parser using the PCFG (not dependency parser) on letterman-cnn.txt.

(10 points) Select two sentences where the parse returned is incorrect. One of your choices can be when the parser entirely fails (produces no output), but one should be where a parse is produced, but you don't think it is a good one. In each case explain why the parser failed and what should have been produced.

(10 points) As a comparison point, try your own grammar on the two sentences chosen above, as well as for two sentences that were parsed correctly by the Stanford Parser, and discuss the differences produced by the two grammars/parsers.

CLARIFICATION ADDED 10/20:

3. Statistical and Lexical Grammar (20 points)

(10 points) Suppose you use the output of your parser on Where the Wild Things Are as a Treebank. Show how you would compute the probabilities, and what they are, for the rules for VP and the rules for NP.

(10 points) You will find that your CFG often generates multiple parses for a sentence. This can happen when it's ambiguous for attachment of the PP or when the scope of conjunction is ambiguous (and for other constructions too). Consider the following cases: sentence 18 where "of all wild things" could modify "king" or "made"; sentence 6 where "to bed" could be an argument of the verb or it could be a modifier of the verb; sentence 10 where "for Max" could modify either "boat" or "tumbled by". Sentence 12 where "to the land" could modify "day" or "sailed off" and "of the wild things" could modify "land," "day," or "sailed off". You cannot use your parser output in this case to compute probabilities to do disambiguation. Why not?

You can access the Penn Treebank from here or from the following account: /afs/cs.pitt.edu/projects/nlp/CORPORA/ptb_v3/parsed/mrg/wsj You will see multiple sections (e.g. 00), and in each section there are individual reports that have been parsed by hand. Use sections 00, 01, 02, 03, 04, and 05 to compute counts. Describe what you would need to count, how your rules would be formulated and how you would use them in a probabilistic lexicalized framework in order to do disambiguation in these cases (note: a rough description will do. You do not need to describe a probabilistic version of CKY). Show the counts for disambiguating whether "to the land" attaches to "sailed off" or "day".