CS 0007 Assignment 2 Fall 2009

Introduction

The purpose of this assignment is to give you more practice writing functions and programs in Python, and in particular writing code with strings and lists.

You should spend at least one hour reading through this handout to make sure you understand what is required, so that you don't waste your time going in circles, or worse yet, hand in something you think is correct and find out later that you misunderstood the specifications. Highlight the handout, scribble on it, and find all the things you're not sure about as soon as possible.

Text analysis program

Suppose you are assessing books for use in an elementary school curriculum. You need to know things like whether or not the vocabulary is appropriate for a certain grade level, how long the sentences are, and some measure of the readability of the text.

A program has been started that does this sort of analysis (this is a slight modification of a file provided to use by the University of Toronto). Download analysis.py The program gives the user two options: they can use the program to determine the vocabulary of the file (the list of distinct words, presented in alphabetical order) or to compute statistics such as average sentence length and readability.

Here are some test files: test1 test2 test3 test4
And here is output for them. Note that two extra lines are printed, to show you intermediate results of the program. Please do not print these in your submissions.
First, here is the output for command 'stats':
test1 output test2 output test3 output test4 output
Now, here is the output for command 'vocab':
test1 output test2 output test3 output test4 output

One of the statistics that the program computes is the Automated Readability Index, or ARI, of the text. The ARI is an estimate of the minimum grade level needed to understand the text. For instance, an ARI value of 4.8 indicates that a student in late grade four should be able to comprehend the text. The ARI is defined as 4.71 x (number of characters / number of words) + 0.5 x (number of words / number of sentences) - 21.43.

You will notice that the program reads the entire text file into a string, and then uses the function split to break that string up into a list of strings, each of which is intended to be a word. The split function uses white space (blanks, tabs and newlines) to decide where the word boundaries are. Of course, most punctuation is adjacent to a word, so using blanks to decide on the word boundaries doesn't always do what you'd wish. For example, in the sentence "I am a Bear of Very Little Brain, and long words bother me." the word list will include "Brain," and "me." Function separate_punctuation, which you will write, goes through the word list and replaces strings like "Brain," with two strings (in this case, "Brain" and ","). This whole approach to dealing with punctuation is simple, but not perfect. For example, some punctuation can come at the beginning of a word (such as quotation marks) or between two words with no whitespace to separate them (dashes often appear this way). In these cases, the program will do imperfect things. For example, in the sentence "Some techniques use a key--a special value." the program will treat "key--a" as a single word, because there were no blanks to separate the words. A fancier program would do better, but do not try to improve these aspects of the Assignment 2 code.

The program I've provided for you runs, but does almost nothing, because many of the functions have an empty body. Your task is to complete the program by filling in the missing parts. Search for the string pass to find them. All the input and output statements required by the program have already been written. Do not add any further input or output.

You should consider using helper functions, and always write a docstring for any function that you define.

Hint: In function print_stats, when compiling statistics like the number of sentences and average sentence length, you will need to be able to tell when you've hit the end of a sentence. The list of end-of-sentence punctuation will be helpful for this. (This is why the function is given a list of end-of-sentence punctuation separate from the list of within-sentence punctuation.) And note that neither sort of punctuation counts as a word, so it should not be included in calculations like average word length or average sentence length.

Grading

These are the aspects of your work that we will focus on in the grading:

What to Hand In

Hand in the following file: