Assignment 3 CS0008 Spring 2015

Text analysis program

Suppose you are assessing books for use in an elementary school curriculum. You need to know things like whether or not the vocabulary is appropriate for a certain grade level, how long the sentences are, and some measure of the readability of the text.

A program has been started that does this sort of analysis (this is a modification of a file provided to us by the University of Toronto). Download analysis.py.txt The program gives the user two options: they can use the program to determine the vocabulary of the file (the list of distinct words, presented in alphabetical order) or to compute statistics such as average sentence length and readability.

All of the files listed below are in this single zip file.

Here are some text files:
text1 text2 text3 text4

And here is output for them. Note that two extra lines are printed, to show you intermediate results of the program. Please do not print those extra lines in your submissions.

First, here is the output for command 'stats':

text1 output text2 output text3 output text4 output

Now, here is the output for command 'vocab':

text1 output text2 output text3 output text4 output

One of the statistics that the program computes is the Automated Readability Index, or ARI, of the text. The ARI is an estimate of the minimum grade level needed to understand the text. For instance, an ARI value of 4.8 indicates that a student in late grade four should be able to comprehend the text. The ARI is defined as 4.71 x (number of characters / number of words) + 0.5 x (number of words / number of sentences) - 21.43.

You will notice that the program reads the entire text file into a string, and then uses the function split to break that string up into a list of strings, each of which is intended to be a word. The split function uses white space (blanks, tabs and newlines) to decide where the word boundaries are. Of course, most punctuation is adjacent to a word, so using blanks to decide on the word boundaries doesn't always do what you'd wish. For example, in the sentence "I am a Bear of Very Little Brain, and long words bother me." the word list will include "Brain," and "me." Function separate_punctuation, which you will write, goes through the word list and replaces strings like "Brain," with two strings (in this case, "Brain" and ","). This whole approach to dealing with punctuation is simple, but not perfect. For example, some punctuation can come at the beginning of a word (such as quotation marks) or between two words with no whitespace to separate them (dashes often appear this way). In these cases, the program will do imperfect things. For example, in the sentence "Some techniques use a key--a special value." the program will treat "key--a" as a single word, because there were no blanks to separate the words. A fancier program would do better, but do not try to improve these aspects of the Assignment 2 code.

Your task is to complete analysis.py by filling in the missing parts. Do not add any further input or output to your submission (though feel free to add debug print statements while you develop your solution).

You should consider defining additional functions to structure your program well; always write a docstring for any function that you define.

Hint: In function print_stats, when compiling statistics like the number of sentences and average sentence length, you will need to be able to tell when you've hit the end of a sentence. The list of end-of-sentence punctuation will be helpful for this. (This is why the function is given a list of end-of-sentence punctuation separate from the list of within-sentence punctuation.) And note that neither sort of punctuation counts as a word, so it should not be included in calculations like average word length or average sentence length.

Grading

These are the aspects of your work that we will focus on in the grading:

What to Hand In

Hand in the following file: