Suppose you are assessing books for use in an elementary school curriculum. You need to know things like whether or not the vocabulary is appropriate for a certain grade level, how long the sentences are, and some measure of the readability of the text.
A program has been started that does this sort of analysis (this is a modification of a file provided to us by the University of Toronto). Download analysis.py.txt The program gives the user two options: they can use the program to determine the vocabulary of the file (the list of distinct words, presented in alphabetical order) or to compute statistics such as average sentence length and readability.
All of the files listed below are in this single zip file.
Here are some text files:
text1
text2
text3
text4
And here is output for them. Note that two extra lines are printed,
to show you intermediate results of the program. Please do not print
those extra lines in your submissions.
First, here is the output for command 'stats':
text1
output
text2
output
text3
output
text4
output
Now, here is the output for command 'vocab':
text1
output
text2
output
text3
output
text4
output
One of the statistics that the program computes is the Automated Readability Index, or ARI, of the text. The ARI is an estimate of the minimum grade level needed to understand the text. For instance, an ARI value of 4.8 indicates that a student in late grade four should be able to comprehend the text. The ARI is defined as 4.71 x (number of characters / number of words) + 0.5 x (number of words / number of sentences) - 21.43.
You will notice that the program reads the entire text file into a
string, and then uses the function split
to break that string
up into a list of strings, each of which is intended to be a word.
The split
function uses white space (blanks, tabs and
newlines) to decide where the word boundaries are.
Of course, most punctuation is adjacent to a word, so using blanks
to decide on the word boundaries doesn't always do what you'd wish.
For example, in the sentence
"I am a Bear of Very Little Brain, and long words bother me."
the word list will include "Brain," and "me."
Function separate_punctuation
, which you will write,
goes through the word list and replaces strings like "Brain,"
with
two strings (in this case, "Brain" and ",").
This whole approach to dealing with punctuation is simple, but not
perfect. For example, some punctuation can come at the beginning of
a word (such as quotation marks) or between two words with
no whitespace to separate them (dashes often appear this way).
In these cases, the program will do imperfect things.
For example, in the sentence
"Some techniques use a key--a special value."
the program will treat "key--a" as a single word,
because there were no
blanks to separate the words.
A fancier program would do better, but do not try to improve these
aspects of the Assignment 2 code.
Your task is to complete analysis.py by filling in the missing parts. Do not add any further input or output to your submission (though feel free to add debug print statements while you develop your solution).
You should consider defining additional functions to structure your program well; always write a docstring for any function that you define.
Hint: In function print_stats
, when compiling
statistics like the number of sentences and
average sentence length, you will need to be able to tell when you've
hit the end of a sentence.
The list of end-of-sentence punctuation will be helpful for this.
(This is why the function is given a
list of end-of-sentence punctuation separate from the list of
within-sentence punctuation.)
And note that neither sort of punctuation counts as a word, so it
should not be included in calculations like average word length
or average sentence length.
These are the aspects of your work that we will focus on in the grading:
Correctness (85%): Your functions and your main program should perform as specified. The TA will run your programs on the given input texts and automatically check that the output maps the given output texts. So, your output should match the output files given to you exactly.
Formatting and Programming style (10%): Make sure that you read the style rules page for some general rules and guidelines about formatting your code.
Hand in the following file:
analysis.py