The purpose of this assignment is to give you more practice writing functions and programs in Python, and in particular writing code with strings and lists.
Suppose you are assessing books for use in an elementary school curriculum. You need to know things like whether or not the vocabulary is appropriate for a certain grade level, how long the sentences are, and some measure of the readability of the text.
A program has been started that does this sort of analysis (this is a slight modification of a file provided to us by the University of Toronto). Download analysis.py.txt The program gives the user two options: they can use the program to determine the vocabulary of the file (the list of distinct words, presented in alphabetical order) or to compute statistics such as average sentence length and readability.
Here are some test files:
test1
test2
test3
test4
And here is output for them. Note that two extra lines are printed,
to show you intermediate results of the program. Please do not print
these in your submissions.
First, here is the output for command 'stats':
test1
output
test2
output
test3
output
test4
output
Now, here is the output for command 'vocab':
test1
output
test2
output
test3
output
test4
output
One of the statistics that the program computes is the Automated Readability Index, or ARI, of the text. The ARI is an estimate of the minimum grade level needed to understand the text. For instance, an ARI value of 4.8 indicates that a student in late grade four should be able to comprehend the text. The ARI is defined as 4.71 x (number of characters / number of words) + 0.5 x (number of words / number of sentences) - 21.43.
You will notice that the program reads the entire text file into a
string, and then uses the function split
to break that string
up into a list of strings, each of which is intended to be a word.
The split
function uses white space (blanks, tabs and
newlines) to decide where the word boundaries are.
Of course, most punctuation is adjacent to a word, so using blanks
to decide on the word boundaries doesn't always do what you'd wish.
For example, in the sentence
"I am a Bear of Very Little Brain, and long words bother me."
the word list will include "Brain," and "me."
Function separate_punctuation
, which you will write,
goes through the word list and replaces strings like "Brain,"
with
two strings (in this case, "Brain" and ",").
This whole approach to dealing with punctuation is simple, but not
perfect. For example, some punctuation can come at the beginning of
a word (such as quotation marks) or between two words with
no whitespace to separate them (dashes often appear this way).
In these cases, the program will do imperfect things.
For example, in the sentence
"Some techniques use a key--a special value."
the program will treat "key--a" as a single word,
because there were no
blanks to separate the words.
A fancier program would do better, but do not try to improve these
aspects of the Assignment 2 code.
Your task is to complete analysis.py by filling in the missing parts. All the input and output statements required by the program have already been written. Do not add any further input or output.
You should consider using helper functions, and always write a docstring for any function that you define.
Hint: In function print_stats
, when compiling
statistics like the number of sentences and
average sentence length, you will need to be able to tell when you've
hit the end of a sentence.
The list of end-of-sentence punctuation will be helpful for this.
(This is why the function is given a
list of end-of-sentence punctuation separate from the list of
within-sentence punctuation.)
And note that neither sort of punctuation counts as a word, so it
should not be included in calculations like average word length
or average sentence length.
These are the aspects of your work that we will focus on in the grading:
Correctness (80%): Your functions and your main program should perform as specified.
Formatting style (10%): Make sure that you read the style rules page for some general rules and guidelines about formatting your code.
Programming style (10%): Your variables names should be meaningful and your code as clear and non-redundant as possible.
Hand in the following file:
analysis.py