The purpose of this assignment is to give you more practice writing functions and programs in Python, and in particular writing code with strings and lists.
You should spend at least one hour reading through this handout to make sure you understand what is required, so that you don't waste your time going in circles, or worse yet, hand in something you think is correct and find out later that you misunderstood the specifications. Highlight the handout, scribble on it, and find all the things you're not sure about as soon as possible.
Suppose you are assessing books for use in an elementary school curriculum. You need to know things like whether or not the vocabulary is appropriate for a certain grade level, how long the sentences are, and some measure of the readability of the text.
A program has been started that does this sort of analysis (this is a slight modification of a file provided to use by the University of Toronto). Download analysis.py The program gives the user two options: they can use the program to determine the vocabulary of the file (the list of distinct words, presented in alphabetical order) or to compute statistics such as average sentence length and readability.
Here are some test files:
test1
test2
test3
test4
And here is output for them. Note that two extra lines are printed,
to show you intermediate results of the program. Please do not print
these in your submissions.
First, here is the output for command 'stats':
test1
output
test2
output
test3
output
test4
output
Now, here is the output for command 'vocab':
test1
output
test2
output
test3
output
test4
output
One of the statistics that the program computes is the Automated Readability Index, or ARI, of the text. The ARI is an estimate of the minimum grade level needed to understand the text. For instance, an ARI value of 4.8 indicates that a student in late grade four should be able to comprehend the text. The ARI is defined as 4.71 x (number of characters / number of words) + 0.5 x (number of words / number of sentences) - 21.43.
You will notice that the program reads the entire text file into a
string, and then uses the function split
to break that string
up into a list of strings, each of which is intended to be a word.
The split
function uses white space (blanks, tabs and
newlines) to decide where the word boundaries are.
Of course, most punctuation is adjacent to a word, so using blanks
to decide on the word boundaries doesn't always do what you'd wish.
For example, in the sentence
"I am a Bear of Very Little Brain, and long words bother me."
the word list will include "Brain," and "me."
Function separate_punctuation
, which you will write,
goes through the word list and replaces strings like "Brain,"
with
two strings (in this case, "Brain" and ",").
This whole approach to dealing with punctuation is simple, but not
perfect. For example, some punctuation can come at the beginning of
a word (such as quotation marks) or between two words with
no whitespace to separate them (dashes often appear this way).
In these cases, the program will do imperfect things.
For example, in the sentence
"Some techniques use a key--a special value."
the program will treat "key--a" as a single word,
because there were no
blanks to separate the words.
A fancier program would do better, but do not try to improve these
aspects of the Assignment 2 code.
The program I've provided for you
runs, but does almost nothing, because many of the functions
have an empty body. Your task is to complete the program by filling
in the missing parts. Search for the string pass
to
find them. All the input and output statements
required by the program have already
been written. Do not add any further input or output.
You should consider using helper functions, and always write a docstring for any function that you define.
Hint: In function print_stats
, when compiling
statistics like the number of sentences and
average sentence length, you will need to be able to tell when you've
hit the end of a sentence.
The list of end-of-sentence punctuation will be helpful for this.
(This is why the function is given a
list of end-of-sentence punctuation separate from the list of
within-sentence punctuation.)
And note that neither sort of punctuation counts as a word, so it
should not be included in calculations like average word length
or average sentence length.
These are the aspects of your work that we will focus on in the grading:
Correctness: Your code should perform as specified. Correctness, as measured by our tests, will count for the largest single portion of your marks.
docstring
comment that describes its parameters, what
the function does, and what is returned by the function. Within
functions, the more complicated parts of your code should also be
described using comments.
Programming style: Your variable names should be meaningful and your code as simple and clear as possible.
Hand in the following file:
analysis.py