Handling Bad Sentence Splits

Note that the below differs from previous policy. Also note that the discussion below is most relevant to the internal concerns of the Pitt annotation group.

Before you annotate a new document, check out the sentence splits. Note that there are two kinds of splits, the default GATE_Splits that come from the text processing platform, and the MPQA splits. The preprocessing done in GATE sets the MPQA splits to be the same as the GATE Splits.

Splits could be bad in one of two ways: their extent is too small or large, or they are in the wrong place, where wrong place typically means that they need to be deleted, for instance because a split got introduced because of an abbreviation ending in a period.

When you modify splits, change both the GATE and the MPQA splits. (Since I am not sure at this point whether one or the other type of split is crucial to automatic systems let's adjust both.) Also let's adjust the associated GATE_Sentence labels and the MPQA_inside labels.

For instance, if you had a split after "Mr" in :

Mr. Bean ... You'll just have to love him!

you would want to remove it (both the GATE Split and the MPQA split) and then you need to merge the two MPQA insides that cover "Mr" and "... You'll just have to love him!". Likewise, you need to merge the two GATE_Sentences over the same spans.



J. Ruppenhofer