| |
| CORPORA |
MPQA Corpus
The research group has produced a corpus of news articles manually
annotated for
opinions and private states (i.e., beliefs, emotions, sentiments,
speculations,
etc.). The first version of the corpus was collected and
annotated
as part of the summer 2002 NRRC Workshop on Multi-Perspective Question
Answering
(MPQA) (Wiebe et al., 2003) sponsored by ARDA. The current
available
release is version
1.2, which is described more fully here.
Texts in the corpus
The corpus is made up of 535 documents containing a total of 11,114
sentences.
The articles in the corpus are from 187 different foreign and U.S. news
sources.
They date from June 2001 to May 2002. The articles were identified by
human
searches and by an information retrieval system. The majority of the
articles
are on 10 different topics, but a number of additional articles were
randomly
selected (more or less) from a larger corpus of 270,000 documents. This
last set
of articles has topic: misc(ellaneous). The 10 topics are:
- argentina: economic collapse in Argentina
- axisofevil: reaction to President Bush's 2002 State
of the Union Address
- guantanamo: U.S. holding prisoners in Guantanamo Bay
- humanrights: reaction to U.S. State Department
report on human rights
- kyoto: ratification of Kyoto Protocol
- mugabe: 2002 presidental election in Zimbabwe
- settlements: Israeli settlements in Gaza and West
Bank
- spacestation: space missions of various countries
- taiwan: relations between Taiwan and China
- venezuela: presidential coup in Venezuela
Annotation scheme
As noted above, the annotations of the project target linguistic
expressions of
opinions, emotions,
sentiments, speculations, evaluations and other private states.
Individual
tokens of relevant linguistic
expressions are annotated for properties such as the source of the
private
state, its target, intensity, type of attitude,
and polarity (positive/negative).
Several characteristics of the corpus are particularly noteworthy:
- The annotation scheme is applied to the word- and phrase-level rather than
to the level of the document or sentence.
- An important property of sources in the annotation
scheme is that they are nested,
reflecting the fact that private states and speech events are often
embedded in one another.
- The representation scheme also includes frames
representing material that is attributed to a source, but is presented objectively, without evaluation,
speculation, or other type of private state by that source.
The complete instructions used to produce the annotations are available
here.
Released annotation files
The database holding the release contains three subdirectories: docs,
meta_anns,
man_anns.
- The docs subdirectory contains the document
collection. In this subdirectory, each file is a text file containing
one source document.
- Each file in the meta_anns subdirectory contains
information about the document (e.g., source, date). The meta_anns
files are in MPQA format.
- The man_anns subdirectory contains the manual
annotations for the documents. For each annotated document there exists
a directory that contains three files, two of which represent different
versions of the annotation scheme and one that holds sentence
boundaries as defined by the GATE tool.
The MPQA file format is a type of general stand-off annotation, with
one
annotation per line, consisting of text fields separated by single TABs.
|
|
|