logo  
Center for the Extraction and Summarization of Events and Opinions in Text  
line decor
  HOME  ::   RESEARCH  ::   PUBLICATIONS  ::   PEOPLE  ::   CORPORA  ::   SYSTEMS  ::   CONTACT  ::  
line decor
   
 
CORPORA

MPQA Corpus

The research group has produced a corpus of news articles manually annotated for opinions and private states (i.e., beliefs, emotions, sentiments, speculations, etc.).   The first version of the corpus was collected and annotated as part of the summer 2002 NRRC Workshop on Multi-Perspective Question Answering (MPQA) (Wiebe et al., 2003) sponsored by ARDA.  The current available release is version 1.2, which is described more fully here.

Texts in the corpus

The corpus is made up of 535 documents containing a total of 11,114 sentences. The articles in the corpus are from 187 different foreign and U.S. news sources. They date from June 2001 to May 2002. The articles were identified by human searches and by an information retrieval system. The majority of the articles are on 10 different topics, but a number of additional articles were randomly selected (more or less) from a larger corpus of 270,000 documents. This last set of articles has topic: misc(ellaneous). The 10 topics are:

  • argentina: economic collapse in Argentina
  • axisofevil: reaction to President Bush's 2002 State of the Union Address
  • guantanamo: U.S. holding prisoners in Guantanamo Bay
  • humanrights: reaction to U.S. State Department report on human rights
  • kyoto: ratification of Kyoto Protocol
  • mugabe: 2002 presidental election in Zimbabwe
  • settlements: Israeli settlements in Gaza and West Bank
  • spacestation: space missions of various countries
  • taiwan: relations between Taiwan and China
  • venezuela: presidential coup in Venezuela

Annotation scheme

As noted above, the annotations of the project target linguistic expressions of opinions, emotions,
sentiments, speculations, evaluations and other private states. Individual tokens of relevant linguistic
expressions are annotated for properties such as the source of the private state, its target, intensity, type of attitude,
and polarity (positive/negative).

Several characteristics of the corpus are particularly noteworthy:
  • The annotation scheme is applied to the word- and phrase-level rather than to the level of the document or sentence.
  • An important property of sources in the annotation scheme is that they are nested, reflecting the fact that private states and speech events are often embedded in one another.
  • The representation scheme also includes frames representing material that is attributed to a source, but is presented objectively, without evaluation, speculation, or other type of private state by that source.
The complete instructions used to produce the annotations are available here.

Released annotation files

The database holding the release contains three subdirectories: docs, meta_anns, man_anns.
  • The docs subdirectory contains the document collection. In this subdirectory, each file is a text file containing one source document.
  • Each file in the meta_anns subdirectory contains information about the document (e.g., source, date). The meta_anns files are in MPQA format.
  • The man_anns subdirectory contains the manual annotations for the documents. For each annotated document there exists a directory that contains three files, two of which represent different versions of the annotation scheme and one that holds sentence boundaries as defined by the GATE tool.

The MPQA file format is a type of general stand-off annotation, with one annotation per line, consisting of text fields separated by single TABs.