Description

Lack of progress in automatically producing semantic representations constitutes a major obstacle for natural language processing. Our proposal addresses this issue by creating a Unified Linguistic Annotation (ULA) exemplified by the first large (550K words), balanced, semantically annotated corpus. This corpus will have most basic types of semantic information annotated according to high-quality schemes using state-of-the-art annotation technology. Crucially, all individual annotations, although unified, will be kept separate in order to make it easy to produce alternative annotations of a specific type of semantic information (word senses, anaphora, etc.) without modifying annotation at other levels. Our ULA framework will be easily extendable to incorporate new annotation schemes as they become available. We will create an infrastructure including both multiply annotated corpora and guidelines for merging so that the ULA will grow after this project is complete.

This project is funded by the National Science Foundation Computing Research Infrastructure Program.