decisions require integration of large amounts of many types of digital information.
Currently, large volumes of raw data are gathered from around the globe in many
languages, media and forms. Given the volume and variety of information, it is
difficult to quickly find and interpret the information. The U.S. Army has
identified the need for technologies supporting network-centric/distributed
information systems, content-based retrieval of information, machine
translation of text, automated text analysis using markup languages and
decisions require the summary, access, review and long-term preservation of
digital records. Due to the large variety and increasing volume of acquired
digital records, it will be decades or centuries before acquisitions can be
manually summarized and reviewed for restrictions on disclosure. Due to the
obsolescence of computer technologies, there is also significant risk that
e-records created using past or current software applications may not be
accessible in the future. The military services also maintain administrative,
operational and engineering e-records, and many of these must be accessible
decades after their creation. They must also review their records for national
security restrictions on disclosure. Hence, the military services are also
concerned with summarization, access, review and preservation of their records.
is an area of language understanding technology in which improved methods could
enhance the quality of summaries, enhance the precision and recall of access
methods and support interpretation and reasoning about the content of
documents. Pragmatics is the area of linguistics that is concerned with the
context and use of language in discourse, in contrast to the syntax and
semantics of clauses or sentences. Computational pragmatics includes
technologies for pronominal co-reference, speech act recognition, topic
recognition and recognition of discourse structure.
first objective of this research project is to develop improved methods for
pronominal co-reference resolution and speech act recognition. These pragmatic
language features play an important role in identifying an author’s intents
(asserting, committing, declaring, directing, or expressing an attitude). The
second research objective is to develop a method of discourse analysis that
relates the structure and topics of a document to the intensions of the author
as expressed in the speech acts of the document.
third research objective is to develop an improved method for summarizing
individual records and sets of records by combining the methods of speech act
recognition and discourse analysis with prior research results based on
document type recognition and metadata extraction.
collaboration with ARL Computational Linguists, the fourth research objective
is to extend methods for information extraction, document type recognition,
metadata extraction and summarization to Arabic language documents.
fifth research objective is to improve the conceptual indexing model of
document retrieval by incorporating an improved topic recognition method and
providing a Boolean query language interface. Research objective six is to
enhance the capability of a prototype access restriction checker by
incorporating the speech act recognition and discourse analysis methods.
seventh research objective is to enhance the technology and resources for file
format identification through the development of a digital file format library.
The final research objective is to create archival services for the
Transcontinental Persistent Archive Prototype (TPAP) from methods developed
during this project and to investigate the scalability of these services
(methods) to large volumes of records.
This project is sponsored by the Army Research Laboratory and NARA's Center for Advanced
Systems and Technology (NCAST).