Advanced Language Processing Technology Applied to Digital Records

 

Military decisions require integration of large amounts of many types of digital information. Currently, large volumes of raw data are gathered from around the globe in many languages, media and forms. Given the volume and variety of information, it is difficult to quickly find and interpret the information. The U.S. Army has identified the need for technologies supporting network-centric/distributed information systems, content-based retrieval of information, machine translation of text, automated text analysis using markup languages and rule-based reasoning.

 

Archival decisions require the summary, access, review and long-term preservation of digital records. Due to the large variety and increasing volume of acquired digital records, it will be decades or centuries before acquisitions can be manually summarized and reviewed for restrictions on disclosure. Due to the obsolescence of computer technologies, there is also significant risk that e-records created using past or current software applications may not be accessible in the future. The military services also maintain administrative, operational and engineering e-records, and many of these must be accessible decades after their creation. They must also review their records for national security restrictions on disclosure. Hence, the military services are also concerned with summarization, access, review and preservation of their records.

 

Pragmatics is an area of language understanding technology in which improved methods could enhance the quality of summaries, enhance the precision and recall of access methods and support interpretation and reasoning about the content of documents. Pragmatics is the area of linguistics that is concerned with the context and use of language in discourse, in contrast to the syntax and semantics of clauses or sentences. Computational pragmatics includes technologies for pronominal co-reference, speech act recognition, topic recognition and recognition of discourse structure.

 

The first objective of this research project is to develop improved methods for pronominal co-reference resolution and speech act recognition. These pragmatic language features play an important role in identifying an author’s intents (asserting, committing, declaring, directing, or expressing an attitude). The second research objective is to develop a method of discourse analysis that relates the structure and topics of a document to the intensions of the author as expressed in the speech acts of the document.

 

The third research objective is to develop an improved method for summarizing individual records and sets of records by combining the methods of speech act recognition and discourse analysis with prior research results based on document type recognition and metadata extraction.

 

In collaboration with ARL Computational Linguists, the fourth research objective is to extend methods for information extraction, document type recognition, metadata extraction and summarization to Arabic language documents.

 

The fifth research objective is to improve the conceptual indexing model of document retrieval by incorporating an improved topic recognition method and providing a Boolean query language interface. Research objective six is to enhance the capability of a prototype access restriction checker by incorporating the speech act recognition and discourse analysis methods.

 

The seventh research objective is to enhance the technology and resources for file format identification through the development of a digital file format library. The final research objective is to create archival services for the Transcontinental Persistent Archive Prototype (TPAP) from methods developed during this project and to investigate the scalability of these services (methods) to large volumes of records.

 

Acknowledgements: This project is sponsored by the Army Research Laboratory and NARA's Center for Advanced Systems and Technology (NCAST).