Duración: 1 octubre 2006 hasta 30 septiembre 2009
Financiado por: referencia TIN2006-15694-CO2-01

There are huge historical document collections residing in libraries, museums and archives that are currently being digitized for preservation purposes and to make them available worldwide through large, on-line digital libraries. The main objective, however, is not to simply provide access to raw images of digitized documents, but to annotate them with their real informative content and, in particular, with text transcriptions and, when convenient, text translations too. Unfortunately, state-of-the-art technologies for automatic text transcription can hardly deal with handwritten text or even with old-style printed text. Similarly, current machine translation techniques are still far from being error-free, and thus they cannot produce acceptable translations of transcribed texts in a fully automatic way.

iDoc aims at developing advanced techniques and interfaces for the analysis, transcription and translation of images of old archive documents, following an interactive-predictive approach. It is a coordinated project with two subprojects: iAnaDoc (“Interactive Analysis of Old Archive Documents”) and iTransDoc (“Interactive Transcription and Translation of Old Text Documents”). As suggested by their names, iAnaDoc mainly covers the image analysis part, while iTransDoc is primarily devoted to the transcription and translation tasks.

iAnaDoc aims to investigate in Document Image Analysis and Pattern Recognition applied to the extraction of metadata from digitized old documents archives. Such metadata consists of three major classes of knowledge, na- mely document layout; text, either printed (iAnaDoc) or handwritten (iTransDoc); and graphical entities (diagrams, tables, symbols, stamps, etc.). Since images of old documents may be of poor quality due to aging and the deterio- ration of material overtime (noise, spots, ink fading, etc), the research will also focus on the early level processing. It includes, a number of image processing techniques to enhance and restore degraded document images.

iTransDoc will bring together specialists from off-line Handwritten Text Recognition and Machine Translation, who will work closely with archivists to design a friendly and intelligent software tool for the transcription and trans- lation of text blocks. These text blocks will have been previously extracted by iAnaDoc from digitized documents. End-users of the tool (archivists) will serve as guarantors of high-quality output; the role of the tool will be to increase user ‘ s productivity by predicting extensions to her/his current, partial hypothesis on the text transcription or transla- tion. This interactive-predictive paradigm is essential to iTransDoc. To some extent, iTransDoc may be considered an extension of a European research project on interactive-predictive machine translation that has been recently finished with great success.

In order for the designed tool to be as efficient as possible, iDoc will pay special attention to the way in which end-users may provide their input. End-user input will be mainly based on conventional computer devices (keyboard and mouse). Nevertheless, speech recognition and graphical input devices will be also tested as complementary or alternative input modes to speed up operation. These multimodal capabilities will be developed by experts on: au- tomatic speech recognition (iTransDoc), on-line handwritten text recognition (iTransDoc) and sketching (iAnaDoc).

The designed software tool will be periodically evaluated in terms of usability and profitability by five end-users that will actively participate in iDoc as Promoter-Observer Entities (EPOs). For this purpose, we will consider several, varied collections of old text documents, from which a representative benchmark of transcription and translation tasks will be defined at the beginning of the project. Accordingly, appropriate evaluation protocols and measures will be also defined. Roughly speaking, our basic goal is to reduce the time needed to manually transcribe/translate a moderately complex text document by at least 25%.