Verónica Romero, Joan-Andreu Sánchez. Category-based language models for handwriting recognition of marriage license books. International Conference on Document Analysis and Recognition (ICDAR), 2013. pp. 788-792. IEEE Computer Society Confere Publishing Services (CPS). A

Handwritten marriage licenses books have been used for centuries by ecclesiastical institutions to register marriages. These documents have interesting information, useful for demography studies, organized in a list of individual marriage license records, such as an accounting book. The information in these books is usually collected by expert demographers that devote a lot of time to transcribe them. Despite the structure of the text, the automatic transcription and semantic information extraction of these documents is quite difficult due to the distinct and evolutionary vocabulary, which is composed mainly of proper names that change along the time. In this paper, we have defined some categories taking into account the semantic information included in the licenses. Then a category-based language model has been generated and integrated into the handwritten text recognition system. We study how the use of these categories can benefit not only the handwriting recognition step, but also the posterior semantic information extraction and knowledge discovery.