Pattern Recognition and Human Language Technology

Research Center

Data

The CS corpus

The CS ("Cristo-Salvador") corpus is a XIX century Spanish manuscript. A detailed description of this corpus, including the information of how to download the page images (without water-marks) along with the corresponding transcriptions, can be found here.

The EUTRANS-I corpus

EUTRANS-I is a simple translation corpus which was produced and used in the EuTrans project. It corresponds to the so called "Traveller Task" which involves human-to-human communication situations in the front-desk of a hotel. Bilingual data were produced semi-automatically in three language pairs on the base of small "seed corpora", obtained from several traveler-oriented booklets. More details and experimental results can be found here. Only a benchmark version of the Spanish-English corpus is available here for academic research (300 KB).

The GERMANA corpus

GERMANA is the result of digitising and annotating a 764-page Spanish manuscript entitled "Noticias y documentos relativos a Doña Germana de Foix, última Reina de Aragón" and written in 1891 by Vicent Salvador. A detailed description and instructions to download can be found here.

The RODRIGO corpus

RODRIGO corresponds to a manuscript from 1545 entitled “Historia de España del arçobispo Don Rodrigo”, and completely written in old Castilian (Spanish) by a single author. It is a 853-page bound volume divided into 307 chapters describing chronicles from the Spanish history. Most pages only a single text block of nearly calligraphed handwriting on well-separated lines. You can download it here.

The IAM-PRHLT bi-modal Handwritten Text corpus

The biMod-IAM-PRHLT corpus is a bimodal dataset of on-line and off-line handwritten text. It is composed of a set of handwritten words (500 aprox.) with several word instances of each of the on-line and off-line modalities. The off-line samples are presented as grey-level images (PNG format), and the on-line samples are sequences of X-Y coordinates (Unipen format, originally in xml format) describing the trajectory of an electronic pen while writing the same word. The writers of the on-line and off-line samples are (generally) different. A more detailed description and instructions to download can be found here.

The IAM-PRHLT bi-modal Handwritten Text corpus II

This is a new biMod-IAM-PRHLT corpus compiled for the The IAM-PRHLT bi-modal Handwritten Text corpus II benchmark to test and develop word-graph based multimodal protocols. These word-graphs are obtained for any word instance (on-line and off-line) of the biMod-IAM-PRHLT-2 corpus, using the viterbi algorithm, with a lexical restriction (prefix-tree). The corpus can be downloaded from here.

The Karyotype corpus

This corpus contains karyotypes, where each one is composed of 22 chromosome images. It was collected for the Karyotype benchmark, where the goal is to associate each chromosome image with a label from a set of 22 labels. The corpus can be downloaded from here.

The DNI corpus

This corpus is a compilation of handwritten national identification numbers (DNI) from real forms. It was collected for the Interactive Sequence Labeling benchmark. The aim of this benchmark is to find new search strategies for passive and active interactive sequence labeling. The corpus can be downloaded from here.

Spanish Numbers

This is a handwritten text corpus about names of numbers in Spanish, collected by the "Instituto Tecnológico de Informática". The corpus contains about 522 handwritten text sentences and is employed frequently as example-task for assessing the performance of new preprocessing, features extraction and modelling methods for HTR. It can be downloaded from here.