Resources

Demos

  • Multimodal Interactive Handwritten Text Recognition

    This is a real demonstration of the capabilities of our multimodal interactive HTR engine (MM-CATTI) on a XIX century Spanish manuscript and on IAMDB text images.

    View demo
  • Interactive Image Retrieval

    An interactive approach is proposed to retrieve images from a huge dataset gathered form internet web-pages. Furthemore the demo shows different user-interaction strategies. Query Suggestion and Tag Cloud approaches are shown to be helpfull for image retrieval.

    View demo
  • Key Word Spotting

    View demo
  • KWS – Hierarchical indexing

    View demo
  • Mathematical Expression Recognition

    This demo takes as input a handwritten math expression and outputs a LaTeX string. The recognition engine is based on SESHAT, an open-source system for recognizing handwritten math expressions.

    View demo
  • Interactive Machine Translation

    An interactive approach is proposed as an alternative to post-editing the output of a machine translation system. In this proposal the user’s feedback is used to validate or to correct parts of the system output that allow the generation of improved versions of the rest of the output.

    View demo
  • Interactive Predictive Parsing

    This demo shows an interactive predictive parsing tool.

    View demo
  • Spoken Language Understanding

    The basic problem is learning a subset of an arbitrary natural language from picture-sentence pairs. This demo shows a prototype system that performs speech recognition and language understanding on the Miniature Language Acquisition task, along with a visual representation of the actions.

    View demo
  • Text-to-Text Spanish-Catalan Translator

    Text-to-Text Spanish-Catalan translation system. There are two translation prototypes: one from the Taval, SisHiTra and TeFaTe projects and the other based on the Spanish-Catalan Phrase-based Statistical approach.

    View demo

Software

  • Features for Handwriting Recognition

    In the following page you could find tools for computing several features for handwriting recognition. Feature Extraction Tools.

    Download
  • iATROS (Improved ATROS)

    iATROS is a new implementation of a previous speech recogniser that has been adapted to be used in both speech and handwritten text recognition. iATROS provides a modular structure that can be used to build different systems whose core is a Viterbi-like search on a Hidden Markov Model network. iATROS provides standard tools for off-line recognition and on-line speech recognition (based on ALSA modules).

    • Míriam Luján-Mares, Vicent Tamarit, Vicent Alabau, Carlos-D. Martínez-Hinarejos, Moisés Pastor, Alberto Sanchis, and Alejandro Toselli. iatros: A speech and handwritting recognition system. In V Jornadas en Tecnologías del Habla (VJTH’2008), pages 75-78, Bilbao (SPAIN), Nov 2008
    Download
  • Thot: a toolkit for phrase-based statistical machine translation

    Thot is an open source toolkit for statistical machine translation. Originally, Thot incorporated tools to train phrase-based models. The new version of Thot now includes a state-of-the-art phrase-based translation decoder as well as tools to estimate all of the models involved in the translation process. In addition to this, Thot is also able to incrementally update its models in real time after presenting an individual sentence pair.

    • (Ortiz-Martínez et al. 2014) D. Ortiz-Martínez and F. Casacuberta. The New Thot Toolkit for Fully-Automatic and Interactive Statistical Machine Translation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 45-48, Gothenburg, Sweden, April 2014
    Download
  • iGREAT (interactive GREAT)

    iGREAT is an open-source, statistical machine translation software toolkit based on finite-state models.

    • Jorge González, Francisco Casacuberta. GREAT: open source software for statistical machine translation. Machine translation, 2011. Vol. 25 (2), pp. 145-160.
    • J. González and F. Casacuberta. GREAT: a finite-state machine translation toolkit implementing a Grammatical Inference Approach for Transducer Inference (GIATI). In EACL Workshop on Computational Linguistics Aspects of Grammatical Inference, pages 24-32, Athens, Greece, March 30 2009.
    • J. González, G. Sanchis, and F. Casacuberta. Learning finite state transducers using bilingual phrases. In 9th International Conference on Intelligent Text Processing and Computational Linguistics. Lecture Notes in Computer Science, Haifa, Israel, February 17 to 23 2008.
    Download
  • jaf MT: A phrased-based hidden semi-Markov Model for SMT

    jaf MT is sowftware for training phrased-based hidden semi-Markov Model for SMT.

    • Jesús Andrés-Ferrer, Alfons Juan.. A phrase-based hidden semi-Markov approach to machine translation. Procedings of European Association for Machine Translation (EAMT), 2009. pp. 168-175.
    Download

Data

  • The CS corpus

    The CS (“Cristo-Salvador”) corpus is a XIX century Spanish manuscript. A detailed description of this corpus, including the information of how to download the page images (without water-marks) along with the corresponding transcriptions. More information
  • Spanish Numbers

    This is a handwritten text corpus about names of numbers in Spanish, collected by the “Instituto Tecnológico de Informática”. The corpus contains about 522 handwritten text sentences and is employed frequently as example-task for assessing the performance of new preprocessing, features extraction and modelling methods for HTR. More information
  • The DNI corpus

    This corpus is a compilation of handwritten national identification numbers (DNI) from real forms. It was collected for the Interactive Sequence Labeling benchmark. The aim of this benchmark is to find new search strategies for passive and active interactive sequence labeling. More information
  • The Karyotype corpus

    This corpus contains karyotypes, where each one is composed of 22 chromosome images. It was collected for the Karyotype benchmark, where the goal is to associate each chromosome image with a label from a set of 22 labels. More information
  • The IAM-PRHLT bi-modal Handwritten Text corpus II

    This is a new biMod-IAM-PRHLT corpus compiled for the The IAM-PRHLT bi-modal Handwritten Text corpus II benchmark to test and develop word-graph based multimodal protocols. These word-graphs are obtained for any word instance (on-line and off-line) of the biMod-IAM-PRHLT-2 corpus, using the viterbi algorithm, with a lexical restriction (prefix-tree). More information
  • The IAM-PRHLT bi-modal Handwritten Text corpus

    The biMod-IAM-PRHLT corpus is a bimodal dataset of on-line and off-line handwritten text. It is composed of a set of handwritten words (500 aprox.) with several word instances of each of the on-line and off-line modalities. The off-line samples are presented as grey-level images (PNG format), and the on-line samples are sequences of X-Y coordinates (Unipen format, originally in xml format) describing the trajectory of an electronic pen while writing the same word. The writers of the on-line and off-line samples are (generally) different. More information
  • The RODRIGO corpus

    RODRIGO corresponds to a manuscript from 1545 entitled “Historia de España del arçobispo Don Rodrigo”, and completely written in old Castilian (Spanish) by a single author. It is a 853-page bound volume divided into 307 chapters describing chronicles from the Spanish history. Most pages only a single text block of nearly calligraphed handwriting on well-separated lines. More information
  • The GERMANA corpus

    GERMANA is the result of digitising and annotating a 764-page Spanish manuscript entitled "Noticias y documentos relativos a Doña Germana de Foix, última Reina de Aragón" and written in 1891 by Vicent Salvador. More information
  • The EUTRANS-I corpus

    EUTRANS-I is a simple translation corpus which was produced and used in the EuTrans project. It corresponds to the so called "Traveller Task" which involves human-to-human communication situations in the front-desk of a hotel. Bilingual data were produced semi-automatically in three language pairs on the base of small "seed corpora", obtained from several traveler-oriented booklets. More information
  • ImageCLEF 2016 Handwritten Retrieval Dataset

    The dataset used in the ImageCLEF 2016 Handwritten Scanned Document Retrieval evaluation is now publicly available at zenodo. More information

Contest

  • The IAM-PRHLT bi-modal Handwritten Text corpus II benchmark

    This is a new benchmark to test and develop word-graph based multimodal protocols. These word-graphs are obtained for any word instance (on-line and off-line) of the biMod-IAM-PRHLT-2 corpus, using the [...] More information
  • The Karyotype benchmark

    This is a new benchmark to test and develop interactive protocols. It contains karyotypes, where each one is composed of 22 chromosome images. The goal is to associate each chromosome [...] More information
  • The Interactive Sequence Labeling benchmark

    The aim of this benchmark is to find new search strategies for passive and active interactive sequence labeling. The corpus is a compilation of handwritten national identification numbers (DNI) from [...] More information
  • Photo-web: Large-scale annotation using general Web data

    Concept detection relies on training data that have been manually, and thus reliably annotated, an expensive and laborious endeavor that cannot easily scale. To address this issue, this new annotation [...] More information
  • Interactive Image Annotation Benchmark

    The objective of this benchmark is to compare the performance of different strategies for the task of interactive image annotation. The goal of this task is to assign words/tags to [...] More information