Thot: a toolkit for phrase-based statistical machine translation

Thot is an open source toolkit for statistical machine translation. Originally, Thot incorporated tools to train phrase-based models. The new version of Thot now includes a state-of-the-art phrase-based translation decoder as well as tools to estimate all of the models involved in the translation process. In addition to this, Thot is also able to incrementally update its models in real time after presenting an individual sentence pair.

  • (Ortiz-Martínez et al. 2014) D. Ortiz-Martínez and F. Casacuberta. The New Thot Toolkit for Fully-Automatic and Interactive Statistical Machine Translation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 45-48, Gothenburg, Sweden, April 2014


iGREAT (interactive GREAT)

iGREAT is an open-source, statistical machine translation software toolkit based on finite-state models.

  • Jorge González, Francisco Casacuberta. GREAT: open source software for statistical machine translation. Machine translation, 2011. Vol. 25 (2), pp. 145-160.
  • J. González and F. Casacuberta. GREAT: a finite-state machine translation toolkit implementing a Grammatical Inference Approach for Transducer Inference (GIATI). In EACL Workshop on Computational Linguistics Aspects of Grammatical Inference, pages 24-32, Athens, Greece, March 30 2009.
  • J. González, G. Sanchis, and F. Casacuberta. Learning finite state transducers using bilingual phrases. In 9th International Conference on Intelligent Text Processing and Computational Linguistics. Lecture Notes in Computer Science, Haifa, Israel, February 17 to 23 2008.


jaf MT: A phrased-based hidden semi-Markov Model for SMT

jaf MT is sowftware for training phrased-based hidden semi-Markov Model for SMT.

  • Jesús Andrés-Ferrer, Alfons Juan.. A phrase-based hidden semi-Markov approach to machine translation. Procedings of European Association for Machine Translation (EAMT), 2009. pp. 168-175.


More PRHLT software

You can find more interesting software in our GitHub.

The EU corpus

The EU corpus is a corpora extracted from the Bulletin of the European Union, which exists in all official languages of the European Union and is publicly available on the Internet. More information

IBEM Mathematical Formula Detection Dataset

The IBEM dataset consists of 600 documents with a total number of 8 272 pages containing 29593 isolated and 136635 embedded expressions.  This was the dataset employed on the ICDAR 2021 Competition on Mathematical Formula Detection. More information

The Finnish Court Records Dataset

This dataset is part of the “The Finnish Court Records” (FCR) collection held by the National Archives of Finland. More information

The EUTRANS-I Corpus

EUTRANS-I is a simple translation corpus which was produced and used in the EuTrans project. It corresponds to the so called “Traveller Task” which involves human-to-human communication situations in the front-desk of a hotel. Bilingual data were produced semi-automatically in three language pairs on the base of small “seed corpora”, obtained from several traveler-oriented booklets. More information

The RODRIGO corpus

RODRIGO corresponds to a manuscript from 1545 entitled “Historia de España del arçobispo Don Rodrigo”, and completely written in old Castilian (Spanish) by a single author. It is a 853-page bound volume divided into 307 chapters describing chronicles from the Spanish history. Most pages only a single text block of nearly calligraphed handwriting on well-separated lines. More information

ImageCLEF 2016 Handwritten Retrieval Dataset

The dataset used in the ImageCLEF 2016 Handwritten Scanned Document Retrieval evaluation is now publicly available at zenodo. More information

Covid19-MLIA: Machine Translation Task

The PRHLT co-organized a machine translation shared task focused on Covid-19 related texts as part of the Covid19-MLIA event.  More information