PRHLT software

You can find most of software in our GitHub.

Fake News Detection and Emotion Analysis

A neural networkbased tool (Emoanalysis) that compares fake news language with real news language from an emotional perspective, considering a set of information types (propaganda, hoax, clickbait and satire) from online article sources news and social networks. False information has been shown to have different emotional patterns in each of the types, and emotions play a key role in misleading the reader. In longer articles, where authors manipulate by adding hype or fabricating events that affect readers’ emotions, the FakeFlow model has been used, which models the flow of affective information in fake news articles using a neural architecture. The model learns from the flow by combining the topic and affective information extracted from the text.

  • Ghanem B., Rosso P., Rangel F. (2020). An Emotional Analysis of False Information in Social Media and News Articles. In: ACM Transactions on Internet Technology (TOIT), 20(2): 1-18


Profiling of Fake News Spreaders

Users play a critical role in creating and spreading fake news online, whether intentionally or unintentionally.
Given the difficulty of knowing which articles contain false information or not, fact-checking websites have been developed to raise awareness of which articles contain fabricated information. Users of those platforms who actively participate in citing evidence to refute fake news and warn other users are known as fact-checkers. Those users who tend to share false information are known as spreaders of fake news. The CheckerOrSpreader model allows a user to be classified as a potential checker or potential spreader. It is based on a convolutional neural network (CNN) and combines word embeddings with features that represent users’ personality traits and linguistic patterns used in their tweets. Using linguistic patterns together with personality traits improves the differentiation between verifiers and spreaders.

  • Giachanou A., Ríssola E., Ghanem B., Crestani F., Rosso P. (2020) The Role of Personality and Linguistic Patterns in Discriminating Between Fake News Spreaders and Fact Checkers. In: Proc. 25th Int. Conf. on Applications of Natural Language to Information Systems, NLDB-2020, Springer Verlag, LNCS(12089), pp.181-192


Hoax/Rumour Detection

This development was carried out to participate in the RumorEval 2019 shared task contest, whose main aim was to automatically determine the veracity of rumours. The approach that was applied to address the subtasks of the contest exploits both classic machine learning algorithms and word embeddings, and is based on various groups of characteristics: stylistic, lexical, emotional, sentimental, structural and Twitter-based. In addition, a new set of features that take advantage of the syntactic information of the texts is introduced.

  • Ghanem B., Cignarella A., Bosco C., Rosso P., Rangel F. (2019) UPV-28-UNITO at SemEval-2019 Task 7 Exploiting Post’s Nesting and Syntax Information for Rumor Stance Classification. In: Proc. of the 13th Int. Workshop on Semantic Evaluation (SemEval-2019), co-located with the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, Minnesota, USA, June 6-7, pp. 1125–1131


Multimodal Fake News Detection with Textual, Visual and Semantic Information

A multimodal system has been developed for the detection of fake news, combining textual, visual and semantic information. For the textual representation, we use BERT-Base to better capture the underlying semantic and contextual meaning of the text. For visual representation, we extract image tags from multiple images containing the articles using, for example, the VGG-16 model. The semantic information is obtained by the image-text similarity computed using the cosine similarity of the image and title tag embeddings. Thenthe different components are concatenated to make the final prediction.

  • Giachanou A., Zhang G., Rosso P. (2020) Multimodal Multi-image Fake News Detection. In: Proc. 7th IEEE International Conference on Data Science and Advanced Analytics, DSAA-2020, pp. 647-654


Thot: a toolkit for phrase-based statistical machine translation

Thot is an open source toolkit for statistical machine translation. Originally, Thot incorporated tools to train phrase-based models. The new version of Thot now includes a state-of-the-art phrase-based translation decoder as well as tools to estimate all of the models involved in the translation process. In addition to this, Thot is also able to incrementally update its models in real time after presenting an individual sentence pair.

  • (Ortiz-Martínez et al. 2014) D. Ortiz-Martínez and F. Casacuberta. The New Thot Toolkit for Fully-Automatic and Interactive Statistical Machine Translation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 45-48, Gothenburg, Sweden, April 2014


iGREAT (interactive GREAT)

iGREAT is an open-source, statistical machine translation software toolkit based on finite-state models.

  • Jorge González, Francisco Casacuberta. GREAT: open source software for statistical machine translation. Machine translation, 2011. Vol. 25 (2), pp. 145-160.
  • J. González and F. Casacuberta. GREAT: a finite-state machine translation toolkit implementing a Grammatical Inference Approach for Transducer Inference (GIATI). In EACL Workshop on Computational Linguistics Aspects of Grammatical Inference, pages 24-32, Athens, Greece, March 30 2009.
  • J. González, G. Sanchis, and F. Casacuberta. Learning finite state transducers using bilingual phrases. In 9th International Conference on Intelligent Text Processing and Computational Linguistics. Lecture Notes in Computer Science, Haifa, Israel, February 17 to 23 2008.


jaf MT: A phrased-based hidden semi-Markov Model for SMT

jaf MT is sowftware for training phrased-based hidden semi-Markov Model for SMT.

  • Jesús Andrés-Ferrer, Alfons Juan.. A phrase-based hidden semi-Markov approach to machine translation. Procedings of European Association for Machine Translation (EAMT), 2009. pp. 168-175.


The EU corpus

The EU corpus is a corpora extracted from the Bulletin of the European Union, which exists in all official languages of the European Union and is publicly available on the Internet. More information

IBEM Mathematical Formula Detection Dataset

The IBEM dataset consists of 600 documents with a total number of 8 272 pages containing 29593 isolated and 136635 embedded expressions.  This was the dataset employed on the ICDAR 2021 Competition on Mathematical Formula Detection. More information

The Finnish Court Records Dataset

This dataset is part of the “The Finnish Court Records” (FCR) collection held by the National Archives of Finland. More information

The EUTRANS-I Corpus

EUTRANS-I is a simple translation corpus which was produced and used in the EuTrans project. It corresponds to the so called “Traveller Task” which involves human-to-human communication situations in the front-desk of a hotel. Bilingual data were produced semi-automatically in three language pairs on the base of small “seed corpora”, obtained from several traveler-oriented booklets. More information

The RODRIGO corpus

RODRIGO corresponds to a manuscript from 1545 entitled “Historia de España del arçobispo Don Rodrigo”, and completely written in old Castilian (Spanish) by a single author. It is a 853-page bound volume divided into 307 chapters describing chronicles from the Spanish history. Most pages only a single text block of nearly calligraphed handwriting on well-separated lines. More information

ImageCLEF 2016 Handwritten Retrieval Dataset

The dataset used in the ImageCLEF 2016 Handwritten Scanned Document Retrieval evaluation is now publicly available at zenodo. More information

Covid19-MLIA: Machine Translation Task

The PRHLT co-organized a machine translation shared task focused on Covid-19 related texts as part of the Covid19-MLIA event.  More information