RESOURCES
PRHLT softwareYou can find most of software in our GitHub. |
Fake News Detection and Emotion Analysis
A neural network–based tool (Emo–analysis) that compares fake news language with real news language from an emotional perspective, considering a set of information types (propaganda, hoax, clickbait and satire) from online article sources news and social networks. False information has been shown to have different emotional patterns in each of the types, and emotions play a key role in misleading the reader. In longer articles, where authors manipulate by adding hype or fabricating events that affect readers’ emotions, the FakeFlow model has been used, which models the flow of affective information in fake news articles using a neural architecture. The model learns from the flow by combining the topic and affective information extracted from the text.
|
Profiling of Fake News SpreadersUsers play a critical role in creating and spreading fake news online, whether intentionally or unintentionally.
|
Hoax/Rumour DetectionThis development was carried out to participate in the RumorEval 2019 shared task contest, whose main aim was to automatically determine the veracity of rumours. The approach that was applied to address the subtasks of the contest exploits both classic machine learning algorithms and word embeddings, and is based on various groups of characteristics: stylistic, lexical, emotional, sentimental, structural and Twitter-based. In addition, a new set of features that take advantage of the syntactic information of the texts is introduced.
|
Multimodal Fake News Detection with Textual, Visual and Semantic InformationA multimodal system has been developed for the detection of fake news, combining textual, visual and semantic information. For the textual representation, we use BERT-Base to better capture the underlying semantic and contextual meaning of the text. For visual representation, we extract image tags from multiple images containing the articles using, for example, the VGG-16 model. The semantic information is obtained by the image-text similarity computed using the cosine similarity of the image and title tag embeddings. Thenthe different components are concatenated to make the final prediction.
|
Thot: a toolkit for phrase-based statistical machine translationThot is an open source toolkit for statistical machine translation. Originally, Thot incorporated tools to train phrase-based models. The new version of Thot now includes a state-of-the-art phrase-based translation decoder as well as tools to estimate all of the models involved in the translation process. In addition to this, Thot is also able to incrementally update its models in real time after presenting an individual sentence pair.
|
iGREAT (interactive GREAT)iGREAT is an open-source, statistical machine translation software toolkit based on finite-state models.
|
jaf MT: A phrased-based hidden semi-Markov Model for SMTjaf MT is sowftware for training phrased-based hidden semi-Markov Model for SMT.
|
The EU corpusThe EU corpus is a corpora extracted from the Bulletin of the European Union, which exists in all official languages of the European Union and is publicly available on the Internet. More information |
IBEM Mathematical Formula Detection DatasetThe IBEM dataset consists of 600 documents with a total number of 8 272 pages containing 29593 isolated and 136635 embedded expressions. This was the dataset employed on the ICDAR 2021 Competition on Mathematical Formula Detection. More information |
The Finnish Court Records DatasetThis dataset is part of the “The Finnish Court Records” (FCR) collection held by the National Archives of Finland. More information |
The EUTRANS-I CorpusEUTRANS-I is a simple translation corpus which was produced and used in the EuTrans project. It corresponds to the so called “Traveller Task” which involves human-to-human communication situations in the front-desk of a hotel. Bilingual data were produced semi-automatically in three language pairs on the base of small “seed corpora”, obtained from several traveler-oriented booklets. More information |
The RODRIGO corpusRODRIGO corresponds to a manuscript from 1545 entitled “Historia de España del arçobispo Don Rodrigo”, and completely written in old Castilian (Spanish) by a single author. It is a 853-page bound volume divided into 307 chapters describing chronicles from the Spanish history. Most pages only a single text block of nearly calligraphed handwriting on well-separated lines. More information |
ImageCLEF 2016 Handwritten Retrieval DatasetThe dataset used in the ImageCLEF 2016 Handwritten Scanned Document Retrieval evaluation is now publicly available at zenodo. More information |
Covid19-MLIA: Machine Translation TaskThe PRHLT co-organized a machine translation shared task focused on Covid-19 related texts as part of the Covid19-MLIA event. More information |