Duration: 1 January 2016 to 31 December 2018
Supported by: under reference TIN2015-71147-C2-1-P

SomEMBED (SOcial Media language understanding – EMBEDing contexts) is a coordinated project whose goal is to advance in the area of Computational Linguistics (CL) and in Natural Language Processing (NLP) in order to deal with and solve the challenges posed by the use of language in the social media: (i) from CL, our goal is to develop techniques and methods for modeling non-standard language from representative corpus of the social media; (ii) from NLP, our goal is to develop new techniques and methods from the state-of-the-art scientific and technical knowledge for solving specific tasks within concrete applications. In this project there are three closely related lines of research: 1) the development of different methodologies for the automatic extraction of constructions or syntactic-semantic patterns in order to semantically represent the content of the documents, being an essential point the novel methods based on continuous representations of text (embeddings) for modeling the context in an effective and efficient manner; 2) the development of applications for solving specific tasks of NLP in order to improve the automatic understanding of the text (for example, detection of figurative language), and identify key features of author profiles (age, gender, language variety, native language, etc.), with special interest in distinguishing users of Spanish-speaking countries (Spain, Mexico, Peru, etc.). These features allow us to use their information for tasks such as product and services mining -especially for the detection of false opinions- and 3) the creation of linguistic resources, particularly annotated corpora, focused on the analysis of non-standard language as a basis for the pattern extraction methodology and for the aforementioned applications. These three lines of research are specified in the following objectives: a) the experimentation with techniques of generation and comparison of continuous represntations of text for obtaining syntactic-semantic patterns; b) the development of NLP applications in social media such as the detection of deceptive reviews, identification of figurative language (irony, metaphor, humor, etc.), plus others focused on the area of author profiling: language variety identification, native language, features of the author (age, gender, political preferences, etc.) and c) the creation of the basic infrastructure of language resources (corpora), specific applications and analysis of the non-standard language. Achieving these goals requires a multidisciplinary approach that involves language processing specialists both in the field of linguistics and in computer engineering. Applicant groups accomplish these requirements and have a long history of collaboration, which is reflected in publications, joint events and joint participation in projects.