Comparative analysis of preprocessing tasks over social media texts in Spanish

One of the key aspects of the texts coming from social media is that they tend to be very noisy. This is mainly because of the usage of informal language and none standard grammatical structures. So in order to use these contents as input for a text analysis process, it is highly recommended to pre...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Tessore, Juan Pablo, Esnaola, Leonardo Martín, Russo, Claudia Cecilia, Baldassarri, Sandra
Otros Autores: 0000-0002-2111-0976
Formato: Documento de conferencia acceptedVersion
Lenguaje:Inglés
Publicado: Association for Computing Machinery (ACM) 2021
Materias:
Acceso en línea:https://repositorio.unnoba.edu.ar/xmlui/handle/23601/143
Aporte de:
Descripción
Sumario:One of the key aspects of the texts coming from social media is that they tend to be very noisy. This is mainly because of the usage of informal language and none standard grammatical structures. So in order to use these contents as input for a text analysis process, it is highly recommended to previously clean and reduce the noise of the data. This work focuses on measuring the effectiveness that diverse cleaning and repairing tasks have on the data. The results obtained, indicate that the tasks of tokens with no letters removal, and stressed words correction are the most effective. In addition, some tasks like hashtags or usernames processing, which behave very well in other datasets, are not that relevant in this one. This research is part of a more general one that pursues to build an automatic emotion classifier that makes use of the preprocessed comments as input.