Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion‑Tagged Social Media Comments in Spanish
Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to...
Guardado en:
| Autores principales: | , , , |
|---|---|
| Otros Autores: | |
| Formato: | Artículo acceptedVersion |
| Lenguaje: | Inglés |
| Publicado: |
Springer Science+Business Media LLC
2021
|
| Materias: | |
| Acceso en línea: | https://repositorio.unnoba.edu.ar/xmlui/handle/23601/142 |
| Aporte de: |
| id |
I103-R405-23601-142 |
|---|---|
| record_format |
dspace |
| institution |
Universidad Nacional del Noroeste de la Provincia de Buenos Aires |
| institution_str |
I-103 |
| repository_str |
R-405 |
| collection |
Re DI Repositorio Digital UNNOBA |
| language |
Inglés |
| topic |
Sentiment analysis Dataset construction Dataset validation Text mining |
| spellingShingle |
Sentiment analysis Dataset construction Dataset validation Text mining Tessore, Juan Pablo Esnaola, Leonardo Martín Lanzarini, Laura Baldassarri, Sandra Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion‑Tagged Social Media Comments in Spanish |
| topic_facet |
Sentiment analysis Dataset construction Dataset validation Text mining |
| description |
Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to classify manually because categorization is prone to subjectivity, and thus, redundant classification is required to validate the assigned tag. Even though, in recent years, the amount of emotion-tagged text datasets in Spanish has been growing, it cannot be compared with the number, size, and quality of the datasets in English. Quality is a particularly concerning issue, as not many datasets in Spanish included a validation step in the construction process. In this article, a dataset of social media comments in Spanish is compiled, selected, filtered, and presented. A sample of the dataset is reclassified by a group of psychologists and validated using the Fleiss Kappa interrater
agreement measure. Error analysis is performed by using the Sentic Computing tool BabelSenticNet. Results indicate that the agreement between the human raters and the automatically acquired tag is moderate, similar to other manually tagged datasets, with the advantages that the presented dataset contains several hundreds of thousands of tagged comments and it does not require extensive manual tagging. The agreement measured between human raters is very similar to the one between human raters and the original tag. Every measure presented is in the moderate agreement zone and, as such, suitable for
training classification algorithms in sentiment analysis field. |
| author2 |
0000-0002-2111-0976 |
| author_facet |
0000-0002-2111-0976 Tessore, Juan Pablo Esnaola, Leonardo Martín Lanzarini, Laura Baldassarri, Sandra |
| format |
Artículo Artículo acceptedVersion Artículo Artículo acceptedVersion Artículo Artículo acceptedVersion |
| author |
Tessore, Juan Pablo Esnaola, Leonardo Martín Lanzarini, Laura Baldassarri, Sandra |
| author_sort |
Tessore, Juan Pablo |
| title |
Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion‑Tagged Social Media Comments in Spanish |
| title_short |
Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion‑Tagged Social Media Comments in Spanish |
| title_full |
Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion‑Tagged Social Media Comments in Spanish |
| title_fullStr |
Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion‑Tagged Social Media Comments in Spanish |
| title_full_unstemmed |
Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion‑Tagged Social Media Comments in Spanish |
| title_sort |
distant supervised construction and evaluation of a novel dataset of emotion‑tagged social media comments in spanish |
| publisher |
Springer Science+Business Media LLC |
| publishDate |
2021 |
| url |
https://repositorio.unnoba.edu.ar/xmlui/handle/23601/142 |
| work_keys_str_mv |
AT tessorejuanpablo distantsupervisedconstructionandevaluationofanoveldatasetofemotiontaggedsocialmediacommentsinspanish AT esnaolaleonardomartin distantsupervisedconstructionandevaluationofanoveldatasetofemotiontaggedsocialmediacommentsinspanish AT lanzarinilaura distantsupervisedconstructionandevaluationofanoveldatasetofemotiontaggedsocialmediacommentsinspanish AT baldassarrisandra distantsupervisedconstructionandevaluationofanoveldatasetofemotiontaggedsocialmediacommentsinspanish |
| _version_ |
1850060716846874624 |
| spelling |
I103-R405-23601-1422021-07-26T14:44:03Z Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion‑Tagged Social Media Comments in Spanish Tessore, Juan Pablo Esnaola, Leonardo Martín Lanzarini, Laura Baldassarri, Sandra 0000-0002-2111-0976 0000-0001-6298-9019 0000-0001-7027-7564 0000-0002-9315-6391 Sentiment analysis Dataset construction Dataset validation Facebook Text mining Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to classify manually because categorization is prone to subjectivity, and thus, redundant classification is required to validate the assigned tag. Even though, in recent years, the amount of emotion-tagged text datasets in Spanish has been growing, it cannot be compared with the number, size, and quality of the datasets in English. Quality is a particularly concerning issue, as not many datasets in Spanish included a validation step in the construction process. In this article, a dataset of social media comments in Spanish is compiled, selected, filtered, and presented. A sample of the dataset is reclassified by a group of psychologists and validated using the Fleiss Kappa interrater agreement measure. Error analysis is performed by using the Sentic Computing tool BabelSenticNet. Results indicate that the agreement between the human raters and the automatically acquired tag is moderate, similar to other manually tagged datasets, with the advantages that the presented dataset contains several hundreds of thousands of tagged comments and it does not require extensive manual tagging. The agreement measured between human raters is very similar to the one between human raters and the original tag. Every measure presented is in the moderate agreement zone and, as such, suitable for training classification algorithms in sentiment analysis field. Fil: Tessore, Juan Pablo. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Escuela de Tecnología. Instituto de Investigación y Transferencia en Tecnología, Centro Asociado CIC; Argentina. Fil: Tessore, Juan Pablo. Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Ciudad Autónoma de Buenos Aires, Argentina Fil: Esnaola, Leonardo Martín. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Escuela de Tecnología. Instituto de Investigación y Transferencia en Tecnología, Centro Asociado CIC; Argentina Fil: Lanzarini, Laura. Facultad de Informática, Instituto de Investigación en Informática LIDI (Centro CICPBA), Universidad Nacional de La Plata, La Plata, Buenos Aires, Argentina Fil: Baldassarri, Sandra. Departamento de Informática e Ingeniería de Sistemas, Universidad de Zaragoza, Aragon, Zaragoza, España Fil: Baldassarri, Sandra. Instituto de Investigación en Ingeniería (I3A), Universidad de Zaragoza, Zaragoza, Aragon, España Con referato 2021-07-26T14:44:02Z info:eu-repo/date/embargoEnd/2022-01-17 2021-07-26T14:44:02Z 2021-01-18 info:eu-repo/semantics/article info:ar-repo/semantics/artículo info:eu-repo/semantics/acceptedVersion info:eu-repo/semantics/article info:ar-repo/semantics/artículo info:eu-repo/semantics/acceptedVersion info:eu-repo/semantics/article info:ar-repo/semantics/artículo info:eu-repo/semantics/acceptedVersion Tessore, J.P., Esnaola, L.M., Lanzarini, L. et al. Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish. Cogn Comput (2021). https://doi.org/10.1007/s12559-020-09800-x 1866-9964 1866-9956 https://repositorio.unnoba.edu.ar/xmlui/handle/23601/142 eng info:eu-repo/grantAgreement/UNNOBA/SIB2017/EXP 195/2017/AR. Buenos Aires/Tecnología y Aplicaciones de Sistemas de Software: Calidad e Innovación en procesos, productos y servicios https://link.springer.com/article/10.1007/s12559-020-09800-x info:eu-repo/semantics/embargoedAccess https://creativecommons.org/licenses/by-nc-nd/2.5/ar/ application/pdf application/pdf text/plain Springer Science+Business Media LLC Cognitive Computation |