Semantic analysis of offensive language categories from existing annotated corpora
DOI:
https://doi.org/10.31449/upinf.151Keywords:
offensive language, hate speech, natural language processing, word embeddingsAbstract
There exists a vast amount of different offensive language corpora for English language, annotation criteria and category naming. In this paper, we explore 21 different categories of offensive language. We use natural language processing techniques to find correlations between the categories based on seven different data sets. We employ several traditional (TF–IDF) and advanced (fastText, GloVe, Word2Vec, BERT, and other deep NLP methods) techniques to uncover similarities among different offensive language categories. The findings reveal that most of the categories are densely interconnected, while a two-level hierarchical representation of them can be provided. We also transfer the analysis to the Slovenian language and compare the findings between both researched languages.Downloads
Published
2022-05-04
How to Cite
[1]
Kljun, M., Teršek, M. and Žitnik, S. 2022. Semantic analysis of offensive language categories from existing annotated corpora. Applied Informatics. 30, 1 (May 2022). DOI:https://doi.org/10.31449/upinf.151.
Issue
Section
Scientific articles