Semantic analysis of offensive language categories from existing annotated corpora

Authors

  • Maša Kljun
  • Matija Teršek Student at Faculty of Computer and Information Science
  • Slavko Žitnik

DOI:

https://doi.org/10.31449/upinf.151

Keywords:

offensive language, hate speech, natural language processing, word embeddings

Abstract

There exists a vast amount of different offensive language corpora for English language, annotation criteria and category naming. In this paper, we explore 21 different categories of offensive language. We use natural language processing techniques to find correlations between the categories based on seven different data sets. We employ several traditional (TF–IDF) and advanced (fastText, GloVe, Word2Vec, BERT, and other deep NLP methods) techniques to uncover similarities among different offensive language categories. The findings reveal that most of the categories are densely interconnected, while a two-level hierarchical representation of them can be provided. We also transfer the analysis to the Slovenian language and compare the findings between both researched languages.

Downloads

Published

2022-05-04

How to Cite

[1]
Kljun, M., Teršek, M. and Žitnik, S. 2022. Semantic analysis of offensive language categories from existing annotated corpora. Applied Informatics. 30, 1 (May 2022). DOI:https://doi.org/10.31449/upinf.151.

Issue

Section

Scientific articles