Prepoznavanje idiomatskih besednih zvez z uporabo besednih vložitev

Tadej Škvorc; Marko Robnik-Šikonja

doi:10.31449/upinf.63

Authors

Tadej Škvorc Univerza v Ljubljani, Fakulteta za računalništvo in informatiko, Večna pot 113, Ljubljana, Slovenija
Marko Robnik-Šikonja Univerza v Ljubljani, Fakulteta za računalništvo in informatiko, Večna pot 113, Ljubljana, Slovenija

DOI:

https://doi.org/10.31449/upinf.63

Keywords:

multi-word expressions, natural language processing, text mining, word embeddings

Abstract

The presence of idioms presents problems for many tasks in natural language processing as they can be hard for computers to detect. Detecting such expressions and correctly determining their meaning has not yet been fully solved. In recent years, several methods for constructing contextual word embeddings have been proposed, which are capable of detecting different meanings of the same word based on its context. Such embeddings should be well-suited to detecting idioms. Current approaches either do not use embeddings or use non-contextual embeddings. We show that we can use contextual word embeddings to differentiate between literal and idiomatic word use. We extract various features (e.g., the contextual vectors and distance to the mean contextual vector for each word) and show that they can be useful for detecting idiomatic word expressions present in the GloWbE corpus of English texts.