A Comprehensive Evaluation of Commercial Large Language Models for Reasoning in Slovenian Language and Grammar

Authors

DOI:

https://doi.org/10.31449/upinf.270

Keywords:

Large Language Models (LLM), Evaluation, Grammatical Error Analysis, natural language processing

Abstract

The use of large language models (LLMs) is rapidly expanding in the Slovenian context; however, their actual performance on the Slovenian language remains insufficiently and unsystematically evaluated. In this paper, we present an extensive comparative evaluation of the most widely used commercial and open-source LLMs with respect to Slovenian. The evaluation includes models from four major providers (OpenAI, Google, Anthropic, and Mistral), as well as the domestic models GaMS-27B-Instruct and GaMS3-12B-Instruct, and assesses them using a diverse set of benchmarks targeting instruction following, reasoning abilities, answer reliability, grammatical competence, and textual coherence. We employ translated standardized benchmark tasks (e.g., ARC, HellaSwag, TruthfulQA, GSM8K), the specialized grammatical error dataset DASSLE 1.0, and a collection of real-world conversations from the Slovenian Conversational Arena. The results show that contemporary commercial models achieve high performance on comprehension and reasoning tasks in Slovenian—most notably GPT-5.1 with a high level of deliberative reasoning and Gemini-2.5-Pro—while open models such as Mistral Large 3 attain competitive results despite more limited resources. In contrast, the evaluation of grammatical competence reveals that the morphological and syntactic complexity of Slovenian remains a significant challenge for all evaluated models. Overall, the paper provides a comprehensive overview of the current state of LLM performance for the Slovenian language.

Author Biographies

  • Miha Malenšek, University of Ljubljana, Faculty of computer and information science

    Miha Malenšek je raziskovalec in doktorski študent na Fakulteti za računalništvo in informatiko, Univerze v Ljubljani, zaposlen v Laboratoriju za podatkovne tehnologije. V svojem delu se ukvarja predvsem s podpornimi sistemi za varno in sledljivo uporabo VJM v domenah, kjer je zanesljiva in preverljiva uporaba VJM ključnega pomena.

  • Domen Vreš, University of Ljubljana, Faculty of computer and information science

    Domen Vreš je raziskovalec in doktorski študent na Fakulteti za računalništvo in informatiko, Univerze v Ljubljani, zaposlen v Laboratoriju za strojno učenje in jezikovne tehnologije. V svojem delu se ukvarja predvsem z učenjem VJM za slovenski jezik, GaMS (Generativni Model Slovenščine).

  • Marko Bajec, University of Ljubljana, Faculty of computer and information science

    Marko Bajec je redni profesor na Fakulteti za računalništvo in informatiko Univerze v Ljubljani ter vodja Laboratorija za podatkovne tehnologije in IoT Demo Centra. Predava več predmetov s področja informatike in podatkovnih baz. V okviru aplikativnega in raziskovalnega dela se ukvarja z obvladovanjem informatike ter uporabo podatkovnih tehnologij v okviru različnih domen, kot so internet stvari, pametna mesta, pametni domovi, oskrbovana stanovanja, telemedicina ipd.

Published

2026-05-15

Issue

Section

Scientific articles

How to Cite

[1]
2026. A Comprehensive Evaluation of Commercial Large Language Models for Reasoning in Slovenian Language and Grammar. Applied Informatics. 34, 1 (May 2026). DOI:https://doi.org/10.31449/upinf.270.