A Comprehensive Evaluation of Commercial Large Language Models for Reasoning in Slovenian Language and Grammar
DOI:
https://doi.org/10.31449/upinf.270Keywords:
Large Language Models (LLM), Evaluation, Grammatical Error Analysis, natural language processingAbstract
The use of large language models (LLMs) is rapidly expanding in the Slovenian context; however, their actual performance on the Slovenian language remains insufficiently and unsystematically evaluated. In this paper, we present an extensive comparative evaluation of the most widely used commercial and open-source LLMs with respect to Slovenian. The evaluation includes models from four major providers (OpenAI, Google, Anthropic, and Mistral), as well as the domestic models GaMS-27B-Instruct and GaMS3-12B-Instruct, and assesses them using a diverse set of benchmarks targeting instruction following, reasoning abilities, answer reliability, grammatical competence, and textual coherence. We employ translated standardized benchmark tasks (e.g., ARC, HellaSwag, TruthfulQA, GSM8K), the specialized grammatical error dataset DASSLE 1.0, and a collection of real-world conversations from the Slovenian Conversational Arena. The results show that contemporary commercial models achieve high performance on comprehension and reasoning tasks in Slovenian—most notably GPT-5.1 with a high level of deliberative reasoning and Gemini-2.5-Pro—while open models such as Mistral Large 3 attain competitive results despite more limited resources. In contrast, the evaluation of grammatical competence reveals that the morphological and syntactic complexity of Slovenian remains a significant challenge for all evaluated models. Overall, the paper provides a comprehensive overview of the current state of LLM performance for the Slovenian language.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Applied Informatics

This work is licensed under a Creative Commons Attribution 4.0 International License.



