Slovenian in the age of artificial intelligence

Author: Petra Prešeren Golob

Date: 10. November 2025

Time to read: 2 min

The Slovenian language is not only a means of communication but a cornerstone of national identity. For centuries, it has withstood various authorities and influences for centuries, but technological progress now poses new challenges. In order for Slovenian to retain its place in the digital future, Slovenian experts from the Centre for Language Resources and Technologies at the University of Ljubljana are building a generative model for the Slovenian language – GaMS, the Slovenian version of large language models, such as ChatGTP or Claude.

How do large language models work?

Large language models are a type of artificial intelligence that learns language from vast datasets containing massive amounts of text, such as websites, books, articles and other written materials. As they learn, they discover patterns connecting words and sentences, the language structure and the way people express their thoughts. When they are asked a question or given a task, they create answers based on the learned patterns. But the key to success is the quality and scope of the training dataset. The higher the quality of the texts it learns from, the better the model’s understanding of the language, culture and context.

Large language models are trained on trillions of words form books, websites, scientific papers, code and even forum discussions. Photo: Depositphotos

GaMS opens the doors to a technological future

Large language models have transformed our daily lives, changing the ways we communicate, access information and function in the digital space. As their behaviours have been learned mostly from English and other big global languages, they do not work as well in Slovenian. Existing language models lack the cultural specificities of Slovenian, which means they disregard the cultural context, history and customs.

Experts from the University of Ljubljana are developing GaMS, Slovenia’s own generative language model. Photo: PoVeJMo
The program will develop several computationally efficient, open-access large language models. Photo: PoVeJMo

1 / 2

"It is true that any corporation can collect online texts in Slovenian and use them, but building such a language model ourselves gives us independence from their arbitrary decisions. We can decide ourselves to whom it will be available, develop it in a targeted and transparent manner, and monitor the quality of input texts," said Dr Špela Arhar Holdt, coordinator of the text collection campaign and researcher at the Faculty of Computer and Information Science.

40 billion words for the future of Slovenian

The model to be developed for Slovenian under the Adaptive Natural Language Processing with Large Language Models programme – known by its Slovenian acronym PoVeJMo, meaning "Let’s Talk" – brings many advantages. The more authentic and accurate the language is, the more it will take into account local cultural specificities and ensure appropriate and effective communication.

The Slovenian large language model will be freely available for various types of use, from integration in medicine and industry to new language resources and technologies for written and spoken Slovenian language, which will encourage the further development and competitiveness of tools and services in the Slovenian language.

The Slovenian large language model will be freely accessible to support a variety of uses and innovations. Photo: Depositphotos
With its large language model, the Slovenian language stands alongside the world’s major languages. Photo: Depositphotos

1 / 2

For the GaMS model to function successfully, researchers require an extensive volume of training data amounting to 40 billion words. The research team therefore organised a campaign to collect written and spoken texts in Slovenian targeting the general public.

The media, libraries and other large institutions were also invited to participate. The texts can range from everyday writings and emails to academic articles, whether proofread or not. All that matters is that the authors hold the necessary copyrights.

The texts are being collected via the portal Povejmo.si, where you can also test how the model works.

The more texts are collected, the better the language model functions and the larger its linguistic capacity. Photo: Depositphotos

With the GaMS project, Slovenian researchers are proving that even small languages can keep pace with technological advances. They are developing an open, safe and high-quality model that will be available to everyone under the same conditions and will form a central language infrastructure for future generations.