UK Flag EN UA Flag UA

MamayLM v1.0
The First Open Multimodal Ukrainian LLM

MamayLM Thumbnail

MamayLM can now see! We are releasing MamayLM v1.0, the best-performing efficient Ukrainian language model that surpasses all similar-sized models in both English and Ukrainian, while matching or overtaking up to 5x larger models.

We are delighted to announce the release of MamayLM v1.0, a new state-of-the-art LLM targeting the Ukrainian language. We are releasing the model in two sizes - 4B and 12B - both of which are cost-efficient, fast, multimodal and can be run on 1 GPU, yet are effective in both Ukrainian and English. The model comes with strong capabilities outpacing open models of similar sizes in both languages, while matching or favourably comparing against much larger models. MamayLM is a result of the research done at the INSAIT Institute - the first MamayLM v0.1 release had shown a wide adoption with more than 10,000 downloads and many positive reviews, providing foundation for further multilingual development. The new version has the following updates:

  1. Stronger base model: Using Gemma 3 models as the base model provides improved performance and capabilities for Ukrainian language tasks.
  2. Multimodality: The model is designed to support multiple modalities, including text and images, enabling a wider range of applications and use cases in both English and Ukrainian. The model now shows high visual question answering capabilities, being able to understand questions with images about localized context.
  3. Longer context: The new version is optimized to handle longer context lengths, allowing it to better understand and generate text with more complex dependencies and relationships. This enables processing and reasoning over much larger documents (or multiple) at once.

Enriching the Training Data for Ukrainian

In our v0.1 version we successfully adapted Gemma 2 to Ukrainian language, based on our research on language transfer. Now, taking Gemma 3 as the base model with even more powerful multilingual (and multimodal!) capabilities, we have applied a similar pipeline of data curation, continual pretraining and instruction fine-tuning, with some notable improvements in various aspects to adapt Gemma 3 4B and 12B to Ukrainian using a total of 81B tokens of Ukrainian and English text.

Pre-Training Phase

In the previous version, our Ukrainian pre-training data was based on the FineWeb2, Malyuk, and CulturaX datasets. For the current v1.0 release, we switched to the Kobza dataset, which builds upon the same sources while integrating HPLT. Kobza also includes fuzzy deduplication and leverages a wider range of web data, as HPLT follows a different pipeline and collects multilingual content from diverse sources. Since FineWeb2 and CulturaX rely on overlapping data and share a similar knowledge cut-off date, we selected the FineWeb2 and UberText (Ukrainian news) subsets within Kobza to maximize coverage. This approach provides a larger and more diverse foundation for our pre-training corpus. Additionally, we applied a data rehydration technique by incorporating the Ukrainian Wikipedia subset, ensuring greater emphasis on high-quality content.

During pretraining, we used best-fit packing to pack sequences at the desired context length, preserving data structure and coherence with minimal disruption. This approach enhances context learning and improves language reasoning capabilities. To prevent catastrophic forgetting, we include a small proportion of English-centric data, such as English Wikipedia, Smoltalk and Mixture of Thoughts.

Post-Training Phase

Similarly to the v0.1 version, for the post-training stage we extracted topics relevant to Ukrainian history and culture, which enabled the generation of a synthetic dataset of Ukrainian QA pairs using knowledge distillation from a larger model. We also employed our LLM-based translation pipeline to translate domain-specific data to Ukrainian, enhancing both quantity and quality in the target language.

Our instruction-tuning dataset incorporates various open-source datasets, such as the Nemotron SFT and Post-Training datasets, OpenCoder (OPC) SFT dataset, Aya Collection and more. We acknowledge the significant contributions of the Ukrainian open-source community, particularly creators of Spivavtor, UAlpaca, UA-Squad, Ukrainian StackExchange, Crimean Tatar Parallel Corpora and UA-Lawyer QA, which amplify the potential of Ukrainian post-training.

Adapting Gemma 3 to Ukrainian Language

Training Process

In the pre-training stage we split the dataset into two parts based on different massive web-sourced datasets and re-introduced smaller domain-specific datasets in both splits. Based on the training with different splits we utilized model souping technique to improve pre-trained model performance - this allowed us to increase pre-training performance dramatically.

In the post-training stage, we trained English- and Ukrainian-focused instruction-tuned models separately, which were later combined into a final better version. Such separated approach allows us to increase the performance on both languages even more thanks to the having data targeted for a specific language. We also applied an advanced model merging technique inspired by Layer Swapping to more precisely extract linguistic capabilities. Further, we consider findings on language imbalances and model merging, which highlight the impact of data mixing proportions on model performance.

Our pipeline enables us to not just preserve visual and long-context capabilities, but even improve them for both languages without having specific datasets targeted for those domains. We believe that the visual multilingual performance is strongly dependent on the model's linguistic capabilities in given languages, therefore, we observe improvements on visual benchmarks without training on text-image data.

Multimodal Performance

MamayLM v1.0 now supports visual input together with the text thanks to the multimodal support of Gemma 3 models. This is a significant advancement from the previous version, which was limited to text-only processing. Even though our training corpus was focused only on the text data, MamayLM inherited the visual understanding capabilities from the base model, which we further managed to preserve during our training. As a result, the tuned model shows improved results on visual evaluations for both English and Ukrainian without including any image training data! This can be explained by internal model architecture, where multimodal capabilities rely the most on the language performance in the text model itself, while vision tower is only used to process the visual input in the format understanble to the main language model. The enhanced multimodal capabilities of MamayLM v1.0 open up new possibilities for applications that require understanding and generating content based on both text and visual inputs, useful in administrative adoption and various other use cases.

MamayLM Visual Questions Answering
MamayLM is now able to understand and answer Ukrainian-specific questions in visual format

Evaluating MamayLM v1.0 12B Capabilities

We evaluated MamayLM on a set of standard English benchmarks, a translated version of them in Ukrainian, as well as Ukrainian-specific benchmarks we collected:

  1. ZNO: mandatory testing of knowledge of the Ukrainian high school curriculum in Ukrainian language & literature, history, mathematics and geography
  2. Winogrande challenge: testing world knowledge and understanding
  3. Hellaswag: testing sentence completion
  4. ARC Easy/Challenge: testing logical reasoning
  5. TriviaQA: testing trivia knowledge
  6. GSM-8K: solving multiple-choice questions in high-school mathematics
  7. MMLU: testing knowledge on a multitude of topics
  8. IFEval: testing instruction-following skills

We undertook the challenge of unraveling the best translation method for the English-only benchmarks. Although some effort has been made in this direction, we found that it was not extensive enough, and the Ukrainian translations could be improved. We identified two key issues in benchmark translation:

  1. the separation of question and answer during translation;
  2. translation quality heavily relying on few-shot prompting or additional model output verification.

To address these issues, we developed a translation framework that preserves the context of both questions and answers. It also employs multisampling and scoring of translation candidates to optimize the balance between machine translation quality and human involvement, ensuring maximum efficiency. All adapted benchmarks for Ukrainian are available in the corresponding GitHub repository.

Performance Against Similarly Sized Models

As illustrated by the figures below, across all benchmarks, MamayLM outperforms all similarly sized models (even outperforming much bigger 70B models on Ukrainian!). It does so in both English and Ukrainian, thanks to the particular method used to train MamayLM (mentioned above).

MamayLM English evaluation
Average score across reported English text benchmarks
MamayLM Ukrainian evaluation
Average score across reported Ukrainian text benchmarks

Performance Against Larger Models

We also evaluated MamayLM v1.0 against current state-of-the-art LLMs. Impressively, our model outperforms models up to 6 times larger across various benchmarks, including those specific to Ukrainian contexts, as shown in the figure below.

MamayLM Ukrainian evaluation
Scores across specific reported Ukrainian text benchmarks and comparison with the bigger models

Performance on Mandatory National Ukrainian Exams (ZNO)

Importantly, as the figure below shows, MamayLM v1.0 achieves the highest score on the ZNO (National Ukrainian) high school exams amongst similarly sized models, while outperforming much larger models, including Gemma 3 27B, Llama 3.1 70B and Qwen 2.5 72B.

MamayLM ZNO evaluation
ZNO evaluation results

The results show that MamayLM models lead the evaluations in Ukrainian language and cultural understanding. While version v0.1 achieved an outstanding score that remains difficult to surpass, our new version delivers overall performance gains across modalities and now includes enhanced visual capabilities.

Performance on Visual Benchmarks

We also evaluated MamayLM v1.0 on visual benchmarks, where it demonstrates strong performance in both Ukrainian and English. The model's ability to understand and generate text based on visual inputs highlights its versatility and effectiveness across different modalities.

To assess the English performance we use original MMMU benchmarks, where our trained model shows improved performance compared to the base version.

MamayLM MMMU evaluation
MMMU evaluation results

To monitor Ukrainian visual performance we used ZNO-Vision to evaluate the model's capabilities in understanding local cultural and historical knowledge together with other domain-specific capabilities in Ukrainian. Our model also shows positive improvements after training comparing to the base model.

MamayLM MMZNO evaluation
ZNO-Vision/MMZNO evaluation results

Generative Performance vs. Larger Models

Beyond benchmark evaluations, we assessed the generative capabilities of MamayLM v1.0 on a set of 500 complex questions. The results demonstrate that our model consistently outperforms significantly larger models, excelling both in the linguistic quality of the generated Ukrainian text and the accuracy of its content. To ensure unbiased and high-quality evaluations, we relied on Gemini 2.0 Flash, which has strong proficiency in Ukrainian and a deep understanding of its cultural and linguistic nuances.

We evaluate the model performance on factual Ukrainian QA data, where our model shows positive performance against much larger models as well as GPT-5-mini and Claude 3.7 Sonnet.

MamayLM Ukrainian QA evaluation
Chat performance comparison against proprietary models on custom Ukrainian QA benchmark

We also check the model performance on m-ArenaHard (Ukrainian subset), designed to evaluate more domain-specific knowledge in math and coding, where our model displays similarly good performance against much larger models.

MamayLM ArenaHard-M UKR evaluation
Chat performance comparison against proprietary models on m-ArenaHard benchmark

Evaluating MamayLM v1.0 4B Capabilities

We assess the capabilities of MamayLM v1.0 4B using the same benchmarks, targeted to evaluate text generation, comprehension, and domain-specific knowledge for both Ukrainian and English. The model shows strong performance against similarly sized models, demonstrating its effectiveness across a range of tasks.

MamayLM 4B Ukrainian evaluation
MamayLM v1.0 4B Ukrainian evaluation results (comparison with similarly sized models)

Furthermore, MamayLM v1.0 4B achieves 50% accuracy on ZNO benchmark, showing promising performance on Ukrainian-focused tasks as a small model.

Benefits of MamayLM

In the current technological landscape, the need for fast, adaptable, and locally optimized solutions has become critical. Available in 4B and 12B sizes, MamayLM is relatively compact and consistently outperforms models up to 5x larger in Ukrainian, while simultaneously maintaining competitive performance in English. Its ability to operate on a single GPU allows for faster adaptation, lower operational costs, and simpler deployment, making it particularly well-suited for environments with limited resources and evolving demands. Moreover, the new version has now visual and long context capabilities with increased performance for both languages.

This offers significant advantages for Ukrainian local businesses and government institutions, which can integrate advanced AI technologies without the prohibitive costs or complex technical requirements typically associated with larger systems. Having a smaller size option allows for more flexibility in deployment and scaling for smaller businesses that lack extensive infrastructure. Additionally, the model's bilingual capabilities support its application in sectors such as education and healthcare, where addressing language barriers can have a meaningful impact. In particular, it helps meet immediate needs in Ukraine by enhancing service delivery across critical areas.

Download Models and Benchmarks

We make normal and quantized versions of MamayLM available on HuggingFace, alongside a detailed description of how to use them for inference:

You can load the model locally using transformers library using the following code:

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "INSAIT-Institute/MamayLM-Gemma-3-12B-IT-v1.0",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

The Ukrainian benchmarks are available in the corresponding GitHub repository.

Check out our previous version MamayLM v0.1 9B (Gemma 2) available at this link. If you use our models, please consider citing our work (citation below).

Contact Us

For any questions on MamayLM, please contact us at contact@insait.ai.

INSAIT is a world-class computer science and AI research institute, which is part of Sofia University, located in Sofia, Bulgaria. INSAIT was created in 2022, in partnership with Switzerland's ETH Zurich and EPFL. It is a strategic institution for Bulgaria, funded with an initial endowment of around 100M USD by the Bulgarian government, over a period of 10 years, and is generously supported with donations of roughly 15M USD from SiteGround, Google, AWS, VMware and other big-tech companies. INSAIT is the first center of its kind in Eastern Europe, structured according to top Western computer science and AI institutions – it provides world-class packages and conditions for outstanding tenure-track and tenured faculty, research scientists, post-docs, PhDs and many other positions. Currently, INSAIT hosts researchers from more than 23 nationalities and does research in areas spanning foundational models, safe and secure AI, robotics, computer vision, quantum computing, algorithms, information security, and other key areas.

Citation

For attribution in academic contexts, please cite this work as

"MamayLM v1.0: An efficient state-of-the-art multimodal Ukrainian LLM", 2025.

BibTeX citation

@misc{MamayLMv1,
      title={MamayLM v1.0: An efficient state-of-the-art multimodal Ukrainian LLM},
      author={Yukhymenko, Hanna and Alexandrov, Anton and Vechev, Martin},
      year={2025},
      }

Distill Template

This blog was based on The Distill Template by Leandro von Werra.