A Benchmark for the Evaluation of
Ukrainian Language Models
Serhii Hamotskyi*, Anna-Izabella Levbarg†, Christian Hänig*
* Anhalt University of Applied Sciences
† University of Greifswald
Introduction - Motivation - Usage of Ukrainian
Introduction
підуgo-1SG.FUT
In red is the word, in green is the translation, blue are the grammatical abbreviations
1SG.FUT: first person (I and not she) singular (I, not we) and future tense → I will go
This has many implications for NLP (and semi-automated creation of datasets).
Theory - Ukrainian grammar
Eval-UA-tion is a set of 3 benchmark tasks (9 benchmark datasets, 1,000+ instances each) for evaluating Large Language Models (LLMs).
Details and code on GitHub: https://github.com/pchr8/eval-UA-tion
Tasks - Eval-UA-tion overview
(Ukrainian Children’s Book Test)
Tasks - UA-CBT
Tasks - UA-CBT
[…] Мисливець, не маючи можливості захищатися, був розірваний на шматки розлюченими тваринами. Невдовзі Змія померла від своїх тяжких ран. Звірі поховали наставницю в пустелі, – і на її честь були влаштовані пишні похорони. Лихвар, почувши про історію зі смертю →_Мисливця_←, розлютився. […]
a) Лихвар
b) Осел
c) Мисливця
d) Фермер
e) Змія
f) Півень
a) Лихваря
b) Осла
c) Мисливця
d) Фермера
e) Змії
f) Півня
Tasks - UA-CBT
[…] Мисливець, не маючи можливості захищатися, був розірваний на шматки розлюченими тваринами. Невдовзі Змія померла від своїх тяжких ран. Звірі поховали наставницю в пустелі, – і на її честь були влаштовані пишні похорони. Лихвар, почувши про історію зі смертю →_Мисливця_←, розлютився. […]
знайшов Змію
смерть Змії
історія зі смертю Змії
*історія зі смертю Змію
(Not just case, but lemma and even part of speech: корів, три, …)
Tasks - UA-CBT
This became a separate package pymorphy-spacy-disambiguation1 and is on Github.
Theory - Two stories
Tasks - UA-CBT
Tasks – UA-CBT
All public domain.
Tasks - UA-CBT
Tasks - UA-CBT
Tasks - UA-CBT
[sunny|cloudy]
day…”)Tasks - UA-CBT
Inspired by LMentry [3], a benchmark “hard for LLMs but easy for humans”.
Tasks - LMentry-static-UA - Outline
Different templates are used for the same task; instances have a lot of metadata.
templates:
# template: 'Яке слово перше по алфавітному порядку: "{t1}" чи "{t2}"?'
- template: 'Which word is first in alphabetical order: "{t1}" or "{t2}"?'
additional-metadata:
template_n: 1
type: ordinal
kind: less
# template: 'Яке слово стоїть ближче до початку алфавіту: "{t1}" чи "{t2}"?'
- template: 'Which word is closer to the beginning of the alphabet: "{t1}" or "{t2}"?'
additional-metadata:
template_n: 2
type: closer_to_side
kind: less
# template: Серед '{t1}' та '{t2}', яке слово розташоване ближче до кінця алфавіту?
- template: Between'{t1}' and '{t2}', which word is closer to the end of the alphabet?
additional-metadata:
template_n: 3
type: closer_to_side
kind: more
# template: Серед '{t1}' і '{t2}', яке слово знаходиться ближче до літери A в алфавіті?
- template: Between '{t1}' and '{t2}', which word is closer to the letter A in the alphabet?
additional-metadata:
template_n: 4
type: closer_to_letter
kind: less
Tasks - LMentry-static-UA - Robustness
Tasks - LMentry-static-UA - N-in-M
Tasks - LMentry-static-UA - N-in-M
ukr_numbers
1 was written:>>> from ukr_numbers import Numbers
>>> Numbers().convert_to_auto(15, "перший")
'пʼятнадцятий'
# loosely 'translating' to English:
>>> Numbers().convert_to_auto(15, "first")
'fifteenth'
Tasks - LMentry-static-UA - N-in-M
Tasks - LMentry-static-UA - Category-based tasks
Tasks - LMentry-static-UA - Comparison-based tasks
по-пʼятe
Tasks - LMentry-static-UA - Comparison-based tasks
(Ukrainska Pravda–Titles)
Tasks - UP-Titles
Tasks - UP-Titles
The Danish government has announced the supply of a new military aid package worth X.X billion Danish crowns, or about US$XXX million, to Ukraine, which was formed after the meeting of Defence Ministers of Denmark and Ukraine. Source: European Pravda, with reference to the press service of the Ministry of Defence of Denmark Details: The new Danish aid package will include …
Tasks - UP-Titles
# total | # wrong | bl_random | bl_human | |
---|---|---|---|---|
UA-CBT | 99 | 6 | 16.67 | 93.94 |
UP-unmasked | 99 | 12 | 10.00 | 87.88 |
UP-masked | 98 | 16 | 10.00 | 83.67 |
LMES-wordalpha | 98 | 8 | 50.00 | 91.84 |
LMES-wordlength | 100 | 6 | 50.00 | 94.00 |
LMES-cats_bin | 99 | 3 | 50.00 | 96.97 |
LMES-cats_mc | 100 | 2 | 20.00 | 98.00 |
LMES-LOW | 100 | 3 | 9.43 | 97.00 |
LMES-WIS | 100 | 6 | 4.69 | 94.00 |
Columns: wrong/total are the number of incorrect/total instances in the human evaluation split; bl_random is the random baseline (of the entire dataset).
Tasks - UP-Titles
Annotation and Human Evaluation
Picture from the GitHub repository of the bot.
Experiments
Experiments
Conclusion
Introduction - Samples
TODO: actual examples and translations instead of this
What’s the fifth letter in the word evaluation?
What’s the last word in the sentence “London is the capital of Great Britain”?
Which word doesn’t belong the same category as the rest: fear, love, sadness, optimism, pizza?
Do all these words belong to the same category: hand, arm, leg, head, finger?
Which word is longer: democracy or pasta?
Which word comes first in alphabetical order: cat or cactus?
Introduction - Samples
TODO: actual examples and translations instead of this
Pick the correct title for the article:
A rare celestial phenomenon will be visible across North America in X weeks on Wednesday. Dubbed the “Night of the Red Comet,” this event occurs only once every XXX years when a comet with a distinct red hue crosses the northern sky. Astronomers and stargazers alike are gearing up for what is expected to be a spectacular view on Wednesday night. Cities in the comet’s path are organizing viewing parties, and local observatories are extending their hours to accommodate the expected influx of enthusiasts. Experts recommend finding a dark spot away from city lights to get the best view of this extraordinary astronomical event.
Introduction - Samples
A model may be accurate, but is it …
Introduction - Motivation
Open benchmarks published on the Internet are associated with two interrelated problems:
A model that has “seen” the data from a benchmark will have inflated scores on that benchmark — which motivates the need for contamination-safe benchmarks.
Introduction - Motivation - Contamination
Introduction - Motivation - Ukrainian datasets
Public domain: https://commons.wikimedia.org/wiki/File:Ouroboros-Abake.svg
Introduction - Motivation - Ukrainian datasets
Public domain: https://commons.wikimedia.org/wiki/File:Ouroboros-Abake.svg
(Fair) Evaluation is hard.
BUT
Introduction - Motivation
I have a dog, Fido, and a parrot, Ollie. We were all friends until last week, when Fido ate my steak. But I wasn’t angry for long: he, like most dogs, loves meat. I think the →_light_← from the Moon makes him a bit of a werewolf.
I think I’ll light my fireplace, maybe the darkness makes him nervous.
I think the light meal I gave him wasn’t enough.
a) dog
b) animal
c) parrot
d) Moon
e) light
Theory - Two stories
Fido ate my steak after midnight. …
Screenshot: https://www.google.com/search?q=light%20synonyms
Theory - Two stories
inflect()
-ing SG↔︎PL was broken for UkrainianTheory - Ukrainian grammar
Theory - Definitions
DISAMBIGUATION
Theory
make_agree_with_number(≈dog, 4)
Theory - Ukrainian grammar
I generated the stories with GPT-4 and Gemini Pro.
Tasks - UA-CBT
Tasks - UA-CBT
Tasks - UA-CBT
UA-CBT task filtration
Annotation and Human Evaluation
Solution: keep the capitalized numeral in the template!
Template: What's the FIRST letter in the word '{word}'?
Numeral: 5
Word: cactus
FIRST
→ firstORD → target: ordinal5
→ five → fifthORDTasks - LMentry-static-UA - N-in-M
Tasks - LMentry-static-UA - N-in-M
TODO: use a Ukrainian example or delete
I have a dog, Fido, and a parrot, Ollie. We were all friends until last week, when Fido ate my steak. But I wasn’t angry for long: he, like most →_____←, loves meat. I think the light from the Moon makes him a bit of a werewolf.
a) dogs
b) parrot
c) friends
d) him
a) dogs
b) parrots
c) friends
d) him
Theory - Two stories
TODO: use a Ukrainian example or delete
In meiner Tasche habe ich ein Stift und einen Schlüsselring mit Schlüssel. Ich darf den Schlüsselring nicht verlieren, sonst verliere ich auch alle meine Schlüssel.
a) Tasche
b) Stift
c) Schlüsselring
d) Schlüssel
a) Taschen
b) Stifte
c) Schlüsselringe
d) Schlüssel-\(\varnothing\)
Theory - Two stories
Tasks - UA-CBT
random.choices()
→ random.sample()
) but otherwise the coding process uneventfulTasks - LMentry-static-UA - Category-based tasks
Tasks - UP-Titles
[cat, magic]
→ \(1/\frac{3+5}{2} = 25\%\)Tasks - Baselines
“On the left is my thesis without your help; on the right is my thesis with your help.”
Annotation and Human Evaluation
Annotation and Human Evaluation
Annotation and Human Evaluation
Annotation and Human Evaluation
Experiments
But I wasn’t angry for long: after all, he’s just a dog, I can’t expect him to be moral. I think the →_light_← from the Moon makes him a bit of a werewolf.
Test slide.