A Benchmark for the Evaluation of
Ukrainian Language Models
Serhii Hamotskyi
Anhalt University of Applied Sciences
Outline
I have a dog, Fido, and a parrot, Ollie. We were all friends until last week, when Fido ate my steak. But I wasn’t angry for long: after all, he’s just a ⇒ ̲ ̲ ̲ ̲ ̲ ̲⇐, I can’t expect him to be moral.
a) dog
b) animal
c) parrot
d) Ollie
e) ate
Introduction - Samples
What’s the fifth letter in the word evaluation?
What’s the last word in the sentence “London is the capital of Great Britain”?
Which word doesn’t belong the same category as the rest: fear, love, sadness, optimism, pizza?
Do all these words belong to the same category: hand, arm, leg, head, finger?
Which word is longer: democracy or pasta?
Which word comes first in alphabetical order: cat or cactus?
Introduction - Samples
Pick the correct title for the article:
A rare celestial phenomenon will be visible across North America in X weeks on Wednesday. Dubbed the “Night of the Red Comet,” this event occurs only once every XXX years when a comet with a distinct red hue crosses the northern sky. Astronomers and stargazers alike are gearing up for what is expected to be a spectacular view on Wednesday night. Cities in the comet’s path are organizing viewing parties, and local observatories are extending their hours to accommodate the expected influx of enthusiasts. Experts recommend finding a dark spot away from city lights to get the best view of this extraordinary astronomical event.
Introduction - Samples
A model may be accurate, but is it …
Introduction - Motivation
Open benchmarks published on the Internet are associated with two interrelated problems:
A model that has “seen” the data from a benchmark will have inflated scores on that benchmark — which motivates the need for contamination-safe benchmarks.
Introduction - Motivation - Contamination
Introduction - Motivation - Ukrainian datasets
Public domain: https://commons.wikimedia.org/wiki/File:Ouroboros-Abake.svg
Introduction - Motivation - Ukrainian datasets
Public domain: https://commons.wikimedia.org/wiki/File:Ouroboros-Abake.svg
Introduction - Motivation - Ukrainian datasets
Introduction - Motivation - Usage of Ukrainian
German example from https://en.wikipedia.org/wiki/Synthetic_language
Theory - Ukrainian grammar
підуgo-1SG.FUT
1SG.FUT: first person (I and not she) singular (I, not we) and future tense → I will go
Theory - Notation
I have a dog, Fido, and a parrot, Ollie. We were all friends until last week, when Fido ate my steak. But I wasn’t angry for long: he, like most →_____←, loves meat. I think the light from the Moon makes him a bit of a werewolf.
a) dogs
b) parrot
c) friends
d) him
a) dogs
b) parrots
c) friends
d) him
Theory - Two stories
In meiner Tasche habe ich ein Stift und einen Schlüsselring mit Schlüssel. Ich darf den Schlüsselring nicht verlieren, sonst verliere ich auch alle meine Schlüssel.
a) Tasche
b) Stift
c) Schlüsselring
d) Schlüssel
a) Taschen
b) Stifte
c) Schlüsselringe
d) Schlüssel-\(\varnothing\)
Theory - Two stories
Theory - Definitions
Eval-UA-tion is a set of 3 benchmark tasks (9 datasets, each with 1,000+ instances) for evaluating Large Language Models (LLMs).
Tasks - Eval-UA-tion overview
Tasks - UA-CBT
Tasks - UA-CBT
Tasks - UA-CBT
Tasks – UA-CBT
All public domain.
Tasks - UA-CBT
Tasks - UA-CBT
Tasks - UA-CBT
Tasks - UA-CBT
[sunny|cloudy]
day…”)Tasks - UA-CBT
Inspired by LMentry [5], a benchmark “hard for LLMs but easy for humans”.
Tasks - LMentry-static-UA - Outline
Different templates are used for the same task; instances have a lot of metadata.
templates:
# template: 'Яке слово перше по алфавітному порядку: "{t1}" чи "{t2}"?'
- template: 'Which word is first in alphabetical order: "{t1}" or "{t2}"?'
additional-metadata:
template_n: 1
type: ordinal
kind: less
# template: 'Яке слово стоїть ближче до початку алфавіту: "{t1}" чи "{t2}"?'
- template: 'Which word is closer to the beginning of the alphabet: "{t1}" or "{t2}"?'
additional-metadata:
template_n: 2
type: closer_to_side
kind: less
# template: Серед '{t1}' та '{t2}', яке слово розташоване ближче до кінця алфавіту?
- template: Between'{t1}' and '{t2}', which word is closer to the end of the alphabet?
additional-metadata:
template_n: 3
type: closer_to_side
kind: more
# template: Серед '{t1}' і '{t2}', яке слово знаходиться ближче до літери A в алфавіті?
- template: Between '{t1}' and '{t2}', which word is closer to the letter A in the alphabet?
additional-metadata:
template_n: 4
type: closer_to_letter
kind: less
Tasks - LMentry-static-UA - Robustness
Tasks - LMentry-static-UA - N-in-M
Tasks - LMentry-static-UA - N-in-M
Solution: keep the capitalized numeral in the template!
Template: What's the FIRST letter in the word '{word}'?
Numeral: 5
Word: cactus
FIRST
→ firstORD → target: ordinal5
→ five → fifthORDTasks - LMentry-static-UA - N-in-M
ukr_numbers
1 was written:>>> from ukr_numbers import Numbers
>>> Numbers().convert_to_auto(15, "перший")
'пʼятнадцятий'
# loosely 'translating' to English:
>>> Numbers().convert_to_auto(15, "first")
'fifteenth'
Tasks - LMentry-static-UA - N-in-M
Tasks - LMentry-static-UA - Category-based tasks
Tasks - LMentry-static-UA - Comparison-based tasks
по-пʼятe
Tasks - LMentry-static-UA - Comparison-based tasks
Tasks - UP-Titles
Tasks - UP-Titles
Tasks - UP-Titles
The Danish government has announced the supply of a new military aid package worth X.X billion Danish crowns, or about US$XXX million, to Ukraine, which was formed after the meeting of Defence Ministers of Denmark and Ukraine. Source: European Pravda, with reference to the press service of the Ministry of Defence of Denmark Details: The new Danish aid package will include …
Tasks - UP-Titles
Tasks - UP-Titles
[cat, magic]
→ \(1/\frac{3+5}{2} = 25\%\)Tasks - Baselines
# total | # wrong | bl_random | bl_human | |
---|---|---|---|---|
UA-CBT | 99 | 6 | 16.67 | 93.94 |
UP-unmasked | 99 | 12 | 10.00 | 87.88 |
UP-masked | 98 | 16 | 10.00 | 83.67 |
LMES-wordalpha | 98 | 8 | 50.00 | 91.84 |
LMES-wordlength | 100 | 6 | 50.00 | 94.00 |
LMES-cats_bin | 99 | 3 | 50.00 | 96.97 |
LMES-cats_mc | 100 | 2 | 20.00 | 98.00 |
LMES-LOW | 100 | 3 | 9.43 | 97.00 |
LMES-WIS | 100 | 6 | 4.69 | 94.00 |
Columns: wrong/total are the number of incorrect/total instances in the human evaluation split; bl_random is the random baseline (of the entire dataset).
Tasks - UP-Titles
“On the left is my thesis without your help; on the right is my thesis with your help.”
Annotation and Human Evaluation
Annotation and Human Evaluation
Annotation and Human Evaluation
Annotation and Human Evaluation
Annotation and Human Evaluation
Picture from the GitHub repository of the bot.
Experiments
Experiments
Experiments
Conclusion
Conclusion
(Fair) Evaluation is hard.
BUT
Introduction - Motivation
I have a dog, Fido, and a parrot, Ollie. We were all friends until last week, when Fido ate my steak. But I wasn’t angry for long: he, like most dogs, loves meat. I think the →_light_← from the Moon makes him a bit of a werewolf.
I think I’ll light my fireplace, maybe the darkness makes him nervous.
I think the light meal I gave him wasn’t enough.
a) dog
b) animal
c) parrot
d) Moon
e) light
Theory - Two stories
Fido ate my steak after midnight. …
Screenshot: https://www.google.com/search?q=light%20synonyms
Theory - Two stories
inflect()
-ing SG↔︎PL was broken for UkrainianTheory - Ukrainian grammar
DISAMBIGUATION
Theory
make_agree_with_number(≈dog, 4)
Theory - Ukrainian grammar
I generated the stories with GPT-4 and Gemini Pro.
Tasks - UA-CBT
Tasks - UA-CBT
Tasks - UA-CBT
UA-CBT task filtration
Annotation and Human Evaluation
Tasks - LMentry-static-UA - N-in-M
random.choices()
→ random.sample()
) but otherwise the coding process uneventfulTasks - LMentry-static-UA - Category-based tasks
But I wasn’t angry for long: after all, he’s just a dog, I can’t expect him to be moral. I think the →_light_← from the Moon makes him a bit of a werewolf.
Test slide.