MASTERARBEIT (Master thesis) DRAFT
This will be the Markdown draft of my Master thesis, I’ll jot things down and then expand.
- language: make a decision and stick with it everywhere
- master((’’)s) thesis
- (Ukrainian(-language)) NLP (for/in the Ukrainian language)
- Ctrl-F all the occurrences of “Russian” in the final text and decide on the right balance and nuances, to which extent is it needed
- (ref: 17/11/23 meeting for N’s masterarbeit)
- Apparently I/we is not OK in theses, but passive is also bad, … how does one do this?
- you can happily add intro text to chapters before the first subchapter starts!
- for every X I use, ask myself the question “why do I use X?”
- latex description environments are cool!
- TODO cite pymorphy2 in the way they want to be cited:
Thanks and stuff
- my father, who was the first to made me notice and love language and languages
- all the people who kept this love alive, one way or the other
- Viktoria Amelina
- See CH’s PhD for an example
Нації вмирають не від інфаркту. Спочатку їм відбирає мову.
Nations don’t die from heart attacks. They go mute first.1
Lina Kostenko, Ukrainian poetess
The Ukrainian language is not at risk of dying, at least not anymore, despite literally centuries of attempts. Before 2014, the quote above was so incisive it hurt. The last 10 years have led to a resurgence of Ukrainian language, especially its use in informal and non-academic contexts, and I’m certain this will be followed by an increase of resources dedicated to its use, study, etc., driven by the objectively increasing need for that.
TODO here some cool stats about UA speakers in Ukraine over the last 20 years or so, as well as the % of Ukrainian speeches in UA parliament from I don’t remember which paper.
That said, the quote by Lina Kostenko remains relevant for all the other languages that either are at risk of dying, or have fewer resources available for their study.
This master thesis attempts to make NLP research in Ukrainian either by creating an additional number of annotated datasets for Ukrainian, as well as by creating the first Ukrainian-language LM benchmark.
The Ukrainian language
Context & history
- Basic history
- All the attempts to kill it as a language by Russia
- Ukrainian language on the Internet
- Specifically mention the cool things I notice in the last year too
- Maybe mention my first research thing in the 8th grade about that :)
- Recent changes in the Ukrainian language
- Everyone started needing UA vocabulary for many things
- Feminine personal pronouns! Incl. the fascinating bit about the war-related ones.2
- Inherent bilingualism of basically everyone and how until lately no one generally needed Ukrainian stuff, as books/TV/… were in Russian and everyone understood anyway
- the GRAC paper mentions the removal of RU text from the corpus and gives historical background on why it’s there 3
- Myths about the Ukrainial language4
- Especially the ’not a real language’ / ‘a dialect of Russian’ and ‘almost identical to Russian’ one
- UA vs RU vs PL vs…
- Just for fun mention the slavic influences on German vocabulary, especially city names :)
Ukrainian from a linguistic perspective
The Ukrainian language belongs to the Slavic family of the Indo-European languages (which also contains languages such as Polish, Czech, Serbian, Bulgarian), specifically - to the East Slavic branch. It contains (by common consensus5) Belarusian, Russian, and Ukrainian. While all three are mutually intelligible to a certain extent, Ukrainian has more in common with Belarusian than with Russian; outside the branch, spoken6 Ukrainian has partial intelligibility with Polish.
Ukrainian is is a synthetic7 inflected language8, that is it can express different grammatical categories (case, number, gender, ..) as part of word formation. In other words, that information about grammatical categories tends to be encoded inside the words themselves.9
(German, too, is a fusional language, but with a smaller degree of inflection. English, on the other hand, largery abandoned the inflectional case system10 and is an analytic language, conveying grammatical information through word order and prepositions.)
nouns decline for the 7 cases11 and 2 numbers (singular, plural)
adjectives agree with nouns in gender, case, number
verbs conjugate for tenses, voices, persons, numbers
TODO pretty pictureS with
- “Ми знайшли зелену чашкУ на столІ = We found the green cup on the table” with colors from the “Specifically” example above
- (Вони) використовуватимуться12
- Sie werden verwendet / They will be used
- As an interesting aside, ChatGPT with GPT3.5 is unable to analyze it by itself, with all of its statements here being wrong13:
TODO: Українська морфеміка — Вікіпедія has a list of affixes, POS etc.
The inflectional paradigm of Ukrainian admits free word order: in English the Subject-Verb-Object word order in “the man saw the dog” (vs “the dog saw the man”) determines who saw whom, while in Ukrainian (“чоловік побачив собакУ”) the last letter of the object (dog) marks it as object - and the ordering of the words can be used for additional emphases or shades of meaning.
- TODO mention how 2-3-4 is special and not just singular/plural
- Ukrainian has a future tense
All the above has direct implications for NLP, for example:
- The development of lemmatizers, morphological analyses, bag-of-words approaches for information retrieval16
- In the area of grammatical error correction, systems developed with English in mind perform worse for morphologically rich languages. 17
- Correctly understanding the word order for the tone/intent/emphasis on specific parts of the sentence, as opposed to the arguably more explicit way English conveys this
- TODO add more
The Ukrainian alphabet is written in Cyrillic and has 33 letters, in writing the apostrophe is also used. It differs from Russian by the absence of the letters ё, ъ, ы and э, and the presence of ґ, є, і, and ї.
This helps (but doesn’t completely solve the problem of) differentiating the two languages, which is needed relatively often: Russian-language fragments within otherwise Ukrainian text (e.g. untranslated quotes in text intended for a bilingual audience) are a typical problem, and one that needs to be solved when building reference corpora or datasets.3
The importance of NLP for mid- and low-resource languages
The Bender rule and language independence
Her original 2011 paper - written in the pre-LLM era - discusses the problem of language independence, that is the extent to which NLP research/technology can scale over multiple (or ‘all’) languages. In her more recent writing on the topic, she notes how work on languages other than English is often considered “language specific” and thus viewed as less important 18, and the underlying misconception that English is a sufficiently representative language and that work on ‘just’ English is not language-specific.
A NLP system that works for English is not guaranteed to behave similarly for other languages, unless explicitly designed and tested for that.
Or in different words, “English is Neither Synonymous with Nor Representative of Natural Language”. 18
In her list of 8 proprieties of English where it fails to represent all languages, 4 apply to Ukrainian: little inflectional morphology, fixed word order, possible matches to database field names or ontology entries, and massive amounts of training data available.
The status of the Ukrainian language
In the taxomy of languages based on data availability 19 (see below), Ukrainian is classified in class 3, “the rising stars”: languages with a thriving online cultural community that got an energy boost from unsupervised pre-training, but let down by insufficient efforts in labeled data collection. Sample languages from that group include Indonesian, Cebuano, Afrikaans, Hebrew. Russian is in class 4, English and German are in class 5.
From a different angle, if we look at estimates of languages used on the Internet (estimated percentages of the top 10M websites), as of October 2023 we find Ukrainian at number 19 (0.6%), between Arabic and Greek2021. English is #1 (53.0%), Russian #3 (4.6%), German at #4 (4.6% as well).
Ukrainian Wikipedia is 15th by daily views and by number of articles22.
- typology bits 19
More about the Bender Rule
- Explain why doing this is important in general
Compare existing UA vs Russian datasets and how on HF both languages are treated as interchangeable
- and mention that colonialism is bad and all that
- USE essentials of linguistics’ first chapters
- The problem of finding/removing Russian fragments in UA corpora3 as example of an error mode
And about the clear gap in actual need and the literature
- e.g. plot of RU/UA languages on say TG
- overlayed on a plot of existing Russian-language…
- even just datasets of HF labeled “RU”
The story I’m telling is
- Doing this is important
- Currently the state is kinda sad
- And we’re doing a first step to fix it
In particular, by focusing on high-resource languages, we have prioritised methods that work well only when large amounts of labelled and unlabelled data are available.
What you can do/ Datasets: If you create a new dataset, reserve half of your annotation budget for creating the same size dataset in another language.
- Tracking Progress in Natural Language Processing | NLP-progress no UA :(
This master thesis tackles the following problems in the context of Ukrainian language:
- Research the current state of NLP, especially focusing on the availability and quality of:
- Create novel Ukrainian-language datasets usable as benchmark tasks:
- create human baselines where practicable
- make them publicly available through established platforms
- Create a benchmark for the evaluation of LMs, using both the newly-created datasets/tasks and pre-existing ones
- Evaluate the existing Ukrainian LMs on this benchmark
Additional research questions are:
- Evaluate whether cross/multi language models that include Ukrainian perform equally well to Ukrainian monolingual models
- Research whether there’s a significant difference in scores of tasks translated to Ukrainian using automated methods as opposed to human translations
- Compare the extent to which the language matters when solving problems, with the following languages:
- English (high resource language)
- Russian (high resource language from the same language family as Ukrainian)
Neural networks and stuff
- Backpropagation, NNs
NLP and language modeling
LLMs and their magic
- Definition and examples
- Metrics (Perplexity, bpX etc.)
Correlations between them and interplay
- from my first paper - task / dataset / benchmark / …
Taxonomy of benchmark tasks
- By task type/goal
- Include more exotic/interesting ones, e.g. truthfulQA23
- One/two/X shot?…
Benchmark data contamination
Canary GUID strings
- My own benchmark tasks have a canary string
- The three ones from ua-datasets don’t, and are too available online - they might have become part of some LLM training data
Notable benchmark tasks
- Focus on individual tasks as opposed to bigger things
- The usual ones e.g. in (Super)GLUE
- If other languages’ versions exist - mention them
- Definitely list the ones I’ll use as base
Children’s book test
- Fact completion
- non-UA but multilingual are OK
- general examples and what makes them cool/notable, abstract/high-level, no lists of tasks
Benchmark (tasks) desiderata
- How to build a good benchmark (task) in general
- What does
- Modern but not too modern language
- e.g. not the 1 million word story
- Ease of use
- Upload datasets to HF
Inclusion to other big benchmarks
Implementations for important eval harnesses
- Modern but not too modern language
- What and why
- My list in [[230928-1735 Other LM Benchmarks notes#’evaluation harness’es]]
- I decided to use X, instead of writing my own, because
- TODO Everywhere “Ukrainian X” - > “X in the Ukrainian language”??
State of the research & literature
- Contextual Embeddings for Ukrainian: A Large Language Model Approach to Word Sense Disambiguation - ACL Anthology
- Auto-generating WSD tasks based on SUM dictionary
- The intro of the UA-GEC paper 17 links cool papers about how LMs done with “english in mind” are suboptimal for morphologically rich languages
- Morphology analyzer
- Not perfect for UA.
- Кір. КІР.
Parse(word='корів', tag=OpencorporaTag('NOUN,inan plur,gent'), normal_form='кір', score=1.0, methods_stack=((DictionaryAnalyzer(), 'корів', 498, 11),))
- also has issues with
- Кір. КІР.
- Inclusion criteria: ones that one could conceivably make into a benchmark task
- e.g. not Instructions finetuning
- [2103.16997] UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language17
- vkovenko/cross_domain_uk_reviews · Datasets at Hugging Face
Multi/cross/… datasets that include UA
Explicitly mention if it’s google translate or real people did it
Belebele Dataset | Papers With Code is a " multiple-choice machine reading comprehension (MRC) dataset", 122 languages
KGQA/QALD_9_plus: QALD-9-Plus Dataset for Knowledge Graph Question Answering - one of the 9 langs is Ukrainian! One could theoretically convert the entities into text
… somewhere: why can’t one just google translate existing benchmarks and be done with it? precision, eval, etc.
Eval-UA-tion Ukrainian eval benchmark
Construction, validation, …
- truthfulQA24 paper has examples
- LOOK WHETHER MY BENCHMARK IS PART OF THE TRAINING DATA!!! - doing interesting tests on the topic
News classification (NC)
Russian-Ukrainian interference test
- Auto-complete sentences based on:
Modern Ukrainian language + genders
- Check whether the model correctly uses the newer grammar, especially including захисниЦЯ etc. (but not war-related words) 2 by letting it autocomplete things
Word Sense Disambiguation
Contextual Embeddings for Ukrainian: A Large Language Model Approach to Word Sense Disambiguation - ACL Anthology<
@labaContextualEmbeddingsUkrainian2023 Contextual Embeddings for Ukrainian (2023) z/d/>
Children’s book test (CBT)
- Original English thing25
- My current task notes page is 231024-1704 Master thesis task CBT
- Get Ukrainian book with good OCR, POS-tag, generate questions, manually check
- Ask X people to solve the entire (or a subset) of the tasks, see how many they get right
- Prolly google spreadsheet
Models tested on the new benchmark
- for ideas about it, see truthfulQA paper23 as well as any multi-lingual benchmark paper
- openAI API
Downstream task - UP article classification
Do UP news classification with different models, do pretty graph about how it correlates with my benchmark results.
Scraping Ukrainska Pravda
How I used a polite header informing them about my actions
(Anecdotally, I understand a fair share of Polish words, but only after I read them out loud!) ↩︎
Another way to say this is that synthetic languages are characterized by a higher morpheme-to-word ratio. ↩︎
(including the vocative case, used when adressing someone, absent in Russian) ↩︎
Список частот says:
VERB|Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Finand аналізатор says
Вид=Недок|Спосіб=Інд|Число=Множ|Особа=3|Час=Майб|ДієТип=Фін; fun fact, chatgpt fails totally at this:13 ↩︎
<_(@benderpost) “The #BenderRule: On naming the languages we study and why it matters” (2019) / Emily Bender: z / https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-and-why-it-matters/ / _> ↩︎ ↩︎ ↩︎
<_(@enwiki:1182341232) “Languages used on the internet — Wikipedia, the free encyclopedia” (2023) / Wikipedia contributors: z / https://en.wikipedia.org/w/index.php?title=Languages_used_on_the_Internet&oldid=1182341232 / _> ↩︎
<_(@wiki:xxx) “List of Wikipedias/Table2 — Meta, discussion about wikimedia projects” (2022) / Meta: z / https://meta.wikimedia.org/w/index.php?title=List_of_Wikipedias/Table2&oldid=23936182 / _> ↩︎
<_(@taskCBT) “The goldilocks principle: Reading children’s books with explicit memory representations” (2015) / Felix Hill, Antoine Bordes, Sumit Chopra, Jason Weston: z / / 10.48550/ARXIV.1511.02301 _> ↩︎