Masterarbeit draft
This will be the Markdown draft, I’ll jot things down and then expand.
Introduction
- Languages used on the Internet - Wikipedia list quoting Usage Statistics and Market Share of Content Languages for Websites, September 2023
Наукова новизна
These guys trained an UA LM(youscan/ukr-roberta-base · Hugging Face), but tested it on their internal tasks and they say it’s better than bert-base-multilingual-cased : How to Train a New Language Model for NLP | YouScan
LM Benchmarking
Theory
Terminology
- from my first paper - task / dataset / benchmark / …
Kinds of benchmark tasks
Notable benchmarks
Similar work
- Contextual Embeddings for Ukrainian: A Large Language Model Approach to Word Sense Disambiguation - ACL Anthology
- Auto-generating WSD tasks based on SUM dictionary
- All
ua-datasets
Ukrainian LM benchmark tasks
Basic description
Single-input understanding tasks1
POS tagging
News classification (NC)
Pair input understanding tasks
UA-SQuAD
Word Sense Disambiguation
Contextual Embeddings for Ukrainian: A Large Language Model Approach to Word Sense Disambiguation - ACL Anthology<@labaContextualEmbeddingsUkrainian2023 Contextual Embeddings for Ukrainian (2023) z/d/>
Children’s book test
Original: <@taskCBT (2015) z/d/>
Get Ukrainian book, POS-tag, generate questions
Benchmarks
Benchmark data contamination
Canary GUID strings
- My own benchmark tasks have a canary string
- The three ones from ua-datasets don’t, and are too available online - they might have become part of some LLM training data
Nel mezzo del deserto posso dire tutto quello che voglio.