28 Sep 2023

Masterarbeit draft

This will be the Markdown draft, I’ll jot things down and then expand.

Introduction

Languages used on the Internet - Wikipedia list quoting Usage Statistics and Market Share of Content Languages for Websites, September 2023

Наукова новизна

These guys trained an UA LM(youscan/ukr-roberta-base · Hugging Face), but tested it on their internal tasks and they say it’s better than bert-base-multilingual-cased : How to Train a New Language Model for NLP | YouScan

LM Benchmarking

Theory

Terminology

from my first paper - task / dataset / benchmark / …

Kinds of benchmark tasks

Notable benchmarks

Similar work

Contextual Embeddings for Ukrainian: A Large Language Model Approach to Word Sense Disambiguation - ACL Anthology
- Auto-generating WSD tasks based on SUM dictionary
All ua-datasets

Ukrainian LM benchmark tasks

Basic description

Single-input understanding tasks¹

POS tagging

News classification (NC)

News classification

Pair input understanding tasks

UA-SQuAD

Word Sense Disambiguation

Contextual Embeddings for Ukrainian: A Large Language Model Approach to Word Sense Disambiguation - ACL Anthology<@labaContextualEmbeddingsUkrainian2023 Contextual Embeddings for Ukrainian (2023) z/d/>

Children’s book test

Original: <@taskCBT (2015) z/d/> Get Ukrainian book, POS-tag, generate questions

Benchmarks

Benchmark data contamination

Canary GUID strings

My own benchmark tasks have a canary string
The three ones from ua-datasets don’t, and are too available online - they might have become part of some LLM training data

XGLUE calls them that, todo better sourec: <@liangXGLUENewBenchmark2020 XGLUE (2020) z/d> ↩︎

Nel mezzo del deserto posso dire tutto quello che voglio.