28 Sep 2023

Ideas for Ukrainian LM eval tasks

Context: 230928-1527 Evaluation benchmark for DE-UA text

Shortlist

Tasks

231024-1704 Master thesis task CBT
the three existing ua_datasets task
RU/UA Interference
Фемінітиви - autocompleting recent language
WSD

How

(partially auto-generated) Google Spreadsheet for anything requiring manual changes/creation

Ideas / general

UCL-DARK/ludwig · Datasets at Hugging Face has an example of 0-shot, 1-shot etc. added to the dataset itself, as folds!

General

would be cool to create ones from different HELM scenarios¹
would be cool to find not-work-intensive ways to create this data coming from other benchmarks (e.g. classify headers by article tags etc.)
especially find cool ways to use the annotated corpora I found
I could generate my own tasks easily with some kind of annotation tool, a la label-parser, and annotate bits of Ukrainian Wikipedia² a la SQuAD
I could use some google translate API³ thing
- and manually check the translations! AND HAVE TWO DIFFERENT DATASETS AND DO GRAPHS OF DIFFERENCES/CHANGES!!!
LLMs shouldn’t scare me from including easy tasks - smaller LMs exist in many contexts!
- TinyStories: Small Language Models That Still Speak Coherent English — LessWrong
For simplicity and ease of inclusion to other benchmarks, I shouldn’t do anything requiring too much code. Maybe even literally limit myself to exact match or multiple-choice questions, along with prompts or something, so that the HF datasets are enough.
- And for simplicity in uploading the datasets to HF

Ideas

Based on LinguisticAndInformationSystems/mphdict: Digital lexicographic systems Ukrainian language + (the grammatical dictionary, synonymous dictionary, etymological dictionary +):
1. Find the best synonym for $word
Tasks on Ukrainian/Russians verbs of motion⁴:
1. Correct verb of motion for hypothetical situations
Ask whether certain words rhyme
1. especially ones where the letter make it seem like they do, but they don’t
2. ask for correct stressing of individual words?⁵
Чи правильно використані фразеологізми
Find the correct tag for the title of an article, from the possible parallel corpus: 231002-2311 230928-1651 Random stuff about the Masterarbeit#UA-RU parallel corpus
Children’s book test<@taskCBT (2015) z/d/>
- Gutenberg has no Ukrainian books, but Anna’s archive does and many of them are actually stories and epub/fb2: казки - Search - Anna’s Archive
- One could filter them by decade etc. for copyright
- Then POS-tag, and automatically generate examples
Yes/no questions:BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions - ACL Anthology
Russian-language interference!
1. Remember how a number of “ukrainian” datasets of HF hub are actually Russian
2. Resources:
  - СЛОВОВЖИВАННЯ | Горох — українські словники
    - відноситися - Антисуржик. Словник «українського» суржика
    - Словник-антисуржик онлайн
    - Антисуржик (словник) - Русский/украинский язык, культура - Форум Днепродзержинск-Каменское
    - ~~EXCELLENT~~! Мова – не калька: словник української мови - Тарас Береза - Тека авторів - Чтиво
      - parse -> estimate frequency -> include only the most frequent?
      - A lot of the examples there are let’s say questionable to my central-Ukrainian ear
        
        голий -> “У костюмі (в одежі) Адама і Єви; у чому мати [на світ] народила.” alrighty then
        
        Льотчик -> летун
        
        Ліберія -> “Вільна країна” I’m done
      - I want RU interference (!= суржик); I want RU interference (!= стилістика)
      - Some kind of filtering is definitely needed. Could be as easy as putting “1” in rows of a spreadsheet
  - https://chtyvo.org.ua/authors/Tykhyi_Oleksii/Slovnyk_movnykh_pokruchiv.pdf
  - Суржиково-український словник
    - has really nice intro!
  - Українське життя в Севастополi Юрій Гнаткевич СЛОВНИК-АНТИСУРЖИК ^ff5ccc
3. Frame as multiple-choice task! Or boolean? Or “Is this a correct sentence”?
  1. I really like this: `“Цей студент [взявся за/почав] дослідження важкої теми.”
  2. For fun, here’s ChatGPT lying about prefixes: https://chat.openai.com/share/0eda9061-d2cf-46bc-ad45-38cc6e58934a
4. False friends!
  1. Here’s an itemized list: Фальшиві друзі перекладача — Вікіпедія
    1. сир/сыр, неділя/неделя/…
5. ChatGPT ideas:
  1. On the semantic front, exploit polysemy and homonymy differences. Formulate sentences with words that have multiple meanings in Russian, but those meanings have distinct equivalents in Ukrainian. This will challenge the model to accurately discern the intended sense based on context.
Implicature⁶:
- https://huggingface.co/datasets/UCL-DARK/ludwig/viewer/0-shot/validation?row=2
LMEntry-lite-UA⁷
- Subset of the LMentry questions, translated to UA, with exact matches
- will do this! here 231203-1745 Masterarbeit eval task LMentry-static-UA
Good old fashioned perplexity. Getting a Ukrainian reference corpus a la Wikipedia and benchmarking on it was always allowed
1. Or Telegram, or news comments!
Look into stability of models to OCR errors! Either scan some old Ukrainian book I have or simulate OCR errors like I did for BxE!
1. I’m not the first who thought of this⁸
2. ocr ukrainian - Google Scholar
Something about the recent changes in UA, both the new 2019 orthography and feminitives ⁹ is now here: 231204-1642 Masterarbeit evaluation task new UA grammar and feminitives
Use the UPravda dataset, replace bits with synonyms to get around contamination, and then do classification / entailment /…!!!
1. Use the bold bits ‘дослівно’ etc., and match the пряма мова to the correct article title/text?Залужний востаннє поговорив з Міллі на його посаді | Українська правда

Neat datasets

I could use these open data for petitions to match the text by type! 2.37. Дані про електронні петиції Вінницької міської територіальної громади, у тому числі, осіб, що їх підписали, та результати розгляду - Петиції 2020 - OpenData.gov.ua
- Data.gov.ua

Work required

Babi

From ¹⁰, automatically generated!

I could also use a graph-based approach? As in create an ontology, ask questions about it?..

Or split it into multiple sub-tasks! one for time, one for y/n, etc.?

Make my own IMDB dataset

Find ~~some popular website with comments and ratings, do sentiment analysis~~: can I scrape https://rozetka.com.ua/jagermeister_4067700015532_/p4971091/comments/ ?

Also: comfy.ua: Відгуки про Pecham Professional 600 мл черный)
Someone did something similar! vkovenko/cross_domain_uk_reviews · Datasets at Hugging Face

Not all comments are in UA but I can filter it.

Use movie subtitles for basic dialogs

e.g. Top rated movies - opensubtitles.com | opensubtitles.com
OpenSubtitles Dataset | Papers With Code
but then to do what?..

Literally google-translate other benchmarks and see what happens

E.g. this is nice: How truthful is GPT-3? A benchmark for language models — LessWrong
- the 800 questions: TruthfulQA/TruthfulQA.csv at main · sylinrl/TruthfulQA + TruthfulQA/data/eval_examples.csv at main · sylinrl/TruthfulQA

Where to get ideas

the list of tasks/areas in Natural Language Processing | Papers With Code is another source of inspiration
Read a UA/RU language textbook for other cool hard things like the verbs of motion
Глянути завдання ЗНО!

Where to get data

Ask people I know for non-classified documents from their work that aren’t googleable! And measure e.g. perplexity on it, and add a canary to it when uploading the benchmark itself!
- (or upload it as huggingface dataset so it’s not indexed in my github repo)
- (or upload it encrypted as ceasar cypher and include the python script or cat task_text.txt | rot13 or whatever)

Existing tasks

UA-only

From fido-ai/ua-datasets: A collection of datasets for Ukrainian language:

POS tagging
- Raw: https://lab.mova.institute/files/robochyi_tb.conllu.txt
SQuAD
- Raw: FIdo-AI/ua-squad at main
News classification
- Raw: FIdo-AI/ua-news at main

Multilingual including UA

Belebele Dataset | Papers With Code is a " multiple-choice machine reading comprehension (MRC) dataset", 122 languages
- facebookresearch/belebele: Repo for the Belebele dataset, a massively multilingual reading comprehension dataset.
- facebook/flores · Datasets at Hugging Face has literally an example in Ukrainian <3
  - I CAN USE IT AS SENTENCE CLASSIFICATION BENCHMARK!!!!
- SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects | Papers With Code
KGQA/QALD_9_plus: QALD-9-Plus Dataset for Knowledge Graph Question Answering - one of the 9 langs is Ukrainian! One could theoretically convert the entities into text
- One could look for similar datasets over wikimagic in English, then take the name of the corresponding Ukrainian page
Training prompts for instruction finatuning, translated to UA too, can be used for matching?.. MBZUAI/Bactrian-X · Datasets at Hugging Face
Measuring knowledge retrieval from LMs in many languages: https://huggingface.co/datasets/Polyglot-or-Not/Fact-Completion / daniel-furman/polyglot-or-not: [arXiv pre-print] Are foundation language models multilingual knowledge bases?
- Actually quite cool! Multilingual leaderboard and all that
- Can be neatly compared to a pure-UA model
Pretty much everything mentioned in the chatGPT-beyond-English paper<@laiChatGPTEnglishComprehensive2023 ChatGPT Beyond English (2023) z/d/>

Random

This is a dictionary that has homonyms as column in the CSV: tamila-krashtan/UkrEtymDict: Revised database of Ukrainian Etymological Dictionary

Holistic Evaluation of Language Models (HELM) ↩︎
ParlAI/parlai/tasks/squad2/test/squad2_index_test.yml at main · facebookresearch/ParlAI ↩︎
matheuss/google-translate-api: A free and unlimited API for Google Translate :dollar::no_entry_sign: ↩︎
Prefixes in Russian Verbs of Motion - The Ultimate Guide ↩︎
lang-uk/ukrainian-word-stress-dictionary: Dictionary of word stresses in the Ukrainian language 🇺🇦 ↩︎
<@ruisLargeLanguageModels2022 (2022) z/d/>: ↩︎
<@bm_lmentry (2022) z/d/> ↩︎
<_(@Todorov2022) “An Assessment of the Impact of OCR Noise on Language Models” (2022) / Konstantin Todorov, Giovanni Colavizza: z / / _> ↩︎
<_(@synchak2023feminine) “Feminine personal nouns in ukrainian: Dynamics in a corpus” (2023) / Vasyl Starkoand Olena Synchak: z / / _> ↩︎
Babi: <@westonAICompleteQuestionAnswering2015 Towards AI-Complete Question Answering (2015) z/d/> / Holistic Evaluation of Language Models (HELM) ↩︎

Nel mezzo del deserto posso dire tutto quello che voglio.

serhii.net