Ideas for Ukrainian LM eval tasks
Context: 230928-1527 Evaluation benchmark for DE-UA text
Shortlist
Tasks
- 231024-1704 Master thesis task CBT
- the three existing
ua_datasets
task - RU/UA Interference
- Фемінітиви - autocompleting recent language
- WSD
How
- (partially auto-generated) Google Spreadsheet for anything requiring manual changes/creation
Ideas / general
- UCL-DARK/ludwig · Datasets at Hugging Face has an example of 0-shot, 1-shot etc. added to the dataset itself, as folds!
General
- would be cool to create ones from different HELM scenarios1
- would be cool to find not-work-intensive ways to create this data coming from other benchmarks (e.g. classify headers by article tags etc.)
- especially find cool ways to use the annotated corpora I found
- I could generate my own tasks easily with some kind of annotation tool, a la label-parser, and annotate bits of Ukrainian Wikipedia2 a la SQuAD
- I could use some google translate API3 thing
- and manually check the translations! AND HAVE TWO DIFFERENT DATASETS AND DO GRAPHS OF DIFFERENCES/CHANGES!!!
- LLMs shouldn’t scare me from including easy tasks - smaller LMs exist in many contexts!
- For simplicity and ease of inclusion to other benchmarks, I shouldn’t do anything requiring too much code. Maybe even literally limit myself to exact match or multiple-choice questions, along with prompts or something, so that the HF datasets are enough.
- And for simplicity in uploading the datasets to HF
Ideas
- Based on LinguisticAndInformationSystems/mphdict: Digital lexicographic systems Ukrainian language + (the grammatical dictionary, synonymous dictionary, etymological dictionary +):
- Find the best synonym for $word
- Tasks on Ukrainian/Russians verbs of motion4:
- Correct verb of motion for hypothetical situations
- Ask whether certain words rhyme
- especially ones where the letter make it seem like they do, but they don’t
- ask for correct stressing of individual words?5
- Чи правильно використані фразеологізми
- Find the correct tag for the title of an article, from the possible parallel corpus: [[231002-2311 230928-1651 Random stuff about the Masterarbeit#UA-RU parallel corpus]]
- Children’s book test<
@taskCBT
(2015) z/d/>- Gutenberg has no Ukrainian books, but Anna’s archive does and many of them are actually stories and epub/fb2: казки - Search - Anna’s Archive
- One could filter them by decade etc. for copyright
- Then POS-tag, and automatically generate examples
- Yes/no questions:BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions - ACL Anthology
- Russian-language interference!
- Remember how a number of “ukrainian” datasets of HF hub are actually Russian
- Resources:
- СЛОВОВЖИВАННЯ | Горох — українські словники
- відноситися - Антисуржик. Словник «українського» суржика
- Словник-антисуржик онлайн
- Антисуржик (словник) - Русский/украинский язык, культура - Форум Днепродзержинск-Каменское
EXCELLENT! Мова – не калька: словник української мови - Тарас Береза - Тека авторів - Чтиво- parse -> estimate frequency -> include only the most frequent?
- A lot of the examples there are let’s say questionable to my central-Ukrainian ear
- голий -> “У костюмі (в одежі) Адама і Єви; у чому мати [на світ] народила.” alrighty then
- Льотчик -> летун
- Ліберія -> “Вільна країна” I’m done
- I want RU interference (!= суржик); I want RU interference (!= стилістика)
- Some kind of filtering is definitely needed. Could be as easy as putting “1” in rows of a spreadsheet
- https://chtyvo.org.ua/authors/Tykhyi_Oleksii/Slovnyk_movnykh_pokruchiv.pdf
- Суржиково-український словник
- has really nice intro!
- Українське життя в Севастополi Юрій Гнаткевич СЛОВНИК-АНТИСУРЖИК ^ff5ccc
- СЛОВОВЖИВАННЯ | Горох — українські словники
- Frame as multiple-choice task! Or boolean? Or “Is this a correct sentence”?
- I really like this: `“Цей студент [взявся за/почав] дослідження важкої теми.”
- For fun, here’s ChatGPT lying about prefixes: https://chat.openai.com/share/0eda9061-d2cf-46bc-ad45-38cc6e58934a
- False friends!
- Here’s an itemized list: Фальшиві друзі перекладача — Вікіпедія
- сир/сыр, неділя/неделя/…
- Here’s an itemized list: Фальшиві друзі перекладача — Вікіпедія
- ChatGPT ideas:
-
On the semantic front, exploit polysemy and homonymy differences. Formulate sentences with words that have multiple meanings in Russian, but those meanings have distinct equivalents in Ukrainian. This will challenge the model to accurately discern the intended sense based on context.
-
- Implicature6:
- LMEntry-lite-UA7
- Subset of the LMentry questions, translated to UA, with exact matches
- will do this! here [[231203-1745 Masterarbeit eval task LMentry-static-UA]]
- Good old fashioned perplexity. Getting a Ukrainian reference corpus a la Wikipedia and benchmarking on it was always allowed
- Or Telegram, or news comments!
- Look into stability of models to OCR errors! Either scan some old Ukrainian book I have or simulate OCR errors like I did for BxE!
- I’m not the first who thought of this8
- ocr ukrainian - Google Scholar
- Something about the recent changes in UA, both the new 2019 orthography and feminitives 9 is now here: [[231204-1642 Masterarbeit evaluation task new UA grammar and feminitives]]
- Use the UPravda dataset, replace bits with synonyms to get around contamination, and then do classification / entailment /…!!!
- Use the bold bits ‘дослівно’ etc., and match the пряма мова to the correct article title/text?Залужний востаннє поговорив з Міллі на його посаді | Українська правда
Neat datasets
- I could use these open data for petitions to match the text by type! 2.37. Дані про електронні петиції Вінницької міської територіальної громади, у тому числі, осіб, що їх підписали, та результати розгляду - Петиції 2020 - OpenData.gov.ua
Work required
Babi
From 10, automatically generated!
- facebookarchive/bAbI-tasks at ccd8fd6af76d35346109e7bb5f7de9138d055e01
- bAbI-tasks/lua/babi/World.lua at ccd8fd6af76d35346109e7bb5f7de9138d055e01 · facebookarchive/bAbI-tasks
- !!! bAbI-tasks/lua/babi/tasks/worlds/world_basic.txt at ccd8fd6af76d35346109e7bb5f7de9138d055e01 · facebookarchive/bAbI-tasks
I could also use a graph-based approach? As in create an ontology, ask questions about it?..
Or split it into multiple sub-tasks! one for time, one for y/n, etc.?
Make my own IMDB dataset
Find some popular website with comments and ratings, do sentiment analysis: can I scrape
https://rozetka.com.ua/jagermeister_4067700015532_/p4971091/comments/ ?
- Also: comfy.ua: Відгуки про Pecham Professional 600 мл черный)
- Someone did something similar! vkovenko/cross_domain_uk_reviews · Datasets at Hugging Face
Not all comments are in UA but I can filter it.
Use movie subtitles for basic dialogs
- e.g. Top rated movies - opensubtitles.com | opensubtitles.com
- OpenSubtitles Dataset | Papers With Code
- but then to do what?..
Literally google-translate other benchmarks and see what happens
- E.g. this is nice: How truthful is GPT-3? A benchmark for language models — LessWrong
Where to get ideas
- the list of tasks/areas in Natural Language Processing | Papers With Code is another source of inspiration
- Read a UA/RU language textbook for other cool hard things like the verbs of motion
- Глянути завдання ЗНО!
Where to get data
- Ask people I know for non-classified documents from their work that aren’t googleable! And measure e.g. perplexity on it, and add a canary to it when uploading the benchmark itself!
- (or upload it as huggingface dataset so it’s not indexed in my github repo)
- (or upload it encrypted as ceasar cypher and include the python script or
cat task_text.txt | rot13
or whatever)
Existing tasks
UA-only
From fido-ai/ua-datasets: A collection of datasets for Ukrainian language:
Multilingual including UA
- Belebele Dataset | Papers With Code is a " multiple-choice machine reading comprehension (MRC) dataset", 122 languages
- facebookresearch/belebele: Repo for the Belebele dataset, a massively multilingual reading comprehension dataset.
- facebook/flores · Datasets at Hugging Face has literally an example in Ukrainian <3
- I CAN USE IT AS SENTENCE CLASSIFICATION BENCHMARK!!!!
- SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects | Papers With Code
- KGQA/QALD_9_plus: QALD-9-Plus Dataset for Knowledge Graph Question Answering - one of the 9 langs is Ukrainian! One could theoretically convert the entities into text
- One could look for similar datasets over wikimagic in English, then take the name of the corresponding Ukrainian page
- Training prompts for instruction finatuning, translated to UA too, can be used for matching?.. MBZUAI/Bactrian-X · Datasets at Hugging Face
- Measuring knowledge retrieval from LMs in many languages: https://huggingface.co/datasets/Polyglot-or-Not/Fact-Completion / daniel-furman/polyglot-or-not: [arXiv pre-print] Are foundation language models multilingual knowledge bases?
- Actually quite cool! Multilingual leaderboard and all that
- Can be neatly compared to a pure-UA model
- Pretty much everything mentioned in the chatGPT-beyond-English paper<
@laiChatGPTEnglishComprehensive2023
ChatGPT Beyond English (2023) z/d/>
Random
This is a dictionary that has homonyms as column in the CSV: tamila-krashtan/UkrEtymDict: Revised database of Ukrainian Etymological Dictionary
-
ParlAI/parlai/tasks/squad2/test/squad2_index_test.yml at main · facebookresearch/ParlAI ↩︎
-
matheuss/google-translate-api: A free and unlimited API for Google Translate :dollar::no_entry_sign: ↩︎
-
lang-uk/ukrainian-word-stress-dictionary: Dictionary of word stresses in the Ukrainian language 🇺🇦 ↩︎
-
<_(@Todorov2022) “An Assessment of the Impact of OCR Noise on Language Models” (2022) / Konstantin Todorov, Giovanni Colavizza: z / / _> ↩︎
-
<_(@synchak2023feminine) “Feminine personal nouns in ukrainian: Dynamics in a corpus” (2023) / Vasyl Starkoand Olena Synchak: z / / _> ↩︎
-
Babi: <
@westonAICompleteQuestionAnswering2015
Towards AI-Complete Question Answering (2015) z/d/> / Holistic Evaluation of Language Models (HELM) ↩︎