serhii.net

In the middle of the desert you can say anything you want

28 Sep 2023

Ideas for Ukrainian LM eval tasks

Context: 230928-1527 Evaluation benchmark for DE-UA text

Shortlist

Tasks

How

  • (partially auto-generated) Google Spreadsheet for anything requiring manual changes/creation

Ideas / general

General

  • would be cool to create ones from different HELM scenarios1
  • would be cool to find not-work-intensive ways to create this data coming from other benchmarks (e.g. classify headers by article tags etc.)
  • especially find cool ways to use the annotated corpora I found
  • I could generate my own tasks easily with some kind of annotation tool, a la label-parser, and annotate bits of Ukrainian Wikipedia2 a la SQuAD
  • I could use some google translate API3 thing
    • and manually check the translations! AND HAVE TWO DIFFERENT DATASETS AND DO GRAPHS OF DIFFERENCES/CHANGES!!!
  • LLMs shouldn’t scare me from including easy tasks - smaller LMs exist in many contexts!
  • For simplicity and ease of inclusion to other benchmarks, I shouldn’t do anything requiring too much code. Maybe even literally limit myself to exact match or multiple-choice questions, along with prompts or something, so that the HF datasets are enough.
    • And for simplicity in uploading the datasets to HF

Ideas

  1. Based on LinguisticAndInformationSystems/mphdict: Digital lexicographic systems Ukrainian language + (the grammatical dictionary, synonymous dictionary, etymological dictionary +):
    1. Find the best synonym for $word
  2. Tasks on Ukrainian/Russians verbs of motion4:
    1. Correct verb of motion for hypothetical situations
  3. Ask whether certain words rhyme
    1. especially ones where the letter make it seem like they do, but they don’t
    2. ask for correct stressing of individual words?5
  4. Чи правильно використані фразеологізми
  5. Find the correct tag for the title of an article, from the possible parallel corpus: [[231002-2311 230928-1651 Random stuff about the Masterarbeit#UA-RU parallel corpus]]
  6. Children’s book test<@taskCBT (2015) z/d/>
    • Gutenberg has no Ukrainian books, but Anna’s archive does and many of them are actually stories and epub/fb2: казки - Search - Anna’s Archive
    • One could filter them by decade etc. for copyright
    • Then POS-tag, and automatically generate examples
  7. Yes/no questions:BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions - ACL Anthology
  8. Russian-language interference!
    1. Remember how a number of “ukrainian” datasets of HF hub are actually Russian
    2. Resources:
    3. Frame as multiple-choice task! Or boolean? Or “Is this a correct sentence”?
      1. I really like this: `“Цей студент [взявся за/почав] дослідження важкої теми.”
      2. For fun, here’s ChatGPT lying about prefixes: https://chat.openai.com/share/0eda9061-d2cf-46bc-ad45-38cc6e58934a
    4. False friends!
      1. Here’s an itemized list: Фальшиві друзі перекладача — Вікіпедія
        1. сир/сыр, неділя/неделя/…
    5. ChatGPT ideas:
      1. On the semantic front, exploit polysemy and homonymy differences. Formulate sentences with words that have multiple meanings in Russian, but those meanings have distinct equivalents in Ukrainian. This will challenge the model to accurately discern the intended sense based on context.

  9. Implicature6:
  10. LMEntry-lite-UA7
    • Subset of the LMentry questions, translated to UA, with exact matches
    • will do this! here [[231203-1745 Masterarbeit eval task LMentry-static-UA]]
  11. Good old fashioned perplexity. Getting a Ukrainian reference corpus a la Wikipedia and benchmarking on it was always allowed
    1. Or Telegram, or news comments!
  12. Look into stability of models to OCR errors! Either scan some old Ukrainian book I have or simulate OCR errors like I did for BxE!
    1. I’m not the first who thought of this8
    2. ocr ukrainian - Google Scholar
  13. Something about the recent changes in UA, both the new 2019 orthography and feminitives 9 is now here: [[231204-1642 Masterarbeit evaluation task new UA grammar and feminitives]]
  14. Use the UPravda dataset, replace bits with synonyms to get around contamination, and then do classification / entailment /…!!!
    1. Use the bold bits ‘дослівно’ etc., and match the пряма мова to the correct article title/text?Залужний востаннє поговорив з Міллі на його посаді | Українська правда

Neat datasets

Work required

Babi

From 10, automatically generated!

  1. facebookarchive/bAbI-tasks at ccd8fd6af76d35346109e7bb5f7de9138d055e01
  2. bAbI-tasks/lua/babi/World.lua at ccd8fd6af76d35346109e7bb5f7de9138d055e01 · facebookarchive/bAbI-tasks
  3. !!! bAbI-tasks/lua/babi/tasks/worlds/world_basic.txt at ccd8fd6af76d35346109e7bb5f7de9138d055e01 · facebookarchive/bAbI-tasks

I could also use a graph-based approach? As in create an ontology, ask questions about it?..

Or split it into multiple sub-tasks! one for time, one for y/n, etc.?

Make my own IMDB dataset

Find some popular website with comments and ratings, do sentiment analysis: can I scrape https://rozetka.com.ua/jagermeister_4067700015532_/p4971091/comments/ ?

Not all comments are in UA but I can filter it.

Use movie subtitles for basic dialogs

Literally google-translate other benchmarks and see what happens

Where to get ideas

  • the list of tasks/areas in Natural Language Processing | Papers With Code is another source of inspiration
  • Read a UA/RU language textbook for other cool hard things like the verbs of motion
  • Глянути завдання ЗНО!

Where to get data

  • Ask people I know for non-classified documents from their work that aren’t googleable! And measure e.g. perplexity on it, and add a canary to it when uploading the benchmark itself!
    • (or upload it as huggingface dataset so it’s not indexed in my github repo)
    • (or upload it encrypted as ceasar cypher and include the python script or cat task_text.txt | rot13 or whatever)

Existing tasks

UA-only

From fido-ai/ua-datasets: A collection of datasets for Ukrainian language:

Multilingual including UA

Random

This is a dictionary that has homonyms as column in the CSV: tamila-krashtan/UkrEtymDict: Revised database of Ukrainian Etymological Dictionary


  1. Holistic Evaluation of Language Models (HELM) ↩︎

  2. ParlAI/parlai/tasks/squad2/test/squad2_index_test.yml at main · facebookresearch/ParlAI ↩︎

  3. matheuss/google-translate-api: A free and unlimited API for Google Translate :dollar::no_entry_sign: ↩︎

  4. Prefixes in Russian Verbs of Motion - The Ultimate Guide ↩︎

  5. lang-uk/ukrainian-word-stress-dictionary: Dictionary of word stresses in the Ukrainian language 🇺🇦 ↩︎

  6. <@ruisLargeLanguageModels2022 (2022) z/d/>: ↩︎

  7. <@bm_lmentry (2022) z/d/↩︎

  8. <_(@Todorov2022) “An Assessment of the Impact of OCR Noise on Language Models” (2022) / Konstantin Todorov, Giovanni Colavizza: z / / _> ↩︎

  9. <_(@synchak2023feminine) “Feminine personal nouns in ukrainian: Dynamics in a corpus” (2023) / Vasyl Starkoand Olena Synchak: z / / _> ↩︎

  10. Babi: <@westonAICompleteQuestionAnswering2015 Towards AI-Complete Question Answering (2015) z/d/> / Holistic Evaluation of Language Models (HELM) ↩︎

Nel mezzo del deserto posso dire tutto quello che voglio.