serhii.net

In the middle of the desert you can say anything you want

28 Sep 2023

Very first notes on my Master thesis - Evaluation benchmark for DE-UA text

Officially - I’m doing this!

This post will be about dumping ideas and stuff.

Related posts for my first paper on this topic:

Procedural:

  • I’ll be using Zotero
  • I’ll be writing it in Markdown
    • TODO: Zotero+markdown+obsidian?..

General questions:

  • Write my own code or use any of the other cool benchmark frameworks that exist?
    • If I’ll write the code: in which way will it be better than, e.g., eleuther-ai’s lm-evaluation-harness?
    • I will be using an existing harness
  • Task types/formats support - a la card types in Anki - how will I make it
    • extensible (code-wise)
    • easy to provide tasks as examples? YAML or what?
  • Do I do German or Ukrainian first? OK to do both in the same Master-arbeit?
    • I do Ukrainian first
  • Using existing literature/websites/scrapes (=contamination) VS making up my own examples?
    • Both OK

Actual questions

  • What’s the meaningful difference between a benchmark and a set of datastes? A leaderboard? Getting-them-together?..
  • Number of sentences/task-tasks I’d have to create to create a unique valid usable benchmark task?
    • 1000+ for it to be meaningful
  • Is it alright if I follow my own interests and create more hard/interesting tasks as opposed to using standard e.g. NER etc. datasets as benchmarks?
    • OK to translate existing tasks, OK to copy the idea of the task - both with citations ofc

My goal

  • Build an Ukrainian benchmark (=set of tasks)
  • Of which at least a couple are my own
  • The datasets uploaded to HF
    • Optionally added/accepted to BIG-bench etc.
  • Optional experiments:
    • Compare whether google translating benchmarks is better/worse than getting a human to do it?
      • Optionally on some other cool evaluations e.g. shutdownability or things like1 or trufulQA2 etc.
      • See if multilingual models a la chatGPT or real ones differ
    • Evaluate the correctness of Ukrainian language VS Russian-language interference!
  • Really optionasl experiments

Decisions

  • Will write in English
  • I’ll upload the tasks’ datasets to HF hub, since all the cool people are doing it
  • Will be on Github and open to contributions/extensions
  • If I end up writing code do it as general as possible, so that both it’ll be trivial to adapt to DE when needed AND to other eval harnesses
  • EDIT 2023-10-10:
    • I will be using an existing evaluation harness

Resources

Benchmarks - generic

Cool places with potential

Plan, more or less

  1. Methodically look through and understand existing benchmarks and tasks
    1. Kinds of tasks
    2. How is the code for them actually written, wrt API and extensibility
  2. Do this for English, russian, Ukrainian
  3. At the same time:
    1. Start creating small interesting tasks 230928-1630 Ideas for Ukrainian LM eval tasks
    2. Start writing the needed code
  4. Write the actual Masterarbeit along the way, while it’s still easy

Changelog

  • 2023-10-03 00:22: EvalUAtion is a really cool name! Ungoogleable though
Nel mezzo del deserto posso dire tutto quello che voglio.