Evaluation benchmark for DE-UA text
Officially - I’m doing this!
This post will be about dumping ideas and stuff.
Related posts for my first paper on this topic:
- 221120-1419 Benchmark tasks for evaluation of language models
- 221205-0009 Metrics for LM evaluation like perplexity, BPC, BPB
- 221119-2306 LM paper garden My first paper on the topic:
Procedural:
- I’ll be using Zotero
- I’ll be writing it in Markdown
- TODO: Zotero+markdown+obsidian?..
General questions:
- Write my own code or use any of the other cool benchmark frameworks that exist?
- Will it involve writing the code to actually run the model, or just parsing input/output types?
- Metrics?
- If I’ll write the code: in which way will it be better than, e.g., eleuther-ai’s lm-evaluation-harness?
- Task types/formats support - a la card types in Anki - how will I make it
- extensible (code-wise)
- easy to provide tasks as examples? YAML or what?
- Do I do German or Ukrainian first? OK to do both in the same Master-arbeit?
- Using existing literature/websites/scrapes (=contamination) VS making up my own examples?
Actual questions
- What’s the meaningful difference between a benchmark and a set of datastes? A leaderboard? Getting-them-together?..
- Number of sentences/task-tasks I’d have to create to create a unique valid usable benchmark task?
- Is it alright if I follow my own interests and create more hard/interesting tasks as opposed to using standard e.g. NER etc. datasets as benchmarks?
My goal
- Build an Ukrainian benchmark (=set of tasks)
- Of which at least a couple are my own
- The datasets uploaded to HF
Decisions
- Will write in English
- I’ll upload it to HF hub, since all the cool people are doing it
- Will be on Github and open to contributions/extensions
- Write the code as generally as possible, so that both it’ll be trivial to adapt to DE when needed
Resources
-
Github
- asivokon/awesome-ukrainian-nlp: Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)
- see their links to other resources!
- Helsinki-NLP/UkrainianLT: A collection of links to Ukrainian language tools
- UA grammatical error correction competition! Damn! CodaLab - Competition
- see their links to other resources!
- Ukrainian nlp projects on github, as well as
#nlp #benchmark
s Repository search results - ukrainian-language · GitHub Topics
- asivokon/awesome-ukrainian-nlp: Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)
-
- The UNLP 2023 Shared Task on Grammatical Error Correction for Ukrainian - ACL Anthology really cool paper with cool links/citations.
-
Cool model with links to datasets etc.! robinhad/kruk: Ukrainian instruction-tuned language models and datasets
-
Datasets UA, almost exclusively
- Lists
- A lot of them here: ukrainian-language · GitHub Topics
- zeusfsx/ukrainian-stackexchange · Datasets at Hugging Face
- hard to read but: Hugging Face – The AI community building the future. - starting from page 2-3-4 the non-multilingual ones start
- LARGE news with titles and uris: zeusfsx/ukrainian-news · Datasets at Hugging Face
- Machine Learning Datasets | Papers With Code
- Multiple choice comprehension, multilang: Belebele Dataset | Papers With Code
- FactCompletion
- reviews from rozetka/tripadv/.. vkovenko/cross_domain_uk_reviews · Datasets at Hugging Face
- 300k IS-A relations, some quite funny: lang-uk/hypernymy_pairs · Datasets at Hugging Face
- Lists
-
Benchmarks UA
- Contextual Embeddings for Ukrainian: A Large Language Model Approach to Word Sense Disambiguation - ACL Anthology
- fido-ai/ua-datasets: A collection of datasets for Ukrainian language
-
ua_datasets is a collection of Ukrainian language datasets. Our aim is to build a benchmark for research related to natural language processing in Ukrainian.
- Cool example of API usage: ua-datasets/ua_datasets/src/question_answering at main · fido-ai/ua-datasets
-
-
UA grammar/resources/…
- Linguistic Resources for the Ukrainian language on-line - Universität Regensburg
- лабораторія української
- I need to know more theory to understand but feels EXTREMELY useful
- has an API:
> curl -F json=false -F data='привіт мене звати Сірьожа' -F tokenizer= -F tagger= -F parser= https://api.mova.institute/udpipe/process
- лабораторія української
- Генеральний регіонально анотований корпус української мови (ГРАК): ГРАК - site.name
- Corpora UA
- brown-uk/corpus: Браунський корпус української мови
- золотий стандарт / UniversalDependencies/UD_Ukrainian-IU at dev
- Інші корпуси української мови та слов’янських мов - site.name, with info about whether it can be downloaded
- saganoren/ukr-twi-corpus: A corpus of Ukrainian Twitter texts + instructions for downloading and filtering texts.
- Linguistic Resources for the Ukrainian language on-line - Universität Regensburg
-
General evaluation bits:
Benchmarks - generic
Here:230928-1735 Other LM Benchmarks notes
Cool places with potential
- Ask Про сайт | Горох — українські словники if they can make dumps available, I could do something like “find the closest synonym to this word” etc.
- OH NICE: Home · LinguisticAndInformationSystems/mphdict Wiki
- These seem to be the DBs of the dictionaries: mphdict/src/data at master · LinguisticAndInformationSystems/mphdict
- OH NICE: Home · LinguisticAndInformationSystems/mphdict Wiki
- Cited in WSD task1:
- Open data places:
Plan, more or less
- Methodically look through and understand existing benchmarks and tasks
- Kinds of tasks
- How is the code for them actually written, wrt API and extensibility
- Do this for English, russian, Ukrainian
- At the same time:
- Start creating small interesting tasks 230928-1630 Ideas for Ukrainian LM eval tasks
- Start writing the needed code
- Write the actual Masterarbeit along the way, while it’s still easy
-
<
@labaContextualEmbeddingsUkrainian2023
Contextual Embeddings for Ukrainian (2023) z/d/> / Contextual Embeddings for Ukrainian: A Large Language Model Approach to Word Sense Disambiguation - ACL Anthology ↩︎