~~My own evaluation harness for Masterarbeit notes~~ eval harnesses notes
Is this needed or I can just use one of the existing ones? I’ll use one of the existing ones!
Then this is about notes about choosing one and adapting my own tasks for it.
First of all, I’d like the generator things to be runnable through Docker, especially the pravda crawler!
Related:
- 230928-1735 Other LM Benchmarks notes has examples
- I have to support prompts, e.g. see what other papers1 do, as well as the harnesses
General:
- I don’t have to forget that the OpenAI API exists!
- And I can use it for evaluation too!
Other / useful libraries:
- 230928-1735 Other LM Benchmarks notes has a list of harnesses, todo move here.
- bigscience-workshop/lm-evaluation-harness: A framework for few-shot evaluation of autoregressive language models.
- TheoremOne/llm-benchmarker-suite: LLM Evals Leaderboard
Main idea
- Following established practice, dataset on HF hub and some magic code to convert it into actual LM inputs.
- Have a versioning system (both for individual tasks and the benchmark in general?)
Architecture
Tasks
- A task has a metadata file, and the task data.
- task metadata
- VERSION!
- Can contain a task description for the model as prompt
- Can contain the format-strings etc. for building examples
- The task data
- can be either a
- .json
- a HF dataset string
- contains:
- All the data needed to build test cases for each example
- can be either a
Task types
-
exact match: true/false
-
multiple choice
- incl. binary yes/no
-
lmentry/lmentry/predict.py at main · aviaefrat/lmentry contains the predicting code used to evaluate it using different kinds of models - I’ll need this.
Links
- HF evaluation library
- lmentry
- UA models: Models - Hugging Face
- I should look into QA models as well!
SWAG seems the closest out of the modern models to UA-CBT — one-word completions etc. I should look into what exactly they do
NarrativeQA!
Nel mezzo del deserto posso dire tutto quello che voglio.