03 Oct 2023

My own evaluation harness for Masterarbeit notes eval harnesses notes

~~Is this needed or I can just use one of the existing ones?~~ I’ll use one of the existing ones!

Then this is about notes about choosing one and adapting my own tasks for it.

First of all, I’d like the generator things to be runnable through Docker, especially the pravda crawler!

230928-1735 Other LM Benchmarks notes has examples
I have to support prompts, e.g. see what other papers¹ do, as well as the harnesses

General:

I don’t have to forget that the OpenAI API exists!
- And I can use it for evaluation too!

Other / useful libraries:

Following established practice, dataset on HF hub and some magic code to convert it into actual LM inputs.
Have a versioning system (both for individual tasks and the benchmark in general?)

A task has a metadata file, and the task data.
task metadata
- VERSION!
- Can contain a task description for the model as prompt
- Can contain the format-strings etc. for building examples
The task data
- can be either a
  - .json
  - a HF dataset string
- contains:
  - All the data needed to build test cases for each example

exact match: true/false
multiple choice
- incl. binary yes/no
lmentry/lmentry/predict.py at main · aviaefrat/lmentry contains the predicting code used to evaluate it using different kinds of models - I’ll need this.

serhii.net