In the middle of the desert you can say anything you want

03 Oct 2023

~~My own evaluation harness for Masterarbeit notes~~ eval harnesses notes

Is this needed or I can just use one of the existing ones? I’ll use one of the existing ones!

Then this is about notes about choosing one and adapting my own tasks for it.

First of all, I’d like the generator things to be runnable through Docker, especially the pravda crawler!



  • I don’t have to forget that the OpenAI API exists!
    • And I can use it for evaluation too!

Other / useful libraries:

Main idea

  • Following established practice, dataset on HF hub and some magic code to convert it into actual LM inputs.
  • Have a versioning system (both for individual tasks and the benchmark in general?)



  • A task has a metadata file, and the task data.
  • task metadata
    • VERSION!
    • Can contain a task description for the model as prompt
    • Can contain the format-strings etc. for building examples
  • The task data
    • can be either a
      • .json
      • a HF dataset string
    • contains:
      • All the data needed to build test cases for each example

Task types

SWAG seems the closest out of the modern models to UA-CBT — one-word completions etc. I should look into what exactly they do


  1. <@laiChatGPTEnglishComprehensive2023 ChatGPT Beyond English (2023) z/d/↩︎

Nel mezzo del deserto posso dire tutto quello che voglio.