28 Sep 2023

LM Benchmarks notes

Context: 230928-1527 Evaluation benchmark for DE-UA text Here I’ll keep random interesting benchmarks I find.

HELM

Code
- Random code sample from all common-sense scenarios: helm/src/helm/benchmark/scenarios/commonsense_scenario.py at main · stanford-crfm/helm
Scenarios: Holistic Evaluation of Language Models (HELM)

GLUECoS

code: GLUECoS/Code at master · microsoft/GLUECoS

Cross-lingual

XGLUE

microsoft/XGLUE: Cross-lingual GLUE
Interesting enough for list of tasks and code
Data: XGLUE

Other

BIGBench

! google/BIG-bench: Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
- colab_examples.ipynb - Colaboratory

Leaderboards

Chatbot Arena Leaderboard - a Hugging Face Space by lmsys

’evaluation harness’es

lm-evaluation-harness/lm_eval/tasks at big-refactor · EleutherAI/lm-evaluation-harness - very clean code etc., currently the branch big-rewrite is the best one
- love how all datasets live on HF
- love how individual datasets are described in yamls¹, and the benchmarks running them as well²
- love the concept of task versioning
ParlAI/parlai/tasks/multinli/test/multinli_test.yml at main · facebookresearch/ParlAI

openai/evals: Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
TheoremOne/llm-benchmarker-suite: LLM Evals Leaderboard
- the most meta of all similar suites I’ve seen

Random relevant code

TeunvdWeij/gpt-shutdownability: Investigating whether GPT-4 resists being shut down.

Nel mezzo del deserto posso dire tutto quello che voglio.