LM Benchmarks notes
Context: 230928-1527 Evaluation benchmark for DE-UA text Here I’ll keep random interesting benchmarks I find.
HELM
- Code
- Random code sample from all common-sense scenarios: helm/src/helm/benchmark/scenarios/commonsense_scenario.py at main · stanford-crfm/helm
- Scenarios: Holistic Evaluation of Language Models (HELM)
GLUECoS
code: GLUECoS/Code at master · microsoft/GLUECoS
Cross-lingual
XGLUE
- microsoft/XGLUE: Cross-lingual GLUE
- Interesting enough for list of tasks and code
- Data: XGLUE
Other
- How truthful is GPT-3? A benchmark for language models — LessWrong
- [2210.14986] Large language models are not zero-shot communicators
- LMentry!
- https://github.com/aviaefrat/lmentry
- With regex-based evaluations for the tasks!
BIGBench
Leaderboards
’evaluation harness’es
- lm-evaluation-harness/lm_eval/tasks at big-refactor · EleutherAI/lm-evaluation-harness - very clean code etc., currently the branch big-rewrite is the best one
- ParlAI/parlai/tasks/multinli/test/multinli_test.yml at main · facebookresearch/ParlAI
- openai/evals: Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
- TheoremOne/llm-benchmarker-suite: LLM Evals Leaderboard
- the most meta of all similar suites I’ve seen
Random relevant code
Nel mezzo del deserto posso dire tutto quello che voglio.