Day 2268 / Current LLM evaluation landscape
The Open LLM Leaderboard is dead1, as good time as any to look for new eval stuff!
-
HF universe
- The Open LLM Leaderboard people are actually OpenEvals (OpenEvals), and they created other cool stuffs
- huggingface/lighteval: Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends eval suite
- I like their documentation for new task / new model
- huggingface/evaluation-guidebook: Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!
- the contents are awesome, w/ examples, model-as-a-judge, etc.
- huggingface/lighteval: Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends eval suite
- evaluate-metric (Evaluate Metric) recommended by them as a guide to existing metrics
- The Open LLM Leaderboard people are actually OpenEvals (OpenEvals), and they created other cool stuffs
-
Harnesses
- Evalverse: Unified and Accessible Library for Large Language Model Evaluation meta-thing that can run different harnesses based on the target. Github repo archived? UpstageAI/evalverse: The Universe of Evaluation. All about the evaluation for LLMs.
- open-compass/opencompass: OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
- don’t really like their documentation on adding datasets/models
- The venerable stanford-crfm/helm: Holistic Evaluation of Language Models (HELM)
- openai/evals: Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
-
Resources / articles:
- How to evaluate by Meta: Validation | How-to guides
- HF eval guidebook: huggingface/evaluation-guidebook