In the middle of the desert you can say anything you want

20 Nov 2022

Benchmark tasks for evaluation of language models

This is a sister page to 221119-2306 LM paper garden, which is the notes/links/refs/research for the paper - this page is the draft for the paper.


  • TODO
  • Lean on this paper1’s intro'
  • TL;DR:
    • eval and benchmarks of LM
    • emphasis on non-English
    • emphasis on domain-specific


  • The field of Natural Language Processing (NLP) covers various topics connected with understanding and computationally processing human (natural) languages. 2

  • History

    • Initially, NLP relied on explicitly written rules, which was time-consuming, not generalizable and inadequate in describing well all language phenomena. 3
      • TODO better source
    • A Language Model (LM) are models that assign a probability to a sequence of words.
    • The use of probability has been an important development, initially using approaches pioneered by Markov(SOURCE) (Markov Chains)4 that were first applied by Shannon(SOURCE) to sequences in the English language, then with a variety of methods and techniques being proposed in the 1980s-1990s(SOURCE).
      • TODO all of this in chapter three refereces of the book4, add real refs
    • In the past, machine learning approaches (naive Bayes, k-nearest-neighbors, conditional random fields (CRF), decision trees, random forests and support vector machines) were used for NLP2
    • But in the last years, increasing availability of both large amounts of text and of computational power, led to these approaches being almost completely replaced by neural models and deep learning.2
  • There are two basic ways to look at a language model2:

    • A probability distribution over words conditional on the preceding or surrounding ones
    • Determining what words mean, especially given that words derive a lot of meaning by the context in which they are placed.
  • And in both roles, LMs are a crucial part of the tasks for which they are used, be it image captioning, translation, summarization, name entity recognition etc.

  • Evaluation

    • Approaches to the evaluation of LMs can be roughly divided into two classes: extrinsic and intrinsic5.
      • Intrinsic evaluation measures the quality of the model independently from any applications
      • Extrinsic evaluation measures the performance of the model in downstream tasks.
    • Examples of intrinsic evaluation methods include both classic statistic ones like perplexity6 and more involved ones like LAMA probes7
    • Extrinsic evaluation can vary based on domain/purpose of LM, but it also includes various benchmarks like GLUE8 which include sets of tasks that each evaluate a separate facet of the LM.
  • Multi-lingual stuff:

    • In the field of NLP, the majority of research has historically been focused on the English language, though this has been changing in recent years9. This has been discussed both from a diversity/inclusion perspective10, and from the angle of language independence11.
    • The large amount of resources available for the English language incentivizes research on it, but “English is neither synonymous with nor representative of natural languages”12 Methods developed and tested on the English language won’t automatically generalize for ‘language in general’.13
    • This makes non-English, multi-lingual and cross-lingual benchmarks even more important.
      • TODO rephrase and source and finish.0
  • TODO mention non-English and domain-specific stuff here

    • There are specific complexities with LMs needed for downsteam tasks on special vocabularies, a la financial language.
      • Perplexity is hard with sparse vocabularies etc.
    • Another thing that’s getting more and more emphasis is non-English stuff, especially emph. in the latest ACL conferences
      • Bender rule etc.
      • -> Exist benchmarks that attempt to evaluate and stimulate multiple languages stuff
  • This papers reviews and systematizes the various approaches that exist to evaluate LMs, with a special focus on non-English and domain-specific LMs.

  • One6 TODO

Evaluation of LMs


Evaluation approaches can be divided into intrinsic and extrinsic. Extrinsic evaluation is usually preferred if the LM is trained for a specific downstream task - for example, if a LM is needed for token classification on financial documents, the logical metric would be train a classifier with different LMs and measure the token classification scores of each. But this is not always possible or reasonable: the downstream task might be prohibitively large or slow to train just for evaluation purposes (or to allow iterating with the needed speed), or the model might be a generic one.

Intrinsic approaches measure the quality of the model independently of any application. They tend to be simpler than extrinsic ones, but have significant drawbacks.

Intrinsic evaluation

Perplexity and probabilistic metrics

LMs are commonly evaluated using probabilistic metrics. For character-based models, average cross-entropy is often used, for word-based models - perplexity.14

The perplexity of a LM on a test set is or the inverse of the (geometric) average probability assigned to each word in the test set by the model.[^evallm] Or formulated differently - the inverse probability of the test set, normalized by the number of words. And minimizing perplexity means increasing the probability of the test set according to the LM.4 It can be viewed as measure of uncertainty when predicting the next token in a sentence15.

It often14 correlates with downstream metrics, but not always, and has other limitations. (TODO: get the citation from the paper)

Perplexity can only be used for if the output is a distribution over probabilities which sum up to i So (for example) a GAN-based text generator returning discrete tokens won’t have a perplexity. 14

Perplexity becomes less meaningful when comparing LMs trained on different vocabularies and datasets, but a number of standard benchmark datasets exist that can be used for comparison. 7 Another notable limitation is context and time-dependence. The One Billion Word Benchmark is a dataset deried from the WMT 2011 News CrGaws Dataset and still commonly used to compare LM ability by reporting the perplexity on it. 15 Even disregarding issues of factuality and media bias, it has been shown that such temporal data has limitations - and a language model’s ability to generate text from news articles from 10 years ago penalizes models with more recent knowledge.15 Common Crawl is a dataset updated annually that can be used to train LMs - and it has been shown that LMs trained on Common Crawl perform worse on the One Billion Word Benchmark over time.

Intrinsic evaluation probing model knowledge

LAMA7, XLAMA , X-FACTR, MickeyProbe, Negated LAMA, Pro etc.5 are a family of probes to estimate factual and common-sense knowledge of LMs. Facts are converted to fill-in-the-blank style questions and the model is evaluated based on the prediction of blank tokens, and various facets of the model are probed: common-sense knowledge, word understanding, negations etc.

Extrinsic evaluation and benchmarks

Extrinsic evaluation assesses the quality of the model on dowstream tasks. For domain-specific models a task is usually known - for example, Finbert16 is a LM for financial domain trained for (and evaluated on) financial sentiment analysis.

For task-independent evaluation, benchmarks17 are used. A benchmark provides a standard way of evaluating the model’s generalization ability across tasks.

A number of both general and domain-specific benchmarks exist, both monolingual and multi-lingual ones.

GLUE, SuperGLUE and general-purpose benchmarks

The General Language Understanding Evaluation (GLUE)8 was introduced to facilitate research on models that can execute a range of different tasks in different domains. It contains nine different natural language understanding tasks, built on established annotated datasets and selected to cover a diverse range of text genres, dataset sizes and degrees of complexity.

For example, CoLa focuses on whether a given sentence is grammatically correct, MRPC on whether two sentences are semantically equivalent, and WNLI is a reading comprehension task in which a system must find the word a pronoun refers to.

GLUE itself is model-agnostic, and can evaluate the outputs of any model capable of producing results on all nine tasks.

According to the AI Index Report 202118, “Progress in NLP has been so swift that technical advances have started to outpace the benchmarks to test for them.”

A little over a year since the introduction of GLUE, the state of the art score of GLUE surpassed the level of non-expert humans, which resulted in the creation of SuperGLUE19 by the same authors. It has the same motivation and design as GLUE but improves upon it in several ways, most prominently more challenging and diverse tasks.

Domain-specific benchmarks

Domain-specific benchmarks exist, such as5 (TODO add sources for each):

  • TweetEval and UMSAB with have social-media-related tasks
  • CodeXGLUE with programming language generation and understanding tasks
  • BLUE, BLURB and CBLUE for the bio-medical domain

Multi-lingual, cross-lingual and on-English benchmarks


  • In 2011 Bender noticed that most papers don’t state the language they’re working on 11, and in a blog post from 2019 she reminded that “English is Neither Synonymous with Nor Representative of Natural Language” 912, and “Always state the language you’re working on, even if it’s English” became known as the “Bender rule”9.
  • Not naming the language studied (usually English) implies the methods developed could work on any other languages, as if English were a neutral and universal language.
  • Language independence and language representation20, availability of non-English and/or multilingual corpora and research is
  • In the field of NLP, Non-English and/or multilingual corpora and benchmarks and research.

Non-English benchmarks

  • Benchmarks for languages other than English exist, for a list of non-English benchmarks, see5.
  • TODO - more here.

Multi-lingual benchmarks

Linguistic Code-switching21 is the phenomenon that happens when speakers alternate languages in the same utterance.






  1.  ↩︎
  2.  ↩︎
  3.  ↩︎
  4.  ↩︎
  5.  ↩︎
  6. Evaluation Metrics For Language Models ↩︎

  7. 1909.01066 Language Models as Knowledge Bases? ↩︎

  8. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding - ACL Anthology ↩︎

  9. Do we Name the Languages we Study? The #BenderRule in LREC and ACL articles - Inria - Institut national de recherche en sciences et technologies du numérique ↩︎

  10. 2004.09095 The State and Fate of Linguistic Diversity and Inclusion in the NLP World ↩︎

  11. That bender 2011 paper

  12. The Bender post

  13.  ↩︎
  14.  ↩︎
  15. No news is good news: An asymmetric model of changing volatility in stock returns - ScienceDirect ↩︎

  16. 1908.10063 FinBERT: Financial Sentiment Analysis with Pre-trained Language Models ↩︎

  17. Challenges and Opportunities in NLP Benchmarking ↩︎

  18. 2103.06312 The AI Index 2021 Annual Report ↩︎

  19. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems ↩︎

  20. Linguistically Naïve != Language Independent: Why NLP Needs Linguistic Typology - ACL Anthology ↩︎

  21. LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation - ACL Anthology ↩︎

Nel mezzo del deserto posso dire tutto quello che voglio.