19 Nov 2022

LM paper notes

~~For the paper I’m writing, I’ll actually try to do a real garden thing. With leaves etc that get updated with new info, not chronologically like my current DTB notes.~~

Basics

Perplexity and intrinsic eval

Resources:
- https://thegradient.pub/understanding-evaluation-metrics-for-language-models/
  
  Closer to the end has a discussion about LM metrics and performance on downstream task:
- https://towardsdatascience.com/evaluation-of-language-models-through-perplexity-and-shannon-visualization-method-9148fbe10bd0
  - Intrinsic evaluation
  - Perplexity is the multiplicative inverse of the probability assigned to the test set by the language model, normalized by the number of words in the test set.
- Perplexity limitations and ways to go around it / smoothing:
- As a result, the bigram probability values of those unseen bigrams would be equal to zero making the overall probability of the sentence equal to zero and in turn perplexity to infinity. This is a limitation which can be solved using smoothing techniques.
The above cites http://web.stanford.edu/~jurafsky/slp3/3.pdf that’s longer and so much better!
Full link: http://web.stanford.edu/~jurafsky/slp3/
- ![[Screenshot_20221119-233022.png]]
- P 37 about test set needing to have enough statistical power to measure improvements
- Sampling
- Chapter 3 about Shakespeare vs WSJ and genre
- 42: Smoothing
  - Unknown words so we don’t multiply 0 probs
  - 7 / 130 really nice basics of ml
https://surge-ai.medium.com/evaluating-language-models-an-introduction-to-perplexity-in-nlp-f6019f7fb914
- Another take on the same, but love it
- Links the Roberta paper about the connection between perplexity and downstream it!
- [[Screenshot_20221120-000131_Fennec.png]]
- ![[Screenshot_20221119-235918_Fennec.png]]
- If surprisal lets us quantify how unlikely a single outcome of a possible event is, entropy does the same thing for the event as a whole. It’s the expected value of the surprisal across every possible outcome — the sum of the surprisal of every outcome multiplied by the probability it happens
Excellent about the drawbacks of perplexity:
- First, as we saw in the calculation section, a model’s worst-case perplexity is fixed by the language’s vocabulary size. This means you can greatly lower your model’s perplexity just by, for example, switching from a word-level model (which might easily have a vocabulary size of 50,000+ words) to a character-level model (with a vocabulary size of around 26), regardless of whether the character-level model is really more accurate.
- Two more
- https://arxiv.org/pdf/2110.12609.pdf about perplexity and news cycle 6- TODO
- The problem is that news publications cycle through viral buzzwords quickly — just think about how often the Harlem Shake was mentioned 2013 compared to now.
https://arxiv.org/pdf/2110.12609.pdf - about one million DS news benchmark

Papers

[1906.03591] A Survey on Neural Network Language Models
- Sources for basics about NLP and LM
- Sources for perplexity and problems with it
- Preprint but cited a lot
A Survey of the Usages of Deep Learning for Natural Language Processing | IEEE Journals & Magazine | IEEE Xplore
- Much more respectable-looking paper about DL in NLP specifically, and about NLP in general
A Comprehensive Survey on Word Representation Models: From Classical to State-of-the-Art Word Representation Language Models | ACM Transactions on Asian and Low-Resource Language Information Processing
- Really nice intro
My dump with links about the statistics and relationships between ppl,bpb,bpc,cross-entropy: [[garden/it/221205-0009 Metrics for LM evaluation like perplexity, BPC, BPB]]

Interesting intrinsic eval

They exist!
GitHub - facebookresearch/LAMA: LAnguage Model Analysis

Benchmarks

[2211.09110] Holistic Evaluation of Language Models is recent and a preprint, but covers a lot of different things and is a goldmine for citations

Glue

GLUE: https://aclanthology.org/W18-5446/
- https://gluebenchmark.com/diagnostics long and detailed
Excellent: https://mccormickml.com/2019/11/05/GLUE/
Tasks overview: https://docs.google.com/spreadsheets/d/1BrOdjJgky7FfeiwC_VDURZuRPUFUAz_jfczPPT35P00/htmlview
https://huggingface.co/datasets/glue

SuperGLUE

https://super.gluebenchmark.com/
Much more detailed paper than the glue one!
More complex tasks since models better than people at easy ones
Goldmine of sources
At the end they list the excluded tasks + instructions from the tasks for humans!

Benchmarking

https://ruder.io/nlp-benchmarking/ nice in general, link to competitions and journals
- Good point about latest models having SUPERFICIAL knowledge and tests needed for this
- Basically how to benchmark
Roberta paper lists three: Glue, Squad & RACE
https://slds-lmu.github.io/seminar_nlp_ss20/resources-and-benchmarks-for-nlp.html really detailed list
xtremr below

Finance

FinBERT / https://github.com/ProsusAI/finBERT
- has other eng lang dataset
- Discussion about cased etc
- Eval on sentiment analysis, accuracy regression
- Redundant content
NFinbert knows numbers, there are a lot of them in finance
“Context, language modeling and multimodal data on finance”
- Models trained on mix better than in fin data alone
- Really nice and involved and financial and I can’t go through it now
- Almost exclusively sentiment analysis
https://link.springer.com/article/10.1007/s41060-021-00285-x NER on German financial text for anonymisation
- BERT

German

https://www.deepset.ai/german-bert - German BERT
- LINKS TO DOWNSTREAM TASKS
https://www.deepset.ai/datasets squad for German dataset
https://huggingface.co/fabianrausch Germ financial datasets

Multilingual

https://ruder.io/tag/natural-language-processing/index.html multilingual not-english NLP seems to be an interest of his, might be interesting in the “why” context
Best post ever: The #BenderRule: On Naming the Languages We Study and Why It Matters
- Bender rule and English not as default
- Linguistically Naïve != Language Independent: Why NLP Needs Linguistic Typology - ACL Anthology
- She has a lot about this: Emily M. Bender - ACL Anthology, not just there
- Resonates with me immensely, and I think the time has come for more Ukrainian-language datasets?
- Do we Name the Languages we Study? The #BenderRule in LREC and ACL articles - Inria - Institut national de recherche en sciences et technologies du numérique
  - 2004.09095 The State and Fate of Linguistic Diversity and Inclusion in the NLP World
Bits:
- https://sites.research.google/xtreme benchmark
- https://www.amazon.science/blog/cross-lingual-transfer-learning-for-multilingual-voice-agents G

Nel mezzo del deserto posso dire tutto quello che voglio.

serhii.net