serhii.net

In the middle of the desert you can say anything you want

19 Nov 2022

LM paper notes

For the paper I’m writing, I’ll actually try to do a real garden thing. With leaves etc that get updated with new info, not chronologically like my current DTB notes.

Basics

Perplexity and intrinsic eval

  • Resources:
  • The above cites http://web.stanford.edu/~jurafsky/slp3/3.pdf that’s longer and so much better!
  • Full link: http://web.stanford.edu/~jurafsky/slp3/
    • ![[Screenshot_20221119-233022.png]]
    • P 37 about test set needing to have enough statistical power to measure improvements
    • Sampling
    • Chapter 3 about Shakespeare vs WSJ and genre
    • 42: Smoothing
      • Unknown words so we don’t multiply 0 probs
      • 7 / 130 really nice basics of ml
  • https://surge-ai.medium.com/evaluating-language-models-an-introduction-to-perplexity-in-nlp-f6019f7fb914
    • Another take on the same, but love it
    • Links the Roberta paper about the connection between perplexity and downstream it!
    • [[Screenshot_20221120-000131_Fennec.png]]
    • ![[Screenshot_20221119-235918_Fennec.png]]
    • If surprisal lets us quantify how unlikely a single outcome of a possible event is, entropy does the same thing for the event as a whole. It’s the expected value of the surprisal across every possible outcome — the sum of the surprisal of every outcome multiplied by the probability it happens

  • Excellent about the drawbacks of perplexity:
    • First, as we saw in the calculation section, a model’s worst-case perplexity is fixed by the language’s vocabulary size. This means you can greatly lower your model’s perplexity just by, for example, switching from a word-level model (which might easily have a vocabulary size of 50,000+ words) to a character-level model (with a vocabulary size of around 26), regardless of whether the character-level model is really more accurate.

    • Two more
    • https://arxiv.org/pdf/2110.12609.pdf about perplexity and news cycle 6- TODO
    • The problem is that news publications cycle through viral buzzwords quickly — just think about how often the Harlem Shake was mentioned 2013 compared to now.

  • https://arxiv.org/pdf/2110.12609.pdf - about one million DS news benchmark

Papers

Interesting intrinsic eval

Benchmarks

Glue

SuperGLUE

  • https://super.gluebenchmark.com/
  • Much more detailed paper than the glue one!
  • More complex tasks since models better than people at easy ones
  • Goldmine of sources
  • At the end they list the excluded tasks + instructions from the tasks for humans!

Benchmarking

Finance

  • FinBERT / https://github.com/ProsusAI/finBERT
    • has other eng lang dataset
    • Discussion about cased etc
    • Eval on sentiment analysis, accuracy regression
    • Redundant content
  • NFinbert knows numbers, there are a lot of them in finance
  • “Context, language modeling and multimodal data on finance”
    • Models trained on mix better than in fin data alone
    • Really nice and involved and financial and I can’t go through it now
    • Almost exclusively sentiment analysis
  • https://link.springer.com/article/10.1007/s41060-021-00285-x NER on German financial text for anonymisation
    • BERT

German

Multilingual

Nel mezzo del deserto posso dire tutto quello che voglio.