LM paper notes
For the paper I’m writing, I’ll actually try to do a real garden thing. With leaves etc that get updated with new info, not chronologically like my current DTB notes.
Perplexity and intrinsic eval
Closer to the end has a discussion about LM metrics and performance on downstream task:
- Intrinsic evaluation
Perplexity is the multiplicative inverse of the probability assigned to the test set by the language model, normalized by the number of words in the test set.
Perplexity limitations and ways to go around it / smoothing:
As a result, the bigram probability values of those unseen bigrams would be equal to zero making the overall probability of the sentence equal to zero and in turn perplexity to infinity. This is a limitation which can be solved using smoothing techniques.
- The above cites http://web.stanford.edu/~jurafsky/slp3/3.pdf that’s longer and so much better!
- Full link: http://web.stanford.edu/~jurafsky/slp3/
- P 37 about test set needing to have enough statistical power to measure improvements
- Chapter 3 about Shakespeare vs WSJ and genre
- 42: Smoothing
- Unknown words so we don’t multiply 0 probs
- 7 / 130 really nice basics of ml
- Another take on the same, but love it
- Links the Roberta paper about the connection between perplexity and downstream it!
If surprisal lets us quantify how unlikely a single outcome of a possible event is, entropy does the same thing for the event as a whole. It’s the expected value of the surprisal across every possible outcome — the sum of the surprisal of every outcome multiplied by the probability it happens
- Excellent about the drawbacks of perplexity:
First, as we saw in the calculation section, a model’s worst-case perplexity is fixed by the language’s vocabulary size. This means you can greatly lower your model’s perplexity just by, for example, switching from a word-level model (which might easily have a vocabulary size of 50,000+ words) to a character-level model (with a vocabulary size of around 26), regardless of whether the character-level model is really more accurate.
- Two more
- https://arxiv.org/pdf/2110.12609.pdf about perplexity and news cycle 6- TODO
The problem is that news publications cycle through viral buzzwords quickly — just think about how often the Harlem Shake was mentioned 2013 compared to now.
- https://arxiv.org/pdf/2110.12609.pdf - about one million DS news benchmark
- [1906.03591] A Survey on Neural Network Language Models
- Sources for basics about NLP and LM
- Sources for perplexity and problems with it
- Preprint but cited a lot
- A Survey of the Usages of Deep Learning for Natural Language Processing | IEEE Journals & Magazine | IEEE Xplore
- Much more respectable-looking paper about DL in NLP specifically, and about NLP in general
- A Comprehensive Survey on Word Representation Models: From Classical to State-of-the-Art Word Representation Language Models | ACM Transactions on Asian and Low-Resource Language Information Processing
- Really nice intro
- My dump with links about the statistics and relationships between ppl,bpb,bpc,cross-entropy: [[garden/it/221205-0009 Metrics for LM evaluation like perplexity, BPC, BPB]]
Interesting intrinsic eval
- [2211.09110] Holistic Evaluation of Language Models is recent and a preprint, but covers a lot of different things and is a goldmine for citations
- GLUE: https://aclanthology.org/W18-5446/
- https://gluebenchmark.com/diagnostics long and detailed
- Excellent: https://mccormickml.com/2019/11/05/GLUE/
- Tasks overview: https://docs.google.com/spreadsheets/d/1BrOdjJgky7FfeiwC_VDURZuRPUFUAz_jfczPPT35P00/htmlview
- Much more detailed paper than the glue one!
- More complex tasks since models better than people at easy ones
- Goldmine of sources
- At the end they list the excluded tasks + instructions from the tasks for humans!
- https://ruder.io/nlp-benchmarking/ nice in general, link to competitions and journals
- Good point about latest models having SUPERFICIAL knowledge and tests needed for this
- Basically how to benchmark
- Roberta paper lists three: Glue, Squad & RACE
- https://slds-lmu.github.io/seminar_nlp_ss20/resources-and-benchmarks-for-nlp.html really detailed list
- xtremr below
- FinBERT / https://github.com/ProsusAI/finBERT
- has other eng lang dataset
- Discussion about cased etc
- Eval on sentiment analysis, accuracy regression
- Redundant content
- NFinbert knows numbers, there are a lot of them in finance
- “Context, language modeling and multimodal data on finance”
- Models trained on mix better than in fin data alone
- Really nice and involved and financial and I can’t go through it now
- Almost exclusively sentiment analysis
- https://link.springer.com/article/10.1007/s41060-021-00285-x NER on German financial text for anonymisation
- https://www.deepset.ai/german-bert - German BERT
- LINKS TO DOWNSTREAM TASKS
- https://www.deepset.ai/datasets squad for German dataset
- https://huggingface.co/fabianrausch Germ financial datasets
https://ruder.io/tag/natural-language-processing/index.html multilingual not-english NLP seems to be an interest of his, might be interesting in the “why” context
- Bender rule and English not as default
- Linguistically Naïve != Language Independent: Why NLP Needs Linguistic Typology - ACL Anthology
- She has a lot about this: Emily M. Bender - ACL Anthology, not just there
- Resonates with me immensely, and I think the time has come for more Ukrainian-language datasets?
- Do we Name the Languages we Study? The #BenderRule in LREC and ACL articles - Inria - Institut national de recherche en sciences et technologies du numérique