In the middle of the desert you can say anything you want
Conclusions should synthesize the results of your paper and separate what is significant from what is not.
Had a discussion with a friend about this, me not wanting to set up a more complex solution once because I didn’t feel like learning it but wanted to understand what I’m running - especially what I consider my core infrastructure.
So I ended up using a sub-optimal solution that I understand
Stumbled upon this bit that phrases the concept in a better way:
I would recommend gitea to anyone looking at gitlab and vice versa. These two are very similar. I think that blindly running either of them in a container just because you can is asking for trouble though. Go through the manual instillation and know how to set things up from scratch. If you can’t do that, you shouldn’t run it, because you won’t be able to fix it when things go wrong. You want a sysadmin that knows how to set these up and how to manage them, back them up, and fix problems along the way.1
Previously: 221119-2306 LM paper garden has more context about such metrics, 221204-2349 Interesting block with explanations of ML stuff has the compression angle for it.
Dumping these here for now.
The GPT21 paper puts it like this:
“Results on language modeling datasets are commonly reported in a quantity which is a scaled or ex- ponentiated version of the average negative log probability per canonical prediction unit - usually a character, a byte, or a word.”
GPT-2 (Metrics : PPL, BPB, BPC) led me to:
Evaluation Metrics for Language Modeling is really detailed.
Vaclav Kosar’s Software & Machine Learning Blog, sample: OpenAI’s DALL-E 2 and DALL-E 1 Explained. Found it originally through Bits-Per-Byte and Bits-Per-Character.
Software engineering, ML, Thinkpad P52 Disassembly - Categories. Often with nice graphics.
Close in spirit, randomness and citing-your-sources to this/my DTB but way more in depth. But the most brilliant part is the big “Ask or report a mistake” button.
I should do in-depth stuff more often.
…And resurrect my link wiki, and go back to the pre-war tradition of reading RSS feeds :(
The GPT31 paper mentioned that it’s 10x bigger than any previous non-sparse LM.
So - sparse LMs () are LMs with A LOT of params where only a subset is used for each incoming example.2
To pass a custom dockerfile, add -f custom_filename
:
docker build . -f custom.Dockerfile -t tag:latest ....
Dockerfile naming conventions exist: Dockerfile Naming Convention and Organization – mohitgoyal.co, quoting options from there:
myapp.azure.dev.Dockerfile
myapp.gcp.dev.Dockerfile
myapp.aws.dev.Dockerfile
-
Dockerfile.myapp.azure.dev
Dockerfile.myapp.i386.azure.dev
Dockerfile.myapp.amd.azure.Dev
From that article I learned that Dockerfiles don’t have to be inside build context anymore! Link: Allow Dockerfile from outside build-context by thaJeztah · Pull Request #886 · docker/cli · GitHub
TL;DR from there
$ docker build --no-cache -f $PWD/dockerfiles/Dockerfile $PWD/context
redis-cli set test 1
etc. immediately work - did it start a server in the background?
systemctl disable redis-cli
etc!redis-cli
starts in interactive mode!
fish
shell!> r
127.0.0.1:6379> multi
OK
127.0.0.1:6379> get google
QUEUED
127.0.0.1:6379> incr google_accesses
QUEUED
127.0.0.1:6379> exec
1) "http://google.com"
2) (integer) 1
127.0.0.1:6379>
help <Tab>
autocompleteshelp @hash
# Create a hashset that has field f1 w/ value v1 etc.:
127.0.0.1:6379> hmset myhash f1 v1 f2 v2
OK
127.0.0.1:6379> hgetall myhash
1) "f1"
2) "v1"
3) "f2"
4) "v2"
127.0.0.1:6379> hget myhash f1
"v1"
Operations on hashes:
# We create a hset s_google that has an url and accesses counter
127.0.0.1:6379> hset s_google url url_google accesses 0
(integer) 2
127.0.0.1:6379> hmget s_google url accesses
1) "url_google"
2) "0"
# Increase accesses by 1
127.0.0.1:6379> HINCRBY s_google accesses 1
(integer) 1
127.0.0.1:6379> hmget s_google url accesses
1) "url_google"
2) "1"
DEL key
FLUSHALL
to delete everythingcat file.txt | redis-cli --pipe
127.0.0.1:6379> zadd myss 1 'one' 2 'two'
(integer) 2
127.0.0.1:6379> ZSCORE myss 'one'
"1"
127.0.0.1:6379> ZSCORE myss 'one'
127.0.0.1:6379> get B
"https://www.wikipedia.org"
127.0.0.1:6379> get A
"http://www.openstreetmap.org"
127.0.0.1:6379> ZCARD accesses
(integer) 2
127.0.0.1:6379> ZCARD accesses
(integer) 2
127.0.0.1:6379> ZRANGE accesses 0 40
1) "A"
2) "B"
127.0.0.1:6379> ZRANGE accesses 0 40 withscores
1) "A"
2) "1"
3) "B"
4) "1"
127.0.0.1:6379>
You can comment on commits but they’re limited, comments on a merge requests give much more functionality incl. closing threads etc.!
Google scholar, in the default search interface, showed only papers written after 2016 - can’t reproduce anymore, but important to keep in mind when looking for 2011 papers.
For the paper I’m writing, I’ll actually try to do a real garden thing. With leaves etc that get updated with new info, not chronologically like my current DTB notes.
https://thegradient.pub/understanding-evaluation-metrics-for-language-models/
Closer to the end has a discussion about LM metrics and performance on downstream task:
Perplexity is the multiplicative inverse of the probability assigned to the test set by the language model, normalized by the number of words in the test set.
Perplexity limitations and ways to go around it / smoothing:
As a result, the bigram probability values of those unseen bigrams would be equal to zero making the overall probability of the sentence equal to zero and in turn perplexity to infinity. This is a limitation which can be solved using smoothing techniques.
If surprisal lets us quantify how unlikely a single outcome of a possible event is, entropy does the same thing for the event as a whole. It’s the expected value of the surprisal across every possible outcome — the sum of the surprisal of every outcome multiplied by the probability it happens
First, as we saw in the calculation section, a model’s worst-case perplexity is fixed by the language’s vocabulary size. This means you can greatly lower your model’s perplexity just by, for example, switching from a word-level model (which might easily have a vocabulary size of 50,000+ words) to a character-level model (with a vocabulary size of around 26), regardless of whether the character-level model is really more accurate.
The problem is that news publications cycle through viral buzzwords quickly — just think about how often the Harlem Shake was mentioned 2013 compared to now.
https://ruder.io/tag/natural-language-processing/index.html multilingual not-english NLP seems to be an interest of his, might be interesting in the “why” context
Best post ever: The #BenderRule: On Naming the Languages We Study and Why It Matters
Bits: