29 Nov 2022

DL N - SMT

Orga

Timeline:
- 16.12 NER shared task
- 18.01 Reinforcement learning ST
- 30.01 Klausur
Klausur:
- Closer to “source code add comments about shapes etc.” than to usual open questions

Statistical machine translation (SMT)

Idea: sentence x in source to y in target lang
Gesucht: $$(argmax)_y P(y|x) = P(x|y)*P(y)/P(x)$$
- Basically we multiply the likeliest translation by the probability of that translation existing in the target language
- P(x|y) is the translation model, P(y) is the language model
Data: two parallel corpora
Training:
- Get two parallel corpora
- Align auf Wortebene, Phrasenebene
  - Mapping is sometimes:
    - one to many
    - many to one
    - many to many
    - “spurious” words like articles have no target mapping
SOTA SMT wre really complex, a lot of language-specific bits etc. … then we got Neural Machine Translation that revolutionized everything

seq2seq models

Basics

Input: Encoder RNN -> feature vector
Output: Decoder RNN (Erzeugt Satz in der Zielsprache basierend auf Encoding des Eingabesatzes)
Inferenz: Vorhergesagte Token werden zu Eingabe für nächste Vorersage
So, first the decoder gets the encoder output + “B” for BOS, then it generates the other words and E at the end.
“seq2seq” is basicallly everything, wenn Engabe und Ausgabe Sequenzen sind
- Zusammenfassung
- Dialogsysteme
- etc.
Sl. 310 with formula and loss for seq2seq training

Decoding

D/ Greedy decoding
- at each step, we do Vorhersage des wahrscheinlichten Wortes in jedem Schritt BUT then we can’t fix errors!
Exhaustive decoding
- $$P(y|x) = P(y_1|x)*P.()…P(y_t|y_1, …, y_(i-1), x)$$
- -> Beam search decoding

Beam search

D/ Beam search decoding
TODO formulas
Idee: in jedem Schritt werden die k besten Lösungen weiter verfolgt
- k=1 -> Greedy Decoding
- k=|V|^T -> Exhaustive decoding
Scores are calculated in log-space
- Then we can sum instead of multiplying probabilities
- $$logP(y_1,…,y_t|x)$$
BS ist nicht zwangsläufig optimal, aber viel effizienter als Exhaustive search.

Example: TODO Sl. 316

B-> he, A, B, I -> -0.7, X,Y, --0.9
he -> hit, struck -> -1e7, -2.9

We add  the negative logarithms for each step, mor is better, and we keep always k=2 possibilities at each step of the tree.

Then we generate <E> at some step, then once we find solutions we pick the one with the highest score, then we backtrack to the beginning for the final decision.

IMPORTANT/ Normalization - this would make shorter sequencesmore likely because less multiplication - we normalize by the number of words - basically divide it by t?

Generally, Beam search is a generic concept that canbe used for searches etc.

Neural Machine Translation - Cont’d

Vorteile of NMT: - Hohe Qualität - Flüssigere Formulierungen - More use of contex - Intuitive - No subcomponent - E2E-optimizable - No manual effort - No Feature Enginneering etc.
Nachteile
- Schwer zu debuggen/interpretieren
  - as always with DL
- Schwer zu reglamentieren
  - Sicherheit
  - Richtlinien/Regeln aufwendig zu integrieren

MT - Evaluation

Übereinstimmung mit menschlichen Übersetzungen
BLEU
ROUGE
etc.
A lot of problems - diff formulations are worse, etc.
There’s an F1 score of both
$$ F_1 = (2BLEUROUGE)/(LEU+ROUGE)$$

MT - Herausforderungen

OOV - Unbekannte Wörter
- Deutsch…
Übersetzungen in speziellen Domänen
Konsistenter Kontext in längeren Texten
- Connection between sentences etc. for longer documents
Language pairs with few resources
- English is used as intermediate for those cases - errors stack but better than nothing
Bias in training data -> she works as a nurse, he works as a programmer.
“Role of bushes in international politics”

Attention is all you need

Attention model

We calculate attention for each step,, which selects the most relevant steps for it.
Sl. 330+
For each timestep, we get a scalar that’s the attention for each timestep, they sum up to 1 through softmax
Increase quadratically with timesteps increase but can be parallelized (RNNs can’t)

TODO

Visualization

NLP - words for each word
CV - “a moman is throwing a frisbee in a park.” and we get important bits in the picture

Bias

“MAN sitting at computer” -> attention goes to computer -> gender bias!
MAN holding racket -> racket -> rightt answers for wrong reason
“Where does the model look like when deciding this?”

uni
uni/dl

Nel mezzo del deserto posso dire tutto quello che voglio.

serhii.net