DL N - SMT
Orga
- Timeline:
- 16.12 NER shared task
- 18.01 Reinforcement learning ST
- 30.01 Klausur
- Klausur:
- Closer to “source code add comments about shapes etc.” than to usual open questions
Statistical machine translation (SMT)
- Idea: sentence x in source to y in target lang
- Gesucht: $$(argmax)_y P(y|x) = P(x|y)*P(y)/P(x)$$
- Basically we multiply the likeliest translation by the probability of that translation existing in the target language
- P(x|y) is the translation model, P(y) is the language model
- Data: two parallel corpora
- Training:
- Get two parallel corpora
- Align auf Wortebene, Phrasenebene
- Mapping is sometimes:
- one to many
- many to one
- many to many
- “spurious” words like articles have no target mapping
- Mapping is sometimes:
- SOTA SMT wre really complex, a lot of language-specific bits etc. … then we got Neural Machine Translation that revolutionized everything
seq2seq models
Basics
- Input: Encoder RNN -> feature vector
- Output: Decoder RNN (Erzeugt Satz in der Zielsprache basierend auf Encoding des Eingabesatzes)
- Inferenz: Vorhergesagte Token werden zu Eingabe für nächste Vorersage
- So, first the decoder gets the encoder output + “B” for BOS, then it generates the other words and E at the end.
- “seq2seq” is basicallly everything, wenn Engabe und Ausgabe Sequenzen sind
- Zusammenfassung
- Dialogsysteme
- etc.
- Sl. 310 with formula and loss for seq2seq training
Decoding
- D/ Greedy decoding
- at each step, we do Vorhersage des wahrscheinlichten Wortes in jedem Schritt BUT then we can’t fix errors!
- Exhaustive decoding
- $$P(y|x) = P(y_1|x)*P.()…P(y_t|y_1, …, y_(i-1), x)$$
- -> Beam search decoding
Beam search
-
D/ Beam search decoding
-
TODO formulas
-
Idee: in jedem Schritt werden die k besten Lösungen weiter verfolgt
- k=1 -> Greedy Decoding
- k=|V|^T -> Exhaustive decoding
-
Scores are calculated in log-space
- Then we can sum instead of multiplying probabilities
- $$logP(y_1,…,y_t|x)$$
-
BS ist nicht zwangsläufig optimal, aber viel effizienter als Exhaustive search.
Example: TODO Sl. 316
B-> he, A, B, I -> -0.7, X,Y, --0.9
he -> hit, struck -> -1e7, -2.9
We add the negative logarithms for each step, mor is better, and we keep always k=2 possibilities at each step of the tree.
Then we generate <E>
at some step, then once we find solutions we pick the one with the highest score, then we backtrack to the beginning for the final decision.
IMPORTANT/ Normalization - this would make shorter sequencesmore likely because less multiplication - we normalize by the number of words - basically divide it by t
?
Generally, Beam search is a generic concept that canbe used for searches etc.
Neural Machine Translation - Cont’d
- Vorteile of NMT: - Hohe Qualität - Flüssigere Formulierungen - More use of contex - Intuitive - No subcomponent - E2E-optimizable - No manual effort - No Feature Enginneering etc.
- Nachteile
- Schwer zu debuggen/interpretieren
- as always with DL
- Schwer zu reglamentieren
- Sicherheit
- Richtlinien/Regeln aufwendig zu integrieren
- Schwer zu debuggen/interpretieren
MT - Evaluation
- Übereinstimmung mit menschlichen Übersetzungen
- BLEU
- ROUGE
- etc.
- A lot of problems - diff formulations are worse, etc.
- There’s an F1 score of both
- $$ F_1 = (2BLEUROUGE)/(LEU+ROUGE)$$
MT - Herausforderungen
-
OOV - Unbekannte Wörter
- Deutsch…
-
Übersetzungen in speziellen Domänen
-
Konsistenter Kontext in längeren Texten
- Connection between sentences etc. for longer documents
-
Language pairs with few resources
- English is used as intermediate for those cases - errors stack but better than nothing
-
Bias in training data -> she works as a nurse, he works as a programmer.
-
“Role of bushes in international politics”
Attention is all you need
Attention model
- We calculate attention for each step,, which selects the most relevant steps for it.
- Sl. 330+
- For each timestep, we get a scalar that’s the attention for each timestep, they sum up to 1 through softmax
- Increase quadratically with timesteps increase but can be parallelized (RNNs can’t)
TODO
Visualization
- NLP - words for each word
- CV - “a moman is throwing a frisbee in a park.” and we get important bits in the picture
Bias
- “MAN sitting at computer” -> attention goes to computer -> gender bias!
- MAN holding racket -> racket -> rightt answers for wrong reason
- “Where does the model look like when deciding this?”