serhii.net

In the middle of the desert you can say anything you want

29 Nov 2022

DL N - SMT

Orga

  • Timeline:
    • 16.12 NER shared task
    • 18.01 Reinforcement learning ST
    • 30.01 Klausur
  • Klausur:
    • Closer to “source code add comments about shapes etc.” than to usual open questions

Statistical machine translation (SMT)

  • Idea: sentence x in source to y in target lang
  • Gesucht: $$(argmax)_y P(y|x) = P(x|y)*P(y)/P(x)$$
    • Basically we multiply the likeliest translation by the probability of that translation existing in the target language
    • P(x|y) is the translation model, P(y) is the language model
  • Data: two parallel corpora
  • Training:
    • Get two parallel corpora
    • Align auf Wortebene, Phrasenebene
      • Mapping is sometimes:
        • one to many
        • many to one
        • many to many
        • “spurious” words like articles have no target mapping
  • SOTA SMT wre really complex, a lot of language-specific bits etc. … then we got Neural Machine Translation that revolutionized everything

seq2seq models

Basics

  • Input: Encoder RNN -> feature vector
  • Output: Decoder RNN (Erzeugt Satz in der Zielsprache basierend auf Encoding des Eingabesatzes)
  • Inferenz: Vorhergesagte Token werden zu Eingabe für nächste Vorersage
  • So, first the decoder gets the encoder output + “B” for BOS, then it generates the other words and E at the end.
  • “seq2seq” is basicallly everything, wenn Engabe und Ausgabe Sequenzen sind
    • Zusammenfassung
    • Dialogsysteme
    • etc.
  • Sl. 310 with formula and loss for seq2seq training

Decoding

  • D/ Greedy decoding
    • at each step, we do Vorhersage des wahrscheinlichten Wortes in jedem Schritt BUT then we can’t fix errors!
  • Exhaustive decoding
    • $$P(y|x) = P(y_1|x)*P.()…P(y_t|y_1, …, y_(i-1), x)$$
    • -> Beam search decoding
  • D/ Beam search decoding

  • TODO formulas

  • Idee: in jedem Schritt werden die k besten Lösungen weiter verfolgt

    • k=1 -> Greedy Decoding
    • k=|V|^T -> Exhaustive decoding
  • Scores are calculated in log-space

    • Then we can sum instead of multiplying probabilities
    • $$logP(y_1,…,y_t|x)$$
  • BS ist nicht zwangsläufig optimal, aber viel effizienter als Exhaustive search.

Example: TODO Sl. 316

B-> he, A, B, I -> -0.7, X,Y, --0.9
he -> hit, struck -> -1e7, -2.9

We add  the negative logarithms for each step, mor is better, and we keep always k=2 possibilities at each step of the tree.

Then we generate <E> at some step, then once we find solutions we pick the one with the highest score, then we backtrack to the beginning for the final decision.

IMPORTANT/ Normalization - this would make shorter sequencesmore likely because less multiplication - we normalize by the number of words - basically divide it by t?

Generally, Beam search is a generic concept that canbe used for searches etc.

Neural Machine Translation - Cont’d

  • Vorteile of NMT: - Hohe Qualität - Flüssigere Formulierungen - More use of contex - Intuitive - No subcomponent - E2E-optimizable - No manual effort - No Feature Enginneering etc.
  • Nachteile
    • Schwer zu debuggen/interpretieren
      • as always with DL
    • Schwer zu reglamentieren
      • Sicherheit
      • Richtlinien/Regeln aufwendig zu integrieren

MT - Evaluation

  • Übereinstimmung mit menschlichen Übersetzungen
  • BLEU
  • ROUGE
  • etc.
  • A lot of problems - diff formulations are worse, etc.
  • There’s an F1 score of both
  • $$ F_1 = (2BLEUROUGE)/(LEU+ROUGE)$$

MT - Herausforderungen

  • OOV - Unbekannte Wörter

    • Deutsch…
  • Übersetzungen in speziellen Domänen

  • Konsistenter Kontext in längeren Texten

    • Connection between sentences etc. for longer documents
  • Language pairs with few resources

    • English is used as intermediate for those cases - errors stack but better than nothing
  • Bias in training data -> she works as a nurse, he works as a programmer.

  • “Role of bushes in international politics”

Attention is all you need

Attention model

  • We calculate attention for each step,, which selects the most relevant steps for it.
  • Sl. 330+
  • For each timestep, we get a scalar that’s the attention for each timestep, they sum up to 1 through softmax
  • Increase quadratically with timesteps increase but can be parallelized (RNNs can’t)

TODO

Visualization

  • NLP - words for each word
  • CV - “a moman is throwing a frisbee in a park.” and we get important bits in the picture

Bias

  • “MAN sitting at computer” -> attention goes to computer -> gender bias!
  • MAN holding racket -> racket -> rightt answers for wrong reason
  • “Where does the model look like when deciding this?”
Nel mezzo del deserto posso dire tutto quello che voglio.