25 Oct 2022

DL 3: CNN

Convolutional neural networks

Motivation: if we just connect everything to the next layer, we: - Get a lot of params - Lose the position information

Convolutions

You do it once per color, and using a sliding window to cover everything
Lokale Features werden durch Filter/Kernel bestimmt
TODO /D Filter Sl. 120 (?)
TODO Param. sharing - same filter for all patches
For each convolution layer, you get more and more complex filters that do more complex features
Output shape
- Influenced by everything - filter size, padding, etc.
- There are online calculators - link in sl.122 - “convnet calculator”
TODO - Filterzahl bestimmt wie viele Kanäle:
- - Input: 3 RGB ones, use 8 filters, get 8 channels as output
We have 4 params
TODO Terminology:
- Schrittweite == stride
- Filter == Kernel (?)
- Filter count == Filterzahl
Parameters:
- Padding:
  - “valid” == only complete kernels == no padding
  - Zero-padding:
    - “same” - рамочка вокруг из нулей, then the size stays the same
- Schrittweite / stride
  - Stride < filter size -> overlapping convolutions
  - TODO S125
Dilated convolutions
- Bigger receptive field without paying anything for it
- Basically instead of filling every pixel, we do every second
- Almost never used nowadays
Dimensions:
- Conv1d is for things like .. temperature?
- Or you can use different filter sizes for stuff like a feature covering temp for
Keras:
- Conv1D, Conv2D, Conv3D
- predictable parameters

Image classification

Motivation:
- Medicine, Qualitätssichreung, sortierung
Pic -> one or multiple classes
Input picture -> Convolution (feature maps) -> Maxpooling -> Fully-connected layer (S.132)

Max-pooling

TODO

Receptive field

How much does a neuron “see”
- -> and learn
A neuron in layer 3 sees many more from prev. layers, like a pyramid

Deep CNNs

The field becomes bigger and bigger based on depth of layers (S. 135)
The deeper you go the more filters you get and therefore the more complex the features are.
At the end you have a “picture” of size 1x1xXXXXX features etc. till you get die Ausgabewahrscheinlichkeit für die Zielklassen.
Example
- digitalisation of Grundrissen/plans, everything worked, but not the doors - the receptive field was just not big enough.
- it should be big enough based on what you want to classify at the end
Jupyter notebook example - a lot of lines and layers but few params - contrasts with the ones we’d have had if e used dense layers for this
ImageNet
- Nice visualization S. 139
- TL;DR Transformers etc. work well, both for Vision and for NLP, if they have enough compute+data
  - For English we do, for the other languages we don’t
  - -> CNNs still matter for these cases
History of the best vision models
- Alex-Net -
  - much more params than the prev 1998 model
  - Stride < Filtergröse, Sliding windows overlap.
  - ReLu instead of tanh
    - 6 times faster to train
    - actually converges faster by itself too!
  - Dropout for regularization
    - Now standard, very interesting then
  - Input data:
    - Scaled to 256
    - Augmentations: flips, crops, colors
  - For testing: 5 crops & flips and took the average predictions of these 10 pictures
- InceptionNet/GoogleNet - 2014
  - 10 times fewer params but much deeper
  - Inception blocks:
    - For each block, multiple convolutions with diff patches (but same padding) are applied
    - max pooling at the end
    - Output concatenated at the end
  - 1x1 Convolutions lernen kanalübergreifende Features
    - basically decreasing N of params to make it faster? TODO
  - Auxiliary loss
    - to avoid vanishing gradients
    - help to calculate the final loss
    - N Auxiliary classifiers, each with own loss that’s closer to the source, and they help
  - TL;DR just much more efficient -> fewer params
- ResNet
  - Residual connection / skp connections / lateral conns / identity shortcut connection
    - Dienen der Fehlerminimierung
      - stellen sicher, dass information nicht verloren geht
      - Either we get better, or we stay the same - we make sure we have the option
      - F(x)+x
    - Verhindert der Vanishing Gradient - always useful in a lot of scenarios
    - Similar to highway networks - though they are the generic solution
  - Architecture - VGG19etc - A LOT of layers
    - pictured either as 4layers or 34 layers, die Darstellungen sind äquivalent
    - ResNet18,34,101,152 etc.
  - Dimensionsreduktion durch Erhöhung des Strides (und nicht Pooling)

uni
uni/dl

Nel mezzo del deserto posso dire tutto quello che voglio.