CSE 559A: Computer Vision


Fall 2018: T-R: 11:30-1pm @ Lopata 101

Instructor: Ayan Chakrabarti (ayan@wustl.edu).
Course Staff: Zhihao Xia, Charlie Wu, Han Liu

http://www.cse.wustl.edu/~ayan/courses/cse559a/

November 8, 2018

Story So Far

  • Machine Learning
    • Learn input-output relationships from data
    • Algorithm design by trial and error
    • Preferred approach for very ill-posed problems
  • Learning by Optimization
    • Select a function from a hypothesis space
    • Typically translates to learning parameters \(\theta\) for a parametric form \(y = f(x; \theta)\)
    • Find \(\theta\) that minimizes loss / error on training set (but be careful of overfitting)
    • In simple cases, closed form solution for \(\theta\)
    • In the more general case, iterative optimization
  • Gradient Descent
    • Compute gradients / partial derivatives of error wrt individual parameters
    • Update parameters by moving in opposite direction
    • Guarantees if loss is a convex function of parameters
    • But can be used generally for arbitrary functional forms and losses
    • Stochastic versions of computational efficiency

Story So Far

  • AutoGrad
    • Represent a complex function as a composition of simpler functions
    • Build routines that can backpropagate gradiens through each simple function, given
      • Gradient at output
      • Values of inputs
    • Automated framework for computing gradients, and therefore gradient descent
  • Choose hypothesis space / parametric form / network architecture
    • So that it can represent all the computation required to solve the problem
    • Has as few parameters as possible so that the optimization problem is easier
    • Has "healthy gradient flow"---gradients do not vanish (or blow up)

Now, let's look at some of the semantic vision tasks we apply this to.

Core Semantic Tasks in Vision

  • Image "Classification"

  • Object Detection

  • Semantic Segmentation

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

Core Semantic Tasks in Vision

The Effect of Data

  • Older ML Methods designed for small training sets
    • Used more complex optimization methods (than gradient descent): second order methods, etc.
    • Methods had better guarantees if you chose "simpler" classifiers
    • And in practice, gave you better results than neural networks
    • But were quadratic in training set size
  • With training set of millions, quadratic-time optimization was not feasible.
  • So people first moved to gradient descent, but with the same simple classifiers.
  • Found that with additional computation power, if you train with small step size for many iterations (still better than quadratic), gradient descent gives you a reasonable answer.
  • But then, since gradient descent was working, the question was why not try more complex classifiers ?
  • And Krizhevsky and others demonstrated: in this large training set / high training computation budget, deep neural networks are much better!

Architectures

Broad Design Principles

  • Think of a network that can "express" the operations that you think are needed to solve the problem
    • What kind of a "receptive" field should it have.
    • How non-linear does it need to be.
    • What should be the nature of the flow of information across the image.
  • Make sure its a function you can actually learn.
    • Think of the flow of gradients.
    • Try to make other architectures that you know can be successfully trained as a starting point.
  • Dealing with Overfitting: One approach:
    • First find the biggest deepest network that will overfit the data
      (Given enough capacity, CNNs will often be able to just memorize the dataset)
    • Then scale it down so that it generalizes.

Architectures

Let's consider image classification.

  • We will fix our input image to be a specific size.
    • Typically choose square images of size \(S\times S\)
    • Given image, resize proportionally so that smaller side (height or width) is \(S\)
    • Then take an \(S\) crop along the other direction
    • (Sometime take multiple crops and average)
  • The final output will be a \(C\) dimensional vector for \(C\) classes.
    • Train using soft-max cross entropy.
    • Classify using arg-max
  • Often, you'll hear about Top-K error.
    • How often is the true class in the top K highest values of predicted \(C\) vector.

Architectures

Let's talk about VGG-16. Winner of Imagenet 2014.

Reference
Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman, "Return of the Devil in the Details: Delving Deep into Convolutional Nets"

Karen Simonyan & Andrew Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition."

Four kinds of layers:

  • Convolutional
  • Max-pooling
  • Fully Connected
  • Soft-max

Architectures

  • Convolutional Layers

Take a spatial input, produce a spatial output.

\[B \times H \times W \times C \Rightarrow B \times H' \times W' \times C'\]

Can also combine with down-sampling.

\[g[b,y,x,c_2] = \sum_{k_y}\sum_{k_x}\sum_{c_1} f[b,y~s+k_y,x~s+k_x,c_1]k[k_y,k_x,c_1,c_2]\]

Here, \(s\) is stride.

  • PSET 5 asks you to implement 'valid convolution'
  • But often combined with padding (just like in regular convolution)

Architectures

Question: Input activation is \(B\times H\times W\times C_1\), and I convolve it with a kernel of size \(K\times K\times C_1 \times C_2\), what is the size of my output ? Assume 'valid' convolution.

\[B \times (H-K+1) \times (W-K+1) \times C_2\]

Question: What if I do this with a stride of 2 ?

Downsample above by \(2\). Think of what happens when sizes are even or odd.

\[B \times \lfloor (H-K)/2 \rfloor + 1 \times \lfloor (W-K)/2 \rfloor + 1 \times C_2\]

In general, you want to pad such that \(H-K\) and \(W-K\) are even, so that you keep the right and bottom edge of your images.

Architectures

Max-Pooling Layer

\[B \times H \times W \times C \Rightarrow B \times H' \times W' \times C\]

\[g[b,y,x,c] = \max_{k_y,k_x} f[b,y~s+k_y,x~s+k_x,c_1]\]

For each channel, choose the maximum value in a spatial neighborhood.

  • What will the gradients of this look like ?
  • Motivated by intuition from traditional object recognition (deformable part models). Allows for some 'slack' in exact spatial location.

Architectures

VGG-16

Input is a 224x224x3 Image

  • Block 1
    • 3x3 Conv (Pad 1): 3->64 + RELU (*pad 1 means on all sides, all conv layers have a "bias")

Architectures

VGG-16
Input is a 224x224x3 Image

- Block 1
    - 3x3 Conv (Pad 1): 3->64 + RELU
    - 3x3 Conv (Pad 1): 64->64 + RELU
    - 2x2 Max-Pool (Pad 0, Stride 2): 64->64
Input to Block 2 is 112x112x64 (called pool1)

- Block 2
    - 3x3 Conv (Pad 1): 64->128 + RELU
    - 3x3 Conv (Pad 1): 128->128 + RELU
    - 2x2 Max-Pool (Pad 0, Stride 2): 128->128
Input to Block 3 is 56x56x128 (called pool2)

- Block 3
    - 3x3 Conv (Pad 1): 128->256 + RELU
    - 3x3 Conv (Pad 1): 256->256 + RELU
    - 3x3 Conv (Pad 1): 256->256 + RELU
    - 2x2 Max-Pool (Pad 0, Stride 2): 256->256
Input to Block 4 is 28x28x256 (called pool3)

- Block 4
    - 3x3 Conv (Pad 1): 256->512 + RELU
    - 3x3 Conv (Pad 1): 512->512 + RELU
    - 3x3 Conv (Pad 1): 512->512 + RELU
    - 2x2 Max-Pool (Pad 0, Stride 2): 512->512
Input to Block 5 is 14x14x512 (called pool4)

- Block 5
    - 3x3 Conv (Pad 1): 512->512 + RELU
    - 3x3 Conv (Pad 1): 512->512 + RELU
    - 3x3 Conv (Pad 1): 512->512 + RELU
    - 2x2 Max-Pool (Pad 0, Stride 2): 512->512
Output of Block 5 is 7x7x512 (called pool5)

Architectures

VGG-16
Output of Block 5 is 7x7x512 (called pool5)
- Reshape to a (49*512=25088) dimensional vector (or $B\times 25088$)

- Fully connected (matmul + bias) 25088 -> 4096 + RELU
- Fully connected (matmul + bias) 4096 -> 4096 + RELU
- Fully connected (matmul + bias) 4096 -> 1000

This is the final output that is trained with a softmax + cross entropy.

  • Lots of layers: 138 Million Parameters
  • Compared to previous architectures, used really small conv filters.
    • This has now become standard.
    • Two 3x3 layers is "better" than a single 5x5 layer.
      • More non-linear
      • Fewer independent weights
  • Train this with backprop !
  • Back in the day, would take a week or more.