CSE 559A: Computer Vision

Fall 2018: T-R: 11:30-1pm @ Lopata 101

Instructor: Ayan Chakrabarti (ayan@wustl.edu).
Course Staff: Zhihao Xia, Charlie Wu, Han Liu

November 8, 2018

# Story So Far

• Machine Learning
• Learn input-output relationships from data
• Algorithm design by trial and error
• Preferred approach for very ill-posed problems
• Learning by Optimization
• Select a function from a hypothesis space
• Typically translates to learning parameters $$\theta$$ for a parametric form $$y = f(x; \theta)$$
• Find $$\theta$$ that minimizes loss / error on training set (but be careful of overfitting)
• In simple cases, closed form solution for $$\theta$$
• In the more general case, iterative optimization
• Compute gradients / partial derivatives of error wrt individual parameters
• Update parameters by moving in opposite direction
• Guarantees if loss is a convex function of parameters
• But can be used generally for arbitrary functional forms and losses
• Stochastic versions of computational efficiency

# Story So Far

• Represent a complex function as a composition of simpler functions
• Build routines that can backpropagate gradiens through each simple function, given
• Values of inputs
• Choose hypothesis space / parametric form / network architecture
• So that it can represent all the computation required to solve the problem
• Has as few parameters as possible so that the optimization problem is easier

Now, let's look at some of the semantic vision tasks we apply this to.

# Core Semantic Tasks in Vision

• Image "Classification"

• Object Detection

• Semantic Segmentation

# The Effect of Data

• Older ML Methods designed for small training sets
• Used more complex optimization methods (than gradient descent): second order methods, etc.
• Methods had better guarantees if you chose "simpler" classifiers
• And in practice, gave you better results than neural networks
• But were quadratic in training set size
• With training set of millions, quadratic-time optimization was not feasible.
• So people first moved to gradient descent, but with the same simple classifiers.
• Found that with additional computation power, if you train with small step size for many iterations (still better than quadratic), gradient descent gives you a reasonable answer.
• But then, since gradient descent was working, the question was why not try more complex classifiers ?
• And Krizhevsky and others demonstrated: in this large training set / high training computation budget, deep neural networks are much better!

# Architectures

• Think of a network that can "express" the operations that you think are needed to solve the problem
• What kind of a "receptive" field should it have.
• How non-linear does it need to be.
• What should be the nature of the flow of information across the image.
• Make sure its a function you can actually learn.
• Think of the flow of gradients.
• Try to make other architectures that you know can be successfully trained as a starting point.
• Dealing with Overfitting: One approach:
• First find the biggest deepest network that will overfit the data
(Given enough capacity, CNNs will often be able to just memorize the dataset)
• Then scale it down so that it generalizes.

# Architectures

Let's consider image classification.

• We will fix our input image to be a specific size.
• Typically choose square images of size $$S\times S$$
• Given image, resize proportionally so that smaller side (height or width) is $$S$$
• Then take an $$S$$ crop along the other direction
• (Sometime take multiple crops and average)
• The final output will be a $$C$$ dimensional vector for $$C$$ classes.
• Train using soft-max cross entropy.
• Classify using arg-max
• Often, you'll hear about Top-K error.
• How often is the true class in the top K highest values of predicted $$C$$ vector.

# Architectures

Let's talk about VGG-16. Winner of Imagenet 2014.

Reference
Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman, "Return of the Devil in the Details: Delving Deep into Convolutional Nets"

Karen Simonyan & Andrew Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition."

Four kinds of layers:

• Convolutional
• Max-pooling
• Fully Connected
• Soft-max

# Architectures

• Convolutional Layers

Take a spatial input, produce a spatial output.

$B \times H \times W \times C \Rightarrow B \times H' \times W' \times C'$

Can also combine with down-sampling.

$g[b,y,x,c_2] = \sum_{k_y}\sum_{k_x}\sum_{c_1} f[b,y~s+k_y,x~s+k_x,c_1]k[k_y,k_x,c_1,c_2]$

Here, $$s$$ is stride.

• PSET 5 asks you to implement 'valid convolution'
• But often combined with padding (just like in regular convolution)

# Architectures

Question: Input activation is $$B\times H\times W\times C_1$$, and I convolve it with a kernel of size $$K\times K\times C_1 \times C_2$$, what is the size of my output ? Assume 'valid' convolution.

$B \times (H-K+1) \times (W-K+1) \times C_2$

Question: What if I do this with a stride of 2 ?

Downsample above by $$2$$. Think of what happens when sizes are even or odd.

$B \times \lfloor (H-K)/2 \rfloor + 1 \times \lfloor (W-K)/2 \rfloor + 1 \times C_2$

In general, you want to pad such that $$H-K$$ and $$W-K$$ are even, so that you keep the right and bottom edge of your images.

# Architectures

Max-Pooling Layer

$B \times H \times W \times C \Rightarrow B \times H' \times W' \times C$

$g[b,y,x,c] = \max_{k_y,k_x} f[b,y~s+k_y,x~s+k_x,c_1]$

For each channel, choose the maximum value in a spatial neighborhood.

• What will the gradients of this look like ?
• Motivated by intuition from traditional object recognition (deformable part models). Allows for some 'slack' in exact spatial location.

# Architectures

VGG-16

Input is a 224x224x3 Image

• Block 1
• 3x3 Conv (Pad 1): 3->64 + RELU (*pad 1 means on all sides, all conv layers have a "bias")

# Architectures

VGG-16
Input is a 224x224x3 Image

- Block 1
- 3x3 Conv (Pad 1): 3->64 + RELU
- 3x3 Conv (Pad 1): 64->64 + RELU
- 2x2 Max-Pool (Pad 0, Stride 2): 64->64
Input to Block 2 is 112x112x64 (called pool1)

- Block 2
- 3x3 Conv (Pad 1): 64->128 + RELU
- 3x3 Conv (Pad 1): 128->128 + RELU
- 2x2 Max-Pool (Pad 0, Stride 2): 128->128
Input to Block 3 is 56x56x128 (called pool2)

- Block 3
- 3x3 Conv (Pad 1): 128->256 + RELU
- 3x3 Conv (Pad 1): 256->256 + RELU
- 3x3 Conv (Pad 1): 256->256 + RELU
- 2x2 Max-Pool (Pad 0, Stride 2): 256->256
Input to Block 4 is 28x28x256 (called pool3)

- Block 4
- 3x3 Conv (Pad 1): 256->512 + RELU
- 3x3 Conv (Pad 1): 512->512 + RELU
- 3x3 Conv (Pad 1): 512->512 + RELU
- 2x2 Max-Pool (Pad 0, Stride 2): 512->512
Input to Block 5 is 14x14x512 (called pool4)

- Block 5
- 3x3 Conv (Pad 1): 512->512 + RELU
- 3x3 Conv (Pad 1): 512->512 + RELU
- 3x3 Conv (Pad 1): 512->512 + RELU
- 2x2 Max-Pool (Pad 0, Stride 2): 512->512
Output of Block 5 is 7x7x512 (called pool5)


# Architectures

VGG-16
Output of Block 5 is 7x7x512 (called pool5)
- Reshape to a (49*512=25088) dimensional vector (or $B\times 25088$)

- Fully connected (matmul + bias) 25088 -> 4096 + RELU
- Fully connected (matmul + bias) 4096 -> 4096 + RELU
- Fully connected (matmul + bias) 4096 -> 1000


This is the final output that is trained with a softmax + cross entropy.

• Lots of layers: 138 Million Parameters
• Compared to previous architectures, used really small conv filters.
• This has now become standard.
• Two 3x3 layers is "better" than a single 5x5 layer.
• More non-linear
• Fewer independent weights
• Train this with backprop !
• Back in the day, would take a week or more.