CSE 559A: Computer Vision

Fall 2018: T-R: 11:30-1pm @ Lopata 101

Instructor: Ayan Chakrabarti (ayan@wustl.edu).
Course Staff: Zhihao Xia, Charlie Wu, Han Liu

November 12, 2018

# General

• Problem Set 4 Due Tonight !

• Problem Set 5 will be posted shortly.

# Object Detection

• Newer methods also use a neural network to generate "region proposals"
• Efficient Implementations: bulk of the computation happens once on the entire image, and you crop a feature map for each region.
• Even Faster Methods: Discretize image locations into grid, and directly output upto a fixed number of bounding boxes for each grid block.

# Transfer Learning

• Say you want to train a network to solve a problem.
• The task is complex, so you need a large network.
• But you don't have enough training data to train such a network.
• Pick a related task for which you do have a lot of training data
• ImageNet is a great database for this for a variety of semantic tasks
• Train a network (like VGG-16) to solve that task.
• Then, choose the output of some intermediate layer of that network
• Use it as a feature vector, and learn a smaller network for your problem which goes from those features to the desired output.

# Transfer Learning

• VGG-16 does well on Imagenet classification

and gives you a feature representation that is surprisingly useful for a broad range of tasks.

Remember computing encoding $$\tilde{x}$$ from $$x$$. VGG-16's pool5, fc1, fc2, features can be the $$\tilde{x}$$ for many tasks.

One can also "initialize" a network with the VGG-16 architecture to one trained with imagenet, and then "finetune" by replacing the final layer as classification for another task.

In general, empirical question to determine when training on Task A will provide good features for Task B.

# Fully-Convolutional Networks

• Option 0: Just don't use downsampling

Bad, because down-sampling is a way to quickly increase the "receptive field" of your network.

• Option 1: Just produce a label map at lower-resolution.
• Option 2: If you downsample by $$N$$ (typically $$N=2^K$$)
Feed every $$(N-1)\times(N-1)$$ "shifted" version of your input through this FCN.

Bad because if you down-sample multiple times, you're still
re-computing activations prior to the last-downsampling.

• Option 3: Dilated Convolutions