CSE 559A: Computer Vision

Fall 2018: T-R: 11:30-1pm @ Lopata 101

Instructor: Ayan Chakrabarti (ayan@wustl.edu).

Course Staff: Zhihao Xia, Charlie Wu, Han Liu

November 20, 2018

Problem Set 5: Deadline Extended to Dec 4th.

Recitation on Nov 30th (Friday after Thanksgiving)

He et al., "Identity Mappings in Deep Residual Networks". 2016.

- Given a limited amount of training data, deep architectures will begin to overfit.

**Important**: Keep track of training and dev-set errors

Training errors will keep going down, but dev will saturate. Make sure you don't train to a point when dev errors start going up.

- So how do we prevent, or delay, overfitting so that our dev performance increases ?

Solution 1: Get more data.

**Data Augmentation**

- Think of transforms to the images that you have that would still keep them in the distribution of real images.

- Typical Transforms
- Scaling the image
- Taking random crops
- Applying Color-transformations (change brightness, hue, saturation randomly)
- Horizontal Flips (but not vertical)
- Rotations upto +- 5 degrees.

- Are a good way of getting more training data for 'free'.
- Teaches your network to be invariant to these transformations ....

- Unless your output isn't. If your output is a bounding box, segmentation map, or other quantities that would change with these augmentation operations, you need to apply them to the outputs too.

**Weight Decay**

- Add a squared or absolute value penalty on all weight values (for example, on each element of every convolutional kernel, matmul matrix) except biases. \(\sum_i w_i^2\) or \(\sum_i |w_i|\)

- So now your effective loss is \(L' = L + \lambda \sum_i w_i^2\)

- How would you train for this ?
- Let's say you use backprop to compute \(\nabla_{w_i} L\).
- What gradient would you apply to your weights ? What is \(\nabla_{w_i} L'\) ?

\[\nabla L' = \nabla L + 2\lambda w_i\]

- So in addition to the standard update, you will also be subtracting a scaled version of the weight itself.

- What about for \(L' = L + \lambda \sum_i |w_i|\) ?

\[\nabla L' = \nabla L + \lambda Sign(w_i)\]

**Regularization: Dropout**

- Key Idea: Prevent a network from "depending" too much on the presence of a specific activation.
- So, randomly drop these values during training.

\(g=\)Dropout(\(f\),p): \(f\) and \(g\) will have the same shape.

Different behavior during training and testing.

- Training
- For each element \(f_i\) of \(f\),
- Set \(g_i=0\) with probability \(p\), and \(\frac{f_i}{(1-p)}\) with probability \((1-p)\)

- Testing: \(g_i = f_i\)

- Why does this make sense ? Because in
*expectation*, our value during training and test will be the same.

- Dropout is a layer. You will backpropagate through it ! How ?

**Regularization: Dropout**

- Write the function as \(g = f \cdot \epsilon\)
- Here \(\epsilon\) is a random array same size as \(f\), with values 0 and \(1/(1-p)\) with probability \(p\) and \((1-p)\).
- \(\cdot\) denotes element-wise multiplication.

- \(\nabla_f = \nabla_g \cdot \epsilon\)
- Even though \(\epsilon\) is random, you must use the same \(\epsilon\) in the backward pass that you generated for the forward pass.
- Don't backpropagate to \(\epsilon\) because it is not a function of the input.

- Like RELU, but kills gradients based on an external random source---whether you dropped that activation or not in the forward pass. If you didn't, remember to multiply by the \(1/(1-p)\).

**Regularization: Early Stopping**

- Keep track of dev set error. Stop optimization when it starts going up.

- This is a legitimate regularization technique !

- Essentially, you are restricting your hypothesis space to functions that are reachable within \(N\) iterations of a random initialization.

- Standard SGD

\[w_i \leftarrow w_i - \lambda \nabla_{w_i}\]

- Momentum

\[g_i \leftarrow \nabla_{w_i} + \gamma g_i\] \[w_i \leftarrow w_i - \lambda g_i\]

- But we are still applying the same learning rate for all parameters / weights.

**Adaptive Learning Rate Methods**

Key idea: Set the learning rate for each parameter based on the magnitude of its gradients.

- Adagrad

\[g^2_i \leftarrow g^2_i + (\nabla_{w_i})^2\] \[w_i \leftarrow w_i - \lambda \frac{\nabla_{w_i}}{\sqrt{g^2_i+\epsilon}}\]

Global learning rate divided by sum of magnitudes of past gradients.

Problem: Will always keep dropping the effective learning rate.

- RMSProp

\[g^2_i \leftarrow \gamma g^2_i + (1-\gamma)(\nabla_{w_i})^2\] \[w_i \leftarrow w_i - \lambda \frac{\nabla_{w_i}}{\sqrt{g^2_i+\epsilon}}\]

**Adaptive Learning Rate Methods**

- Adam: RMSProp + Momentum

\[m_i \leftarrow \beta_1 m_i + (1-\beta_1) \nabla_{w_i}\] \[v_i \leftarrow \beta_2 v_i + (1-\beta_2) (\nabla_{w_i})^2\]

\[w_i \leftarrow w_i - \frac{\lambda}{\sqrt{v_i}+\epsilon}m_i\]

- How do you initialize \(m_i\) and \(v_i\) ? Typically as 0 and 1.

- This won't matter once the values of \(m_i, v_i\) stabilize. But in initial iterations, they will be biased towards their initial values.

**Adaptive Learning Rate Methods**

- Adam: RMSProp + Momentum + Bias Correction

\[m_i \leftarrow \beta_1 m_i + (1-\beta_1) \nabla_{w_i}\] \[v_i \leftarrow \beta_2 v_i + (1-\beta_2) (\nabla_{w_i})^2\]

\[\hat{m}_i = \frac{m_i}{1-\beta_1^t}\] \[\hat{v}_i = \frac{v_i}{1-\beta_2^t}\]

\[w_i \leftarrow w_i - \frac{\lambda}{\sqrt{\hat{v}_i}+\epsilon}\hat{m}_i\]

Here, \(t\) is the iteration number.

As \(t\rightarrow \infty\), \(1-\beta^t=1\).

- Neural Network Training is Slow.

- But many operations are parallelizable. In particular, operations for different batches are independent.

- That's why GPUs are great for deep learning! But even so, you will begin to saturate the computation (or worse, memory) on a GPU.

- Solution: Break up computation across multiple GPUs.

- Two possibilities:
- Model Parallelism
- Data Parallelism

**Model Parallelism**

- Less popular, doesn't help for many networks.

- Essentially, if you have two independent "paths" in your network, you can place them on different devices. And sync, when they join.

Was used in the Sutskever et al., 2012 ImageNet paper.

**Data Parallelism**

- Begin with all devices having the same model weights.

- One each device, load a separate batch of data.

- Do forward-backward to compute weight gradients on each GPU with its own batch.

- Have a single device (one of the GPUs, or a CPU) recover gradients from all devices.
- Average these gradients and apply the update to the weights.
- Distribute new weights to all devices.

- Works well in practice, especially for multiple GPUs in the same machine.

- Communication overhead of transferring gradients and weights back and forth. Can be large if distributing across multiple machines.

- Approximate Distributed Training
- Let each worker keep updating its own weights independently for multiple iterations. Then, transmit back weights to single device, average weights, and sync to all devices.
- Other options, quantize gradients when sending back and forth (while making sure all workers have the same models).