CSE 559A: Computer Vision

Use Left/Right PgUp/PgDown to navigate slides

Fall 2018: T-R: 11:30-1pm @ Lopata 101

Instructor: Ayan Chakrabarti (ayan@wustl.edu).
Course Staff: Zhihao Xia, Charlie Wu, Han Liu

Oct 30, 2018

# General

• Proposal Feedback Out
• Do a pull on your existing proposal repo
• In some cases, there are additional steps you need to take. So do this now !
• Problem Set 4 ready to Clone
• Due two weeks from today

# Machine Learning

• Obtain a function $$f: \mathcal{X} \rightarrow \mathcal{Y}$$ from data
• Maps inputs from domain $$\mathcal{X}$$ to outputs from domain $$\mathcal{Y}$$
• Components
• Training set of pairs $$(x_i,y_i)$$
• Loss function $$L(y, \hat{y})$$
• Hypothesis Space $$\mathcal{H}$$ to search over for $$f$$

$f = \arg \min_{f\in \mathcal{H}} \sum_i L(y_i, f(x_i))$

• Basically, algorithm design by trial and error (on training set)
• A better way of solving problems when the problems are ill-posed
• Need to watch out for over-fitting the training set

# Machine Learning

Classification

Consider the case when $$y$$ is binary, i.e., $$\mathcal{Y} = \{0,1\}$$.

How do you define the loss function then ?

• Ideally, $$L(y,\hat{y})$$ is 0 if they are equal, 1 otherwise.

But don't know how to solve that. What if we solved by regression ?

$w = \arg \min_{w} \frac{1}{T} \sum_t (y_t - w^T\tilde{x}_t)^2$

And at test time, we can output $$y = 1$$ if $$w^T\tilde{x} > 0.5$$ and $$0$$ otherwise.

The problem is the loss function will penalize $$w^T\tilde{x}_t > 1$$ when $$y_t=1$$. While at test time, this would give us exactly the right answer !

# Machine Learning

Logistic regression

• Learn a function $$f(x) = P(y = 1)$$ which regresses to the probability $$y$$ is 1.
• We have to choose $$f$$ such that the domain of $$f(x)$$ lies between $$[0,1]$$.

$f(x;w) = \sigma\left(w^T\tilde{x}\right),~~~~\sigma(p) = \frac{\exp(p)}{1+\exp(p)}$

• This ensures that the output of $$f$$ is between $$[0,1]$$
• $$w^T\tilde{x}$$ can be interpreted as the log of the odds, or log of ratio between $$P(y=1)$$ to $$P(y=0)$$
• $$\tilde{x}$$ is some augmented "feature vector" derived from $$x$$.
• "Linear Classifier" if $$\tilde{x} = [x^T;1]^T$$ (log-odds are linear)
• Could be polynomial $$\tilde{x} = [1,x,x^2,x^3]$$
• Or other arbitrary non-linear functions of $$x$$
• Can apply even when $$x$$ is non-numeric, as long as $$\hat{x}$$ is numeric.

# Machine Learning

Logistic Regression

For Binary Classification: $$~~~\mathcal{X}\rightarrow [0,1]$$ $f(x; w) = \sigma(w^T\tilde{x}) = \frac{\exp(w^T\tilde{x})}{1 + \exp(w^T\tilde{x})}$

• To classify, $$y = 1$$ if $$P(y=1) > 0.5$$ or 0 otherwise
• That is, $$y = 1$$ if $$w^T\tilde{x} > 0$$ and $$0$$ otherwise.

• Note: Classifier is linear in chosen encoding $$\tilde{x}$$.
• $$w^T\tilde{x} <> 0$$ defines a "separating hyperplane" between positive and negative part of the space of $$\tilde{x}$$.

# Machine Learning

Logistic regression

$P(y=1) = f(x) = \sigma\left(w^T\tilde{x}\right)$

Cross-Entropy Loss

If true $$y$$ is 1, we want $$f(x)$$ to be high, and if it is 0, we want it to be low.

$L(y,f(x)) = - \left\{\begin{array}{ll} \log P(y=1) = \log f(x)~~~& \text{if}~~y=1\\\log P(y=1) = \log 1-f(x)~~~& \text{if}~~y=0\end{array} \right.$

$L(y,f(x)) = -y\log f(x) - (1-y)\log(1-f(x))$

There's a minus because this is the loss.

Minimizing $$\sum_t L(y_t,f(x_t))$$ can be viewed as maximizing the sum of the log-probabilities, or the product of the probabilities of the labels $$y_t$$ under our predicted distribution.

Promotes a high probability for the correct label > uniform distribution (low confidence) over both labels > high probability for incorrect label.

But now, how do we minimize this function in terms of $$w$$ ? No longer least-squares.

Logistic Regression

$f(x; w) = \sigma(w^T\tilde{x}) = \frac{\exp(w^T\tilde{x})}{1 + \exp(w^T\tilde{x})}$

• Cross-entropy / Negative Likelihood Loss

$L(y,f(x;w)) = -y \log f(x;w) - (1-y) \log (1-f(x;w))$

$f(x;w) = \frac{1}{1 + \exp(-w^T\tilde{x})}~~~~~~~1-f(x;w) = \frac{1}{1 + \exp(w^T\tilde{x})}$

Logistic Regression

$f(x; w) = \sigma(w^T\tilde{x}) = \frac{\exp(w^T\tilde{x})}{1 + \exp(w^T\tilde{x})}$

• Cross-entropy / Negative Likelihood Loss

$L(y,f(x;w)) = y \log \left[1 + \exp(-w^T\tilde{x})\right] + (1-y) \log \left[1 + \exp(w^T\tilde{x})\right]$

• Putting it all together, given a training set of $$\{(x_t,y_t)\}$$:

$w = \arg \min_w \frac{1}{T} \sum_{t=1}^T y_t \log \left[1 + \exp(-w^T\tilde{x}_t)\right] + (1-y_t) \log \left[1 + \exp(w^T\tilde{x}_t)\right]$

Logistic Regression

$w = \arg \min_w \frac{1}{T} \sum_{t=1}^T y_t \log \left[1 + \exp(-w^T\tilde{x}_t)\right] + (1-y_t) \log \left[1 + \exp(w^T\tilde{x}_t)\right]$

• You can show that this loss is a convex function of $$w$$
(compute the Hessian matrix and show that it's eigenvalues are non-negative)
• So it has a single global minimum.

But how do we find it ?

Logistic Regression

$w = \arg \min_w \frac{1}{T} \sum_{t=1}^T y_t \log \left[1 + \exp(-w^T\tilde{x}_t)\right] + (1-y_t) \log \left[1 + \exp(w^T\tilde{x}_t)\right]$

More General Form

$w = \arg \min_w C(w)~~~~C(w) = \frac{1}{T} \sum_t C_t(w)$

Iterative algorithm

• Given a current estimate of $$w$$, approximate $$C(w)$$ as a linear function of $$w$$
• $$C(w) = \alpha^Tw$$
• Do this fit by computing the gradient of $$C(w)$$ wrt $$w$$
• $$\alpha = \nabla_w C(w)$$ (would be true if $$C(w) = \alpha^Tw$$)

Think of $$[C(w),w]$$ as the co-ordinates on a plane. Which direction to move in $$w$$-space to reduce $$C(w)$$ ?

$$-\alpha$$

$w = \arg \min_w C(w)~~~~C(w) = \frac{1}{T} \sum_t C_t(w)$

• Begin with initial guess $$w_0$$
• At each iteration $$i$$:
• $$w_{i+1} \leftarrow w_{i} - \gamma \nabla_w C(w_i)$$
• At each iteration, we update the parameters $$w$$ by "moving", in $$w$$-space, in the
opposite direction of the gradient (at that point $$w_i$$).
• $$\gamma$$ is the step-size. When running optimization for training, often called the "learning rate".
• In some cases, $$\gamma$$ can be set by doing a line-search
• Check values of $$C(w-\gamma \nabla_w)$$ and pick $$\gamma$$ which minimizes the cost
• In other cases, we choose a fixed value of $$\gamma$$ (or change it in some pre-determined schedule per iteration)
• Then, we are moving by a distance that is proportional to magnitude of the gradient

• If you select optimal step size by doing a "line search" for $$\gamma$$, can prove that gradient-descent will converge.
• If function is convex, converge to unique global minimum.
• Second order variants that consider the Hessian matrix: Newton & Quasi-Newton Methods
• Gauss-Newton, Levenberg-Marquardt, ...

But simple gradient descent suffices / our only choice when:

• Function isn't convex.
• Can't afford to do line search.
• So many parameters that can't compute Hessian.

Also, no theoretical guarantees.

Theory still catching up. Meanwhile, we'll try to understand the "behavior" of the gradients.

$\nabla_w C(w) = \left[\begin{array}{c} \frac{\partial}{\partial w_1} C(w)\\\frac{\partial}{\partial w_2} C(w)\\ \vdots \end{array}\right]$

$\text{If}~~~C(w) = \frac{1}{T} \sum_t C_t(w),~~~\text{then}~~~$

$\nabla_w C(w) = \left[\begin{array}{c} \frac{\partial}{\partial w_1} C(w)\\\frac{\partial}{\partial w_2} C(w)\\ \vdots \end{array}\right]$

$\text{If}~~~C(w) = \frac{1}{T} \sum_t C_t(w),~~~\text{then}~~~\nabla_w C(w) = \frac{1}{T} \nabla_w C_t(w)$

Logistic Regression

$C_t(w) = y_t \log \left[1 + \exp(-w^T\tilde{x}_t)\right] + (1-y_t) \log \left[1 + \exp(w^T\tilde{x}_t)\right]$

What is $$\nabla_w C_t(w)$$, the gradient of the loss from a singe training example ?

$C_t(w) = y_t \log \left[1 + \exp(-w^T\tilde{x}_t)\right] + (1-y_t) \log \left[1 + \exp(w^T\tilde{x}_t)\right]$

Ok, what is the derivative of

$C_t(p) = y_t \log \left[1 + \exp(-p)\right] + (1-y_t) \log \left[1 + \exp(p)\right]$

with respect to $$p$$ (where $$p$$ is a scalar).

Take 5 mins !

$C_t(w) = y_t \log \left[1 + \exp(-w^T\tilde{x}_t)\right] + (1-y_t) \log \left[1 + \exp(w^T\tilde{x}_t)\right]$

Ok, what is the derivative of

$C_t(p) = y_t \log \left[1 + \exp(-p)\right] + (1-y_t) \log \left[1 + \exp(p)\right]$

with respect to $$p$$ (where $$p$$ is a scalar).

$\frac{\partial}{\partial p} C_t(p)= \frac{\exp(p)}{1+\exp(p)} - y_t$

$\frac{\partial}{\partial p} C_t(p) = y_t~~~\frac{-\exp(-p)}{1+\exp(-p)} + (1-y_t) \frac{\exp(p)}{1+\exp(p)}$

$= \frac{\exp(p)}{1+\exp(p)} - y_t\left[\frac{\exp(-p)}{1+\exp(-p)} + \frac{\exp(p)}{1+\exp(p)}\right]$

$= \frac{\exp(p)}{1+\exp(p)} - y_t\left[\frac{1}{1+\exp(p)} + \frac{\exp(p)}{1+\exp(p)}\right]$

$C_t(w) = y_t \log \left[1 + \exp(-w^T\tilde{x}_t)\right] + (1-y_t) \log \left[1 + \exp(w^T\tilde{x}_t)\right]$

$C_t(p) = y_t \log \left[1 + \exp(-p)\right] + (1-y_t) \log \left[1 + \exp(p)\right]$ $\frac{\partial}{\partial p} C_t(p)= \frac{\exp(p)}{1+\exp(p)} - y_t$

Observations

• $$\frac{\exp(p)}{1+\exp(p)}$$ is basically the output $$f(x_t;w)$$, predicted probability that $$y_t=1$$.
• Remember: this is the expression for gradient of $$p$$, i.e. logit / log-odds.
• Gradient 0 if $$y_t = 0$$ and probability 0, $$y=1$$ and probability 1.
• Do nothing if predicting right answer with perfect confidence.
• If we say probability > 0, and $$y_t = 0$$. Gradient is positive.
• If we say probability < 1, and $$y_t = 1$$. Gradient is negative.

Remember we move in the opposite direction of gradient.

$C_t(w) = y_t \log \left[1 + \exp(-w^T\tilde{x}_t)\right] + (1-y_t) \log \left[1 + \exp(w^T\tilde{x}_t)\right]$

$C_t(p) = y_t \log \left[1 + \exp(-p)\right] + (1-y_t) \log \left[1 + \exp(p)\right]$ $\frac{\partial}{\partial p} C_t(p)= \frac{\exp(p)}{1+\exp(p)} - y_t$

Also, changing $$p$$ makes a much bigger difference in the corresponding probability,
when $$p$$ is near 0 / probability near $$0.5$$.

$C_t(w) = y_t \log \left[1 + \exp(-w^T\tilde{x}_t)\right] + (1-y_t) \log \left[1 + \exp(w^T\tilde{x}_t)\right]$

$C_t(p) = y_t \log \left[1 + \exp(-p)\right] + (1-y_t) \log \left[1 + \exp(p)\right]$ $\frac{\partial}{\partial p} C_t(p)= \frac{\exp(p)}{1+\exp(p)} - y_t$

But this is still derivative with respect to $$p$$. We want gradient with respect to $$w$$.

$\frac{\partial}{\partial w^j} C_t(w)= ?~~\times~~\left[\frac{\exp(w^T\tilde{x}_t)}{1+\exp(w^T\tilde{x}_t)} - y_t\right]$

$C_t(w) = y_t \log \left[1 + \exp(-w^T\tilde{x}_t)\right] + (1-y_t) \log \left[1 + \exp(w^T\tilde{x}_t)\right]$

$C_t(p) = y_t \log \left[1 + \exp(-p)\right] + (1-y_t) \log \left[1 + \exp(p)\right]$ $\frac{\partial}{\partial p} C_t(p)= \frac{\exp(p)}{1+\exp(p)} - y_t$

But this is still derivative with respect to $$p$$. We want gradient with respect to $$w$$.

$\frac{\partial}{\partial w^j} C_t(w)= \tilde{x}^j_t~~\times~~\left[\frac{\exp(w^T\tilde{x}_t)}{1+\exp(w^T\tilde{x}_t)} - y_t\right]$

$\nabla_w C_t(w) = ?$

$C_t(w) = y_t \log \left[1 + \exp(-w^T\tilde{x}_t)\right] + (1-y_t) \log \left[1 + \exp(w^T\tilde{x}_t)\right]$

$C_t(p) = y_t \log \left[1 + \exp(-p)\right] + (1-y_t) \log \left[1 + \exp(p)\right]$ $\frac{\partial}{\partial p} C_t(p)= \frac{\exp(p)}{1+\exp(p)} - y_t$

But this is still derivative with respect to $$p$$. We want gradient with respect to $$w$$.

$\frac{\partial}{\partial w^j} C_t(w)= \tilde{x}^j_t~~\times~~\left[\frac{\exp(w^T\tilde{x}_t)}{1+\exp(w^T\tilde{x}_t)} - y_t\right]$

$\nabla_w C_t(w) = \tilde{x}_t~~\left[\frac{\exp(w^T\tilde{x}_t)}{1+\exp(w^T\tilde{x}_t)} - y_t\right]$

$\nabla_w C_t(w) = \nabla_w(w^T\tilde{x}_t)~~~~\left[\frac{\exp(w^T\tilde{x}_t)}{1+\exp(w^T\tilde{x}_t)} - y_t\right]$

$\nabla_w C_t(w) = \nabla_w p(w)~~~~\frac{\partial C_t(p)}{\partial p}$

$w = \arg \min_w \frac{1}{T} \sum_{i=1}^T y_t \log \left[1 + \exp(-w^T\tilde{x}_t)\right] + (1-y_t) \log \left[1 + \exp(w^T\tilde{x}_t)\right]$

Putting it together:

• At each iteration $$i$$,
• Based on current $$w$$, compute $$f(x_t,w)=\hat{y}_t$$
• Compute derivative of the "output" as $$\hat{y}_t-y_t$$
• Multiply by $$x_t$$ to get $$\nabla_w$$
• Change $$w$$ by subtracting some $$\gamma$$ times this gradient.

$w = \arg \min_w \frac{1}{T} \sum_{i=1}^T y_t \log \left[1 + \exp(-w^T\tilde{x}_t)\right] + (1-y_t) \log \left[1 + \exp(w^T\tilde{x}_t)\right]$

Putting it together:

• At each iteration $$i$$,
• Based on current $$w$$, compute $$f(x_t,w)=\hat{y}_t$$ for every training sample
• Compute derivative of the "output" as $$\hat{y}_t-y_t$$ for every training sample
• Multiply by $$x_t$$ and average across all training samples to get $$\nabla_w$$
• Change $$w$$ by subtracting some $$\gamma$$ times this gradient.

$C(w) = \frac{1}{T} \sum_t C_t(w) \Rightarrow \nabla_w C = \frac{1}{T} \sum_t \nabla_w C_t$

Expensive when we have a LOT of training data.

$w = \arg \min_w \frac{1}{T} \sum_t C(x_t,y_t; w)$

$\nabla_w = \frac{1}{T} \sum_t \nabla_w C(x_t,y_t; w)$

Remember, summation over training samples meant to approximate an expectation over $$P_{XY}(x,y)$$.

$\frac{1}{T} \sum_t C(x_t,y_t; w) \rightarrow \mathbb{E}_{P_{XY}(x,y)} C(x,y; w)$

$\frac{1}{T} \sum_t \nabla_w C(x_t,y_t; w) \rightarrow \mathbb{E}_{P_{XY}(x,y)} \nabla_w C(x,y; w)$

In other words, we are approximating the "true" gradient with gradients over samples.

What if we used a smaller number of samples in each iteration, but different samples in different iterations ?

• Single sample $w_{i+1} \leftarrow w_{i} - \gamma \nabla_w C_t(x_t,y_t;w_i)$ At each iteration, choose a random $$t\in\{1,2,\ldots,T\}$$.
• "Mini"-batched SGD (sometimes GD is called Batched GD) $w_{i+1} \leftarrow w_{i} - \gamma \nabla_w \frac{1}{B} \sum_{t\in \mathcal{B}} C_t(x_t,y_t;w_i)$ At each iteration, choose a random smaller batch $$\mathcal{B}$$ of size $$B << T$$.

With replacement ? Without replacement ?

In practice:

• Shuffle order of training examples
• Choose a batch size
• Take consecutive groups of $$B$$ samples as you loop through iterations
• [1,B] in iteration 1
• [B+1,2B] in iteration 2
• . . .
• Once you reach the end of the training set (called one "epoch"),
shuffle the order again.

$w_{i+1} \leftarrow w_{i} - \gamma \frac{1}{B} \sum_{t\in \mathcal{B}} \nabla_w C_t(x_t,y_t;w_i)$

General Notes

• The gradient over a mini-batch is an "approximation", or a "noisy" version of the gradient over the true training set. $\frac{1}{B} \sum_{t\in \mathcal{B}} \nabla_w C_t(x_t,y_t;w_i) = \frac{1}{T} \sum_{t=1}^T \nabla_w C_t(x_t,y_t;w_i) + \epsilon$
• Typically, if you decrease the batch-size, you will want to decrease your step size (because you are "less sure" about the gradient).

$w_{i+1} \leftarrow w_{i} - \gamma \frac{1}{B} \sum_{t\in \mathcal{B}} \nabla_w C_t(x_t,y_t;w_i)$

General Notes

Say your cost function is convex, and you care only about decreasing this cost (not worried about overfitting)

• Larger batch size will always give you "better" gradients.
• But diminishing returns after a batch size.
• Computational cost is number of examples per iteration $$\times$$ number of iterations for convergence
• Higher batch means more computation per iteration, but may mean fewer iterations required to converge.
• Best combination of step size and batch size is an empirical question.
• Another factor: parallelism.
• Note that you can compute the gradient of all samples of your batch in parallel.
• Ideally, you want to at least "saturate" all available parallel threads.

$w_{i+1} \leftarrow w_{i} - \gamma \frac{1}{B} \sum_{t\in \mathcal{B}} \nabla_w C_t(x_t,y_t;w_i)$

General Notes

If your cost function is NOT convex, and/or you are worried about overfitting.

• Might prevent you from overfitting to train set.
• Try different batch sizes, check performance on dev set, not just train set.

Momentum

Standard SGD:

$g_{i+1} = \frac{1}{B} \sum_{t\in \mathcal{B}} \nabla_w C_t(x_t,y_t;w_i)$ $w_{i+1} \leftarrow w_{i} - \gamma g_{i+1}$

With Momentum:

For $$\beta < 1$$:

$g_{i+1} = \frac{1}{B} \sum_{t\in \mathcal{B}} \nabla_w C_t(x_t,y_t;w_i) + \beta g_{i}$ $w_{i+1} \leftarrow w_{i} - \gamma g_{i+1}$

• Keep adding the gradient from a previous batch, again and again across iterations, with decaying weight.
• Remember: $$g_{i}$$ was computed with respect to a different position in $$w$$ space.
• People often use $$\beta$$ as high as $$0.9$$ or $$0.99$$.
• Will need to revisit "best" value of $$\gamma$$ when you change $$\beta$$.