CSE 559A: Computer Vision

Use Left/Right PgUp/PgDown to navigate slides

Fall 2018: T-R: 11:30-1pm @ Lopata 101

Instructor: Ayan Chakrabarti (ayan@wustl.edu).
Course Staff: Zhihao Xia, Charlie Wu, Han Liu

Oct 25, 2018

General

• Problem Set 3 Due tonight
• Problem Set 4 will be posted shortly
• Reminder: Tomorrow's Office Hours will be in Lopata 103

Grouping and Segmentation

Graph Based Approaches

• Define a set of vertices $$V$$, edges $$E$$ with weights $$w(v_1,v_2)$$
• Cut = partition of vertices into two sets of nodes $$A$$ and $$B$$
• $$A\cup B = V$$
• $$A\cap B = \Phi$$
• Cut$$(A,B) = \sum_{u\in A, v \in B} w(u,v)$$

Can lead to isolated points

Grouping and Segmentation

Graph Based Approaches

• Define a set of vertices $$V$$, edges $$E$$ with weights $$w(v_1,v_2)$$
• Cut = partition of vertices into two sets of nodes $$A$$ and $$B$$
• $$A\cup B = V$$
• $$A\cap B = \Phi$$
• Cut$$(A,B) = \sum_{u\in A, v \in B} w(u,v)$$

• Use normalized measure
• Assoc$$(A,V) = \sum_{u\in A, v \in V} w(u,v)$$
• Assoc$$(B,V) = \sum_{u\in B, v \in V} w(u,v)$$

$\text{NCut}(A,B) = \frac{\text{Cut}(A,B)}{\text{Assoc}(A,V)} + \frac{\text{Cut}(A,B)}{\text{Assoc}(B,V)}$

$= 2 - \left(\frac{\text{Assoc}(A,A)}{\text{Assoc}(A,V)} + \frac{\text{Assoc}(B,B)}{\text{Assoc}(B,V)}\right) = 2 - \text{NAssoc}(A,B)$

Grouping and Segmentation

Graph Based Approaches

$\text{NCut}(A,B) = \frac{\text{Cut}(A,B)}{\text{Assoc}(A,V)} + \frac{\text{Cut}(A,B)}{\text{Assoc}(B,V)}$

$= 2 - \left(\frac{\text{Assoc}(A,A)}{\text{Assoc}(A,V)} + \frac{\text{Assoc}(B,B)}{\text{Assoc}(B,V)}\right) = 2 - \text{NAssoc}(A,B)$

• Instead of minimizing cut, minimize normalized cut, or maximize normalized association.
• Note that for this case, you might not even need a "unary" cost.
• The solution of assigning all nodes to $$A$$ or $$B$$ is sub-optimal, because now the denominator goes to 0.
• So this can be viewed as a form of clustering
• NP-hard even for the binary case. But there are approximate solvers based on "real-valued" relaxations.

Grouping and Segmentation

Normalized Cuts: Sketch

Say we say, $$x_i=1$$ if node $$i$$ is in A, and $$-1$$ if it's in B.

Then we can show that the NCut expression simplifies as:

$\text{NCut}(A,B) = \frac{\sum_{x_i>0,x_j<0} -w_{ij}x_ix_j}{\sum_{x_i > 0}d_i} + \frac{\sum_{x_i<0,x_j>0} -w_{ij}x_ix_j}{\sum_{x_i < 0}d_i}$

where $$d_i = \sum_j w_{ij}$$.

Then, you can construct two $$N\times N$$ matrices $$D$$ and $$W$$ such that $$D$$ is diagonal with entries $$d_i$$ on its diagonal, and $$W$$ is symmetric with $$W_{ij} = w_{ij}$$. Stack $$x_i$$s into an N-dimensional vector $$x$$.

And you can show that

$\text{NCut}(x) = \frac{y^T(D-W)y}{y^TDy}$

where $$y = (1+x)-b(1-x), b=(\sum_{x_i>0}d_i)/(\sum_{x_i<0}d_i)$$ and you have $$y^TD1=0$$.

You can turn this into a minimization problem, but $$x\in\{-1,1\}$$. This is still combinatorial. But if you let $$y$$ take on real values, then this turns into an eigen-value problem.

Grouping and Segmentation

$\arg \min \text{NCut}(x) = \frac{y^T(D-W)y}{y^TDy}$

• Find lowest eigenvectors $$z$$ of $$D^{-0.5}(D-W)D^{-0.5}$$, and get $$y=D^{-0.5}z$$.
• Lowest eigenvalue will be zero. Ignore that.
• Eigenvector corresponding to second lowest eigenvalue will give you a partition
• Compute $$z \rightarrow y \rightarrow x$$, threshold by sign to figure out label
• Huge Matrices, but Sparse. Eigen-decomposition can be done efficiently by Lanczos method.
• Can apply this recursively, or look at next eigen-vectors to further sub-partition image.
• For details
• Shi and Malik, "Normalized Cuts and Image Segmentation," PAMI 2000.

Machine Learning

So far, given an input $$X$$ and desired output $$Y$$ we have

• Tried to explain the relationship of how $$X$$ results from $$Y$$
• $$X$$ = observed image(s) / $$Y$$ = clean image, sharp image, surface normal, depth
• Noise, photometry, geometry, ...
• Often put a hand-crafted "regularization" cost to compute the inverse
• Depth maps are smooth
• Image gradients are small
• But sometimes, there is no way to write-down a relationship between $$X$$ and $$Y$$ ?
• $$X$$ = Image, $$Y$$ = Does the image contain a dog ?
• Even if there is, the hand-crafted regularization cost is often arbitrary.
• Real images contain far more complex and subtle regularity.

Machine Learning

Instead, we are going to assume that there is some underlying joint probability distribution $$P_{XY}(x,y)$$

• And our goal is to compute:
• The best estimate of $$y$$ conditioned on a specific value of $$x$$,

(using small letters for actual values, capitals for "random variables")

Machine Learning

Instead, we are going to assume that there is some underlying joint probability distribution $$P_{XY}(x,y)$$

• And our goal is to compute:
• The best estimate of $$y$$ conditioned on a specific value of $$x$$,
• To minimize some notion of "risk" or "loss"

Define a loss function $$L(y,\hat{y})$$, which measures how much we dislike $$\hat{y}$$ as our estimate,
when $$y$$ is the right answer.

Examples

• $$L(y,\hat{y}) = \|y-\hat{y}\|^2$$
• $$L(y,\hat{y}) = \|y-\hat{y}\|$$
• $$L(y,\hat{y}) = 0$$ if $$y = \hat{y}$$, and some $$C$$ otherwise.

Machine Learning

Instead, we are going to assume that there is some underlying joint probability distribution $$P_{XY}(x,y)$$

• And our goal is to compute:
• The best estimate of $$y$$ conditioned on a specific value of $$x$$,
• To minimize some notion of "risk" or "loss"

Ideally,

$\hat{y}(x) = \arg \min_\hat{y} \int L(y,\hat{y})P(y|x)~~dy$

What is $$P(y|x)$$ in terms of $$P_{XY}$$ ?

Machine Learning

Instead, we are going to assume that there is some underlying joint probability distribution $$P_{XY}(x,y)$$

• And our goal is to compute:
• The best estimate of $$y$$ conditioned on a specific value of $$x$$,
• To minimize some notion of "risk" or "loss"

Ideally,

$\hat{y}(x) = \arg \min_\hat{y} \int L(y,\hat{y})P(y|x)~~dy$

$P(y|x) = \frac{P_{XY}(x,y)}{\int P_{XY}(x,y')dy'}$

Machine Learning

$\hat{y}(x) = \arg \min_\hat{y} \int L(y,\hat{y})P(y|x)~~dy$

$P(y|x) = \frac{P_{XY}(x,y)}{\int P_{XY}(x,y')dy'}$

• So we have a loss (depends on the application)
• We can compute $$P(y|x)$$ from $$P_{XY}$$.
• But we don't know $$P_{XY}$$ !

Assume we are given data !

Machine Learning

$\hat{y}(x) = \arg \min_\hat{y} \int L(y,\hat{y})P(y|x)~~dy$

$P(y|x) = \frac{P_{XY}(x,y)}{\int P_{XY}(x,y')dy'}$

• So we have a loss (depends on the application)
• We can compute $$P(y|x)$$ from $$P_{XY}$$.
• But we don't know $$P_{XY}$$ !

Assume we are given as training examples, samples $$(x,y) \sim P_{XY}$$ from the true joint distribution.

Machine Learning

$\hat{y}(x) = \arg \min_\hat{y} \int L(y,\hat{y})P(y|x)~~dy$

$P(y|x) = \frac{P_{XY}(x,y)}{\int P_{XY}(x,y')dy'}$

Given $$\{(x_i,y_i)\}$$ as samples from $$P_{XY}$$, we could:

• Estimate $$P_{XY}$$

Machine Learning

$\hat{y}(x) = \arg \min_\hat{y} \int L(y,\hat{y})P(y|x)~~dy$

$P(y|x) = \frac{P_{XY}(x,y)}{\int P_{XY}(x,y')dy'}$

Given $$\{(x_i,y_i)\}$$ as samples from $$P_{XY}$$, we could:

• Estimate $$P_{XY}$$
• Choose parametric form for the joint distribution (Gaussian, Mixture of Gaussians, Bernoulli, $$\ldots$$)
• Estimate the parameters of that parametric form to "best fit" the data.
• Depending again on some notion of fit (often likelihood)

$P_{XY}(x,y) = f(x,y,;\theta)$

$\theta = \arg \max_\theta \sum_i \log f(x_i,y_i;\theta)$

Maximum Likelihood Estimation

Machine Learning

$\hat{y}(x) = \arg \min_\hat{y} \int L(y,\hat{y})P(y|x)~~dy$

$P(y|x) = \frac{P_{XY}(x,y)}{\int P_{XY}(x,y')dy'}$

$P_{XY}(x,y) = f(x,y,;\theta)$

$\theta = \arg \max_\theta \sum_i \log f(x_i,y_i;\theta)$

So that's one way of doing things ...

• You're doing a minimization for learning $$P_{XY}$$, but then also a minimization at "test time" for every input $$x$$.
• You're approximating $$P_{XY}$$ with some choice of the parametric form $$f$$.
• And it's possible that the best $$\theta$$ that maximizes likelihood, may not be the best $$\theta$$ that minimizes loss.

Machine Learning

Given a bunch of samples $$\{(x_i,y_i)\}$$ from $$P_{XY}$$,

we want to learn a function $$y = f(x)$$, such that

given a typical $$x,y$$ from $$P_{XY}$$, the expected loss $$L(y,f(x))$$ is low.

$f = \arg \min_f \color{white}{\int} \left( \int L(y,f(x)) p(y|x) dy \right) \color{white}{p(x) dx}$

Machine Learning

Given a bunch of samples $$\{(x_i,y_i)\}$$ from $$P_{XY}$$,

we want to learn a function $$y = f(x)$$, such that

given a typical $$x,y$$ from $$P_{XY}$$, the expected loss $$L(y,f(x))$$ is low.

$f = \arg \min_f {\int} \left( \int L(y,f(x)) p(y|x) dy \right) {p(x) dx}$

Machine Learning

Given a bunch of samples $$\{(x_i,y_i)\}$$ from $$P_{XY}$$,

we want to learn a function $$y = f(x)$$, such that

given a typical $$x,y$$ from $$P_{XY}$$, the expected loss $$L(y,f(x))$$ is low.

$f = \arg \min_f \int \int L(y,f(x))~~p_{XY}(x,y) dx dy$

What we're going to is to replace the double integration with a summation over samples !

Machine Learning

Given a bunch of samples $$\{(x_i,y_i)\}$$ from $$P_{XY}$$,

we want to learn a function $$y = f(x)$$, such that

given a typical $$x,y$$ from $$P_{XY}$$, the expected loss $$L(y,f(x))$$ is low.

$f = \arg \min_f \sum_i L(y_i,f(x_i))$

What we're going to is to replace the double integration with a summation over samples !

Empirical Risk Minimization

• So instead of first fitting the probability distribution from training data, and then
given a new input, minimizing the loss under that distribution ...
• We are going to do a search over possible functions that "do well" on the training data,
and assume that a function that minimizes "empirical risk" also minimizes "expected risk".

Machine Learning

Formally

• Given inputs $$x\in\mathcal{X}$$ and $$y\in\mathcal{Y}$$, we want to learn a function $$y=f(x)$$, i.e., $$f: \mathcal{X}\rightarrow \mathcal{Y}$$
• Function should be a "good" predictor of $$y$$, as measured in terms of a risk or loss function: $$L(y,\hat{y})$$.
• Ideally, we want to find the best $$f\in\mathcal{H}$$, among some class or space of functions $$\mathcal{H}$$ (called the hypothesis space), which minimizes the expected loss under the joint distribution $$P_{XY}(x,y)$$: $f = \arg \min_{f\in\mathcal{H}} \int_\mathcal{X}\int_\mathcal{Y}~~L(y,f(x))~~P_{XY(x,y)}~~dy~dx$
• But we don't know this joint distribution, but we have a training set $$(x_1,y_1), (x_2,y_2),\ldots (x_T,y_T)$$, which (we hope!) are samples from $$P_{XY}$$.
• So we approximate the integral over the $$P_{XY}$$ with an average over the training set $$(x_t,y_t)$$, $f = \arg \min_{f\in\mathcal{H}} \frac{1}{T}\sum_t L(y_t,f(x_t))$

You're given data. Choose a loss function that matches your task, a hypothesis space $$\mathcal{H}$$, and minimize.

Machine Learning

$f = \arg \min_{f\in\mathcal{H}} \frac{1}{T}\sum_t L(y_t,f(x_t))$

Consider:

• $$x \in \mathcal{X} = \mathbb{R}^d$$
• $$y \in \mathcal{Y} = \mathbb{R}$$
• $$\mathcal{H}$$ is the space of all "linear functions" of $$\mathcal{X}$$.
• $$f(x;w,b) = w^Tx + b$$, $$~~~~w\in\mathbb{R}^d,b\in\mathbb{R}$$
• Minimization of $$f\in\mathcal{H}$$ becomes a minimization of $$w,b$$
• Consider $$L(y,\hat{y}) = (y-\hat{y})^2$$

And then we have our familiar least-squares regression!

Machine Learning

$f = \arg \min_{f\in\mathcal{H}} \frac{1}{T}\sum_t L(y_t,f(x_t))$

$w,b = \arg \min_{w\in\mathbb{R}^d,b\in\mathbb{R}} \frac{1}{T} \sum_t (y_t - w^Tx_t -b)^2$

So we know how to solve this: take derivative and set to 0.

Not just for fitting "lines". Imagine $$x$$ is a vector corresponding to a patch of intensities in a noisy image. $$y$$ is corresponding clean intensity of the center pixel. You could use this to "learn" a linear "denoising filter", by fitting to many examples of pairs of noisy and noise-free intensities.

Machine Learning

$w,b = \arg \min_{w\in\mathbb{R}^d,b\in\mathbb{R}} \frac{1}{T} \sum_t (y_t - w^Tx_t -b)^2$

Define $$\tilde{x}_t = [x_t^T,1]^T$$

$w = \arg \min_{w\in\mathbb{R}^{d+1}} \frac{1}{T} \sum_t (y_t - w^T\tilde{x}_t)^2$

$w = \left(\sum_t \tilde{x}_t\tilde{x}_t^T\right)^{-1} \left(\sum_t \tilde{x}_ty_t\right)$

Now, let's say we wanted to fit a polynomial instead of a linear function.

For $$x\in\mathbb{R}$$,

$$f(x;w_0,w_1,w_2,\ldots,w_n) = w_0 + w_1 x + w_2x^2 + \ldots w_nx^n$$.

This is our hypothesis space. Same loss function $$L(y,\hat{y}) = (y-\hat{y})^2$$.

Machine Learning

$w = \arg \min_{w\in\mathbb{R}^{n+1}} \frac{1}{T} \sum_t (y_t - w_0 - w_1 x_t - w_2 x_t^2 \ldots - w_nx_t^n)^2$

Try to work out an expression for $$w$$.

Machine Learning

$w = \arg \min_{w\in\mathbb{R}^{n+1}} \frac{1}{T} \sum_t (y_t - w_0 - w_1 x_t - w_2 x_t^2 \ldots - w_nx_t^n)^2$

Set $$\tilde{x}_t = \left[1,x_t,x_t^2,x_t^3,\ldots x_t^n\right]^T$$.

And you get exactly the same equation !

$w = \left(\sum_t \tilde{x}_t\tilde{x}_t^T\right)^{-1} \left(\sum_t \tilde{x}_ty_t\right)$

• But now, inverting a larger matrix.
• Can apply the same idea to polynomials of multi-dimensional $$x$$.
• Can apply least-squares fitting to any task with an $$L2$$ loss, and
where the hypothesis space is linear in the parameters (not necessarily in the input).

E.g: $$f(x;w_0,w_1,w_2,\ldots,w_n) = w_0 + w_1 x + w_2x^2 + \ldots w_nx^n$$.

Machine Learning

While we train on empirical loss,

$\frac{1}{T}\sum_t L(y_t,f(x_t))$

we care about the actual expected loss:

$\int_\mathcal{X}\int_\mathcal{Y}~~L(y,f(x))~~P_{XY(x,y)}~~dy~dx$

Why ?

Machine Learning

While we train on empirical loss,

$\frac{1}{T}\sum_t L(y_t,f(x_t))$

we care about the actual expected loss:

$\int_\mathcal{X}\int_\mathcal{Y}~~L(y,f(x))~~P_{XY(x,y)}~~dy~dx$

Why ? Because we don't want to explain the training set. We want to do
well on new inputs. We want to "generalize" from the training set.

This is why a more complex function that exactly fits the training set is bad, when it "generalizes" poorly.

Machine Learning

Error = Bayes Error + Approximation Error + Estimation Error

• Bayes Error: This is the error due to the uncertainty in $$p(y|x)$$. This is the error you would have even if you knew the exact distribution and could craft a function $$f$$ with infinite complexity.

$\text{Bayes Error} = \int \left( \int \min_\hat{y} L(y,\hat{y}) P_{XY}(x,y) dy\right) dx$

• Approximation Error: This is the error due to the limited capacity of our hypothesis space $$\mathcal{H}$$. It is the error of the true best function $$f\in\mathcal{H}$$, minus the Bayes error, assuming we had perfect knowledge of $$P_{XY}$$.
Also called the "Bias"$$^2$$.
• Estimation Error: This is the remaining component of error, caused by the fact that we don't know the true $$P_{XY}$$, but only have a limited number of samples.
This depends on the size of our training set (and also, how well we are able to do the minimization). Called "variance".

Machine Learning

Error = Bayes Error + Approximation Error + Estimation Error

Choosing a simple function class: higher approximation error, lower estimation error.

Choosing a complex function class: lower approximation error, higher estimation error.

How do I decrease Bayes Error ?

Machine Learning

Error = Bayes Error + Approximation Error + Estimation Error

Choosing a simple function class: higher approximation error, lower estimation error.

Choosing a complex function class: lower approximation error, higher estimation error.

How do I decrease Bayes Error ? By getting better inputs!

Machine Learning

Overfitting

• Definitions of complexity of a function or hypothesis space $$\mathcal{H}$$: VC-dimension, Rademacher complexity
• Try to capture that one function or function space provides a "simpler" explanation than another
• Useful as intuition. But often don't "work" for very complex functions and tasks.
• But the idea is:
• Given two functions with the same error, you want to choose one that's simpler.
• You never want to consider a class of functions that can fit 'random' $$T$$ pairs of $$(x,y)$$, where $$T$$ is the size of your training set.
• Choose hypothesis space based on size of training set.
• Add a "regularizer" (for example, a squared penalty on higher order polynomial coefficients).

Public Service Announcement: Any regularizer is biased by what you think is "simple". There is no universal definition of simple.

Machine Learning

Overfitting: Good ML "Hygiene"

Remember, you can overfit not just the parameters, but your design choices !

For any given task:

• Have train, dev, val, and test set.
• Train your estimators on the train set.
• Choose hyperparameters based on dev set.
• Function class
• Regularization weight
• Number of iterations to train, etc.
• Keep periodically checking to see if it generalizes to val set.
• Look at the test set only at the "end" of the project.