CSE 559A: Computer Vision


Use Left/Right PgUp/PgDown to navigate slides

Fall 2018: T-R: 11:30-1pm @ Lopata 101

Instructor: Ayan Chakrabarti (ayan@wustl.edu).
Course Staff: Zhihao Xia, Charlie Wu, Han Liu

http://www.cse.wustl.edu/~ayan/courses/cse559a/

Sep 13, 2018

Administrivia

  • Tomorrow
    • Zhihao's Office Hours back in Jolley 309: 10:30am-Noon


  • This Friday: Regular Office Hours
  • Next Friday: Recitation for PSET 1
    • Try all problems before coming to recitation
  • Monday Office Hours again in Jolley 217

Image Restoration

Statistics / Estimation Recap

Setting

  • There is a true image (or image-like object) \(x\) that we want to estimate
  • What we have is some degraded observation \(y\) that is based on \(x\)
  • The degradation might be "stochastic": so we say \(p(y | x)\)

Simplest Case (single pixel: \(x,y \in \mathbb{R}\)):

\[y = x + \sigma \epsilon \rightarrow p(y|x) = \mathcal{N}(y; \mu = x, \sigma^2)\]

Estimate \(x\) from \(y\): "denoising"

Image Restoration

Statistics / Estimation Recap

  • \(p(y|x)\) is our observation model, called the likelihood function
    • Likelihood of observation \(y\) given true image \(x\)
  • Maximum Likelihood Estimator: \(\hat{x} = \arg \max_x p(y|x)\)

For \(p(y|x) = \mathcal{N}(y; x, \sigma^2)\), what is ML Estimate ?

\[p(y|x) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(- \frac{(y - x)^2}{2\sigma^2}\right)\]

\(\hat{x}=y\)

That's a little disappointing.

Image Restoration

Statistics / Estimation Recap

  • What we really want to do is maximize \(p(x | y)\), not the likelihood \(p(y | x)\)
    • \(p(y|x)\) is the distribution of observation \(y\) given true image \(x\)
    • \(p(x|y)\) is the distribution of true \(x\) given observation \(y\)
  • Bayes Rule

\[p(x|y) = \frac{p(y|x)p(x)}{p(y)}\]

Simple Derivation

\[p(x,y) = p(x)p(y|x) = p(y)p(x|y) \rightarrow p(x|y) = \frac{p(y|x)p(x)}{p(y)}\]

\[p(y) = \int p(y|x')p(x') dx'\]

Image Restoration

Statistics / Estimation Recap

\[p(x|y) = \frac{p(y|x)p(x)}{\int p(y|x')p(x') dx'}\]

  • \(p(y|x)\): Likelihood
  • \(p(x)\): "Prior" Distribution on \(x\)
    • What we believe about \(x\) before we see the observation
  • \(p(x|y)\): Posterior Distribution of \(x\) given observation \(y\)
    • What we believe about \(x\) after we see the observation

Image Restoration

Statistics / Estimation Recap

\[p(x|y) = \frac{p(y|x)p(x)}{\int p(y|x')p(x') dx'}\]

Maximum A Posteriori (MAP) Estimation

\[\hat{x} = \arg \max_{x} p(x|y)\]

  • The most likely answer under the posterior distribution
  • Other Estimators Possible: Try to minimize some "risk" or loss function \(L(\hat{x},x)\)
    • Measures how bad an answer \(\hat{x}\) is when true value is \(x\)
  • Minimize Expected Risk Under Posterior

\[\hat{x} = \arg \min_{x} \int L(x,x') p(x'|y) dx'\]

  • Let's say \(L(x,x') = (x-x')^2\). What is \(\hat{x}\) ?

\[\hat{x} = \mathbb{E}_{p(x|y)} x = \int x~~p(x|y) dx\]

Image Restoration

Statistics / Estimation Recap

\[p(x|y) = \frac{p(y|x)p(x)}{\int p(y|x')p(x') dx'}\]

Maximum A Posteriori (MAP) Estimation

\[\hat{x} = \arg \max_{x} p(x|y)\]

How do we compute this ?

\[\hat{x} = \arg \max_{x} p(x|y) = \arg \max_{x} \frac{p(y|x)p(x)}{\int p(y|x')p(x') dx'}\]

\[= \arg \max_{x} p(y|x)p(x)\]

Image Restoration

Statistics / Estimation Recap

\[p(x|y) = \frac{p(y|x)p(x)}{\int p(y|x')p(x') dx'}\]

Maximum A Posteriori (MAP) Estimation

\[\hat{x} = \arg \max_{x} p(x|y)\]

How do we compute this ?

\[\hat{x} = \arg \max_{x} p(x|y) = \arg \max_{x} \frac{p(y|x)p(x)}{\int p(y|x')p(x') dx'}\]

\[= \arg \max_{x} p(y|x)p(x) = \arg \max_{x} \log \left[p(y|x)p(x)\right]\]

\[= \arg \max_{x} \log p(y|x) + \log p(x)\]

\[= \arg \min_{x} -\log p(y|x) - \log p(x)\]

(Turn this into minimization of a cost)

Image Restoration

Statistics / Estimation Recap

Maximum A Posteriori (MAP) Estimation

\[\hat{x} = \arg \min_{x} -\log p(y|x) - \log p(x)\]

Back to Denoising

  • \(p(y|x) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(- \frac{(y-x)^2}{2\sigma^2}\right)\)
  • Let's choose a simple prior:
    • \(p(x) = \mathcal{N}(x; 0.5, 1)\)

Image Restoration

Statistics / Estimation Recap

Maximum A Posteriori (MAP) Estimation

\[\hat{x} = \arg \min_{x} \frac{(y-x)^2}{2\sigma^2} + C - \log p(x)\]

Back to Denoising

  • \(p(y|x) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(- \frac{(y-x)^2}{2\sigma^2}\right)\)

  • Let's choose a simple prior:
    • \(p(x) = \mathcal{N}(x; 0.5, 1)\)

Image Restoration

Statistics / Estimation Recap

Maximum A Posteriori (MAP) Estimation

\[\hat{x} = \arg \min_{x} \frac{(y-x)^2}{2\sigma^2} + C + \frac{(x-0.5)^2}{2} + C'\]

Back to Denoising

  • \(p(y|x) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(- \frac{(y-x)^2}{2\sigma^2}\right)\)

  • Let's choose a simple prior:
    • \(p(x) = \mathcal{N}(x; 0.5, 1)\)

Image Restoration

Statistics / Estimation Recap

Maximum A Posteriori (MAP) Estimation

\[\hat{x} = \arg \min_{x} \frac{(y-x)^2}{2\sigma^2} + \frac{(x-0.5)^2}{2}\]

Back to Denoising

  • \(p(y|x) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(- \frac{(y-x)^2}{2\sigma^2}\right)\)

  • Let's choose a simple prior:
    • \(p(x) = \mathcal{N}(x; 0.5, 1)\)

Image Restoration

Statistics / Estimation Recap

Maximum A Posteriori (MAP) Estimation

\[\hat{x} = \arg \min_{x} \frac{(y-x)^2}{2\sigma^2} + \frac{(x-0.5)^2}{2}\]

How do you compute \(x\) ?

  • In this case, simpler option available. Complete the squares, express as a single quadratic \(\propto (x-\hat{x})^2\).
  • More generally, to minimize \(C(x)\), find \(C'(x) = 0\), check \(C''(x) > 0\)
  • What is \(C'(x)\) ?

\[C'(x) = \frac{x - y}{\sigma^2} + x - 0.5\]

Image Restoration

Statistics / Estimation Recap

Maximum A Posteriori (MAP) Estimation

\[\hat{x} = \arg \min_{x} \frac{(y-x)^2}{2\sigma^2} + \frac{(x-0.5)^2}{2}\]

\[C'(x) = \frac{x - y}{\sigma^2} + x - 0.5 \color{red}{= 0}\]

\[\hat{x} = \frac{y + 0.5\sigma^2}{1+\sigma^2}\]

\[C''(x) = \frac{1}{\sigma^2} + 1 > 0 ~~~~~~~~~~~~\forall x\]

  • Means the function is convex (quadratic functions with a positive coefficient for \(x^2\) are convex)

Image Restoration

Statistics / Estimation Recap

Maximum A Posteriori (MAP) Estimation

\[\hat{x} = \arg \min_{x} \frac{(y-x)^2}{2\sigma^2} + \frac{(x-0.5)^2}{2}\]

\[\hat{x} = \frac{y + 0.5\sigma^2}{1+\sigma^2}\]

  • Weighted sum of observation and prior mean
  • Closer to prior mean when \(\sigma^2\) is high

Image Restoration

Statistics / Estimation Recap

  • Let's go beyond "single pixel images": \(Y[n] = X[n] + \epsilon[n]\)
  • If noise is independent at each pixel

\[p(\{Y[n]\} | \{X[n]\}) = \prod_n p(Y[n]|X[n]) = \prod_n \mathcal{N}(Y[n];~X[n],\sigma^2)\]

  • Similarly, if prior \(p(\{X[n]\})\) is defined to model pixels independently:

\[p(\{X[n]\})= \prod_n \mathcal{N}(X[n];~0.5,1)\]

  • Product turns into sum after taking \(\log\)

Image Restoration

Statistics / Estimation Recap

\[\hat{X}[n] = \arg \min_{\{X[n]\}} = \sum_n \frac{(Y[n]-X[n])^2}{2\sigma^2} + \frac{(X[n]-0.5)^2}{2}\]

  • Note: This is a minimization over multiple variables

  • But since the different variables don't interact with each other, we can optimize each pixel separately

\[\hat{X}[n] = \frac{Y[n] + 0.5\sigma^2}{1+\sigma^2}~~~\forall n\]

Image Restoration

Multi-variate Gaussians

  • Re-interpret these as multi-variate Gaussians

For \(X \in \mathbb{R}^d\):

\[p(X) = \frac{1}{\sqrt{(2\pi)^d\text{det}(\Sigma)}} \exp\left(-\frac{1}{2} (X-\mu)^T\Sigma^{-1}(X-\mu)\right)\]

  • Here, \(\mu \in \mathbb{R}^d\) is a mean "vector", same size as \(X\)
  • Co-variance \(\Sigma\) is a symmetric, positive-definite matrix \(d\times d\) matrix

When the different elements of \(X\) are independent, off-diagonal elements of \(\Sigma\) are 0.

How do we represent our prior and likelihood as a multi-variate Gaussian ?

Image Restoration

Multi-variate Gaussians

  • Represent \(X\) and \(Y\) as vectorized images

\[p(Y|X) = \mathcal{N}(Y; X, \sigma^2 I)\]

  • Here the mean-vector for \(Y\) is \(X\)
  • The co-variance matrix is a diagonal matrix with all diagonal entries \(= \sigma^2\)

Image Restoration

Multi-variate Gaussians

\[p(Y|X) = \frac{1}{\sqrt{(2\pi)^d\text{det}(\Sigma)}} \exp\left(-\frac{1}{2} (Y-\mu)^T\Sigma^{-1}(Y-\mu)\right)\]

  • \(\mu = X\), \(\Sigma = \sigma^2 I\)
  • \(\mbox{det}(\Sigma) = (\sigma^2)^d\)
  • \(\Sigma^{-1} = \sigma^{-2} I\)

\[(Y-\mu)^T\Sigma^{-1}(Y-\mu) = (Y-X)^T\frac{1}{\sigma^2} I (Y-X) = \frac{1}{\sigma^2}(Y-X)^T~I~(Y-X)\]

\[= \frac{1}{\sigma^2}(Y-X)^T(Y-X) = \frac{1}{\sigma^2}\|Y-X\|^2\]

\[= \frac{1}{\sigma^2}\sum_n (Y[n]-X[n])^2\]

Image Restoration

\[p(Y|X) = \mathcal{N}(Y; X, \sigma^2 I)\]

\[p(X) = \mathcal{N}(Y; 0.5I, I)\]

\[\hat{X} = \arg \min_{X} \frac{1}{\sigma}^2 \|Y-X\|^2 + \|X-0.5\|^2\]

  • But now, we can use these multi-variate distributions to model correlations and interactions between different pixels
  • Let's stay with denoising, but talk about defining a prior in the Wavelet domain
  • Instead of saying, individual pixel values are independent, let's say individual wavelet coefficients are independent
  • Let's also put different means and variances on different wavelet coefficients
    • All the 'derivative' coefficients have zero mean
    • Variance goes up as we go to coarser levels

Image Restoration

Let \(C = W~X\), where \(W\) represents the Wavelet transform matrix

  • \(W\) is unitary, \(W^{-1} = W^T\)
  • \(X = W^T C\)

Define our prior on \(C\):

\[p(C) = \mathcal{N}(C; \mu_c, D_c)\]

  • Now, \(D_c\) is a diagonal matrix, with entries equal to corresponding variances of coefficients
  • Off-diagonal elements \(0\) implies Wavelet co-efficients are un-correlated
  • A prior on \(C\) implies a prior on \(X\)
    • \(C\) is just a different representation for \(X\)

Image Restoration

Change of Variables

  • We have probability distribution on a random variable \(U\): \(p_U(U)\)
  • \(U = f(V)\) is a one-to-one function

\[p_V(V) = p_U(f(V))~~?\]

Realize, that \(p(\cdot)\) are densities. So we need to account for "scaling" of the probability measure.

\[\int p_U(U) dU = \int p_U(f(V)) \color{red}{dU}\]

\[dU = \mbox{det}(J_f) dV\]

where \(\mbox{det}(J_f)_{ij} = \frac{du_i}{dv_j}\)

So, \(p_V(V) = p_U(f(V)) \mbox{det}(J_f)\)

If \(U = f(V) = A~V\) is linear, \(J=A\).

If \(A\) is unitary, \(\mbox{det}(A) = 1\).

Image Restoration

Let \(C = W~X\), where \(W\) represents the Wavelet transform matrix

\[p(C) = \mathcal{N}(C; \mu_c, D_c)\]

\[p(X) = \frac{1}{\sqrt{(2\pi)^d\mbox{det}(D_c)}} \exp\left(-\frac{1}{2}(WX - \mu_c)^TD_c(WX-\mu_c) \right)\]

\[ = \mathcal{N}(X; \color{red}{?}, \color{red}{?})\]

Image Restoration

Let \(C = W~X\), where \(W\) represents the Wavelet transform matrix

\[p(C) = \mathcal{N}(C; \mu_c, D_c)\]

\[p(X) = \frac{1}{\sqrt{(2\pi)^d\mbox{det}(D_c)}} \exp\left(-\frac{1}{2}(WX - \mu_c)^TD_c(WX-\mu_c) \right)\]

\[\propto \exp\left(-\frac{1}{2}(X - W^T\mu_c)^T~W^TD_cW~(X-W^T\mu_c) \right) \]

\[ = \mathcal{N}(X; \color{red}{?}, \color{red}{?})\]

Image Restoration

Let \(C = W~X\), where \(W\) represents the Wavelet transform matrix

\[p(C) = \mathcal{N}(C; \mu_c, D_c)\]

\[p(X) = \frac{1}{\sqrt{(2\pi)^d\mbox{det}(D_c)}} \exp\left(-\frac{1}{2}(WX - \mu_c)^TD_c(WX-\mu_c) \right)\]

\[\propto \exp\left(-\frac{1}{2}(X - W^T\mu_c)^T~W^TD_cW~(X-W^T\mu_c) \right) \]

\[ = \mathcal{N}(X; W^T\mu_c,W^TD_cW)\]

Now, let's denoise with this prior.

Image Restoration

\[\hat{X} = \arg \min_X \|X - Y\|^2 + (X-\mu)^T \Sigma^{-1} (X-\mu)\]

where, \(\mu = W^T\mu_c\), \(\Sigma = W^TD_cW\). (We're assuming noise variance is 1).

How do we minimize this ?

Take derivative, set to 0. Check second derivative is positive.

Take gradient (vector derivative), set to 0. Check that "Hessian" is positive-definite.

\(y = f(X)\) is a scalar valued function of a vector \(X\in \mathbb{R}^d\).

  • The gradient is a vector same sized as \(X\), with each entry being the partial derivative of \(y\) with respect to that element of \(X\).

\[\nabla_X y = \left[\begin{array}{c} \frac{\partial y}{\partial X_1} \\ \frac{\partial y}{\partial X_2} \\ \frac{\partial y}{\partial X_3} \\ \vdots \end{array}\right]\]

Image Restoration

\(y = f(X)\) is a scalar valued function of a vector \(X\in \mathbb{R}^d\).

  • The gradient is a vector same sized as \(X\), with each entry being the partial derivative of \(y\) with respect to that element of \(X\).

\[\nabla_X y = \left[\begin{array}{c} \frac{\partial y}{\partial X_1} \\ \frac{\partial y}{\partial X_2} \\ \frac{\partial y}{\partial X_3} \\ \vdots \end{array}\right]\]

  • The Hessian is a matrix of size \(d\times d\) for \(X \in \mathbb{R}^d\):

\[(H_{yx})_{ij} = \frac{\partial^2 y}{\partial X_i \partial X_j}\]

Image Restoration

Properties of a multi-variate quadratic form

\[X^T Q X - 2 X^T R + S\]

where \(Q\) is a symmetric \(d\times d\) matrix, \(R\) is a \(d\)-dimensional vector, \(S\) is a scalar.

  • Note that this is a scalar value

\[\nabla_X = 2QX - 2R\]

  • Comes from the following identities
    • \(\nabla_X X^TAX = (A+A^T)X\)
    • \(\nabla_X X^T R = \nabla_X R^T X = R\)
  • The Hessian is simply given by \(Q\) (it is constant, doesn't depend on \(X\))
  • If \(Q\) is positive definite, then \(\nabla_X = 0\) gives us the unique minimizer.

\(\nabla_X = 0 \rightarrow 2QX - 2R = 0 \rightarrow X = Q^{-1} R\)

Image Restoration

\[\hat{X} = \arg \min_X \|X - Y\|^2 + (X-\mu)^T \Sigma^{-1} (X-\mu)\]

where, \(\mu = W^T\mu_c\), \(\Sigma = W^TD_cW\).

\[= \arg \min_X X^T(I + \Sigma^{-1})X - 2X^T(Y + \Sigma^{-1}\mu) + \left(\|Y\|^2 + \mu^T\Sigma^{-1}\mu\right)\]

\(Q = (I+\Sigma^{-1})\) is positive definite (sum of two positive-definite matrices is positive-definite).

\[\hat{X} = (I+\Sigma^{-1})^{-1}(Y + \Sigma^{-1}\mu)\]

\[\Sigma^{-1} = (W^TD_cW)^{-1} = W^{-1}D_c^{-1}W^{T^{-1}} = W^TD_c^{-1}W\]

i.e., taking inverse in the original de-correlated wavelet domain.

\[\Sigma^{-1}\mu = W^TD_c^{-1}W\mu = W^TD_c^{-1}WW^T\mu_c = W^TD_c^{-1}\mu_c\]

i.e., inverse-wavelet transform of the scaled wavelet coefficients.

Image Restoration

\[\hat{X}= \arg \min_X X^T(I + \Sigma^{-1})X - 2X^T(Y + \Sigma^{-1}\mu) + \left(\|Y\|^2 + \mu^T\Sigma^{-1}\mu\right)\] \[= \color{red}{(I + \Sigma^{-1})^{-1}}(Y + \Sigma^{-1}\mu)\]

\[\Sigma^{-1} = W^TD_c^{-1}W,~~~\Sigma^{-1}\mu = W^TD_c^{-1}\mu_c\]

Note that we can't actually form \(I+\Sigma^{-1}\), let alone invert it: \(d\times d\) matrix, with $d = $ total number of pixels.

\[I + \Sigma^{-1} = I + W^TD_c^{-1}W\]

\[= W^TW + W^TD_c^{-1}W = W^TIW + W^TD_c^{-1}W = W^T(I+D_c^{-1})W\]

\[(I+\Sigma^{-1})^{-1} = W^T(I+D_c^{-1})^{-1}W\]

\(I+D_c^{-1}\) is also diagonal, so it's inverse is just inverting diagonal elements.

\[\hat{X} = W^T(I+D_c^{-1})^{-1}W~~~(Y+W^TD_c^{-1}\mu_c)\]

Image Restoration

\[\hat{X}= \arg \min_X X^T(I + \Sigma^{-1})X - 2X^T(Y + \Sigma^{-1}\mu) + \left(\|Y\|^2 + \mu^T\Sigma^{-1}\mu\right)\]

\[\hat{X} = W^T(I+D_c^{-1})^{-1}W~~~(Y+W^TD_c^{-1}\mu_c)\]

\[W\hat{X} = WW^T(I+D_c^{-1})^{-1}(WY + WW^TD_c{^-1}\mu_c)\]

Let's call \(\hat{C} = W\hat{X}\), \(C_y = WY\).

\[\hat{C} = (I+D_c^{-1})^{-1}(C_y + D_c{^-1}\mu_c)\]

So what we did is that we "denoised" in the wavelet domain, with the noise variance in the wavelet domain also being equal to 1.

\(Y = X + \epsilon \rightarrow C_y = C + W\epsilon\).

\(W\epsilon\) is also a random vector with 0 mean, and covariance \(=WIW^T=I\).

Image Restoration

So far, we've been in a Bayesian setting with a prior.

\[\hat{X} = \arg \min_X -\log p(Y|X) - \log p(X)\]

  • We're balancing off fidelity to observation (first term, likelihood), and what we believe about \(X\) (second term, prior)
  • But we need \(p(X)\) to be 'a proper' probability distribution. Sometimes, that's too restrictive.
  • Instead just think of these as a "data term" (still comes from observation model), and have a generic "regularization" term.

\[\hat{X} = \arg \min_X D(X; Y) + R(X)\]

\[\hat{X} = \arg \min_X \|X-Y\|^2 + \lambda \left(\|G_x*X\|^2 + \|G_y*X\|^2\right)\]

  • Here, \(G_x*X\) is a vector corresponding to the image we get by convolving \(X\)
    with the Gaussian x-derivative filter.

Image Restoration

\[\hat{X} = \arg \min_X \|X-Y\|^2 + \lambda \left(\|G_x*X\|^2 + \|G_y*X\|^2\right)\]

  • Data term still has the same form as for denoising.
  • But now, regularization just says we want the x and y spatial derivatives of our image to be small
    • Encodes a prior that natural images are smooth.
  • This isn't a prior. Because gradients aren't a "representation" of \(X\).
  • But better, because they don't enforce smoothness at alternate gradient locations.

How do we minimize this ?