CSE 559A: Computer Vision

Use Left/Right PgUp/PgDown to navigate slides

Fall 2018: T-R: 11:30-1pm @ Lopata 101

Instructor: Ayan Chakrabarti (ayan@wustl.edu).

Course Staff: Zhihao Xia, Charlie Wu, Han Liu

Aug 30, 2018

**EVERYONE**needs to fill out survey.

- Setup git and Anaconda, send us your public key, and do problem set 0.
- Do immediately: submit public key and make sure you can clone repo.

- If you have trouble with git/Python/LaTeX setup:
- Attend Zhihao's office hours tomorrow: 10:30 AM-Noon @ Jolley 309

- This monday is labor day: no office hours!
- Monday location still TBD

- \(E(x,y,t)\): Light energy,
*per unit area per unit time*, arriving at point \((x,y)\) at time \(t\)- Here, \(x,y\) are real numbers (in meters) denoting actual position on the sensor plane.

- \(I[n_x,n_y]\): Intensity measured by the sensor element at grid location \(n_x,n_y\)
- Here, \(n_x\), \(n_y\) are integers, indexing pixel location.

- \(p(x-\bar{x}_{n_x},y-\bar{y}_{n_y})\): is a sensitivity function
- \(\bar{x}_{n_x},\bar{y}_{n_y}\) is the location (in meters) of the center of the sensor element
- \(p(\cdot,\cdot)\) is ideally 1 inside pixel, 0 outside. But may have attenuation at boundaries.

- Defining \(q\) as the "quantum efficiency" of the sensor: Ratio of Light Energy to Charge/Voltage
- \(\int E(x,y,t)~~p(x-\bar{x}_{n_x},y-\bar{y}_{n_y})~q~dx~dy\)

Rate at which charge/voltage increases in sensor element \(n_x,n_y\) at time \(t\).

- \(\int E(x,y,t)~~p(x-\bar{x}_{n_x},y-\bar{y}_{n_y})~q~dx~dy\)

- An image capture involves "exposing" the image for an interval \(T\) (seconds)

- So the total intensity is going to involve integrating the charge/voltage rate over that interval.

\[I[n_x,n_y] = \int_{t=0}^T \Bigg[\int E(x,y,t)~~p(x-\bar{x}_{n_x},y-\bar{y}_{n_y})~q~dx~dy \Bigg] dt\]

- \(n_x, n_y\) are integers indexing pixels in image array.
- \((x,y)\) is spatial location
- \(I[n_x,n_y]\) is recorded pixel intensity.
- \(E(x,y,t)\) is light "power" per unit area incident at location \((x,y)\) on the sensor plane at time \(t\)

- \((\bar{x}_{n_x},\bar{y}_{n_y})\) is the "center" spatial location of the pixel / sensor element at \([n_x,n_y]\).
- \(p(x,y)\) is spatial sensitivity of the sensor (might be lower near boundaries, etc.)
- \(q\) is quantum efficiency of the sensor (photons/energy to charge/voltage)
- \(T\) is the duration of the exposure interval.

CCD/CMOS sensors measure total energy or "count photons" that arrived during exposure.

\[I[n_x,n_y] = \int_{t=0}^T \Bigg[\int E(x,y,t)~~p(x-\bar{x}_{n_x},y-\bar{y}_{n_y})~q~dx~dy \Bigg] dt\]

\(I[n_x,n_y]\) is recorded pixel intensity. \(I[n_x,n_y]\) is the *ideal unquantized* pixel intensity

\[I^0[n_x,n_y] = \int_{t=0}^T \Bigg[\int E(x,y,t)~~p(x-\bar{x}_{n_x},y-\bar{y}_{n_y})~q~dx~dy \Bigg] dt\]

\[I \leftarrow I^0\]

\[I^0[n_x,n_y] = \int_{t=0}^T \Bigg[\int E(x,y,t)~~p(x-\bar{x}_{n_x},y-\bar{y}_{n_y})~q~dx~dy \Bigg] dt\]

\[I \leftarrow I^0\]

- Caused by uncertainty in photon arrival
- Actual number of photons \(K\) is a discrete random variable with Poisson distribution
- \(P(K = k) = \frac{\lambda^k e^{-\lambda}}{k!}\)
- \(\lambda\) is the "expected" number of photons. In our case, \(\propto I^0\)

- Property of Poisson distribution: Mean and Variance both equal to \(\lambda\)
- Often, shot noise is modeled with additive Gaussian noise with signal dependent variance:

\[I \leftarrow I^0 + \sqrt{I^0}~~\epsilon_1\]

where \(\epsilon \sim \mathcal{N}(0,1)\) (Gaussian random noise with mean 0, variance 1).

\(\sqrt{I^0}\epsilon_1~~\sim~~ ?\) \(\sqrt{I^0}\epsilon_1~~\sim \mathcal{N}(?,?)\) \(\sqrt{I^0}\epsilon_1~~\sim \mathcal{N}(0,I^0)\)

\[I^0[n_x,n_y] = \int_{t=0}^T \Bigg[\int E(x,y,t)~~p(x-\bar{x}_{n_x},y-\bar{y}_{n_y})~q~dx~dy \Bigg] dt\]

\[I \leftarrow I^0 + \sqrt{I^0}~~\epsilon_1\]

- Signal amplified by gain \(g\) before digitization. Based on ISO (higher \(g\) for higher ISO).
- Some signal-independent Gaussian noise added before and after amplification.

\[I \leftarrow g \times (I^0 + \sqrt{I^0}~~\epsilon_{1})\] \[I \leftarrow g \times (I^0 + \sqrt{I^0}~~\epsilon_{1} + \sigma_{2a}\epsilon_{2a})\] \[I \leftarrow g \times (I^0 + \sqrt{I^0}~~\epsilon_{1} + \sigma_{2a}\epsilon_{2a}) + \sigma_{2b}\epsilon_{2b}\]

where \(\sigma_{2a}\) and \(\sigma_{2b}\) are parameters (lower for high quality sensors),

and \(\epsilon_1,\epsilon_{2a},\epsilon_{2b}\) are \(\mathcal{N}(0,1)\) noise variables, all independent.

\[~~~~~~~~~~~I \leftarrow g I^0 + g \sqrt{I^0}~~\epsilon_1 + g\sigma_{2a}\epsilon_{2a}+\sigma_{2b}\epsilon_{2b}\] \[~~~~~~~~~~~I \leftarrow g I^0 + g \sqrt{I^0}~~\epsilon_1 + \color{red}{g\sigma_{2a}\epsilon_{2a}+\sigma_{2b}\epsilon_{2b}}\] \[I \leftarrow ~~~~~g I^0~~~~~ + ~~~g \sqrt{I^0}~~\epsilon_1~~~ + ~~~~~\sqrt{\big(g^2\sigma_{2a}^2+\sigma_{2b}^2\big)}~~\epsilon_2\] \[I \leftarrow \underbrace{g I^0}_{\tiny \mbox{Amplified Signal}} + \underbrace{g \sqrt{I^0}~~\epsilon_1}_{\tiny \mbox{Amplified Shot Noise}} + \underbrace{\sqrt{\big(g^2\sigma_{2a}^2+\sigma_{2b}^2\big)}~~\epsilon_2}_{\tiny \mbox{Amplified and un-amplified additive noise}}\]

\[I \leftarrow g I^0 + g \sqrt{I^0}~~\epsilon_1 + \sqrt{\big(g^2\sigma_{2a}^2+\sigma_{2b}^2\big)}~~\epsilon_2\]

- Final step is rounding and clipping (by an analog to digital converter)

\[I \leftarrow \text{Round}\Big(g I^0 + g \sqrt{I^0}~~\epsilon_1 + \sqrt{\big(g^2\sigma_{2a}^2+\sigma_{2b}^2\big)}~~\epsilon_2\Big)\]

\[I = \min\Bigg(I_\max,~~\text{Round}\Big(g I^0 + g \sqrt{I^0}~~\epsilon_1 + \sqrt{\big(g^2\sigma_{2a}^2+\sigma_{2b}^2\big)}~~\epsilon_2\Big)\Bigg)\]

\[I = \min\Bigg(I_\max,~~\text{Round}\Big(g I^0 + g \sqrt{I^0}~~\epsilon_1 + \sqrt{\big(g^2\sigma_{2a}^2+\sigma_{2b}^2\big)}~~\epsilon_2\Big)\Bigg)\]

ignoring sensor saturation, dark current, ...

- To understand the degradation process of noise (if we want to denoise / recover \(I^0\) from \(I\)).
- To prevent degradation during capture, because we control exposure time \(T\) and ISO / gain \(g\).
- To understand the different trade-offs for loss of information from noise, rounding, and clipping.

Ignoring noise, what is the optimal \(g\) for a given \(I^0[n_x,n_y]\) ?

- Keep \(g\) low so that most values of \(g I^0[n_x,n_y]\) are below \(I_\max\).
- But if \(g\) is too low, a lot of the variation will get rounded to the same value.

Note that here, our 'ideal' intensity is \(gI^0\), everything else is noise.

Say we have chosen the optimal target values for the product \(gI^0\). Is it better:

- To have a higher \(g\) and lower magnitude \(I^0\)
- To have a lower \(g\) and higher magnitude \(I^0\)
- Depends, based on \(\sigma_{2a}, \sigma_{2b}\)

Note that here, our 'ideal' intensity is \(gI^0\), everything else is noise.

Say we have chosen the optimal target values for the product \(gI^0\). Is it better:

- To have a higher \(g\) and lower magnitude \(I^0\)
**To have a lower \(g\) and higher magnitude \(I^0\)**- Depends, based on \(\sigma_{2a}, \sigma_{2b}\)

**Additional Reading (if interested)**:

S. Hasinoff, F. Durand, W.T. Freeman, "Noise-Optimal Capture for High Dynamic Range Photography," CVPR 2010.

So how do we increase \(I^0\) ?

- Better sensors (higher \(q\))
- Larger sensor elements: \(~~p(\cdot,\cdot) > 0\) over a larger area.

But we've gone the other way: cameras stuff more 'megapixels' in smaller form factors.

Increase exposure time \(T\) ?

- If scene is static and camera is stationary:
- \(E(x,y,t)\) doesn't change with \(t \Rightarrow I^0 \propto T\)

- If scene is moving ...

Increase \(E(x,y,t)\) itself. How ?

- Take pictures outdoors, or under brighter lights.

- Don't use a pinhole camera !

Photographers think about these tradeoffs every time they take a shot

- Dynamic range and what part of the image should be well exposed (rounding and clipping)

- Choosing between:
- ISO i.e. Gain & noise
- Exposure Time & motion blur
- F-stop i.e. aperture size & defocus blur

We left out at an important term in this equation: wavelength

\[I^0[n_x,n_y] = \int_{t=0}^T \Bigg[\int E(\lambda,x,y,t)~~p(x-\bar{x}_{n_x},y-\bar{y}_{n_y})~q(\lambda)~d\lambda~dx~dy \Bigg] dt\]

- Light carries different amounts of power in different wavelengths
- \(E(\lambda,x,y,t)\) now refers to power per unit area per unit wavelength
- In wavlength \(\lambda\), incident at \((x,y)\) at time \(t\)
- Both spectral and spatial density function

- \(q(\lambda)\): Quantum efficiency also a function of wavelength
- CMOS/CCD sensors are sensitive (have high \(q\)) across most of the visible spectrum
- Actually extend to longer than visible wavelengths (near infra red)
- Why cameras have NIR filter, to prevent NIR radiation from being 'superimposed' on the image

Q: But this measures 'total' power in all wavelengths. How do we measure color ?

Ans: By putting a color filter in front of each sensor element.

\[I^0[n_x,n_y,c] = \int_{t=0}^T \Bigg[\int E(\lambda,x,y,t)~~p(x-\bar{x}_{n_x},y-\bar{y}_{n_y})~\Pi_c(\lambda)~q(\lambda)~d\lambda~dx~dy \Bigg] dt\]

\[ \text{for}~~~~c \in \{R,G,B\} \]

- \(\Pi_c\) is the transmittance of a color filter for color channel \(c\)
- E.g, \(~~\Pi_R\) will transmit power in (be high for) wavelengths in the red part of the visible spectrum

and attenuate power in (be low for) other wavelengths. - Sometimes also called "color matching functions"

\[I^0[n_x,n_y,c] = \int_{t=0}^T \Bigg[\int E(\lambda,x,y,t)~~p(x-\bar{x}_{n_x},y-\bar{y}_{n_y})~\Pi_c(\lambda)~q(\lambda)~d\lambda~dx~dy \Bigg] dt\]

\[ \text{for}~~~~c \in \{R,G,B\} \]

- But we can only put one filter in front of each sensor element / pixel location.
- So color cameras "multiplex" color measurements: they measure a different color channel at each location.
- Usually in an alternating pattern called the Bayer pattern:

- Note: a disadvantage is that color filters block light, so measured \(I^0\) values are lower.
- That's why black and white / grayscale cameras are "faster" than color cameras.

Final steps in camera processing pipelines (except for some DSLR cameras shooting in RAW):

- Filter Colors to Standard RGB:
- Cameras often use their own color filters \(\Pi_c\).
- Apply a linear transformation to map those measurements to standard RGB.

- White-balance: scale color channels to remove color cast from a non-neutral illuminant.
- Tone-mapping:
- The simplest form is "gamma correction" (approximately raising each intensity to the power \((1/2.2)\))
- Done based on standard developed for what old display devices expected
- Fits the full set of measurable colors into the gamut that can be displayed / printed
- Modern cameras often do more advanced processing (to make colors look vibrant)

- Compression

And that's how you get your PNG / JPEG images !

**Optional Additional Reading:** Szeliski Sec 2.3

Other effects we did not talk about. E.g.,

- Real Lenses not thin lenses and have distortions:

Other effects we did not talk about. E.g.,

Rolling Shutter: No explicit shutter but when pixels reset electronically (along scanlines)

- Exist as 2-D (grayscale) or 3-D (color image) arrays

- Precision: uint8 (0-255), uint16(0-65535), Floating point (0-1)
- We will often treat them as (positive) real numbers.

- Conventions:
- \(I[n_x,n_y] \in \mathbb{R}\)
- \(I[n_x,n_y,c] \in \mathbb{R}\)
- \(I[n_x,n_y] \in \mathbb{R}^3\)
- \(I[n] \in \mathbb{R}\) or \(\in \mathbb{R}^3\), where \(n \in \mathbb{Z}^2\)

- How do you process / manipulate these arrays ?

- \(Y[n] = h(X[n])\)
- \(Y[n] = h(X_1[n],X_2[n],\ldots))\)
- \(Y[n] = h_n(X[n])\) - Might vary based on location.
- \(h(\cdot)\) itself might be based on 'global statistics'