CSE 559A: Computer Vision

Fall 2018: T-R: 11:30-1pm @ Lopata 101

Instructor: Ayan Chakrabarti (ayan@wustl.edu).
Course Staff: Zhihao Xia, Charlie Wu, Han Liu

December 4, 2018

# General

• PSET 3 grades posted, PSET 4 to be posted shortly.
• PSET 5 due today.
• No office hours on Friday.

• Project reports due December 9.
• Submit through git
git clone cse559@euclid.seas.wustl.edu:wustl.key/project
• Include only your report PDF. Do not submit your code or other files.
• But hang on to your code in case we ask for it later.

• So far, we have looked at networks that given an input $$x$$, produce an output $$y$$.
• All $$(x,y)$$ pairs come from some joint distribution $$p(x,y)$$.
• A function that maps $$x\rightarrow y$$ is then reasoning with the distribution $$p(y|x)$$
• And producing a single guess $$\hat{y}$$ which minimizes $$\mathbb{E}_{p(y|x)} L(y,\hat{y})$$.
• But if $$p(y|x)$$ is not deterministic, this expected loss won't go to zero: Bayes Error
• What if I didn't want my network to produce a "best" guess, but tell me about this distribution ?

• One option: choose a parametric form for $$p(y|x) = f(y; \theta)$$.
• Have a network $$g$$ that predicts $$\theta = g(x)$$ for a specific $$x$$.
• The other option is to train a sampler. A network that given an input $$x$$, produces "samples" from $$p(y|x)$$.
• How do you produce samples, or how do you produce multiple outputs for the same input ?

• Let's ignore conditional distributions. Consider the task of generating samples from $$p_x(x)$$.
• You don't know $$p(x)$$, but you have training examples that are samples from $$p_x(x)$$.
• You want to learn a "generator" network $$G(z; \theta)$$ which
• Takes in random inputs $$z$$ from a known distribution $$p_z(z)$$
• And produces outputs $$x$$ from $$p(x)$$
• Has learnable parameters $$\theta$$
• You want to select $$\theta$$ such that the distribution of $$\{G(z; \theta): z \sim p_z(z)\}$$ to match $$p_x(x)$$.
• But you don't have the data distribution, only samples from it.

• Set this up as a min-max objective with a second network, a "discriminator" $$D(x,\phi)$$
• The discriminator is a binary classifier. Tries to determine if
• The input $$x$$ is "real", i.e., it came from the training set.
• Or "fake", i.e., it was the output of $$G$$
• Train both networks simultaneously, against the following loss:

$L(\theta, \phi) = -\mathbb{E}_{z\sim p_z} \log (1-D(G(z;\theta);\phi))$

• This is the cross-entropy loss on the discriminator saying outputs of $$G$$ should be labeled 0.
• What about examples that should be labeled 1 ?

• Set this up as a min-max objective with a second network, a "discriminator" $$D(x,\phi)$$
• The discriminator is a binary classifier. Tries to determine if
• The input $$x$$ is "real", i.e., it came from the training set.
• Or "fake", i.e., it was the output of $$G$$
• Train both networks simultaneously, against the following loss:

$L(\theta, \phi) = -\mathbb{E}_{z\sim p_z} \log (1-D(G(z;\theta);\phi)) -\mathbb{E}_{x\sim p_x} \log D(x;\phi)$

• This is the cross-entropy loss on the discriminator saying outputs of $$G$$ should be labeled 0.
• What about examples that should be labeled 1 ?

$L(\theta, \phi) = -\mathbb{E}_{z\sim p_z} \log (1-D(G(z;\theta);\phi)) -\mathbb{E}_{x\sim p_x} \log D(x;\phi)$

• Expectation $$\mathbb{E}_{x\sim p_x}$$ is just average over training set.
• Expectation $$\mathbb{E}_{z\sim p_z}$$ is based on sampling $$z$$ from known $$p_z$$ at each iteration.

$\theta = \arg \max_{\theta} \min_\phi L(\theta, \phi)$

• You are optimizing the discriminator to succeed in telling real and fake samples. To minimize the loss.
• And training the generator to fool the disciminator, i.e., maximize the same loss
• How do you solve this optimization problem ?
• Turns out, it is reasonable to use back-prop and gradient descent.
• Just compute gradients with respect to the loss for both the discriminator and generator.
• Subtract them from the discriminator, add them to the generator.

$G = \arg \max_G \min_D -\mathbb{E}_{z\sim p_z} \log (1-D(G(z))) -\mathbb{E}_{x\sim p_x} \log D(x)$

Theoretical Analysis

• For a given input $$x$$, what should the optimal output of your discriminator $$D(x)$$ be ?
• Say you know $$p_x(x)$$.
• You also know $$p_z(x)$$ and $$G$$, and therefore $$p_g(x)$$: probability of $$x$$ being an output from the generator.

$q = D(x) = \arg \min_{q} -p_g(x) \log(1-q) - p_x(x) \log q$

• What $$q$$ minimizes this, for $$q \in [0,1]$$ ?

$q = \frac{p_x(x)}{p_g(x) + p_x(x)}$

• Let's replace $$D$$ with this optimal value in the loss function, and figure out what $$G$$ should do.

$G = \arg \max_G \min_D -\mathbb{E}_{z\sim p_z} \log (1-D(G(z))) -\mathbb{E}_{x\sim p_x} \log D(x)$

$G = \arg \min_G \mathbb{E}_{z\sim p_z} \log \frac{p_g(G(z))}{p_x(G(z)) + p_g(G(z))} + \mathbb{E}_{x\sim p_x}\log \frac{p_x(x)}{p_x(x)+p_g(x)}$

• Remember that $$p_g$$ also depends on $$G$$. In fact, you can replace this as an optimization on $$p_g$$.

$p_g = \arg \min \int_x \left[ p_g(x) \log \frac{p_g(x)}{p_x(x) + p_g(x)} + p_x(x) \log \frac{p_x(x)}{p_x(x) + p_g(x)} \right] dx$

• You can relate this to KL-divergences:

$KL\left(p_g \| \frac{p_g+p_x}{2}\right)+KL\left(p_x \| \frac{p_g+p_x}{2}\right)$

• Called the "Jensen-Shannon" Divergence
• Minimized when $$p_g$$ matches $$p_x$$.

$L(\theta, \phi) = -\mathbb{E}_{z\sim p_z} \log (1-D(G(z;\theta);\phi)) -\mathbb{E}_{x\sim p_x} \log D(x;\phi)$

Practical Concerns

• So the procedure is, set up your generator and discriminator networks. Define the loss.
• At each iteration, pick a batch of $$z$$ values from a known distribution (typically a vector of uniformly or Gaussian distributed values)
• And a batch of training samples
• Compute gradients for the discriminator and update.
• Compute gradients for the generator, by back-propagating through the discriminator, and update.
• A common issue is that the discriminator has a much "easier" task
• In the initial iterations, your generator will be producing junk.
• Very easy for the discriminator to identify fake samples with high confidence.
• At that point, $$\log 1-D(G(z))$$ will saturate.

$L(\theta, \phi) = -\mathbb{E}_{z\sim p_z} \log (1-D(G(z;\theta);\phi)) -\mathbb{E}_{x\sim p_x} \log D(x;\phi)$

Practical Concerns

• One common approach: minimize $$G$$ with respect to a different loss
• Instead of $$\max - \log(1-D(G(z)))$$
• Do $$\min -\log D(G(z))$$
• View as minimizing cross-entropy wrt wrong label, rather than maximizing wrt true label.
• Other approaches:
• Reduce capacity of discriminator
• Make fewer updates to discriminator, or have lower learning rate
• Provide additional losses to generator to help it train: e.g., separate network that predicts intermediate features of the discriminator.
• Other losses: See Wasserstein GANs.
• Also need to be careful how you use Batch Normalization. (Don't let the discriminator use batch statistics to tell real and fake apart!)

• Conditional GANs: now want to sample from $$p(x|s)$$ for a given $$s$$
• Same adversarial setting but $$s$$ is given as an input to both generator and discriminator
• $$G(z,s)$$ and $$D(s)$$
• Or no noise at all: called an "Adversarial Loss". The generator is producing a deterministic output, but being trained with a distribution matching loss rather than $$L_1$$ or $$L_2$$.
• Can be useful when the true $$p(x|s)$$ is multi-modal.