autoencoders

Intro

Autoencoder is a method for non-linear dimentionality reduction. We need it when we want to extract informative features or reduce noise. We do it by artificially reducing the dimentinality of data or add some sort of regularization. Depending on this we can distinguish these types of models:

plain (undercomplete) autoencoder
denoising autoencoder
variational autoencoder
contractive autoencoder

Plain (undercomplete) autoencoder

There is a function from space $X$ to space $Z$ . Objective: maximize mutual information $I(X,Z)$ . So,

$\begin{array}{rlr} \max I(X,Z) & = H(X) - H\left( X|Z \right) \\ & = \max\left\lbrack - H\left( X|Z \right) \right\rbrack & H(X)\text{ is a constant } \\ & = \max\mathbb{E}_{x,y \sim P(x,y)}\left\lbrack \log P\left( x|z \right) \right\rbrack \\ & = \max\mathbb{E}_{x \sim P(x)}\lbrack\log P(x|Z = f_{\theta}(x)\rbrack & \text{ }X \rightarrow Z\text{ is deterministic } \end{array}$

Let’s take another distribution $Q$ and use it to approximate original distribution $P$ .

$\begin{aligned} \mathbb{E}_{x \sim P(x)}\lbrack\log P(x|Z = f_{\theta}(x)\rbrack & = \mathbb{E}_{x \sim P(x)}\left\lbrack \log P\left( x|Z = f_{\theta}(x) \right)\frac{Q\left( x|Z = f_{\theta}(x) \right)}{Q\left( x|Z = f_{\theta}(x) \right)} \right\rbrack \\ & = \mathbb{E}_{x \sim P(x)}\left\lbrack \log\frac{P\left( x|Z = f_{\theta}(x) \right)}{Q\left( x|Z = f_{\theta}(x) \right)} + \log Q\left( x|Z = f_{\theta}(x) \right) \right\rbrack \\ & = \mathbb{E}_{x \sim P(x)}\left\lbrack \log\frac{P\left( x|Z = f_{\theta}(x) \right)}{Q\left( x|Z = f_{\theta}(x) \right)} \right\rbrack + \mathbb{E}_{x \sim P(x)}\left\lbrack \log Q\left( x|Z = f_{\theta}(x) \right) \right\rbrack \\ & = \text{ KL }\left\lbrack P\left( X|Z \right)~|~Q\left( X|Z \right) \right\rbrack + \mathbb{E}_{x \sim P(x)}\left\lbrack \log Q\left( x|Z = f_{\theta}(x) \right) \right\rbrack \\ & \geq \mathbb{E}_{x \sim P(x)}\left\lbrack \log Q\left( x|Z = f_{\theta}(x) \right) \right\rbrack \end{aligned}$

Last step is possible becase KL-divergence is non-negative and becomes equality when the KL-divergence is zero.

We can’t directly optimize the sum but we can maximize the latest term. By doing so we shift the distribution $Q$ to be close to $P$ .

Variational Autoencoder

The idea is to maximize $p_{\theta}(x)$ . If it is done we could use it to sample $x \in \mathcal{X}$ . There are two problems. First, we can’t estimate it directly due to dimensionality. Second, it is hard to sample from this distribution. So we have add an assuption that we generate our samples from hidden distribution $p\left( x|z \right)$ . Again, there is a problem with estimation. So we rely on another distribution $q(z)$ and try to “fit” it to the true one using variational methods.

There are some philosophical consideration why should we optimize $p(x)$ instead of generating function directly. Some methods optimize the generator directly (like GANs).

Below is the derivation using importance sampling. We want to find the parameters $\theta$ of distribution $p_{\theta(x)}$ . Let’s decompose it:

$\begin{array}{rlr} p_{\theta}(x) & = \int_{\mathcal{Z}}p_{\theta}(x,z)dx = \int_{\mathcal{Z}}p_{\theta}\left( x|z \right)p_{\theta}(z)dx \\ & = \mathbb{E}_{z \sim p_{\theta}(z)}\left\lbrack p_{\theta}\left( x|z \right) \right\rbrack & \text{ definition of expectation } \\ & = \mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack p_{\theta}\left( x|z \right)\frac{p_{\theta}(z)}{q_{\varphi}\left( z|x \right)} \right\rbrack & \text{ iportance sampling }p_{\theta}(z) \rightarrow q_{\varphi}\left( z|x \right) \end{array}$

Taking the $\log$ :

$\begin{array}{rlr} \log p_{\theta(x)} & = \log\mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack p_{\theta}\left( x|z \right)\frac{p_{\theta}(z)}{q_{\varphi}\left( z|x \right)} \right\rbrack \\ & \geq \mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log p_{\theta}\left( x|z \right)\frac{p_{\theta}(z)}{q_{\varphi}\left( z|x \right)} \right\rbrack & \text{ Jensen's inequality } \\ & = \mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log p_{\theta}\left( x|z \right) \right\rbrack - \text{ KL}\left\lbrack q_{\varphi}\left( z|x \right)\| p_{\theta}(z) \right\rbrack & \text{ definition of KL-divergence } \end{array}$

The inequality holds from Jensen’s inequality $f\left( \mathbb{E}\lbrack x\rbrack \right) \geq \mathbb{E}\left\lbrack f(x) \right\rbrack$ where f is concave (for the convex $f(x)$ we change the inequality to $\leq$ ). Our final objective function:

$\mathcal{L}_{\theta,\varphi}(D) = \sum_{x \in D}\mathcal{L}_{\theta,\varphi}\left( x_{i} \right) = \mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log p_{\theta}\left( x|z \right) \right\rbrack - \text{ KL}\left\lbrack q_{\varphi}\left( z|x \right)\| p_{\theta}(z) \right\rbrack$

So we reformulated the problem from absolute maximization to optimizing lower bound. TODO: the tightness of the lower bound. We can optimize it using gradient decsent method.

Taking derivative with respect to $\theta$ :

$\begin{aligned} & \nabla_{\theta}\mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log p_{\theta}\left( x|z \right) \right\rbrack - \text{ KL}\left\lbrack q_{\varphi}\left( z|x \right)\| p_{\theta}(z) \right\rbrack \\ & = \nabla_{\theta}\mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log p_{\theta}\left( x|z \right) \right\rbrack - \mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log q_{\varphi}\left( z|x \right) - \log p_{\theta}(z) \right\rbrack \\ & = \nabla_{\theta}\mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log p_{\theta}\left( x|z \right) + \log p_{\theta}(z) \right\rbrack \\ & = \mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \nabla_{\theta}\log p_{\theta}\left( x|z \right) + \nabla_{\theta}\log p_{\theta}(z) \right\rbrack \\ & \approx \frac{1}{n}\sum_{i = 1}^{n}\left\lbrack \nabla_{\theta}\log p_{\theta}\left( x|z_{i} \right) + \nabla_{\theta}\log p_{\theta}\left( z_{i} \right) \right\rbrack \end{aligned}$

where the last sum goes over the samples from $q_{\theta}$ . Taking derivative with respect to $\varphi$ requires reparametrization trick (see stochastick gradients):

We can’t directly move gradient into the expectation because the generating probability depends on $\varphi$ . So we use reparametrization assuming we can “push randomness” and express the $q_{\varphi}\left( z|x \right)$ as deterministic function of $x,\varphi$ and random variable $\varepsilon$ from $p_{\varepsilon}$ , i.e. z = g(x, phi, epsilon).

$\begin{aligned} & \nabla_{\varphi}\mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log p_{\theta}\left( x|z \right) \right\rbrack - \text{ KL}\left\lbrack q_{\varphi}\left( z|x \right)\| p_{\theta}(z) \right\rbrack \\ = & \nabla_{\varphi}\mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log p_{\theta(z)} - \log q_{\varphi}\left( z|x \right) \right\rbrack \\ = & \nabla_{\varphi}\mathbb{E}_{\varepsilon \sim p_{\varepsilon}}\lbrack\log p_{\theta(g(x,\varphi,\varepsilon))} - \log q_{\varphi}(g\left( x,\varphi,\varepsilon|x \right)\rbrack \\ = & \mathbb{E}_{\varepsilon \sim p_{\varepsilon}}\lbrack\nabla_{\varphi}\log p_{\theta(g(x,\varphi,\varepsilon))} - \log q_{\varphi}(g\left( x,\varphi,\varepsilon|x \right)\rbrack \\ = & \frac{1}{n}\sum_{i = 1}^{k}\left\lbrack \nabla_{\varphi}\log p_{\varphi(g\left( x,\varphi,\varepsilon_{k} \right))} - \nabla_{\varphi}\log q_{\varphi}\left( g\left( x,\varphi,\varepsilon_{k} \right)|x \right) \right\rbrack \end{aligned}$

the later sum goes over samples of $p_{\varepsilon}$ . Finally, we have everything to estimate the model. Only thing left is to specify distribution for $p_{\theta}(z)$ , $p_{\theta}\left( x|z \right)$ , $q_{\varphi}\left( z|x \right)$ and encoder and decoder parametrization (normally neural networks).