Home | About

Intro

Autoencoder is a method for non-linear dimentionality reduction. We need it when we want to extract informative features or reduce noise. We do it by artificially reducing the dimentinality of data or add some sort of regularization. Depending on this we can distinguish these types of models:

Plain (undercomplete) autoencoder

There is a function from space XX to space ZZ. Objective: maximize mutual information I(X,Z)I(X,Z). So,

maxI(X,Z)=H(X)H(X|Z)=max[H(X|Z)]H(X) is a constant =max𝔼x,yP(x,y)[logP(x|z)]=max𝔼xP(x)[logP(x|Z=fθ(x)] XZ is deterministic \begin{array}{rlr} \max I(X,Z) & = H(X) - H\left( X|Z \right) \\ & = \max\left\lbrack - H\left( X|Z \right) \right\rbrack & H(X)\text{ is a constant } \\ & = \max\mathbb{E}_{x,y \sim P(x,y)}\left\lbrack \log P\left( x|z \right) \right\rbrack \\ & = \max\mathbb{E}_{x \sim P(x)}\lbrack\log P(x|Z = f_{\theta}(x)\rbrack & \text{ }X \rightarrow Z\text{ is deterministic } \end{array}

Let’s take another distribution QQ and use it to approximate original distribution PP.

𝔼xP(x)[logP(x|Z=fθ(x)]=𝔼xP(x)[logP(x|Z=fθ(x))Q(x|Z=fθ(x))Q(x|Z=fθ(x))]=𝔼xP(x)[logP(x|Z=fθ(x))Q(x|Z=fθ(x))+logQ(x|Z=fθ(x))]=𝔼xP(x)[logP(x|Z=fθ(x))Q(x|Z=fθ(x))]+𝔼xP(x)[logQ(x|Z=fθ(x))]= KL [P(X|Z)|Q(X|Z)]+𝔼xP(x)[logQ(x|Z=fθ(x))]𝔼xP(x)[logQ(x|Z=fθ(x))]\begin{aligned} \mathbb{E}_{x \sim P(x)}\lbrack\log P(x|Z = f_{\theta}(x)\rbrack & = \mathbb{E}_{x \sim P(x)}\left\lbrack \log P\left( x|Z = f_{\theta}(x) \right)\frac{Q\left( x|Z = f_{\theta}(x) \right)}{Q\left( x|Z = f_{\theta}(x) \right)} \right\rbrack \\ & = \mathbb{E}_{x \sim P(x)}\left\lbrack \log\frac{P\left( x|Z = f_{\theta}(x) \right)}{Q\left( x|Z = f_{\theta}(x) \right)} + \log Q\left( x|Z = f_{\theta}(x) \right) \right\rbrack \\ & = \mathbb{E}_{x \sim P(x)}\left\lbrack \log\frac{P\left( x|Z = f_{\theta}(x) \right)}{Q\left( x|Z = f_{\theta}(x) \right)} \right\rbrack + \mathbb{E}_{x \sim P(x)}\left\lbrack \log Q\left( x|Z = f_{\theta}(x) \right) \right\rbrack \\ & = \text{ KL }\left\lbrack P\left( X|Z \right)~|~Q\left( X|Z \right) \right\rbrack + \mathbb{E}_{x \sim P(x)}\left\lbrack \log Q\left( x|Z = f_{\theta}(x) \right) \right\rbrack \\ & \geq \mathbb{E}_{x \sim P(x)}\left\lbrack \log Q\left( x|Z = f_{\theta}(x) \right) \right\rbrack \end{aligned}

Last step is possible becase KL-divergence is non-negative and becomes equality when the KL-divergence is zero.

We can’t directly optimize the sum but we can maximize the latest term. By doing so we shift the distribution QQ to be close to PP.

Variational Autoencoder

The idea is to maximize pθ(x)p_{\theta}(x). If it is done we could use it to sample x𝒳x \in \mathcal{X}. There are two problems. First, we can’t estimate it directly due to dimensionality. Second, it is hard to sample from this distribution. So we have add an assuption that we generate our samples from hidden distribution p(x|z)p\left( x|z \right). Again, there is a problem with estimation. So we rely on another distribution q(z)q(z) and try to “fit” it to the true one using variational methods.

There are some philosophical consideration why should we optimize p(x)p(x) instead of generating function directly. Some methods optimize the generator directly (like GANs).

Below is the derivation using importance sampling. We want to find the parameters θ\theta of distribution pθ(x)p_{\theta(x)}. Let’s decompose it:

pθ(x)=𝒵pθ(x,z)dx=𝒵pθ(x|z)pθ(z)dx=𝔼zpθ(z)[pθ(x|z)] definition of expectation =𝔼zqφ(z|x)[pθ(x|z)pθ(z)qφ(z|x)] iportance sampling pθ(z)qφ(z|x)\begin{array}{rlr} p_{\theta}(x) & = \int_{\mathcal{Z}}p_{\theta}(x,z)dx = \int_{\mathcal{Z}}p_{\theta}\left( x|z \right)p_{\theta}(z)dx \\ & = \mathbb{E}_{z \sim p_{\theta}(z)}\left\lbrack p_{\theta}\left( x|z \right) \right\rbrack & \text{ definition of expectation } \\ & = \mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack p_{\theta}\left( x|z \right)\frac{p_{\theta}(z)}{q_{\varphi}\left( z|x \right)} \right\rbrack & \text{ iportance sampling }p_{\theta}(z) \rightarrow q_{\varphi}\left( z|x \right) \end{array}

Taking the log\log:

logpθ(x)=log𝔼zqφ(z|x)[pθ(x|z)pθ(z)qφ(z|x)]𝔼zqφ(z|x)[logpθ(x|z)pθ(z)qφ(z|x)] Jensen’s inequality =𝔼zqφ(z|x)[logpθ(x|z)] KL[qφ(z|x)pθ(z)] definition of KL-divergence \begin{array}{rlr} \log p_{\theta(x)} & = \log\mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack p_{\theta}\left( x|z \right)\frac{p_{\theta}(z)}{q_{\varphi}\left( z|x \right)} \right\rbrack \\ & \geq \mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log p_{\theta}\left( x|z \right)\frac{p_{\theta}(z)}{q_{\varphi}\left( z|x \right)} \right\rbrack & \text{ Jensen's inequality } \\ & = \mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log p_{\theta}\left( x|z \right) \right\rbrack - \text{ KL}\left\lbrack q_{\varphi}\left( z|x \right)\| p_{\theta}(z) \right\rbrack & \text{ definition of KL-divergence } \end{array}

The inequality holds from Jensen’s inequality f(𝔼[x])𝔼[f(x)]f\left( \mathbb{E}\lbrack x\rbrack \right) \geq \mathbb{E}\left\lbrack f(x) \right\rbrack where f is concave (for the convex f(x)f(x) we change the inequality to \leq). Our final objective function:

θ,φ(D)=xDθ,φ(xi)=𝔼zqφ(z|x)[logpθ(x|z)] KL[qφ(z|x)pθ(z)]\mathcal{L}_{\theta,\varphi}(D) = \sum_{x \in D}\mathcal{L}_{\theta,\varphi}\left( x_{i} \right) = \mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log p_{\theta}\left( x|z \right) \right\rbrack - \text{ KL}\left\lbrack q_{\varphi}\left( z|x \right)\| p_{\theta}(z) \right\rbrack

So we reformulated the problem from absolute maximization to optimizing lower bound. TODO: the tightness of the lower bound. We can optimize it using gradient decsent method.

Taking derivative with respect to θ\theta:

θ𝔼zqφ(z|x)[logpθ(x|z)] KL[qφ(z|x)pθ(z)]=θ𝔼zqφ(z|x)[logpθ(x|z)]𝔼zqφ(z|x)[logqφ(z|x)logpθ(z)]=θ𝔼zqφ(z|x)[logpθ(x|z)+logpθ(z)]=𝔼zqφ(z|x)[θlogpθ(x|z)+θlogpθ(z)]1ni=1n[θlogpθ(x|zi)+θlogpθ(zi)]\begin{aligned} & \nabla_{\theta}\mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log p_{\theta}\left( x|z \right) \right\rbrack - \text{ KL}\left\lbrack q_{\varphi}\left( z|x \right)\| p_{\theta}(z) \right\rbrack \\ & = \nabla_{\theta}\mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log p_{\theta}\left( x|z \right) \right\rbrack - \mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log q_{\varphi}\left( z|x \right) - \log p_{\theta}(z) \right\rbrack \\ & = \nabla_{\theta}\mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log p_{\theta}\left( x|z \right) + \log p_{\theta}(z) \right\rbrack \\ & = \mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \nabla_{\theta}\log p_{\theta}\left( x|z \right) + \nabla_{\theta}\log p_{\theta}(z) \right\rbrack \\ & \approx \frac{1}{n}\sum_{i = 1}^{n}\left\lbrack \nabla_{\theta}\log p_{\theta}\left( x|z_{i} \right) + \nabla_{\theta}\log p_{\theta}\left( z_{i} \right) \right\rbrack \end{aligned}

where the last sum goes over the samples from qθq_{\theta}. Taking derivative with respect to φ\varphi requires reparametrization trick (see stochastick gradients):

θ𝔼zqφ(z|x)[logpθ(x|z)] KL[qφ(z|x)pθ(z)]=θ𝔼zqφ(z|x)[ KL[qφ(z|x)pθ(z)]=θ𝔼zqφ(z|x)[logpθ(z)logqφ(z|x)]\begin{aligned} & \nabla_{\theta}\mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log p_{\theta}\left( x|z \right) \right\rbrack - \text{ KL}\left\lbrack q_{\varphi}\left( z|x \right)\| p_{\theta}(z) \right\rbrack \\ & = \nabla_{\theta}\mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\lbrack - \text{ KL}\left\lbrack q_{\varphi}\left( z|x \right)\| p_{\theta}(z) \right\rbrack \\ & = \nabla_{\theta}\mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log p_{\theta(z)} - \log q_{\varphi}\left( z|x \right) \right\rbrack \end{aligned}

We can’t directly move gradient into the expectation because the generating probability depends on φ\varphi. So we use reparametrization assuming we can “push randomness” and express the qφ(z|x)q_{\varphi}\left( z|x \right) as deterministic function of x,φx,\varphi and random variable ε\varepsilon from pεp_{\varepsilon}, i.e. z = g(x, phi, epsilon).

φ𝔼zqφ(z|x)[logpθ(x|z)] KL[qφ(z|x)pθ(z)]=φ𝔼zqφ(z|x)[logpθ(z)logqφ(z|x)]=φ𝔼εpε[logpθ(g(x,φ,ε))logqφ(g(x,φ,ε|x)]=𝔼εpε[φlogpθ(g(x,φ,ε))logqφ(g(x,φ,ε|x)]=1ni=1k[φlogpφ(g(x,φ,εk))φlogqφ(g(x,φ,εk)|x)]\begin{aligned} & \nabla_{\varphi}\mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log p_{\theta}\left( x|z \right) \right\rbrack - \text{ KL}\left\lbrack q_{\varphi}\left( z|x \right)\| p_{\theta}(z) \right\rbrack \\ = & \nabla_{\varphi}\mathbb{E}_{z \sim q_{\varphi}\left( z|x \right)}\left\lbrack \log p_{\theta(z)} - \log q_{\varphi}\left( z|x \right) \right\rbrack \\ = & \nabla_{\varphi}\mathbb{E}_{\varepsilon \sim p_{\varepsilon}}\lbrack\log p_{\theta(g(x,\varphi,\varepsilon))} - \log q_{\varphi}(g\left( x,\varphi,\varepsilon|x \right)\rbrack \\ = & \mathbb{E}_{\varepsilon \sim p_{\varepsilon}}\lbrack\nabla_{\varphi}\log p_{\theta(g(x,\varphi,\varepsilon))} - \log q_{\varphi}(g\left( x,\varphi,\varepsilon|x \right)\rbrack \\ = & \frac{1}{n}\sum_{i = 1}^{k}\left\lbrack \nabla_{\varphi}\log p_{\varphi(g\left( x,\varphi,\varepsilon_{k} \right))} - \nabla_{\varphi}\log q_{\varphi}\left( g\left( x,\varphi,\varepsilon_{k} \right)|x \right) \right\rbrack \end{aligned}

the later sum goes over samples of pεp_{\varepsilon}. Finally, we have everything to estimate the model. Only thing left is to specify distribution for pθ(z)p_{\theta}(z), pθ(x|z)p_{\theta}\left( x|z \right), qφ(z|x)q_{\varphi}\left( z|x \right) and encoder and decoder parametrization (normally neural networks).