Home | About
Intro
Autoencoder is a method for non-linear dimentionality
reduction. We need it when we want to extract informative features or
reduce noise. We do it by artificially reducing the dimentinality of
data or add some sort of regularization. Depending on this we can
distinguish these types of models:
Plain (undercomplete) autoencoder
There is a function from space
to space
.
Objective: maximize mutual information
.
So,
Let’s take another distribution
and use it to approximate original distribution
.
Last step is possible becase KL-divergence is non-negative and
becomes equality when the KL-divergence is zero.
We can’t directly optimize the sum but we can maximize the latest
term. By doing so we shift the distribution
to be close to
.
Variational Autoencoder
The idea is to maximize
.
If it is done we could use it to sample
.
There are two problems. First, we can’t estimate it directly due to
dimensionality. Second, it is hard to sample from this distribution. So
we have add an assuption that we generate our samples from hidden
distribution
.
Again, there is a problem with estimation. So we rely on another
distribution
and try to “fit” it to the true one using variational methods.
There are some philosophical consideration why should we optimize
instead of generating function directly. Some methods optimize the
generator directly (like GANs).
Below is the derivation using importance sampling. We want to find
the parameters
of distribution
.
Let’s decompose it:
Taking the
:
The inequality holds from Jensen’s inequality
where f is concave (for the convex
we change the inequality to
).
Our final objective function:
So we reformulated the problem from absolute maximization to
optimizing lower bound. TODO: the tightness of the lower bound. We can
optimize it using gradient decsent method.
Taking derivative with respect to
:
where the last sum goes over the samples from
.
Taking derivative with respect to
requires reparametrization trick (see stochastick gradients):
We can’t directly move gradient into the expectation because the
generating probability depends on
.
So we use reparametrization assuming we can “push randomness” and
express the
as deterministic function of
and random variable
from
,
i.e. z = g(x, phi, epsilon).
the later sum goes over samples of
.
Finally, we have everything to estimate the model. Only thing left is to
specify distribution for
,
,
and encoder and decoder parametrization (normally neural networks).