## WHAT is an Autoencoder?

Autoencoders are a special class of neural networks that have the ability to **regenerate their input** from a limited number of learned features. In the most basic form an autoencoder has an input layer, a hidden layer and an output layer; that is reconstruction of the input.

To extract useful features, in general, the hidden layer has lesser number of units than the input and output layer and thus forms the bottleneck for the network. By bottleneck I mean that by limiting the number of units in the hidden layer, we constrain the number of features that the auto encoder can learn and then force it to reconstruct the input using these limited features.

## WHY do we use Autoencoders?

The feature extraction in this manner is particularly interesting. We can use this method to reduce dimensionality of input, i.e. limit feature space/ extract key features from the inputs without yet having the labels for images and use these features as an input to a classifier instead of the complete input. We call this **unsupervised feature extraction**.

## How to use an Autoencoders?

There are a number of flavors of autoencoders but two considerations are a) minimizing the loss between input and output and b) limiting the number of active neurons in the encoding layer to ensure some feature learning.

**Calculating Loss**

The most common way to calculate **loss** for an autoencoder is mean squared error (mse) or cross entropy loss between the input and reconstructed input.

Usually to **prevent the model from overfitting on the training data** because of either smaller number of instances or any other reason we add a regularization term to the weights and the loss function. Lambda below is our weight decay parameter i.e. weight of the regularization.

__L1 Regularization:__

L1 regularization has the property to move values of w towards 0 by either -1 or 1 i.e. the gradient of the absolute function, it won't change w at 0. This way L1 regularization ensures sparsity of features extracted i.e. limits the number of features.

__L2 Regularization:__

L2 regularization also drives w towards 0 but slowly because of the more linear nature of the derivative of a square function and moreover as w gets smaller so will the change in it and therefore when using L2 regularization, weights may never reach 0.

An interesting article on intuition behind __L1 and L2 regularization__.

**Flavors of Autoencoders**

There are several flavors of an Autoencoder depending on what we want. The one shown above is a vanilla or basic version of an autoencoder.

Other flavors include Multilayer, Sparse, Denoising and Convolutional

## Stacked Autoencoders

Stacked autoencoders also known as **multilayered** or **deep autoencoders** have multiple layers of hidden units within them. Each layer is **trained separately** using a **greedy layer wise approach **and at the end we stack the layers together and fine-tune them.

__Finetuning:__

Because we use a greedy layer wise approach for training a stacked AE, a fine-tuning part is required to make the final network smoother. It just means instead of separately training the layers, we collectively update the weights for the entire network.

## Sparse Autoencoders

Another interesting flavor of the auto encoders is the sparse autoencoder. In a sparse autoencoder our hidden units are no more the bottleneck, we can learn interesting features from the input even if the number of hidden units is greater than equal to the number of input units. This is because in sparse autoencoder we enforce a sparsity constraint which limits the number of active units at one point. We can do this by introducing a sparsity penalty in the loss function. The term has minimum value only when the number of units activated are equal to the number of units we want activated. This is where KL divergence comes in handy. We also introduce beta that controls the impact of this sparsity penalty on the network i.e. how strictly do we want our model to be sparse.

__loss__

We enforce sparsity in an auto encoder by adding a sparsity loss to the original loss function

### KL Divergence:

## Denoising Autoencoders

Another flavor of the auto encoders are the denoising auto encoders which is often very useful. You add noise to the data and then feed it as input to the autoencoder, the loss is computed by comparing the fed input to the original input without noise. Train long enough and the network learns to remove noise from the input.

This is a very interesting feature because denoising auto encoders can be used to reconstruct partially available data, add interesting features to images and recover lost data given a trained network.

Make sure to add more noise to an image if you have relatively small data to ensure better training.

Denoising Autoencoders with multiple layers are called **stacked denoising autoencoders**.

An interesting article on __Denoising autoencoders__.

## Comments