Overcoming the Vanishing Gradient Problem

I’ll compare typical neural network behavior to a proposed alternative, *covariance*, and explain why and when this alternative may be an improvement. In particular, while deep neural networks suffer from a “vanishing gradient”, covariance may re-invigorate the gradient as it propagates through many layers.

First, a typical neural network’s behavior:

A* forward-pass*, where an input is sent through the neural network, and an output layer activates according to its “guess” of the input’s classification.

A comparison, between the guess and the correct label, which measures “how wrong” the guess was. This is measured using what is called a *loss function*.

A *back-propagation* of “force” upon the weights of each neuron, using gradient descent, to move the network’s weight parameters toward better guesses. This “force” can be imagined beginning as the loss function, and spreading back upstream in proportion to the gradient of the activation function at each neuron.

The Problem:

While backpropagation by gradient descent is mathematically rigorous, it has a failing in practice: the ‘gradient values’, which dictate how **much** each neuron should change, become smaller and smaller as they are propagated to deeper layers of the neural network. Eventually, there is no “signal” telling deeper neurons to change at all! Only the top-layer neurons are told how to adapt.

To overcome this problem of “vanishing gradients”, I offer an alternative to the typical loss function: *covariance*.

Covariance is the measure of how consistently two values stay close to each other, while they change. If two neurons are always “on” as a pair, or “off” as a pair, they are said to have *positive* covariance. Similarly, if one neuron is always “on” whenever the other neuron is “off”, they have *negative* covariance.

The covariant model of loss can be illustrated with an example:

You have a neural network which is attempting to classify images as either “cat” or “not-cat”. It has an output layer neuron which is measured, a “1” or more signifies a guess that it was given the image of a cat, and anything less than “1” signifies that it guesses “not-cat”.

After inputing each image, you retain the activation levels of each neuron, and sort these image →activation maps according to whether the guess was *accurate* or *erroneous*. For the entire group of accurately classified images, the output neuron was “1” or more, by definition. Meanwhile, the erroneously classified images output less than “1”, by definition. This means that *among* the accurate batch, the output neuron had highly positive covariance. (They were all above “1”.) And, the erroneous batch had highly positive covariance, too! (They were all below “1”.)

You can see that, though all the accurately classified images are similar to each other, (all >1) AND all the erroneous classifications are similar to each other, (all<1) the two groups are** different**. (“all>1” != “all<1”) **That is a negative covariance between the two groups.** When a neuron exhibits negative covariance

*between*the accurate and erroneous groups, we’ll say that neuron feels a “force” to change. This is an alternative to the typical loss function.

If we only consider the output-layer neuron, then measuring negative covariance isn’t far from the typical loss function, which compares the accurate label, itself, to the erroneous batch. However, covariance can be computed for ALL neurons. That makes all the difference.

How, specifically:

For each neuron in your cat-classifier network, you can ask: “what was the statistical variance among the activation levels for all accurately-classified images?” This is like asking “for this particular neuron, do images from the accurate batch all have similar value?” A neuron with positive covariance is a likely signifier of ‘cat-ness’. Similarly, compute the variance of activations among erroneously classified images; these neurons may be sources of ‘confusion’.

Pulling from Everywhere, not just the Output Layer:

By computing the negative covariance *between* the accurate batch and erroneous batch, at each neuron, you have a “force” similar to the loss function. Yet, unlike the typical loss function, this “force” exerts itself from **many** neurons throughout the network. Now, rather than a gradient value ‘vanishing’ as you backpropagate to deeper layers, many neurons can provide a ‘boost’ when backpropagation reaches them.

So, you backpropagate by gradient descent, and values pass deeper into your network, dimming slowly. Intermittently, backprop encounters neurons which *scored large negative covariance*. Simply add this covariance, with a scaling factor, to your gradient! Thus, the gradient is ‘pulled upon’ at many points throughout your network, instead of pulling at just the output layer.

This is important for particularly deep networks, and networks with LSTM-style memory. Neural networks with many layers, or with long memories, struggle to train their *early* mistakes away — those errors happened *too far back* in the network for typical backpropagation to identify them. Negative covariance between accurate and erroneous batches can identify those early mistakes, **no matter how deep the network**.