I can’t predict the future. So, I hope that AI will. Predicting the future is a complicated task — even predicting the trajectory of a bouncy ball is tricky. Yet, we are reasonably good at imagining plausible futures, and we can identify outlandish or impossible chains of events. Neural networks fail at those tasks. In particular, neural networks have trouble working backwards from a goal, to construct a series of actions which achieve that goal. Let’s try to solve that puzzle…
Learning Moment to Moment:
In order to work backwards from a goal, our neural network will need to identify plausible chains of events. Then, it must project from the final event to the plausible events immediately prior, chaining backward until it has a few paths to its goal. So, we want a network that can receive the environment’s state at one moment in time, and generate plausible following moments. This can be represented by placing each moment in a state space, and trying to find the vector from one state to the next.
But, there’s a problem: in most cases, our state space is HUGE. Consider the case of video-prediction, where you have a 1280p by 720p image, and you want to predict the next frame in the video. That image has 921,600 pixels, and each pixel has three colors, for a total of 2.8 million dimensions! In reality, we don’t need to know every single pixel, and what we are really concerned with are the people and objects in the video. We want a network that extracts the meaningful information from the video, to predict what the objects and people do next. Our network should extract the relevant features from the image — that feature space can be much smaller.
So, in our frame-by-frame prediction task, we would take two frames of our video, one following the other. We extract features from each of those images, and we can now locate the two images in a feature space. The vector between them, moving from one moment’s features to the next, is what we want to predict. Once we can predict those ‘moment-to-moment’ vectors, we can follow those vectors backwards from a goal. That is how we can generate plausible paths to reach the goal!
What Does Success Look Like?
When we reduce one moment and the next to sets of features, our network succeeds when it predicts the features of the next moment. We’re not asking our network to memorize every pixel and re-create the following video frame — it just needs to know about the states of the relevant objects and people. To measure this, we compare the predicted features (from our network’s prediction of the next moment) to the features extracted from the actual next moment.
But, how do we know that the features extracted are really relevant? We need to dig a little deeper into the nuances of feature vectors and feature spaces. Consider two images that return almost identical features — they are located close to one another in the feature space. Yet, when the network checks the feature vector of each image, the vectors point to very different futures. That means there is some feature which distinguishes these two images that is not being measured.
For example, many videos show people walking, and there is an expected future that they keep walking, following a pattern. However, many other videos show people falling down. If our network does not have an “about to fall over” feature, then it will locate these disparate events in the same place on the feature space, despite their divergent futures. However, with an “about to fall over” feature being detected, those two images suddenly exist in separate locations on the feature space.
This leads to a good generalization: if two feature vectors are very different, we hope that their starting points are located far apart in the feature space; conversely, if two feature vectors are very similar, we are happy if their sources are located near each other. And that is what we need to maximize.
Maximizing Local Agreement
This part isn’t your grandma’s machine learning.
Most MI researchers would say “great, you have feature vectors, which are the lines connecting one moment’s extracted features to the next moment’s extracted features… train a neural network to predict those vectors, and you’re done!” That would take the features we’ve already got, and build a function predicting which vectors are attached to those features. It doesn’t build the features for us. The researcher might respond, “Fine, train a network end-to-end, that learns the features, predicts the next moment’s features, and then compares the predicted features to the actual next moment’s features.” That doesn’t work, because the network is creating the features, and it is evaluating its accuracy in terms of those same features. A network that only extracts a feature that is always at “1”, and only predicts a future with that same feature at “1”, would always be right. We need to be a bit more mischievous, to create a network that evaluates success using its own rules.
I offer that we train a network to generate features which maximize local agreement, while minimizing global agreement— points which are near to each other have similar feature vectors, but they differ from points afar. Their levels of disagreement (and conformity) are our loss function.
With that modification, the network is rewarded for identifying features that really matter, and punished for features which cause confusion. The network does not learn which vectors point where (k Nearest Neighbor lookup could find the appropriate vector just as easily). Instead, it learns which features make futures distinct. Having a nice map of each moment’s vector to the next moment is our byproduct.
Why Vectors Matter
In feature space, the feature vector measures the change of those features. When words are compressed to their features, the vectors between them represent a ‘kind of change’. Word-pairs exhibiting similar changes have the same vector. For instance, the change from ‘King’ to ‘Queen’ is the same as the change from ‘Him’ to ‘Her’. Both take the masculine form, and change it into the feminine form. If these word pairs were the ‘first moment’ and ‘second moment’ in our future-predicting network, then their vectors in feature space would be very similar, and we would hope that similar vectors are located close to one another.
That is the key insight: our features generate vectors, and their feature coordinates should be similar whenever their feature vectors are similar. This draws disparate pairs toward each other, whenever they exhibit a similar transition. The two kinds of change are similar, so they express a deeper similarity that needs to be measured. Rewarding a network for placing similar-vectors near each other generates that measurement.
Imagine a neural network that is trained to observe smoke coil, and predict the next frame in the video. It is then allowed to observe drops of ink in water. The behavior is different — smoke will curl around itself many times, while water tends to rapidly smudge the ink. Yet, there are similarities, and a good feature detector would end up creating feature vectors for smoke and water that are very similar. The training technique that I offer seeks to place those two events, smoke and water, near to each other in the feature space, because of the similarity of their feature vectors. The network values distinguishing features, not the accuracy of predictions. It’s a difference that allows analogy and inference, and it‘s a step toward machines that backtrack from a goal to the decisions that get them there.