Rutherford’s model of the atom, though incorrect, provides a basis for comprehending the world of subatomic physics. According to him, electrons ‘orbit’ the nucleus of the atom, analogous to the orbit of planets around stars. A true artificial intelligence must be capable of forming such analogies. Here, I examine the process which may allow a neural network with a mixture of experts to perform such a task.
A Mixture of Experts?
‘Mixture of Experts’ is a particular variety of neural network where numerous clusters of neurons, each an ‘expert’ at some task, work together. (details here) Depending upon the input to the network, information is routed to some of these clusters and not others. Breaking this process down: an input arrives, and it is fed into a classifier; that classifier determines which experts are needed; the input (and perhaps some metadata from the classification process) is fed to the requisite experts; those experts process the data, and provide their output.
Experts can be chained together, the output of one group of experts feeding into another set of experts higher along the network. In our particular set-up, the final output of this tree of experts is a prediction: “given the input at this instant, along with this information about the past, then some later instant should be this.” When that later moment arrives, that actual occurrence is compared to the prediction. If the prediction was in error, then the network attempts to correct itself by changing either which experts were used, or how those experts do their job. Correcting the error may even necessitate creating a new expert!
Suppose that we train a neural network to play many different arcade games. In some games, there are multiple platforms and a character who can run and jump from platform to platform, with the goal of collecting shiny coins. Other games require accelerating and turning a vehicle along a racetrack or among asteroids and aliens. These two types of games have very different dynamics. The meaning of the objects on screen is very different, as are the strategies for success.
Our neural network must recognize its context: “Am I playing Asteroids, or Mario?” At first, it may not be sure. The network makes a guess — “If I’m playing Mario, then the UP button should make a character jump…” It tries to jump, and nothing happens. “Is this some other kind of game? Or, is my character stuck in a place where it can’t jump, though a jump would normally be possible?” The network must explore, testing various predictions, until it determines which sort of game it is playing.
In this example, the neural network can already play a few games. When playing a platform game, it knows which experts should be activated. Similarly, in racing games, another set of experts is utilized. The network contains both kinds of experts, and the initial classifier network determines which ones to use. So, if UP doesn’t cause a character on screen to jump, the network sends an error message back down to the classifier. The classifier takes the original input and feeds it to the experts responsible for racing games, instead. “If this is actually a racing game, does my UP action produce the expected result?” The network can switch contexts when the first guess doesn’t add up.
But, what if the neural network is actually playing a maze game? There are no characters on screen jumping between platforms! Yet, similar to a racing game, the network views a landscape from first-person perspective and can move left or right — but, it can also move forward and backward, changing what is visible on screen. The maze game is similar to a racing game in some ways, and different in others. The classifier must be updated, routing data to a new set of experts! Though, some of the experts used will be the same as those for a racing game — there is a degree of analogy between the dynamics of a racing game and a maze game.
There are many similarities between the racing games familiar to our neural network, and this new maze game. In the racing games, it could steer to hit glittering green orbs to score extra points and move faster. This maze game has golden triangles which score points, so there is an analogy between the orbs and triangles. The race car lost ‘health’ when it ran into grey blocks; the maze character lost ‘health’ when purple blobs hurled fiery orbs at it. In both cases, the goal is to steer to avoid them.
The network can learn these analogies by routing ‘golden triangles’ to the same experts that handled ‘green orbs’, and similarly, routing ‘fiery orbs’ to the expert responsible for ‘grey blocks’. It doesn’t need to learn from scratch — the network can use analogies to adapt its knowledge of racing games to this maze game. The experts involved in both games constitute a set of analogies.
This use of analogies, which accelerates learning by adapting knowledge from a similar context, is called Transfer Learning. It is critical for creating an artificial intelligence which can quickly adapt to new circumstances. Ideally, a neural network which masters transfer learning can imagine how to play a game it has never encountered before — the lofty goal of Zero-Shot Learning. The network could reason about its context, and take the correct actions the very first time that it plays the game.
Continuing the example of a maze game: aside from the similarities mentioned above, the maze game differs significantly from the familiar racing game. The neural network can turn around corners, look up and down, and it interacts with barrels and purple blobs, attacking and destroying them. Corners are completely new — a new expert module is needed. Yet, attacking purple blobs is analogous to attacking turtle-ducks in Mario, and breaking barrels for hidden treasure is similar to breaking bricks for coins in Mario. The neural network can use those expert modules adapted to Mario, if it can recognize their similarity. The maze game is analogous to the racing game in some ways, Mario in others. How does the network sort out which analogies to use, and when?
The goal of Zero-Shot Learning necessitates that the neural network imagine when one context or the other is more relevant. Furthermore, the network must identify exceptions to the rules, and find an analogy to fit those exceptions. This is where my analysis becomes a bit more technical…
The Space of Dynamics
Imagine a gigantic cube. Each point within that cube corresponds to some dynamics. One point in the cube corresponds to “A orbiting B”. Another point in the cube represents “A mixed within B”. Another point is “A bouncing around near B”. There are multitudinous points, each representing some relationship between objects. This cube is a Space of Dynamics.
These points in the space of dynamics might be talking about planets and stars — planets do orbit stars, so the first point I mentioned would be appropriate. However, planets are not mixed inside stars, nor are they bouncing around near stars, so the other points in the space of dynamics are inappropriate. Yet, each of these points would be appropriate for different models of the atom. Rutherford believed that the electron orbits the nucleus. Another view, the ‘plum pudding’ model, supposed that electrons are mixed within the nucleus. And the Bohr model of the atom showed that the electron is bouncing around in a cloud of probability, percolating into diverse places like a rainstorm. The Space of Dynamics holds all these possible perspectives within itself.
So, the task of Transfer Learning, by forming analogies between systems’ dynamics, is really a question posed to this Space of Dynamics. Find the point in the space which corresponds to observed behavior. Then, compare this point to the points from other tasks that the neural network has learned. You can ‘project these points onto a subspace’ — you cast their shadows onto a wall of the cube. If your new observation’s shadow is close to the shadow of another task, then that projection constitutes an analogy between the two systems. That analogy is only appropriate for the components of those tasks which are represented by that shadow. The two systems may differ in other ways, though their shadows are close to each other in that one perspective.
Moving away from the wall of shadows, you can observe where those two points differ in space. That difference is another component of dynamics. The maze game was similar to the racing game in many ways, yet more similar to Mario in others. The distance between the points in the space of dynamics represents the ways that the maze game differed from the racing game. So, the neural network seeks an analogy that explains that difference — the Mario game suffices for most of those differences, meaning that the Mario game’s projection and the maze game’s projection were close to each other in the ways where mazes and racing differ.
Some aspects of the maze game were unique, too — those correspond to differences in the placement of the points which cannot be matched by casting a shadow. Only those unique aspects need to be learned, by adding a few new experts (which are trained with back-propagation by gradient descent, as usual). These new experts do not disrupt or impinge upon the other experts, avoiding the perennial problem of catastrophic forgetting found in most neural networks. I know of no better way to train an artificial intelligence to learn many tasks, form analogies between them, and efficiently explore its expectations. Perhaps we’ll see an implementation soon. :)