Much of machine intelligence can be cut into two broad categories: trees and holographs. A decision tree presents a series of choices, A or B, C or D, working its way to the leaves of the tree. Trees categorize, using explicit separations. Holographic networks are different. In an image-classifying neural network, for example, every neuron is involved to varying degrees in the process of classifying images. Dropout is a popular regularizer for training these neural networks, which randomly eliminates neurons. The network, as a result, is highly redundant and information is distributed across the bulk of connections. No one neuron codes for a specific thing.
I argue that the holographic networks are easier to find than decision trees, being thermodyamically preferred among the possible weights of a network. First, some background:
Neural Networks contain both Trees and Holographs
A DenseNet, with connections between every neuron in each layer, can have weights on its neural connections such that it simulates a decision tree. And, weights can be assigned such that the DenseNet has a redundant, distributed classifier — a holographic memory. Both trees and holographs exist in the space of possible neural synapse weights, the range of possibilities.
Imagine that space of neural weights as a landscape, with peaks, ridges, and valleys. The process of initializing the connections of a neural network, and then training the network on a data set, is akin to dropping a boulder somewhere on that landscape and watching where it rolls until it settles in some valley. That valley is the local minimum, the solution to the classification problem.
A decision tree is a solution to the classification problem, so a multitude of valleys on our landscape correspond to decision trees. Similarly, there are many holographic memories available, so numerous valleys are holographs, as well. To understand which of these is thermodynamically preferred, we must consider what the landscape looks like near each of these valleys…
Near each particular valley on our landscape, the surrounding hills may be jagged and steep or smooth and regular. If a small change to the network creates a large change in the result, then that valley is rugged and irregular — the steep cliffs correspond to the large change in the network’s result. However, if a small change to the network creates an even smaller change in the output, then that valley is smooth and broad.
It is simple to demonstrate that decision trees are in rugged valleys, while holographic memories are in smooth vales. In a decision tree, if one of the criteria is shifted even a small amount, the reclassified inputs will tend to cluster as large errors in a few outputs. A small change to the decision tree generates a large change in results; the landscape around decision trees is steep and irregular.
Meanwhile, Dropout demonstrates that a large change in a holographic network produces only a small change in results — Dropout eliminates half of the neurons in the network, and still gives the correct answer! So, many of the changes in that region of the weight space have little or no impact on outputs — around the holographic valley, the landscape is smooth.
I cannot say which is more common, decision tree valleys or holographs, though I strongly suspect that holographs are radically more abundant among possible neural weights. However, I can show that, because of the ruggedness of decision trees and the smoothness of holographic memories, holographs are much more likely to be an outcome of training. The network’s loss function presses synaptic weights toward holographs naturally, as a kind of entropic state.
When a valley is surrounded by steep cliffs, it suffers twice to not be found. First, a steep valley is small, so it is unlikely that a randomly chosen initial weight matrix will land within its domain. Second, when initial weights land along those steep cliffs, the gradient is large and the updated weights are moved far away — the cliffs bounce the network away from the valley entirely! Decision trees, being steep valleys, are unlikely to be found.
Conversely, a smooth valley benefits from exactly the same qualities. A smooth valley is wide, so it is likely that random initialization will land within that valley’s reach. And, the gradient on a smooth valley is small, so updated weights are likely still within that valley. The smoothness of holographic memories makes them likely outcomes of training.
So, if you had one hundred initializations begin near decision trees, and one hundred near holographs, then the ones near holographs would settle into that holograph’s valley, while those near decision trees would bounce away. All else being equal, holographic memories will be the product of training networks almost every time.
Holographic is Dense
A decision tree parses each branch with a separate discriminator. Branch A may split into C and D, while branch B splits into E and F. Each split is handled by a distinct feature. So, if a decision tree has 7 binary splits, it can best arrange them into a tree with three layers and four final bifurcations, encoding eight possible outcomes.
Meanwhile, a holographic memory utilizes each neuron for every classification. A neuron which distinguishes between pugs and collies may also distinguish between corvettes and civics. That’s equivalent to a tree where each layer of bifurcation is handled by a single neuron. Branch A and B are both followed by the same neuron, which distinguishes both C from D and E from F. Seven distinctions would correspond to seven layers of bifurcation — that’s 2 to the 7th power, or 128 encoded outcomes. Holographs’ 128 distinct categories far exceeds decision trees’ eight.
So, miraculously, the nature of neural networks is to tend toward the most efficient and resilient form of intelligence — a holographic memory. The landscape of possible synaptic weights is dominated by broad, steady valleys with holographs at their center. It is most natural, entropic, to fall into those places with the greatest powers of distinction. The cosmos made this one easy for us, both to evolve such a system of intelligence within ourselves, and to allow us to design a similar sort of intelligence for our own purposes.