Neural Networks with Abstract Attention

8 min readOct 23, 2017

Attention is among the noteworthy advances in artificial intelligence research. Rather than send the entire ‘field of view’ into a neural network, as vague input, attention highlights certain areas of input for the network’s focus. An image of a city street, with letters and numbers strewn across the scene, is parsed into small sub-regions, each with its own letter or number needing identification. A sentence in English is translated into French by a neural network that attends to only the relevant words — a pronoun only ‘looks at’ the French-gender of the noun it references. Yet, attention has clung to the lowest layer of network architecture: attention filters out most of the inputs — the pixels, or words. Attention networks would improve, if they filtered higher-layer abstractions, too.

“Last time this happened…”

Consider how we move our own attention along, traversing our memories in search of relevant experiences. “Oh, that red arrow is tilting… last time this happened, my car started to smolder. I should pull over.” We have a present goal in mind (‘drive in your lane’), yet a portion of our environment’s input was unexpected (‘red arrow tilting’), and we scan for a similar instance (‘last time this happened…’). We keep looking through our memories, until we find a past example that was similar in the relevant ways (‘also red arrow tilting’), though other aspects of that memory may differ (‘I was on the road to Bakersfield, that time’) . That is, we apply an attention filter to decide which high-level abstractions need to be present in our ‘match’. And, when we find a match, we create a new goal (‘pull over’) which percolates back down to our lower-level planning (‘turn signal, look in right mirror, …’) to get us safely onto the shoulder.

This example demonstrates the multiple layers of attention which need to be active for complex cognition. There is an attention filter that ‘smudges’ most of the visual input — the only ‘un-smudged’ part is the area of your glance while driving. Yet, simultaneously, we have a ‘background attention’ with a much broader view, and it attempts to make vague predictions about what it will see. When those predictions are in error, our ‘focal attention’ moves to those errors — we ‘notice something change out of the corner of our eyes’, and then we ‘glance at it’. This dual-format, a ‘focal attention’ for the task at hand and a ‘background attention’ for vague predictions, allows us to train a set of specialized recognition systems for each point of focus (‘driving’ vs ‘threading a needle’) while maintaining a broad prediction system to govern when to move our focus (‘rapid motion’, ‘flashing light’).

Additionally, our attention applies to memory, and searches our memories until it finds a fit. “Last time this happened…” is a call for memories which were similar in ‘this’ way, though they may differ in all other ways. Our memories store a set of abstractions about events, not the exact pixels. So, when we seek ‘last time’, we are applying an attention filter on the set of high-level abstractions in memory. In the ‘overheating car’ example, we search memories for ‘the last time the red arrow tilted’, while ignoring ‘the last time I was driving on this street’. Most of the high-level abstractions stored in memory are irrelevant — we scan through only a few high-level abstractions at a time. (This is equivalent to a k nearest neighbor search across a small subset of dimensions on an autoencoder's feature vector. Our own memory centers may actually operate much like a high-level kNN in this regard.)

Follow the Bouncing Ball

Key to attention is deciding what information is irrelevant, as well as what is ambiguous. Our brains seek information that is neither — we want unambiguous and relevant results. Yet, we benefit from recognizing the other cases; something might be ambiguous, but that is alright because it is also irrelevant, while another tidbit of information may be ambiguous and relevant. Uh-oh!

Our curiosity targets those ambiguous and important pieces of information, and seeks a prediction by way of analogy. When an object is partially obscured, we actively imagine the part we cannot see. For example, a video displays an actor tossing a red ball into the air. They throw the ball higher and higher, until the ball leaves the field of view. Each time that the ball went up, it also fell back down, and so, we expect to see the ball fall back down into view. If the red ball does not fall back into view, our attention perks up, and our brain focuses its activity on finding an explanation for this disparity. “Did it get stuck up there?” “Did it fall off to one side, and I missed it?” Our attention is sensitized, waiting for more information, hoping to reduce ambiguity.

When we seek an explanation for the red ball’s disappearance, we are asking our brains to imagine explanations, and we continue to imagine new analogies, until we find one that fits. Our attention is scanning a set of ‘memories’ that we generate, and is asking if just a few high-level abstractions match. If we want a neural network to have this power of imagination and satisficing, it needs attention that can filter abstractions, not just filter pixels. What would an implementation look like, in broad strokes?

Painting High-Level Attention

Suppose we have a vision system, receiving pixel arrays from a robot-mounted camera, and we are designing a neural network which directs a gripper to pick and place items that it can see. Most of the pixels that the camera receives are irrelevant to a given pick-and-place task. The pixels all along the periphery can be altered, without altering the robot’s best sequence of actions. They are irrelevant. So, we seek an attention mechanism, which filters out the irrelevant pixels, and focuses the neural network on the target object. This is input-level attention, and it has real value. Yet, we also want attention at higher levels of abstraction.

Abstract Encoding

Each moment of sensory input can be compressed into an extracted feature vector. Autoencoders are the canonical examples of this process. They reduce complex environment data into a feature vector, encoding abstract qualities of the input as the activations of each dimension of the vector. For our abstract attention networks, these ‘memories’ of each moment must be searched by the neural network, using an attention filter on the components of the feature vectors. An image of a target object might be encoded into a feature vector with dimensions describing color, general shape, and expected rigidity. Attention, acting on these abstractions, might focus the network on the color component when a human asks for the “red ball”, and instead might filter out all but rigidity when asked to “stack these”. These high-level attention filters are situational, and must be learned. Yet, they carve a path toward true cognition.

Kinds of Accuracy

Our attention network must evaluate where to pay attention. Generally, if something can be altered, without altering the outcome, then that alteration is irrelevant and can be filtered out. This is true not only for periphery pixels, but also for ‘peripheral’ dimensions of the encoded feature vector. So, when the network accurately predicts peripheral pixels or features, that accuracy is not counted. Only the accuracy of the filter-selected features matters. (That is, the attention filter also applies to the loss function!)

So, supposing that the gripper robot is tasked with “moving these items into this box”: it successively migrates its pixel-attention to different items, encoding their feature vectors, and it filters out the features describing objects’ color, focusing its attention instead on shape and rigidity. As the robot arm attempts to grip and move these items, it might find that its color predictions change radically — the lighting is different, as each item leaves its box. Yet, because the robot’s task depends upon shape and rigidity, these inconsistencies in color can be ignored. The neural network is not ‘punished’ for poor color-prediction. However, if an item is unexpectedly soft or heavy, the robot arm must attend to its prediction’s error, and seek an explanation.

Explaining Analogs

When our robot gripper is trained, it experiences variations in its environment which do not affect the outcome, and it must learn to distinguish these variations from the features that matter. And, when the network has learned which feature dimensions are relevant, it must also seek to explain any errors in its prediction of those dimensions’ activations. This requires the creation of analogies — where one subset of the encoding’s feature dimensions are mapped onto another subset of feature dimensions. The hope is that one set of characteristics will inform the other, ‘filling in the blanks’.

Structurally, this process of analogy-formation requires the addition of a layer of neurons above the encoding layer, and this ‘analogy layer’ must be trained to find correspondence between subsets of encoded features. In one moment, a vision network’s encoding may register features ‘A’, ‘B’, and ‘C’ as active, and the next moment, it experiences a reward, while a prior moment with only ‘A’ and ‘B’ active had no subsequent reward. Similarly, ‘E’ ‘F’ and ‘G’ were followed by a reward, while ‘E’ and ‘F’ by themselves yielded no reward. The ‘analogy layer’ would seek and compare hypotheses which abstract from all these instances — “A:E, B:F, C:G, and all three must be present” for example, or the simpler hypothesis, “C and G yield rewards, the other features were incidental.” If a mapping consistently relates one set of experiences to another, the analogy layer is rewarded. (The network learns symmetry operations on components of the feature vectors which impact outcomes.)

Imagining an Explanation

When the image of an object is partially obscured, a neural network can be trained to imagine the obscured portion of the object. Similarly, when a scene is compressed to its encoded feature vector, an analogy allows the network to imagine obscured features. The red ball, from our earlier example, does not fall past the actor’s hand — it is consistently caught before falling farther. So, when the red ball is thrown up beyond the field of view, and does not fall back down, we imagine that it must have been caught off-screen. The abstract concept of ‘caught the ball’=’stops falling’, which was observed on-screen during prior throws, is inferred when the ball does not fall back into view. Just like visual completion of occluded objects, the analogy of ‘caught the ball off-screen’ fills-in where information is absent.

This is a key insight for attention mechanisms and abstraction. Encoding each moment as a feature vector, we can apply attention filters to these feature vectors for nearest neighbor search, as well as completing ambiguous features using analogies. These actions are synonymous with visual completion and visual attention, only they are applied to the abstracted feature vectors.

Binding Qualities of Experiences

If such a network was allowed to experience a rich array of senses, it might find many ‘analogies’ during its personal experience which are spurious artifacts… similar to synesthetes’ binding of colors to numbers, or smells to words. Yet, many of those synesthetic senses have provided the intuitive associations which ‘short circuit’ complex problems — number and space synesthetes, for example, are able to perform computations at calculator speed. And, some painters benefit from a synesthetic binding of color, shape, and sound, creating imagery which seems to ‘sing’. Perhaps, with abstract attention and analogy, machine intelligence could be as intuitive and poetical as our own.

Neural Networks with Abstract Attention

Written by Anthony Repetto