The key insight is that we are still performing gradient descent, but only in the direction of the gradient of the ‘hold-out set’, and only the component of that gradient which is perpendicular to the gradient of the original data set.

So, we would need to compute gradient descent with the original data, first. The ‘weight updates’ proscribed by that operation would constitute a very high-dimensional vector (one dimension per synaptic weight!) — let’s call that vector v. Then, we would compute gradient descent using the ‘hold-out’ data; those ‘weight updates’ are another vector in that same high-dimension space — let’s call that vector u.

Finally, we can compute the component of u which is perpendicular to v. That perpendicular vector tells us, for each synaptic weight (i.e. each dimension of our vector), how much we should update that synaptic weight. To compute the component of u which is perpendicular to v, we simply subtract from u the component which is parallel to v. That parallel component is just the projection of u onto v, which is found by dotting u and v together, and multiplying by the unit vector in the v direction, divided again by the length of v. (the link explains this process well, with mathy formulae…)

In summary: use gradient descent to find v and u; compute the projection of u onto v; subtract that projection from the u vector; the result of that subtraction is the update vector for all synapses. I hope that makes sense!

Written by