The key insight is that we are still performing gradient descent, but *only in the direction of the gradient of the ‘hold-out set’*, and only the **component** of that gradient which is **perpendicular **to the gradient of the original data set.

So, we would need to compute gradient descent with the **original** data, first. The ‘weight updates’ proscribed by that operation would constitute a very high-dimensional vector (one dimension per synaptic weight!) — let’s call that vector *v*. Then, we would compute gradient descent using the ‘**hold-out**’ data; those ‘weight updates’ are another vector in that same high-dimension space — let’s call that vector *u*.

Finally, we can compute the *component* of u which is *perpendicular* to v. That perpendicular vector tells us, for each synaptic weight (i.e. each dimension of our vector), how much we should update that synaptic weight. To compute the component of u which is perpendicular to v, we simply *subtract from u the component which is parallel to v*. That parallel component is just the projection of u onto v, which is found by dotting u and v together, and multiplying by the unit vector in the v direction, divided again by the length of v. (the link explains this process well, with mathy formulae…)

In summary: use gradient descent to find v and u; compute the projection of u onto v; subtract that projection from the u vector; the result of that subtraction is the update vector for all synapses. I hope that makes sense!