Attention Residuals: What If Your Network Could Choose Which Layer to Listen To?
Residual connections are one of those ML ideas that feels obvious in hindsight. Instead of each layer completely overwriting the previous representation, you just add the transformation on top:
h_l = h_{l-1} + f(h_{l-1})
This single + sign is what made training 100-layer networks possible. It gives gradients a