How residual connections enable surprise in language models ⚠️ Work in Progress: This post is a draft and subject to change.