Ah, you're thinking about embeddings which are basically the encoder stack on a traditional transformer architecture. Modern GPT-like models (including Claude), however, drop the encoder and use decoder-only architectures.
I could imagine something where encoders pad up to the context length because causal masking doesn't apply and the self attention has learned to look across the whole context-window.
Nope, a decoder only transformer is a variant of the original architecture proposed by Google [1]. All variants of GPT that we know about (1 through 3) all roughly use this same architecture which takes only the decoder stack from the original Google paper and drops the encoder [2]
I could imagine something where encoders pad up to the context length because causal masking doesn't apply and the self attention has learned to look across the whole context-window.