Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Ah, you're thinking about embeddings which are basically the encoder stack on a traditional transformer architecture. Modern GPT-like models (including Claude), however, drop the encoder and use decoder-only architectures.

I could imagine something where encoders pad up to the context length because causal masking doesn't apply and the self attention has learned to look across the whole context-window.



Decoder only architecture? What is this? That doesn't sound like a transformer at all, are you saying gpt4 uses a totally different algorithm?


Nope, a decoder only transformer is a variant of the original architecture proposed by Google [1]. All variants of GPT that we know about (1 through 3) all roughly use this same architecture which takes only the decoder stack from the original Google paper and drops the encoder [2]

[1] Original Google Paper - https://arxiv.org/abs/1706.03762

[2] Original GPT Paper - https://s3-us-west-2.amazonaws.com/openai-assets/research-co...


How can it work without an encoder?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: