GPTLMHeadModel¶
The GPTLMHeadModel class applies a linear language modeling head to the GPT backbone for causal (autoregressive) next-token prediction. The output projection weight is tied to the input token embedding.
Class Signature¶
class GPTLMHeadModel(config: GPTConfig)
Parameters¶
config (GPTConfig): GPT configuration object.
Methods¶
- GPTLMHeadModel.forward(input_ids: Tensor, attention_mask: Tensor | None = None, position_ids: Tensor | None = None, past_key_values: list[KVCache] | None = None, labels: Tensor | None = None, use_cache: bool = False) tuple[Tensor | None, Tensor, list[KVCache] | None]
Compute per-token logits over the vocabulary. When labels are provided,
also returns the shifted cross-entropy loss for next-token prediction.
- GPTLMHeadModel.tie_weights() None
Tie the lm_head projection weight to the input token embedding weight.
- GPTLMHeadModel.get_input_embeddings() Embedding
Return the token embedding layer.
- GPTLMHeadModel.get_output_embeddings() Linear | None
Return the lm_head linear projection.
Examples¶
>>> import lucid.models as models
>>> config = models.GPTConfig.base()
>>> model = models.GPTLMHeadModel(config)
>>> print(model)
GPTLMHeadModel(...)
>>> import lucid
>>> input_ids = lucid.randint(0, config.vocab_size, (2, 32))
>>> loss, logits, _ = model(input_ids, labels=input_ids)
>>> logits.shape
(2, 32, 40478)
>>> # Greedy autoregressive generation
>>> generated = input_ids
>>> for _ in range(20):
... _, logits, _ = model(generated)
... next_token = logits[:, -1, :].argmax(axis=-1, keepdims=True)
... generated = lucid.cat([generated, next_token], axis=-1)