GPTTokenizerFast

C++ Backend

class lucid.models.GPTTokenizerFast(vocab: dict[str, int] | None = None, merges: list[tuple[str, str]] | None = None, vocab_file: Path | str | None = None, merges_file: Path | str | None = None, unk_token: SpecialTokens | str = SpecialTokens.UNK, pad_token: SpecialTokens | str = SpecialTokens.PAD, eot_token: str = '<|endoftext|>', lowercase: bool = True, clean_text: bool = True, end_of_word_suffix: str = '</w>')

GPTTokenizerFast is the high-performance GPT-1 tokenizer in Lucid. It wraps the native C++ BPE backend and follows the original GPT-1 tokenization scheme: character-level BPE with an </w> end-of-word suffix and a single <|endoftext|> token used as the document boundary delimiter.

Class Signature

class GPTTokenizerFast(BPETokenizerFast):
    def __init__(
        self,
        vocab: dict[str, int] | None = None,
        merges: list[tuple[str, str]] | None = None,
        vocab_file: Path | str | None = None,
        merges_file: Path | str | None = None,
        unk_token: SpecialTokens | str = SpecialTokens.UNK,
        pad_token: SpecialTokens | str = SpecialTokens.PAD,
        eot_token: str = "<|endoftext|>",
        lowercase: bool = True,
        clean_text: bool = True,
        end_of_word_suffix: str = "</w>",
    ) -> None

Typical Usage

from lucid.models import GPTTokenizerFast

tokenizer = GPTTokenizerFast.from_pretrained("path/to/gpt_tokenizer")

# Single-sentence encoding
encoded = tokenizer.encode_plus("The GPT model learns by predicting the next word.")
print(encoded["input_ids"])
print(encoded["attention_mask"])

# Plain encode / decode
ids = tokenizer.encode("Language models are powerful.")
text = tokenizer.decode(ids)

# Train from scratch
corpus = ["The quick brown fox.", "GPT learns autoregressive language modeling."]
tokenizer = GPTTokenizerFast.train_from_iterator(corpus, vocab_size=1000)
tokenizer.save_pretrained("./my_gpt_tokenizer")

Methods

GPTTokenizerFast.encode_plus(text: str, add_special_tokens: bool = True) dict[str, list[int]]
GPTTokenizerFast.encode(text: str, add_special_tokens: bool = True, return_tensor: bool = False, device: Literal['cpu', 'gpu'] = 'cpu') list[int] | LongTensor
GPTTokenizerFast.decode(ids: list[int] | LongTensor, skip_special_tokens: bool = True) str
GPTTokenizerFast.save_pretrained(save_directory: Path | str) list[str]
classmethod GPTTokenizerFast.from_pretrained(pretrained_model_name_or_path: str | Path, **kwargs: Any) GPTTokenizerFast

Note

  • This class requires compiled C++ extensions under lucid._backend._C.

  • encode_plus returns a dict with input_ids and attention_mask only — there is no token_type_ids because GPT is a decoder-only model with no next-sentence-prediction objective.

  • eot_token (<|endoftext|>) serves as the document boundary delimiter and is registered as both bos_token and eos_token internally. It is appended to each sequence when add_special_tokens=True.

  • build_inputs_with_special_tokens returns the token list as-is (no per-sample BOS prepended), consistent with GPT-1 pretraining where eot_token only marks document boundaries.

  • Vocabulary artifacts are saved as vocab.json + merges.txt + tokenizer_config.json, identical to the BPETokenizerFast format.