Tokenizers

The lucid.data.tokenizers package provides tokenizer interfaces for text pipelines. It focuses on converting text into token ids and reconstructing text from ids in a way that integrates cleanly with Lucid tensors and data loaders.

Overview

Key responsibilities of a tokenizer in Lucid:

  • Split text into token units.

  • Convert tokens to integer ids for model input.

  • Convert ids back to readable text.

  • Handle special tokens such as [PAD], [UNK], [BOS], and [EOS].

  • Save and reload tokenizer state with pretrained-style APIs.

The base class is lucid.data.tokenizers.Tokenizer.

Quick Start

Single input:

from lucid.data.tokenizers import Tokenizer

tokenizer = MyTokenizer.from_pretrained("path/to/tokenizer")
ids = tokenizer.encode("hello lucid", add_special_tokens=True)
text = tokenizer.decode(ids)

Batch input (2D tensor output):

import lucid

batch_ids = tokenizer.batch_encode(
    ["hello lucid", "hi"],
    padding=True,
    return_tensor=True,
    device="cpu",
)
assert isinstance(batch_ids, lucid.Tensor)  # shape: (batch, seq_len)

Save / load:

tokenizer.save_pretrained("out/my_tokenizer")
tokenizer2 = MyTokenizer.from_pretrained("out/my_tokenizer")

Note

  • encode(…, return_tensor=True) returns a 1D lucid.LongTensor.

  • batch_encode(…, return_tensor=True) returns a 2D tensor when lengths are aligned (or when padding=True is set).

  • decode accepts both list[int] and lucid.LongTensor inputs.