tokenizers.Tokenizer¶

class lucid.data.tokenizers.Tokenizer(unk_token: SpecialTokens | str = SpecialTokens.UNK, pad_token: SpecialTokens | str = SpecialTokens.PAD, bos_token: SpecialTokens | str | None = SpecialTokens.BOS, eos_token: SpecialTokens | str | None = SpecialTokens.EOS)¶

The Tokenizer class defines a Hugging Face-style tokenizer interface for Lucid. It is an abstract base class intended for custom tokenizer implementations in lucid.data.tokenizers.

Overview¶

The base interface separates tokenization and vocabulary conversion into explicit steps:

tokenize converts raw text into a token list.
convert_tokens_to_ids maps tokens to integer ids.
convert_ids_to_tokens maps ids back to tokens.
convert_tokens_to_string reconstructs text from tokens.

It also provides common utility methods such as:

encode / decode
batch_encode / batch_decode
build_inputs_with_special_tokens
save_pretrained / from_pretrained (abstract)

Class Signature¶

class Tokenizer(ABC):
    def __init__(
        self,
        unk_token: SpecialTokens | str = SpecialTokens.UNK,
        pad_token: SpecialTokens | str = SpecialTokens.PAD,
        bos_token: SpecialTokens | str | None = SpecialTokens.BOS,
        eos_token: SpecialTokens | str | None = SpecialTokens.EOS,
    ) -> None

Core Abstract Methods¶

Subclasses must implement:

@property
def vocab_size(self) -> int

def tokenize(self, text: str) -> list[str]

def convert_tokens_to_ids(self, tokens: str | list[str]) -> int | list[int]

def convert_ids_to_tokens(self, ids: int | list[int]) -> str | list[str]

def convert_tokens_to_string(self, tokens: list[str]) -> str

def save_pretrained(self, save_directory: Path | str) -> list[str]

@classmethod
def from_pretrained(
    cls, pretrained_model_name_or_path: str | Path, **kwargs
) -> Tokenizer

Common Usage¶

import lucid
from lucid.data.tokenizers import Tokenizer

class MyTokenizer(Tokenizer):
    ...

tokenizer = MyTokenizer()
ids = tokenizer.encode("hello world")
text = tokenizer.decode(ids)

# Tensor output for model input
batch = tokenizer.batch_encode(
    ["hello world", "hi"],
    padding=True,
    return_tensor=True,
    device="cpu",
)

Note

encode(…, return_tensor=True) returns lucid.LongTensor.
batch_encode(…, return_tensor=True) can return a 2D tensor only when sequence lengths are uniform or when padding=True is used.
decode accepts either list[int] or lucid.LongTensor.