WordPieceTokenizerFast

C++ Backend

class lucid.data.tokenizers.WordPieceTokenizerFast(vocab: dict[str, int] | None = None, vocab_file: Path | str | None = None, unk_token: SpecialTokens | str = SpecialTokens.UNK, pad_token: SpecialTokens | str = SpecialTokens.PAD, bos_token: SpecialTokens | str | None = None, eos_token: SpecialTokens | str | None = None, lowercase: bool = True, wordpieces_prefix: str = '##', max_input_chars_per_word: int = 100, clean_text: bool = True)

WordPieceTokenizerFast is the high-performance WordPiece tokenizer implementation in Lucid. It shares the same Python-facing interface as WordPieceTokenizer, but delegates tokenization and vocabulary operations to a native C++ backend exposed via the lucid.data.tokenizers._C extension module.

Class Signature

class WordPieceTokenizerFast(Tokenizer):
    def __init__(
        self,
        vocab: dict[str, int] | None = None,
        vocab_file: Path | str | None = None,
        unk_token: SpecialTokens | str = SpecialTokens.UNK,
        pad_token: SpecialTokens | str = SpecialTokens.PAD,
        bos_token: SpecialTokens | str | None = None,
        eos_token: SpecialTokens | str | None = None,
        lowercase: bool = True,
        wordpieces_prefix: str = "##",
        max_input_chars_per_word: int = 100,
        clean_text: bool = True,
    ) -> None

Typical Usage

from lucid.data.tokenizers import WordPieceTokenizerFast

tokenizer = WordPieceTokenizerFast.from_pretrained(".data/bert/pretrained")

ids = tokenizer.encode("Hello, Lucid!")
text = tokenizer.decode(ids)

batch_ids = tokenizer.batch_encode(
    ["hello world", "tokenizer fast path"],
    add_special_tokens=True,
)

Note

  • This class requires the compiled C++ extension (lucid.data.tokenizers._C).

  • If the extension is not built, importing or constructing the tokenizer will fail with an import/runtime error.

  • For development builds, run python setup.py build_ext –inplace before use.