RoFormerTokenizerFast¶
C++ Backend
- class lucid.models.RoFormerTokenizerFast(vocab: dict[str, int] | None = None, vocab_file: Path | str | None = None, unk_token: SpecialTokens | str = SpecialTokens.UNK, pad_token: SpecialTokens | str = SpecialTokens.PAD, cls_token: str = '[CLS]', mask_token: str = '[MASK]', sep_token: str = '[SEP]', lowercase: bool = True, wordpieces_prefix: str = '##', max_input_chars_per_word: int = 100, clean_text: bool = True)¶
RoFormerTokenizerFast is a RoFormer-named tokenizer wrapper that inherits the same high-performance WordPiece pipeline as BERTTokenizerFast.
Class Signature¶
class RoFormerTokenizerFast(BERTTokenizerFast)
Typical Usage¶
from lucid.models import RoFormerTokenizerFast
tokenizer = RoFormerTokenizerFast.from_pretrained("some_path")
encoded = tokenizer.encode_plus("RoFormer uses rotary position embedding.")
print(encoded["input_ids"])
print(encoded["token_type_ids"])
print(encoded["attention_mask"])
Methods¶
- RoFormerTokenizerFast.encode_plus(text_a: str, text_b: str | None = None) dict[str, list[int]]¶
- RoFormerTokenizerFast.encode_pretraining_inputs(text_a: str, text_b: str | None = None, return_tensor: bool = False, device: Literal['cpu', 'gpu'] = 'cpu') dict[str, list[int] | LongTensor]¶
- RoFormerTokenizerFast.encode(text: str, add_special_tokens: bool = True, return_tensor: bool = False, device: Literal['cpu', 'gpu'] = 'cpu') list[int] | LongTensor¶
- RoFormerTokenizerFast.decode(ids: list[int] | LongTensor, skip_special_tokens: bool = True) str¶
- RoFormerTokenizerFast.save_pretrained(save_directory: Path | str) list[str]¶
- classmethod RoFormerTokenizerFast.from_pretrained(pretrained_model_name_or_path: str | Path, **kwrags: Any) BERTTokenizerFast¶
Note
This class currently reuses BERTTokenizerFast behavior as-is.
It accepts the same vocabulary format and special token settings.
As with BERT tokenizer usage, MLM masking is handled outside the tokenizer.