Text Models¶

Transformer¶

Transformer

The Transformer is a deep learning architecture introduced by Vaswani et al. in 2017, designed for handling sequential data with self-attention mechanisms. It replaces traditional recurrent layers with attention-based mechanisms, enabling highly parallelized training and capturing long-range dependencies effectively.

Vaswani, Ashish, et al. “Attention Is All You Need.” Advances in Neural Information Processing Systems, 2017, pp. 5998-6008.

Name	Model	Input Shape	Parameter Count	FLOPs
Transformer-Base	transformer_base	\((N, L_{src})\), \((N, L_{tgt})\)	62,584,544	\(O(N \cdot d_{m} \cdot L_{src} \cdot L_{tgt})\)
Transformer-Big	transformer_big	\((N, L_{src})\), \((N, L_{tgt})\)	213,237,472	\(O(N \cdot d_{m} \cdot L_{src} \cdot L_{tgt})\)

BERT¶

Transformer Encoder-Only Transformer

BERT is a Transformer-based model family for sequence understanding tasks, including pre-training, language modeling, and sequence-level prediction heads.

Devlin, Jacob, et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv, 11 Oct. 2018, https://doi.org/10.48550/arXiv.1810.04805.

Name	Model	Input Shape	Parameter Count	Pre-Trained
\(\text{BERT}\)	BERT	\((N,L)\)	Depends	–
\(\text{BERT}_\text{Pre}\)	BERTForPreTraining	\((N,L)\)	110,106,428	✅
\(\text{BERT}_\text{MLM}\)	BERTForMaskedLM	\((N,L)\)	109,514,298	❌
\(\text{BERT}_\text{CLM}\)	BERTForCausalLM	\((N,L)\)	109,514,298	❌
\(\text{BERT}_\text{NSP}\)	BERTForNextSentencePrediction	\((N,L)\)	109,483,778	❌
\(\text{BERT}_\text{SC}\)	BERTForSequenceClassification	\((N,L)\)	109,483,778	❌
\(\text{BERT}_\text{TC}\)	BERTForTokenClassification	\((N,L)\)	108,895,493	❌
\(\text{BERT}_\text{QA}\)	BERTForQuestionAnswering	\((N,L)\)	108,893,186	❌

RoFormer¶

Transformer Encoder-Only Transformer

RoFormer is a rotary-position-embedding variant of BERT-style encoders, retaining bidirectional Transformer blocks while replacing absolute positional embedding usage in self-attention with RoPE.

Su, Jianlin, et al. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” arXiv, 15 Apr. 2021, https://doi.org/10.48550/arXiv.2104.09864.

Name	Model	Input Shape	Parameter Count	Pre-Trained
\(\text{RoFormer}\)	RoFormer	\((N,L)\)	Depends	–
\(\text{RoFormer}_\text{MLM}\)	RoFormerForMaskedLM	\((N,L)\)	109,711,674	❌
\(\text{RoFormer}_\text{SC}\)	RoFormerForSequenceClassification	\((N,L)\)	109,090,562	❌
\(\text{RoFormer}_\text{TC}\)	RoFormerForTokenClassification	\((N,L)\)	109,090,562	❌
\(\text{RoFormer}_\text{MC}\)	RoFormerForMultipleChoice	\((N,C,L)\)	109,089,793	❌
\(\text{RoFormer}_\text{QA}\)	RoFormerForQuestionAnswering	\((N,L)\)	109,090,562	❌

Note

Parameter counts are based on the smallest known variants for each model families.

GPT¶

Transformer Decoder-Only Transformer

GPT is the first large-scale decoder-only Transformer trained with unsupervised pre-training followed by supervised fine-tuning, demonstrating strong transfer across diverse language understanding tasks.

Radford, Alec, et al. “Improving Language Understanding by Generative Pre-Training.” OpenAI, 2018.

Name	Model	Input Shape	Parameter Count	Pre-Trained
\(\text{GPT}\)	GPT	\((N,L)\)	116,534,784	–
\(\text{GPT}_\text{LM}\)	GPTLMHeadModel	\((N,L)\)	116,534,784	❌
\(\text{GPT}_\text{DH}\)	GPTDoubleHeadsModel	\((N,C,L)\)	116,535,553	❌
\(\text{GPT}_\text{SC}\)	GPTForSequenceClassification	\((N,L)\)	116,536,320	❌