Text Models

Transformer

Transformer

The Transformer is a deep learning architecture introduced by Vaswani et al. in 2017, designed for handling sequential data with self-attention mechanisms. It replaces traditional recurrent layers with attention-based mechanisms, enabling highly parallelized training and capturing long-range dependencies effectively.

Vaswani, Ashish, et al. “Attention Is All You Need.” Advances in Neural Information Processing Systems, 2017, pp. 5998-6008.

Name

Model

Input Shape

Parameter Count

FLOPs

Transformer-Base

transformer_base

\((N, L_{src})\), \((N, L_{tgt})\)

62,584,544

\(O(N \cdot d_{m} \cdot L_{src} \cdot L_{tgt})\)

Transformer-Big

transformer_big

\((N, L_{src})\), \((N, L_{tgt})\)

213,237,472

\(O(N \cdot d_{m} \cdot L_{src} \cdot L_{tgt})\)

BERT

Transformer Encoder-Only Transformer

BERT is a Transformer-based model family for sequence understanding tasks, including pre-training, language modeling, and sequence-level prediction heads.

Devlin, Jacob, et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv, 11 Oct. 2018, https://doi.org/10.48550/arXiv.1810.04805.

Name

Model

Input Shape

Parameter Count

Pre-Trained

\(\text{BERT}\)

BERT

\((N,L)\)

Depends

\(\text{BERT}_\text{Pre}\)

BERTForPreTraining

\((N,L)\)

110,106,428

\(\text{BERT}_\text{MLM}\)

BERTForMaskedLM

\((N,L)\)

109,514,298

\(\text{BERT}_\text{CLM}\)

BERTForCausalLM

\((N,L)\)

109,514,298

\(\text{BERT}_\text{NSP}\)

BERTForNextSentencePrediction

\((N,L)\)

109,483,778

\(\text{BERT}_\text{SC}\)

BERTForSequenceClassification

\((N,L)\)

109,483,778

\(\text{BERT}_\text{TC}\)

BERTForTokenClassification

\((N,L)\)

108,895,493

\(\text{BERT}_\text{QA}\)

BERTForQuestionAnswering

\((N,L)\)

108,893,186

RoFormer

Transformer Encoder-Only Transformer

RoFormer is a rotary-position-embedding variant of BERT-style encoders, retaining bidirectional Transformer blocks while replacing absolute positional embedding usage in self-attention with RoPE.

Su, Jianlin, et al. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” arXiv, 15 Apr. 2021, https://doi.org/10.48550/arXiv.2104.09864.

Name

Model

Input Shape

Parameter Count

Pre-Trained

\(\text{RoFormer}\)

RoFormer

\((N,L)\)

Depends

\(\text{RoFormer}_\text{MLM}\)

RoFormerForMaskedLM

\((N,L)\)

109,711,674

\(\text{RoFormer}_\text{SC}\)

RoFormerForSequenceClassification

\((N,L)\)

109,090,562

\(\text{RoFormer}_\text{TC}\)

RoFormerForTokenClassification

\((N,L)\)

109,090,562

\(\text{RoFormer}_\text{MC}\)

RoFormerForMultipleChoice

\((N,C,L)\)

109,089,793

\(\text{RoFormer}_\text{QA}\)

RoFormerForQuestionAnswering

\((N,L)\)

109,090,562

Note

Parameter counts are based on the smallest known variants for each model families.

GPT

Transformer Decoder-Only Transformer

GPT is the first large-scale decoder-only Transformer trained with unsupervised pre-training followed by supervised fine-tuning, demonstrating strong transfer across diverse language understanding tasks.

Radford, Alec, et al. “Improving Language Understanding by Generative Pre-Training.” OpenAI, 2018.

Name

Model

Input Shape

Parameter Count

Pre-Trained

\(\text{GPT}\)

GPT

\((N,L)\)

116,534,784

\(\text{GPT}_\text{LM}\)

GPTLMHeadModel

\((N,L)\)

116,534,784

\(\text{GPT}_\text{DH}\)

GPTDoubleHeadsModel

\((N,C,L)\)

116,535,553

\(\text{GPT}_\text{SC}\)

GPTForSequenceClassification

\((N,L)\)

116,536,320