Text Models¶
Transformer¶
Transformer
The Transformer is a deep learning architecture introduced by Vaswani et al. in 2017, designed for handling sequential data with self-attention mechanisms. It replaces traditional recurrent layers with attention-based mechanisms, enabling highly parallelized training and capturing long-range dependencies effectively.
Vaswani, Ashish, et al. “Attention Is All You Need.” Advances in Neural Information Processing Systems, 2017, pp. 5998-6008.
Name |
Model |
Input Shape |
Parameter Count |
FLOPs |
|---|---|---|---|---|
Transformer-Base |
\((N, L_{src})\), \((N, L_{tgt})\) |
62,584,544 |
\(O(N \cdot d_{m} \cdot L_{src} \cdot L_{tgt})\) |
|
Transformer-Big |
\((N, L_{src})\), \((N, L_{tgt})\) |
213,237,472 |
\(O(N \cdot d_{m} \cdot L_{src} \cdot L_{tgt})\) |
BERT¶
Transformer Encoder-Only Transformer
BERT is a Transformer-based model family for sequence understanding tasks, including pre-training, language modeling, and sequence-level prediction heads.
Devlin, Jacob, et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv, 11 Oct. 2018, https://doi.org/10.48550/arXiv.1810.04805.
Name |
Model |
Input Shape |
Parameter Count |
Pre-Trained |
|---|---|---|---|---|
\(\text{BERT}\) |
\((N,L)\) |
Depends |
– |
|
\(\text{BERT}_\text{Pre}\) |
\((N,L)\) |
110,106,428 |
✅ |
|
\(\text{BERT}_\text{MLM}\) |
\((N,L)\) |
109,514,298 |
❌ |
|
\(\text{BERT}_\text{CLM}\) |
\((N,L)\) |
109,514,298 |
❌ |
|
\(\text{BERT}_\text{NSP}\) |
\((N,L)\) |
109,483,778 |
❌ |
|
\(\text{BERT}_\text{SC}\) |
\((N,L)\) |
109,483,778 |
❌ |
|
\(\text{BERT}_\text{TC}\) |
\((N,L)\) |
108,895,493 |
❌ |
|
\(\text{BERT}_\text{QA}\) |
\((N,L)\) |
108,893,186 |
❌ |
RoFormer¶
Transformer Encoder-Only Transformer
RoFormer is a rotary-position-embedding variant of BERT-style encoders, retaining bidirectional Transformer blocks while replacing absolute positional embedding usage in self-attention with RoPE.
Su, Jianlin, et al. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” arXiv, 15 Apr. 2021, https://doi.org/10.48550/arXiv.2104.09864.
Name |
Model |
Input Shape |
Parameter Count |
Pre-Trained |
|---|---|---|---|---|
\(\text{RoFormer}\) |
\((N,L)\) |
Depends |
– |
|
\(\text{RoFormer}_\text{MLM}\) |
\((N,L)\) |
109,711,674 |
❌ |
|
\(\text{RoFormer}_\text{SC}\) |
\((N,L)\) |
109,090,562 |
❌ |
|
\(\text{RoFormer}_\text{TC}\) |
\((N,L)\) |
109,090,562 |
❌ |
|
\(\text{RoFormer}_\text{MC}\) |
\((N,C,L)\) |
109,089,793 |
❌ |
|
\(\text{RoFormer}_\text{QA}\) |
\((N,L)\) |
109,090,562 |
❌ |
Note
Parameter counts are based on the smallest known variants for each model families.
GPT¶
Transformer Decoder-Only Transformer
GPT is the first large-scale decoder-only Transformer trained with unsupervised pre-training followed by supervised fine-tuning, demonstrating strong transfer across diverse language understanding tasks.
Radford, Alec, et al. “Improving Language Understanding by Generative Pre-Training.” OpenAI, 2018.
Name |
Model |
Input Shape |
Parameter Count |
Pre-Trained |
|---|---|---|---|---|
\(\text{GPT}\) |
\((N,L)\) |
116,534,784 |
– |
|
\(\text{GPT}_\text{LM}\) |
\((N,L)\) |
116,534,784 |
❌ |
|
\(\text{GPT}_\text{DH}\) |
\((N,C,L)\) |
116,535,553 |
❌ |
|
\(\text{GPT}_\text{SC}\) |
\((N,L)\) |
116,536,320 |
❌ |