nn.Transformer¶
- class lucid.nn.Transformer(d_model: int = 512, num_heads: int = 8, num_encoder_layers: int = 6, num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1, activation: ~typing.Callable[[~lucid._tensor.tensor.Tensor], ~lucid._tensor.tensor.Tensor] = <function relu>, layer_norm_eps: float = 1e-05, norm_first: bool = False, bias: bool = True, custom_encoder: ~lucid.nn.module.Module | None = None, custom_decoder: ~lucid.nn.module.Module | None = None)¶
Overview¶
The Transformer module is a complete sequence-to-sequence transformer model consisting of an encoder and a decoder. It is commonly used in natural language processing tasks such as machine translation, text generation, and more. The model follows the standard transformer architecture introduced in Attention Is All You Need by Vaswani et al. (2017).
Class Signature¶
class lucid.nn.Transformer(
d_model: int = 512,
num_heads: int = 8,
num_encoder_layers: int = 6,
num_decoder_layers: int = 6,
dim_feedforward: int = 2048,
dropout: float = 0.1,
activation: Callable[[Tensor], Tensor] = F.relu,
layer_norm_eps: float = 1e-5,
norm_first: bool = False,
bias: bool = True,
custom_encoder: nn.Module | None = None,
custom_decoder: nn.Module | None = None,
)
Parameters¶
d_model (int, default=512): The dimensionality of the input embeddings (\(d_{model}\)).
num_heads (int, default=8): The number of attention heads in each multi-head attention layer (\(H\)).
Warning
The embedding dimension (\(d_{model}\)) must be divisible by \(H\).
num_encoder_layers (int, default=6): The number of TransformerEncoderLayer instances stacked in the encoder.
num_decoder_layers (int, default=6): The number of TransformerDecoderLayer instances stacked in the decoder.
dim_feedforward (int, default=2048): The dimensionality of the intermediate layer in the feedforward network.
dropout (float, default=0.1): Dropout probability applied to the attention and feedforward layers.
activation (Callable[[Tensor], Tensor], default=F.relu): The activation function applied in the feedforward network.
layer_norm_eps (float, default=1e-5): A small constant added to the denominator for numerical stability in layer normalization.
norm_first (bool, default=False): If True, applies layer normalization before the attention and feedforward sublayers, instead of after.
bias (bool, default=True): If True, enables bias terms in the linear layers.
custom_encoder (nn.Module | None, optional): If provided, replaces the default TransformerEncoder with a custom encoder.
custom_decoder (nn.Module | None, optional): If provided, replaces the default TransformerDecoder with a custom decoder.
Forward Method¶
def forward(
src: Tensor,
tgt: Tensor,
src_mask: Tensor | None = None,
tgt_mask: Tensor | None = None,
mem_mask: Tensor | None = None,
src_key_padding_mask: Tensor | None = None,
tgt_key_padding_mask: Tensor | None = None,
mem_key_padding_mask: Tensor | None = None
) -> Tensor
Computes the forward pass of the Transformer model.
Inputs:
src (Tensor): The source input tensor of shape \((N, L_s, d_{model})\).
tgt (Tensor): The target input tensor of shape \((N, L_t, d_{model})\).
src_mask (Tensor | None, optional): A mask of shape \((L_s, L_s)\) applied to the encoder self-attention weights.
tgt_mask (Tensor | None, optional): A mask of shape \((L_t, L_t)\) applied to the decoder self-attention weights.
mem_mask (Tensor | None, optional): A mask of shape \((L_t, L_s)\) applied to decoder-encoder cross-attention weights.
src_key_padding_mask (Tensor | None, optional): A mask of shape \((N, L_s)\), where non-zero values indicate positions that should be ignored in the encoder.
tgt_key_padding_mask (Tensor | None, optional): A mask of shape \((N, L_t)\), where non-zero values indicate positions that should be ignored in the decoder.
mem_key_padding_mask (Tensor | None, optional): A mask of shape \((N, L_s)\), where non-zero values indicate positions that should be ignored in cross-attention.
Output:
Tensor: The output tensor of shape \((N, L_t, d_{model})\).
Mathematical Details¶
The Transformer model processes input through an encoder-decoder architecture as follows:
Encoding Process
\[M = \operatorname{Encoder}(S)\]where \(S\) is the source input and \(M\) is the memory output of the encoder.
Decoding Process
\[Y = \operatorname{Decoder}(T, M)\]where \(T\) is the target input and \(Y\) is the final output.
Layer Normalization (if applied)
\[Y = \operatorname{LayerNorm}(Y)\]
Usage Example¶
import lucid
import lucid.nn as nn
# Create Transformer model
transformer = nn.Transformer(
d_model=512, num_heads=8, num_encoder_layers=6, num_decoder_layers=6
)
# Create random input tensors
src = lucid.random.randn(16, 10, 512) # (batch, seq_len, embed_dim)
tgt = lucid.random.randn(16, 10, 512)
# Compute Transformer output
output = transformer(src, tgt)
print(output.shape) # Expected output: (16, 10, 512)