Cerebras Modelzoo Layers
class cerebras.modelzoo.layers.AlibiPositionEmbeddingLayer(*args, **kwargs) [source]
Bases: torch.nn.Module
Alibi Position Embedding Layer, Symmetric case with bidirectional supported
Alibi bias as in the paper: https://arxiv.org/abs/2108.12409
Parameters
- num_heads (int): number of attention heads.
- slopes (Tensor): slope values to use for alibi heads. Shape: [num_heads, 1]. Defaults to
None
. - alibi_trainable_slopes (bool): whether the alibi slopes are trainable parameters.
- slopes_initializer (str): initializer for alibi slopes if they’re trainable. Defaults to
xavier_uniform
.
Returns
- Relative position bias, to be used in attention masking.
Return type
- position_bias (Tensor)
forward(seq_length, key_length, past_kv=None, constant_pos_mask=None, batch_size=None) [source]
Return the position bias based on the alibi slopes.
Parameters
- seq_length (int): the length of query tokens.
- key_length (int): the length of key tokens.
Returns
- Position bias tensor with shape [num_heads, query_length, key_length]
MultiheadAttention Class
class cerebras.modelzoo.layers.MultiheadAttention(*args, **kwargs)
Bases: torch.nn.Module
Multi-head attention layer. Adapted from: https://pytorch.org/docs/stable/_modules/torch/nn/modules/activation.html#MultiheadAttention.
Parameters
-
embed_dim (
int
) – Number of input units in each projection output -
num_heads (
int
) – Number of attention heads. -
inner_dim (
int
) – Number of output units in attention query/key/value projection. Defaults toembed_dim
. -
dropout (
float
) – Dropout rate for key-query weights. Defaults to 0.0. -
batch_first (
bool
) – If True, then the input and output tensors are provided as (batch, seq, feature), otherwise the format will be (seq, batch, feature). Default: True (batch, seq, feature). -
add_bias_kv (
bool
) – If specified, adds bias to the key and value sequences at dim=0. Default: False. -
add_zero_attn (
bool
) – If specified, adds a new batch of zeros to the key and value sequences at dim=1. Default: False. -
kdim (
int
) – Number of input units in the key projection. -
vdim (
int
) – Number of input units in the value projection. -
use_projection_bias (
bool
) – Whether to use bias in the key, query, and value projections. -
use_ffn_bias (
bool
) – Whether to use bias in the output projection. -
attention_initializer (
str
) – Projection kernel initializer. Defaults toxavier_uniform
. -
attention_q_initializer – Query projection kernel initializer. If not specified, the query will be initialized via
attention_initializer
. -
output_layer_initializer (
str
|initializer
) – If not None, use this initializer for the output transform layer. Defaults to None. -
bias_initializer (
str
) – Bias initializer. Defaults tozeros
. -
attention_type (
str
) – The attention variant to execute. Currently acceptsdot_product
andscaled_dot_product
. Defaults toscaled_dot_product
. -
scale_qk_dot_by_d (
bool
) – IfTrue
scales QK^T dot product by d(=hidden/d_head) instead of sqrt(d). -
attention_logits_alpha (
float
) – Scales the QK^T dot product. Used to stabilize logits in muP training. -
softmax_dtype_fp32 (
bool
) – Use an FP32 softmax implementation. -
attention_kernel (
str
|None
) – Kernel to use. Usesdefault
if None. See accepted values below.None
- Default implementation.fast_attention
- Experimental optimized implementation.
-
device (
optional
) – Device to create the model parameters on, can be a CUDA device or CS device.
MultiheadAttention.forward(q, k, v, attn_mask=None, key_padding_mask=None, need_weights=False, average_attn_weights=True, past_kv=None, cache_present_kv=False, past_kv_self_attn=True, position_bias=None, rotary_position_embedding_helper=None, layer_idx=None, **extra_args)
Applies the attention mechanism to queries q
, keys k
, and values v
.
Parameters
- q (
Tensor
) – Queries, shape[batch_size, seq_length, embed_dim]
. - k (
Tensor
) – Keys, shape[batch_size, seq_length, embed_dim]
. - v (
Tensor
) – Values, shape[batch_size, seq_length, embed_dim]
. - attn_mask (
Tensor
) – Attention mask. Can be 2D of shape[batch_size, seq_length]
, or 3D of shape[batch, query_length, seq_length]
. - key_padding_mask (
Tensor
) – If specified, a mask of shape (N, S) indicating which elements within key to ignore for the purpose of attention (i.e. treat as “padding”). Defaults to None. - need_weights (
bool
) – If specified, returnsattn_output_weights
in addition toattn_outputs
. Default: False. - average_attn_weights (
bool
) – If true, indicates that the returnedattn_weights
should be averaged across heads. Otherwise,attn_weights
are provided separately per head. Note that this flag only has an effect whenneed_weights=True
. Default: True (i.e., average weights across heads). - past_kv (
tuple(tensor, tensor)
) – Past keys and values. Tensors have shape[batch_size, num_heads, seq_length, embed_dim / num_heads]
. The 0th and 1st tensors contain the past keys and values, respectively. Defaults toNone
. - cache_present_kv (
bool
) – Specifies if the present keys and values must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. Defaults toFalse
. - past_kv_self_attn (
bool
) – Specifies whether the past keys & values should be used for self-attention (true) or cross-attention (false). Ignored ifpast_kv
is not provided. Default: True. - position_bias (
Tensor
) – Tensor containing position bias to apply in attention with shape[num_heads, query_length, key_length]
. - rotary_position_embedding_helper (
Optional[RotaryPositionEmbeddingHelper]
) – A helper class to apply rotary embedding on the input tensor.
Returns
Attention output tensor with shape [batch_size, seq_length, embed_dim]
.
class cerebras.modelzoo.layers.BatchChannelNorm2D(*args, **kwargs)[source]#
Bases: torch.nn.Module
Implements Batch Channel Normalization proposed in Micro-Batch Training with Batch-Channel Normalization and Weight Standardization https://arxiv.org/abs/1903.10520
Parameters
-
num_groups (int) – number of groups to separate the channels into.
-
num_channels (int) – number of channels. C from an expected input of size (N, C, H, W).
-
eps (float) – a value added to the denominator for numerical stability. Default: 1e-5.
-
momentum (float) – The Update rate value used for the running_mean and running_var computation. Default: 0.1.
-
device (torch.device) – Device to place the learnable parameters.
-
dtype (torch.dtype) – Data type of learnable parameters.
Shape:
input: (N, C, H, W) output: (N, C, H, W) (same shape as input)
class cerebras.modelzoo.layers.EmbeddingLayer(*args, **kwargs)[source]#
Bases: torch.nn.Module
Creates token and, optionally, position and segment embeddings.
Parameters
-
vocab_size (int) – Size of input vocabulary.
-
embedding_size (int) – Dimension of the embedding space.
-
pad_token_id (Optional_[int]_) – If specified, the entries at padding_idx do not contribute to the gradient; therefore, the embedding vector at padding_idx is not updated during training.
-
segment_embedding_size (int) – Dimension of the embedding space for segment embeddings. Useful when factorized embeddings are used for tokens and so the size of the embedding space for segments differs from that for tokens. Defaults to the same value as embedding_size.
-
embeddings_initializer (Optional_[str,Callable]_) – Token embeddings initializer. Defaults to ‘uniform’.
-
max_position_embeddings (int) – Maximum sequence length to train using model.
-
position_embedding_type (str) – ‘learned’, ‘fixed’ or ‘rotary’. Defaults to “learned”, for ‘rotary’ embeddings, embeddings are not created at bottom but computed with key&query embeddings by RotaryPositionEmbeddingHelper
-
position_embedding_offset (int) – Offset for position embeddings. Default to 0.
-
min_timescale (Optional_[int]_) – The scale of the shortest sinusoid. Default to 1.0. (only need to be specified when position_embedding_type is fixed).
-
max_timescale (Optional_[int]_) – The scale of the longest sinusoid. Default to 1.0e4. (only need to be specified when position_embedding_type is fixed).
-
position_embeddings_initializer (Optional_[str,Callable]_) – Position embeddings initializer. Defaults to “uniform”.
-
num_segments (Optional_[int]_) – Number of segments for the segment embedding layer. Defaults to None, in which case the segment embedding layer is not created.
-
segment_embeddings_initializer (Optional_[str,Callable]_) – Segment embeddings initializer. Defaults to “uniform”.
-
(optional) (device) – Device to create the model parameters on, can be a cuda device or CS device.
forward(input_ids, position_ids=None, segment_ids=None, past_length=0)[source]#
Convert input_ids to token embeddings according to the embedding type.
Word embeddings (required), segment embeddings (optional) and position embeddings (optional).
Parameters
-
input_ids (Tensor) – input token ids with shape
[batch_size, seq_length]
. -
position_ids (Tensor) – position ids with shape
[batch_size, seq_length]
. -
segment_ids (Tensor) – input segment ids with shape
[batch_size, seq_length]
.
Returns
Token embedding output with shape [batch_size, seq_length, embedding_size]
.
class cerebras.modelzoo.layers.FeedForwardNetwork(*args, **kwargs)[source]#
Bases: torch.nn.Module
A feed forward network that consists of a stack of fully connected layers arranged as [LinearLayer -> Activation -> Dropout] block repeated len(layers_units) times.
Parameters
config (FeedForwardNetworkConfig) – Feed forward network config.
Initialize the FFN object instance.
class cerebras.modelzoo.layers.GPTJDecoderLayer(*args, **kwargs)[source]#
Bases: cerebras.modelzoo.layers.TransformerDecoderLayer.TransformerDecoderLayer
GPTJDecoderLayer is inherited from TransformerDecoderLayer, it has 2 modifications:
-
It uses parallel decoder architecture instead of the sequential one
-
It supports both gptj and gpt-neox which uses untied_layer_norm
Reference: https://www.cerebras.net/blog/how-to-harness-the-predictive-power-of-gpt-j
Parameters
-
d_model (int) – the number of expected features in the input (required).
-
nhead (int) – the number of heads in the multihead-attention models (required).
-
use_untied_layer_norm (bool) – whether to use untied layer_norm. Should be False for GPTJ and True for Neox
-
kwargs – the rest of the arguments the same as TransformerDecoderLayer
forward(tgt, memory=None, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, attention_mask=None, rotary_position_embedding_helper=None, past_kv=None, cache_present_kv=False, self_attn_position_bias=None, cross_attn_position_bias=None, layer_idx=None, expert_hash_idx=None)[source]#
GPTJ layer with rotary position embeddings and parallel decoder architecture
Parameters
-
tgt (torch.Tensor) – the sequence to the decoder layer (required).
-
memory (Optional_[torch.Tensor]_) – the sequence from the last layer of the encoder (required).
-
tgt_mask (Optional_[torch.Tensor]_) – the mask for the tgt sequence (optional).
-
memory_mask (Optional_[torch.Tensor]_) – the mask for the memory sequence (optional).
-
tgt_key_padding_mask (Optional_[torch.Tensor]_) – the mask for the tgt keys per batch (optional).
-
memory_key_padding_mask (Optional_[torch.Tensor]_) – the mask for the memory keys per batch (optional).
-
rotary_position_embedding_helper (Optional_[RotaryPositionEmbeddingHelper]_) – A helper class to apply rotary embedding on the input tensor.
-
past_kv (Optional_[Union[Tuple[torch.Tensor,_ torch.Tensor]__, Tuple_[torch.Tensor,_ torch.Tensor, torch.Tensor, torch.Tensor]]]) – Past keys and values for self attention and (if applicable) cross attention modules. Key/value tensors have shape
[batch_size, num_heads, seq_length, embed_dim / num_heads]
. (optional). -
cache_present_kv (bool) – Specifies if the present keys and values must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. (optional).
-
self_attn_position_bias (Optional_[torch.Tensor]_) – the tensor containing position bias to apply in self-attention, can be obtained from relative or alibi position embeddings.
-
expert_hash_idx (Optional_[torch.Tensor]_) – tensor containing mixture-of-experts expert selection indices for each token in the batch. Only used with MoE with hash-based routing enabled (optional).
Shape:
Output tensor with shape
class cerebras.modelzoo.layers.GroupInstanceNorm(*args, **kwargs)[source]#
Bases: torch.nn.Module
Uses torch.nn.GroupNorm to emulate InstanceNorm by setting number of groups equal to the number of channels.
Parameters
num_channels (int) – number of channels. C from an expected input of size (N, C, H, W).
class cerebras.modelzoo.layers.MultiQueryAttention(*args, **kwargs)[source]#
Bases: cerebras.modelzoo.layers.AttentionLayer.MultiheadAttention
Implements the Multi-Query Attention Layer from
Fast Transformer Decoding: One Write-Head is All You Need <https://arxiv.org/abs/1911.02150>
Parameters
-
embed_dim (int) – Number of input units in each projection output
-
num_heads (int) – Number of attention heads.
-
inner_dim (int) – Number of output units in attention query/key/value projection. Defaults to
embed_dim
. -
dropout (float) – Dropout rate for key-query weights. Defaults to 0.0.
-
batch_first (bool) – If True, then the input and output tensors are provided as (batch, seq, feature), otherwise the format will be (seq, batch, feature). Default: True (batch, seq, feature).
-
add_bias_kv (bool) – If specified, adds bias to the key and value sequences at dim=0. Default: False.
-
add_zero_attn (bool) – If specified, adds a new batch of zeros to the key and value sequences at dim=1. Default: False
-
kdim (int) – Number of output units in key projection
-
vdim (int) – Number of output units in projection
-
use_projection_bias (bool) – Whether to use bias in the key, query, and value projections.
-
use_ffn_bias (bool) – Whether to use bias in the output projection.
-
attention_initializer (str) – Projection kernel initializer. Defaults to
xavier_uniform
. -
attention_q_initializer – Query projection kernel initializer. If not specified, the query will be initialized via
attention_initializer
-
output_layer_initializer (str or initializer) – If not None, use this initializer for the output transform layer. Defaults to None.
-
bias_initializer (str) – Bias initializer. Defaults to
zeros
. -
attention_type (str) – The attention variant to execute. Currently accepts
dot_product
andscaled_dot_product
. Defaults toscaled_dot_product
. -
softmax_dtype_fp32 (bool) – Use an FP32 softmax implementation.
-
attention_kernel (str | None) –
Kernel to use. Uses
default
if None. See accepted values below.None
- Default implementation.fast_attention
- Experimental optimized implementation. -
device (optional) – Device to create the model parameters on, can be a cuda device or CS device.
class cerebras.modelzoo.layers.RelativePositionEmbeddingLayer(*args, **kwargs)[source]#
Bases: torch.nn.Module
Relative Position Embedding Layer
Parameters
-
num_heads (int) – number of attention heads.
-
relative_attention_bias (Tensor) – Tensor with relative attention weights. Shape: [num_relative_attention_buckets, num_heads]. Defaults set to None.
-
num_relative_attention_buckets (int) – Number of buckets used to calculate relative position bias. Default: 32
-
max_relative_positions (int) – The maximum relative distance used when calculating relative position buckets. See relative_position_bucket docs for more details. Default: 128
-
bidirectional_relative_attention (bool) – Whether attention is bidirectional.
-
allow_negative_buckets (bool) – If enabled, position buckets will be both positive and negative (as required by certain models like DEBERTA). Default: False.
-
relative_attn_bias_initializer (bool) – Relative Attention bias initializer. Defaults to
xavier_uniform
.
Returns
Relative position bias, to be used in attention masking
Return type
position_bias (Tensor)
forward(seq_length, key_length, past_kv=None)[source]#
Return the position bias.
Parameters
-
seq_length (int) – the length of query tokens.
-
key_length (int) – the length of key tokens.
Returns
Position bias tensor with shape [num_heads, query_length, key_length]
static relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128, allow_negative_buckets=False)[source]#
Translate relative position to a bucket number for relative attention. The relative position is defined as memory_position - query_position
, i.e., the distance in tokens from the attending position to the attended-to position.
If bidirectional_relative_attention = False
, then positive relative positions are invalid. We use smaller buckets for small absolute relative positions and larger buckets for larger absolute relative positions. All relative positions >= max_distance
map to the same bucket. All relative positions <= -max_distance
map to the same bucket. This should allow for more graceful generalization to longer sequences than the model has been trained on.
Parameters:
-
relative_position: Tensor with relative positions.
Type:Tensor
-
bidirectional: Whether attention is bidirectional.
Type:bool
-
num_buckets: Number of buckets for relative positions.
Type:int
-
max_distance: Used in order to calculate relative position buckets.
Type:int
-
allow_negative_buckets: If enabled, position buckets will be both positive and negative (as required by certain models like DEBERTA).
Default:False
Returns:
A Tensor with the same shape as relative_position
, containing int32
values in the range [0, num_relative_attention_buckets)
.
class cerebras.modelzoo.layers.Transformer(*args, **kwargs)[source]#
Bases: torch.nn.Module
A transformer model. User is able to modify the attributes as needed. The architecture is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users can build the BERT(https://arxiv.org/abs/1810.04805) model with corresponding parameters.
Parameters
-
d_model (int) – the number of expected features in the encoder/decoder inputs (default=512).
-
nhead (int) – the number of heads in the multihead attention models (default=8).
-
num_encoder_layers (int) – the number of sub-encoder-layers in the encoder (default=6).
-
num_decoder_layers (int) – the number of sub-decoder-layers in the decoder (default=6).
-
dim_feedforward (int) – the dimension of the feedforward network model (default=2048).
-
dropout (float) – the dropout value (default=0.1).
-
activation (Union_[str,_ Callable_[[torch.Tensor],_ torch.Tensor]__]) – the activation function of encoder/decoder intermediate layer, can be a string (“relu” or “gelu”) or a unary callable. Default: gelu
-
custom_encoder (Optional_[Any]_) – custom encoder (default=None).
-
custom_decoder (Optional_[Any]_) – custom decoder (default=None).
-
layer_norm_eps (float) – the eps value in layer normalization components (default=1e-5).
-
batch_first (bool) – If
True
, then the input and output tensors are provided as (batch, seq, feature). Default:False
(seq, batch, feature). -
norm_first (bool) – if
True
, encoder and decoder layers will perform LayerNorms before other attention and feedforward operations, otherwise after. Default:False
(after). -
attention_type – Should be in [“scaled_dot_product”, “dot_product”].
-
use_projection_bias_in_attention – Add bias to Q,K,V projections in the Attention layer. Defaults to False.
-
use_ffn_bias_in_attention – Add bias in the concluding FFN in the Attention layer. Defaults to False.
-
use_ffn_bias – Add bias in all dense layers of the decoder’s ffn sublayer.
-
attention_initializer – Attention layer initializer. Defaults to “xavier_uniform”.
-
ffn_initializer – FFN layer initializer. Defaults to “xavier_uniform”.
-
device (optional) – Device to create the model parameters on, can be a cuda device or CS device.
Examples::
> transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12) > src = torch.rand((10, 32, 512)) > tgt = torch.rand((20, 32, 512)) > out = transformer_model(src, tgt)
Copy to clipboard
Note: A full example to apply nn.Transformer module for the word language model is available in https://github.com/pytorch/examples/tree/master/word_language_model
forward(src, tgt, src_mask=None, tgt_mask=None, memory_mask=None, src_key_padding_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None)[source]#
Take in and process masked source/target sequences.
Parameters
-
src (torch.Tensor) – the sequence to the encoder (required).
-
tgt (torch.Tensor) – the sequence to the decoder (required).
-
src_mask (Optional_[torch.Tensor]_) – the additive mask for the src sequence (optional).
-
tgt_mask (Optional_[torch.Tensor]_) – the additive mask for the tgt sequence (optional).
-
memory_mask (Optional_[torch.Tensor]_) – the additive mask for the encoder output (optional).
-
src_key_padding_mask (Optional_[torch.Tensor]_) – the ByteTensor mask for src keys per batch (optional).
-
tgt_key_padding_mask (Optional_[torch.Tensor]_) – the ByteTensor mask for tgt keys per batch (optional).
-
memory_key_padding_mask (Optional_[torch.Tensor]_) – the ByteTensor mask for memory keys per batch (optional).
Shape:
-
src: (S,E) for unbatched input, (S,N,E) if batch_first=False or (N, S, E) if batch_first=True.
-
tgt: (T,E) for unbatched input, (T,N,E) if batch_first=False or (N, T, E) if batch_first=True.
-
src_mask: (S,S) or (N⋅num\_heads,S,S).
-
tgt_mask: (T,T) or (N⋅num\_heads,T,T).
-
memory_mask: (T,S).
-
src_key_padding_mask: (S) for unbatched input otherwise (N,S).
-
tgt_key_padding_mask: (T) for unbatched input otherwise (N,T).
-
memory_key_padding_mask: (S) for unbatched input otherwise (N,S).
Note: [src/tgt/memory]_mask ensures that position i is allowed to attend the unmasked positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged. If a BoolTensor is provided, positions with True
are not allowed to attend while False
values will be unchanged. If a FloatTensor is provided, it will be added to the attention weight. [src/tgt/memory]_key_padding_mask provides specified elements in the key to be ignored by the attention. If a ByteTensor is provided, the non-zero positions will be ignored while the zero positions will be unchanged. If a BoolTensor is provided, the positions with the value of True
will be ignored while the position with the value of False
will be unchanged.
- output: (T,E) for unbatched input, (T,N,E) if batch_first=False or (N, T, E) if batch_first=True.
Note: Due to the multi-head attention architecture in the transformer model, the output sequence length of a transformer is same as the input sequence (i.e. target) length of the decode.
where S is the source sequence length, T is the target sequence length, N is the batch size, E is the feature number
Examples
> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
Copy to clipboard
class cerebras.modelzoo.layers.TransformerDecoder(*args, **kwargs)[source]#
Bases: torch.nn.Module
TransformerDecoder is a stack of N decoder layers
Parameters
-
decoder_layer – an instance of the TransformerDecoderLayer() class (required).
-
num_layers – the number of sub-decoder-layers in the decoder (required).
-
norm – the layer normalization component (optional).
Examples::
> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8) > transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=6) > memory = torch.rand(10, 32, 512) > tgt = torch.rand(20, 32, 512) > out = transformer_decoder(tgt, memory)
Copy to clipboard
forward(tgt, memory=None, tgt_mask=None, sparse_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, self_attn_position_bias=None, cross_attn_position_bias=None, rotary_position_embedding_helper=None, past_kv=None, cache_present_kv=False, extract_layer_idx=None, expert_hash_idx=None, **extra_args)[source]#
Pass the inputs (and mask) through the decoder layer in turn.
Parameters
-
tgt (torch.Tensor) – the sequence to the decoder (required).
-
memory (Optional_[torch.Tensor]_) – the sequence from the last layer of the encoder (optional).
-
tgt_mask (Optional_[torch.Tensor]_) – the mask for the tgt sequence (optional).
-
memory_mask (Optional_[torch.Tensor]_) – the mask for the memory sequence (optional).
-
tgt_key_padding_mask (Optional_[torch.Tensor]_) – the mask for the tgt keys per batch (optional).
-
memory_key_padding_mask (Optional_[torch.Tensor]_) – the mask for the memory keys per batch (optional).
-
self_attn_position_bias (Optional_[torch.Tensor]_) – the tensor containing position bias to apply in self-attention, can be obtained from relative or alibi position embeddings.
-
cross_attn_position_bias (Optional_[torch.Tensor]_) – similar to self_attn_position_bias, this is the tensor containing position bias to apply in cross-attention.
-
rotary_position_embedding_helper (Optional_[RotaryPositionEmbeddingHelper]_) – A helper class to apply rotary embedding on the input tensor.
-
past_kv (Optional_[List[Union[Tuple[torch.Tensor,_ torch.Tensor]__, Tuple_[torch.Tensor,_ torch.Tensor, torch.Tensor, torch.Tensor]]]__]) – Past keys and values for each of the decoder layers (optional).
-
cache_present_kv (bool) – Specifies if the present keys and values must be cached and returned. (optional).
-
extract_layer_idx (Optional_[int]_) – (inclusive)layer index in range [0, self.num_layers) (zero-indexed) Applies decoder layers up to (and including) extract_layer_idx instead of all decoder layers. For ex: extract_layer_idx=3 would run fwd pass from decoder_block_0 to decoder_block_3 and return outputs from decoder_block_3. If extract_layer_idx = None and norm != None, then the output returned would be decoder_block_ -> norm -> output (return)
-
expert_hash_idx (Optional_[torch.Tensor]_) – Optional tensor for mixture-of-experts models with hash-based routing. Tensor contains the expert ID for each token in the batch based on a hashing calculation.
Shape:
see the docs in Transformer class.
class cerebras.modelzoo.layers.TransformerDecoderLayer(*args, **kwargs)[source]#
Bases: torch.nn.Module
TransformerDecoderLayer is made up of self-attn, multihead-attn and feedforward network. This standard decoder layer is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users may modify or implement in a different way during application.
Parameters
-
d_model (int) – the number of expected features in the input (required).
-
nhead (int) – the number of heads in the multihead-attention models (required).
-
dim_feedforward (int) – the dimension of the feedforward network model (default=2048).
-
dropout (float) – the dropout value (default=0.1).
-
activation (Union_[str,_ Callable_[[torch.Tensor],_ torch.Tensor]__]) – the activation function of the intermediate layer, can be a string (“relu” or “gelu”) or a unary callable. Default: gelu
-
layer_norm_eps (float) – the eps value in layer normalization components (default=1e-5).
-
batch_first (bool) – If
True
, then the input and output tensors are provided as (batch, seq, feature). Default:False
(seq, batch, feature). -
norm_layer (Type_[torch.nn.Module]_) – the normalization class that will be used before/after FF layers (default=nn.LayerNorm)
-
norm_first (bool) – if
True
, layer norm is done prior to self attention, multihead attention and feedforward operations, respectively. Otherwise it’s done after. Default:False
(after). -
attention_dropout_rate (Optional_[float]_) – Attention dropout rate. If None, defaults to dropout.
-
attention_softmax_fp32 (Optional_[bool]_) – Use FP32 softmax in attention block.
-
use_projection_bias_in_attention – Add bias to Q,K,V projections in the Attention layer. Defaults to False.
-
attention_type – Should be in [“scaled_dot_product”, “dot_product”]
-
scale_qk_dot_by_d (bool) – If
True
scales QK^T dot product by d(=hidden/d_head) instead of sqrt(d). -
attention_logit_alpha (float) – Scales the QK^T dot product. Used to stabilize logits in muP training.
-
attention_inner_dim (int) – Number of output units in attention query/key/value projection. Defaults to d_model
-
add_cross_attention (bool) – If
True
, adds cross-attention layer between encoder/decoder, otherwise, only self-attention is used in the decoder (GPT-style models should set toFalse
) -
use_ffn_bias_in_attention – Add bias in the concluding FFN in the Attention layer. Defaults to False.
-
use_ffn_bias – Add bias in all dense layers of the decoder’s ffn sublayer
-
attention_initializer – Attention layer initializer. Defaults to “xavier_uniform”.
-
attention_q_initializer – Query projection kernel initializer. If not specified, the query will be initialized via
attention_initializer
-
attention_output_layer_initializer – attention output layer projection initializer. If not specified, the output will be initialized via
attention_initializer
-
ffn_initializer – FFN layer initializer. Defaults to “xavier_uniform”.
-
ffn_output_layer_initializer – If not None, initialize the last FFN layer with this initializer. Defaults to None.
-
use_ff_layer1_dropout (bool) – If
True
, dropout will be enabled after the first feed forward layer. Default: True -
True (use_ff_layer2_dropout = If) – True
-
Default (dropout will be enabled after the second feed forward layer.) – True
-
ffn_dropout_rate (Optional_[float]_) – Controls dropout rate of FF’s first layer. If None, defaults to dropout.
-
moe_params – A dict of MoE params including num_experts, top_k and load_balancing_loss_coef
Examples
> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8, batch_first=True) > memory = torch.rand(32, 10, 512) > tgt = torch.rand(32, 20, 512) > out = decoder_layer(tgt, memory)
Copy to clipboard
forward(tgt, memory=None, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, rotary_position_embedding_helper=None, past_kv=None, cache_present_kv=False, self_attn_position_bias=None, cross_attn_position_bias=None, layer_idx=None, expert_hash_idx=None, **extra_args)[source]#
Pass the inputs (and mask) through the decoder layer.
Parameters
-
tgt (torch.Tensor) – the sequence to the decoder layer (required).
-
memory (Optional_[torch.Tensor]_) – the sequence from the last layer of the encoder (required).
-
tgt_mask (Optional_[torch.Tensor]_) – the mask for the tgt sequence (optional).
-
memory_mask (Optional_[torch.Tensor]_) – the mask for the memory sequence (optional).
-
tgt_key_padding_mask (Optional_[torch.Tensor]_) – the mask for the tgt keys per batch (optional).
-
memory_key_padding_mask (Optional_[torch.Tensor]_) – the mask for the memory keys per batch (optional).
-
rotary_position_embedding_helper (Optional_[RotaryPositionEmbeddingHelper]_) – A helper class to apply rotary embedding on the input tensor.
-
past_kv (Optional_[Union[Tuple[torch.Tensor,_ torch.Tensor]__, Tuple_[torch.Tensor,_ torch.Tensor, torch.Tensor, torch.Tensor]]]) – Past keys and values for self attention and (if applicable) cross attention modules. Key/value tensors have shape
[batch_size, num_heads, seq_length, embed_dim / num_heads]
. (optional). -
cache_present_kv (bool) – Specifies if the present keys and values must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. (optional).
-
self_attn_position_bias (Optional_[torch.Tensor]_) – the tensor containing position bias to apply in self-attention, can be obtained from relative or alibi position embeddings.
-
expert_hash_idx (Optional_[torch.Tensor]_) – tensor containing mixture-of-experts expert selection indices for each token in the batch. Only used with MoE with hash-based routing enabled (optional).
Shape:
see the docs in Transformer class.
class cerebras.modelzoo.layers.TransformerEncoder(*args, **kwargs)[source]#
Bases: torch.nn.Module
TransformerEncoder is a stack of N encoder layers
Parameters
-
encoder_layer – an instance of the TransformerEncoderLayer() class (required).
-
num_layers – the number of sub-encoder-layers in the encoder (required).
-
norm – the layer normalization component (optional).
-
enable_nested_tensor – if True, input will automatically convert to nested tensor (and convert back on output). This will improve the overall performance of TransformerEncoder when padding rate is high. Default:
False
(disabled).
Examples::
> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8) > transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6) > src = torch.rand(10, 32, 512) > out = transformer_encoder(src)
Copy to clipboard
forward(src, mask=None, src_key_padding_mask=None, rotary_position_embedding_helper=None, self_attn_position_bias=None, extract_layer_idx=None, **extra_args)[source]#
Pass the input through the encoder layers in turn.
Parameters
-
src (torch.Tensor) – the sequence to the encoder (required).
-
mask (Optional_[torch.Tensor]_) – the mask for the src sequence (optional).
-
src_key_padding_mask (Optional_[torch.Tensor]_) – the mask for the src keys per batch (optional).
-
rotary_position_embedding_helper (Optional_[RotaryPositionEmbeddingHelper]_) – A helper class to apply rotary embedding on the input tensor.
-
self_attn_position_bias (Optional_[torch.Tensor]_) – the tensor containing position bias to apply in self-attention, can be obtained from relative or alibi position embeddings.
-
extract_layer_idx (Optional_[int]_) – (inclusive)layer index in range [0, self.num_layers) (zero-indexed) Applies encoder layers up to (and including) extract_layer_idx instead of all encoder layers. For ex: extract_layer_idx=3 would run fwd pass from encoder_block_0 to encoder_block_3 and return outputs from encoder_block_3. If extract_layer_idx = None and norm != None, then the output returned would be encoder_block_ -> norm -> output (return)
Shape:
see the docs in Transformer class.
class cerebras.modelzoo.layers.TransformerEncoderLayer(*args, **kwargs)[source]#
Bases: torch.nn.Module
TransformerEncoderLayer is made up of self-attn and feedforward network. This standard encoder layer is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users may modify or implement in a different way during application.
Parameters
-
d_model (int) – the number of expected features in the input (required).
-
nhead (int) – the number of heads in the multihead attention models (required).
-
dim_feedforward (int) – the dimension of the feedforward network model (default=2048).
-
dropout (float) – the dropout value (default=0.1).
-
activation (Union_[str,_ Callable_[[torch.Tensor],_ torch.Tensor]__]) – the activation function of the intermediate layer, can be a string (“relu” or “gelu”) or a unary callable. Default: gelu
-
layer_norm_eps (float) – the eps value in layer normalization components (default=1e-5).
-
batch_first (bool) – If
True
, then the input and output tensors are provided as (batch, seq, feature). Default:False
(seq, batch, feature). -
norm_layer (Type_[torch.nn.Module]_) – the normalization class that will be used before/after FF layers (default=nn.LayerNorm)
-
norm_first (bool) – if
True
, layer norm is done prior to attention and feedforward operations, respectively. Otherwise it’s done after. Default:False
(after). -
attention_dropout_rate (Optional_[float]_) – Attention dropout rate. If None, defaults to dropout.
-
use_projection_bias_in_attention – Add bias to Q,K,V projections in the Attention layer. Defaults to False.
-
attention_type – Should be in [“scaled_dot_product”, “dot_product”]
-
scale_qk_dot_by_d (bool) – If
True
scales QK^T dot product by d(=hidden/d_head) instead of sqrt(d). -
attention_softmax_fp32 (Optional_[bool]_) – Use FP32 softmax in attention block.
-
attention_inner_dim (int) – Number of output units in attention query/key/value projection. Defaults to d_model
-
add_cross_attention – If
True
, adds cross-attention layer between encoder/decoder, otherwise, only self-attention is used in the decoder (GPT-style models should set toFalse
) -
use_ffn_bias_in_attention – Add bias in the concluding FFN in the Attention layer. Defaults to False.
-
use_ffn_bias – Add bias in all dense layers of the decoder’s ffn sublayer
-
attention_initializer – Attention layer initializer. Defaults to “xavier_uniform”.
-
attention_q_initializer – Query projection kernel initializer. If not specified, the query will be initialized via
attention_initializer
-
attention_output_layer_initializer – attention output layer projection initializer. If not specified, the output will be initialized via
attention_initializer
-
ffn_initializer – FFN layer initializer. Defaults to “xavier_uniform”.
-
ffn_output_layer_initializer – If not None, initialize the last FFN layer with this initializer. Defaults to None.
-
use_ff_layer1_dropout (bool) – If
True
, dropout will be enabled after the first feed forward layer. Default: True -
True (use_ff_layer2_dropout = If) – True
-
Default (dropout will be enabled after the second feed forward layer.) – True
-
ffn_dropout_rate (Optional_[float]_) – Controls dropout rate of FF’s first layer. If None, defaults to dropout.
Example
When batch_first
is True
: >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True) >>> src = torch.rand(32, 10, 512) >>> out = encoder_layer(src)
forward(src, src_mask=None, src_key_padding_mask=None, rotary_position_embedding_helper=None, self_attn_position_bias=None, **extra_args)[source]#
Pass the input through the encoder layer.
Parameters
-
src (torch.Tensor) – the sequence to the encoder layer (required).
-
src_mask (Optional_[torch.Tensor]_) – the mask for the src sequence (optional).
-
src_key_padding_mask (Optional_[torch.Tensor]_) – the mask for the src keys per batch (optional).
-
rotary_position_embedding_helper (Optional_[RotaryPositionEmbeddingHelper]_) – A helper class to apply rotary embedding on the input tensor.
-
self_attn_position_bias (Optional_[torch.Tensor]_) – the tensor containing position bias to apply in self-attention, can be obtained from relative or alibi position embeddings.
Shape:
see the docs in Transformer class.