Stay up to date with the latest features, enhancements, bug fixes, and improvements.
Note: Please see this guide for instructions on how to conduct data preprocessing for Llama 3.3 70B.
null_expert_bias
parameter that represents the model’s uncertainty or “none of the above” option when routing. By including a null expert probability in the weighting calculation, gradient flow back to the router is improved, leading to improved loss, especially in scenarios where only the top single expert (top_k=1
) is selected. Users can continue to choose between normalizing expert weights into a probability distribution or simply using the raw router scores as attention-like weights. The added null expert probability integrates seamlessly with both approaches.
import cerebras_appliance
| import cerebras.appliance
|
| import cerebras_pytorch
| import cerebras.pytorch as cstorch
|
| from modelzoo import
| from cerebras.modelzoo import
|
max_checkpoints
is now stateless across multiple invocations of run.py
. This means that checkpoints generated by a previous run in the same model_dir
will no longer be counted towards max_checkpoints
for the current run, and that the previous checkpoints will no longer be automatically deleted in the current run. This provides you with greater flexibility to control which checkpoints are deleted vs. saved from run to run, but also means you may want to keep a closer eye on remaining disk space.
use_cs_grad_accum
YAML parameter in Model Zoo models has been deprecated and no longer needs to be explicitly configured to work with gradient accumulation. To set specific micro batch sizes and enable or disable gradient accumulation, you should now directly set the micro_batch_size
YAML parameter to none | auto | explore | <positive_int>. See page working_with_microbatches for details.
micro_batch_size
when constructing a cstorch.backend
is now deprecated. To specify micro_batch_size
using the cstorch API, pass this option to cstorch.utils.data.DataExecutor
instead. This change gives users the ability to choose different micro batch sizes for train vs. eval tasks.
model_dir
has changed. There are now subdirectories for each individual run with their own artifacts. These subdirectories are named model_dir/cerebras_logs/<train|eval>/<timestamp>
. Checkpoints, Tensorboard event files, and YAMLs still exist directly under model_dir
. The most recent run’s subdirectory can be accessed using the symlink - model_dir/cerebras_logs/latest
. Run logs have been moved from model_dir/run_xxx.log
to:model_dir/cerebras_logs/<train|eval>/<timestamp>/run.log
ormodel_dir/cerebras_logs/latest/run.log
max_checkpoints
parameter in the runconfig portion of the yaml config file, it would default to 5 and only retain the 5 most recent checkpoints. In release 2.1.1, max_checkpoints
now defaults again to infinity, reverting back to previous default checkpoint saving behavior.
autogen_policy:disabled
if you were previously using it (for more information on how to do this, click on Autogenerating fused kernels for loss operations). If your model previously required autogen and is now no longer compiling, reach out to the Cerebras Support Team for assistance. We will be restoring autogen functionality in the next release.
cbfloat
. Training jobs can be switched seamlessly from bfloat
to cbfloat
with dynamic loss scaling without impact on model accuracy. For more information, refer to our documentation on Numerical Precision Level.
cbfloat16
training. When initializing from bfloat
checkpoints from past releases without loss scaling, explicitly specify --load_checkpoint_states
or its runconfig
equivalent to ensure parameter loading from params.yaml
. Subsequent checkpoints will inherit dynamic loss scaling and not require this.
model_dir
have changed. There are now subdirectories for each individual run with their own artifacts. These subdirectories are named model_dir/cerebras_logs/<train|eval>/<timestamp>
. Checkpoints, Tensorboard event files, and YAMLs still exist directly under model_dir
. The most recent run’s subdirectory can be accessed using the symlink - model_dir/cerebras_logs/latest
. Run logs have been moved from model_dir/run_xxx.log
to:model_dir/cerebras_logs/<train|eval>/<timestamp>/run.log
ormodel_dir/cerebras_logs/latest/run.log
model_dir/cerebras_logs
data when safe to delete if no longer necessary for debugging.
experimental_api
flag within the runconfig
portion of model configuration YAML files is deprecated with the official release of the Cerebras PyTorch 2.0 API.
run.py
workflow, but now users can also write their own training loops and make other customizations. See more information on how to write your own training loop.
cerberas_pytorch
. See documentation here for details. Losses and layer implementations from the layers API still remain in the Cerebras Model Zoo. For more information, contact our support team.
micro_batch_size
parameter within the train_input
and eval_input
sections of the model configuration YAML. Note that batch_size/num_csx
must be a multiple of micro_batch_size
. Micro-batch sizes with good performance are recommended within the gradient accumulation Micro-batch size setting in YAML params within the Cerebras Developer Documentation.
is_pretrained_checkpoint
flag has been deprecated for clarity. Users should instead use the load_checkpoint_states
in conjunction with checkpoint_path
to specify which components are loaded from the checkpoint. Allowed values are model
, optimizer
, dataloader
, grad_scaler
, lr_scheduler
. For more information, see the PyTorch params documentation.
use_cs_grad_accum
in runconfig
of YAML) in release 2.0.2.
num_heads
within a transformer block should not be a prime number.
model
section in the configuration yaml.
{pipelined,weight_streaming}
argument in run.py
because all models will run in weight_streaming
mode by default. All models previously supported in Pipeline are now supported for Weight Streaming.
batch_size
parameter in Model Zoo yaml
configuration files now represents the total effective batch size of the model and is divided evenly across the specified num_csx
CSX systems. This differs from pre-1.9.0 behavior, where the batch size
parameter defined the batch size per CSX, not globally. Note that batch_size
must now be divisible by num_csx
.
num_tokens
) is not yet fully supported and requires coordination with the Cerebras team.
model_dir
can cause issues.
use_pre_encoder_decoder_layer_norm: False
. This was confusing, and we have changed the behavior in 1.8. To enable pre-layer normalization you should instead set use_pre_encoder_decoder_layer_norm: True
. This update better aligns the naming of the parameter to its usage. To use release 1.7 checkpoints in release 1.8, you’ll need to update the config to reflect this change. Directions for converting configuration files can be found in our checkpoint conversion documentation.
pooler_nonlinearity
) and masked language model head (mlm_nonlinearity
) independently of the activation used for the rest of the model (encoder_nonlinearity
). Both will default to encoder_nonlinearity
if not explicitly set. Use the checkpoint conversion documentation to convert 1.7 configuration files to 1.8 to have access to this feature.
eval_all
and train_and_eval
to enable users to evaluate models throughout long training runs or evaluate all checkpoints after training has completed. Mode details in eval.
run_appliance.py
has now been deprecated for TensorFlow in favour of one single script called run.py
that employs the aforementioned run script changes.
src_max_sequence_length
and tgt_max_sequence_length
parameters in model yaml config file) may have compile times of over 3 hours. T5 is only supported with input and output sequences up to 2048 tokens.
(vocabulary V / (heads * Greatest_Common_Divisor(Sin, Sout)) > 2^11
.
share_embedding_weights
parameter is set to True
for PyTorch GPT-style models (e.g. here), custom settings for embedding_initializer
in the yaml configuration (like here), are ignored. Embedding weights are initialized with default configuration (normal distribution, std=0.02). This is done by the __rest_parameters
function, which is called at the end of the GPT2LMHeadModel initialization
run.py
for PyTorch and with run_appliance.py
for Tensorflow: set --execution_strategy
argument to pipeline
or weight_streaming
to specify execution mode.
--num_csx=1
in the run command for Pipelined execution. Weight Streaming does not have such requirement.
--num_workers_per_csx
denotes the maximum number of Kubernetes worker pods per CS-X. Note that these are distributed across physical worker nodes attached to each CS-X on the Wafer-Scale Cluster.
--num_workers_per_csx=8
as a command line argument in the run command for Pipelined execution. Weight Streaming execution does not need to specify this argument.
train_input
and/or eval_input
sections of the configuration yamls.include_padding_in_loss: False
in the configuration yaml. We believe it makes the most sense to exclude padding tokens in the loss calculation. Such setting differs from the original public implementation where token padding is included in the loss, which is most likely used for performance optimization on GPUs, leading to our eval accuracy being potentially different from published numbers. This does not apply to the PyTorch version or the Original Cerebras Installation.cs_ip
explicitly
eval
mode on CS-1 system, if --nodes
and --tasks_per_node
value pairs are not set to one of the following, then the session may hang. This issue exists for both TensorFlow and PyTorch.
--nodes==1 --tasks_per_node=2
, or
support@cerebras.net
.eval
mode on CS-1 system, if --nodes
and --tasks_per_node
value pairs are not set to one of the following, then the session may hang. This issue exists for both TensorFlow and PyTorch.
--nodes==1 --tasks_per_node=2
, or
support@cerebras.net
.cs_input_analyzer
is a new Bash script that recommends Slurm resource settings you need to run on Cerebras system. These recommendations are generated by this script for a given input_fn
and model. To use this tool, run it manually. See The cs_input_analyzer Scripteval
mode on CS-1 system, if --nodes
and --tasks_per_node
value pairs are not set to one of the following, then the session may hang. This issue exists for both TensorFlow and PyTorch.
--nodes==1 --tasks_per_node=2
, or
--nodes==2 --tasks_per_node=1
--nodes
and --tasks_per_node
. For example:
eval
performance is not affected by these Slurm resource settings. See the example command below:eval
mode is added for BERT and FC-MNIST PyTorch models. These models now support both train
and eval
modes.
cerebras.framework.torch.initialize()
.
cbfloat16
data format (see CB16 Half-Precision is now supported.
GradScaler
(see Dynamic loss scaling).
roberta_base.yaml
and roberta_large.yaml
.
run.py
with the --compile_only
flag, as well as run on CPU or GPU using run_cpu_gpu.py
.To train this model on the Cerebras System at your own risk, comment out the following lines from run.py
:```Bash
if not runconfig_params[“compile_only”]:
raise ValueError(
“Running the Transformer model on the Cerebras System is in beta.”
“Convergence is not guaranteed. Remove this exception to proceed.”
)`````````````````````````````````````````````
support@cerebras.net
.run.py
of the model:```Bash
if not runconfig_params[“compile_only”]:
raise ValueError(
“Running the Transformer model on the Cerebras System is in beta.”
“Convergence is not guaranteed. Remove this exception to proceed.”
)`````````````````````````````````````````````
support@cerebras.net
.eval
mode on CS-1 system, if --nodes
and --tasks_per_node
value pairs are not set to one of the following, then the session may hang. This issue exists for both TensorFlow and PyTorch.
--nodes==1 --tasks_per_node=2
, or
cs_ip
flag must include both the IP address and the port number of the CS system. Only the IP address, for example: --cs_ip 192.168.1.1
, will not be sufficient. You must also include the port number, for example: --cs_ip 192.168.1.1:9000
.train
mode. To train this model on the Cerebras System at your own risk, edit the run.py
file and comment out the entire raise ValueError()
function, as shown below:```Bash
elif runconfig_params[“mode”] == TRAIN:# raise ValueError(# “Training PyTorch models on the Cerebras System is in beta ”# “and is only validated with the default config provided in the ”# “Model Zoo. Remove this exception and use the provided config to”# “proceed.”#)
runner.train(train_loader)`````````````````````````````````````````````
train
mode. To train these models on the Cerebras System at your own risk, edit the run.py
file and comment out the entire raise ValueError()
function, as shown below:```Bash
elif runconfig_params[“mode”] == TRAIN:
#raise ValueError(
#“Training PyTorch models on the Cerebras System is in beta ”
#“and is only validated with the default configs provided in the ”
#“Model Zoo. Remove this exception and use one of the provided ”
#“configs to proceed.”
#)
runner.train(train_loader)`````````````````````````````````````````````
train
mode.
support@cerebras.net
.train
mode for Variable Sequence Length (VSL) on the CS system.
train
mode for Variable Sequence Length (VSL) on the CS system.
train
mode for Variable Sequence Length (VSL) on the CS system.
srun_train
and salloc_node
. See below:
csrun_wse
script can be used to execute training, evaluation and prediction on the CS system.
csrun_cpu
script can be used to launch a given user command on a CPU, within the Cerebras Singularity container.
eval
and eval_all
).
eval
and eval_all
).
weight_decay
is set to a non-zero value, and
loss_scaling_factor
is not set to “dynamic”.
loss_scaling_factor
to dynamic.”max_seq_len
and the target max_seq_len
are equal.
max_predictions_per_seq
parameter is set to an odd value and if the following conditions are true:
max_predictions_per_seq
parameter to an even value.
support@cerebras.net
.support@cerebras.net
.CerebrasEarlyStoppingHook
. Using this hook, you can terminate early a neural network training based on some logic.
run.py
, you can run evaluation or prediction with your network as follows:
--mode eval
to use the evaluation feature.
--mode predict
to use the prediction feature.
cbfloat16
. The CB16 is a floating-point format with 6-bit exponent and 9-bit explicit mantissa. This allows for double the dynamic range of FP16. See Control numerical precision level.
analyze_input_fn_compile
, this tool provides a detailed log identifying any missing functions and provides recommendations on parameter values to enhance the training performance on the CS system.
CS_AUTOTUNE
), which is similar to the TensorFlow tf.data.AUTOTUNE
. When you are targeting the CS system, using CS_AUTOTUNE
instead of tf.data.AUTOTUNE
will result in a better specification of parameters such as:
num_parallel_calls
cycle_length
num_parallel_reads
KerasModelToCerebrasEstimator
, is provided to convert a Keras model so the model can be run using the CerebrasEstimator
.
use_cs
parameter in the cerebras estimator interface is removed and will result in compiler error if used in this API. The target hardware will now be automatically determined from a combination of the runtime configuration parameter cs_ip
and the use_cs
parameter setting in the method definitions for train
.
runconfig
section.
max_steps
parameter is added as a default parameter to control the duration of training.
max_gradient_norm
hyperparameter value will not result in reduced incremental compile times.
loss_scaling_factor
to a constant value either equal to or greater than 1.0.