Pretraining With Downstream Validation

On this page, you’ll build on the Pretraining with Upstream Validation guide. The example will be for pretraining Llama-3-8B model. For downstream validation, you’ll use the external frameworks Eleuther Eval Harness (EEH) and BigCode Eval Harness (BCEH). By the end of this guide, you should be comfortable kicking off your own pretraining run for the model of your choice, combining both upstream and downstream validation.

Prerequisites

Before beginning this guide, make sure you’ve:

Completed setup and installation.
Read Trainer Essentials and Trainer Configuration which cover the basics of running models in Model Zoo.
Read Pretraining with Upstream Validation as this guide directly builds on the walkthrough there.
Read Downstream Validation Using Eleuther Eval Harness and Downstream Validation using BigCode Eval Harness).

Configure the Run

Similar to Pretraining with Upstream Validation, this page will present the YAML configuration file as well as the equivalent pure Python setup side-by-side for your ease of comparison. You will add downstream validation to the pretraining configuration set up in Pretraining with Upstream Validation for Llama-3-8B. Recall the full configuration you put together from that tutorial:

trainer:
  init:
    backend:
      backend_type: CSX
      cluster_config:
        num_csx: 16
    seed: 2024
    model:
      # Embedding
      vocab_size: 128256
      hidden_size: 4096
      position_embedding_type: "rotary"
      pos_scaling_factor: 1.0
      rope_theta: 500000.0
      rotary_dim: 128
      share_embedding_weights: false
      max_position_embeddings: 8192
      embedding_dropout_rate: 0.0
      embedding_layer_norm: false

      # Decoder
      num_hidden_layers: 32
      dropout_rate: 0.0
      layer_norm_epsilon: 1.0e-5
      norm_type: "rmsnorm"

      # Decoder - Attention
      num_heads: 32
      attention_type: "scaled_dot_product"
      attention_module: "multiquery_attention"
      attention_dropout_rate: 0.0
      use_projection_bias_in_attention: false
      use_ffn_bias_in_attention: false
      extra_attention_params:
          num_kv_groups: 8

      # Decoder - ffn
      filter_size: 14336
      nonlinearity: "swiglu"
      use_ffn_bias: false

      # Task-specific
      use_bias_in_output: false
      loss_scaling: "num_tokens"
      loss_weight: 1.0

      # Initializer
      initializer_range: 0.02

      # Cerebras parameters
      mixed_precision: True
      fp16_type: "cbfloat16"

    optimizer:
      AdamW:
        betas: [0.9, 0.95]
        correct_bias: True
        weight_decay: 0.1

    schedulers:
    - CosineDecayLR:
        initial_learning_rate: 3.0e-5
        end_learning_rate: 3.0e-6
        total_iters: 528

    precision:
      fp16_type: cbfloat16
      loss_scaling_factor: dynamic
      max_gradient_norm: 1.0

    loop:
      num_steps: 10000
      eval_frequency: 1000
      eval_steps: 1000

    checkpoint:
      steps: 1000

    callbacks:
    - ComputeNorm: {}
    - CheckLoss: {}
    - ModelEvalMetrics: {}

    loggers:
    - ProgressLogger: {}
    - TensorBoardLogger: {}
  fit:
    train_dataloader:
      data_processor: GptHDF5MapDataProcessor
      data_dir: "/data/llama_v3_dataset_vocab128256/train"
      batch_size: 80
      micro_batch_size: 20
      shuffle: False
      shuffle_seed: 1337
      num_workers: 8
      prefetch_factor: 10
      persistent_workers: True # Important to avoid seeding at each epoch
    val_dataloader:
    - data_processor: GptHDF5MapDataProcessor
      data_dir: "/data/llama_v3_dataset_vocab128256/val"
      batch_size: 80
      micro_batch_size: 20
      shuffle: False
      shuffle_seed: 1337
      num_workers: 8
      prefetch_factor: 10
      persistent_workers: True # Important to avoid seeding at each epoch

Configure EEH

Let’s add downstream validation on a single EEH multiple-choice task winogrande as part of the pretraining run. To do this, you will need to augment the configuration with the EleutherEvalHarness callback as such:

trainer:
  init:
    backend:  # CSX
      ...
    model:  # llama
      ...
    optimizer:  # AdamW
      ...
    schedulers:  # CosineDecayLR
      ...
    precision:  # DLS
      ...
    loop:
      ...
    checkpoint:
      ...
    callbacks:
      ...
      - EleutherEvalHarness:
        # Eleuther Eval Harness settings
        eeh_args:
          tasks: winogrande
          num_fewshot: 0
        # CSX-specific eval harness settings
        keep_data_dir: false
        # Dataloader settings
        batch_size: 4
        shuffle: false
        max_sequence_length: 8192
        num_workers: 1
        data_dir: <path_to_mounted_dir>
        tokenizer_file_path: <path_to_llama3_tokenizer_json_file>
        eos_id: 128001
        pretrained_model_name_or_path: null
    loggers:
      ...
    seed: 2024
    ...

As part of your pretraining run’s configuration, you have now set up downstream validation on EEH task winogrande.

The eval_frequency specified as part of the trainer’s loop (YAML) or in the TrainingLoop object (Python) also controls the frequency of downstream validation; i.e., for your example above, validation on EEH task winogrande will be run every 1K steps.
Update the tasks argument to configure downstream validation for more EEH tasks. Note that only a single generative EEH task may be specified per callback.

Configure BCEH

Configuring downstream validation using BCEH is no different than it is for EEH. For example, if you want to configure the pretraining run on the code generative task humaneval, please augment the YAML configuration file with the the BigCodeEvalHarness callback as such:

YAML: Simply add the callback to the list of callbacks in the YAML. Don’t forget to include the inference settings under model configuration!
Python: Construct a BigCodeEvalHarness callback object and pass it to the Trainer’s constructor as follows. Note that the BCEH arguments are passed to the callback via the BigCodeCLIArgs object, comprising the list of supported BCEH command line arguments.

trainer:
  init:
    backend:  # CSX
      ...
    model:  # llama
      ...
      # Inference Settings
      start_token: 128256   # Set to `vocab_size`
      stop_sequences: []    # Left empty as stop_sequences are overridden from the BCEH task
      max_tokens: 256       # Default from HF implementations
      loop_dim: 1
    optimizer:  # AdamW
      ...
    schedulers:  # CosineDecayLR
      ...
    precision:  # DLS
      ...
    loop:
      ...
    checkpoint:
      ...
    callbacks:
      ...
      - BigCodeEvalHarness:
        # BigCode Eval Harness settings
        bigcode_args:
          tasks: humaneval
        # CSX-specific eval harness settings
        keep_data_dir: false
        # Dataloader settings
        batch_size: 4
        shuffle: false
        max_sequence_length: 8192
        num_workers: 1
        data_dir: <path_to_mounted_dir>
        tokenizer_file_path: <path_to_llama3_tokenizer_json_file>
        eos_id: 128001
        pretrained_model_name_or_path: null
    loggers:
      ...
    seed: 2024
    ...

And that is all! As part of your pretraining run’s configuration, you have now set up downstream validation on BCEH task humaneval.

Since only running one generative eval harness task is supported per callback, please create a separate BigCodeEvalHarness callback to run downstream validation for more BCEH tasks.
To obtain the final eval metrics for BCEH, please run the code execution and evaluation flow separately using the Downstream Validation using BigCode Eval Harness guide.

Configure EEH and BCEH

Configuring downstream validation for both EEH and BCEH is also straightforward via the use of both the BigCodeEvalHarness callbacks. Let’s augment the full YAML configuration file to run downstream validation on EEH tasks hellaswag, gsm8k and winogrande, and BCEH task mbpp with the callbacks as follows:

YAML: Simply add both callbacks to the list of callbacks in the YAML. Since you are running generative eval harness tasks, don’t forget to include the inference settings under model configuration!
Python: Construct BigCodeEvalHarness objects, respectively.

trainer:
  init:
    backend:  # CSX
      ...
    model:  # llama
      ...
      # Inference Settings
      start_token: 128256   # Set to `vocab_size`
      stop_sequences: []    # Left empty as stop_sequences are overridden from the BCEH task
      max_tokens: 256       # Default from HF implementations
      loop_dim: 1
    optimizer:  # AdamW
      ...
    schedulers:  # CosineDecayLR
      ...
    precision:  # DLS
      ...
    loop:
      ...
    checkpoint:
      ...
    callbacks:
      ...
      - BigCodeEvalHarness:
        # BigCode Eval Harness settings
        bceh_args:
          tasks: mbpp
        # CSX-specific eval harness settings
        keep_data_dir: false
        # Dataloader settings
        batch_size: 4
        shuffle: false
        max_sequence_length: 8192
        num_workers: 1
        data_dir: <path_to_mounted_dir>
        tokenizer_file_path: <path_to_llama3_tokenizer_json_file>
        eos_id: 128001
        pretrained_model_name_or_path: null
      - EleutherEvalHarness:
        # Eleuther Eval Harness settings
        eeh_args:
          tasks: hellaswag,gsm8k,winogrande
          num_fewshot: 0
        # CSX-specific eval harness settings
        keep_data_dir: false
        # Dataloader settings
        batch_size: 4
        shuffle: false
        max_sequence_length: 8192
        num_workers: 1
        data_dir: <path_to_mounted_dir>
        tokenizer_file_path: <path_to_llama3_tokenizer_json_file>
        eos_id: 128001
        pretrained_model_name_or_path: null
    loggers:
      ...
    seed: 2024
    ...

And that is all! As part of your pretraining run’s configuration, you have now set up downstream validation on both BCEH and EEH tasks.

Start Pretraining

Once you have a fully configured Trainer, with your choice of downstream validation, all you need to do now is to kick off the run and start pretraining.

YAML: Let’s assume that the YAML configuration that you put together above is written to a file called ./pretrain_downstream_llama_8b.yaml. To run pretraining use the CLI command.
Python: Let’s assume that the python code that you put together above is written to a file called ./pretrain_downstream_llama_8b.py. To run pretraining, execute that python script.

cszoo fit ./pretrain_downstream_llama_8b.yaml

​Prerequisites

​Configure the Run

​Configure EEH

​Configure BCEH

​Configure EEH and BCEH

​Start Pretraining