This page walks through the steps for performing downstream validation on the Cerebras Wafer-Scale cluster using EleutherAI’s Evaluation Harness (EEH). EEH is a popular framework for evaluating large language models across various different datasets and tasks.

While you can configure EEH as part of your training (see Pretraining with Downstream Validation from the Cerebras Model Zoo.

The examples in this guide will perform downstream validation on LLaMA3 8B.

By the end of this guide, you will be able to leverage the EEH framework to perform standalone downstream validation on your models on CS-X.

To skip ahead to the examples covered in this guide without following the entire walk through, please follow the links:

Prerequisites

Please ensure that you have installed the Cerebras Model Zoo package by going through the installation guide. Note that EEH version tested and packaged in the Cerebras Model Zoo is the official release v0.4.2.

Please also read through the Trainer Overview and Trainer Configuration Overview, as these guides will help understand how to configure running EEH standalone.

This guide configures the downstream EEH run using V2 YAML. While release 2.3 includes support for legacy V1 YAML, please convert the configuration to V2 using convert_legacy_params_to_trainer_params from the script src/cerebras/modelzoo/trainer/utils.py.

Configure the Run

This section covers the required steps for setting up an EEH run to perform standalone downstream validation on various tasks.

In particular, you will need to write a YAML configuration file to configure an instance of the Trainer callback.

The example in this section configures evaluation for LLaMA3 8B via the multiple choice (non-generative) eval harness task winogrande using a single CSX.

If you aren’t interested in seeing the break down of the configuration, feel free to skip ahead to the Putting it All Together section to see the full YAML configuration.

Configure the CSX Backend

The first step is to specify the CSX backend and resources required for the run.

Please create a YAML configuration file with the following cluster config:

trainer:
  init:
    backend:
      backend_type: CSX
      cluster_config:
        num_csx: 1

This example uses a single CSX, but you can readily update num_csx to run EEH on multiple CSXs for improved performance.

Configure the Model

Next, please add the following model configuration in the YAML for LLaMA3 8B with 8K context length:

trainer:
  init:
    backend:  # CSX
      ...
    model:
      name: llama # This setting is required

      # Embedding
      vocab_size: 128256
      hidden_size: 4096
      position_embedding_type: "rotary"
      pos_scaling_factor: 1.0
      rope_theta: 500000.0
      rotary_dim: 128
      share_embedding_weights: false
      max_position_embeddings: 8192
      embedding_dropout_rate: 0.0
      embedding_layer_norm: false

      # Decoder
      num_hidden_layers: 32
      dropout_rate: 0.0
      layer_norm_epsilon: 1.0e-5
      norm_type: "rmsnorm"

      # Decoder - Attention
      num_heads: 32
      attention_type: "scaled_dot_product"
      attention_module: "multiquery_attention"
      attention_dropout_rate: 0.0
      use_projection_bias_in_attention: false
      use_ffn_bias_in_attention: false
      extra_attention_params:
        num_kv_groups: 8

      # Decoder - ffn
      filter_size: 14336
      nonlinearity: "swiglu"
      use_ffn_bias: false

      # Task-specific
      use_bias_in_output: false
      loss_scaling: "num_tokens"
      loss_weight: 1.0

      # Initializer
      initializer_range: 0.02

To run downstream validation harness, you must specify the name setting in the model configuration. Valid names corresponding to the supported models include:

  • btlm

  • bloom

  • gpt2

  • gptj

  • falcon

  • gpt3

  • gpt-neox

  • llama

  • mistral

  • mpt

  • jais

  • santacoder

  • starcoder

Configure the EEH Callback

EEH is implemented as an extension to the Trainer callback.

Add the following section in the YAML to set up the EleutherEvalHarness callback:

trainer:
  init:
    backend: # CSX
      ...
    model: # Llama3-8B
      name: llama # This setting is required
      ...
    callbacks:
    - EleutherEvalHarness:
        # Eleuther Eval Harness settings (also exposed via CLI)
        eeh_args:
          tasks: winogrande
          num_fewshot: 0
        # CSX-specific eval harness settings (also exposed via CLI)
        keep_data_dir: false
        # Dataloader settings
        batch_size: 4
        shuffle: false
        max_sequence_length: 8192
        num_workers: 1
        data_dir: <path_to_mounted_dir>
        tokenizer_file_path: <path_to_llama3_tokenizer_json_file>
        eos_id: 128001
        pretrained_model_name_or_path: null
        # Eval Harness Flags
        flags:
          csx.performance.micro_batch_size: null

The eeh_args section exposes the following settings to configure the EEH run:

Eleuther Eval Harness CLI ArgumentsDescription
--tasksComma separated string specifying Eleuther Eval Harness tasks. To get full list of tasks, use the command lm-eval --tasks list from within your python venv.
--num_fewshotNumber of examples to be added to the fewshot context string. Defaults to 0
--output_pathThe path to the output file where the result metrics will be saved. If the path is a directory and log_samples is true, the results will be saved in the directory. Else the parent directory will be used.
--limitAccepts an integer, or a float between 0.0 and 1.0. This limits the number of documents to evaluate per task to the first X documents (if an integer) or first X% of documents. This is useful for debugging.
--use_cacheA path to a sqlite db file for caching model responses. None if not caching.
--cache_requests {true,refresh,delete}Speed up evaluation by caching the building of dataset requests. None if not caching.
--check_integrityWhether to run the relevant part of the test suite for the tasks.
--write_outPrints the prompt for the first few documents. Defaults to False.
--log_samplesIf True, write out all model outputs and documents for per-sample measurement and post-hoc analysis. Defaults to False.
--show_configIf True, shows the the full config of all tasks at the end of the evaluation. Defaults to False.
--include_pathAdditional path to include if there are external tasks to include.
--predict_onlyUse with –log_samples. Only model outputs will be saved and metrics will not be evaluated.
--seedSet seed for python’s random, numpy and torch.
--temperatureSampling temperature used for generation (autoregressive, generate_until tasks only).
--top_pTop-p parameter used for nucleus sampling (autoregressive, generate_until tasks only).
--top_kTop-k parameter used for generation (autoregressive, generate_until tasks only).

You can either specify the settings here or pass them via CLI arguments to the standalone EEH run script.

The callback configuration also accepts dataloader settings that you must specify in the YAML to set up input data preprocessing for the run:

DataLoader SettingsDescription
data_dirThis setting is required. Provide a path to the mounted directory visible to the worker containers where eval harness task data samples are dumped after preprocessing. Use the mount_dirs argument to specify a dir mount, similar to our existing flows.
tokenizer_file_pathPath to a custom tokenizer (JSON) file. If you provide a custom tokenizer, then you must also specify eos_id; otherwise, you must provide a pretrained tokenizer from Hugging Face in pretrained_model_name_or_path.
pretrained_model_name_or_pathHugging Face (HF) pretrained model name or path. This setting is required if you do not specify tokenizer_file_path. For detailed description, see HF AutoTokenizers.
eos_idEnd-of-sentence (eos) token ID to signal the termination of a sequence. This setting is required if you specify a custom tokenizer in tokenizer_file_path. You can set this by looking for the ID corresponding to the eos token in the custom tokenizer JSON file.
max_sequence_lengthMaximum length of the input sequence. This setting is required for preprocessing input data samples from the specified eval harness tasks. You should align the max_sequence_length field to the max_position_embeddings value in the model configuration of the YAML. If you don’t specify max_sequence_length, the flow defaults to this max_position_embeddings setting.

Additionally, you may optionally specify the following, CSX-specific eval harness setting:

  • keep_data_dir: Use this to preserve the preprocessed eval harness task data samples, i.e. the directory specified under data_dir. Defaults to False, i.e. data samples are deleted after the run.

(Optional) Configure HuggingFace (HF) Cache Directory

EEH utilizes HF’s APIs to download task data and other configurations. This data is by default cached under $HOME/.cache/huggingface.

However, you may choose to specify a different directory for this cached data via the HFCacheDir callback:

trainer:
  init:
    backend: # CSX
      ...
    model: # Llama3-8B
      ...
    callbacks:
    - EleutherEvalHarness:
        ...
    - HFCacheDir:
        cache_dir: <path_to_directory_for_caching_HF_data>

Putting it All Together

Here’s what the full YAML configuration looks like once you follow this guide for configuring the individual pieces:

ltrainer:
init:
  backend:
    backend_type: CSX
    cluster_config:
      num_csx: 1
      mount_dirs: <path(s)_to_mount_to_appliance_containers>
  model:
    name: llama

    # Embedding
    vocab_size: 128256
    hidden_size: 4096
    position_embedding_type: "rotary"
    pos_scaling_factor: 1.0
    rope_theta: 500000.0
    rotary_dim: 128
    share_embedding_weights: false
    max_position_embeddings: 8192
    embedding_dropout_rate: 0.0
    embedding_layer_norm: false

    # Decoder
    num_hidden_layers: 32
    dropout_rate: 0.0
    layer_norm_epsilon: 1.0e-5
    norm_type: "rmsnorm"

    # Decoder - Attention
    num_heads: 32
    attention_type: "scaled_dot_product"
    attention_module: "multiquery_attention"
    attention_dropout_rate: 0.0
    use_projection_bias_in_attention: false
    use_ffn_bias_in_attention: false
    extra_attention_params:
      num_kv_groups: 8

    # Decoder - ffn
    filter_size: 14336
    nonlinearity: "swiglu"
    use_ffn_bias: false

    # Task-specific
    use_bias_in_output: false
    loss_scaling: "num_tokens"
    loss_weight: 1.0

    # Initializer
    initializer_range: 0.02
  callbacks:
  - EleutherEvalHarness:
      eeh_args:
        tasks: winogrande
        num_fewshot: 0
      keep_data_dir: false
      # Dataloader settings
      batch_size: 4
      shuffle: false
      max_sequence_length: 8192
      num_workers: 1
      data_dir: <path_to_mounted_dir>
      tokenizer_file_path: <path_to_llama3_tokenizer_json_file>
      eos_id: 128001
      pretrained_model_name_or_path: null
      # Eval Harness Flags
      flags:
        csx.performance.micro_batch_size: null

Running EEH on CS-X

Now that the the YAML configuration is complete, you will use the run script src/cerebras/modelzoo/common/run_eleuther_eval_harness.py from the Cerebras Model Zoo to run EEH on various tasks.

This script accepts the following command line interface (CLI) arguments:

python run_eleuther_eval_harness.py CSX [-h] [--tasks task1,task2] [--num_fewshot N] [--output_path DIR|DIR/file.json] [--limit N|0<N<1] [--use_cache DIR]
                                  [--cache_requests {true,refresh,delete}] [--check_integrity] [--write_out] [--log_samples] [--show_config] [--include_path DIR]
                                  [--predict_only] [--seed SEED] [--trust_remote_code] [--temperature TEMPERATURE] [--top_p TOP_P] [--top_k TOP_K]
                                  [--keep_data_dir]
                                  -p PARAMS [-m {eval}] [-o MODEL_DIR] [--checkpoint_path CHECKPOINT_PATH]
                                  [--disable_strict_checkpoint_loading] [--load_checkpoint_states LOAD_CHECKPOINT_STATES] [--logging LOGGING]
                                  [--wsc_log_level WSC_LOG_LEVEL [WSC_LOG_LEVEL ...]] [--max_steps MAX_STEPS] [--eval_steps EVAL_STEPS] [--config CONFIG]
                                  [--compile_only | --validate_only] [--num_workers_per_csx NUM_WORKERS_PER_CSX] [-c COMPILE_DIR]
                                  [--job_labels JOB_LABELS [JOB_LABELS ...]] [--job_priority {p1,p2,p3}] [--debug_args_path DEBUG_ARGS_PATH]
                                  [--mount_dirs MOUNT_DIRS [MOUNT_DIRS ...]] [--python_paths PYTHON_PATHS [PYTHON_PATHS ...]]
                                  [--credentials_path CREDENTIALS_PATH] [--mgmt_address MGMT_ADDRESS] [--job_time_sec JOB_TIME_SEC] [--disable_version_check]
                                  [--num_csx NUM_CSX] [--num_wgt_servers NUM_WGT_SERVERS] [--num_act_servers NUM_ACT_SERVERS]
                                  [--debug_args [DEBUG_ARGS [DEBUG_ARGS ...]]] [--ini [INI [INI ...]]] [--transfer_processes TRANSFER_PROCESSES]

  1. We support a subset of Eleuther’s command line interface (CLI) arguments above. For a more detailed descrition of these supported arguments, see https://github.com/EleutherAI/lm-evaluation-harness/blob/v0.4.0/docs/interface.md.

  2. You may also specify these arguments in the YAML under the eeh_args key of the EleutherEvalHarness configuration, but please note that the CLI setting will override the settings in the YAML.

  3. The --params CLI argument is required. Use it to specify the path to the YAML configuration file. Note that while we do support the old V1 YAML specification, it will soon be deprecated so we recommend using the new V2 YAML.

  4. Use the --checkpoint_path CLI argument to specify the path to the checkpoint file to load model weights from. If a checkpoint path is not provided, we support checkpoint autoloading in this flow such that the latest checkpoint file will be picked up from the specified model_dir.

Supported Tasks

  • As of our 2.4.0 release, we support lm_eval@v0.4.5 .

  • You may perform downstream validation on all EEH tasks with output_type: loglikelihood or output_type: multiple_choice in the task specification. See asdiv and arc_easy for respective examples. You may specify each of these types of tasks separately or together in a single EleutherEvalHarness callback.

  • We currently do not support eval harness tasks with output_type: loglikelihood_rolling. These unsupported tasks are pile and wikitext for the supported EEH version 0.4.2.

    • agieval

    • babi

    • bbh

    • bbh_cot_fewshot

    • bbh_cot_zeroshot

    • bbh_fewshot

    • bbh_zeroshot

    • bigbench

    • codexglue_code2text

    • math_word_problems

    • mgsm_direct

    • mgsm_cot_native

    • mmlu_flan_cot_fewshot

    • mmlu_flan_cot_zeroshot

    • mmlu_flan_n_shot_generative

    • polemo2

    • super-glue-lm-eval-v1

    • unscramble

Adding New Tasks

Please refer to Eleuther’s new task implementation guide here to add new tasks.

Limitations

  • We currently do not support running multiple generative eval harness tasks in the same callback.

  • EEH task groups, such as agieval, comprise multiple generative sub tasks that you will have to configure in the YAML via separate callbacks.

  • Please turn on grad accumulation and choose a small micro batch size (between 16 to 32) under the flags configuration of the EleutherEvalHarness callback of the YAML,

Conclusion

In summary, by following this guide you have run standalone downstream validation for the Llama3-8B model on EEH’s multiple different evaluation datasets / tasks.

You should now be comfortable in configuring more downstream EEH runs on your model of choice on even more eval harness tasks.

What’s next?

To run downstream validation on code generation tasks, please see check out:

You can also perform downstream validation using EEH as part of your pretraining runs with upstream validation. Check out the following guide: