Example (Single Non-generative Task)
./llama3_8B_eeh.yaml
. Then, to run evaluation for task winogrande
, please set up a bash script as follows:Example (Multiple Non-generative Tasks)
Example (Generative task)
output_type: generate_until
in the task specification, such as triviaqa
or drop
. Refer to task triviqa for an example specification from the official EEH repository.In order to run generative inference on CSX, you must specify the following inference settings in the model config of YAML file:start_token
- ID of the special token that indicates where to start inferring for each sample, as described above. You may specify a list of token IDs instead of a single ID. If you do, the model will start inference at the first token that matches any one of the provided IDs. The model will pad inferred predictions with the first ID in the list.
stop_sequences
- List of sequences (each one being a list of token IDs). If any one of these sequences is emitted by the model, inference will stop for that sample. For example, suppose you would like to stop inferring after either a newline character (e.g. token id 1), or a combination of a period (e.g. token id 2) followed by a space (e.g. token id 3). In this case, set stop_sequences to [[1], [2, 3]]. To stop inferring after seeing a newline character only, set stop_sequences to [[1]]. To disable this feature, set stop_sequences
to an empty list []. Additionally, the following optional parameters may be set:
max_tokens
- Maximum tokens to infer for each sample.
loop_dim
- Indicates the sequence dimension in the input and output data. Default value is 1. If set to 0, indicates that both input and output data is transposed (i.e. sequence X samples
instead of samples X sequence
).
./llama3_8B_eeh.yaml
to add these inference settings:start_token
, it is ideal to choose a value that’s not going to be generated by the model, i.e. vocab_size
in the example above.
stop_sequences
under setting generation_kwargs.until
of the task spec. For instance, triviqa specifies "\n"
, "."
and ","
as the stop tokens. The EEH flow will internally override the stop_sequences
config with the value from the task, so you can also specify an arbitrary value in the YAML.
temperature
, top_k
or top_p
to either the bash script or under eeh_args
of the YAML. For example:Example (Non-generative and Generative Tasks)
Run Multiple Generative Tasks
convert_legacy_params_to_trainer_params
from the script src/cerebras/modelzoo/trainer/utils.py
.Trainer
callback.
The example in this section configures evaluation for LLaMA3 8B
via the multiple choice (non-generative) eval harness task winogrande
using a single CSX.
If you aren’t interested in seeing the break down of the configuration, feel free to skip ahead to the Putting it All Together section to see the full YAML configuration.
num_csx
to run EEH on multiple CSXs for improved performance.
name
setting in the model configuration. Valid names corresponding to the supported models include:btlm
bloom
gpt2
gptj
falcon
gpt3
gpt-neox
llama
mistral
mpt
jais
santacoder
starcoder
Trainer
callback.
Add the following section in the YAML to set up the EleutherEvalHarness
callback:
eeh_args
section exposes the following settings to configure the EEH run:
Eleuther Eval Harness CLI Arguments | Description |
---|---|
--tasks | Comma separated string specifying Eleuther Eval Harness tasks. To get full list of tasks, use the command lm-eval --tasks list from within your python venv. |
--num_fewshot | Number of examples to be added to the fewshot context string. Defaults to 0 |
--output_path | The path to the output file where the result metrics will be saved. If the path is a directory and log_samples is true, the results will be saved in the directory. Else the parent directory will be used. |
--limit | Accepts an integer, or a float between 0.0 and 1.0. This limits the number of documents to evaluate per task to the first X documents (if an integer) or first X% of documents. This is useful for debugging. |
--use_cache | A path to a sqlite db file for caching model responses. None if not caching. |
--cache_requests {true,refresh,delete} | Speed up evaluation by caching the building of dataset requests. None if not caching. |
--check_integrity | Whether to run the relevant part of the test suite for the tasks. |
--write_out | Prints the prompt for the first few documents. Defaults to False. |
--log_samples | If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis. Defaults to False. |
--show_config | If True, shows the the full config of all tasks at the end of the evaluation. Defaults to False. |
--include_path | Additional path to include if there are external tasks to include. |
--predict_only | Use with –log_samples. Only model outputs will be saved and metrics will not be evaluated. |
--seed | Set seed for python’s random, numpy and torch. |
--temperature | Sampling temperature used for generation (autoregressive, generate_until tasks only). |
--top_p | Top-p parameter used for nucleus sampling (autoregressive, generate_until tasks only). |
--top_k | Top-k parameter used for generation (autoregressive, generate_until tasks only). |
DataLoader Settings | Description |
---|---|
data_dir | This setting is required. Provide a path to the mounted directory visible to the worker containers where eval harness task data samples are dumped after preprocessing. Use the mount_dirs argument to specify a dir mount, similar to our existing flows. |
tokenizer_file_path | Path to a custom tokenizer (JSON) file. If you provide a custom tokenizer, then you must also specify eos_id ; otherwise, you must provide a pretrained tokenizer from Hugging Face in pretrained_model_name_or_path . |
pretrained_model_name_or_path | Hugging Face (HF) pretrained model name or path. This setting is required if you do not specify tokenizer_file_path . For detailed description, see HF AutoTokenizers. |
eos_id | End-of-sentence (eos) token ID to signal the termination of a sequence. This setting is required if you specify a custom tokenizer in tokenizer_file_path . You can set this by looking for the ID corresponding to the eos token in the custom tokenizer JSON file. |
max_sequence_length | Maximum length of the input sequence. This setting is required for preprocessing input data samples from the specified eval harness tasks. You should align the max_sequence_length field to the max_position_embeddings value in the model configuration of the YAML. If you don’t specify max_sequence_length , the flow defaults to this max_position_embeddings setting. |
keep_data_dir
: Use this to preserve the preprocessed eval harness task data samples, i.e. the directory specified under data_dir
. Defaults to False, i.e. data samples are deleted after the run.$HOME/.cache/huggingface
.
However, you may choose to specify a different directory for this cached data via the HFCacheDir
callback:
src/cerebras/modelzoo/common/run_eleuther_eval_harness.py
from the Cerebras Model Zoo to run EEH on various tasks.
This script accepts the following command line interface (CLI) arguments:
eeh_args
key of the EleutherEvalHarness
configuration, but please note that the CLI setting will override the settings in the YAML.
--params
CLI argument is required. Use it to specify the path to the YAML configuration file. Note that while we do support the old V1 YAML specification, it will soon be deprecated so we recommend using the new V2 YAML.
--checkpoint_path
CLI argument to specify the path to the checkpoint file to load model weights from. If a checkpoint path is not provided, we support checkpoint autoloading in this flow such that the latest checkpoint file will be picked up from the specified model_dir
.
lm_eval@v0.4.5
.
output_type: loglikelihood
or output_type: multiple_choice
in the task specification. See asdiv and arc_easy for respective examples. You may specify each of these types of tasks separately or together in a single EleutherEvalHarness
callback.
output_type: loglikelihood_rolling
. These unsupported tasks are pile
and wikitext
for the supported EEH version 0.4.2.
agieval
babi
bbh
bbh_cot_fewshot
bbh_cot_zeroshot
bbh_fewshot
bbh_zeroshot
bigbench
codexglue_code2text
math_word_problems
mgsm_direct
mgsm_cot_native
mmlu_flan_cot_fewshot
mmlu_flan_cot_zeroshot
mmlu_flan_n_shot_generative
polemo2
super-glue-lm-eval-v1
unscramble
flags
configuration of the EleutherEvalHarness
callback of the YAML,