convert_legacy_params_to_trainer_params
from the script src/cerebras/modelzoo/trainer/utils.py
.Trainer
callback.
The example in this section configure code evaluation on the bigcode eval harness task humaneval
using a single CSX.
If you aren’t interested in seeing the break down of the configuration, feel free to skip ahead to the Putting it All Together section to see the full YAML configuration.
num_csx
to run BCEH on multiple CSXs for improved performance.
model_name
is now deprecated and replaced with name
.name
setting in the model configuration. Valid names corresponding to the supported models include:btlm
bloom
gpt2
gptj
falcon
gpt3
gpt-neox
llama
mistral
mpt
jais
santacoder
starcoder
Trainer
callback.
Add the following configuration for the BigCodeEvalHarness
callback.
bigcode_args
section exposes the following settings configure the BCEH run:
BigCode Eval Harness CLI Arguments | Description |
---|---|
--prefix | Number of examples to be added to the fewshot context string. Defaults to 0 |
--temperature | Sampling temperature used for generation. |
--top_p | Top-p parameter used for nucleus sampling. |
--top_k | Top-k parameter used for generation. |
--n_samples | Number of completions to generate for each sample. Defaults to 1. |
--seed | Random seed used for evaluation. |
--tasks | Comma separated string specifying BigCode Eval Harness tasks. |
--instruction_tokens | A series of instruction tokens used for instruction-tuning benchamrks separated by comma e.g. <user_message>,<end_user_message>,<assistant_message> |
--max_length_generation | Maximum length of generated sequence (prompt+generation). |
--limit | Number of samples to solve and evaluate from the benchmark. |
--limit_start | Optional offset to start from when limiting the number of samples. |
--save_every_k_tasks | Optional saving after every k tasks. |
--load_generations_path | Path of file with previously generated solutions, if provided generation is skipped and only evaluation is done |
--load_data_path | Path of additional data to load for the tasks. |
--metric_output_path | Path to save the results. |
--load_generations_intermediate_paths | List of paths for saving the intermediate code generations. |
--save_generations_path | Path for saving the model’s output code generations. |
--save_references_path | Path for saving the references solutions/tests. |
--prompt | Prompt type to use for generation in HumanEvalPack tasks. |
--check_references | Don’t run generation but benchmark groundtruth (useful for debugging). |
DataLoader Settings | Description |
---|---|
data_dir | This setting is required. Provide a path to the mounted directory visible to the worker containers where eval harness task data samples are dumped after preprocessing. Use the mount_dirs argument to specify a dir mount, similar to our existing flows. |
tokenizer_file_path | Path to a custom tokenizer (JSON) file. If you provide a custom tokenizer, then you must also specify eos_id ; otherwise, you must provide a pretrained tokenizer from Hugging Face in pretrained_model_name_or_path . |
pretrained_model_name_or_path | Hugging Face (HF) pretrained model name or path. This setting is required if you do not specify tokenizer_file_path . For detailed description, see HF AutoTokenizers. |
eos_id | End-of-sentence (eos) token ID to signal the termination of a sequence. This setting is required if you specify a custom tokenizer in tokenizer_file_path . You can set this by looking for the ID corresponding to the eos token in the custom tokenizer JSON file. |
max_sequence_length | Maximum length of the input sequence. This setting is required for preprocessing input data samples from the specified eval harness tasks. You should align the max_sequence_length field to the max_position_embeddings value in the model configuration of the YAML. If you don’t specify max_sequence_length , the flow defaults to this max_position_embeddings setting. |
keep_data_dir
: Use this to preserve the preprocessed eval harness task data samples, i.e. the directory specified under data_dir
. Defaults to False, i.e. data samples are deleted after the run.$HOME/.cache/huggingface
.
However, you may choose to specify a different directory for this cached data via the HFCacheDir
callback:
start_token
- ID of the special token that indicates where to start inferring for each sample, as described above. You may specify a list of token IDs instead of a single ID. If you do, the model will start inference at the first token that matches any one of the provided IDs. The model will pad inferred predictions with the first ID in the list.
stop_sequences
- List of sequences (each one being a list of token IDs). If any one of these sequences is emitted by the model, inference will stop for that sample. For example, suppose you would like to stop inferring after either a newline character (e.g. token id 1), or a combination of a period (e.g. token id 2) followed by a space (e.g. token id 3). In this case, set stop_sequences to [[1], [2, 3]]. To stop inferring after seeing a newline character only, set stop_sequences to [[1]]. To disable this feature, set stop_sequences
to an empty list []. Additionally, the following optional parameters may be set:
max_tokens
- Maximum tokens to infer for each sample.
loop_dim
- Indicates the sequence dimension in the input and output data. Default value is 1. If set to 0, indicates that both input and output data is transposed (i.e. sequence X samples
instead of samples X sequence
).
start_token
, it is ideal to choose a value that’s not going to be generated by the model, i.e. vocab_size
in the example above.
stop_sequences
. For instance, see the specification of stop tokens for humaneval . The flow will internally override the stop_sequences
config with the value from the task, so you can also specify an arbitrary, valid value in the YAML as shown above.
bigcode_args
key of the BigCodeEvalHarness
configuration, but please note that the CLI setting will override the settings in the YAML.
params
argument is required. Use it to specify the path to the YAML configuration file.
--checkpoint_path
CLI argument to specify the path to the checkpoint file to load model weights from. If a checkpoint path is not provided, we support checkpoint autoloading in this flow such that the latest checkpoint file will be picked up from the specified model_dir
.
--save_generations_path
argument to specify the path for saving the generations. If no absolute path is provided, the generations will be dumped inside of model_dir
.
The code execution and evaluation flow is run on CPU, preferably in a sandboxed environment, using these dumped model outputs.
./llama3_8B_bceh.yaml
. Then, to run code generation for task humaneval
set up a bash script as follows:
temperature
, top_k
or top_p
sampling to either the bash script or under bigcode_args
of the YAML. For example:
humaneval
, require executing the model’s generated code for processing the final eval scores. Thus, for security purposes, use the instructions here to set up a containerized environment.
Finally, to obtain the final eval scores, invoke the BCEH’s main script using the --load_generations_path
CLI argument.
<path_to_bigcode_model_dir>/20240627_051131/bigcode_0/generations_humaneval_1.json
.
Set up a bash script as follows to invoke the BCEH script:
./evaluation_results.json
are as follows:
config
section of the final output JSON contains all the default values from BigCode’s CLI. Feel free to ignore this section and consider only the final eval score, i.e. "pass@1": 0.13414634146341464
above.
--check_references
is set for tasks that require code execution.
--tasks
argument specifies the correct task for which you produced the model’s output generations on CSX, i.e. humaneval
in the example above.
BigCodeEvalHarness
callbacks (one per task) in the YAML.
For example, you may update the YAML config as such to run downstream validation on generative tasks mpbb
and humaneval
:
DS1000
, since it requires Python version 3.7.10, but we support 3.8.0 in our packaged environment.flags
configuration of the BigCodeEvalHarness
callback of the YAML.
humaneval
.
You should now be comfortable in configuring more downstream BCEH runs on your model of choice on even more code eval tasks.