Prerequisites

Read through the Trainer Overview and Trainer Configuration Overview. In this guide, you’ll use the tools and configurations outlined in those pages.

Configure the Model Directory

Pass the model_dir argument to the Trainer constructor to set the model directory for trainer artifacts:


trainer:
  init:
    ...
    model_dir: "./model_dir"

Model Directory Structure

THe model directory contains a subdirectory named after the datetime for the run. This allows multiple runs to use the same model directory.

model_dir/
├── 20240618_074134
│   ├── eleuther_0
│   │   └── results
│   │       ├── drop_10.json
│   │       ├── drop_5.json
│   │       ├── results_10.json
│   │       ├── results_5.json
│   │       ├── winogrande_10.json
│   │       └── winogrande_5.json
│   ├── events.out.csevents.1718721695.user.17833.0
│   │   └── ...
│   └── events.out.tfevents.1718721695.user.17833.0
├── cerebras_logs
│   ├── 20240618_074134
│   │   ├── run.log
│       └── ...
├── checkpoint_10.mdl
├── checkpoint_5.mdl
└── latest_run.log -> cerebras_logs/20240618_074134/run.log

Inside this subdirectory are results from the run. In the above example, you can see the results from an Eleuther Eval Harness run.

This run also used the TensorBoardLogger and includes the event files that were written by the TensorBoard writer.

If you open Tensorboard to the model directory, the runs are grouped together by run.

tensorboard --bind_all --logdir=./model_dir

Inside the model directory, the cerebras_logs directory stores various logs and artifacts from compilation and execution. These logs and artifacts are organized by datetime, matching the previously mentioned subdirectory, to help associate them with specific runs.

Checkpoints taken during the run are also saved in the base model directory, allowing future runs with checkpoint autoloading enabled to access them easily. See Checkpointing for more details.