Cerebras’ model library for implementing multimodal models
configs/
: YAML configuration files.modeling_mmsimple.py
: Defines the core multimodal model.model.py
: The entry point to the model.run.py
: Training script. Performs training and validation.Config File | Dataset | Notes |
---|---|---|
params_mm_llava_llama2_7b_phase1.yaml | LLaVA Visual Instruct Pretrain LCS-558K Dataset | LLaVA-7B Phase-1 with CLIP ViT image encoder, Vicuna-7B text encoder and mlp2x-gelu Feedforward network for Projector. Freeze image_model and text_model during training |
params_mm_llava_llama2_7b_phase2.yaml | LLaVA Visual Instruct 150K Dataset | LLaVA-7B Phase-2 with CLIP ViT image encoder, Vicuna-7B text encoder and mlp2x-gelu Feedforward network for Projector. Freeze image_model during training |
run.py
image_model
.
b. LLAMA3 checkpoints and tokenizer files should be downloaded to a subdirectory text_model
c. Rename config.json
to config_lmsys.json
mv /path/to/pretrained/checkpoints/text_model/config.json /path/to/pretrained/checkpoints/text_model/config_lmsys.json
d. Download LLaVA-8B config.json from HuggingFace
We do steps (c) and (d) above since we need additional information about LLaVA model such as mm_projector_type
etc to build appropriate CS config yaml and checkpoint
python modelzoo/tools/convert_checkpoint.py list
run.py
train_input.data_dir
parameter points to the correct dataset
train_input.img_data_dir
parameter points to the correct parent directory containing all images needed by the dataset.
train_input.image_size
parameter corresponds to the image-size of the dataset.
train_input.transforms
appropriately if train_input.image_size
is updated.
image_model.image_size
points to the image size passed to each ViTModel
image_model.patch_size
parameter to use different patch sizes within each ViTModel
model.freeze
contains the regex patterns to freeze appropriate layers in the model
image_model.image_layer_idx
parameter to specify the image_model encoder layer from which features are extracted for the input image.
/path/to/yaml
, /path/to/model_dir
, and train
as placeholders for user supplied inputs.
/path/to/yaml
is a path to the YAML config file with model parameters such one of the configurations described in Configs included for this model./path/to/model_dir
is a path to the directory where we would like to store the logs and other artifacts of the run.--mode
specifies the desired mode to run the model in. Change to --mode eval
to run in eval mode.image_model
and text_model
under output-dir
as shown below:
image_model
consists of weights for vision_tower
in source repository
text_model
consists of weights to be loaded for the Language model and projectors
text_model
folder.
text_model.mm_vision_tower
points to the image_model
path to ensure the weights from image_model
folder are loaded into the source code vision_tower
. This path is automatically added during checkpoint conversion.
text_model
to text_model_llava
. This is since the source code repository expects the path to include llava
keyword in order to correctly load the checkpoints. (code pointers: builder.py, mm_utils.py)
output-dir
should look like below:
text_model_llava
folder path to --model-path
in eval scripts in LLaVA source code repositoryimage_key
in the H5 files generated.
For example: