Model Description
LLaVA (Large Language and Vision Assistant) is a multimodal model that integrates a vision encoder with a language model via a lightweight projector module, enabling end-to-end visual and language understanding. It accepts both image and text inputs and generates text-based outputs, making it suitable for instruction-following, question answering, and general-purpose visual dialogue tasks. The architecture consists of three components:- A vision encoder initialized from pretrained OpenAI CLIP-ViT-L/14-336px.
- A language model initialized from Vicuna weights.
- A projector module, implemented as a multi-layer perceptron (MLP), which maps image embeddings into the language model’s token embedding space.
- Feature Alignment Pretraining: Only the projector module is trained in this phase. Its weights are updated to align the image features from the vision encoder with the word embeddings of the language model.
- Instruction Finetuning:The model is trained on instruction-following data to enable chatbot capabilities. During this phase, the language model and projector are typically finetuned, while the vision encoder remains frozen.
Code Structure
The code for this model is located in thellava directory within the ModelZoo. Here’s how it’s organized:
configs/: Contains YAML configuration files used for training, evaluation, and instruction fine-tuning.model.py: Defines the top-level LLaVA model class, encapsulating the vision encoder, projector, and language model modules, and orchestrating forward passes for training and inference.modeling_llava.py: Implements core building blocks including the projector MLP, model loading utilities, and integration logic to bridge image features into the LLM token embedding space.
Available Configurations
| Configuration | Description |
|---|---|
params_llava_v1p5_pretrain_13b_phase1_MSL2K.yaml | Phase 1: Feature alignment pretraining for 13B model (MSL=2048). |
params_llava_v1p5_pretrain_13b_phase2_MSL2K.yaml | Phase 2: Instruction fine-tuning for 13B model (MSL=2048). |
params_llava_v1p5_pretrain_7b_phase1_MSL2K.yaml | Phase 1: Feature alignment pretraining for 7B model (MSL=2048). |
params_llava_v1p5_pretrain_7b_phase2_MSL2K.yaml | Phase 2: Instruction fine-tuning for 7B model (MSL=2048). |
Dataset Download and Preprocessing
Please follow the instructions here for datasets to be downloaded from HuggingFace Datasets. Since all datasets on the HuggingFace Hub are Git repositories, the datasets can be downloaded locally by runninggit clone:
preprocess_dataset.py to further pre-process some of the Phase-1 and Phase-2 datasets into the correct LLaVA jsonl formats. Please see the help message to see which datasets are covered and find more details in the below sections regarding each individual datasets. Additionally, we provide an additional utility option convert_json2jsonl to convert a folder of json files to jsonl files - the latter will be the input format that the later HDF5 processing scripts will act on.
image_key in the H5 files generated.
For example:
Phase-1: Pre-training for Feature alignment datasets:
LLaVA Visual Instruct Pretrain LCS-558K Dataset
- Download from https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain
- Dataset and images can be directly downloaded from HuggingFace Hub Datasets using
git clone git@hf.co:datasets/liuhaotian/LLaVA-Pretrain - No further preprocessing is required
ShareGPT4V-PT Dataset
- Download the dataset
share-captioner_coco_lcs_sam_1246k_1107.jsonfrom HuggingFace here. - This dataset consists of 100K high-quality captions collected from advanced GPT4-Vision and has been expanded to 1.2 million with a caption model trained on this subset.
- Images for this dataset can be downloaded by following the instructions here
- No further preprocessing is required
Synthdog-EN Dataset
- Download the dataset from HuggingFace: https://huggingface.co/datasets/naver-clova-ix/synthdog-en
- The images for this dataset are present in the parquet files
- Steps for preprocessing dataset:
- Place the downloaded files at
/<path>/synthdog-en - Use
preprocess_dataset.pyto process the data into LLaVa jsonl format. Please see the required input arguments.
- Example command:
- Place the downloaded files at
Phase-2: Instruction Finetuning datasets
LLaVA Visual Instruct 150K Dataset
- Download the dataset
llava_v1_5_mix665k.jsonfrom https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json - Images corresponding to this dataset can be downloaded by following instructions here
- No further preprocessing is required
ShareGPT4V-SFT Dataset:
- Download the dataset
sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.jsonfrom HuggingFace here. - This dataset is built by replacing 23K image-text pairs related to the image captioning task in LLaVA-mix-665K with a equivalent subset in collected GPT4V-generated high-quality image-text pairs.
- Images for this dataset can be downloaded by following the instructions here
- No further preprocessing is required
ChartQA Dataset
- Download the dataset from HuggingFaceHub Datasets from the link: https://huggingface.co/datasets/ahmed-masry/ChartQA
- Steps for preprocessing dataset:
- Place the downloaded files at
/<path>/ChartQA_Dataset - Use
preprocess_dataset.pyto process the data into LLaVa jsonl format. Please see the required input arguments:
- Example command:
- Place the downloaded files at
DVQA Dataset
- The dataset can be downloaded by following instructions mentioned in https://github.com/kushalkafle/DVQA_dataset?tab=readme-ov-file#download-links
- Steps for preprocessing dataset:
- Place the downloaded files at
/<path>/DVQA - Use
preprocess_dataset.pyto process the data into LLaVa jsonl format. Please see the required input arguments:
- Example command:
- Place the downloaded files at
AI2 Diagram Dataset
- Download the dataset and images using
- Steps for preprocessing dataset:
- Unzip the downloaded zip file to
/<path>/ai2d - Use
preprocess_dataset.pyto process the data into LLaVa jsonl format. Please see the required input arguments:
- Example command:
- Unzip the downloaded zip file to
ArxivQA Dataset
- The dataset can be downloaded from https://huggingface.co/datasets/MMInstruction/ArxivQA
- Steps for preprocessing dataset:
- Place the downloaded files at
/<path>/ArxivQA - Use
preprocess_dataset.pyto process the data into LLaVa jsonl format. Please see the required input arguments:
- Example command:
- Place the downloaded files at
ArxivCap Dataset
- The dataset can be downloaded from https://huggingface.co/datasets/MMInstruction/ArxivCap.
- We process and use only figures with captions. Any subfigures are not included.
- Steps for preprocessing dataset:
- Place the downloaded files at
/<path>/ArxivCAP - Use
preprocess_dataset.pyto process the data into LLaVa jsonl format. Please see the required input arguments:
- Example command:
- Place the downloaded files at
DocVQA Dataset
- Download the dataset following instructions here
- Registration is required to download the dataset under
Single Document Visual Question Answering - Steps for preprocessing dataset:
- Place the downloaded files at
/<path>/DocVQA - Use
preprocess_dataset.pyto process each subset into LLaVa jsonl format. Please see the required input arguments:
- Example commands:
- Place the downloaded files at
Sequence of the steps to perform
The high-level steps for training a model are relatively simple, involving data-processing and then model training and evaluation- Step 1: Dataset download and preprocessing
- Step 2: Generate H5 files for training
- Generate files for Phase-1 (Pre-training for Feature alignment) stage
- Generate files for Phase-2 (Instruction Finetuning) stage
- Step 3: Download pretrained checkpoints for Phase 1
- Step 4: Convert checkpoints to CS Model Zoo format using checkpoint converter
- Step 5: Training the model on CS system or GPU using
run.py- To compile/validate, run train and eval on Cerebras System
- To run train and eval on GPU/CPU
- Phase-1 training
- Phase-2 training
- Step 6: Convert checkpoint to source code repository format to run eval
- Step 7: Set up source code repository for benchmark evaluation and run evaluation.
Step 1: Dataset download and preparation
Please follow instructions in Dataset download and preprocessing to setup datasets for the appropriate phase for H5 file generationStep 2: Generate H5 files for training
The next step is to generate H5 files that are used by the model during training using LlavaHDF5MapDataProcessor. We usecreate_hdf5_dataset.py to create preprocessed dataset files. Further details on usage and instructions can be found here.
Generate files for Phase-1 (Pre-training for Feature alignment) stage
Refer to LlavaPhaseOnePreprocessor and config file for Phase-1 H5 file generation: llava_phase_1_preproc.yaml. Please update the following fields inllava_phase_1_preproc.yaml appropriately:
-
setup.input_dir: Input data directory containing jsonl files. -
setup.output_dir: Output directory to save generated H5 files -
setup.processes: Adjust based on cores available for parallel processing -
processor.tokenizer_type: Tokenizer to use -
processor.max_sequence_length: Maximum sequence length that the model is trained on. This includes the token positions to be used for image data features as well. This means that the number of tokens available for text tokens isprocessor.max_sequence_length - dataset.num_patches - 1(BOS token) -
dataset.num_patches: Number of patches obtained after the image is patchified. This is computed based on the following: -
dataset.image_dir: Parent directory where all the images are present. Used along with the relative path under theimage_keyfield in jsonl to check that images exist, and throw out examples with no image.
Generate files for Phase-2 (Instruction Finetuning) stage
Refer to LlavaPhaseTwoPreprocessor and config file for Phase-2 H5 file generation: llava_phase_2_preproc.yaml. The fields to be updated include- All fields mentioned for Phase-1 above
- Please note the field
dataset.system_prompt_style:vicuna_v1. This is used to transform the instruction finetuning dataset into thevicuna_v1format with appropriate system message andUSERandASSISTANTvalues. Note that we currently supportvicuna_v1only. - Support for other formats such as
llama,zephyris planned for future releases.
Step 3: Download pretrained checkpoints for Phase-1
Checkpoint converter script for converting CLIP-ViT and Vicuna checkpoints to CS format require the following directory structure:image_model.
text_model
config.json to config_lmsys.json
mm_projector_type etc to build appropriate CS config yaml and checkpoint
Note: In case of LLaVA-13B
- The image model remains the same, so you follow the same step (a) as above and download from openai/clip-vit-large-patch14-336.
- Download
text_modelfrom lmsys/vicuna-13b-v1.5 in step (b) - Rename
config.jsontoconfig_lmsys.json, same as step (c) - Download LLaVA-13B config.json in step (d)
Step 4: Convert checkpoints to CS Model Zoo format using checkpoint converter
- Checkpoint conversion script: modelzoo/tools/convert_checkpoint.py
- LLaVA Model checkpoint converter: modelzoo/tools/checkpoint_converters/llava.py
- Command:
python modelzoo/tools/convert_checkpoint.py list
Step 5: Training the model on CS system or GPU using run.py
IMPORTANT: See the following notes before proceeding further.
Parameter settings in YAML config file: The config YAML files are located in the configs directory. Before starting a training run, make sure that the YAML configs used have the following set correctly:
-
The
train_input.data_dirparameter points to the correct dataset -
The
train_input.img_data_dirparameter points to the correct parent directory containing all images needed by the dataset. -
The
train_input.image_sizeparameter corresponds to the image-size of the dataset. -
Also change sizes in
train_input.transformsappropriately iftrain_input.image_sizeis updated. -
The
model.image_model.image_sizepoints to the image size passed to ViTModel -
The
model.image_model.patch_sizeparameter to use different patch sizes -
model.freezecontains the regex patterns to freeze appropriate layers in the model -
model.image_feature_select_layer_idxparameter to specify the image_model encoder layer from which features are extracted for the input image. -
model.image_start_idxparameter: This parameter should be set based on thedata_params.jsonfile that is saved when H5 files are generated usingcreate_hdf5_dataset.py. In general,- Phase-1:
model.image_start_idx: 1 - Phase-2 with
dataset.system_prompt_style: vicuna_v1:model.image_start_idx: 35
- Phase-1:
/path/to/yaml, /path/to/model_dir, and train as placeholders for user supplied inputs.
/path/to/yamlis a path to the YAML config file with model parameters such one of the configurations described in the configs included for this model./path/to/model_diris a path to the directory where we would like to store the logs and other artifacts of the run.--modespecifies the desired mode to run the model in. Change to--mode evalto run in eval mode.
To run train and eval on GPU/CPU
If running on a CPU or GPU, activate the environment from Python GPU Environment setup, and simply run:- Phase-1 (Pre-training for Feature alignment) stage
- To launch this phase, we initialize the model using converted checkpoint from Step-4
- Command:
- Phase-2 (Instruction Finetuning) stage
- When instruction finetuning, the model is initialized from the checkpoint from Phase-1.
- Command:
- Note: The yaml from Phase-2 should only load model states from Phase-1 checkpoint by setting the yaml flag
runconfig.load_checkpoint_states: "model"
Step 6: Convert checkpoint to source code repository format to run eval
We perform evaluations on multimodal benchmarks using LLaVA source code repository. For this, we need to convert the checkpoints generated using the training run from Phase-2 to LLaVA source code repository format. This can be done using the command:image_model and text_model under output-dir as shown below:
-
Folder
image_modelconsists of weights forvision_towerin source repository -
Folder
text_modelconsists of weights to be loaded for the Language model and projectors -
The LLaVA source code repository expects tokenizer files to be present along with the language model weights (code pointer). For this, please copy the tokenizer files into
text_modelfolder. -
Also, please make sure
text_model.mm_vision_towerpoints to theimage_modelpath to ensure the weights fromimage_modelfolder are loaded into the source codevision_tower. This path is automatically added during checkpoint conversion. -
Rename folder
text_modeltotext_model_llava. This is since the source code repository expects the path to includellavakeyword in order to correctly load the checkpoints. (code pointers: builder.py, mm_utils.py) -
So, after the relevant tokenizer files are copied, the
output-dirshould look like below:
Step 7: Set up source code repository for benchmark evaluation and run evaluation benchmarks
- Setup LLaVA source code repository for multimodal benchmark evaluation by following instructions mentioned in Evaluation Docs.
- Instructions for creating conda environment and setting up the repository are mentioned in Installation Section
- Scripts to run various benchmarks are provided here
- Pass
text_model_llavafolder path to--model-pathin eval scripts in LLaVA source code repository
DataLoader Features Dictionary
LlavaHDF5MapDataProcessor outputs the following features dictionary with keys/values:image_data: Image tensor- Shape:
(batch_size, model.num_channels, model.image_model.image_size[0], model.image_model.image_size[1]) - Type:
torch.float16
- Shape:
labels: Text input tokens to be predicted by model- Shape:
(batch_size, model.text_model.max_sequence_length) - Type:
torch.int32
- Shape:
key_padding_mask: Mask indicating the positions of image_tokens. Used in conjunction with causal attention mask(generated on the fly).- Shape:
(batch_size, model.text_model.max_sequence_length) - Type:
torch.int32 1at positions where we DO NOT want to ATTEND,0otherwise
- Shape:
text_input_ids: Tensor indicating input text tokens. These include<pad>token inserted in[model.image_start_idx: model.image_start_idx+num_patches]positions- Shape:
(batch_size, model.text_model.max_sequence_length) - Type:
torch.int32
- Shape:
loss_mask: Mask indicating positions to consider when computing loss.- Shape:
(batch_size, model.text_model.max_sequence_length) - Type:
torch.int32 1at positions where we want to compute loss,0otherwise
- Shape:
Implementation notes
The following modifications and assumptions are made in this implementation:- Phase-2 instruction finetuning data includes samples which contain text-only data. For these cases, we pass a dummy image and make sure we do not attend to these dummy image features using
src_key_padding_maskfrom the dataloader. - Our preprocessing scripts and model defintions all assume that that image occurs at a fixed location with the context length. This is specified in the model definition using
model.image_start_idxparameter in the yaml - We currently support datasets which contain single image per sample.
- We currently do not support interleaving of multiple images with text.
- We currently expect all the images under a single parent folder and the relative path of images from different datasets are written under
image_keyin the H5 files generated. For example:
References
- LLaVA-v1: Visual Instruction Tuning
- LLaVA-v1.5: Improved Baselines with Visual Instruction Tuning
- LLaVA source code repository
- ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
- SynthDog-EN: OCR-Free Document Understanding Transformer
- ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
- DVQA: Understanding Data Visualizations via Question Answering
- AI2D: A Diagram Is Worth A Dozen Images
- ArxivQA & ArxivCap: Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models
- DocVQA: A Dataset for VQA on Document Images

