Multimodal model that connects a vision encoder to a language model through instruction tuning on GPT-4-generated image-text data.
llava
directory within the ModelZoo. Here’s how it’s organized:
configs/
: Contains YAML configuration files used for training, evaluation, and instruction fine-tuning.model.py
: Defines the top-level LLaVA model class, encapsulating the vision encoder, projector, and language model modules, and orchestrating forward passes for training and inference.modeling_llava.py
: Implements core building blocks including the projector MLP, model loading utilities, and integration logic to bridge image features into the LLM token embedding space.Configuration | Description |
---|---|
params_llava_v1p5_pretrain_13b_phase1_MSL2K.yaml | Phase 1: Feature alignment pretraining for 13B model (MSL=2048). |
params_llava_v1p5_pretrain_13b_phase2_MSL2K.yaml | Phase 2: Instruction fine-tuning for 13B model (MSL=2048). |
params_llava_v1p5_pretrain_7b_phase1_MSL2K.yaml | Phase 1: Feature alignment pretraining for 7B model (MSL=2048). |
params_llava_v1p5_pretrain_7b_phase2_MSL2K.yaml | Phase 2: Instruction fine-tuning for 7B model (MSL=2048). |
git clone
:
preprocess_dataset.py
to further pre-process some of the Phase-1 and Phase-2 datasets into the correct LLaVA jsonl formats. Please see the help message to see which datasets are covered and find more details in the below sections regarding each individual datasets. Additionally, we provide an additional utility option convert_json2jsonl
to convert a folder of json files to jsonl files - the latter will be the input format that the later HDF5 processing scripts will act on.
image_key
in the H5 files generated.
For example:
git clone git@hf.co:datasets/liuhaotian/LLaVA-Pretrain
share-captioner_coco_lcs_sam_1246k_1107.json
from HuggingFace here./<path>/synthdog-en
preprocess_dataset.py
to process the data into LLaVa jsonl format. Please see the required input arguments.llava_v1_5_mix665k.json
from https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.jsonsharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json
from HuggingFace here./<path>/ChartQA_Dataset
preprocess_dataset.py
to process the data into LLaVa jsonl format. Please see the required input arguments:/<path>/DVQA
preprocess_dataset.py
to process the data into LLaVa jsonl format. Please see the required input arguments:/<path>/ai2d
preprocess_dataset.py
to process the data into LLaVa jsonl format. Please see the required input arguments:/<path>/ArxivQA
preprocess_dataset.py
to process the data into LLaVa jsonl format. Please see the required input arguments:/<path>/ArxivCAP
preprocess_dataset.py
to process the data into LLaVa jsonl format. Please see the required input arguments:Single Document Visual Question Answering
/<path>/DocVQA
preprocess_dataset.py
to process each subset into LLaVa jsonl format. Please see the required input arguments:run.py
create_hdf5_dataset.py
to create preprocessed dataset files. Further details on usage and instructions can be found here.
llava_phase_1_preproc.yaml
appropriately:
setup.input_dir
: Input data directory containing jsonl files.
setup.output_dir
: Output directory to save generated H5 files
setup.processes
: Adjust based on cores available for parallel processing
processor.tokenizer_type
: Tokenizer to use
processor.max_sequence_length
: Maximum sequence length that the model is trained on. This includes the token positions to be used for image data features as well. This means that the number of tokens available for text tokens is processor.max_sequence_length - dataset.num_patches - 1(BOS token)
dataset.num_patches
: Number of patches obtained after the image is patchified. This is computed based on the following:
dataset.image_dir
: Parent directory where all the images are present. Used along with the relative path under the image_key
field in jsonl to check that images exist, and throw out examples with no image.
dataset.system_prompt_style
: vicuna_v1
. This is used to transform the instruction finetuning dataset into the vicuna_v1
format with appropriate system message and USER
and ASSISTANT
values. Note that we currently support vicuna_v1
only.llama
, zephyr
is planned for future releases.image_model
.
text_model
config.json
to config_lmsys.json
mm_projector_type
etc to build appropriate CS config yaml and checkpoint
Note: In case of LLaVA-13B
text_model
from lmsys/vicuna-13b-v1.5 in step (b)config.json
to config_lmsys.json
, same as step (c)python modelzoo/tools/convert_checkpoint.py list
run.py
train_input.data_dir
parameter points to the correct dataset
train_input.img_data_dir
parameter points to the correct parent directory containing all images needed by the dataset.
train_input.image_size
parameter corresponds to the image-size of the dataset.
train_input.transforms
appropriately if train_input.image_size
is updated.
model.image_model.image_size
points to the image size passed to ViTModel
model.image_model.patch_size
parameter to use different patch sizes
model.freeze
contains the regex patterns to freeze appropriate layers in the model
model.image_feature_select_layer_idx
parameter to specify the image_model encoder layer from which features are extracted for the input image.
model.image_start_idx
parameter: This parameter should be set based on the data_params.json
file that is saved when H5 files are generated using create_hdf5_dataset.py
. In general,
model.image_start_idx: 1
dataset.system_prompt_style: vicuna_v1
: model.image_start_idx: 35
/path/to/yaml
, /path/to/model_dir
, and train
as placeholders for user supplied inputs.
/path/to/yaml
is a path to the YAML config file with model parameters such one of the configurations described in the configs included for this model./path/to/model_dir
is a path to the directory where we would like to store the logs and other artifacts of the run.--mode
specifies the desired mode to run the model in. Change to --mode eval
to run in eval mode.runconfig.load_checkpoint_states: "model"
image_model
and text_model
under output-dir
as shown below:
image_model
consists of weights for vision_tower
in source repository
text_model
consists of weights to be loaded for the Language model and projectors
text_model
folder.
text_model.mm_vision_tower
points to the image_model
path to ensure the weights from image_model
folder are loaded into the source code vision_tower
. This path is automatically added during checkpoint conversion.
text_model
to text_model_llava
. This is since the source code repository expects the path to include llava
keyword in order to correctly load the checkpoints. (code pointers: builder.py, mm_utils.py)
output-dir
should look like below:
text_model_llava
folder path to --model-path
in eval scripts in LLaVA source code repositoryimage_data
: Image tensor
(batch_size, model.num_channels, model.image_model.image_size[0], model.image_model.image_size[1])
torch.float16
labels
: Text input tokens to be predicted by model
(batch_size, model.text_model.max_sequence_length)
torch.int32
key_padding_mask
: Mask indicating the positions of image_tokens. Used in conjunction with causal attention mask(generated on the fly).
(batch_size, model.text_model.max_sequence_length)
torch.int32
1
at positions where we DO NOT want to ATTEND, 0
otherwisetext_input_ids
: Tensor indicating input text tokens. These include <pad>
token inserted in [model.image_start_idx: model.image_start_idx+num_patches]
positions
(batch_size, model.text_model.max_sequence_length)
torch.int32
loss_mask
: Mask indicating positions to consider when computing loss.
(batch_size, model.text_model.max_sequence_length)
torch.int32
1
at positions where we want to compute loss, 0
otherwisesrc_key_padding_mask
from the dataloader.model.image_start_idx
parameter in the yamlimage_key
in the H5 files generated.
For example: