Learn how to preprocess text-only and multimodal datasets into HDF5 format for pre-training and fine-tuning.
.jsonl
,
.json.gz
,
.jsonl.zst
,
.jsonl.zst.tar
.parquet
.txt
.fasta
KeyError: 'tags'
This issue occurs due to an outdated version of the huggingface_hub
package. To resolve it, update the package by running:pip install --upgrade huggingface_hub==0.26.1
setup
, processing
, and dataset
.
python preprocess_data.py --config /path/to/configuration/file
python launch_tokenflow.py --output_dir <directory/of/file(s)>
In your terminal, you will see a url like http://172.31.48.239:5000
. Copy and paste this into your browser to launch TokenFlow, a tool for interactively visualizing whether loss and attention masks were applied correctly:
Arguments
output_dir
: Contains the file(s) you can view in the GUI. [Required]
data_params
: Location of the data_params.json file for the preprocessed dataset. [Optional]
port
: Use this to specify a different port for the flask server. [Optional, default=5000]
input_strings
and label_strings
are converted tokens from input_ids
and labels
respectively. The tokens in the string sections are highlighted in green when the loss weight is greater than zero for that specific token. Similarly, the tokens are highlighted in red when their attention mask is set to zero. For multimodal datasets, hovering over the image pad tokens also displays the corresponding image in the popup window.