Preprocessing Components
Dataset Splitting and Preprocessing
Learn how to split and preprocesses datasets for LLM training.
You can organize your data in two ways:
- Single Dataset: All files in one input directory
- Multiple Subsets: Files organized in separate subdirectories
Setting Up Your Dataset
Option 1: Single Dataset Setup
Place all your data files in one directory:
Option 2: Multiple Subsets Setup
Organize files into subdirectories that represent different subsets:
Configuration Options
Basic Configuration
Your configuration file needs these essential parameters:
Defining Data Splits
You can split your dataset in two ways:
- Data Splits Only - Divide your data into train/validation/test sets:
- Context Splits Only - Split by maximum sequence length:
- Combined Splits - Split by both data type and sequence length:
Multiple Subsets Configuration
To process multiple subsets, add this to your configuration:
How Splitting Works
Random Sampling Process
The system uses deterministic random sampling to assign documents to splits:
- For data splits: Each document is randomly assigned to train/val/test based on the
split_fraction
values - For context splits: Each document is assigned a maximum sequence length from the
MSL_List
based onsplit_fractions
- For combined splits: A two-level sampling process first assigns the data split, then the context length
For reproducibility, all sampling uses a fixed random seed that you can set:
Output Structure
After processing, your output directory will contain:
For a single dataset:
For multiple subsets: