You can organize your data in two ways:
- Single Dataset: All files in one input directory
- Multiple Subsets: Files organized in separate subdirectories
Setting Up Your Dataset
Option 1: Single Dataset Setup
Place all your data files in one directory:
input_dir/
├── file1.jsonl.zst
├── file2.jsonl.zst
Option 2: Multiple Subsets Setup
Organize files into subdirectories that represent different subsets:
input_dir/
├── subset_1/
│ ├── file1.jsonl.zst
├── subset_2/
│ ├── file2.jsonl.zst
Configuration Options
Basic Configuration
Your configuration file needs these essential parameters:
setup:
data:
source: "./input_dir" # Path to your input data
type: "local" # Data source type
output_dir: "./output_dir" # Where processed data will be stored
data_splits_dir: "./data_splits" # Directory for split datasets
Defining Data Splits
You can split your dataset in two ways:
- Data Splits Only - Divide your data into train/validation/test sets:
data_splits:
train:
split_fraction: 0.8 # 80% of data goes to training
val:
split_fraction: 0.2 # 20% of data goes to validation
- Context Splits Only - Split by maximum sequence length:
context_splits:
MSL_List: [128, 512] # Sequence length options
split_fractions: [0.5, 0.5] # Distribution between lengths
- Combined Splits - Split by both data type and sequence length:
data_splits:
train:
split_fraction: 0.8
context_splits:
MSL_List: [128, 512]
split_fractions: [0.5, 0.5]
val:
split_fraction: 0.2
context_splits:
MSL_List: [128, 512]
split_fractions: [0.6, 0.4]
Multiple Subsets Configuration
To process multiple subsets, add this to your configuration:
setup:
data:
source: "./input_dir"
type: "local"
top_level_as_subsets: True # Process each subdirectory as a separate dataset
How Splitting Works
Random Sampling Process
The system uses deterministic random sampling to assign documents to splits:
- For data splits: Each document is randomly assigned to train/val/test based on the
split_fraction
values
- For context splits: Each document is assigned a maximum sequence length from the
MSL_List
based on split_fractions
- For combined splits: A two-level sampling process first assigns the data split, then the context length
For reproducibility, all sampling uses a fixed random seed that you can set:
processing:
split_seed: 42 # Set your own seed for reproducible results
Output Structure
After processing, your output directory will contain:
For a single dataset:
output_dir/
├── train/
│ ├── msl_128/ # Contains files processed with 128 max sequence length
│ └── msl_512/ # Contains files processed with 512 max sequence length
└── val/
├── msl_128/
└── msl_512/
For multiple subsets:
output_dir/
├── subset_1/
│ ├── train/
│ │ ├── msl_128/
│ │ └── msl_512/
│ └── val/
│ ├── msl_128/
│ └── msl_512/
├── subset_2/
│ ├── train/
│ └── val/
Complete Configuration Example
setup:
data:
source: "./input_dir"
type: "local"
top_level_as_subsets: True # Remove this line for single dataset processing
output_dir: "./output_dir"
data_splits_dir: "./data_splits"
data_splits:
train:
split_fraction: 0.8
context_splits:
MSL_List: [128, 512]
split_fractions: [0.5, 0.5]
val:
split_fraction: 0.2
context_splits:
MSL_List: [128, 512]
split_fractions: [0.6, 0.4]
processing:
split_seed: 42
# Additional processing parameters go here