You can organize your data in two ways:

  • Single Dataset: All files in one input directory
  • Multiple Subsets: Files organized in separate subdirectories

Setting Up Your Dataset

Option 1: Single Dataset Setup

Place all your data files in one directory:

input_dir/
├── file1.jsonl.zst
├── file2.jsonl.zst

Option 2: Multiple Subsets Setup

Organize files into subdirectories that represent different subsets:

input_dir/
├── subset_1/
│   ├── file1.jsonl.zst
├── subset_2/
│   ├── file2.jsonl.zst

Configuration Options

Basic Configuration

Your configuration file needs these essential parameters:

setup:
  data:
    source: "./input_dir"    # Path to your input data
    type: "local"            # Data source type
  output_dir: "./output_dir" # Where processed data will be stored
  data_splits_dir: "./data_splits" # Directory for split datasets

Defining Data Splits

You can split your dataset in two ways:

  1. Data Splits Only - Divide your data into train/validation/test sets:
data_splits:
  train:
    split_fraction: 0.8      # 80% of data goes to training
  val:
    split_fraction: 0.2      # 20% of data goes to validation
  1. Context Splits Only - Split by maximum sequence length:
context_splits:
  MSL_List: [128, 512]       # Sequence length options
  split_fractions: [0.5, 0.5] # Distribution between lengths
  1. Combined Splits - Split by both data type and sequence length:
data_splits:
  train:
    split_fraction: 0.8
    context_splits:
      MSL_List: [128, 512]
      split_fractions: [0.5, 0.5]
  val:
    split_fraction: 0.2
    context_splits:
      MSL_List: [128, 512]
      split_fractions: [0.6, 0.4]

Multiple Subsets Configuration

To process multiple subsets, add this to your configuration:

setup:
  data:
    source: "./input_dir"
    type: "local"
    top_level_as_subsets: True  # Process each subdirectory as a separate dataset

How Splitting Works

Random Sampling Process

The system uses deterministic random sampling to assign documents to splits:

  1. For data splits: Each document is randomly assigned to train/val/test based on the split_fraction values
  2. For context splits: Each document is assigned a maximum sequence length from the MSL_List based on split_fractions
  3. For combined splits: A two-level sampling process first assigns the data split, then the context length

For reproducibility, all sampling uses a fixed random seed that you can set:

processing:
  split_seed: 42  # Set your own seed for reproducible results

Output Structure

After processing, your output directory will contain:

For a single dataset:

output_dir/
├── train/
│   ├── msl_128/  # Contains files processed with 128 max sequence length
│   └── msl_512/  # Contains files processed with 512 max sequence length
└── val/
    ├── msl_128/
    └── msl_512/

For multiple subsets:

output_dir/
├── subset_1/
│   ├── train/
│   │   ├── msl_128/
│   │   └── msl_512/
│   └── val/
│       ├── msl_128/
│       └── msl_512/
├── subset_2/
│   ├── train/
│   └── val/

Complete Configuration Example

setup:
  data:
    source: "./input_dir"
    type: "local"
    top_level_as_subsets: True  # Remove this line for single dataset processing
  output_dir: "./output_dir"
  data_splits_dir: "./data_splits"
  data_splits:
    train:
      split_fraction: 0.8
      context_splits:
        MSL_List: [128, 512]
        split_fractions: [0.5, 0.5]
    val:
      split_fraction: 0.2
      context_splits:
        MSL_List: [128, 512]
        split_fractions: [0.6, 0.4]

processing:
  split_seed: 42
  # Additional processing parameters go here