Dataset Splitting and Preprocessing

You can organize your data in two ways:

Single Dataset: All files in one input directory
Multiple Subsets: Files organized in separate subdirectories

Setting Up Your Dataset

Option 1: Single Dataset Setup

Place all your data files in one directory:

input_dir/
├── file1.jsonl.zst
├── file2.jsonl.zst

Option 2: Multiple Subsets Setup

Organize files into subdirectories that represent different subsets:

input_dir/
├── subset_1/
│   ├── file1.jsonl.zst
├── subset_2/
│   ├── file2.jsonl.zst

Configuration Options

Basic Configuration

Your configuration file needs these essential parameters:

setup:
  data:
    source: "./input_dir"    # Path to your input data
    type: "local"            # Data source type
  output_dir: "./output_dir" # Where processed data will be stored
  data_splits_dir: "./data_splits" # Directory for split datasets

Defining Data Splits

You can split your dataset in two ways:

Data Splits Only - Divide your data into train/validation/test sets:

data_splits:
  train:
    split_fraction: 0.8      # 80% of data goes to training
  val:
    split_fraction: 0.2      # 20% of data goes to validation

Context Splits Only - Split by maximum sequence length:

context_splits:
  MSL_List: [128, 512]       # Sequence length options
  split_fractions: [0.5, 0.5] # Distribution between lengths

Combined Splits - Split by both data type and sequence length:

data_splits:
  train:
    split_fraction: 0.8
    context_splits:
      MSL_List: [128, 512]
      split_fractions: [0.5, 0.5]
  val:
    split_fraction: 0.2
    context_splits:
      MSL_List: [128, 512]
      split_fractions: [0.6, 0.4]

Multiple Subsets Configuration

To process multiple subsets, add this to your configuration:

setup:
  data:
    source: "./input_dir"
    type: "local"
    top_level_as_subsets: True  # Process each subdirectory as a separate dataset

How Splitting Works

Random Sampling Process

The system uses deterministic random sampling to assign documents to splits:

For data splits: Each document is randomly assigned to train/val/test based on the split_fraction values
For context splits: Each document is assigned a maximum sequence length from the MSL_List based on split_fractions
For combined splits: A two-level sampling process first assigns the data split, then the context length

For reproducibility, all sampling uses a fixed random seed that you can set:

processing:
  split_seed: 42  # Set your own seed for reproducible results

Output Structure

After processing, your output directory will contain:

For a single dataset:

output_dir/
├── train/
│   ├── msl_128/  # Contains files processed with 128 max sequence length
│   └── msl_512/  # Contains files processed with 512 max sequence length
└── val/
    ├── msl_128/
    └── msl_512/

For multiple subsets:

output_dir/
├── subset_1/
│   ├── train/
│   │   ├── msl_128/
│   │   └── msl_512/
│   └── val/
│       ├── msl_128/
│       └── msl_512/
├── subset_2/
│   ├── train/
│   └── val/

Complete Configuration Example

setup:
  data:
    source: "./input_dir"
    type: "local"
    top_level_as_subsets: True  # Remove this line for single dataset processing
  output_dir: "./output_dir"
  data_splits_dir: "./data_splits"
  data_splits:
    train:
      split_fraction: 0.8
      context_splits:
        MSL_List: [128, 512]
        split_fractions: [0.5, 0.5]
    val:
      split_fraction: 0.2
      context_splits:
        MSL_List: [128, 512]
        split_fractions: [0.6, 0.4]

processing:
  split_seed: 42
  # Additional processing parameters go here

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

Dataset Splitting and Preprocessing

Setting Up Your Dataset

Option 1: Single Dataset Setup

Option 2: Multiple Subsets Setup

Configuration Options

Basic Configuration

Defining Data Splits

Multiple Subsets Configuration

How Splitting Works

Random Sampling Process

Output Structure

Complete Configuration Example

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

​Setting Up Your Dataset

​Option 1: Single Dataset Setup

​Option 2: Multiple Subsets Setup

​Configuration Options

​Basic Configuration

​Defining Data Splits

​Multiple Subsets Configuration

​How Splitting Works

​Random Sampling Process

​Output Structure

​Complete Configuration Example

Setting Up Your Dataset

Option 1: Single Dataset Setup

Option 2: Multiple Subsets Setup

Configuration Options

Basic Configuration

Defining Data Splits

Multiple Subsets Configuration

How Splitting Works

Random Sampling Process

Output Structure

Complete Configuration Example