torch.utils.data.DataLoader
). The key argument for this DataLoader is the “Dataset”, which specifies the source of data. PyTorch supports two primary types of Datasets:
In the Cerebras Model Zoo, dataloaders extend these base types to implement additional functionalities. For instance, the
- Map-style datasets (
Dataset
) is a map from indices/keys to data samples. So, ifdataset[idx]
is accessed, that readsidx-th
from a directory on disk.- Iterable-style datasets (
IterableDataset
) represents an iterable over data samples. This is very suitable where random reads are expensive or even improbable, and where the batch size depends on the fetched data. So, ifiter(dataset)
is called, returns a stream of data from a database, or remote server, or even logs generated in real time.
BertCSVDynamicMaskDataProcessor
(code) extends IterableDataset
and BertClassifierDataProcessor
(code) extends Dataset
.
IterableDataset
and add functionalities like input encoding and tokenization. Notable examples are:
- BertCSVDataProcessor - Reads
CSV
files containing the input text tokens andMLM
andNSP
features- GptHDF5MapDataProcessor - A
HDF5
map style dataset processor to read fromHDF5
format for GPT pre-training- T5DynamicDataProcessor - Reads text files containing the input text tokens, adds extra ids for language modelling task on the fly
HDF5
, CSV
, and TXT
. Other file types are not tested and may not be supported.