Learn how to create and optimize custom PyTorch dataloaders for Cerebras systems.
runconfig
parameters by the variable num_workers_per_csx
. For more information about this flag in Cerebras Model Zoo models visit YAML params documentation.drop_last=True
(yes, the batch dimension must also stay constant).
microbatch_size * num_systems * grad_accum_steps
. However, under the hood the framework doesn’t care about what batch size the dataloader yields. Instead it repeatedly grabs the next chunk of microbatch_size
samples from the data it has buffered and feeds these samples to the system. If this splitting of batches is unacceptable then you’ll need to implement some workarounds. However, this use case is both uncommon and advanced and so won’t be covered in detail here.
drop_last = True.
For simpler data sharding, use the helper functions num_tasks
and task_id
from modelzoo.data.common.input_utils
. They return the total number of worker nodes and the current node’s index, respectively.
To create a dataloader with data sharding, you can use the num_tasks
and task_id
functions for efficient distribution. Here’s how you might structure the dataloader setup:
To simplify the dataloader adjustment for data sharding, you only need to modify two lines:
__init__
method of your dataset class, add a line to shard the data. This ensures each worker gets a unique subset:drop_last = True
. This ensures that all distributed batches have the same size, avoiding issues with uneven batch sizes at the end of each epoch:__init__
method, you ensure that each worker node gets a distinct subset of the data, eliminating concerns about sample repetition within an epoch. Additionally, this approach works seamlessly on a GPU without needing different dataloader versions.
Furthermore, you can streamline this process using your original TextDataset alongside a Cerebras-defined sampler, which automates the sharding. By integrating this sampler, you delegate the complexity of data sharding to the sampler, simplifying your dataloader configuration and making your code more maintainable:
CBSampler
approach comes with a few additional benefits such as a data order that is independent of the number of systems you use or the number of workers streaming to each system. As with the previous example, this approach also works on GPU.
worker_init_fn
or similar.
seed=None
) and had moved the line self.files = self.files[task_id()::num_tasks()]
to the end of the __init__
function, then the order of each worker’s copy of self.files
might be different, in which case each worker’s list of files post sharding might not be disjoint from the lists of the other workers. Some data would get repeated, and some data would be ignored all together, which might create undesirable convergence characteristics.
The only other change made was to specify drop_last=True
to ensure that the model gets the consistency of data shape that it needs.
Again, this dataloader works for both CS system and GPU.
cerebras_pytorch
, use the Cerebras virtual environment. Instructions to set it up can be found here.cerebras.pytorch.utils.benchmark.benchmark_dataloader
function takes a function that creates a new dataloader as well as some arguments specifying how many steps to run for and similar. The output will look something like this
iostat
).