Rate
and GlobalRate
.
It is important to note that these metrics are what’s measured by the user node. While they are useful for getting an overall picture of throughput, they are not exact measurements of throughput seen by the Cerebras Wafer-Scale Cluster. This is due to the asynchronous nature of execution on Cerebras Wafer-Scale, where input workers stream data to the wafer quasi-independently of the user node that’s receiving the outputs.
GlobalRate
measures the average throughput of the entire training run. It does so by dividing total samples for which outputs have been received by total time since the executable was fully loaded to the Wafer-Scale Engine.
GlobalRate
is also logged to events files and is viewable in TensorBoard as avg_samples_per_sec
.Rate
measures a smoothed out version of GlobalRate
, where at each sampling interval (i.e. logging step), a smoothing factor (default of 0.4) is applied to the previously calculated Rate
and added to the local throughput since the last sampling point.
Rate
is also logged to events files and is viewable in TensorBoard as local_samples_per_sec
.Rate
is more susceptible to spikes than GlobalRate
, it is more representative of the current throughput measured by the user node.Rate
) seen. Once user node catches up to the wafer, throughput will stabalize and return to normal.
GlobalRate
when doing long training runs since it’s amortized over the entire training duration.Rate
and GlobalRate
).
Having said that, the first few logging steps may present outlier throughputs due to difference in when the clock is started on the user node vs. when the wafer actually starts processing data. This effect is short-lived and steady-state throughput is achieved quickly thereafer.