This page will cover how to configure the with weight sparsity.
Trainer
with weight sparsity. By the end, you should be familiar with how to use sparsity in tandem with the Trainer
for any model.
init_method
is "random"
, which means 30% of the elements in each Parameter (which passes the default parameter filter) will be pruned once at model initialization and kept that way throughout training. Non-Parameter tensors are not pruned.
algorithm
:You can also define a custom class that inherits from SparsityAlgorithm
.
sparsity
:The desired sparsity level between 0 and 1. 0.0 means the Parameter is kept fully dense. 1.0 means the Parameter is effectively entirely zeros. Dynamic sparsity algorithms also accept more complex configuration described below in Dynamic Hyperparameters.init_method
optional:Method to compute the initial sparsity distribution.
random
: (default) Sparsity is randomly distributed within each weight.
topk
: Sparsity is distributed according to the lowest magnitude weights.
from_zeros
: Sparsity pattern is determined by weight values that are already zero.
param_filter
optional:Controls which Parameters are sparsified. The list of Parameter names can be found using model.named_parameters()
.
When this is omitted, any multidimensional Parameters (except those with embedding
, norm
, or lm_head
in their name) automatically get sparsity applied (single dimensional weights such as biases are ignored) (See default_sparse_param_filter
).While this provides a good default heuristic for transformer based models 1, a (list of) glob expressions can also be provided to only apply sparsity to Parameters which match, e.g.```Bash
trainer:
init:
sparsity:
…
param_filter:
param_filter: *
Per-layer sparsity options can be configured by passing in a list of configuration dictionaries. See below in advanced param_filters.
GMP
, SET
, or RigL
) needs an additional update
schedule indicating when to update the sparsity pattern. There are 2 basic methods built-in with 3 different options:
GMP
, SET
, or RigL
) can configure the sparsity
(and drop_fraction
for SET
and RigL
) field using a “step aware hyperparemeter” akin to learning rate schedules in addition to simple constants. These more complex configurations usually require additional options and so are specified as dictionaries.
DynamicSparsityAlgorithm
that invokes such a dynamic hyperparameter for sparsity
ensures sparsity levels stay legal by using torch.clamp(sparsity, min=0.0, max=1.0)
.Linear
Exponential
GMP
, where the sparsity level monotonically increases throughout training because a fraction of the remaining elements in the Parameter are pruned at each update step, asymptotically approaching an empty network.
Cosine
RigL
, which usually uses a “cosine decay” on its drop_fraction
. minimum
defaults to 0.0
. half_period
controls what step the value reaches its minimum.
0.3
, but another set of weights should be dynamically sparsified using the SET algorithm, it can be done by providing a list of sparsity algorithms.
param_filters
param_filters
can be specified as a dictionary, mapping “patterns” to the config dictionaries to overlay on the default sparsity config options.
For example, when using RigL on transformer networks (uses gradient information to guide which values in a Parameter to prune), sparsity can be cyclically restributed between the heads of attention projection weights in case samples in a batch activate one head disproportionately to another. This ultimately decreases the effectiveness of dynamic sparsity and even can hurt model performance.
To ensure sparsity is fairly distributed between the different attention heads of the multi-head attention projections, you can specify balance_out_groups
when the output logits are logically N independent/stacked groups (i.e. input projection weights before multi-head attention QKV), or balance_in_groups
for the reverse (i.e. output projection weights). These should apply differently to different weights using param_filter
since this conceptually only applies to Attention projection weights. In the following example, the model has 12 attention heads.
run
command (see guide: Launch your job) - ensure the .yaml
file has sparsity enabled. To validate your sparsity config before launching training, run with --validate_only
. You can also log which weights are being sparsified by passing --logging VERBOSE
to your run command.
LogSparsity
callback.
YAMLPython