prompt
(also known as instruction
) and a completion
(also known as response
). Here is a simple example:
Summarization
mode within create_hdf5_dataset.py
.
For example, for a maximum sequence length of 2048
tokens:
Summarization
mode are the following:
--sep_token
This allows you to specify a token between the prompt and completion, with null
meaning there is no sep token. For example, you could create a new token <sep>
to clearly indicate to the model that the prompt has finished and the completion is beginning.
--prompt_key
This allows you to specify the key in the raw data that stores the prompt/instruction. In the json data above, “prompt” specifies the prompt. It could also be “instruction” or any other string.
--completion_key
This allows you to specify the key in the raw data that stores the completion/response. In the json data above, “completion” specifies the prompt. It could also be “response” or any other string.
Summarization
mode. It is possible to ignore the loss from padding tokens, so the dummy tokens are added and the model operates on each IFT example while avoiding gradient updates from padding. However, the model still operates on the full MSL, so a lot of computation is wasted performing operations on padding tokens that will be later ignored.
0
below) to get to the fixed MSL.
_Paris
, .
, and </s>
(end-of-sequence). As expected, the loss value for predicting Paris as the capital of the United States is originally quite high, at ~10.5
. However, because of the lack of attention masking, the model can look back at previous occurrences and now places higher probability on a repetition of the false “fact” for the second and third appearances, with loss of around ~3.5
.
Now we show the per-token loss with attention masking:
Summarization
workflow. Simply replace the mode
in your call to create_hdf5_dataset.py
:
attention_span
, and position_ids
. The position_ids
is simply 0 ... N-1
for each position in a sequence of length N, and attention_span
is the reverse, N-1 ... 0
. So if two sequences are packed together, the position_ids
would be 0 ... N-1 0 ... M-1
and attention_span
would be N-1 ... 0 M-1 ... 0
. These two inputs are used by our stack to create the attention masks.
While the preprocessor automatically creates these inputs to perform attention masking, note that there is one additional step to indicate to the software stack that this masking should be applied.
use_vsl: True
to the train_input
and/or eval_input
section of the yaml config passed to the run.py
.
sep-token
, the token is added to the model’s vocabulary and will be trained from random initialization.