A simple and stable method for fine-tuning language models using human or synthetic preference data without reinforcement learning.
configs/
: YAML configuration files for DPO fine-tuning runs.model.py
: Defines the DPO training logic, including the contrastive loss function used to compare chosen and rejected completions.Configuration | Description |
---|---|
params_zephyr_7b_dpo.yaml | DPO training config for a 7B model using preference-labeled instruction tuning data. |