Language Models
DPO
A simple and stable method for fine-tuning language models using human or synthetic preference data without reinforcement learning.
Model Description
Direct Preference Optimization (DPO) is a training method for fine-tuning language models using preference data — pairs of responses labeled as preferred vs rejected — without requiring reinforcement learning or a separate reward model. DPO was introduced in Rafailov et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
Code Structure
This implementation consists of:
configs/
: YAML configuration files for DPO fine-tuning runs.model.py
: Defines the DPO training logic, including the contrastive loss function used to compare chosen and rejected completions.
Available Configurations
Configuration | Description |
---|---|
params_zephyr_7b_dpo.yaml | DPO training config for a 7B model using preference-labeled instruction tuning data. |
References
- Rafailov, R., et al. (2023). Direct Preference Optimization
- Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback (InstructGPT)
- Christiano, P. et al. (2017). Deep reinforcement learning from human preferences