This module covers techniques for aligning language models with human preferences. While supervised fine-tuning helps models learn tasks, preference alignment encourages outputs to match human expectations and values.
Typical alignment methods involve multiple stages:
- Supervised Fine-Tuning (SFT) to adapt models to specific domains
- Preference alignment (like RLHF or DPO) to improve response quality
Alternative approaches like ORPO combine instruction tuning and preference alignment into a single process. Here, we will focus on DPO and ORPO algorithms.
If you would like to learn more about the different alignment techniques, you can read more about them in the Argilla Blog.
DPO simplifies preference alignment by directly optimizing models using preference data, eliminating the need for separate reward models and complex reinforcement learning. This makes it more stable and efficient than traditional RLHF.
Key benefits:
- No separate reward model needed
- More stable training process
- Lower computational requirements
ORPO introduces a combined approach to instruction tuning and preference alignment in a single process. It modifies the standard language modeling objective by combining negative log-likelihood loss with an odds ratio term on a token level.
Key innovations:
- Unified single-stage training process
- Reference model-free architecture
- Improved computational efficiency
ORPO has shown impressive results across various benchmarks. Better performance on AlpacaEval compared to traditional methods. Strong results on MT-Bench, even without multi-turn training. Effective across different model sizes (125M to 1.3B parameters).