Machine Learning
Reinforcement Learning from Human Feedback (RLHF)
A machine learning technique that uses human preferences as a reward signal to fine-tune a model, aligning its behavior with human values and expectations.
Explanation
RLHF is a multi-stage process used to train large language models (LLMs) and other AI systems. It typically involves three steps: pre-training a model on a large dataset, collecting human-labeled data where humans rank or rate model outputs, training a reward model to predict these human preferences, and finally using reinforcement learning (often PPO) to optimize the original model against the reward model. This process helps reduce hallucinations, improve safety, and ensure the AI follows instructions more effectively than standard supervised learning alone.