Back to Glossary
Reinforcement Learning

Policy gradient methods

Policy gradient methods are a class of reinforcement learning algorithms that directly optimize the policy function, which maps states to actions, without explicitly learning a value function. These methods aim to find the optimal policy by estimating the gradient of the expected return with respect to the policy parameters and then updating the policy in the direction of that gradient.

Explanation

Policy gradient methods operate by adjusting the policy directly, based on the rewards received. Unlike value-based methods (e.g., Q-learning) that learn an estimate of the optimal value function and then derive a policy from it, policy gradient methods directly search for the optimal policy. The core idea involves estimating the gradient of the expected return (cumulative reward) with respect to the policy parameters. This gradient indicates how the policy should be adjusted to increase the expected return. Algorithms like REINFORCE, Actor-Critic methods (e.g., A2C, A3C), and Proximal Policy Optimization (PPO) fall under this category. These methods often involve complex mathematical derivations and are sensitive to hyperparameter tuning, but they are advantageous in continuous action spaces and can learn stochastic policies. Policy gradient methods often suffer from high variance, meaning the estimated gradient can be noisy, hindering stable learning. Techniques like baseline subtraction are used to reduce this variance. They are crucial for solving complex control problems where the optimal policy is more easily represented directly rather than through a value function.

Related Terms