Researchers propose cancellation hypothesis for GRPO in LLM post-training
Researchers have introduced a cancellation hypothesis to elucidate the success of critic-free reinforcement learning methods like GRPO in the post-training of large language models. This hypothesis suggests that sequence-level rewards lead to implicit token-level credit assignment, as the gradients from both positive and negative rollouts tend to cancel each other out.