The Reinforced Self-Training (ReST) Algorithm: A New Approach to Aligning LLMs with Human Preferences

Large language models (LLMs) have revolutionized natural language processing with their ability to generate coherent text and solve various linguistic tasks. However, their outputs may not always align well with human preferences and can even produce dangerous content if not properly regulated. To address this issue, researchers at DeepMind have developed a new algorithm called Reinforced Self-Training (ReST) inspired by growing batch reinforcement learning.

The ReST algorithm consists of two main loops: the inner loop (Improve) and the outer loop (Grow). During the Grow step, the language model policy generates multiple output predictions for each scenario, which are used to supplement the training dataset. In the Improve step, the enriched dataset is ranked and filtered using a scoring formula based on human preferences. The filtered dataset is then used to fine-tune the language model using offline reinforcement learning. This process is repeated by increasing the filtering threshold.

One key advantage of ReST is its efficiency compared to traditional online and offline reinforcement learning approaches. The output of the Grow phase is utilized over multiple Improve stages, significantly reducing the computational cost. Additionally, the quality of the policy is not limited by the quality of the original dataset, as new training data is sampled from an improved policy.

ReST also offers the ability to inspect data quality and diagnose alignment problems, such as reward hacking. The decoupling of the Grow and Improve steps allows for easy evaluation and potential improvements in the alignment process. Moreover, ReST requires fewer hyperparameters and is a simple, reliable technique.

The researchers evaluated ReST on machine translation tasks, comparing it with other offline reinforcement learning algorithms. They found that ReST significantly improved the quality of translations compared to supervised learning baselines, as rated by human evaluators, on various benchmarks.

The development of the ReST algorithm represents an important step in aligning LLMs with human preferences. By incorporating reinforcement learning and fine-tuning techniques, ReST offers a promising approach to improving the performance and safety of language models.

FAQ

What is the ReST algorithm?

The Reinforced Self-Training (ReST) algorithm is a new approach to aligning large language models (LLMs) with human preferences. It consists of two loops: the inner loop (Improve) and the outer loop (Grow). In the Grow step, the language model generates multiple output predictions to supplement the training dataset. In the Improve step, the enriched dataset is filtered using a scoring formula based on human preferences and used to fine-tune the language model using offline reinforcement learning.

What are the advantages of the ReST algorithm?

ReST offers several advantages over traditional online and offline reinforcement learning approaches. It significantly reduces computational costs by utilizing the output of the Grow phase over multiple Improve stages. It also allows for easy inspection of data quality and diagnosis of alignment problems. Furthermore, ReST requires fewer hyperparameters and is a simple, reliable technique.

How does ReST improve the performance of language models?

By incorporating reinforcement learning and fine-tuning techniques, ReST enhances the alignment between language models and human preferences. It enables the language model to generate outputs that are more in line with desired outcomes, such as high-quality translations. ReST improves the performance of language models by leveraging offline reinforcement learning and iterative dataset expansion.