8 min read

22/09/2023

What Is Reinforcement Learning from Human Feedback (RLHF) and How Does It Work?

EXPERT ARTIFICIAL INTELLIGENCE

Reinforcement Learning from Human Feedback (RLHF) is a very hot topic for all of us in the AI space. Everyone exposed to some kind of machine translation re-training, either offline or online, is quite familiar with the concept and the procedures. This has produced a massive transfer of talent that leverages their experience in machine translation as an NLP task to fine-tune Large Language Models (LLMs). In this article, we will describe in plain language what is reinforcement learning from human feedback (RLHF) and how it works, drawing parallels from machine translation and pointing to some practical, real-world applications.

The basics: What is Reinforcement Learning

Reinforcement learning is a branch of machine learning in which an algorithm, often called an “agent”, learns to behave in a specific way within a given environment. This is done by performing certain actions and receiving rewards or punishments in response to those actions. Reinforcement learning aims to solve a problem on several levels through trial and error. The goal is for the agent to learn to make decisions that maximize a cumulative reward over time. Think about it for a moment, this is how we humans learn instinctively. We call it “learning by experience” or “trial and error”. For example, from the age of six or seven, we know that a frying pan, a radiator or an oven are hot and that we should not touch them without making sure that they are switched off or at least not hot. We also know that if we stand on the edge of something, we risk falling. In the same way, machines are trained on real-life scenarios to make a series of decisions. For example, an agent could be trained to navigate a complex environment, such as a robot having to cross a labyrinth. For every correct decision that brings it closer to the exit, it receives a positive reward.

On the other hand, a decision that takes it further away from the exit results in a punishment or a negative reward. Over time, the agent learns the most optimal strategy to achieve its goal based on the rewards and punishments it has experienced. This process of learning through reinforcement is fundamentally similar to the way we, as human beings, learn through our daily experiences, adjusting our behavior based on the results we observe.

For example, imagine a game in which a little robot (by definition not a thinking machine) must find a way out in a maze. Every time the robot makes a correct decision and approaches the exit, it receives a positive reward. But if it makes a decision that pulls him away from the exit, he then receives a punishment (negative reward). Sooner or later, the robot will learn the most optimum strategy to get out of the maze based on the rewards or punishments it has experienced. Imagine this robot has a vacuum cleaner attached to it, with several proximity detection devices to map out your house as it cleans it. Give it time it will know your house, your walls, and optimum cleaning route.

That's the basic concept of Reinforcement learning: learning through experience and feedback.

To make the concept more familiar to people in the translation services industry: think about a machine translation engine that is constantly (or frequently) fed with more data from video game translations. It may get to be good enough on day 1 yet not find the exact terminology or style we like, making some errors on the way. With enough material, it will start to learn what we prefer. Reinforcement Learning just applies to many more areas within Machine Learning: computer vision, OCR, data classification, etc.

Now, Enter Human Feedback…

Now we know what Reinforcement Learning is, let’s add humans in the feedback process. A standard definition for RLHF is that Reinforcement Learning from Human Feedback (RLHF) is a machine learning approach that combines reinforcement learning techniques, such rewards and punishments, with human guidance to train an artificial intelligence (AI) agent. RLHF works by first training a "reward model" directly from human feedback. The algorithm is geared to make decisions in an environment to maximize cumulative rewards (basically, we turn the algorithm into a hound sniffing for prey and giving it a biscuit when it finds it). The reward model is a function that takes an agent's output (the algorithm’s output) and predicts how good or bad it is. Once the reward model is trained, it can be used to train the agent using reinforcement learning.

In reinforcement learning, an agent learns to perform a task by interacting with its environment and receiving rewards for actions that lead to desired outcomes. The agent learns to maximize its rewards through trial and error and eventually develops a policy that maps states to actions.

This "reward model," trained directly from human feedback, determines the reward function to optimize the agent's policy using reinforcement learning algorithms, such as Proximal Policy Optimization.

Thus, we now have a system that uses human choices and what humans prefer to guide how and what the agent is hungry to learn. This potentially gives human-like choices more “weight.”

It is a question of trial and error, as it involves interacting with the environment and observing the rewards or penalties received for its actions (and biscuit/no biscuit if it were a dog!).

More on the subject: The Future of Machine Translation

Key components of reinforcement learning

Let’s recap the key concepts before we delve into how RLHF works:

Agent: the algorithm or machine that takes actions that affect the environment. For example, if you’re building a machine to play Go, poker, or chess, the machine learning to play is the agent.
State: The agent's observation of the environment.
Action: The agent's decision or action taken in the environment based on its observation.
Environment: Every action that the RF agent makes directly affects the environment. Here, Go’s board is an environment. The pack of cards or the chessboard are environments. The environment takes the agent's present state (the observation) and action as information and returns the reward to the agent with a new state. 

This is very important because the environment may have changed as a result of the agent’s action.

For example, the card played or taken by the system, the piece moved in Go, or a game of chess, or detecting a ball or a child in a self-driving car necessarily changes the scenario. It will have a negative/positive effect on the whole situation. Potentially, the game and the arrangement of pieces on the board could have changed. A ball or child on the road should trigger a set of decisions. Indeed, the move and the presence will decide the following action and state on the road, the game, or the board.
Reward: The feedback received by the agent from the environment after performing an action. Rewards can be positive or negative (for undesired actions) and not necessarily come from humans. There are many scenarios in which we may want the machine to learn all by itself. In these cases, the only critic guiding the learning process is the feedback/reward it receives.
Policy: The strategy that defines how the agent selects actions given its current state, with the goal of maximizing the total cumulative reward.
Discount factor:  Over time, the discount factor modifies the importance of incentives. Given the uncertainty of the future it’s better to add variance to the value estimates. Discount factor helps in reducing the degree to which future rewards affect our value function estimates.
Q-value or action-value: Q Value is a measure of the overall expected reward if the agent is in state and takes action, and then plays until the end of the episode according to some policy.

As you can begin to see, there may be numerous applications in industrial settings and development where Reinforcement Learning makes complete sense and becomes a very attractive option because of its ability to learn from itself.

How does Reinforcement Learning from Human Feedback (RLHF) work

In a typical reinforcement learning setup, a "reward model" is firstly trained directly from human feedback. The agent starts in an initial state and takes actions according to its policy. The environment responds to the agent’s actions by providing rewards and updating the state. This reward model is trained to predict how much a human would reward an agent for a given action or behavior. The reward model can be used to train the agent using reinforcement learning.

The agent then updates its policy based on the observed rewards and new state, and the process continues until a termination condition is met (the car has reached its destination, checkmate, optimum conditions to maximize the sale of shares, etc.)

A key difference once we add human feedback in Reinforcement Learning, is that the agent learns to maximize the rewards predicted by the reward model. This allows the agent to learn from human feedback directly without having to explicitly define a reward function, e.g. the effort is on matching human preferences and choices which may not be exactly the “optimum” choices as ranked automatically. The result is always a more “human-like” output and behavior.

The training process for RLHF typically consists of three core steps:

Pretraining a language model (LM): The initial model is pretrained on a large corpus of text data.
Gathering data and training a reward model: Human feedback is collected by asking humans to rank instances of the agent's behavior. These rankings can be used to score outputs, for example, with the Elo rating system. Other types of human feedback that provide richer information include numerical feedback, natural language feedback, edit rate, etc.
Fine-tuning the LM with reinforcement learning: The pretrained language model is fine-tuned using the reward model as a reward function, optimizing the agent's policy.

RLHF has been applied to various domains of natural language processing, such as conversational agents, text summarization, and natural language understanding.

It has enabled language models to align with complex human values and improve their performance in user-specified tasks.

Reinforcement Learning algorithms

There are various reinforcement learning algorithms, such as Q-Learning, SARSA, and Deep Q Network (DQN), which differ in their approaches to learning the optimal policy, but those will be the subject of another article!!

Practical Applications of Reinforcement Learning

We now know that Reinforcement Learning allows agents (algorithms) to learn how to behave in an environment by trial and error. These AI agents can perform a wide range of tasks, including:

Natural language processing tasks, such as machine translation, text summarization, and question answering.
Robotics tasks, such as grasping objects and navigating through complex environments.
Game playing tasks

RLHF is a powerful technique that can be used to train AI agents to perform a wide range of tasks, and it is likely to play an increasingly important role in the development of AI systems in the future.

Let’s see two examples of how RLHF can be used in very straightforward tasks.

How to train a chatbot with RLHF

We train a reward model to predict how much a human would reward the chatbot for a given response. The reward model is trained on a dataset of human feedback, where humans rate the quality of chatbot responses.
We initialize the chatbot with a random policy.
The chatbot interacts with the human user and receives feedback on its responses.
The chatbot uses the reward model to update its policy based on the feedback it received.
Steps 3 and 4 are repeated until the chatbot is able to consistently generate high-quality responses.

How to train a chatbot to generate creative text formats

A large dataset of creative text formats is collected. This can be books, novels, specific documents from the legal profession or technical documentation.
A reward model is trained on this dataset to predict how good or bad a given creative text format is.
The chatbot is initialized with a random policy for generating creative text formats.
The chatbot interacts with the reward model by generating creative text formats and receiving rewards.
The chatbot's policy is updated using reinforcement learning to maximize its expected reward.
Steps 4 and 5 are repeated until the chatbot is able to generate creative text formats that are consistently rated as high-quality by humans.

User Scenarios for RLHF

Reinforcement Learning is a powerful tool that can be used to solve a wide range of real-world problems. It is still a relatively new technology, but it is rapidly developing and has the potential to revolutionize many industries and the way we train AI agents.

Industrial manufacturing: Reinforcement Learning is used to train robots to perform complex tasks in industrial settings, such as assembly line work and machine tending. This can help to reduce labor costs, improve product quality, and reduce downtime.
Self-driving cars: Reinforcement Learning is used to train self-driving cars to navigate the road and make decisions in real time. This can help to improve safety and efficiency. Trading and finance: Reinforcement Learning is used to train algorithms to make trading decisions. This can help to improve returns and reduce risk.
Natural language processing (NLP): Reinforcement Learning is used to train NLP models to perform tasks such as question answering (the above chatbots), summarization, and translation. This can improve the performance of chatbots and other NLP applications.
Healthcare: Reinforcement Learning is being used to develop new methods for diagnosing and treating diseases. For example, Reinforcement Learning is being used to train robots to perform surgery and develop personalized treatment plans for patients.

Limitations of Reinforcement Learning with Human Feedback

RLHF is a powerful technique for training AI agents, but it has some limitations. One limitation is that it requires human feedback to train the reward model. This can be expensive and time-consuming to collect.

Scaling the process to train bigger and more sophisticated models is very time- and resource-intensive due to the reliance on human feedback.

Additionally, RLHF can be difficult to implement and tune.

Techniques for automating or semi-automating the feedback process may help address this challenge.