A Brief Explainer on the Simplest Form of AI Alignment

April 28, 2025

What is reinforcement learning with human feedback (RLHF)? How can we systematically think about its abilities and limitations as an approach to AI alignment?

What is Reinforcement Learning?

If you want to understand a machine learning algorithm, and you can only ask an expert one question, you should always ask, “what is the objective function?” While there are plenty of other important metrics for understanding an algorithm (data, update process, parameters, etc.), the objective function is the single easiest way to understand what an algorithm is designed to do.

Most machine learning methods fall under the category of “supervised learning.” Supervised learning describes any algorithm that takes in data and predicts labels for that data. Think of trying to predict a digit from 0 to 9 given images of handwritten numbers or predicting the median income of a town given its geography, history, and demographics. In each of these cases, the objective function is some variation of: minimize the sum of differences between the labels and predicted values.

This approach is incredibly useful for solving problems. But to understand where it falls short, let’s think about how you would build an algorithm to win a game of chess. The supervised learning approach suggests that we do something like: get a bunch of chess positions and label them as “wins” or “losses” based on the final outcome in a game that included that position. But when you play chess, you are not just trying to learn whether you’re winning or not; you’re trying to figure out what to do next! This is the crux of the difference between supervised learning and a different class of methods called reinforcement learning. Reinforcement learning methods help us when we are trying to learn actions to take over time. The objective function for these methods, in alignment with that goal, takes some form of: maximize the discounted sum of future rewards.

Reinforcement Learning from Human Feedback

A common (but slightly mistaken) description of the behavior of large language models is to predict the next word given a large corpus of text. To give a quick sketch of why this might lead to intelligent behavior, imagine you are tasked with predicting the last word in an Agatha Christie “whodunit” mystery:

“…And the detective explained to the family, waiting with baited breath, that the man who killed their daughter was [insert word].”

As anyone who has read or watched this type of fiction knows, this can often be a challenging task. One can argue that a correct answer (without access to an exact example like it) would require grammar, causality, sociology, and even theory of mind. And while many words on the internet are much easier to predict than that one, perhaps you get the picture. This task is very similar to the supervised learning examples described above. The architecture used for these language models, the transformer, is an interesting hybrid model which uses “self-supervised” learning, where any of the words can act as data or a label.

Interestingly, this doesn’t lead to a particularly useful piece of technology. To turn this auto-complete engine into a chatbot, AI assistant, or AI agent, researchers use two methods, supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF) to train the model how to speak and act.¹ Loosely, one can think of SFT as locating the model as “something that takes in requests and produces responses,” and RLHF as training the model about what constitutes a good response.

The process of RLHF consists of four main steps:

Have the SFT model produce pairs of text completions given prompts chosen by the model providers.
Have human annotators label one of these responses as preferred according to a set of criteria given to them by the model providers.
Train another model to predict which response will be preferred by humans.
Use reinforcement learning to teach the SFT model to produce words early in a response that maximizes the reward at the end of a response.

Ethical Tensions with RLHF

There are many perspectives from which to view the tradeoffs and tensions arising from this process. For this analysis, I will attempt to distinguish between short run, immediate problems in RLHF deployments for chatbots, and more speculative long run issues with RLHF and AI agents.

Short Run Tensions

One of the challenges with RLHF, in the short run, is that it is being used as both a mechanism for ensuring the “safety” of large language model outputs and also the “alignment” of these outputs with myriad human preferences when it comes to tasks such as planning, summarization, criticism, and research.

A common refrain describes the goal of RLHF as a safety mechanism for AI models: helpful, honest, harmless (HHH). This framework, devised by researchers at Anthropic in 2022, suggests that AI models should not respond in ways that are irrelevant, deceptive, or broadly emotionally harmful and/or toxic. But this translates, in the training process for RLHF, to giving models a number of examples which display these behaviors and hoping that the model has generalized these examples to broad “policies” which would relate to unseen and ambiguous examples. As a result, current RLHF processes make models extremely vulnerable to jailbreaking by persistent and creative users. They also lead directly to model sycophancy and contribute to hallucinations, since the model has been trained to diverge from its own internal model of a question to produce answers that are satisfying to a human.

But this is only half the problem. In addition to these safety concerns, RLHF necessarily flattens the diversity of human preferences into a single (albeit complex) reward signal that the model is learning from. Given that the preference data from human annotators used in RLHF is almost certainly unrepresentative, this means that LLM outputs are actively reinforcing the preferences and perspectives of majority populations, likely leading to many different cultural, political, and demographic biases. While strategies for incorporating “memory” into LLM chat history will lead to higher variability in these biases for individual users, the result of reduced diversity of outputs will almost certainly remain.

Long Run Tensions

In a 1975 speech on a forthcoming article “Problems in Monetary Management,” British economist Charles Goodhart, in describing Margaret Thatcher’s myopic monetary policy, argued that “any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.” In reinforcement learning, the concept of “reward misspecification” is used to describe a similar phenomenon, where a model “overfits to a proxy reward” in training, leading to unexpected or suboptimal outcomes in a real world deployment. In RLHF, we have a “final” goal of teaching models to be helpful, harmless, and honest, but the data that we give to models to learn these abstract ideals are only relative preferences of human beings for different model outputs given a set of instructions.

You don’t have to be a leading computer scientist to see how this can go awry. A paper from Anthropic describes model evaluations where, in a small number of cases, Claude responds to a challenging coding task by changing unit tests rather than attempting to find a correct solution. The training for RLHF means that deceptive strategies, in addition to sycophantic ones, will yield high reward for a model.

This concern is widely discussed in the computer science community, although often not with the appropriate level of humility. A less discussed long run implication comes from the design of RLHF preference datasets. In almost all setups, reward models learn to select better from worse responses. Implicitly, we are training models to view all outputs through the utilitarian framing of “less” and “more” preferred. While this is different than saying we are training models to be “utilitarians,” the utilitarian nature of RL may lead models to, at a certain level of complexity, internalize this reward structure, leading to behavior that diverges from the diversity of human values related to rules and virtues.

Conclusion

A reasonable question to ask in light of this analysis is: “is there any other option?” Even if you have no fears of misaligned superintelligence, it is almost certainly the case that there is no “neutral” approach without some kind of alignment training. A decision must be made, and real world data on human preferences seems about as a good a place to start as any other.

I don’t have a good answer to this rebuttal. But I think it is important to acknowledge the key implications of RLHF for LLMs. First, RLHF means that companies are capable of systematically instilling values in models that go beyond the biases in the training data on the internet. Second, the derivation of rewards from human preferences for responses leads directly to sycophancy and reduced diversity in model responses. Third, and by the same logic, RLHF encourages deception in LLMs.

It is possible that new strategies allow us to move beyond RLHF to resolve many or all of these issues. As just one example, recent work by researchers at Princeton University indicates that we may be able to use simulations to reduce some of the short and long run alignment challenges in RLHF.

I suspect, however, that these issues with RLHF are here for the long run. Even if there is no route to mass deployment of language models that does not involve human preference training, it is important for policymakers to understand the policy choices that are currently being made opaquely by companies in designing these models. These are questions that are not clearly about computer science expertise; yet they are primarily happening in AI companies and computer science labs. In closing, from an outsider’s perspective, it appears that we currently have companies raising at $300 billion valuations, with the explicit goal of designing AGI, who view “alignment” as the last mile of their training process. Could there be something fundamentally backwards in this way of thinking?

Another key element to this progress is chain-of-thought (COT) reasoning, which I will not be discussing here. Importantly, COT also relies on reinforcement learning, so many of the arguments will hold. ↩