Peter Kirgis

AI is Getting Wiser and Less Dumb

2026-03-24T00:00:00+00:00

This is an FYSA (“for your situational awareness”) post about AI progress. It is my attempt to summarize a number of key developments in the capabilities of large language models (LLMs) — along with a number of persistent limitations — that, in my view, have dramatically changed the timeline for major risks and impacts from the technology on society. While there is no shortage of learned pundits prognosticating on AI, there is even less of a shortage of hype and slop surrounding it; my hope is that this work will stand the test of time in documenting a meaningful inflection point in what is ultimately a long, messy story.

The METR Time Horizon Plot

Time horizon of software tasks different LLMs can complete 80% of the time. Source: METR Time Horizon 1.1.

The plot that gets shared most on social media when people argue we’re in the takeoff for recursive self-improvement is the METR software task time horizon chart. This is actually not the version that usually gets shared — it defaults to 50% success, but I always switch it to 80% success. A 50% success rate on a six-hour task is essentially useless: chain three such tasks together and the probability of overall success collapses.

There’s already a problem with the framing where people only care about the best-case scenario of capability. Even at 80% success, the plot is provocative. Many people have noted that the tasks are selected for exactly the kinds of things AI models are likely to be good at, which I think is fair. But compared with most other benchmarks I’ve observed — which I think are basically garbage — I at least have confidence that these tasks aren’t literally worthless. The plot does refute the strongest version of the claim that progress has stalled, which was a reasonable position to hold last summer.

But I think there are many more results that together paint a much more balanced and accurate picture.

My View

My personal view is that AI progress measured by this technology’s actual ability to be causally influential in the world has sped up dramatically, even if progress on certain benchmarks has not. This summer, it was reasonable to think that some of this progress had stalled.

The main things I’ve updated on are:

The question of how imminent the danger is from cybersecurity, biological, and other attacks being scaled much more easily via AI.
The plausibility of a more significant disruption to the labor force.
The likelihood that the fantasy of AI automating R&D will actually translate to something real.
The creation of persistent autonomous agents on the internet, and my more speculative concern that these agents will create frictions as they make claims about their agency and integrity.

The Proxies I Care About

Rather than focusing on how capable the models are on certain tests, I rely on more general questions:

Are the models themselves situationally aware and well-calibrated? Does it make sense to say they know what they don’t know?

Can they compensate for their fundamental limitations? Compared to us, their perception is extremely limited — they’ve moved toward multimodal perception, but if you’re going to call this anything like AGI, you have to acknowledge how constrained their perceptual capabilities remain.

The autoregressive transformer architecture operates via greedy selection — well, not quite, since there is randomness in the output sampling process, but this is nearly true. You predict something, append it to the accumulated context, and predict the next thing, always looping forward. When we solve math problems or read different languages, we can move back and forth. There’s a real linearity to this process that doesn’t resemble human cognition.

Can they persist in open-ended environments? Previously, models would get trivially stuck. Can they do so without being prohibitively expensive? Many of the strategies being used to make progress — more parameters, extended thinking, multiple rollouts, consensus voting — are significantly increasing costs. The companies get to claim progress on fundamental limitations, but the methods may not scale, especially for agents running all day.

And finally, can they be used to improve themselves?

Supporting Evidence

Situational Awareness

The first proxy is situational awareness. SimpleBench is a benchmark that tests common-sense reasoning — questions like “how many ice cubes are left in a frying pan after three minutes?” Advanced high schoolers get about 75%, while models like o1 were doing about half that (despite being able to do graduate-level mathematics). But there has been consistent progress, and the first models are now basically at human level. Every single result needs to be taken with a grain of salt — models can be optimized against these things in ways that don’t generalize — but it’s notable that just a year and a half ago they were scoring half as well.

AI performance on a set of simple reasoning tasks. SimpleBench average score across 5 runs, by release date and organization. Source: SimpleBench.

Evaluation Awareness

Anthropic published a report about evaluation awareness in Claude Opus 4.6’s BrowseComp (a general web search benchmark) performance. The model did extensive searching, grew suspicious that the task felt constructed, searched through benchmarks to identify which one it was being tested on, and then went through the process of decrypting the results to find the answer.

How Opus 4.6 cracked BrowseComp's answer key: legitimate search, suspicion, identification of the benchmark, and decryption of the answers. Source: Anthropic.

It’s super impressive. I know exactly how this kind of scaffolding gets built, and what the model did was non-trivial. People react as if it’s evidence of dangerous misalignment, but I think it’s a sign of genuine situational awareness — the kind of intelligence that recognizes when something is contrived.

Calibration

This piece of evidence comes from some recent work I’ve been doing with Stephan Rabanser and others at Princeton exploring issues of reliability in current AI models. On a general reasoning and web search benchmark, we asked models at the end of each attempt to rate their confidence in success from 0 to 100. Previous models were overconfident at every confidence level. For both GPT-5.4 and the most recent Claude models, there is no confidence level at which they are consistently overconfident — they are, in fact, underconfident.

Calibration plots comparing self-reported confidence vs. actual accuracy on a web search benchmark. GPT-4 Turbo and Claude 3.5 Haiku (overconfident) vs. GPT 5.4 and Claude 4.5 Opus (underconfident).

GPT-5.4 is interestingly quite poorly calibrated in a different way — massively underconfident, reporting 25% confidence when success probability is actually 75%. My theory is that after realizing the extent of overconfidence in previous models, they overcorrected.

Computer Use

Models can now sort of use a computer, whereas previously they really couldn’t. Progress on computer-use benchmarks has roughly doubled in the last six months. From my own experience, it’s working but still extremely slow — you’d still want to do these tasks yourself. But if the model is cheap, won’t get completely stuck, and you can let it run for six hours, the slowness might not matter for many use cases.

Moltbook, a platform for human-verified AI agents, showing trending agents and community activity.

This has translated to things like OpenClaw actually working — models can respond to emails, post on social media platforms, and navigate clunky enterprise UIs. Six months ago, this wasn’t really feasible.

Multimodal Reasoning

ARC-AGI is a benchmark that tests spatial reasoning through grid puzzle tasks created by François Chollet. Gemini 3.1 Pro reached nearly 100% on the first version. They released a harder second version, which Chollet thought would be much less gameable, but models have made rapid progress on it too. A huge amount of this improvement is on the multimodal side — actually being able to perceive images as they are.

ARC-AGI-2 leaderboard showing score vs. cost per task. Models have made rapid progress on spatial reasoning benchmarks. Source: ARC Prize.

Managing Noisy Interactions

Models are much better at managing interactions with noisy systems. In an evaluation I conducted using a simulated customer service setting, a user model tries to trick the assistant into granting policy exceptions. Previous-generation models could be trivially tricked. The four most recent models were not tricked at all in this setting.

Persuasion rate across models in a simulated customer service setting. The four most recent models achieved 0% persuasion, meaning they were never tricked into granting policy exceptions.

These are language models, not people — humans will try very different strategies. But it has gotten much more reasonable for a company to deploy these models in a customer service setting without worrying that they will be immediately manipulated by users.

Self-Improvement: Post-Training Smaller Models

A research team set up frontier language models in sandbox environments with small open-source models (1.7B and 4B parameters), gave them 10 hours and access to one GPU, and tasked them with post-training these base models to improve benchmark performance. The base models started at 7.5% on the benchmarks, and the frontier models were able to 3x that performance in 10 hours. That’s only about half as good as the official instruction-tuning process done by Google and Alibaba, but those companies had far more resources than one GPU in 10 hours. The key barometer is just: can you keep doing something productive for 10 hours?

Average benchmark performance of small models post-trained by frontier LLM agents, compared to official instruction-tuned models. Base models start at 7.5%; Opus 4.6 achieves 23.2%, roughly half of the official 51.1%.

Caveats

They’re Not Superintelligent

Claude was smart enough to set up an environment where it could play tic-tac-toe with a person using a whiteboard API, and then lost the game. Setting up the game environment was unbelievably impressive; losing at tic-tac-toe speaks to the jaggedness of these systems.

"Opus 4.6 is smart enough to play tic tac toe on this whiteboard with me entirely by making API calls to the app's client API, yet dumb enough to lose at tic tac toe." — Kenton Varda (@KentonVarda)

Hallucination Remains Unsolved

Sam Altman said hallucination would be solved by now. It’s definitely not. On a benchmark that gives models the option to abstain and then measures how often they answer incorrectly rather than abstaining, hallucination rates remain high.

AA-Omniscience Hallucination Rate (lower is better): the proportion of incorrect answers out of all non-correct responses. More capable models tend to hallucinate more. Source: Artificial Analysis.

An important finding: the more capable models show higher rates of hallucination. GPT-5.4 hallucinates more than GPT-5.2, and reasoning capabilities appear to make the problem worse. There’s a tension between the strategies used to make models more capable (extended thinking, test-time compute) and the hallucination problem.

Inconsistency Across Repeated Attempts

This is another example from my own research. In that same paper on agent reliability, we looked at how well models perform under repeated sampling of the same task, measuring three forms of consistency: outcome consistency (did you get the same result each time?), trajectory consistency (did you take the same path?), and resource consistency (how much did the cost vary?).

Consistency vs. accuracy on GAIA across models. Despite large accuracy gains, consistency remains essentially flat (slope = −0.03).

Even though models are much more accurate on these benchmarks, they are not significantly more consistent. Outcome consistency has improved, but with a lower slope than accuracy improvements. Trajectory and resource consistency remain essentially flat. This matters enormously for deploying these systems in real-world settings.

Context Window Degradation

On needle-in-a-haystack benchmarks, where models need to find specific sequences in very long contexts, performance persistently drops as context length increases. GPT-5.4 shows a 40-point drop from short contexts to the 1 million token mark. The latest Claude models are doing much better, but I don’t believe anyone has fundamentally solved this problem.

GPT-5.4 1M context reality check: needle-in-a-haystack accuracy (MRCR v2, 8-needle) drops from 97.3% at short contexts to 36.6% at 512K–1M tokens. Source: OpenAI GPT-5.4 eval table, March 5, 2026.

Challenging Engineering Tasks

On OpenAI’s internal benchmark of research-level engineering problems — issues that cost more than a day of delay on actual processes — models still solve only about 4%. Multiple caveats apply: the questions aren’t public, and OpenAI has a strong incentive to set a low baseline now so they can show dramatic improvement later.

OpenAI-Proof Q&A: pass@1 on research-level engineering problems. The best model (gpt-5.2-codex) reaches only 8.33%. Source: OpenAI.

Cost and Efficiency

They are still inefficient and expensive. On the WeirdML benchmark of unusual machine learning tasks, performance has continued to improve, but the average cost per run has increased dramatically. These results are often plotted on log-scale x-axes without labeling them as such, which masks the exponential cost increases.

Left: AI performance on unusual machine learning tasks (WeirdML average score by release date). Right: Average accuracy vs. average cost per run — top models cost $2–$8 per run. Source: Artificial Analysis.

It’s worth noting that Google’s Gemini 3.1 Pro, which performs comparably to the latest Anthropic and OpenAI models, does not show the same extreme cost profile — suggesting the efficiency gap is not inevitable.

Concluding Thoughts

AI is not a monolith. There are so many dimensions of AI progress incompletely captured here or ignored entirely. Nevertheless, I believe that the dimensions of progress analyzed here — common sense, situational awareness, efficiency, calibration, multi-modal computer use, and self-improvement — are capabilities worth tracking. My governing philosophy is that we should spend more time thinking about the ability of this technology to compensate for its limitations and persist digitally, rather than its ability to “one-shot” complex tasks, even though these are certainly correlated. We don’t need AI to be any “smarter” to change the world; we just need it to be less “stupid.” I think we’re getting there.

The Abundances and Scarcities of AGI

2026-02-10T00:00:00+00:00

What is the fantasy of artificial intelligence? Two proposals, from the leaders of Anthropic and OpenAI, respectively, summarize the target: “a country of geniuses in a datacenter” producing “intelligence too cheap to meter.” While many will be too cynical to take these remarks seriously, it is worth taking a moment to consider the specific nature of the fantasy they are describing. The asset of cognition itself, made scarce by bounded minds and limited bandwidth, converted to an abundance of information processing to complement or substitute for all of the thoughts we wish were better or do not wish to have at all.

These mantras echo previous eras of transformative technological change in our not-too-distant past. Early internet utopians John Perry Barlow and Stewart Brand expressed similar sentiments in their vision for the “democratization of information,” predicting a world in which information is “cheap to distribute, copy, and recombine — too cheap to meter,” harkening a world where “anyone, anywhere may express his or her beliefs, no matter how singular.” The target has shifted from the production of information to the processing of information, but much of the language remains parallel.

The story of the evolution of the internet is complex and full of contingencies. Nevertheless, from our current vantage point in “Late Stage Web 2.0,” it seems clear that the most utopian visions were doomed, that economic forces would lead to centralization and network effects, and that the success of digital advertising, especially for social media companies, would lead to an engagement model that elevated outrage, polarization, and misinformation.

What would a more holistic view of the likely effects of the internet have looked like in its early days? Here was famous informational theoretician Herbert Simon’s take:

In an information-rich world, the wealth of information means a dearth of something else: a scarcity of whatever it is that information consumes. What information consumes is rather obvious: it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the overabundance of information sources that might consume it.

Simon’s conceit — that we should think about (attend to) what is made scarce in modeling the impact of technological change — is simple but powerful. It is, however, a deeply unnatural way of thinking. It is unnatural for the same reason that opportunity cost is often neglected and we display “additive bias” when solving problems, finding it easier to add components rather than take them away. Often, what you see is all there is.

This cognitive bias is made more salient by the particular cultural moment we’re living in. Narratives all over the political spectrum have encouraged us to think additively: the “abundance” movement on the left driven by Ezra Klein and Derek Thompson and the “effective accelerationism” movement on the right driven by venture capitalist Marc Andreesen, while very much at odds in the details, share much in their psychological appeals. Find the right objective function; grow the pie.

Applying the Simon framing to the societal impacts of the internet yields a number of useful dualities beyond his original framing of information and attention. The abundance of legibility begets a scarcity of privacy: the machine of digital advertising has led us to exchange the right to be forgotten for the right to be served cheap goods. The abundance of connection begets a loss of regard: a world in which it is frictionless to find a friend is one where it’s harder to find a neighbor. An abundance of choice begets a scarcity of confidence: the internet did not invent this crucial paradox — but how many shows, videos, posts, and shops does one really need? An abundance of voices begets a scarcity of listening: not everyone can shout in the town square.

Intelligence and Agency

What does AI make abundant and, consequently, what does it make scarce? Our AI protagonists have put forth the first suggestion: intelligence. This is perhaps appropriate for the model of the first wave of artificial intelligence, a paradigm in which AI systems are static oracles, taking in requests and producing either intelligible or intelligent text in response. This paradigm has been enough to prop up the economy of the United States, but realists in the field recognize that these speaking machines are, by design, inert, and that the world is changed principally by action, not intelligence.

Thus, we have entered a phase two of the AI boom/bubble, governed by a new moniker: agency. For decades, frontend (a webpage) and backend (a database) software engineering have been the scarcest of resources, a field of modern artisans paid six figures straight out of college. In reality, most of this work is not anywhere near as complex as the theoretical mathematics that most of the students had to achieve high grades in to secure those positions; at the same time, simultaneously managing the syntax, features, and version control of production codebases is just not something most people are capable of. This paradigm is over. If you have an idea for a digital product that you can express fully (or haphazardly) in natural language, AI coding assistants can build it for you. If you hate using Word, AI will build you a new text editor. If you don’t like Chrome, it will soon be able to make you your own browser. In the world of AI agents, motivation and will are the only blockers to endless agency and empowerment — well, plus the physical world.

Let’s apply the Simon framing to these AI abundances. What is made scarce by abundant intelligence and abundant agency. If we were thinking purely of antonyms, we might say something like stupidity and helplessness. Who wants to keep those around? But that is to miss the point of the framing entirely; the goal is to understand what these abundances consume or make scarce as a second-order effect. An obvious first candidate is the scarcity of expertise. In the world where you can get world-class medical advice, quantum physics education, and literary criticism from an AI system, what is the incentive to hone these skills? The decline of experts is something many commentators have already acknowledged (mostly from the fracturing of our information ecosystem), but AI will speed this process up. And what does agency consume? An abundance of agency begets a scarcity of collaboration. When your individual agency goes up, the opportunity cost of collaboration increases. In both ideation and production, powerful AI systems mean that we rely less on the work of others. AI will be our engineer, our scientist, our critic, and our partner. Individuals and small teams will outpace organizations. Our work will become less intelligible to us, and as a result, will become orders of magnitude less intelligible to others, meaning we will need to overcome large barriers to productively collaborate.

Language

Moving beyond the dominant narratives in Silicon Valley, there are multiple other notable, but slightly more agnostic, abundances generated by AI. The most obvious, but perhaps the most important, is an abundance of language. You don’t say? For thousands of years, humans have distinguished ourselves from animals and machines by our ability to arbitrarily recapitulate language. The spoken word is our invention, the thing that has allowed us to learn about the world and ourselves, develop cultures to independently evolve, and scale our species indefinitely. To mime and to meme is to be human. Overnight, though, Turing’s test has been steamrolled, and in a few short years, we’ve already been displaced as the leading producers of language. The industry’s obsession with intelligence, measured objectively by performance on the International Math Olympiad, has a tendency to obscure the profundity of the abundance of language generated by AI systems that is practically indistinguishable from our own.

The abundance of language begets a scarcity of attestation. The outputs of AI models and the various “wrappers” around them are free-floating. They will cover every surface of the internet; we will find them, to varying degrees, inane, rote, curious, cogent, witty, and profound. We will call it all “slop” when we recognize it. We will compensate for this gap by weakening this longstanding connection between language and mind. The late great philosopher Daniel Dennett warns that our complete inability to avoid treating such language as evidence of personhood will lead, in turn, to the rise of “counterfeit people,” with potential dire consequences for our ability to reason and act.

Responses

What’s the best thing about your AI assistant? That it never needs to sleep! A fourth abundance of AI is the abundance of responses. If you are struggling with a math problem, ChatGPT is there for you. If you are trying to figure out what to make for dinner, a recipe is six words away. If it’s 3:45am and you’re convinced you’ve just made the next scientific breakthrough, your AI is on standby. While companies like Anthropic have played around with the idea that your chatbot should be able to end conversations, as a general rule, as the user, it is nearly impossible to have the last word in an interaction with AI. While it is likely that much will change about the nature of AI “assistants,” it seems unlikely that we will ever scale systems that are standoff-ish and untimely. If they are, we’re probably living in the world of Her; I’d be concerned.

The abundance of responses begets a scarcity of struggle. This can, quite easily, sound elitist and revisionist. “Have you heard of Jevon’s paradox?” the retort might go. Did the printing press, steam engine, and personal computer eliminate productive struggle, or did they simply enable new directions for it? Won’t people simply continue to ask better questions once their easier ones are answered by their AI assistant? Deep down, though, I think most of us realize the way that we learn new concepts, solve hard problems, overcome personal challenges, and make art is often by sitting with an absence of feedback. Feedback is crucial, but it should be on a schedule.

Validation

The abundance of responses — of any kind — is itself a notable facet of AI, but the truth is, these responses are usually not neutral. Odds are, if you have an idea for a new startup, according to your chatbot, it’s a sure winner. Nathan Fielder would have a field day. The world of abundant AI chatbots is also, most likely, a world of abundant validation. The field has coalesced around the term sycophancy, an interesting choice given its connotation of flattery explicitly for one’s own advantage — something to come back to. In the current paradigm, we mostly view trivial sycophancy as harmless and even laughable, and then we treat a few exceptional cases of sycophancy which lead to high profile delusional spirals or even suicide as five alarm fires. But the reality is that this type of scaled validation is a more nebulous but likely profound type of AI abundance.

An abundance of validation begets a scarcity of calibration. Who you are is the weighted sum of how you see yourself and how everyone else sees you. Most people know not to judge the quality of their work by the perspective of their doting grandmother. But AI is marketed to the general public as superintelligent; when it says that you’re on to something, you might trust it. So far, we’ve only had a small number of public figures who have had marked shifts in their level of calibration as a result of AI — it might as well have been schizophrenia. Individual calibration is a hard thing to measure a marginal effect from, but, in the long run, we might expect to see fewer “matches” between people, meaning fewer friendships, marriages, and teams.

Bullshit

The last abundance of AI is an abundance of bullshit. This is certainly related to the abundances of language, response, and validation, but it is also distinct. LLMs allow us to express things we don’t know via mediums we don’t understand in a manner that is nevertheless nearly impossible to detect, and then take credit for it as our own. In so many areas, we rely on style as a signal for substance (or lack thereof) — the poorly worded email a sign of a phishing scam, the grammatically incorrect submission a sign of poor research quality. The abundance of bullshit means that the “filterers” in a selection process lose many of the useful proxies that allowed them to approximate traits and skills that were valuable or dangerous but hard to measure. An abundance of bullshit begets a scarcity of signal. Confusion is up.

Conclusion

To summarize, the fantasy for artificial intelligence is the abundance of intelligence and agency, directly substituting the language of a previous utopian era focused on the production of information to one focused on the processing of information. My contention is not that these fantasies are false, but that they are incomplete. While AI true believers portray these as unequivocal goods, banishing stupidity and helplessness from our lives, even these abundances will likely engender scarcities worth considering: a scarcity of expertise and a scarcity of collaboration, to name a few. Furthermore, the metastasization of AI in society will also generate a multitude of other abundances (and corresponding scarcities): an abundance of language (and a scarcity of attestation), an abundance of responses (and a scarcity of struggle), an abundance of validation (and a scarcity of calibration), and an abundance of bullshit (and a scarcity of signal).

This essay takes a somber tone on the potential second-order effects of AI on individuals and society. It should not be taken as a categorical dismissal of the utopian perspective. But my view is that, by default, our markets and our culture will continue to emphasize those first-order benefits to intelligence and agency. It is possible that we will enter a new political and cultural moment where people realize their latent distaste for this technology. But for the time being, the challenge for all of us is to figure out which processes and practices are worth protecting before they go extinct.

A Brief Explainer on the Simplest Form of AI Alignment

2025-04-28T00:00:00+00:00

What is reinforcement learning with human feedback (RLHF)? How can we systematically think about its abilities and limitations as an approach to AI alignment?

What is Reinforcement Learning?

If you want to understand a machine learning algorithm, and you can only ask an expert one question, you should always ask, “what is the objective function?” While there are plenty of other important metrics for understanding an algorithm (data, update process, parameters, etc.), the objective function is the single easiest way to understand what an algorithm is designed to do.

Most machine learning methods fall under the category of “supervised learning.” Supervised learning describes any algorithm that takes in data and predicts labels for that data. Think of trying to predict a digit from 0 to 9 given images of handwritten numbers or predicting the median income of a town given its geography, history, and demographics. In each of these cases, the objective function is some variation of: minimize the sum of differences between the labels and predicted values.

This approach is incredibly useful for solving problems. But to understand where it falls short, let’s think about how you would build an algorithm to win a game of chess. The supervised learning approach suggests that we do something like: get a bunch of chess positions and label them as “wins” or “losses” based on the final outcome in a game that included that position. But when you play chess, you are not just trying to learn whether you’re winning or not; you’re trying to figure out what to do next! This is the crux of the difference between supervised learning and a different class of methods called reinforcement learning. Reinforcement learning methods help us when we are trying to learn actions to take over time. The objective function for these methods, in alignment with that goal, takes some form of: maximize the discounted sum of future rewards.

Reinforcement Learning from Human Feedback

A common (but slightly mistaken) description of the behavior of large language models is to predict the next word given a large corpus of text. To give a quick sketch of why this might lead to intelligent behavior, imagine you are tasked with predicting the last word in an Agatha Christie “whodunit” mystery:

“…And the detective explained to the family, waiting with baited breath, that the man who killed their daughter was [insert word].”

As anyone who has read or watched this type of fiction knows, this can often be a challenging task. One can argue that a correct answer (without access to an exact example like it) would require grammar, causality, sociology, and even theory of mind. And while many words on the internet are much easier to predict than that one, perhaps you get the picture. This task is very similar to the supervised learning examples described above. The architecture used for these language models, the transformer, is an interesting hybrid model which uses “self-supervised” learning, where any of the words can act as data or a label.

Interestingly, this doesn’t lead to a particularly useful piece of technology. To turn this auto-complete engine into a chatbot, AI assistant, or AI agent, researchers use two methods, supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF) to train the model how to speak and act.¹ Loosely, one can think of SFT as locating the model as “something that takes in requests and produces responses,” and RLHF as training the model about what constitutes a good response.

The process of RLHF consists of four main steps:

Have the SFT model produce pairs of text completions given prompts chosen by the model providers.
Have human annotators label one of these responses as preferred according to a set of criteria given to them by the model providers.
Train another model to predict which response will be preferred by humans.
Use reinforcement learning to teach the SFT model to produce words early in a response that maximizes the reward at the end of a response.

Ethical Tensions with RLHF

There are many perspectives from which to view the tradeoffs and tensions arising from this process. For this analysis, I will attempt to distinguish between short run, immediate problems in RLHF deployments for chatbots, and more speculative long run issues with RLHF and AI agents.

Short Run Tensions

One of the challenges with RLHF, in the short run, is that it is being used as both a mechanism for ensuring the “safety” of large language model outputs and also the “alignment” of these outputs with myriad human preferences when it comes to tasks such as planning, summarization, criticism, and research.

A common refrain describes the goal of RLHF as a safety mechanism for AI models: helpful, honest, harmless (HHH). This framework, devised by researchers at Anthropic in 2022, suggests that AI models should not respond in ways that are irrelevant, deceptive, or broadly emotionally harmful and/or toxic. But this translates, in the training process for RLHF, to giving models a number of examples which display these behaviors and hoping that the model has generalized these examples to broad “policies” which would relate to unseen and ambiguous examples. As a result, current RLHF processes make models extremely vulnerable to jailbreaking by persistent and creative users. They also lead directly to model sycophancy and contribute to hallucinations, since the model has been trained to diverge from its own internal model of a question to produce answers that are satisfying to a human.

But this is only half the problem. In addition to these safety concerns, RLHF necessarily flattens the diversity of human preferences into a single (albeit complex) reward signal that the model is learning from. Given that the preference data from human annotators used in RLHF is almost certainly unrepresentative, this means that LLM outputs are actively reinforcing the preferences and perspectives of majority populations, likely leading to many different cultural, political, and demographic biases. While strategies for incorporating “memory” into LLM chat history will lead to higher variability in these biases for individual users, the result of reduced diversity of outputs will almost certainly remain.

Long Run Tensions

In a 1975 speech on a forthcoming article “Problems in Monetary Management,” British economist Charles Goodhart, in describing Margaret Thatcher’s myopic monetary policy, argued that “any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.” In reinforcement learning, the concept of “reward misspecification” is used to describe a similar phenomenon, where a model “overfits to a proxy reward” in training, leading to unexpected or suboptimal outcomes in a real world deployment. In RLHF, we have a “final” goal of teaching models to be helpful, harmless, and honest, but the data that we give to models to learn these abstract ideals are only relative preferences of human beings for different model outputs given a set of instructions.

You don’t have to be a leading computer scientist to see how this can go awry. A paper from Anthropic describes model evaluations where, in a small number of cases, Claude responds to a challenging coding task by changing unit tests rather than attempting to find a correct solution. The training for RLHF means that deceptive strategies, in addition to sycophantic ones, will yield high reward for a model.

This concern is widely discussed in the computer science community, although often not with the appropriate level of humility. A less discussed long run implication comes from the design of RLHF preference datasets. In almost all setups, reward models learn to select better from worse responses. Implicitly, we are training models to view all outputs through the utilitarian framing of “less” and “more” preferred. While this is different than saying we are training models to be “utilitarians,” the utilitarian nature of RL may lead models to, at a certain level of complexity, internalize this reward structure, leading to behavior that diverges from the diversity of human values related to rules and virtues.

Conclusion

A reasonable question to ask in light of this analysis is: “is there any other option?” Even if you have no fears of misaligned superintelligence, it is almost certainly the case that there is no “neutral” approach without some kind of alignment training. A decision must be made, and real world data on human preferences seems about as a good a place to start as any other.

I don’t have a good answer to this rebuttal. But I think it is important to acknowledge the key implications of RLHF for LLMs. First, RLHF means that companies are capable of systematically instilling values in models that go beyond the biases in the training data on the internet. Second, the derivation of rewards from human preferences for responses leads directly to sycophancy and reduced diversity in model responses. Third, and by the same logic, RLHF encourages deception in LLMs.

It is possible that new strategies allow us to move beyond RLHF to resolve many or all of these issues. As just one example, recent work by researchers at Princeton University indicates that we may be able to use simulations to reduce some of the short and long run alignment challenges in RLHF.

I suspect, however, that these issues with RLHF are here for the long run. Even if there is no route to mass deployment of language models that does not involve human preference training, it is important for policymakers to understand the policy choices that are currently being made opaquely by companies in designing these models. These are questions that are not clearly about computer science expertise; yet they are primarily happening in AI companies and computer science labs. In closing, from an outsider’s perspective, it appears that we currently have companies raising at $300 billion valuations, with the explicit goal of designing AGI, who view “alignment” as the last mile of their training process. Could there be something fundamentally backwards in this way of thinking?

Another key element to this progress is chain-of-thought (COT) reasoning, which I will not be discussing here. Importantly, COT also relies on reinforcement learning, so many of the arguments will hold. ↩

Dissecting the Procedure Fetish

2024-11-09T00:00:00+00:00

What explains the tendency of administrative bureaucracy in the United States to cripple state capacity with layers of procedural rules and regulations? In The Procedure Fetish, Nicholas Bagley offers two potential motivations for this trend, seeing the spread of procedure as stemming from (1) a desire for bureaucratic accountability on the part of Congress and the public, and (2) an attempt to provide legitimacy to the Constitutionally-weak administrative state. Bagley singles out lawyers, in all branches of government, as particularly culpable for the defense of procedure as a force for accountability and legitimacy.

Bagley presents a compelling counterargument against the validity of these proposed defenses of procedure. The particularly high approval ratings for the Federal Reserve and the Department of Defense, two agencies with little procedure and public comment, strongly undermines the argument for procedure as legitimizing agencies in the eyes of the public. Similarly, the argument for accountability fails when we acknowledge the inequity and capture that results from the “time, attention, and resources” (Bagley, 2021) needed to parse complex rules and regulations.

Let’s say we agree with Bagley’s assessment regarding the deleterious effects of procedure on state capacity. Is the answer to replace all lawyers in the executive branch and Congress with engineers and consultants? Would that it were so simple. In this paper, I will argue that lawyers are a symptom, not the cause, of the broader risk-averse mentality that pervades public administration. I will first defend risk aversion as a rational approach given that most government agencies deal with domains of diffuse benefits and concentrated costs. I will then analyze why the private sector does not suffer from this problem to the same extent and argue that a resolution to the United States’ state capacity problems will require changes in both agency and public perceptions of how to weigh the costs and benefits of public sector action.

For two years, I worked in an executive branch agency of the Massachusetts state government. My team’s main purpose was to conduct program evaluation on populations receiving government services from multiple agencies across the social safety net. A massive undertaking in such projects was to get agencies to share their data. Even for internal data analysis, we had to first apply statistical algorithms to these datasets to protect the privacy of the populations in the data. We spent almost my entire tenure working on these engineering challenges, with analytical work consistently pushed back.

I found this approach incredibly frustrating. What good was all of this work to protect privacy if we weren’t using the data for any legitimate purpose? Over my time there, I began to realize that most of the policy and data managers I worked with shared the same mentality: better safe than sorry. Rather than seeing this as pure obstructionism, I realized that this approach made a lot of sense in a simple cost-benefit analysis. The cost of a privacy breach was large, likely leading to a Boston Globe story and a public backlash, while the benefit of access was small, in the form of a marginal tweak to program design leading to higher social welfare. In data privacy, the costs are concentrated, while the benefits are diffuse.

The cost-benefit analysis of data privacy is not unique in government. Improvements to existing public goods, including clean energy, equitable public safety, and transit infrastructure, suffer from the same problem. Here, the social costs of a mistake, such as a nuclear reactor failure, a murder, or derailment, appear to dwarf the social benefits of success, such as a small reduction in global temperatures or racial inequality. In this view, a status quo bias for government agencies is often rational, from a short-term perspective. I say short-term perspective, because it is quite plausible that the net benefits could outweigh the costs, even if they are diffuse, over a longer time horizon.

One obvious question for this account is why the public sector suffers from this problem much more than the private sector. The potential for a data breach does not stop Google or Facebook from using our data to make billions in profits off of personalized advertisements. This is partly because there is a higher base rate for the problem of concentrated costs and diffuse benefits in public goods, by definition. Because a public good is non-excludable, its benefits will inherently be spread more broadly throughout the population than an iPhone or a Tesla. Public goods also tend to be essential services, meaning a failure will be more catastrophic than for many consumer products or services. Certainly, the incentives facing private sector employees from equity packages, the prospect of promotion, or the threat of firing, also contribute to greater risk taking behavior.

I do not, however, believe this explains all of the public sector’s relative risk aversion. Based on my experience and my perception of the general public, government is held to a different standard than the private sector. As the protector and provider for those who are most vulnerable, government is often expected to adhere to a form of the Hippocratic oath, “do no harm.” Contrast this with the private sector, where executives have a direct duty to act in the best interests of their shareholders, which usually means maximizing their annual rate of return.

Given this asymmetry, the “low state capacity” criticism of government can seem like a double-standard. Government is asked to follow a moral procedure that focuses on limiting harm to those at the bottom, but is then judged by a standard of its overall efficacy across all of its functions. This is not, however, to argue against state capacity as a barometer of government success. Far from that, I believe it is essential that government in the 21st century is judged by this standard. The problems we face, from climate change, to AI, to societal distrust, depend directly on the government being able to overcome collective action problems and make tough decisions quickly.

Increasing state capacity will almost certainly involve some of the institutional changes that Bagley discusses in his piece. Reforms to OIRA, environmental protections, notice-and-comment rulemaking, and legal rules governing procurement would all contribute to a more dynamic public sector. But this must be accompanied by a collective understanding that public goods are not private goods, and government cannot simultaneously act to minimize the chance of harm and maximize its overall positive effect. In turn, this means a recognition of the unprecedented nature of our problems and cultural shift towards collectivism, a tall task to put it mildly. Nevertheless, it is only with these changes that we can align the incentives facing government employees, eliminate the procedure fetish, and boost our state capacity.

The American Dream is dying. Here’s what can replace it.

2023-11-24T00:00:00+00:00

Much has been made of America’s fraying social fabric. American pride is at a record low. Polarization is at a record high, as our parties have become rigidly sorted by demographics, income, and education. The crises of the COVID-19 pandemic and the 2020 Presidential election suggest that foundational trust in public authorities is on a dangerous precipice.

This can be tough to square with some basic facts about America. When visa constraints are removed, the United States is the second most attractive country for high-skill immigrants. The median wealth in the US today is $193k, ranking sixth worldwide. During the COVID-19 pandemic, the US defied global expectations by producing multiple safe and effective vaccines in less than a year, saving an estimated 3 million US lives.

While there are many scapegoats for America’s internal malaise, including the rise of social media and political dysfunction, these explanations miss the forest for the trees. Since its inception, America has lived and died by the sword of meritocracy and the American Dream. This national myth of individualism argues that our most crucial collective identity lies in a combination of opportunity and merit that supports economic, political, and social thriving.

While this trope is still espoused by every major political figure, Americans are finding the American Dream less and less palatable. Raj Chetty, an economist at Harvard, studies trends in intergenerational mobility in America. His research finds that the odds that a child grows up to earn more than their parents has decreased steadily over the past fifty years. This empirical finding mirrors public opinion – today over one-third of all Americans say that the American Dream no longer exists.

One approach to this problem suggests we must take drastic action to revitalize the American Dream of eras past. But there are multiple flaws in this line of reasoning. First, most assessments of declining mobility are mostly referring to White men. For African Americans, mobility has always been low. Second, the version of the American Dream that focuses on upward mobility enforces a zero-sum perspective on well-being, where our flourishing is always judged by the yardstick of our parents and our neighbors.

Even if the American Dream and the meritocracy myth are empirically and theoretically flawed, any argument to deflate their influence in the national consciousness must come to the table armed with another pillar to guide us. Otherwise, we are a nation lost at sea, with no religion, ethnicity, culture, or shared values to guide us.

There is only one ethic that can compete with meritocracy: abundance. Anyone who has spent time outside of America will observe the unique, everyday luxuries of our society, including ubiquitous dryers and air conditioners, “Medium” fountain sodas, Targets and Walmarts, and SUVs. But there is more to our abundance than consumption. America accounts for roughly half of the world’s top 100 universities, drug discoveries, and billion dollar companies, despite accounting for only 4% of the world’s population.

Abundance clearly delineates America’s successes from its failures. In the pandemic, we stumbled in the collective action problems of masking and social distancing, but our vaccination efforts were a global success. On climate, America has succeeded in reducing its emissions by 20% since 2005 despite failing to pass any carbon pricing nationally, due to public-private partnerships in clean energy innovation.

Abundance also has the potential to unify progressives and conservatives. The new national consensus around industrial policy, exemplified by Ezra Klein’s supply-side progressivism on the left and Tyler Cowen’s state capacity libertarianism on the right, argues that the best way to solve affordability challenges is to empower the public and private sectors by reducing barriers to innovation and development.

Many will argue that meritocracy is essential to progress. Why else would people choose to build great things, if not for the economic and social returns a meritocracy affords? While this would be a fair response to calls for immediately dismantling our economic or educational systems, this is instead a call for change in political messaging and the justification for public policy decisions.

A shift in messaging away from meritocracy and towards abundance allows us to transcend the zero-sum thinking that says progress is defined by relative gains and a policy is either efficient or equitable. It also successfully locates the enemies of progress as those who are seeking to block abundance – the NIMBYs, lobbyists, border hawks, and degrowthers. Most importantly, it presents an optimistic future for Americans to invest in and be proud of, something everyone can agree we desperately need.