Research

LLM Spirals of Delusion: A Benchmarking Audit Study of AI Chatbot Interfaces

March 01, 2026

We conduct an audit study comparing LLM behavior across API and chat interfaces, documenting large differences in sycophancy, escalation, and delusion reinforcement between environments and across model versions.

Paper Website Code

Towards a Science of AI Agent Reliability

February 21, 2026

We propose twelve metrics decomposing AI agent reliability along four dimensions — consistency, robustness, predictability, and safety — and evaluate 14 models, finding that recent capability gains have yielded only small improvements in reliability.

Paper Website

Differences in the Moral Foundations of Large Language Models

November 14, 2025

I conduct a series of experiments to compare the moral judgments of sixteen frontier language models from all major US model providers using Jonathan Haidt's Moral Foundations Theory.

Paper Slides Code

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

September 17, 2025

I execute a large scale data analysis of 21k AI agent rollouts across 9 benchmarks to evaluate cost-effectiveness, general problem solving capabilities, and failure modes of frontier AI agents as part of the Holistic Agent Leaderboard (HAL).

Paper Code

Desegregation Paradox? A Model and Simulation of LIHTC’s Effects on Economic Segregation

January 29, 2025

I run a simulation of a counterfactual distribution of low income housing without the Low Income Housing Tax Credit, the nation's largest affordable housing program, to characterize the impact of the program on economic segregation.

Paper