Research
LLM Spirals of Delusion: A Benchmarking Audit Study of AI Chatbot Interfaces
We conduct an audit study comparing LLM behavior across API and chat interfaces, documenting large differences in sycophancy, escalation, and delusion reinforcement between environments and across model versions.
Towards a Science of AI Agent Reliability
We propose twelve metrics decomposing AI agent reliability along four dimensions — consistency, robustness, predictability, and safety — and evaluate 14 models, finding that recent capability gains have yielded only small improvements in reliability.
Differences in the Moral Foundations of Large Language Models
I conduct a series of experiments to compare the moral judgments of sixteen frontier language models from all major US model providers using Jonathan Haidt's Moral Foundations Theory.
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
I execute a large scale data analysis of 21k AI agent rollouts across 9 benchmarks to evaluate cost-effectiveness, general problem solving capabilities, and failure modes of frontier AI agents as part of the Holistic Agent Leaderboard (HAL).
Desegregation Paradox? A Model and Simulation of LIHTC’s Effects on Economic Segregation
I run a simulation of a counterfactual distribution of low income housing without the Low Income Housing Tax Credit, the nation's largest affordable housing program, to characterize the impact of the program on economic segregation.