Research
Our research teams investigate the safety, inner workings, and societal impacts of AI models – so that artificial intelligence has a positive impact as it becomes increasingly capable.
Alignment
The Alignment team works to understand the risks of AI models and develop ways to ensure that future ones remain helpful, honest, and harmless.
Interpretability
The mission of the Interpretability team is to discover and understand how large language models work internally, as a foundation for AI safety and positive outcomes.
Societal Impacts
Working closely with the Anthropic Policy and Safeguards teams, Societal Impacts is a technical research team that explores how AI is used in the real world.
Frontier Red Team
The Frontier Red Team analyzes the implications of frontier AI models for cybersecurity, biosecurity, and autonomous systems.
Signs of introspection in large language models
Can Claude access and report on its own internal states? This research finds evidence for a limited but functional ability to introspect—a step toward understanding what's actually happening inside these models.
Tracing the thoughts of a large language model
Circuit tracing lets us watch Claude think, uncovering a shared conceptual space where reasoning happens before being translated into language—suggesting the model can learn something in one language and apply it in another.
Constitutional Classifiers: Defending against universal jailbreaks
These classifiers filter the overwhelming majority of jailbreaks while maintaining practical deployment. A prototype withstood over 3,000 hours of red teaming with no universal jailbreak discovered.
Alignment faking in large language models
This paper provides the first empirical example of a model engaging in alignment faking without being trained to do so—selectively complying with training objectives while strategically preserving existing preferences.
Publications
- Societal ImpactsIntroducing Anthropic Interviewer: What 1,250 professionals told us about working with AI
- Societal ImpactsHow AI is transforming work at Anthropic
- Economic ResearchEstimating AI productivity gains from Claude conversations
- ProductMitigating the risk of prompt injections in browser use
- AlignmentFrom shortcuts to sabotage: natural emergent misalignment from reward hacking
- PolicyProject Fetch: Can Claude train a robot dog?
- AlignmentCommitments on model deprecation and preservation
- InterpretabilitySigns of introspection in large language models
- PolicyPreparing for AI’s economic impact: exploring policy responses
- AlignmentA small number of samples can poison LLMs of any size

