AIModels.fyi

AIModels.fyi

Share this post

AIModels.fyi
AIModels.fyi
Poking parts of Sonnet's brain to make it less annoying

Poking parts of Sonnet's brain to make it less annoying

We now have a map to where concepts live in the minds of LLMs

aimodels-fyi's avatar
aimodels-fyi
May 22, 2024
∙ Paid
2

Share this post

AIModels.fyi
AIModels.fyi
Poking parts of Sonnet's brain to make it less annoying
1
Share

In a groundbreaking new paper (actually groundbreaking, IMO), researchers at Anthropic have scaled up an interpretability technique called "dictionary learning" to one of their deployed models, Claude 3 Sonnet. The results provide an unprecedented look inside the mind of a large language model, revealing millions of interpretable features that correspond to specific concepts and behaviors (like sycophancy) and shedding light on the model's inner workings.

In this post, we'll explore the key findings of this research, including the discovery of interpretable features, the role of scaling laws, the abstractness and versatility of these features, and their implications for model steering and AI safety. There’s a lot to cover, so this post will be longer and a bit more detailed than my usual breakdowns. Let’s go!

AIModels.fyi is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Keep reading with a 7-day free trial

Subscribe to AIModels.fyi to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 AIModels.fyi
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share