Poking parts of Sonnet's brain to make it less annoying
We now have a map to where concepts live in the minds of LLMs
In a groundbreaking new paper (actually groundbreaking, IMO), researchers at Anthropic have scaled up an interpretability technique called "dictionary learning" to one of their deployed models, Claude 3 Sonnet. The results provide an unprecedented look inside the mind of a large language model, revealing millions of interpretable features that correspond to specific concepts and behaviors (like sycophancy) and shedding light on the model's inner workings.
In this post, we'll explore the key findings of this research, including the discovery of interpretable features, the role of scaling laws, the abstractness and versatility of these features, and their implications for model steering and AI safety. There’s a lot to cover, so this post will be longer and a bit more detailed than my usual breakdowns. Let’s go!
Keep reading with a 7-day free trial
Subscribe to AIModels.fyi to keep reading this post and get 7 days of free access to the full post archives.