Alert button
Picture for Neel Nanda

Neel Nanda

Alert button

Google DeepMind

Improving Dictionary Learning with Gated Sparse Autoencoders

Add code
Bookmark button
Alert button
Apr 30, 2024
Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

Viaarxiv icon

How to use and interpret activation patching

Add code
Bookmark button
Alert button
Apr 23, 2024
Stefan Heimersheim, Neel Nanda

Viaarxiv icon

AtP*: An efficient and scalable method for localizing LLM behaviour to components

Add code
Bookmark button
Alert button
Mar 01, 2024
János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda

Figure 1 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Figure 2 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Figure 3 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Figure 4 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Viaarxiv icon

Explorations of Self-Repair in Language Models

Add code
Bookmark button
Alert button
Feb 23, 2024
Cody Rushing, Neel Nanda

Viaarxiv icon

Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs

Add code
Bookmark button
Alert button
Feb 11, 2024
Bilal Chughtai, Alan Cooney, Neel Nanda

Figure 1 for Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs
Figure 2 for Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs
Figure 3 for Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs
Figure 4 for Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs
Viaarxiv icon

Universal Neurons in GPT2 Language Models

Add code
Bookmark button
Alert button
Jan 22, 2024
Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas

Viaarxiv icon

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

Add code
Bookmark button
Alert button
Dec 06, 2023
Aleksandar Makelov, Georg Lange, Neel Nanda

Viaarxiv icon

Training Dynamics of Contextual N-Grams in Language Models

Add code
Bookmark button
Alert button
Nov 01, 2023
Lucia Quirke, Lovis Heindrich, Wes Gurnee, Neel Nanda

Viaarxiv icon

Linear Representations of Sentiment in Large Language Models

Add code
Bookmark button
Alert button
Oct 23, 2023
Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda

Viaarxiv icon

Copy Suppression: Comprehensively Understanding an Attention Head

Add code
Bookmark button
Alert button
Oct 06, 2023
Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda

Viaarxiv icon