Alert button
Picture for Rohin Shah

Rohin Shah

Alert button

Google DeepMind

Improving Dictionary Learning with Gated Sparse Autoencoders

Add code
Bookmark button
Alert button
Apr 30, 2024
Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

Viaarxiv icon

Evaluating Frontier Models for Dangerous Capabilities

Add code
Bookmark button
Alert button
Mar 20, 2024
Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca Dragan, Rohin Shah, Allan Dafoe, Toby Shevlane

Figure 1 for Evaluating Frontier Models for Dangerous Capabilities
Figure 2 for Evaluating Frontier Models for Dangerous Capabilities
Figure 3 for Evaluating Frontier Models for Dangerous Capabilities
Figure 4 for Evaluating Frontier Models for Dangerous Capabilities
Viaarxiv icon

AtP*: An efficient and scalable method for localizing LLM behaviour to components

Add code
Bookmark button
Alert button
Mar 01, 2024
János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda

Figure 1 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Figure 2 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Figure 3 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Figure 4 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Viaarxiv icon

Challenges with unsupervised LLM knowledge discovery

Add code
Bookmark button
Alert button
Dec 18, 2023
Sebastian Farquhar, Vikrant Varma, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, Rohin Shah

Viaarxiv icon

BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks

Add code
Bookmark button
Alert button
Dec 05, 2023
Stephanie Milani, Anssi Kanervisto, Karolis Ramanauskas, Sander Schulhoff, Brandon Houghton, Rohin Shah

Viaarxiv icon

Explaining grokking through circuit efficiency

Add code
Bookmark button
Alert button
Sep 05, 2023
Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, Ramana Kumar

Viaarxiv icon

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Add code
Bookmark button
Alert button
Jul 24, 2023
Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik

Figure 1 for Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Figure 2 for Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Figure 3 for Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Figure 4 for Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Viaarxiv icon

Towards Solving Fuzzy Tasks with Human Feedback: A Retrospective of the MineRL BASALT 2022 Competition

Add code
Bookmark button
Alert button
Mar 23, 2023
Stephanie Milani, Anssi Kanervisto, Karolis Ramanauskas, Sander Schulhoff, Brandon Houghton, Sharada Mohanty, Byron Galbraith, Ke Chen, Yan Song, Tianze Zhou, Bingquan Yu, He Liu, Kai Guan, Yujing Hu, Tangjie Lv, Federico Malato, Florian Leopold, Amogh Raut, Ville Hautamäki, Andrew Melnik, Shu Ishida, João F. Henriques, Robert Klassert, Walter Laurito, Ellen Novoseller, Vinicius G. Goecks, Nicholas Waytowich, David Watkins, Josh Miller, Rohin Shah

Figure 1 for Towards Solving Fuzzy Tasks with Human Feedback: A Retrospective of the MineRL BASALT 2022 Competition
Figure 2 for Towards Solving Fuzzy Tasks with Human Feedback: A Retrospective of the MineRL BASALT 2022 Competition
Figure 3 for Towards Solving Fuzzy Tasks with Human Feedback: A Retrospective of the MineRL BASALT 2022 Competition
Figure 4 for Towards Solving Fuzzy Tasks with Human Feedback: A Retrospective of the MineRL BASALT 2022 Competition
Viaarxiv icon

SIRL: Similarity-based Implicit Representation Learning

Add code
Bookmark button
Alert button
Jan 03, 2023
Andreea Bobu, Yi Liu, Rohin Shah, Daniel S. Brown, Anca D. Dragan

Figure 1 for SIRL: Similarity-based Implicit Representation Learning
Figure 2 for SIRL: Similarity-based Implicit Representation Learning
Figure 3 for SIRL: Similarity-based Implicit Representation Learning
Figure 4 for SIRL: Similarity-based Implicit Representation Learning
Viaarxiv icon

Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

Add code
Bookmark button
Alert button
Oct 04, 2022
Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, Zac Kenton

Figure 1 for Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals
Figure 2 for Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals
Figure 3 for Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals
Figure 4 for Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals
Viaarxiv icon