Alert button
Picture for Samuel Marks

Samuel Marks

Alert button

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Add code
Bookmark button
Alert button
Mar 31, 2024
Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, Aaron Mueller

Figure 1 for Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Figure 2 for Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Figure 3 for Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Figure 4 for Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Viaarxiv icon

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Add code
Bookmark button
Alert button
Oct 10, 2023
Samuel Marks, Max Tegmark

Viaarxiv icon

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Add code
Bookmark button
Alert button
Jul 27, 2023
Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell

Figure 1 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 2 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 3 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 4 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Viaarxiv icon