Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dillon Bowen

A StrongREJECT for Empty Jailbreaks

Feb 15, 2024
Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, Sam Toyer

Figure 1 for A StrongREJECT for Empty Jailbreaks

Figure 2 for A StrongREJECT for Empty Jailbreaks

Figure 3 for A StrongREJECT for Empty Jailbreaks

Figure 4 for A StrongREJECT for Empty Jailbreaks

The rise of large language models (LLMs) has drawn attention to the existence of "jailbreaks" that allow the models to be used maliciously. However, there is no standard benchmark for measuring the severity of a jailbreak, leaving authors of jailbreak papers to create their own. We show that these benchmarks often include vague or unanswerable questions and use grading criteria that are biased towards overestimating the misuse potential of low-quality model responses. Some jailbreak techniques make the problem worse by decreasing the quality of model responses even on benign questions: we show that several jailbreaking techniques substantially reduce the zero-shot performance of GPT-4 on MMLU. Jailbreaks can also make it harder to elicit harmful responses from an "uncensored" open-source model. We present a new benchmark, StrongREJECT, which better discriminates between effective and ineffective jailbreaks by using a higher-quality question set and a more accurate response grading algorithm. We show that our new grading scheme better accords with human judgment of response quality and overall jailbreak effectiveness, especially on the sort of low-quality responses that contribute the most to over-estimation of jailbreak performance on existing benchmarks. We release our code and data at https://github.com/alexandrasouly/strongreject.

* Code and data at https://github.com/alexandrasouly/strongreject

Via

Access Paper or Ask Questions

Generalized SHAP: Generating multiple types of explanations in machine learning

Jun 15, 2020
Dillon Bowen, Lyle Ungar

Figure 1 for Generalized SHAP: Generating multiple types of explanations in machine learning

Figure 2 for Generalized SHAP: Generating multiple types of explanations in machine learning

Figure 3 for Generalized SHAP: Generating multiple types of explanations in machine learning

Figure 4 for Generalized SHAP: Generating multiple types of explanations in machine learning

Many important questions about a model cannot be answered just by explaining how much each feature contributes to its output. To answer a broader set of questions, we generalize a popular, mathematically well-grounded explanation technique, Shapley Additive Explanations (SHAP). Our new method - Generalized Shapley Additive Explanations (G-SHAP) - produces many additional types of explanations, including: 1) General classification explanations; Why is this sample more likely to belong to one class rather than another? 2) Intergroup differences; Why do our model's predictions differ between groups of observations? 3) Model failure; Why does our model perform poorly on a given sample? We formally define these types of explanations and illustrate their practical use on real data.

* 12 pages, 7 figures. Based on a submission to NeurIPS 2020. Dillon Bowen is credited with the original concept, code, data analysis, and initial paper draft. Lyle Ungar is credited with contributions to the draft and mathematical notation. Documentation can be found at https://dsbowen.github.io/gshap/

Via

Access Paper or Ask Questions