Alert button

More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness

Add code
Bookmark button
Alert button
Apr 29, 2024
Aaron J. Li, Satyapriya Krishna, Himabindu Lakkaraju

Share this with someone who'll enjoy it:

View paper onarxiv icon

Share this with someone who'll enjoy it: