Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chris Cremer

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Feb 26, 2024
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, Sara Hooker

AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance large language models. Proximal Policy Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance. We revisit the formulation of alignment from human preferences in the context of RL. Keeping simplicity as a guiding principle, we show that many components of PPO are unnecessary in an RLHF context and that far simpler REINFORCE-style optimization variants outperform both PPO and newly proposed "RL-free" methods such as DPO and RAFT. Our work suggests that careful adaptation to LLMs alignment characteristics enables benefiting from online RL optimization at low cost.

* 27 pages, 7 figures, 2 tables

Via

Access Paper or Ask Questions

Inference Suboptimality in Variational Autoencoders

May 27, 2018
Chris Cremer, Xuechen Li, David Duvenaud

Figure 1 for Inference Suboptimality in Variational Autoencoders

Figure 2 for Inference Suboptimality in Variational Autoencoders

Figure 3 for Inference Suboptimality in Variational Autoencoders

Figure 4 for Inference Suboptimality in Variational Autoencoders

Amortized inference allows latent-variable models trained via variational learning to scale to large datasets. The quality of approximate inference is determined by two factors: a) the capacity of the variational distribution to match the true posterior and b) the ability of the recognition network to produce good variational parameters for each datapoint. We examine approximate inference in variational autoencoders in terms of these factors. We find that divergence from the true posterior is often due to imperfect recognition networks, rather than the limited complexity of the approximating distribution. We show that this is due partly to the generator learning to accommodate the choice of approximation. Furthermore, we show that the parameters used to increase the expressiveness of the approximation play a role in generalizing inference rather than simply improving the complexity of the approximation.

* ICML

Via

Access Paper or Ask Questions

Reinterpreting Importance-Weighted Autoencoders

Aug 15, 2017
Chris Cremer, Quaid Morris, David Duvenaud

Figure 1 for Reinterpreting Importance-Weighted Autoencoders

Figure 2 for Reinterpreting Importance-Weighted Autoencoders

Figure 3 for Reinterpreting Importance-Weighted Autoencoders

The standard interpretation of importance-weighted autoencoders is that they maximize a tighter lower bound on the marginal likelihood than the standard evidence lower bound. We give an alternate interpretation of this procedure: that it optimizes the standard variational lower bound, but using a more complex distribution. We formally derive this result, present a tighter lower bound, and visualize the implicit importance-weighted distribution.

* ICLR 2017 Workshop

Via

Access Paper or Ask Questions