Document Details
Clip:
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences Corby Rosset ∗ Ching-An Cheng Arindam Mitra Michael Santacroce Ahmed Awadallah ∗ Tengyang Xie ∗ Microsoft Research Abstract This paper studies post-training large language models (LLMs) using preference feedback from a powerful oracle to help a model iteratively improve over itself. The typical approach for post-training LLMs involves Reinforcement Learning from Human Feedback (RLHF), which traditionally separates reward learning and subsequent policy opti- mization. However, such a reward maximization approach is limited by the nature of “point-wise” rewards (such as that
Filename:
2404.03715.pdf
Filetype:
application/pdf
Size:
1104120 bytes
Uploaded On:
2024-04-08
Abstract:
Summary:
Tags:
Notes:
Visible:
1
Status:
Parsed
Author:
CreationDate:
2024-04-08T00:08:24+00:00
Creator:
LaTeX with hyperref
Keywords:
ModDate:
2024-04-08T00:08:24+00:00
PTEX.Fullbanner:
This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5
Producer:
pdfTeX-1.40.25
Subject:
Title:
Trapped:
False
Pages:
36
Return to Document Library