Document Details
Clip:
The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization Shengyi Huang Michael NoukhovitchArian Hosseini Kashif Rasul Weixun Wang Lewis TunstallHugging FaceMila, Université de Montréal Fuxi AI Lab, NetEase costa@huggingface.co Abstract This work is the first to openly reproduce the Reinforcement Learning from Human Feedback (RLHF)scaling behaviorsreported in OpenAI’s seminal TL;DR summa- rization work (Stiennon et al., 2020). We create an RLHF pipeline from scratch, enumerate over 20 key implementation details, and share key insights during the reproduction. Our RLHF-trained Pythia models demonstrate significant gains in re- sponse quality that scale with model size with our 2.8B, 6.9B models outperforming OpenAI’s released 1.3B checkpoint. We publicly release the trained model check-
Filename:
2403.17031v1.pdf
Filetype:
application/pdf
Size:
2996825 bytes
Uploaded On:
2024-06-10
Abstract:
Summary:
Tags:
Notes:
Visible:
1
Status:
Parsed
Author:
CreationDate:
2024-03-27T01:10:52+00:00
Creator:
LaTeX with hyperref
Keywords:
ModDate:
2024-03-27T01:10:52+00:00
PTEX.Fullbanner:
This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5
Producer:
pdfTeX-1.40.25
Subject:
Title:
Trapped:
False
Pages:
42
Return to Document Library