Document Details


2403.17031v1.pdf
Download View Text Delete
Clip: The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization Shengyi Huang Michael NoukhovitchArian Hosseini Kashif Rasul Weixun Wang Lewis TunstallHugging FaceMila, Université de Montréal Fuxi AI Lab, NetEase costa@huggingface.co Abstract This work is the first to openly reproduce the Reinforcement Learning from Human Feedback (RLHF)scaling behaviorsreported in OpenAI’s seminal TL;DR summa- rization work (Stiennon et al., 2020). We create an RLHF pipeline from scratch, enumerate over 20 key implementation details, and share key insights during the reproduction. Our RLHF-trained Pythia models demonstrate significant gains in re- sponse quality that scale with model size with our 2.8B, 6.9B models outperforming OpenAI’s released 1.3B checkpoint. We publicly release the trained model check-
Filename: 2403.17031v1.pdf
Filetype: application/pdf
Size: 2996825 bytes
Uploaded On: 2024-06-10
Abstract:
Summary:
Tags:
Notes:
Visible: 1
Status: Parsed
Author:
CreationDate: 2024-03-27T01:10:52+00:00
Creator: LaTeX with hyperref
Keywords:
ModDate: 2024-03-27T01:10:52+00:00
PTEX.Fullbanner: This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5
Producer: pdfTeX-1.40.25
Subject:
Title:
Trapped: False
Pages: 42

Return to Document Library