Extracted Text
2401.01335.pdf
Self-Play Fine-Tuning Converts Weak Language Models
to Strong Language Models
Zixiang Chen
∗†
Yihe Deng
∗‡
Huizhuo Yuan
∗§
Kaixuan Ji
¶
Quanquan Gu
‖
Abstract
Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is
pivotal for advancing Large Language Models (LLMs). In this paper, we delve into the prospect
of growing a strong LLM out of a weak one without the need for acquiring additional human-
annotated data. We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN),
which starts from a supervised fine-tuned model. At the heart ofSPINlies a self-play mechanism,
where the LLM refines its capability by playing against instances of itself. More specifically, the
LLM generates its own training data from its previous iterations, refining its policy by discerning
these self-generated responses from those obtained from human-annotated data. Our method
progressively elevates the LLM from a nascent model to a formidable one, unlocking the full
potential of human-annotated demonstration data for SFT. Theoretically, we prove that the
global optimum to the training objective function of our method is achieved only when the
LLM policy aligns with the target data distribution. Empirically, we evaluate our method on
several benchmark datasets including the HuggingFace Open LLM Leaderboard, MT-Bench,
and datasets from Big-Bench. Our results show thatSPINcan significantly improve the LLM’s
performance across a variety of benchmarks and even outperform models trained through direct
preference optimization (DPO) supplemented with extra GPT-4 preference data. This sheds
light on the promise of self-play, enabling the achievement of human-level performance in LLMs
without the need for expert opponents.
1
Large Language Models (LLMs) have began a groundbreaking era in artificial general intelligence
(AGI), demonstrating extraordinary capabilities across a wide range of domains that require in-
tricate reasoning and specialized knowledge. These models excel in areas such as mathematical
reasoning/problem solving (Cobbe et al.,;,;,), code gener-
ation/programming (Chen et al.,;,;,), text generation (Bubeck
∗
Equal contribution
†
Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:
chenzx19@cs.ucla.edu
‡
Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:
yihedeng@cs.ucla.edu
§
Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:
hzyuan@cs.ucla.edu
¶
Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:
kaixuanji@cs.ucla.edu
‖
Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:qgu@cs.ucla.edu
1
et al.,;,;,
others. A significant advancement in LLMs is the post-pretraining alignment with the more de-
sirable behaviors (Mishra et al.,;,;,;
2022), a process often reliant on the costly human-annotated data. Typical alignment methods
include Supervised Fine-Tuning (SFT) (Ouyang et al.,;,) based on human
demonstrations, and Reinforcement Learning from Human Feedback (RLHF) (Christiano et al.,
2017;,;;,) based on human preferences.
All the aforementioned alignment methods require a substantial volume of human annotated data.
Therefore, there is increasing interest in developing fine-tuning methods that can effectively utilize
human data, thereby streamlining the alignment process. This motivates us to study fine-tuning
LLMs without the need for additional human-annotated data beyond the fine-tuning dataset. Our
study is also related to the broader goal of converting weak models to strong models without the
requirement for extra training data, which is of central interest in machine learning that can be
traced back to the boosting algorithms (Kearns and Valiant,,;,;
Freund and Schapire,). The self-training algorithm (Vapnik,;,
2004;,) has also been proved to be able to convert weak learners to strong learners in
mixture models without the need for additional labeled data (,;,).
However, the pursuit of autonomously enhancing a weak LLM without external guidance is both
intriguing and understudied. This raises the following question:
Can we empower a weak LLM to improve itself without acquiring additional human annotated data?
In this paper, we answer this question affirmatively. Inspired by the success of self-play mecha-
nisms (Samuel,) in games, exemplified by AlphaGo Zero (Silver et al.,), AlphaZero (Silver
et al.,), with historical roots traced back to TD-Gammon (Tesauro et al.,), we propose
to convert a weak LLM to a strong one through the lens of self-play, where the model is enhanced
by playing against itself without requiring any direct supervision. In particular, we propose a novel
fine-tuning method called Self-Play fIne-tuNing (SPIN), which begins from a supervised fine-tuned
model.SPINallows the LLM to engage in self-play, eliminating the need for an expert annotator such
as a human or more advanced LLMs like GPT-4. In detail, with the LLM from previous iterationt
denoted bypθt, we employ it to generate responsesy
′
to the promptsxin the human-annotated
SFT dataset. The subsequent objective is to find a new LLMpθt+1, capable of distinguishing the
responsesy
′
generated bypθtfrom the responsesygenerated by humans. This process can be seen
as a two-player game: the main player, or the new LLMpθt+1, seeks to discern between the responses
of the opponent playerpθtand human-generated responses, while the opponent, or the old LLM
pθt, generates responses as similar as possible to those in the human-annotated SFT dataset. The
new LLMpθt+1is obtained by fine-tuning the old onepθtto prefer responses frompdataoverpθt,
resulting in a distributionpθt+1that is more aligned withpdata. In the next iteration, the newly
obtained LLMpθt+1becomes the opponent for response generation, with the self-play process aiming
for the LLM to eventually converge topθ
∗=pdata, so that the strongest possible LLM can no longer
differentiate the responses generated by its previous version and those generated by the human.
Interestingly, our method exhibits similarity with the recently introduced direct preference
optimization (DPO) method (Rafailov et al.,), with the notable distinction being the self-play
nature of our method. Consequently, our approach stands out by eliminating the need for extra
human preference data, a requirement present in the DPO method. Additionally, the self-play
mechanism in our method resembles the idea of generative adversarial networks (GAN) (Goodfellow
2
et al.,;,), albeit that both the discriminator (main player) and the generator
(the opponent) in our method are instances of the same LLM from different iterations. Theoretically,
we prove that our method converges when the distribution of the LLM is identical to the target data
distribution, i.e.,pθt=pdata. Our experimental results onzephyr-7b-sft-full(Tunstall et al.,
2023a), a fine-tuned LLM based on Mistral-7B (Jiang et al.,), show that while continued training
using SFT on its own SFT dataset Ultrachat200k (Ding et al.,) reaches a performance plateau
or even diminished evaluation scores, our method consistently improveszephyr-7b-sft-fullacross
successive iterations while leveraging only a50k subset of Ultrachat200k dataset. Ultimately,SPIN
effectively improves the base model’s average score from58.14to63.16on the HuggingFace Open
LLM Leaderboard (Beeching et al.,) with remarkable 10%+ improvement in scores on GSM8k
and TruthfulQA, and from5.94to6.78on MT-Bench (Zheng et al.,). Notably, SPINachieves
results that are even comparable to models trained on additional62k preference dataset (Tunstall
et al.,) on Open LLM leaderboard and MT-Bench.
Concurrent to our work,2023) proposed the use of synthetic data with binary
feedback in self-training, reducing the reliance on human data. In contrast, our approach eliminates
the need for additional binary feedback from humans or an extra reward model thanks to the self-play
mechanism. Additionally,2023) employed a weak LLM model as the guidance to
train stronger LLMs in a fashion of weak-to-strong generation. Unlike2023
necessitates both a weak supervisor and a strong model, ourSPINoperates effectively with a single
LLM.
Notation.We use lowercase letters and lowercase boldface letters to denote scalars and vectors,
respectively. We use[N]to denote the index set{1, . . . , N} . In the function space, letFbe the
function class. The symbolqdatadesignates the target data distribution, whileprepresents the
conditional probability of LLM’s response (i.e., LLM policy).
2
Self-Play.Self-play (Samuel,;,), where the algorithm learns by playing
against itself, has gained notable attention due to its effectiveness in multi-agent reinforcement learn-
ing (MARL). This method involves agents engaging in interactions with copies of themselves, enabling
an increasing level of challenge and complexity within the learning environment. A fundamental
work in the field of self-play is AlphaGo Zero (Silver et al.,), which demonstrated exceptional
performance against human players using a self-play learning scheme. Subsequent research has
expanded upon the concept of self-play, exploring various adaptations and implementations (Anthony
et al.,;,;,;,;,
2019;,). Our method takes the self-play approach akin to AlphaGo Zero, which
can convert a weak model to a strong one without additional human-annotated data. While the
effectiveness of self-play in MARL is well-established, to our knowledge, our work is the first to apply
this approach to the enhancement of LLMs.
Synthetic Data for LLMs.In the context of supervised fine-tuning (SFT) of LLMs, human-
crafted data has proven to be a remarkably effective source that enhances the performance of
LLMs on tasks such as code generation (Roziere et al.,;,) and mathematical
reasoning (Yuan et al.,;,). While human data typically exhibits high quality,
acquiring sufficient amount of such data poses a challenge in cost. In light of this consideration, the
use of synthetic data has become increasingly popular and considered as a proxy for human data.
This approach primarily leverages advanced LLMs such as the GPT series (Radford et al.,;
3
Brown et al.,;,) as the guidance to generate high-quality data (Josifoski et al.
2023;,;,;,). Recent research has also highlighted
the rephrasing capability of LLMs in prompting for better LLM response (Deng et al.,;
et al.,) as well as augmenting synthetic data for more effective SFT (Yu et al.,,
2023). In contrast to prior studies that utilized more advanced models for synthetic data generation
when pretraining or fine-tuning a target model, our approach directly generate synthetic data from
the target model itself.
Curriculum Learning.In deep learning, it has been observed that training models using data
samples arranged in a strategically meaningful order can lead to improved performance compared to
training on randomly shuffled data. This approach is commonly known as curriculum learning (Bengio
et al.,;,). Initial studies in curriculum learning introduced efficient algorithms
that adhere to an ‘easy-to-hard’ progression (Spitkovsky et al.,;,;
Grauman,;,). In the field of Natural Language Processing (NLP), criteria
such as sentence length and term frequency are commonly utilized (Cirik et al.,;,
2018;,). More recent developments include the application of curriculum learning
algorithms in multi-modal learning (Liu et al.,;,). Our work shares a similar idea
to curriculum learning, wherein the training data evolves iteratively—beginning with responses that
are easy to distinguish from human-annotated data and gradually progressing to more challenging
instances.
3
We consider a Large Language Model (LLM) parameterized byθand denoted bypθ. The model
takes as input a sequencex= [x1, . . . , xn ], commonly referred to as the prompt, to generate
the corresponding responsey= [y1, . . . , ym ]. The responseyis therefore considered as a sample
from the conditional probability distributionpθ(·|x). In LLMs,xiandyjrepresent individual
tokens from a predetermined vocabulary within the sequencesxandy, respectively. The auto-
regressive modelpθgenerates tokens sequentially for a given position, leveraging only the sequence
of previously generated tokens. This model therefore constitutes a Markov process, where the
conditional probability distributionpθ(y|x)can be expressed through a decomposition as follows:
pθ(y|x) =
m
Y
j=1
pθ(yj|x,y<j),
wherey<1is null andy<j= [y1, . . . , yj−1 ]forj= 2, . . . , m. In the following, we review two major
fine-tuning methods for LLMs: supervised fine-tuning and reinforcement learning (RL) fine-tuning.
3.1
Supervised fine-tuning (SFT) is employed to tailor a pre-trained LLM to specific downstream tasks,
leveraging relatively smaller dataset of labeled examples in comparison to the large-scale pre-training
data (Ouyang et al.,;,). In this context, we consider a specific task where the
prompts, denoted byx, are derived from a specified distributionq(·). The notationpdata(·|x)then
represents the probability distribution of the associated high-quality responsesyfrom the training
data. Consequently, SFT involves training the LLM to minimize the following negative log-likelihood
loss associated with these distributions,
LSFT(θ) =−E
x∼q(·),y∼pdata(·|x)
h
logpθ
`
y|x
´
i
. (3.1)
4
It should be noted that excludingx∼q(·)from the expectation term yields the typical cross-
entropy loss, expressed as−E
y∼pdata(·|x) [logpθ (y|x)].LSFT(θ)attains its minimum when the model’s
predictive distributionpθ(y|x)aligns perfectly with the distribution of the labeled high-quality
responsespdata(y|x).
Consequently, the LLM after SFT is anticipated to generate responses that closely resemble
those frompdata(y|x). This procedure is therefore expected to significantly enhance the model’s
performance in generating appropriate responses for a specific task.
3.2
RL fine-tuning (Christiano et al.,;,;,
enhancing the specific capabilities of general-purpose pre-trained models. Typically, RL fine-tuning
is employed subsequent to SFT to achieve improved alignment for LLMs (Tunstall et al.,).
For a given sequence pair(x,y), RL fine-tuning necessitates a deterministic reward function
r(x,y). The higher the rewardr(x,y), the better the responseyis to the given promptx. The
objective of the RL fine-tuning process is then to maximize the following objective function:
LRL(θ) =E
x∼q(·),y∼pθ(·|x)[r(x,y)]−λE
x∼q(·)KL
`
pθ(·|x))||pref(·|x)
´
,
where the Kullback-Leibler (KL) regularization enforces the new modelpθto be close to the reference
modelpref, andλ >0is the regularization parameter to control the deviation of the new model
pθfrom the reference modelpref. In practice, the reference modelprefis often initialized as the
supervised fine-tuned model. The inclusion of KL regularization is vital for preventing excessive
deviation from the reference model, which in turn reduces the risk of mode collapse.
Meanwhile, the primary challenge in RL fine-tuning lies in finding a good reward function.
Typically, this function requires training on a preference dataset. The compilation of such a dataset
demands significant resources, often involving comprehensive evaluations either by human annotators,
i.e., reinforcement learning from human feedback (RLHF) (Christiano et al.,;,)
or strong AI agents, i.e., reinforcement learning from AI feedback (RLAF) (Bai et al.,
4
In this section, we introduce a new fine-tuning method for enhancing the performance of LLMs
without relying on additional human or AI feedback. Consider a high-quality supervised fine-tuning
(SFT) datasetSSFT={(x,y)}
n
i=1, which are sampled from the marginal distributionq(x)and
pdata(y|x). Given a supervised fine-tuned LLMpθ0, further application of the SFT approach in(3.1)
withSSFTwill be ineffective and potentially lead to worse performance. In addition, without human
and/or AI feedback, it becomes infeasible to acquire a preference dataset for RL fine-tuning (e.g.,
RLHF and RLAF). This hinders the application of RL fine-tuning techniques.
We evaluatepθ0againstSSFT, wherepθ0is the LLM achieved by SFT using(3.1). We notice
a persistent quality gap between the groundtruth responseyfromSSFTand the LLM-generated
responsey
′
∼pθ
(·|x)(refer to Figure). This disparity indicates that there is still room for
improvement overpθ0. Therefore, we propose to use the synthetic data generated by the LLM to
enhance LLM’s performance starting frompθ0
iteratively.
4.1 SPIN)
Let us consider a two-player game, where the main player’s objective is to distinguish the responses
generated by the LLM and those generated by the human. Meanwhile, the opponent’s role is to
generate responses that are indistinguishable from the human’s responses. The core of our method
5
In Southampton, what is the most popular form of
transportation for commuters?
I don't have access to current data. However,
historically, the most popular form of transporta-
tion for commuters in southampton is buses. The
city also has a good network of trains and taxis.
In Southampton, what is the most popular form of
transportation for commuters?
55% of commuters in Southampton use cars as
their primary mode of transportation, while 23%
use public transportation, 12% use bikes, and
10% walk.
Chosen: ground truth completion Rejected: model generation @Iter 0
I do not have access to the latest updates and statistics of southampton. However, based on
previous reports, the most popular form of transportation for commuters in southampton is
the public transport network, which includes buses, trains, and ferries. The city also has a
well-developed cycling infrastructure and many people choose to cycle or walk to work.
Additionally, private cars and taxis are also commonly used for commuting.
Model generation @Iter 1 Figure 1: Example of ground truth completion compared to the fine-tuned model generation at
iteration 0 and 1. We can observe that the model generation at iteration 0, although fluent, incorrectly
quantifies transportation preferences with specific percentages that are potentially hallucinations.
The model generation at iteration 1 provides a qualitative summary of the transportation forms at
Southampton without specific percentage, aligning more closely with the ground truth while adding
more details.
is the self-play mechanism, where both the main player and the opponent are the same LLM, but
from different iterations. More specifically, the opponent is the old LLM from the previous iteration,
and the main player is the new LLM to be learned in the current iteration.
In iterationt+ 1, the opponent is the LLM from the previous iteration, denoted bypθt, which
generates responsesy
′
for those promptsxin the SFT dataset according topθt(·|x). Our method,
therefore, consists of the following two steps at iterationt+ 1:(1)training the main player, and(2)
updating the opponent player.
Training the Main Player.We begin with illustrating how we expect a main player is trained to
distinguish LLM responses from human responses. Motivated by integral probability metric (IPM)
(Müller,), we formulate our objective function such that the main player ft+1maximizes the
expected value gap between the target data distributionpdataand the opponent player’s distribution
pθt
:
ft+1= argmax
f∈Ft
E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
ˆ
f(x,y)−f(x,y
′
)
˜
, (4.1)
whereFtis a sequence of highly expressive function classes that we will determine in later deduction.
The subscripttinFtis due to that the function class is dependent onpθt. Given such aft+1and
a response sequenceyto the promptx, the value offt+1(x,y)reflects the main player’s degree
of belief thatyoriginates frompdatarather thanpθt. Ideally, the main playerft+1should yield a
high value wheny∼pdata(·|x)and a low value wheny
′
∼pθt
(·|x), wherepθtis the opponent’s
distribution. Instead of solving(4.1), we can also solve the following more general optimization
problem,
ft+1= argmin
f∈Ft
E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
ˆ
ℓ
`
f(x,y)−f(x,y
′
)
´˜
, (4.2)
6
whereℓ(·)is a loss function that is both monotonically decreasing and convex. For example, a linear
loss functionℓ(t) =−treduces(4.2)to the minimization version of(4.1). However, the use of a
linear loss function results in an unbounded objective value, which, during continuous training, leads
to a negative infinite value off(x,y
′
)on the opponent player’s responses. Therefore, in our work,
we choose the logistic loss functionℓ(t) :=log(1 +exp(−t))for its non-negativity, smoothness, and
exponentially decaying tail ast→ ∞. Such a choice of loss function aids in preventing the excessive
growth in the absolute value off. It is worth noting that the objective function defined in(4.2)with
a linear loss reduces to a similar IPM framework as in Wasserstein Generative Adversarial Networks
(WGAN) (Arjovsky et al.,). However, our approach differs in both the choice of the function
classFtand the training procedure.
Updating the Opponent Player. Previously we have discussed the training offt+1given the
opponent player’s distributionpθt. Now suppose we have optimized our main playerft+1that can
distinguishpdatafrompθt, within a certain function classFt, we elaborate how we get parameterθt+1
of the opponent player. Specifically, when presented with two responsesyandy
′
to the same prompt
x,ft+1assesses the valuesft+1(x,y)andft+1(x,y
′
). It then infers that the response with the higher
value is from the real data distributionpdataand the response with lower value is attributed to the
LLMpθt. Subsequently, the objective of the opponent player is to find a better LLM that generates
responses indistinguishable frompdatafor the main player. This is achieved by maximizing the
expected valueE
x∼q(·),y∼p(·|x) [ft+1(x,y)]. In addition, to prevent excessive deviation ofpθt+1from
pθtand stabilize the self-play, we incorporate a Kullback-Leibler (KL) regularization term. Putting
these together gives rise to the following optimization problem:
argmax
p
E
x∼q(·),y∼p(·|x)[ft+1(x,y)]−λE
x∼q(·)KL
`
p(·|x)||pθt
(·|x)
´
, (4.3)
whereλ >0is the regularization parameter. Notably, (4.3) has a closed-form solution bp(·|x):
bp(y|x)∝pθt
(y|x) exp
`
λ
−1
ft+1(x,y)
´
. (4.4)
It is worth noting thatbp(·|x)is not guaranteed to be belong to the LLM space{pθ(·|x)|θ∈Θ}.
Since we hope that the closed-form solutionbpin the probability space can be realized by an LLM
with parameterθ, i.e.,pθ(y|x) =bp(y|x), solving forpθ(y|x)∝pθt (y|x)exp
`
λ
−1
ft+1 (x,y)
´
gives
ft+1(x,y) =λ·log
pθ(·|x)
pθ
t
(·|x)
. This suggests the following function classFtforft+1:
Ft=
ȷ
λ·log
pθ(y|x)
pθt
(y|x)
θ∈Θ
ff
, (4.5)
whereΘis the parameter space of LLMs being considered. Given the choice ofFtin(4.5),
optimizing (4.2) gives ft+1parameterized byθt+1in the following form:
ft+1(x,y) =λ·log
pθt+1
(y|x)
pθt
(y|x)
. (4.6)
Substituting(4.6)into(4.4)yieldsbp(y|x) =pθt+1(y|x). In other words,θt+1learned from(4.2)is
exactly the LLM parameter for our ideal opponent selection.
7
Algorithm 1Self-Play Fine-Tuning (SPIN)
Input:{(xi,yi)}
i∈[N]: SFT Dataset,pθ0
: LLM with parameterθ0,T: Number of iterations.
fort= 0, . . . , T−1do
fori= 1, . . . Ndo
Generate synthetic datay
′
i
∼pθt
(·|xi).
end for
Updateθt+1= argmin
θ∈Θ
P
i∈[N]
ℓ
“
λlog
pθ(yi|xi)
pθ
t
(yi|xi)
−λlog
pθ(y
′
i
|xi)
pθ
t
(y
′
i
|xi)
”
.
end for
Output:θT.
End-to-end Training Objective.We integrate the previously discussed two steps into a single
end-to-end training objective with an update rule ofθt+1. Specifically, plugging(4.5)into(4.2)
arrives at the update ruleθt+1=argmin
θ∈ΘLSPIN (θ,θt), whereLSPINis the training objective
defined as follows
LSPIN(θ,θt) =E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
»
ℓ
„
λlog
pθ(y|x)
pθt
(y|x)
−λlog
pθ(y
′
|x)
pθt
(y
′
|x)
«–
.(4.7)
We summarize the iterative self-play process of our methodSPINas follows,
. . .→ pθt
(·|x)
|{z}
Opponent Player att
→λ·log
pθt+1
(·|x)
pθt
(·|x)
| {z }
Main Player att+ 1
→ pθt+1
(·|x)
|{z}
Opponent Player att+ 1
→. . .
Namely, the opponent player chosen from the previous iterationtis employed to train the main
player at iterationt+ 1, resulting in the LLM parameterized byθt+1. Then we determine the next
opponent player at iterationt+ 1by directly copying the LLM parameterθt+1, which is then used
in training the main player at iterationt+ 2. The detailed algorithm is presented in Algorithm.
Remark 4.1.(4.7)bears resemblance to direct preference optimization (DPO) (Rafailov et al.,
2023) for RL fine-tuning. However,SPINexhibits significant distinctions with DPO. Specifically,SPIN
is applied to supervised fine-tuning (SFT) and relies solely on the SFT dataset, represented by pairs
(x,y). In sharp contrast, DPO is designed for RL fine-tuning and necessitates a preference dataset,
represented by(x,yw,yl), whereywandyldenote the winner (chosen) and loser (rejected) responses,
respectively. DPO demands that,at the instance level,ywis superior toyl. In comparison, our
method requires that,at the distribution level, the targetpdatashould be distinguishable from
the weak LLMpθbefore it becomes a strong one. In terms of algorithm design, DPO implements a
single-iteration approach, while our method facilitates an iterative self-play strategy, as outlined in
Algorithm.
5
In this section, we provide a theoretical analysis for Algorithm. Under monotonicity
and convexity assumption of the objective functionℓ, we show that the global optimum is obtained
if and only if parameterθtgenerates data distribution. We summarize our assumptions as follows:
8
Assumption 5.1.The loss functionℓ(t) :R→R is monotonically decreasing, i.e.,∀t, ℓ
′(t)≤0and
satisfiesℓ
′
(0)<0. In addition,ℓ(t)is a convex function.
Assumption
including correlation lossℓ(t) = 1−t, hinge lossℓ(t) =max(0,1−t), exponential lossℓ(t) =exp(−t)
and logistic lossℓ(t) =log(1 +exp(−t)). Under Assumptions, we present the following theorem,
which is pivotal in understanding the optimization dynamics of our method.
Theorem 5.2.Under Assumption, suppose there exists pθ(·|x) =pdata(·|x), then we have that
•(Sufficiency) Ifpθt
(·|x) =pdata(·|x), thenθtis the global minimum of (4.7) for any λ≥0.
•(Necessity) Ifpθt
(·|x)̸=pdata(·|x), there exists an appropriately chosenλ, such thatθtis not
the global minimum of (4.7).
Remark 5.3.Theorem
method naturally stops at the pointpθ(·|x) =pdata(·|x), implying the effectiveness of our approach
in aligning the LLM’s distribution with the target data distribution. Moreover, Theorem
indicates that the optimization process only stops when the global optimality is achieved, i.e., the
LLM’s distribution aligns with the target data distribution.
For the logistic loss functionℓ(t) =log(1 +exp(−t)), the following theorem gives a more precise
characterization of the opponent player, enabling a better understanding ofSPIN.
Theorem 5.4.Consider the choice of logistic lossℓ(t) =log(1 +exp(−t))inSPIN. Suppose that
pθt(y|x)
`
pdata
(y|x)/pθt(y|x)
´
1/λ
lies in the LLM space{pθ(y|x)|θ∈Θ}andθt+1is global minimum
ofLSPIN(θ,θt), then the opponent player at iterationt+ 1satisfies
pθt+1
(y|x)∝pθt
(y|x)
`
pdata(y|x)/pθt
(y|x)
´
1/λ
.
Remark 5.5.According to Theorem, the model update from pθt(y|x)topθt+1(y|x)tends to
increase the probabilitypθt+1(y|x)whenpθt(y|x)is less thanpdata(y|x), and decrease it whenpθt(y|x)
is greater thanpdata(y|x). Thus, Theorem
process naturally converges to the point wherepθ(·|x)equalspdata(·|x). The update of the opponent
player is controlled by
`
pdata
(y|x)/pθt(y|x)
´
1/λ
, which is regulated by the factor1/λ. A smaller
λresults in a larger change of the opponent player, while a largerλleads to a smaller change.
Therefore, aspθ(·|x)approachespdata(·|x), increasingλenhances the stability of LLM training. This
observation aligns with(4.3), whereλis the regularization parameter of the KL regularization that
is employed to control the deviation of the opponent player.
6
This section provides a detailed empirical analysis ofSPIN. Our findings highlight several key points:
(1)SPINmarkedly enhances model performance across a wide range of evaluation benchmarks by
breaking the limit of SFT;(2)even without introducing new human annotated data,SPINat iteration
0achieves performance on par to DPO training that utilizes even more data;(3)iterative training is
a necessary component inSPINas it breaks the limit of multi-epoch training.
9
6.1
Model and Datasets.In this study, we adoptzephyr-7b-sft-fullas our base model. This
model derives from the pre-trained Mistral-7B (Jiang et al.,
on the SFT dataset Ultrachat200k
1
by HuggingFace. Ultrachat200k represents a high-quality 200k
subset of the larger UltraChat (Ding et al.,) corpus, which comprises approximately 1.4M
dialogues produced using OpenAI’s Turbo APIs. From UltraChat200k, We randomly sample50k
prompts and use the base model to generate the synthetic responses. We subsequently follow the
optimization method described in Section
the synthetic data from the most recent iteration and add to the newly generated synthetic data,
therefore resulting in a synthetic dataset size of50k at iteration0and100k at iteration1,2and3.
At each iteration, we train our model for2epochs.
Evaluation.We employed the widely used Huggingface Open LLM Leaderboard (Beeching
et al.,) as our evaluation benchmark, using the same Language Model Evaluation Harness
library (Gao et al.,). This leaderboard encompasses 6 different datasets, each focusing on a a
specific capability of LLMs. Collectively, these datasets provide a thorough assessment framework,
evaluating LLMs on commonsense reasoning (Arc (Clark et al.,), HellaSwag (Zellers et al.,
2019), Winogrande (Sakaguchi et al.,)), multi-task language understanding (MMLU(Hendrycks
et al.,)), human falsehood mimic (TruthfulQA (Lin et al.,)) and math problem solving
(GSM8k (Cobbe et al.,)). In evaluation, the language models are prompted with few-shot
in-context examples and the question. We follow the standard approach and report the average score
across all datasets. In Table, we detail the evaluation setting adopted by both the leaderboard and
our experiments. We leave further implementation details to Appendix.
Table 1: Detailed information of HuggingFace Open LLM Leaderboard. For each evaluation dataset,
we present the number of few-shot examples and metric adopted for evaluation.
Datasets Arc TruthfulQA Winogrande GSM8k HellaSwag MMLU
# few-shot 25 0 5 5 10 5
Metricacc_norm mc2 acc acc acc_norm acc
6.2SPINEffectively Improves Benchmark Performance
We demonstrate the effectiveness ofSPINusing HuggingFace Open LLM Leaderboard as a wide
range of evaluation. In Table, we compare the performance of our fine-tuned model by SPINafter
iterations 0 to 3 with the base modelzephyr-7b-sft-full. We can observe thatSPINexhibits
remarkable effectiveness in improving the model’s performance by further leveraging the SFT dataset,
on which the base model has already been fully fine-tuned. At iteration0, where model responses are
generated fromzephyr-7b-sft-full, we observe an overall improvement of2.66%on the average
score. The improvement is particularly significant on the TruthfulQA and GSM8k benchmarks, with
improvement exceeding5%and10%respectively. At iteration1, we employ the LLM model from
iteration0to generate new responses forSPIN, adhering to the procedure outlined in Algorithm.
This iteration yields further enhancements of1.32%on average, and especially significant on the Arc
Challenge and TruthfulQA benchmarks. Subsequent iterations continue this trend of incremental
improvement across various tasks. Meanwhile, the improvement at iterationt+ 1is naturally smaller
1
https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
10
SFT SPIN
iter-0
SPIN
iter-1
SPIN
iter-2
SPIN
iter-3
58
59
60
61
62
63
Average Accuracy (%)
58.14
60.80
62.12
62.97
63.16
HuggingFace Open LLM Benchmark Figure 2: The average score ofSPINat different iterations on the HuggingFace Open LLM leaderboard
datasets. For “SFT”, we report the performance of our base modelzephyr-7b-sft-full, which has
been fine-tuned on the same dataset we use to generate synthetic data.
than that at iterationt. As the iterative training progresses, the degree of improvement gradually
approaches zero, suggesting that the model has reached a limiting point in the last iteration.
Table 2: Test performance ofSPINbased onzephyr-7b-sft-fullacross HuggingFace Open LLM
Leaderboard datasets. We also denote the average improvement over last iteration in the Average
column.
Model Arc TruthfulQA Winogrande GSM8k HellaSwag MMLU Average
zephyr-7b-sft-full60.41 43.73 74.19 26.76 82.85 60.92 58.14
SPINiteration0 63.40 49.18 72.69 35.10 84.38 60.03 60.80
(+2.66)
SPINiteration1 65.19 55.17 72.30 35.78 84.96 59.34 62.12
(+1.32)
SPINiteration2 65.96 54.91 73.56 38.06 85.41 59.93 62.97
(+0.85)
SPINiteration3 65.87 54.90 73.72 38.97 85.54 59.99 63.16
(+0.19)
Comparison with DPO. zephyr-7b-betais a model derived fromzephyr-7b-sft-full, trained
with DPO on approximately62k preference data. This data, the UltraFeedback Binarized dataset
2
,
comprises both chosen and rejected completions evaluated by GPT-4. We note that, DPO requires
either human input or advanced language model feedback to determine the preference, making
data generation a rather expensive procedure. In contrast, ourSPINonly requires the initial model
itself. Moreover, unlike DPO which requires new data source, our method exclusively leverages the
existing SFT dataset. In Figure, we show the performance comparison of SPINat iterations 0 and
1 (employing50k SFT data) with DPO training, from the same SFT checkpoint. We can observe
that, while DPO leverages more data from new sources,SPINbased on the existing SFT data can
already achieve comparable average performance to DPO training at iteration 0. From iteration 1,
SPINeven surpasses the performance of DPO on the leaderboard benchmark.
2
https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized
11
Arc TruthfulQAWinogrande GSM8k Hellaswag MMLU Average
0
10
20
30
40
50
60
70
80
Scores
Zephyr-SFT
Zephyr-DPO
SPIN-iter-0
SPIN-iter-1
SPIN-iter-2
SPIN-iter-3 Figure 3: Performance comparison with DPO training across the six benchmark datasets. Self-play
at iteration0achieves comparable performance to DPO training with62k new data. At iteration1,
self-play has already surpassed DPO training on the majority of datasets.
6.3
In this subsection, we examine the effect of synthetic dataset size and training epochs within an
iteration. Our analysis demonstrates the effectiveness of the synthetic data used bySPINcompared to
the SFT data, as well as the necessity of iterative training inSPIN. Furthermore, to comprehensively
assess the performance improvements ofSPIN, we perform additional evaluations on benchmark
tasks distinct from those in the Open LLM leaderboard.
Training Size.We investigate the effect of varying training data size on the performance ofSPIN.
In Figure, we demonstrate the effect of training size forSPINduring iteration0and additionally
compare with SFT with the full original dataset. Specifically, for the SFT baseline, we fully fine-tune
Mistral-7B on Ultrachat200k for three epochs and report first epoch performance as the starting
point (with x-axis 0) in the figure for SFT. ForSPIN, we report thezephyr-7b-sft-fullcheckpoint
as the starting point, which has also been fine-tuned on Ultrachat200k for one epoch. We select
the training size ofSPINat iteration 0 to be 14k, 26k, and 50k and generate the data accordingly,
ensuring that the larger dataset encompasses the smaller dataset. The performance ofSPINwas
then evaluated after 1 epoch of self-play fine-tuning for each training size. We can observe that,
whileSPINresults in notable improvement with increasing training sizes, SFT on further epochs 2
and 3 fails to yield more than1%improvement. Lastly, in Table, we also show the performance of
SFT fromzephyr-7b-sft-fullon Ultrachat200k for one epoch. While self-play fine-tuning with
synthetic data fromzephyr-7b-sft-fulleffectively improves its performance, simply fine-tuning it
again on the SFT data leads to degraded performance, as similarly observed in Figure.
Table 3: Test performance ofzephyr-7b-sft-fullfine-tuned on Ultrachat200k for 1 more epoch
across HuggingFace Open LLM benchmark datasets. SFT fails to further leverage the fine-tuning
data for performance enhancement and even results in degraded performance.
Model Arc TruthfulQA Winogrande GSM8k HellaSwag MMLU Average
zephyr-7b-sft-full60.41 43.73 74.19 26.76 82.85 60.92 58.14
SFT epoch 1 57.76 44.39 75.77 25.85 81.69 57.8957.23
12
0
14k26k50k
200k 400k
Training Size
58.0
58.5
59.0
59.5
60.0
60.5
61.0
Average Score 59.04
59.82
59.27
58.14
59.55
60.16
60.83
Performance Comparison
SFT
SPIN Figure 4: The scaling effect of training size ofSPINcompared to SFT on the average score of
Open LLM Leaderboard. ForSPIN, we consider training data of sizes14k,26k and50k where
the larger dataset contains the smaller dataset. The starting point forSPIN(with x-axis 0) is the
zephyr-7b-sft-fullcheckpoint, which has been fine-tuned on Ultrachat200k for 1 epoch. We
report the model performance trained for 1 epoch withSPINon the varying sizes of dataset. We
additionally compare with SFT, where we fine-tune Mistral-7B on Ultrachat200k for 3 consecutive
epochs and report the model performance at the first epoch as the starting point (with x-axis 0).
Iterative Training v.s. Training for More Epochs.We further study the training within
iteration0and compare with the performance achieved in iteration1, particularly contrasting the
test performance obtained from extended training duration with that from next iteration. Figure
depicts the performance trajectory of the model trained usingSPINover multiple epochs at iteration
0. It is evident that the most substantial improvement occurs during the first two epochs, followed by
only modest gains in subsequent epochs. Notably,SPINexhibits robustness and stability; extending
the training duration does not diminish performance but rather maintains a rather consistent level.
Nevertheless, the observation suggests an inherent limitation to the performance achievable within
a single iteration, thereby underscoring the necessity for iterative training. As shown by the test
performance achieved at iteration 1 in the figures, extending the training in iteration 0 fails to reach
the performance comparable to iteration 1.
Further Investigation on More Tasks.Here, we further investigate the performance ofSPIN
on a broader variety of tasks, including MT-Bench (Zheng et al.,), Big-Bench (bench authors,
2023) and OpenBookQA (Mihaylov et al.,) in addition to the Open LLM Leaderboard tasks.
Specifically, we use the following tasks from Big-Bench-Hard for a more comprehensive evaluation,
including Causal Judgment (causal reasoning), Sports Understanding (commonsense reasoning)
and Formal Fallacies (logical reasoning). In Table, we show the resulting scores of SPINon
MT-Bench as well as those tasks from Big-Bench. In Figure, we detail the model performances on
MT-Bench with regard to different types of questions. We can see a notably robust improvement
in the performance ofSPINon various tasks besides the HuggingFace Benchmark, without major
degradation. Notably, on MT-Bench, the model fine-tuned bySPINhas surpassed the performance
ofvicuna-13b-v1.5(Chiang et al.,) with a score of 6.57.
13
0 1 2 3 4 5
Epoch
61
62
63
64
65
Accuracy (%)
iter 0
iter 1 (epoch 2) (a) Arc Challenge accuracy.0 1 2 3 4 5
Epoch
44
46
48
50
52
54
Accuracy (%)
iter 0
iter 1 (epoch 2) (b) TruthfulQA score.0 1 2 3 4 5
Epoch
58
59
60
61
62
Average Accuracy (%)
iter 0
iter 1 (epoch 2) (c) Average score.
Figure 5: TheSPINtraining dynamics ofzephyr-7b-sft-fullon the 50k synthetic data with
regard to the number of training epochs during iteration 0. We can observe that iterative training is
pivotal as training for more epochs during iteration 0 reaches a limit and cannot surpass iteration 1.
Table 4: Test performance on other reasoning benchmark datasets forSPINat different iterations
andzephyr-7b-sft-full. We report the average score for MT-Bench and the accuracy score for
Big Bench datasets under standard few-shot CoT evaluation. On OpenBookQA, we reportacc_norm
with 1-shot example as used in2023). As similar to Open LLM Leaderboard evaluation,
we observe a steady improvement in performance on the other benchmark tasks, with no significant
degradation.
Model MT-BenchBB-causal BB-formal BB-sports OpenBookQA
zephyr-7b-sft-full 5.94 56.15 49.6 96.0 45.4
SPINiteration0 6.46
(+0.52) 57.75 51.6 95.2 46.8
SPINiteration1 6.65
(+0.19) 58.82 51.2 95.2 47.2
SPINiteration2 6.78
(+0.13) 59.36 51.2 94.4 47.6
7
This paper introduces a novel fine-tuning methodSPIN, to convert a weak LLM to a strong LLM by
unleashing the full power of human-annotated data. Central to this method is a self-play mechanism,
wherein a main player (the LLM) is fine-tuned to differentiate the responses of opponent player (the
LLM from previous iteration) from the target data distribution, and the LLM is iteratively aligned
with the target data distribution. Therefore,SPINfacilitates the LLM’s iterative self-evaluation and
enhancement through self-play. In comparison to supervised fine-tuning and RL fine-tuning methods,
SPINenables the LLM to self-improve without additional human data or feedback from stronger
LLMs. Empirical results demonstrate thatSPINsignificantly enhances LLM performance across
diverse benchmarks, even outperforming models trained with additional human data or AI feedback.
Limitation and Future Work.Our theoretical results demonstrate that the optimization process
ofSPINconverges if and only if the LLM’s distribution aligns withpdata. Therefore, our study
focuses on a fixed target data distribution generated by humans, which inherently imposes a ceiling
on the performance of fine-tuned LLM. Exploring the dynamically changing target data distribution
is an important direction to overcome this limitation and elevate the LLM’s performance beyond
14
Writing
Roleplay
Reasoning
Math
Coding
Extraction
STEM
Humanities
0123456789
model
SPIN iter-2
SPIN iter-1
SPIN iter-0
SFT
Loading [MathJax]/extensions/MathMenu.js Figure 6: Model performance on MT-Bench. We compareSPINacross different iterations with the
base SFT model. Starting from iteration 1, our fine-tuned model bySPINrobustly outperforms the
SFT checkpoint on all evaluation aspects.
this ceiling or even to a super-human level. Moreover, considering the resource demands of synthetic
data generation, another promising avenue for further exploration is to reduce the volume of required
synthetic data.
A
A.1
We use the Alignment Handbook library (Tunstall et al.,) as the codebase for our self-
play fine-tuning methodSPIN, which includes DeepSpeed ZeRO-3 (Rajbhandari et al.,) and
FlashAttention-2 (Dao,) to reduce training cost. We train our models with RMSProp (Hinton
et al.,) optimizer with no weight decay for all iterations as commonly used in fine-tuning LLMs
for alignment, with a global batch size of64,10%warmup steps and bfloat16 precision. We set the
peak learning rate to be 5e-7 for iterations 0 and 1, and decay this peak learning rate to 1e-7 for
iteration 2 and 3 as we are approaching the end of self-play fine-tuning. Lastly, we chooseβ= 0.1
and max sequence length to be2048tokens as in2023b). We note that at the last
iteration (iter-3) where the model is close to convergence, we increase the value ofβto5.0. We
use the Accelerate library (Gugger et al.,) to generate our synthetic data using distributed
inference with multiple GPUs with a global batch size of64. We consider the prompting template
“### Instruction: {prompt}\n˙### Response: ” as commonly used in2023). For
Ultrachat200k containing multi-round conversations, we only sample the first round as our prompt
and ground truth completion pairs.
A.2
In Tables, we further provide the generation examples of our fine-tuned model by SPIN
from different iterations. We can observe an improvement in response quality as compared to the
generation of the SFT checkpoint. Meanwhile, the model generations at higher iterations typically
becomes more concise than iteration0and resemble the ground truth completion better.
15
B
B.1
Proof of Theorem.
To begin with, we prove the “Sufficiency” in Theorem. Since pdata(·|x) =
pθt
(·|x), by symmetry property ofyandy
′
, we have for anyθ∈Θthat
2LSPIN(θ,θt) =E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
»
ℓ
„
γlog
pθ(y|x)
pθt
(y|x)
−γlog
pθ(y
′
|x)
pθt
(y
′
|x)
«–
+E
x∼q(·),y
′
∼pdata(·|x),y∼pθ
t
(·|x)
»
ℓ
„
γlog
pθ(y|x)
pθt
(y|x)
−γlog
pθ(y
′
|x)
pθt
(y
′
|x)
«–
=E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
»
ℓ
„
γlog
pθ(y|x)
pθt
(y|x)
−γlog
pθ(y
′
|x)
pθt
(y
′
|x)
«
+ℓ
„
γlog
pθ(y
′
|x)
pθt
(y
′
|x)
−γlog
pθ(y|x)
pθt
(y|x)
«–
≥2E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
»
ℓ
„
γ
2
log
pθ(y|x)
pθt
(y|x)
−
γ
2
log
pθ(y
′
|x)
pθt
(y
′
|x)
+
γ
2
log
pθ(y
′
|x)
pθt
(y
′
|x)
−
γ
2
log
pθ(y|x)
pθt
(y|x)
«–
= 2ℓ(0),
where the inequality is due to Jensen’s inequality (recalling thatℓis convex in Assumption
Therefore, we have thatLSPIN(θ,θt)≥ℓ(0) =LSPIN(θt,θt), which means thatθtis the global
optimum of(4.7). As a consequence, the gradient at the pointθtis zero, which concludesθt+1=θt.
Next, we prove the “Necessity”. Defineg(λ)as follows:
g(λ) =E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
»
ℓ
„
λlog
pdata(y|x)
pθt
(y|x)
−λlog
pdata(y
′
|x)
pθt
(y
′
|x)
«–
.
Then we haveg(0) =ℓ(0)and
g
′
(0) =E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
»
ℓ
′
(0)
„
log
pdata(y|x)
pθt
(y|x)
−log
pdata(y
′
|x)
pθt
(y
′
|x)
«–
=ℓ
′
(0)
„
E
x∼q(·),y∼pdata(·|x)
»
log
pdata(y|x)
pθt
(y|x)
–
−E
x∼q(·),y
′
∼pθ
t
(·|x)
»
log
pdata(y
′
|x)
pθt
(y
′
|x)
–«
=ℓ
′
(0)
h
KL
`
pdata(·|x)
pθt
(·|x)
´
+ KL
`
pθt
(·|x)
pdata(·|x)
´
i
<0,
where the last inequality is due to the condition thatℓ
′(0)<0. Therefore, there exist aλ0such
that for all0< λ < λ0 , we haveg(λ)< ℓ(0). Chooseθ
∗such thatpθ
∗(y|x) =pdata(y|x). For those
0< λ < λ0, we have that
LSPIN(θ
∗
,θt) =E
x∼q(·),y∼p
θ
∗(·|x),y
′
∼pθ
t
(·|x)
»
ℓ
„
λlog
pθ
∗(y|x)
pθt
(y|x)
−λlog
pθ
∗(y
′
|x)
pθt
(y
′
|x)
«–
=E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
»
ℓ
„
λlog
pdata(y|x)
pθt
(y|x)
−λlog
pdata(y
′
|x)
pθt
(y
′
|x)
«–
16
=g(λ)
< g(0)
=LSPIN(θt,θt),
where the second equality holds by the choice ofpθ
∗(·|x), and the inequality holds due to the choice
ofλ. Therefore, we conclude thatθtis not the global optimum of (4.7) if pθt
(·|x)̸=pdata(·|x).
B.2
We need the following auxiliary lemma before we prove Theorem.
Lemma B.1.Suppose thatℓ(t) =log(1 +exp(−t))and fora, b >0, the following inequality holds
aℓ(t) +bℓ(−t)≥alog(1 +b/a) +blog(1 +a/b),
the equality holds if and only ift= log(a/b).
Proof of Lemma.
Defineg(t) =aℓ(t) +bℓ(−t) =alog(1 +exp(−t)) +blog(1 +exp(t)), then we
have
g
′
(t) =−
aexp(−t)
1 + exp(−t)
+
bexp(t)
1 + exp(t)
=
−a+bexp(t)
1 + exp(t)
.
Therefore,g
′(t)<0whent <log(a/b),g
′(t)>0whent >log(a/b), which indicates thatgachieves
it minimum att= log(a/b)which concludes the proof.
Lemma aℓ(t) +bℓ(−t)is achieved whent=log(a/b).
Based on Lemma, we can further prove that (4.2)with the logistic loss function has a closed-form
solution if we ignore the constraint setFt.
Lemma B.2.Denotep+(y,y
′
,
x) =q(x)·pdata(y|x)·pθt(y
′
|
x)andp−(y,y
′
,
x) =q(x)·pθt(y
′
|
x)·
pdata(y|x),
E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
Θ
ℓ
Γ
f(x,y)−f(x,y
′
)
∆Λ
≥log 2−JSD(p+∥p−),
whereJSD(p+∥p−)represents the Jensen–Shannon divergence which is defined as follows
JSD
ı
p
q
ȷ
=
1
2
KL
ı
p
p+q
2
ȷ
+
1
2
KL
ı
q
p+q
2
ȷ
,
whereKL(·∥·)is KL-divergence.JSDis always non-negative and equals zero if and only ifp+and
p−are identical. Moreover, the global minimum valuelog2−JSD(p+∥p− )is achieved byf
∗if and
only if,
f
∗
(x,y) =Z(x) + log
`
pdata(y|x)
pθt
(y|x)
´
,
whereZ(x)is any function that is possibly dependent onx.
17
Proof of Lemma. We rewrite the objective function in the following formula,
2E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
ˆ
ℓ
`
f(x,y)−f(x,y
′
)
´˜
=
Z
q(x)pdata(y|x)pθt
(y
′
|x)
ˆ
ℓ
`
f(x,y)−f(x,y
′
)
´˜
dydy
′
+
Z
q(x)pdata(y
′
|x)pθt
(y|x)
ˆ
ℓ
`
f(x,y
′
)−f(x,y)
´˜
dydy
′
=
Z
q(x)pdata(y|x)pθt
(y
′
|x)ℓ
`
f(x,y)−f(x,y
′
)
´
+q(x)pdata(y
′
|x)pθt
(y|x)ℓ
`
f(x,y
′
)−f(x,y)
´
dydy
′
(i)
≥
Z
q(x)pdata(y|x)pθt
(y
′
|x) log
„
1 +
pdata(y
′
|x)pθt
(y|x)
pdata(y|x)pθt
(y
′
|x)
«
+q(x)pdata(y
′
|x)pθt
(y|x) log
„
1 +
pdata(y|x)pθt
(y
′
|x)
pdata(y
′
|x)pθt
(y|x)
«
dydy
′
,
where the inequality is due toaℓ(t) +bℓ(−t)≥alog(1 +b/a) +blog(1 +a/b)in Lemma
a=q(x)pdata(y|x)pθt(y
′
|
x), b=q(x)pdata(y
′
|
x)pθt(y|x),t=f(x,y)−f(x,y
′
). The equality (i)
holds if and only if the following equation holds almost surely for anyx,y,y
′
,
f(x,y)−f(x,y
′
) = log
„
pdata(y|x)pθt
(y
′
|x)
pdata(y
′
|x)pθt
(y|x)
«
. (B.1)
(B.1) is equivalent to
f(x,y)−log
„
pdata(y|x)
pθt
(y|x)
«
=f(x,y
′
)−log
„
pdata(y
′
|x)
pθt
(y
′
|x)
«
holds almost surely for anyx,y,y
′
. Therefore, the equality (i) holds if and only if there exists some
Z(x)such that
f(x,y) =Z(x) + log
„
pdata(y|x)
pθt
(y|x)
«
.
Recall thatp+(y,y
′
|
x) =pdata(y|x)·pθt(y|x)andp−(y,y
′
|
x) =pθt(y|x)·pdata(y|x). Then, the
right-hand side of (i) can be written as
Z
q(x)pdata(y|x)pθt
(y
′
|x) log
„
1 +
pdata(y
′
|x)pθt
(y|x)
pdata(y|x)pθt
(y
′
|x)
«
+q(x)pdata(y
′
|x)pθt
(y|x) log
„
1 +
pdata(y|x)pθt
(y
′
|x)
pdata(y
′
|x)pθt
(y|x)
«
dydy
′
=
Z
p+(y,y
′
|x) log
„
1 +
p−(y,y
′
|x)
p+(y,y
′
|x)
«
+p−(y,y
′
|x) log
„
1 +
p+(y,y
′
|x)
p−(y,y
′
|x)
«
dydy
′
= 2 log 2 +
Z
p+(y,y
′
|x) log
„
1/2[p−(y,y
′
|x) +p+(y,y
′
|x)]
p+(y,y
′
|x)
«
+p−(y,y
′
|x) log
„
1/2[p−(y,y
′
|x) +p+(y,y
′
|x)]
p−(y,y
′
|x)
«
dydy
′
18
= 2 log 2−KL
`
p+
p++p−
2
´
−KL
`
p−
p++p−
2
´
= 2 log 2−2·JSD(p+∥p−),
where the last equality is by the definition of JSD. This concludes the proof.
Lemma (4.2)if we ignore the constraint setFt. If this
closed-form solution belongs toFt, then it should also be the solution to (4.2). This observation is
the key to the proof of Theorem.
Proof of Theorem. Under the condition of Theorem, there exists a pθsuch that
pθ(y|x)∝pθt
(y|x)
Γ
pdata(y|x)/pθt
(y|x)
∆
1/λ
.
Therefore, there exists a function
b
Z(x)such that
pθ(y|x) =
b
Z(x)·pθt
(y|x)
Γ
pdata(y|x)/pθt
(y|x)
∆
1/λ
. (B.2)
Applying logarithm function on both side of (B.2) yields
λlog(
b
Z(x)) + log
`
pdata(y|x)
pθt
(y|x)
´
=λlog
`
pθ(y|x)
pθt
(y|x)
´
∈ Ft.
By Lemma, f
∗(x,y) =λlog(
b
Z
(x)) +log
Γ
pdata(y|x)
pθ
t
(y|x)
∆
is the global minimum of the following
minimization problem,
argmin
f
E
y∼pdata(·|x),y
′
∼pθ
t
(·|x)
Θ
ℓ
Γ
f(x,y)−f(x,y
′
)
∆Λ
. (B.3)
Sincef
∗
∈ Ft ,f
∗(x,y) =λlog(
b
Z
(x))+log
Γ
pdata(y|x)
pθ
t
(y|x)
∆
is also the global optimum of the optimization
problem (4.2),
argmin
f∈Ft
E
y∼pdata(·|x),y
′
∼pθ
t
(·|x)
Θ
ℓ
Γ
f(x,y)−f(x,y
′
)
∆Λ
.
Therefore, we have proved that
min
f
E
y∼pdata(·|x),y
′
∼pθ
t
(·|x)
Θ
ℓ
Γ
f(x,y)−f(x,y
′
)
∆Λ
= min
f∈Ft
E
y∼pdata(·|x),y
′
∼pθ
t
(·|x)
Θ
ℓ
Γ
f(x,y)−f(x,y
′
)
∆Λ
= min
θ∈Θ
LSPIN(θ,θt). (B.4)
Sinceθt+1is the global minimum ofLSPIN(θ,θt). Then by(B.4),λlog
ı
pθ
t+1
(y|x)
pθ
t
(y|x)
ȷ
should be the
global minimum of problem (B.3). By Lemma, there exists Z(x)such that
λlog
`
pθt+1
(y|x)
pθt
(y|x)
´
=Z(x) + log
`
pdata(y|x)
pθt
(y|x)
´
,
which leads to the result thatpθt+1
(y|x)∝pθt
(y|x)
Γ
pdata(y|x)/pθt
(y|x)
∆
1/λ
.
19
References
Anil, R.,Dai, A. M.,Firat, O.,Johnson, M.,Lepikhin, D.,Passos, A.,Shakeri, S.,
Taropa, E.,Bailey, P.,Chen, Z. et al.(2023). Palm 2 technical report.arXiv preprint
arXiv:2305.10403.
Anthony, T.,Tian, Z.andBarber, D.(2017). Thinking fast and slow with deep learning and
tree search.Advances in neural information processing systems30.
Arjovsky, M.,Chintala, S.andBottou, L.(2017). Wasserstein generative adversarial networks.
InInternational conference on machine learning. PMLR.
Austin, J.,Odena, A.,Nye, M.,Bosma, M.,Michalewski, H.,Dohan, D.,Jiang, E.,Cai,
C.,Terry, M.,Le, Q. et al.(2021). Program synthesis with large language models.arXiv
preprint arXiv:2108.07732.
Bai, Y.,Jones, A.,Ndousse, K.,Askell, A.,Chen, A.,DasSarma, N.,Drain, D.,Fort,
S.,Ganguli, D.,Henighan, T. et al.(2022a). Training a helpful and harmless assistant with
reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862.
Bai, Y.,Kadavath, S.,Kundu, S.,Askell, A.,Kernion, J.,Jones, A.,Chen, A.,Goldie,
A.,Mirhoseini, A.,McKinnon, C. et al.(2022b). Constitutional ai: Harmlessness from ai
feedback.arXiv preprint arXiv:2212.08073.
Bansal, T.,Pachocki, J.,Sidor, S.,Sutskever, I.andMordatch, I.(2018). Emergent
complexity via multi-agent competition. InInternational Conference on Learning Representations.
Beeching, E.,Fourrier, C.,Habib, N.,Han, S.,Lambert, N.,Rajani, N.,Sanseviero, O.,
Tunstall, L.andWolf, T.(2023). Open llm leaderboard.https://huggingface.co/spaces/
HuggingFaceH4/open_llm_leaderboard.
bench authors, B.(2023). Beyond the imitation game: Quantifying and extrapolating the
capabilities of language models.Transactions on Machine Learning Research.
Bengio, Y.,Louradour, J.,Collobert, R.andWeston, J.(2009). Curriculum learning. In
Proceedings of the 26th annual international conference on machine learning.
Brown, T.,Mann, B.,Ryder, N.,Subbiah, M.,Kaplan, J. D.,Dhariwal, P.,Neelakantan,
A.,Shyam, P.,Sastry, G.,Askell, A. et al.(2020). Language models are few-shot learners.
Advances in neural information processing systems331877–1901.
Bubeck, S.,Chandrasekaran, V.,Eldan, R.,Gehrke, J.,Horvitz, E.,Kamar, E.,Lee,
P.,Lee, Y. T.,Li, Y.,Lundberg, S. et al.(2023). Sparks of artificial general intelligence:
Early experiments with gpt-4.arXiv preprint arXiv:2303.12712.
Burns, C.,Izmailov, P.,Kirchner, J. H.,Baker, B.,Gao, L.,Aschenbrenner, L.,Chen,
Y.,Ecoffet, A.,Joglekar, M.,Leike, J. et al.(2023). Weak-to-strong generalization:
Eliciting strong capabilities with weak supervision .
Chen, M.,Tworek, J.,Jun, H.,Yuan, Q.,Pinto, H. P. d. O.,Kaplan, J.,Edwards, H.,
Burda, Y.,Joseph, N.,Brockman, G. et al.(2021). Evaluating large language models
trained on code.arXiv preprint arXiv:2107.03374.
20
Chiang, W.-L.,Li, Z.,Lin, Z.,Sheng, Y.,Wu, Z.,Zhang, H.,Zheng, L.,Zhuang, S.,
Zhuang, Y.,Gonzalez, J. E.,Stoica, I.andXing, E. P.(2023). Vicuna: An open-source
chatbot impressing gpt-4 with 90%* chatgpt quality.
Christiano, P. F.,Leike, J.,Brown, T.,Martic, M.,Legg, S.andAmodei, D.(2017).
Deep reinforcement learning from human preferences.Advances in neural information processing
systems30.
Chung, H. W.,Hou, L.,Longpre, S.,Zoph, B.,Tay, Y.,Fedus, W.,Li, Y.,Wang, X.,
Dehghani, M.,Brahma, S. et al.(2022). Scaling instruction-finetuned language models.arXiv
preprint arXiv:2210.11416.
Cirik, V.,Hovy, E.andMorency, L.-P.(2016). Visualizing and understanding curriculum
learning for long short-term memory networks.arXiv preprint arXiv:1611.06204.
Clark, P.,Cowhey, I.,Etzioni, O.,Khot, T.,Sabharwal, A.,Schoenick, C.andTafjord,
O.(2018). Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv
preprint arXiv:1803.05457.
Cobbe, K.,Kosaraju, V.,Bavarian, M.,Chen, M.,Jun, H.,Kaiser, L.,Plappert, M.,
Tworek, J.,Hilton, J.,Nakano, R. et al.(2021). Training verifiers to solve math word
problems.arXiv preprint arXiv:2110.14168.
Dao, T.(2023). Flashattention-2: Faster attention with better parallelism and work partitioning.
arXiv preprint arXiv:2307.08691.
Deng, Y.,Zhang, W.,Chen, Z.andGu, Q.(2023). Rephrase and respond: Let large language
models ask better questions for themselves.arXiv preprint arXiv:2311.04205.
Ding, N.,Chen, Y.,Xu, B.,Qin, Y.,Zheng, Z.,Hu, S.,Liu, Z.,Sun, M.andZhou, B.
(2023). Enhancing chat language models by scaling high-quality instructional conversations.arXiv
preprint arXiv:2305.14233.
Frei, S.,Zou, D.,Chen, Z.andGu, Q.(2022). Self-training converts weak learners to strong
learners in mixture models. InInternational Conference on Artificial Intelligence and Statistics.
PMLR.
Freund, Y.(1995). Boosting a weak learning algorithm by majority.Information and computation
121256–285.
Freund, Y.andSchapire, R. E.(1997). A decision-theoretic generalization of on-line learning
and an application to boosting.Journal of computer and system sciences55119–139.
Gao, L.,Schulman, J.andHilton, J.(2023a). Scaling laws for reward model overoptimization.
InInternational Conference on Machine Learning. PMLR.
Gao, L.,Tow, J.,Abbasi, B.,Biderman, S.,Black, S.,DiPofi, A.,Foster, C.,Golding,
L.,Hsu, J.,Le Noac’h, A.,Li, H.,McDonell, K.,Muennighoff, N.,Ociepa, C.,Phang,
J.,Reynolds, L.,Schoelkopf, H.,Skowron, A.,Sutawika, L.,Tang, E.,Thite, A.,
Wang, B.,Wang, K.andZou, A.(2023b). A framework for few-shot language model evaluation.
21
Goodfellow, I.,Pouget-Abadie, J.,Mirza, M.,Xu, B.,Warde-Farley, D.,Ozair, S.,
Courville, A.andBengio, Y.(2014). Generative adversarial nets.Advances in neural
information processing systems27.
Grandvalet, Y.andBengio, Y.(2004). Semi-supervised learning by entropy minimization.
Advances in neural information processing systems17.
Gugger, S.,Debut, L.,Wolf, T.,Schmid, P.,Mueller, Z.,Mangrulkar, S.,Sun, M.
andBossan, B.(2022). Accelerate: Training and inference at scale made simple, efficient and
adaptable.https://github.com/huggingface/accelerate.
Hendrycks, D.,Burns, C.,Basart, S.,Zou, A.,Mazeika, M.,Song, D.andSteinhardt, J.
(2020). Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300.
Hernandez-Leal, P.,Kartal, B.andTaylor, M. E.(2018). Is multiagent deep reinforcement
learning the answer or the question? a brief survey.learning2122.
Hinton, G.,Srivastava, N.andSwersky, K.(2012). Neural networks for machine learning
lecture 6a overview of mini-batch gradient descent.Cited on142.
Jiang, A. Q.,Sablayrolles, A.,Mensch, A.,Bamford, C.,Chaplot, D. S.,Casas, D.
d. l.,Bressand, F.,Lengyel, G.,Lample, G.,Saulnier, L. et al.(2023). Mistral 7b.
arXiv preprint arXiv:2310.06825.
Josifoski, M.,Sakota, M.,Peyrard, M.andWest, R.(2023). Exploiting asymmetry for
synthetic training data generation: Synthie and the case of information extraction.arXiv preprint
arXiv:2303.04132.
Kearns, M.andValiant, L.(1994). Cryptographic limitations on learning boolean formulae and
finite automata.Journal of the ACM (JACM)4167–95.
Kou, Y.,Chen, Z.,Cao, Y.andGu, Q.(2022). How does semi-supervised learning with
pseudo-labelers work? a case study. InThe Eleventh International Conference on Learning
Representations.
Kumar, M.,Packer, B.andKoller, D.(2010). Self-paced learning for latent variable models.
Advances in neural information processing systems23.
Lanctot, M.,Zambaldi, V.,Gruslys, A.,Lazaridou, A.,Tuyls, K.,Pérolat, J.,Silver,
D.andGraepel, T.(2017). A unified game-theoretic approach to multiagent reinforcement
learning.Advances in neural information processing systems30.
Lee, D.-H.(2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep
neural networks. InICML Challenges in Representation Learning Workshop.
Lee, Y. J.andGrauman, K.(2011). Learning the easy things first: Self-paced visual category
discovery. InCVPR 2011. IEEE.
Lewkowycz, A.,Andreassen, A.,Dohan, D.,Dyer, E.,Michalewski, H.,Ramasesh, V.,
Slone, A.,Anil, C.,Schlag, I.,Gutman-Solo, T. et al.(2022). Solving quantitative
reasoning problems with language models.Advances in Neural Information Processing Systems
353843–3857.
22
Li, Y.,Bubeck, S.,Eldan, R.,Giorno, A. D.,Gunasekar, S.andLee, Y. T.(2023).
Textbooks are all you need ii: phi-1.5 technical report.
Li, Y.,Choi, D.,Chung, J.,Kushman, N.,Schrittwieser, J.,Leblond, R.,Eccles, T.,
Keeling, J.,Gimeno, F.,Dal Lago, A. et al.(2022). Competition-level code generation
with alphacode.Science3781092–1097.
Lin, S.,Hilton, J.andEvans, O.(2021). Truthfulqa: Measuring how models mimic human
falsehoods.arXiv preprint arXiv:2109.07958.
Liu, B.,Bubeck, S.,Eldan, R.,Kulkarni, J.,Li, Y.,Nguyen, A.,Ward, R.andZhang,
Y.(2023). Tinygsm: achieving> 80% on gsm8k with small language models.arXiv preprint
arXiv:2312.09241.
Liu, C.,He, S.,Liu, K.,Zhao, J. et al.(2018). Curriculum learning for natural answer generation.
InIJCAI.
Liu, F.,Ge, S.andWu, X.(2021). Competence-based multimodal curriculum learning for medical
report generation. InProceedings of the 59th Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume
1: Long Papers).
Luo, H.,Sun, Q.,Xu, C.,Zhao, P.,Lou, J.,Tao, C.,Geng, X.,Lin, Q.,Chen, S.and
Zhang, D.(2023). Wizardmath: Empowering mathematical reasoning for large language models
via reinforced evol-instruct.arXiv preprint arXiv:2308.09583.
Mihaylov, T.,Clark, P.,Khot, T.andSabharwal, A.(2018). Can a suit of armor conduct
electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing.
Mishra, S.,Khashabi, D.,Baral, C.andHajishirzi, H.(2021). Cross-task generalization via
natural language crowdsourcing instructions.arXiv preprint arXiv:2104.08773.
Müller, A.(1997). Integral probability metrics and their generating classes of functions.Advances
in applied probability29429–443.
Muller, P.,Omidshafiei, S.,Rowland, M.,Tuyls, K.,Perolat, J.,Liu, S.,Hennes, D.,
Marris, L.,Lanctot, M.,Hughes, E. et al.(2019). A generalized training approach for
multiagent learning.arXiv preprint arXiv:1909.12823.
OpenAI(2023). Gpt-4 technical report.
Ouyang, L.,Wu, J.,Jiang, X.,Almeida, D.,Wainwright, C.,Mishkin, P.,Zhang, C.,
Agarwal, S.,Slama, K.,Ray, A. et al.(2022). Training language models to follow instructions
with human feedback.Advances in Neural Information Processing Systems3527730–27744.
Prasad, A.,Stengel-Eskin, E.andBansal, M.(2023). Rephrase, augment, reason: Visual
grounding of questions for vision-language models.arXiv preprint arXiv:2310.05861.
Radford, A.,Wu, J.,Child, R.,Luan, D.,Amodei, D.,Sutskever, I. et al.(2019).
Language models are unsupervised multitask learners.OpenAI blog19.
23
Rafailov, R.,Sharma, A.,Mitchell, E.,Ermon, S.,Manning, C. D.andFinn, C.(2023).
Direct preference optimization: Your language model is secretly a reward model.arXiv preprint
arXiv:2305.18290.
Rajbhandari, S.,Rasley, J.,Ruwase, O.andHe, Y.(2020). Zero: Memory optimizations
toward training trillion parameter models. InSC20: International Conference for High Performance
Computing, Networking, Storage and Analysis. IEEE.
Roziere, B.,Gehring, J.,Gloeckle, F.,Sootla, S.,Gat, I.,Tan, X. E.,Adi, Y.,Liu, J.,
Remez, T.,Rapin, J. et al.(2023). Code llama: Open foundation models for code.arXiv
preprint arXiv:2308.12950.
Sakaguchi, K.,Bras, R. L.,Bhagavatula, C.andChoi, Y.(2021). Winogrande: An adversarial
winograd schema challenge at scale.Communications of the ACM6499–106.
Samuel, A. L.(1959). Some studies in machine learning using the game of checkers.IBM Journal
of research and development3210–229.
Samuel, A. L.(2000). Some studies in machine learning using the game of checkers.IBM Journal
of research and development44206–226.
Schapire, R. E.(1990). The strength of weak learnability.Machine learning5197–227.
Silver, D.,Hubert, T.,Schrittwieser, J.,Antonoglou, I.,Lai, M.,Guez, A.,Lanctot,
M.,Sifre, L.,Kumaran, D.,Graepel, T. et al.(2017a). Mastering chess and shogi by
self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815.
Silver, D.,Schrittwieser, J.,Simonyan, K.,Antonoglou, I.,Huang, A.,Guez, A.,
Hubert, T.,Baker, L.,Lai, M.,Bolton, A. et al.(2017b). Mastering the game of go
without human knowledge.nature550354–359.
Singh, A.,Co-Reyes, J. D.,Agarwal, R.,Anand, A.,Patil, P.,Liu, P. J.,Harrison,
J.,Lee, J.,Xu, K.,Parisi, A. et al.(2023). Beyond human data: Scaling self-training for
problem-solving with language models.arXiv preprint arXiv:2312.06585.
Soviany, P.,Ionescu, R. T.,Rota, P.andSebe, N.(2022). Curriculum learning: A survey.
International Journal of Computer Vision1301526–1565.
Spitkovsky, V. I.,Alshawi, H.andJurafsky, D.(2009). Baby steps: How “less is more” in
unsupervised dependency parsing. InNIPS 2009 Workshop on Grammar Induction, Representation
of Language and Language Learning.
Stiennon, N.,Ouyang, L.,Wu, J.,Ziegler, D.,Lowe, R.,Voss, C.,Radford, A.,Amodei,
D.andChristiano, P. F.(2020). Learning to summarize with human feedback.Advances in
Neural Information Processing Systems333008–3021.
Taori, R.,Gulrajani, I.,Zhang, T.,Dubois, Y.,Li, X.,Guestrin, C.,Liang, P.and
Hashimoto, T. B.(2023). Stanford alpaca: An instruction-following llama model.https:
//github.com/tatsu-lab/stanford_alpaca.
24
Tesauro, G. et al.(1995). Temporal difference learning and td-gammon.Communications of the
ACM3858–68.
Thoppilan, R.,De Freitas, D.,Hall, J.,Shazeer, N.,Kulshreshtha, A.,Cheng, H.-T.,
Jin, A.,Bos, T.,Baker, L.,Du, Y. et al.(2022). Lamda: Language models for dialog
applications.arXiv preprint arXiv:2201.08239.
Touvron, H.,Martin, L.,Stone, K.,Albert, P.,Almahairi, A.,Babaei, Y.,Bashlykov,
N.,Batra, S.,Bhargava, P.,Bhosale, S. et al.(2023). Llama 2: Open foundation and
fine-tuned chat models.arXiv preprint arXiv:2307.09288.
Tunstall, L.,Beeching, E.,Lambert, N.,Rajani, N.,Rasul, K.,Belkada, Y.,Huang, S.,
von Werra, L.,Fourrier, C.,Habib, N. et al.(2023a). Zephyr: Direct distillation of lm
alignment.arXiv preprint arXiv:2310.16944.
Tunstall, L.,Beeching, E.,Lambert, N.,Rajani, N.,Rush, A. M.andWolf, T.(2023b).
The alignment handbook.https://github.com/huggingface/alignment-handbook.
Vapnik, V.(1999).The nature of statistical learning theory. Springer science & business media.
Victor, S.,Albert, W.,Colin, R.,Stephen, B.,Lintang, S.,Zaid, A.,Antoine, C.,
Arnaud, S.,Arun, R.,Manan, D. et al.(2022). Multitask prompted training enables
zero-shot task generalization. InInternational Conference on Learning Representations.
Vinyals, O.,Babuschkin, I.,Chung, J.,Mathieu, M.,Jaderberg, M.,Czarnecki, W.,
Dudzik, A.,Huang, A.,Georgiev, P.,Powell, R.,Ewalds, T.,Horgan, D.,Kroiss,
M.,Danihelka, I.,Agapiou, J.,Oh, J.,Dalibard, V.,Choi, D.,Sifre, L.,Sulsky, Y.,
Vezhnevets, S.,Molloy, J.,Cai, T.,Budden, D.,Paine, T.,Gulcehre, C.,Wang, Z.,
Pfaff, T.,Pohlen, T.,Yogatama, D.,Cohen, J.,McKinney, K.,Smith, O.,Schaul,
T.,Lillicrap, T.,Apps, C.,Kavukcuoglu, K.,Hassabis, D.andSilver, D.(2019).
AlphaStar: Mastering the Real-Time Strategy Game StarCraft II.https://deepmind.com/blog/
alphastar-mastering-real-time-strategy-game-starcraft-ii/.
Wei, J.,Wang, X.,Schuurmans, D.,Bosma, M.,Xia, F.,Chi, E.,Le, Q. V.,Zhou, D.
et al.(2022). Chain-of-thought prompting elicits reasoning in large language models.Advances
in Neural Information Processing Systems3524824–24837.
Wu, J.,Liang, Y.,Akbari, H.,Wang, Z.,Yu, C. et al.(2022). Scaling multimodal pre-training
via cross-modality gradient harmonization.Advances in Neural Information Processing Systems
3536161–36173.
Yang, Y.,Singh, A. K.,Elhoushi, M.,Mahmoud, A.,Tirumala, K.,Gloeckle, F.,Rozière,
B.,Wu, C.-J.,Morcos, A. S.andArdalani, N.(2023). Decoding data quality via synthetic
corruptions: Embedding-guided pruning of code data.arXiv preprint arXiv:2312.02418.
Yu, L.,Jiang, W.,Shi, H.,Yu, J.,Liu, Z.,Zhang, Y.,Kwok, J. T.,Li, Z.,Weller, A.
andLiu, W.(2023). Metamath: Bootstrap your own mathematical questions for large language
models.arXiv preprint arXiv:2309.12284.
25
Yuan, Z.,Yuan, H.,Li, C.,Dong, G.,Tan, C.andZhou, C.(2023). Scaling relationship on
learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825.
Zellers, R.,Holtzman, A.,Bisk, Y.,Farhadi, A.andChoi, Y.(2019). Hellaswag: Can a
machine really finish your sentence?arXiv preprint arXiv:1905.07830.
Zhang, D.,Meng, D.,Li, C.,Jiang, L.,Zhao, Q.andHan, J.(2015). A self-paced multiple-
instance learning framework for co-saliency detection. InProceedings of the IEEE international
conference on computer vision.
Zhang, X.,Kumar, G.,Khayrallah, H.,Murray, K.,Gwinnup, J.,Martindale, M. J.,
McNamee, P.,Duh, K.andCarpuat, M.(2018). An empirical exploration of curriculum
learning for neural machine translation.arXiv preprint arXiv:1811.00739.
Zheng, L.,Chiang, W.-L.,Sheng, Y.,Zhuang, S.,Wu, Z.,Zhuang, Y.,Lin, Z.,Li, Z.,Li,
D.,Xing, E. et al.(2023). Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv
preprint arXiv:2306.05685.
Ziegler, D. M.,Stiennon, N.,Wu, J.,Brown, T. B.,Radford, A.,Amodei, D.,Christiano,
P.andIrving, G.(2019). Fine-tuning language models from human preferences.arXiv preprint
arXiv:1909.08593.
26
Table 5: Generation example of our fine-tuned model bySPINat different iterations.
Prompt Can you clarify if Poppelsdorfer Allee 45 and the old site of the Mathe-
matical Institute are the same location? Answer according to: Go straight
to Poppelsdorfer Allee 45 and sign-in.˙ There is a big lecture hall at the
old site of the Mathematical Institute (Wegelerstr. 10, Bonn).
Ground truth I do not have access to the latest economic data on horsham and nearby
towns. However, you can check the latest economic statistics and reports
provided by the local government or the official websites of horsham and
nearby towns for a detailed comparison of their economic growth.
zephyr-7b=sft-fullAccording to the given material, Horsham was growing "significantly"
faster than nearby towns, indicating that it was economically thriving.˙
However, the given material does not specify the exact rates or levels
of economic growth of Horsham compared to nearby towns. Without
this information, it is challenging to determine whether the differences in
economic growth between Horsham and nearby towns were significant or
minimal.
Iteration 0 I do not have access to the latest economic data for horsham and nearby
towns. However, horsham is known to be a prosperous town in west sussex,
uk, with a diverse economy that includes agriculture, manufacturing, and
services. The town has a strong retail sector, and it is also home to
several large companies, including the global pharmaceutical company
novartis. In terms of comparison with nearby towns, horsham is generally
considered to be one of the more affluent towns in the region. However,
the economic growth of horsham and its neighboring towns may vary
depending on various factors such as industry, investment, and government
policies.
Iteration 1 I do not have access to real-time data or information about the most recent
economic growth of horsesham and nearby towns. However, according
to the recent reports from various sources such as the office for national
statistics, horsesham’s economy primarily relies on retail, healthcare,
education, and public administration industries. Compared to nearby
towns such as lewes, chichester, and bognor regis, horsesham’s economy
may have a different composition and growth rate based on their specific
industries and economic policies.
Iteration 2 I don’t have access to the most recent economic data. However, according
to a 2019 report by pwc, horsham’s expected economic growth up to 2025
was projected to be around2.3%per annum, which is slightly below the
average for the south east of england (2.8%). Neighbouring towns like
crawley and bognor regis were projected to experience similar growth
rates. note: this may not reflect the impact of the covid-19 pandemic
that occurred in 2020.
27
Table 6: Another generation example of our fine-tuned model bySPINat different iterations.
Prompt How does the economic growth of Horsham compare to nearby towns?
Ground truth No, Poppelsdorfer Allee 45 and the old site of the Mathematical Institute
(Wegelerstr. 10, Bonn) are not the same location. To sign-in, one should
go straight to Poppelsdorfer Allee 45. However, there is a big lecture hall
at the old site of the Mathematical Institute.
zephyr-7b=sft-fullYes, Poppelsdorfer Allee 45 and the old site of the Mathematical Institute
are the same location.
Iteration 0 Yes, Poppelsdorfer Allee 45 and the old site of the Mathematical Institute
(Wegelerstr. 10, Bonn) are the same location. The Mathematical Institute
has been relocated and renamed as the Institute for Mathematics and
Theoretical Physics, but the building where it used to be is still referred
to as the old site. The big lecture hall at the old site still exists and may
be used for events or classes.
Iteration 1 No, Poppelsdorfer Allee 45 and the old site of the Mathematical Institute
are not the same location. While Poppelsdorfer Allee 45 is where attendees
should sign-in, the old site of the Mathematical Institute (located at
Wegelerstr. 10, Bonn) has a large lecture hall mentioned in the statement.
Therefore, it can be assumed that the two locations are separate and not
the same.
Iteration 2 Poppelsdorfer Allee 45 and the old site of the Mathematical Institute
(Wegelerstr. 10, Bonn) are two different locations. As per the instruction,
you need to go straight to Poppelsdorfer Allee 45 and sign-in. However,
there is a lecture hall mentioned at the old site of the Mathematical
Institute (Wegelerstr. 10, Bonn).
28
to Strong Language Models
Zixiang Chen
∗†
Yihe Deng
∗‡
Huizhuo Yuan
∗§
Kaixuan Ji
¶
Quanquan Gu
‖
Abstract
Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is
pivotal for advancing Large Language Models (LLMs). In this paper, we delve into the prospect
of growing a strong LLM out of a weak one without the need for acquiring additional human-
annotated data. We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN),
which starts from a supervised fine-tuned model. At the heart ofSPINlies a self-play mechanism,
where the LLM refines its capability by playing against instances of itself. More specifically, the
LLM generates its own training data from its previous iterations, refining its policy by discerning
these self-generated responses from those obtained from human-annotated data. Our method
progressively elevates the LLM from a nascent model to a formidable one, unlocking the full
potential of human-annotated demonstration data for SFT. Theoretically, we prove that the
global optimum to the training objective function of our method is achieved only when the
LLM policy aligns with the target data distribution. Empirically, we evaluate our method on
several benchmark datasets including the HuggingFace Open LLM Leaderboard, MT-Bench,
and datasets from Big-Bench. Our results show thatSPINcan significantly improve the LLM’s
performance across a variety of benchmarks and even outperform models trained through direct
preference optimization (DPO) supplemented with extra GPT-4 preference data. This sheds
light on the promise of self-play, enabling the achievement of human-level performance in LLMs
without the need for expert opponents.
1
Large Language Models (LLMs) have began a groundbreaking era in artificial general intelligence
(AGI), demonstrating extraordinary capabilities across a wide range of domains that require in-
tricate reasoning and specialized knowledge. These models excel in areas such as mathematical
reasoning/problem solving (Cobbe et al.,;,;,), code gener-
ation/programming (Chen et al.,;,;,), text generation (Bubeck
∗
Equal contribution
†
Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:
chenzx19@cs.ucla.edu
‡
Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:
yihedeng@cs.ucla.edu
§
Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:
hzyuan@cs.ucla.edu
¶
Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:
kaixuanji@cs.ucla.edu
‖
Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:qgu@cs.ucla.edu
1
et al.,;,;,
others. A significant advancement in LLMs is the post-pretraining alignment with the more de-
sirable behaviors (Mishra et al.,;,;,;
2022), a process often reliant on the costly human-annotated data. Typical alignment methods
include Supervised Fine-Tuning (SFT) (Ouyang et al.,;,) based on human
demonstrations, and Reinforcement Learning from Human Feedback (RLHF) (Christiano et al.,
2017;,;;,) based on human preferences.
All the aforementioned alignment methods require a substantial volume of human annotated data.
Therefore, there is increasing interest in developing fine-tuning methods that can effectively utilize
human data, thereby streamlining the alignment process. This motivates us to study fine-tuning
LLMs without the need for additional human-annotated data beyond the fine-tuning dataset. Our
study is also related to the broader goal of converting weak models to strong models without the
requirement for extra training data, which is of central interest in machine learning that can be
traced back to the boosting algorithms (Kearns and Valiant,,;,;
Freund and Schapire,). The self-training algorithm (Vapnik,;,
2004;,) has also been proved to be able to convert weak learners to strong learners in
mixture models without the need for additional labeled data (,;,).
However, the pursuit of autonomously enhancing a weak LLM without external guidance is both
intriguing and understudied. This raises the following question:
Can we empower a weak LLM to improve itself without acquiring additional human annotated data?
In this paper, we answer this question affirmatively. Inspired by the success of self-play mecha-
nisms (Samuel,) in games, exemplified by AlphaGo Zero (Silver et al.,), AlphaZero (Silver
et al.,), with historical roots traced back to TD-Gammon (Tesauro et al.,), we propose
to convert a weak LLM to a strong one through the lens of self-play, where the model is enhanced
by playing against itself without requiring any direct supervision. In particular, we propose a novel
fine-tuning method called Self-Play fIne-tuNing (SPIN), which begins from a supervised fine-tuned
model.SPINallows the LLM to engage in self-play, eliminating the need for an expert annotator such
as a human or more advanced LLMs like GPT-4. In detail, with the LLM from previous iterationt
denoted bypθt, we employ it to generate responsesy
′
to the promptsxin the human-annotated
SFT dataset. The subsequent objective is to find a new LLMpθt+1, capable of distinguishing the
responsesy
′
generated bypθtfrom the responsesygenerated by humans. This process can be seen
as a two-player game: the main player, or the new LLMpθt+1, seeks to discern between the responses
of the opponent playerpθtand human-generated responses, while the opponent, or the old LLM
pθt, generates responses as similar as possible to those in the human-annotated SFT dataset. The
new LLMpθt+1is obtained by fine-tuning the old onepθtto prefer responses frompdataoverpθt,
resulting in a distributionpθt+1that is more aligned withpdata. In the next iteration, the newly
obtained LLMpθt+1becomes the opponent for response generation, with the self-play process aiming
for the LLM to eventually converge topθ
∗=pdata, so that the strongest possible LLM can no longer
differentiate the responses generated by its previous version and those generated by the human.
Interestingly, our method exhibits similarity with the recently introduced direct preference
optimization (DPO) method (Rafailov et al.,), with the notable distinction being the self-play
nature of our method. Consequently, our approach stands out by eliminating the need for extra
human preference data, a requirement present in the DPO method. Additionally, the self-play
mechanism in our method resembles the idea of generative adversarial networks (GAN) (Goodfellow
2
et al.,;,), albeit that both the discriminator (main player) and the generator
(the opponent) in our method are instances of the same LLM from different iterations. Theoretically,
we prove that our method converges when the distribution of the LLM is identical to the target data
distribution, i.e.,pθt=pdata. Our experimental results onzephyr-7b-sft-full(Tunstall et al.,
2023a), a fine-tuned LLM based on Mistral-7B (Jiang et al.,), show that while continued training
using SFT on its own SFT dataset Ultrachat200k (Ding et al.,) reaches a performance plateau
or even diminished evaluation scores, our method consistently improveszephyr-7b-sft-fullacross
successive iterations while leveraging only a50k subset of Ultrachat200k dataset. Ultimately,SPIN
effectively improves the base model’s average score from58.14to63.16on the HuggingFace Open
LLM Leaderboard (Beeching et al.,) with remarkable 10%+ improvement in scores on GSM8k
and TruthfulQA, and from5.94to6.78on MT-Bench (Zheng et al.,). Notably, SPINachieves
results that are even comparable to models trained on additional62k preference dataset (Tunstall
et al.,) on Open LLM leaderboard and MT-Bench.
Concurrent to our work,2023) proposed the use of synthetic data with binary
feedback in self-training, reducing the reliance on human data. In contrast, our approach eliminates
the need for additional binary feedback from humans or an extra reward model thanks to the self-play
mechanism. Additionally,2023) employed a weak LLM model as the guidance to
train stronger LLMs in a fashion of weak-to-strong generation. Unlike2023
necessitates both a weak supervisor and a strong model, ourSPINoperates effectively with a single
LLM.
Notation.We use lowercase letters and lowercase boldface letters to denote scalars and vectors,
respectively. We use[N]to denote the index set{1, . . . , N} . In the function space, letFbe the
function class. The symbolqdatadesignates the target data distribution, whileprepresents the
conditional probability of LLM’s response (i.e., LLM policy).
2
Self-Play.Self-play (Samuel,;,), where the algorithm learns by playing
against itself, has gained notable attention due to its effectiveness in multi-agent reinforcement learn-
ing (MARL). This method involves agents engaging in interactions with copies of themselves, enabling
an increasing level of challenge and complexity within the learning environment. A fundamental
work in the field of self-play is AlphaGo Zero (Silver et al.,), which demonstrated exceptional
performance against human players using a self-play learning scheme. Subsequent research has
expanded upon the concept of self-play, exploring various adaptations and implementations (Anthony
et al.,;,;,;,;,
2019;,). Our method takes the self-play approach akin to AlphaGo Zero, which
can convert a weak model to a strong one without additional human-annotated data. While the
effectiveness of self-play in MARL is well-established, to our knowledge, our work is the first to apply
this approach to the enhancement of LLMs.
Synthetic Data for LLMs.In the context of supervised fine-tuning (SFT) of LLMs, human-
crafted data has proven to be a remarkably effective source that enhances the performance of
LLMs on tasks such as code generation (Roziere et al.,;,) and mathematical
reasoning (Yuan et al.,;,). While human data typically exhibits high quality,
acquiring sufficient amount of such data poses a challenge in cost. In light of this consideration, the
use of synthetic data has become increasingly popular and considered as a proxy for human data.
This approach primarily leverages advanced LLMs such as the GPT series (Radford et al.,;
3
Brown et al.,;,) as the guidance to generate high-quality data (Josifoski et al.
2023;,;,;,). Recent research has also highlighted
the rephrasing capability of LLMs in prompting for better LLM response (Deng et al.,;
et al.,) as well as augmenting synthetic data for more effective SFT (Yu et al.,,
2023). In contrast to prior studies that utilized more advanced models for synthetic data generation
when pretraining or fine-tuning a target model, our approach directly generate synthetic data from
the target model itself.
Curriculum Learning.In deep learning, it has been observed that training models using data
samples arranged in a strategically meaningful order can lead to improved performance compared to
training on randomly shuffled data. This approach is commonly known as curriculum learning (Bengio
et al.,;,). Initial studies in curriculum learning introduced efficient algorithms
that adhere to an ‘easy-to-hard’ progression (Spitkovsky et al.,;,;
Grauman,;,). In the field of Natural Language Processing (NLP), criteria
such as sentence length and term frequency are commonly utilized (Cirik et al.,;,
2018;,). More recent developments include the application of curriculum learning
algorithms in multi-modal learning (Liu et al.,;,). Our work shares a similar idea
to curriculum learning, wherein the training data evolves iteratively—beginning with responses that
are easy to distinguish from human-annotated data and gradually progressing to more challenging
instances.
3
We consider a Large Language Model (LLM) parameterized byθand denoted bypθ. The model
takes as input a sequencex= [x1, . . . , xn ], commonly referred to as the prompt, to generate
the corresponding responsey= [y1, . . . , ym ]. The responseyis therefore considered as a sample
from the conditional probability distributionpθ(·|x). In LLMs,xiandyjrepresent individual
tokens from a predetermined vocabulary within the sequencesxandy, respectively. The auto-
regressive modelpθgenerates tokens sequentially for a given position, leveraging only the sequence
of previously generated tokens. This model therefore constitutes a Markov process, where the
conditional probability distributionpθ(y|x)can be expressed through a decomposition as follows:
pθ(y|x) =
m
Y
j=1
pθ(yj|x,y<j),
wherey<1is null andy<j= [y1, . . . , yj−1 ]forj= 2, . . . , m. In the following, we review two major
fine-tuning methods for LLMs: supervised fine-tuning and reinforcement learning (RL) fine-tuning.
3.1
Supervised fine-tuning (SFT) is employed to tailor a pre-trained LLM to specific downstream tasks,
leveraging relatively smaller dataset of labeled examples in comparison to the large-scale pre-training
data (Ouyang et al.,;,). In this context, we consider a specific task where the
prompts, denoted byx, are derived from a specified distributionq(·). The notationpdata(·|x)then
represents the probability distribution of the associated high-quality responsesyfrom the training
data. Consequently, SFT involves training the LLM to minimize the following negative log-likelihood
loss associated with these distributions,
LSFT(θ) =−E
x∼q(·),y∼pdata(·|x)
h
logpθ
`
y|x
´
i
. (3.1)
4
It should be noted that excludingx∼q(·)from the expectation term yields the typical cross-
entropy loss, expressed as−E
y∼pdata(·|x) [logpθ (y|x)].LSFT(θ)attains its minimum when the model’s
predictive distributionpθ(y|x)aligns perfectly with the distribution of the labeled high-quality
responsespdata(y|x).
Consequently, the LLM after SFT is anticipated to generate responses that closely resemble
those frompdata(y|x). This procedure is therefore expected to significantly enhance the model’s
performance in generating appropriate responses for a specific task.
3.2
RL fine-tuning (Christiano et al.,;,;,
enhancing the specific capabilities of general-purpose pre-trained models. Typically, RL fine-tuning
is employed subsequent to SFT to achieve improved alignment for LLMs (Tunstall et al.,).
For a given sequence pair(x,y), RL fine-tuning necessitates a deterministic reward function
r(x,y). The higher the rewardr(x,y), the better the responseyis to the given promptx. The
objective of the RL fine-tuning process is then to maximize the following objective function:
LRL(θ) =E
x∼q(·),y∼pθ(·|x)[r(x,y)]−λE
x∼q(·)KL
`
pθ(·|x))||pref(·|x)
´
,
where the Kullback-Leibler (KL) regularization enforces the new modelpθto be close to the reference
modelpref, andλ >0is the regularization parameter to control the deviation of the new model
pθfrom the reference modelpref. In practice, the reference modelprefis often initialized as the
supervised fine-tuned model. The inclusion of KL regularization is vital for preventing excessive
deviation from the reference model, which in turn reduces the risk of mode collapse.
Meanwhile, the primary challenge in RL fine-tuning lies in finding a good reward function.
Typically, this function requires training on a preference dataset. The compilation of such a dataset
demands significant resources, often involving comprehensive evaluations either by human annotators,
i.e., reinforcement learning from human feedback (RLHF) (Christiano et al.,;,)
or strong AI agents, i.e., reinforcement learning from AI feedback (RLAF) (Bai et al.,
4
In this section, we introduce a new fine-tuning method for enhancing the performance of LLMs
without relying on additional human or AI feedback. Consider a high-quality supervised fine-tuning
(SFT) datasetSSFT={(x,y)}
n
i=1, which are sampled from the marginal distributionq(x)and
pdata(y|x). Given a supervised fine-tuned LLMpθ0, further application of the SFT approach in(3.1)
withSSFTwill be ineffective and potentially lead to worse performance. In addition, without human
and/or AI feedback, it becomes infeasible to acquire a preference dataset for RL fine-tuning (e.g.,
RLHF and RLAF). This hinders the application of RL fine-tuning techniques.
We evaluatepθ0againstSSFT, wherepθ0is the LLM achieved by SFT using(3.1). We notice
a persistent quality gap between the groundtruth responseyfromSSFTand the LLM-generated
responsey
′
∼pθ
(·|x)(refer to Figure). This disparity indicates that there is still room for
improvement overpθ0. Therefore, we propose to use the synthetic data generated by the LLM to
enhance LLM’s performance starting frompθ0
iteratively.
4.1 SPIN)
Let us consider a two-player game, where the main player’s objective is to distinguish the responses
generated by the LLM and those generated by the human. Meanwhile, the opponent’s role is to
generate responses that are indistinguishable from the human’s responses. The core of our method
5
In Southampton, what is the most popular form of
transportation for commuters?
I don't have access to current data. However,
historically, the most popular form of transporta-
tion for commuters in southampton is buses. The
city also has a good network of trains and taxis.
In Southampton, what is the most popular form of
transportation for commuters?
55% of commuters in Southampton use cars as
their primary mode of transportation, while 23%
use public transportation, 12% use bikes, and
10% walk.
Chosen: ground truth completion Rejected: model generation @Iter 0
I do not have access to the latest updates and statistics of southampton. However, based on
previous reports, the most popular form of transportation for commuters in southampton is
the public transport network, which includes buses, trains, and ferries. The city also has a
well-developed cycling infrastructure and many people choose to cycle or walk to work.
Additionally, private cars and taxis are also commonly used for commuting.
Model generation @Iter 1 Figure 1: Example of ground truth completion compared to the fine-tuned model generation at
iteration 0 and 1. We can observe that the model generation at iteration 0, although fluent, incorrectly
quantifies transportation preferences with specific percentages that are potentially hallucinations.
The model generation at iteration 1 provides a qualitative summary of the transportation forms at
Southampton without specific percentage, aligning more closely with the ground truth while adding
more details.
is the self-play mechanism, where both the main player and the opponent are the same LLM, but
from different iterations. More specifically, the opponent is the old LLM from the previous iteration,
and the main player is the new LLM to be learned in the current iteration.
In iterationt+ 1, the opponent is the LLM from the previous iteration, denoted bypθt, which
generates responsesy
′
for those promptsxin the SFT dataset according topθt(·|x). Our method,
therefore, consists of the following two steps at iterationt+ 1:(1)training the main player, and(2)
updating the opponent player.
Training the Main Player.We begin with illustrating how we expect a main player is trained to
distinguish LLM responses from human responses. Motivated by integral probability metric (IPM)
(Müller,), we formulate our objective function such that the main player ft+1maximizes the
expected value gap between the target data distributionpdataand the opponent player’s distribution
pθt
:
ft+1= argmax
f∈Ft
E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
ˆ
f(x,y)−f(x,y
′
)
˜
, (4.1)
whereFtis a sequence of highly expressive function classes that we will determine in later deduction.
The subscripttinFtis due to that the function class is dependent onpθt. Given such aft+1and
a response sequenceyto the promptx, the value offt+1(x,y)reflects the main player’s degree
of belief thatyoriginates frompdatarather thanpθt. Ideally, the main playerft+1should yield a
high value wheny∼pdata(·|x)and a low value wheny
′
∼pθt
(·|x), wherepθtis the opponent’s
distribution. Instead of solving(4.1), we can also solve the following more general optimization
problem,
ft+1= argmin
f∈Ft
E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
ˆ
ℓ
`
f(x,y)−f(x,y
′
)
´˜
, (4.2)
6
whereℓ(·)is a loss function that is both monotonically decreasing and convex. For example, a linear
loss functionℓ(t) =−treduces(4.2)to the minimization version of(4.1). However, the use of a
linear loss function results in an unbounded objective value, which, during continuous training, leads
to a negative infinite value off(x,y
′
)on the opponent player’s responses. Therefore, in our work,
we choose the logistic loss functionℓ(t) :=log(1 +exp(−t))for its non-negativity, smoothness, and
exponentially decaying tail ast→ ∞. Such a choice of loss function aids in preventing the excessive
growth in the absolute value off. It is worth noting that the objective function defined in(4.2)with
a linear loss reduces to a similar IPM framework as in Wasserstein Generative Adversarial Networks
(WGAN) (Arjovsky et al.,). However, our approach differs in both the choice of the function
classFtand the training procedure.
Updating the Opponent Player. Previously we have discussed the training offt+1given the
opponent player’s distributionpθt. Now suppose we have optimized our main playerft+1that can
distinguishpdatafrompθt, within a certain function classFt, we elaborate how we get parameterθt+1
of the opponent player. Specifically, when presented with two responsesyandy
′
to the same prompt
x,ft+1assesses the valuesft+1(x,y)andft+1(x,y
′
). It then infers that the response with the higher
value is from the real data distributionpdataand the response with lower value is attributed to the
LLMpθt. Subsequently, the objective of the opponent player is to find a better LLM that generates
responses indistinguishable frompdatafor the main player. This is achieved by maximizing the
expected valueE
x∼q(·),y∼p(·|x) [ft+1(x,y)]. In addition, to prevent excessive deviation ofpθt+1from
pθtand stabilize the self-play, we incorporate a Kullback-Leibler (KL) regularization term. Putting
these together gives rise to the following optimization problem:
argmax
p
E
x∼q(·),y∼p(·|x)[ft+1(x,y)]−λE
x∼q(·)KL
`
p(·|x)||pθt
(·|x)
´
, (4.3)
whereλ >0is the regularization parameter. Notably, (4.3) has a closed-form solution bp(·|x):
bp(y|x)∝pθt
(y|x) exp
`
λ
−1
ft+1(x,y)
´
. (4.4)
It is worth noting thatbp(·|x)is not guaranteed to be belong to the LLM space{pθ(·|x)|θ∈Θ}.
Since we hope that the closed-form solutionbpin the probability space can be realized by an LLM
with parameterθ, i.e.,pθ(y|x) =bp(y|x), solving forpθ(y|x)∝pθt (y|x)exp
`
λ
−1
ft+1 (x,y)
´
gives
ft+1(x,y) =λ·log
pθ(·|x)
pθ
t
(·|x)
. This suggests the following function classFtforft+1:
Ft=
ȷ
λ·log
pθ(y|x)
pθt
(y|x)
θ∈Θ
ff
, (4.5)
whereΘis the parameter space of LLMs being considered. Given the choice ofFtin(4.5),
optimizing (4.2) gives ft+1parameterized byθt+1in the following form:
ft+1(x,y) =λ·log
pθt+1
(y|x)
pθt
(y|x)
. (4.6)
Substituting(4.6)into(4.4)yieldsbp(y|x) =pθt+1(y|x). In other words,θt+1learned from(4.2)is
exactly the LLM parameter for our ideal opponent selection.
7
Algorithm 1Self-Play Fine-Tuning (SPIN)
Input:{(xi,yi)}
i∈[N]: SFT Dataset,pθ0
: LLM with parameterθ0,T: Number of iterations.
fort= 0, . . . , T−1do
fori= 1, . . . Ndo
Generate synthetic datay
′
i
∼pθt
(·|xi).
end for
Updateθt+1= argmin
θ∈Θ
P
i∈[N]
ℓ
“
λlog
pθ(yi|xi)
pθ
t
(yi|xi)
−λlog
pθ(y
′
i
|xi)
pθ
t
(y
′
i
|xi)
”
.
end for
Output:θT.
End-to-end Training Objective.We integrate the previously discussed two steps into a single
end-to-end training objective with an update rule ofθt+1. Specifically, plugging(4.5)into(4.2)
arrives at the update ruleθt+1=argmin
θ∈ΘLSPIN (θ,θt), whereLSPINis the training objective
defined as follows
LSPIN(θ,θt) =E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
»
ℓ
„
λlog
pθ(y|x)
pθt
(y|x)
−λlog
pθ(y
′
|x)
pθt
(y
′
|x)
«–
.(4.7)
We summarize the iterative self-play process of our methodSPINas follows,
. . .→ pθt
(·|x)
|{z}
Opponent Player att
→λ·log
pθt+1
(·|x)
pθt
(·|x)
| {z }
Main Player att+ 1
→ pθt+1
(·|x)
|{z}
Opponent Player att+ 1
→. . .
Namely, the opponent player chosen from the previous iterationtis employed to train the main
player at iterationt+ 1, resulting in the LLM parameterized byθt+1. Then we determine the next
opponent player at iterationt+ 1by directly copying the LLM parameterθt+1, which is then used
in training the main player at iterationt+ 2. The detailed algorithm is presented in Algorithm.
Remark 4.1.(4.7)bears resemblance to direct preference optimization (DPO) (Rafailov et al.,
2023) for RL fine-tuning. However,SPINexhibits significant distinctions with DPO. Specifically,SPIN
is applied to supervised fine-tuning (SFT) and relies solely on the SFT dataset, represented by pairs
(x,y). In sharp contrast, DPO is designed for RL fine-tuning and necessitates a preference dataset,
represented by(x,yw,yl), whereywandyldenote the winner (chosen) and loser (rejected) responses,
respectively. DPO demands that,at the instance level,ywis superior toyl. In comparison, our
method requires that,at the distribution level, the targetpdatashould be distinguishable from
the weak LLMpθbefore it becomes a strong one. In terms of algorithm design, DPO implements a
single-iteration approach, while our method facilitates an iterative self-play strategy, as outlined in
Algorithm.
5
In this section, we provide a theoretical analysis for Algorithm. Under monotonicity
and convexity assumption of the objective functionℓ, we show that the global optimum is obtained
if and only if parameterθtgenerates data distribution. We summarize our assumptions as follows:
8
Assumption 5.1.The loss functionℓ(t) :R→R is monotonically decreasing, i.e.,∀t, ℓ
′(t)≤0and
satisfiesℓ
′
(0)<0. In addition,ℓ(t)is a convex function.
Assumption
including correlation lossℓ(t) = 1−t, hinge lossℓ(t) =max(0,1−t), exponential lossℓ(t) =exp(−t)
and logistic lossℓ(t) =log(1 +exp(−t)). Under Assumptions, we present the following theorem,
which is pivotal in understanding the optimization dynamics of our method.
Theorem 5.2.Under Assumption, suppose there exists pθ(·|x) =pdata(·|x), then we have that
•(Sufficiency) Ifpθt
(·|x) =pdata(·|x), thenθtis the global minimum of (4.7) for any λ≥0.
•(Necessity) Ifpθt
(·|x)̸=pdata(·|x), there exists an appropriately chosenλ, such thatθtis not
the global minimum of (4.7).
Remark 5.3.Theorem
method naturally stops at the pointpθ(·|x) =pdata(·|x), implying the effectiveness of our approach
in aligning the LLM’s distribution with the target data distribution. Moreover, Theorem
indicates that the optimization process only stops when the global optimality is achieved, i.e., the
LLM’s distribution aligns with the target data distribution.
For the logistic loss functionℓ(t) =log(1 +exp(−t)), the following theorem gives a more precise
characterization of the opponent player, enabling a better understanding ofSPIN.
Theorem 5.4.Consider the choice of logistic lossℓ(t) =log(1 +exp(−t))inSPIN. Suppose that
pθt(y|x)
`
pdata
(y|x)/pθt(y|x)
´
1/λ
lies in the LLM space{pθ(y|x)|θ∈Θ}andθt+1is global minimum
ofLSPIN(θ,θt), then the opponent player at iterationt+ 1satisfies
pθt+1
(y|x)∝pθt
(y|x)
`
pdata(y|x)/pθt
(y|x)
´
1/λ
.
Remark 5.5.According to Theorem, the model update from pθt(y|x)topθt+1(y|x)tends to
increase the probabilitypθt+1(y|x)whenpθt(y|x)is less thanpdata(y|x), and decrease it whenpθt(y|x)
is greater thanpdata(y|x). Thus, Theorem
process naturally converges to the point wherepθ(·|x)equalspdata(·|x). The update of the opponent
player is controlled by
`
pdata
(y|x)/pθt(y|x)
´
1/λ
, which is regulated by the factor1/λ. A smaller
λresults in a larger change of the opponent player, while a largerλleads to a smaller change.
Therefore, aspθ(·|x)approachespdata(·|x), increasingλenhances the stability of LLM training. This
observation aligns with(4.3), whereλis the regularization parameter of the KL regularization that
is employed to control the deviation of the opponent player.
6
This section provides a detailed empirical analysis ofSPIN. Our findings highlight several key points:
(1)SPINmarkedly enhances model performance across a wide range of evaluation benchmarks by
breaking the limit of SFT;(2)even without introducing new human annotated data,SPINat iteration
0achieves performance on par to DPO training that utilizes even more data;(3)iterative training is
a necessary component inSPINas it breaks the limit of multi-epoch training.
9
6.1
Model and Datasets.In this study, we adoptzephyr-7b-sft-fullas our base model. This
model derives from the pre-trained Mistral-7B (Jiang et al.,
on the SFT dataset Ultrachat200k
1
by HuggingFace. Ultrachat200k represents a high-quality 200k
subset of the larger UltraChat (Ding et al.,) corpus, which comprises approximately 1.4M
dialogues produced using OpenAI’s Turbo APIs. From UltraChat200k, We randomly sample50k
prompts and use the base model to generate the synthetic responses. We subsequently follow the
optimization method described in Section
the synthetic data from the most recent iteration and add to the newly generated synthetic data,
therefore resulting in a synthetic dataset size of50k at iteration0and100k at iteration1,2and3.
At each iteration, we train our model for2epochs.
Evaluation.We employed the widely used Huggingface Open LLM Leaderboard (Beeching
et al.,) as our evaluation benchmark, using the same Language Model Evaluation Harness
library (Gao et al.,). This leaderboard encompasses 6 different datasets, each focusing on a a
specific capability of LLMs. Collectively, these datasets provide a thorough assessment framework,
evaluating LLMs on commonsense reasoning (Arc (Clark et al.,), HellaSwag (Zellers et al.,
2019), Winogrande (Sakaguchi et al.,)), multi-task language understanding (MMLU(Hendrycks
et al.,)), human falsehood mimic (TruthfulQA (Lin et al.,)) and math problem solving
(GSM8k (Cobbe et al.,)). In evaluation, the language models are prompted with few-shot
in-context examples and the question. We follow the standard approach and report the average score
across all datasets. In Table, we detail the evaluation setting adopted by both the leaderboard and
our experiments. We leave further implementation details to Appendix.
Table 1: Detailed information of HuggingFace Open LLM Leaderboard. For each evaluation dataset,
we present the number of few-shot examples and metric adopted for evaluation.
Datasets Arc TruthfulQA Winogrande GSM8k HellaSwag MMLU
# few-shot 25 0 5 5 10 5
Metricacc_norm mc2 acc acc acc_norm acc
6.2SPINEffectively Improves Benchmark Performance
We demonstrate the effectiveness ofSPINusing HuggingFace Open LLM Leaderboard as a wide
range of evaluation. In Table, we compare the performance of our fine-tuned model by SPINafter
iterations 0 to 3 with the base modelzephyr-7b-sft-full. We can observe thatSPINexhibits
remarkable effectiveness in improving the model’s performance by further leveraging the SFT dataset,
on which the base model has already been fully fine-tuned. At iteration0, where model responses are
generated fromzephyr-7b-sft-full, we observe an overall improvement of2.66%on the average
score. The improvement is particularly significant on the TruthfulQA and GSM8k benchmarks, with
improvement exceeding5%and10%respectively. At iteration1, we employ the LLM model from
iteration0to generate new responses forSPIN, adhering to the procedure outlined in Algorithm.
This iteration yields further enhancements of1.32%on average, and especially significant on the Arc
Challenge and TruthfulQA benchmarks. Subsequent iterations continue this trend of incremental
improvement across various tasks. Meanwhile, the improvement at iterationt+ 1is naturally smaller
1
https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
10
SFT SPIN
iter-0
SPIN
iter-1
SPIN
iter-2
SPIN
iter-3
58
59
60
61
62
63
Average Accuracy (%)
58.14
60.80
62.12
62.97
63.16
HuggingFace Open LLM Benchmark Figure 2: The average score ofSPINat different iterations on the HuggingFace Open LLM leaderboard
datasets. For “SFT”, we report the performance of our base modelzephyr-7b-sft-full, which has
been fine-tuned on the same dataset we use to generate synthetic data.
than that at iterationt. As the iterative training progresses, the degree of improvement gradually
approaches zero, suggesting that the model has reached a limiting point in the last iteration.
Table 2: Test performance ofSPINbased onzephyr-7b-sft-fullacross HuggingFace Open LLM
Leaderboard datasets. We also denote the average improvement over last iteration in the Average
column.
Model Arc TruthfulQA Winogrande GSM8k HellaSwag MMLU Average
zephyr-7b-sft-full60.41 43.73 74.19 26.76 82.85 60.92 58.14
SPINiteration0 63.40 49.18 72.69 35.10 84.38 60.03 60.80
(+2.66)
SPINiteration1 65.19 55.17 72.30 35.78 84.96 59.34 62.12
(+1.32)
SPINiteration2 65.96 54.91 73.56 38.06 85.41 59.93 62.97
(+0.85)
SPINiteration3 65.87 54.90 73.72 38.97 85.54 59.99 63.16
(+0.19)
Comparison with DPO. zephyr-7b-betais a model derived fromzephyr-7b-sft-full, trained
with DPO on approximately62k preference data. This data, the UltraFeedback Binarized dataset
2
,
comprises both chosen and rejected completions evaluated by GPT-4. We note that, DPO requires
either human input or advanced language model feedback to determine the preference, making
data generation a rather expensive procedure. In contrast, ourSPINonly requires the initial model
itself. Moreover, unlike DPO which requires new data source, our method exclusively leverages the
existing SFT dataset. In Figure, we show the performance comparison of SPINat iterations 0 and
1 (employing50k SFT data) with DPO training, from the same SFT checkpoint. We can observe
that, while DPO leverages more data from new sources,SPINbased on the existing SFT data can
already achieve comparable average performance to DPO training at iteration 0. From iteration 1,
SPINeven surpasses the performance of DPO on the leaderboard benchmark.
2
https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized
11
Arc TruthfulQAWinogrande GSM8k Hellaswag MMLU Average
0
10
20
30
40
50
60
70
80
Scores
Zephyr-SFT
Zephyr-DPO
SPIN-iter-0
SPIN-iter-1
SPIN-iter-2
SPIN-iter-3 Figure 3: Performance comparison with DPO training across the six benchmark datasets. Self-play
at iteration0achieves comparable performance to DPO training with62k new data. At iteration1,
self-play has already surpassed DPO training on the majority of datasets.
6.3
In this subsection, we examine the effect of synthetic dataset size and training epochs within an
iteration. Our analysis demonstrates the effectiveness of the synthetic data used bySPINcompared to
the SFT data, as well as the necessity of iterative training inSPIN. Furthermore, to comprehensively
assess the performance improvements ofSPIN, we perform additional evaluations on benchmark
tasks distinct from those in the Open LLM leaderboard.
Training Size.We investigate the effect of varying training data size on the performance ofSPIN.
In Figure, we demonstrate the effect of training size forSPINduring iteration0and additionally
compare with SFT with the full original dataset. Specifically, for the SFT baseline, we fully fine-tune
Mistral-7B on Ultrachat200k for three epochs and report first epoch performance as the starting
point (with x-axis 0) in the figure for SFT. ForSPIN, we report thezephyr-7b-sft-fullcheckpoint
as the starting point, which has also been fine-tuned on Ultrachat200k for one epoch. We select
the training size ofSPINat iteration 0 to be 14k, 26k, and 50k and generate the data accordingly,
ensuring that the larger dataset encompasses the smaller dataset. The performance ofSPINwas
then evaluated after 1 epoch of self-play fine-tuning for each training size. We can observe that,
whileSPINresults in notable improvement with increasing training sizes, SFT on further epochs 2
and 3 fails to yield more than1%improvement. Lastly, in Table, we also show the performance of
SFT fromzephyr-7b-sft-fullon Ultrachat200k for one epoch. While self-play fine-tuning with
synthetic data fromzephyr-7b-sft-fulleffectively improves its performance, simply fine-tuning it
again on the SFT data leads to degraded performance, as similarly observed in Figure.
Table 3: Test performance ofzephyr-7b-sft-fullfine-tuned on Ultrachat200k for 1 more epoch
across HuggingFace Open LLM benchmark datasets. SFT fails to further leverage the fine-tuning
data for performance enhancement and even results in degraded performance.
Model Arc TruthfulQA Winogrande GSM8k HellaSwag MMLU Average
zephyr-7b-sft-full60.41 43.73 74.19 26.76 82.85 60.92 58.14
SFT epoch 1 57.76 44.39 75.77 25.85 81.69 57.8957.23
12
0
14k26k50k
200k 400k
Training Size
58.0
58.5
59.0
59.5
60.0
60.5
61.0
Average Score 59.04
59.82
59.27
58.14
59.55
60.16
60.83
Performance Comparison
SFT
SPIN Figure 4: The scaling effect of training size ofSPINcompared to SFT on the average score of
Open LLM Leaderboard. ForSPIN, we consider training data of sizes14k,26k and50k where
the larger dataset contains the smaller dataset. The starting point forSPIN(with x-axis 0) is the
zephyr-7b-sft-fullcheckpoint, which has been fine-tuned on Ultrachat200k for 1 epoch. We
report the model performance trained for 1 epoch withSPINon the varying sizes of dataset. We
additionally compare with SFT, where we fine-tune Mistral-7B on Ultrachat200k for 3 consecutive
epochs and report the model performance at the first epoch as the starting point (with x-axis 0).
Iterative Training v.s. Training for More Epochs.We further study the training within
iteration0and compare with the performance achieved in iteration1, particularly contrasting the
test performance obtained from extended training duration with that from next iteration. Figure
depicts the performance trajectory of the model trained usingSPINover multiple epochs at iteration
0. It is evident that the most substantial improvement occurs during the first two epochs, followed by
only modest gains in subsequent epochs. Notably,SPINexhibits robustness and stability; extending
the training duration does not diminish performance but rather maintains a rather consistent level.
Nevertheless, the observation suggests an inherent limitation to the performance achievable within
a single iteration, thereby underscoring the necessity for iterative training. As shown by the test
performance achieved at iteration 1 in the figures, extending the training in iteration 0 fails to reach
the performance comparable to iteration 1.
Further Investigation on More Tasks.Here, we further investigate the performance ofSPIN
on a broader variety of tasks, including MT-Bench (Zheng et al.,), Big-Bench (bench authors,
2023) and OpenBookQA (Mihaylov et al.,) in addition to the Open LLM Leaderboard tasks.
Specifically, we use the following tasks from Big-Bench-Hard for a more comprehensive evaluation,
including Causal Judgment (causal reasoning), Sports Understanding (commonsense reasoning)
and Formal Fallacies (logical reasoning). In Table, we show the resulting scores of SPINon
MT-Bench as well as those tasks from Big-Bench. In Figure, we detail the model performances on
MT-Bench with regard to different types of questions. We can see a notably robust improvement
in the performance ofSPINon various tasks besides the HuggingFace Benchmark, without major
degradation. Notably, on MT-Bench, the model fine-tuned bySPINhas surpassed the performance
ofvicuna-13b-v1.5(Chiang et al.,) with a score of 6.57.
13
0 1 2 3 4 5
Epoch
61
62
63
64
65
Accuracy (%)
iter 0
iter 1 (epoch 2) (a) Arc Challenge accuracy.0 1 2 3 4 5
Epoch
44
46
48
50
52
54
Accuracy (%)
iter 0
iter 1 (epoch 2) (b) TruthfulQA score.0 1 2 3 4 5
Epoch
58
59
60
61
62
Average Accuracy (%)
iter 0
iter 1 (epoch 2) (c) Average score.
Figure 5: TheSPINtraining dynamics ofzephyr-7b-sft-fullon the 50k synthetic data with
regard to the number of training epochs during iteration 0. We can observe that iterative training is
pivotal as training for more epochs during iteration 0 reaches a limit and cannot surpass iteration 1.
Table 4: Test performance on other reasoning benchmark datasets forSPINat different iterations
andzephyr-7b-sft-full. We report the average score for MT-Bench and the accuracy score for
Big Bench datasets under standard few-shot CoT evaluation. On OpenBookQA, we reportacc_norm
with 1-shot example as used in2023). As similar to Open LLM Leaderboard evaluation,
we observe a steady improvement in performance on the other benchmark tasks, with no significant
degradation.
Model MT-BenchBB-causal BB-formal BB-sports OpenBookQA
zephyr-7b-sft-full 5.94 56.15 49.6 96.0 45.4
SPINiteration0 6.46
(+0.52) 57.75 51.6 95.2 46.8
SPINiteration1 6.65
(+0.19) 58.82 51.2 95.2 47.2
SPINiteration2 6.78
(+0.13) 59.36 51.2 94.4 47.6
7
This paper introduces a novel fine-tuning methodSPIN, to convert a weak LLM to a strong LLM by
unleashing the full power of human-annotated data. Central to this method is a self-play mechanism,
wherein a main player (the LLM) is fine-tuned to differentiate the responses of opponent player (the
LLM from previous iteration) from the target data distribution, and the LLM is iteratively aligned
with the target data distribution. Therefore,SPINfacilitates the LLM’s iterative self-evaluation and
enhancement through self-play. In comparison to supervised fine-tuning and RL fine-tuning methods,
SPINenables the LLM to self-improve without additional human data or feedback from stronger
LLMs. Empirical results demonstrate thatSPINsignificantly enhances LLM performance across
diverse benchmarks, even outperforming models trained with additional human data or AI feedback.
Limitation and Future Work.Our theoretical results demonstrate that the optimization process
ofSPINconverges if and only if the LLM’s distribution aligns withpdata. Therefore, our study
focuses on a fixed target data distribution generated by humans, which inherently imposes a ceiling
on the performance of fine-tuned LLM. Exploring the dynamically changing target data distribution
is an important direction to overcome this limitation and elevate the LLM’s performance beyond
14
Writing
Roleplay
Reasoning
Math
Coding
Extraction
STEM
Humanities
0123456789
model
SPIN iter-2
SPIN iter-1
SPIN iter-0
SFT
Loading [MathJax]/extensions/MathMenu.js Figure 6: Model performance on MT-Bench. We compareSPINacross different iterations with the
base SFT model. Starting from iteration 1, our fine-tuned model bySPINrobustly outperforms the
SFT checkpoint on all evaluation aspects.
this ceiling or even to a super-human level. Moreover, considering the resource demands of synthetic
data generation, another promising avenue for further exploration is to reduce the volume of required
synthetic data.
A
A.1
We use the Alignment Handbook library (Tunstall et al.,) as the codebase for our self-
play fine-tuning methodSPIN, which includes DeepSpeed ZeRO-3 (Rajbhandari et al.,) and
FlashAttention-2 (Dao,) to reduce training cost. We train our models with RMSProp (Hinton
et al.,) optimizer with no weight decay for all iterations as commonly used in fine-tuning LLMs
for alignment, with a global batch size of64,10%warmup steps and bfloat16 precision. We set the
peak learning rate to be 5e-7 for iterations 0 and 1, and decay this peak learning rate to 1e-7 for
iteration 2 and 3 as we are approaching the end of self-play fine-tuning. Lastly, we chooseβ= 0.1
and max sequence length to be2048tokens as in2023b). We note that at the last
iteration (iter-3) where the model is close to convergence, we increase the value ofβto5.0. We
use the Accelerate library (Gugger et al.,) to generate our synthetic data using distributed
inference with multiple GPUs with a global batch size of64. We consider the prompting template
“### Instruction: {prompt}\n˙### Response: ” as commonly used in2023). For
Ultrachat200k containing multi-round conversations, we only sample the first round as our prompt
and ground truth completion pairs.
A.2
In Tables, we further provide the generation examples of our fine-tuned model by SPIN
from different iterations. We can observe an improvement in response quality as compared to the
generation of the SFT checkpoint. Meanwhile, the model generations at higher iterations typically
becomes more concise than iteration0and resemble the ground truth completion better.
15
B
B.1
Proof of Theorem.
To begin with, we prove the “Sufficiency” in Theorem. Since pdata(·|x) =
pθt
(·|x), by symmetry property ofyandy
′
, we have for anyθ∈Θthat
2LSPIN(θ,θt) =E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
»
ℓ
„
γlog
pθ(y|x)
pθt
(y|x)
−γlog
pθ(y
′
|x)
pθt
(y
′
|x)
«–
+E
x∼q(·),y
′
∼pdata(·|x),y∼pθ
t
(·|x)
»
ℓ
„
γlog
pθ(y|x)
pθt
(y|x)
−γlog
pθ(y
′
|x)
pθt
(y
′
|x)
«–
=E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
»
ℓ
„
γlog
pθ(y|x)
pθt
(y|x)
−γlog
pθ(y
′
|x)
pθt
(y
′
|x)
«
+ℓ
„
γlog
pθ(y
′
|x)
pθt
(y
′
|x)
−γlog
pθ(y|x)
pθt
(y|x)
«–
≥2E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
»
ℓ
„
γ
2
log
pθ(y|x)
pθt
(y|x)
−
γ
2
log
pθ(y
′
|x)
pθt
(y
′
|x)
+
γ
2
log
pθ(y
′
|x)
pθt
(y
′
|x)
−
γ
2
log
pθ(y|x)
pθt
(y|x)
«–
= 2ℓ(0),
where the inequality is due to Jensen’s inequality (recalling thatℓis convex in Assumption
Therefore, we have thatLSPIN(θ,θt)≥ℓ(0) =LSPIN(θt,θt), which means thatθtis the global
optimum of(4.7). As a consequence, the gradient at the pointθtis zero, which concludesθt+1=θt.
Next, we prove the “Necessity”. Defineg(λ)as follows:
g(λ) =E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
»
ℓ
„
λlog
pdata(y|x)
pθt
(y|x)
−λlog
pdata(y
′
|x)
pθt
(y
′
|x)
«–
.
Then we haveg(0) =ℓ(0)and
g
′
(0) =E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
»
ℓ
′
(0)
„
log
pdata(y|x)
pθt
(y|x)
−log
pdata(y
′
|x)
pθt
(y
′
|x)
«–
=ℓ
′
(0)
„
E
x∼q(·),y∼pdata(·|x)
»
log
pdata(y|x)
pθt
(y|x)
–
−E
x∼q(·),y
′
∼pθ
t
(·|x)
»
log
pdata(y
′
|x)
pθt
(y
′
|x)
–«
=ℓ
′
(0)
h
KL
`
pdata(·|x)
pθt
(·|x)
´
+ KL
`
pθt
(·|x)
pdata(·|x)
´
i
<0,
where the last inequality is due to the condition thatℓ
′(0)<0. Therefore, there exist aλ0such
that for all0< λ < λ0 , we haveg(λ)< ℓ(0). Chooseθ
∗such thatpθ
∗(y|x) =pdata(y|x). For those
0< λ < λ0, we have that
LSPIN(θ
∗
,θt) =E
x∼q(·),y∼p
θ
∗(·|x),y
′
∼pθ
t
(·|x)
»
ℓ
„
λlog
pθ
∗(y|x)
pθt
(y|x)
−λlog
pθ
∗(y
′
|x)
pθt
(y
′
|x)
«–
=E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
»
ℓ
„
λlog
pdata(y|x)
pθt
(y|x)
−λlog
pdata(y
′
|x)
pθt
(y
′
|x)
«–
16
=g(λ)
< g(0)
=LSPIN(θt,θt),
where the second equality holds by the choice ofpθ
∗(·|x), and the inequality holds due to the choice
ofλ. Therefore, we conclude thatθtis not the global optimum of (4.7) if pθt
(·|x)̸=pdata(·|x).
B.2
We need the following auxiliary lemma before we prove Theorem.
Lemma B.1.Suppose thatℓ(t) =log(1 +exp(−t))and fora, b >0, the following inequality holds
aℓ(t) +bℓ(−t)≥alog(1 +b/a) +blog(1 +a/b),
the equality holds if and only ift= log(a/b).
Proof of Lemma.
Defineg(t) =aℓ(t) +bℓ(−t) =alog(1 +exp(−t)) +blog(1 +exp(t)), then we
have
g
′
(t) =−
aexp(−t)
1 + exp(−t)
+
bexp(t)
1 + exp(t)
=
−a+bexp(t)
1 + exp(t)
.
Therefore,g
′(t)<0whent <log(a/b),g
′(t)>0whent >log(a/b), which indicates thatgachieves
it minimum att= log(a/b)which concludes the proof.
Lemma aℓ(t) +bℓ(−t)is achieved whent=log(a/b).
Based on Lemma, we can further prove that (4.2)with the logistic loss function has a closed-form
solution if we ignore the constraint setFt.
Lemma B.2.Denotep+(y,y
′
,
x) =q(x)·pdata(y|x)·pθt(y
′
|
x)andp−(y,y
′
,
x) =q(x)·pθt(y
′
|
x)·
pdata(y|x),
E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
Θ
ℓ
Γ
f(x,y)−f(x,y
′
)
∆Λ
≥log 2−JSD(p+∥p−),
whereJSD(p+∥p−)represents the Jensen–Shannon divergence which is defined as follows
JSD
ı
p
q
ȷ
=
1
2
KL
ı
p
p+q
2
ȷ
+
1
2
KL
ı
q
p+q
2
ȷ
,
whereKL(·∥·)is KL-divergence.JSDis always non-negative and equals zero if and only ifp+and
p−are identical. Moreover, the global minimum valuelog2−JSD(p+∥p− )is achieved byf
∗if and
only if,
f
∗
(x,y) =Z(x) + log
`
pdata(y|x)
pθt
(y|x)
´
,
whereZ(x)is any function that is possibly dependent onx.
17
Proof of Lemma. We rewrite the objective function in the following formula,
2E
x∼q(·),y∼pdata(·|x),y
′
∼pθ
t
(·|x)
ˆ
ℓ
`
f(x,y)−f(x,y
′
)
´˜
=
Z
q(x)pdata(y|x)pθt
(y
′
|x)
ˆ
ℓ
`
f(x,y)−f(x,y
′
)
´˜
dydy
′
+
Z
q(x)pdata(y
′
|x)pθt
(y|x)
ˆ
ℓ
`
f(x,y
′
)−f(x,y)
´˜
dydy
′
=
Z
q(x)pdata(y|x)pθt
(y
′
|x)ℓ
`
f(x,y)−f(x,y
′
)
´
+q(x)pdata(y
′
|x)pθt
(y|x)ℓ
`
f(x,y
′
)−f(x,y)
´
dydy
′
(i)
≥
Z
q(x)pdata(y|x)pθt
(y
′
|x) log
„
1 +
pdata(y
′
|x)pθt
(y|x)
pdata(y|x)pθt
(y
′
|x)
«
+q(x)pdata(y
′
|x)pθt
(y|x) log
„
1 +
pdata(y|x)pθt
(y
′
|x)
pdata(y
′
|x)pθt
(y|x)
«
dydy
′
,
where the inequality is due toaℓ(t) +bℓ(−t)≥alog(1 +b/a) +blog(1 +a/b)in Lemma
a=q(x)pdata(y|x)pθt(y
′
|
x), b=q(x)pdata(y
′
|
x)pθt(y|x),t=f(x,y)−f(x,y
′
). The equality (i)
holds if and only if the following equation holds almost surely for anyx,y,y
′
,
f(x,y)−f(x,y
′
) = log
„
pdata(y|x)pθt
(y
′
|x)
pdata(y
′
|x)pθt
(y|x)
«
. (B.1)
(B.1) is equivalent to
f(x,y)−log
„
pdata(y|x)
pθt
(y|x)
«
=f(x,y
′
)−log
„
pdata(y
′
|x)
pθt
(y
′
|x)
«
holds almost surely for anyx,y,y
′
. Therefore, the equality (i) holds if and only if there exists some
Z(x)such that
f(x,y) =Z(x) + log
„
pdata(y|x)
pθt
(y|x)
«
.
Recall thatp+(y,y
′
|
x) =pdata(y|x)·pθt(y|x)andp−(y,y
′
|
x) =pθt(y|x)·pdata(y|x). Then, the
right-hand side of (i) can be written as
Z
q(x)pdata(y|x)pθt
(y
′
|x) log
„
1 +
pdata(y
′
|x)pθt
(y|x)
pdata(y|x)pθt
(y
′
|x)
«
+q(x)pdata(y
′
|x)pθt
(y|x) log
„
1 +
pdata(y|x)pθt
(y
′
|x)
pdata(y
′
|x)pθt
(y|x)
«
dydy
′
=
Z
p+(y,y
′
|x) log
„
1 +
p−(y,y
′
|x)
p+(y,y
′
|x)
«
+p−(y,y
′
|x) log
„
1 +
p+(y,y
′
|x)
p−(y,y
′
|x)
«
dydy
′
= 2 log 2 +
Z
p+(y,y
′
|x) log
„
1/2[p−(y,y
′
|x) +p+(y,y
′
|x)]
p+(y,y
′
|x)
«
+p−(y,y
′
|x) log
„
1/2[p−(y,y
′
|x) +p+(y,y
′
|x)]
p−(y,y
′
|x)
«
dydy
′
18
= 2 log 2−KL
`
p+
p++p−
2
´
−KL
`
p−
p++p−
2
´
= 2 log 2−2·JSD(p+∥p−),
where the last equality is by the definition of JSD. This concludes the proof.
Lemma (4.2)if we ignore the constraint setFt. If this
closed-form solution belongs toFt, then it should also be the solution to (4.2). This observation is
the key to the proof of Theorem.
Proof of Theorem. Under the condition of Theorem, there exists a pθsuch that
pθ(y|x)∝pθt
(y|x)
Γ
pdata(y|x)/pθt
(y|x)
∆
1/λ
.
Therefore, there exists a function
b
Z(x)such that
pθ(y|x) =
b
Z(x)·pθt
(y|x)
Γ
pdata(y|x)/pθt
(y|x)
∆
1/λ
. (B.2)
Applying logarithm function on both side of (B.2) yields
λlog(
b
Z(x)) + log
`
pdata(y|x)
pθt
(y|x)
´
=λlog
`
pθ(y|x)
pθt
(y|x)
´
∈ Ft.
By Lemma, f
∗(x,y) =λlog(
b
Z
(x)) +log
Γ
pdata(y|x)
pθ
t
(y|x)
∆
is the global minimum of the following
minimization problem,
argmin
f
E
y∼pdata(·|x),y
′
∼pθ
t
(·|x)
Θ
ℓ
Γ
f(x,y)−f(x,y
′
)
∆Λ
. (B.3)
Sincef
∗
∈ Ft ,f
∗(x,y) =λlog(
b
Z
(x))+log
Γ
pdata(y|x)
pθ
t
(y|x)
∆
is also the global optimum of the optimization
problem (4.2),
argmin
f∈Ft
E
y∼pdata(·|x),y
′
∼pθ
t
(·|x)
Θ
ℓ
Γ
f(x,y)−f(x,y
′
)
∆Λ
.
Therefore, we have proved that
min
f
E
y∼pdata(·|x),y
′
∼pθ
t
(·|x)
Θ
ℓ
Γ
f(x,y)−f(x,y
′
)
∆Λ
= min
f∈Ft
E
y∼pdata(·|x),y
′
∼pθ
t
(·|x)
Θ
ℓ
Γ
f(x,y)−f(x,y
′
)
∆Λ
= min
θ∈Θ
LSPIN(θ,θt). (B.4)
Sinceθt+1is the global minimum ofLSPIN(θ,θt). Then by(B.4),λlog
ı
pθ
t+1
(y|x)
pθ
t
(y|x)
ȷ
should be the
global minimum of problem (B.3). By Lemma, there exists Z(x)such that
λlog
`
pθt+1
(y|x)
pθt
(y|x)
´
=Z(x) + log
`
pdata(y|x)
pθt
(y|x)
´
,
which leads to the result thatpθt+1
(y|x)∝pθt
(y|x)
Γ
pdata(y|x)/pθt
(y|x)
∆
1/λ
.
19
References
Anil, R.,Dai, A. M.,Firat, O.,Johnson, M.,Lepikhin, D.,Passos, A.,Shakeri, S.,
Taropa, E.,Bailey, P.,Chen, Z. et al.(2023). Palm 2 technical report.arXiv preprint
arXiv:2305.10403.
Anthony, T.,Tian, Z.andBarber, D.(2017). Thinking fast and slow with deep learning and
tree search.Advances in neural information processing systems30.
Arjovsky, M.,Chintala, S.andBottou, L.(2017). Wasserstein generative adversarial networks.
InInternational conference on machine learning. PMLR.
Austin, J.,Odena, A.,Nye, M.,Bosma, M.,Michalewski, H.,Dohan, D.,Jiang, E.,Cai,
C.,Terry, M.,Le, Q. et al.(2021). Program synthesis with large language models.arXiv
preprint arXiv:2108.07732.
Bai, Y.,Jones, A.,Ndousse, K.,Askell, A.,Chen, A.,DasSarma, N.,Drain, D.,Fort,
S.,Ganguli, D.,Henighan, T. et al.(2022a). Training a helpful and harmless assistant with
reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862.
Bai, Y.,Kadavath, S.,Kundu, S.,Askell, A.,Kernion, J.,Jones, A.,Chen, A.,Goldie,
A.,Mirhoseini, A.,McKinnon, C. et al.(2022b). Constitutional ai: Harmlessness from ai
feedback.arXiv preprint arXiv:2212.08073.
Bansal, T.,Pachocki, J.,Sidor, S.,Sutskever, I.andMordatch, I.(2018). Emergent
complexity via multi-agent competition. InInternational Conference on Learning Representations.
Beeching, E.,Fourrier, C.,Habib, N.,Han, S.,Lambert, N.,Rajani, N.,Sanseviero, O.,
Tunstall, L.andWolf, T.(2023). Open llm leaderboard.https://huggingface.co/spaces/
HuggingFaceH4/open_llm_leaderboard.
bench authors, B.(2023). Beyond the imitation game: Quantifying and extrapolating the
capabilities of language models.Transactions on Machine Learning Research.
Bengio, Y.,Louradour, J.,Collobert, R.andWeston, J.(2009). Curriculum learning. In
Proceedings of the 26th annual international conference on machine learning.
Brown, T.,Mann, B.,Ryder, N.,Subbiah, M.,Kaplan, J. D.,Dhariwal, P.,Neelakantan,
A.,Shyam, P.,Sastry, G.,Askell, A. et al.(2020). Language models are few-shot learners.
Advances in neural information processing systems331877–1901.
Bubeck, S.,Chandrasekaran, V.,Eldan, R.,Gehrke, J.,Horvitz, E.,Kamar, E.,Lee,
P.,Lee, Y. T.,Li, Y.,Lundberg, S. et al.(2023). Sparks of artificial general intelligence:
Early experiments with gpt-4.arXiv preprint arXiv:2303.12712.
Burns, C.,Izmailov, P.,Kirchner, J. H.,Baker, B.,Gao, L.,Aschenbrenner, L.,Chen,
Y.,Ecoffet, A.,Joglekar, M.,Leike, J. et al.(2023). Weak-to-strong generalization:
Eliciting strong capabilities with weak supervision .
Chen, M.,Tworek, J.,Jun, H.,Yuan, Q.,Pinto, H. P. d. O.,Kaplan, J.,Edwards, H.,
Burda, Y.,Joseph, N.,Brockman, G. et al.(2021). Evaluating large language models
trained on code.arXiv preprint arXiv:2107.03374.
20
Chiang, W.-L.,Li, Z.,Lin, Z.,Sheng, Y.,Wu, Z.,Zhang, H.,Zheng, L.,Zhuang, S.,
Zhuang, Y.,Gonzalez, J. E.,Stoica, I.andXing, E. P.(2023). Vicuna: An open-source
chatbot impressing gpt-4 with 90%* chatgpt quality.
Christiano, P. F.,Leike, J.,Brown, T.,Martic, M.,Legg, S.andAmodei, D.(2017).
Deep reinforcement learning from human preferences.Advances in neural information processing
systems30.
Chung, H. W.,Hou, L.,Longpre, S.,Zoph, B.,Tay, Y.,Fedus, W.,Li, Y.,Wang, X.,
Dehghani, M.,Brahma, S. et al.(2022). Scaling instruction-finetuned language models.arXiv
preprint arXiv:2210.11416.
Cirik, V.,Hovy, E.andMorency, L.-P.(2016). Visualizing and understanding curriculum
learning for long short-term memory networks.arXiv preprint arXiv:1611.06204.
Clark, P.,Cowhey, I.,Etzioni, O.,Khot, T.,Sabharwal, A.,Schoenick, C.andTafjord,
O.(2018). Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv
preprint arXiv:1803.05457.
Cobbe, K.,Kosaraju, V.,Bavarian, M.,Chen, M.,Jun, H.,Kaiser, L.,Plappert, M.,
Tworek, J.,Hilton, J.,Nakano, R. et al.(2021). Training verifiers to solve math word
problems.arXiv preprint arXiv:2110.14168.
Dao, T.(2023). Flashattention-2: Faster attention with better parallelism and work partitioning.
arXiv preprint arXiv:2307.08691.
Deng, Y.,Zhang, W.,Chen, Z.andGu, Q.(2023). Rephrase and respond: Let large language
models ask better questions for themselves.arXiv preprint arXiv:2311.04205.
Ding, N.,Chen, Y.,Xu, B.,Qin, Y.,Zheng, Z.,Hu, S.,Liu, Z.,Sun, M.andZhou, B.
(2023). Enhancing chat language models by scaling high-quality instructional conversations.arXiv
preprint arXiv:2305.14233.
Frei, S.,Zou, D.,Chen, Z.andGu, Q.(2022). Self-training converts weak learners to strong
learners in mixture models. InInternational Conference on Artificial Intelligence and Statistics.
PMLR.
Freund, Y.(1995). Boosting a weak learning algorithm by majority.Information and computation
121256–285.
Freund, Y.andSchapire, R. E.(1997). A decision-theoretic generalization of on-line learning
and an application to boosting.Journal of computer and system sciences55119–139.
Gao, L.,Schulman, J.andHilton, J.(2023a). Scaling laws for reward model overoptimization.
InInternational Conference on Machine Learning. PMLR.
Gao, L.,Tow, J.,Abbasi, B.,Biderman, S.,Black, S.,DiPofi, A.,Foster, C.,Golding,
L.,Hsu, J.,Le Noac’h, A.,Li, H.,McDonell, K.,Muennighoff, N.,Ociepa, C.,Phang,
J.,Reynolds, L.,Schoelkopf, H.,Skowron, A.,Sutawika, L.,Tang, E.,Thite, A.,
Wang, B.,Wang, K.andZou, A.(2023b). A framework for few-shot language model evaluation.
21
Goodfellow, I.,Pouget-Abadie, J.,Mirza, M.,Xu, B.,Warde-Farley, D.,Ozair, S.,
Courville, A.andBengio, Y.(2014). Generative adversarial nets.Advances in neural
information processing systems27.
Grandvalet, Y.andBengio, Y.(2004). Semi-supervised learning by entropy minimization.
Advances in neural information processing systems17.
Gugger, S.,Debut, L.,Wolf, T.,Schmid, P.,Mueller, Z.,Mangrulkar, S.,Sun, M.
andBossan, B.(2022). Accelerate: Training and inference at scale made simple, efficient and
adaptable.https://github.com/huggingface/accelerate.
Hendrycks, D.,Burns, C.,Basart, S.,Zou, A.,Mazeika, M.,Song, D.andSteinhardt, J.
(2020). Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300.
Hernandez-Leal, P.,Kartal, B.andTaylor, M. E.(2018). Is multiagent deep reinforcement
learning the answer or the question? a brief survey.learning2122.
Hinton, G.,Srivastava, N.andSwersky, K.(2012). Neural networks for machine learning
lecture 6a overview of mini-batch gradient descent.Cited on142.
Jiang, A. Q.,Sablayrolles, A.,Mensch, A.,Bamford, C.,Chaplot, D. S.,Casas, D.
d. l.,Bressand, F.,Lengyel, G.,Lample, G.,Saulnier, L. et al.(2023). Mistral 7b.
arXiv preprint arXiv:2310.06825.
Josifoski, M.,Sakota, M.,Peyrard, M.andWest, R.(2023). Exploiting asymmetry for
synthetic training data generation: Synthie and the case of information extraction.arXiv preprint
arXiv:2303.04132.
Kearns, M.andValiant, L.(1994). Cryptographic limitations on learning boolean formulae and
finite automata.Journal of the ACM (JACM)4167–95.
Kou, Y.,Chen, Z.,Cao, Y.andGu, Q.(2022). How does semi-supervised learning with
pseudo-labelers work? a case study. InThe Eleventh International Conference on Learning
Representations.
Kumar, M.,Packer, B.andKoller, D.(2010). Self-paced learning for latent variable models.
Advances in neural information processing systems23.
Lanctot, M.,Zambaldi, V.,Gruslys, A.,Lazaridou, A.,Tuyls, K.,Pérolat, J.,Silver,
D.andGraepel, T.(2017). A unified game-theoretic approach to multiagent reinforcement
learning.Advances in neural information processing systems30.
Lee, D.-H.(2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep
neural networks. InICML Challenges in Representation Learning Workshop.
Lee, Y. J.andGrauman, K.(2011). Learning the easy things first: Self-paced visual category
discovery. InCVPR 2011. IEEE.
Lewkowycz, A.,Andreassen, A.,Dohan, D.,Dyer, E.,Michalewski, H.,Ramasesh, V.,
Slone, A.,Anil, C.,Schlag, I.,Gutman-Solo, T. et al.(2022). Solving quantitative
reasoning problems with language models.Advances in Neural Information Processing Systems
353843–3857.
22
Li, Y.,Bubeck, S.,Eldan, R.,Giorno, A. D.,Gunasekar, S.andLee, Y. T.(2023).
Textbooks are all you need ii: phi-1.5 technical report.
Li, Y.,Choi, D.,Chung, J.,Kushman, N.,Schrittwieser, J.,Leblond, R.,Eccles, T.,
Keeling, J.,Gimeno, F.,Dal Lago, A. et al.(2022). Competition-level code generation
with alphacode.Science3781092–1097.
Lin, S.,Hilton, J.andEvans, O.(2021). Truthfulqa: Measuring how models mimic human
falsehoods.arXiv preprint arXiv:2109.07958.
Liu, B.,Bubeck, S.,Eldan, R.,Kulkarni, J.,Li, Y.,Nguyen, A.,Ward, R.andZhang,
Y.(2023). Tinygsm: achieving> 80% on gsm8k with small language models.arXiv preprint
arXiv:2312.09241.
Liu, C.,He, S.,Liu, K.,Zhao, J. et al.(2018). Curriculum learning for natural answer generation.
InIJCAI.
Liu, F.,Ge, S.andWu, X.(2021). Competence-based multimodal curriculum learning for medical
report generation. InProceedings of the 59th Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume
1: Long Papers).
Luo, H.,Sun, Q.,Xu, C.,Zhao, P.,Lou, J.,Tao, C.,Geng, X.,Lin, Q.,Chen, S.and
Zhang, D.(2023). Wizardmath: Empowering mathematical reasoning for large language models
via reinforced evol-instruct.arXiv preprint arXiv:2308.09583.
Mihaylov, T.,Clark, P.,Khot, T.andSabharwal, A.(2018). Can a suit of armor conduct
electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing.
Mishra, S.,Khashabi, D.,Baral, C.andHajishirzi, H.(2021). Cross-task generalization via
natural language crowdsourcing instructions.arXiv preprint arXiv:2104.08773.
Müller, A.(1997). Integral probability metrics and their generating classes of functions.Advances
in applied probability29429–443.
Muller, P.,Omidshafiei, S.,Rowland, M.,Tuyls, K.,Perolat, J.,Liu, S.,Hennes, D.,
Marris, L.,Lanctot, M.,Hughes, E. et al.(2019). A generalized training approach for
multiagent learning.arXiv preprint arXiv:1909.12823.
OpenAI(2023). Gpt-4 technical report.
Ouyang, L.,Wu, J.,Jiang, X.,Almeida, D.,Wainwright, C.,Mishkin, P.,Zhang, C.,
Agarwal, S.,Slama, K.,Ray, A. et al.(2022). Training language models to follow instructions
with human feedback.Advances in Neural Information Processing Systems3527730–27744.
Prasad, A.,Stengel-Eskin, E.andBansal, M.(2023). Rephrase, augment, reason: Visual
grounding of questions for vision-language models.arXiv preprint arXiv:2310.05861.
Radford, A.,Wu, J.,Child, R.,Luan, D.,Amodei, D.,Sutskever, I. et al.(2019).
Language models are unsupervised multitask learners.OpenAI blog19.
23
Rafailov, R.,Sharma, A.,Mitchell, E.,Ermon, S.,Manning, C. D.andFinn, C.(2023).
Direct preference optimization: Your language model is secretly a reward model.arXiv preprint
arXiv:2305.18290.
Rajbhandari, S.,Rasley, J.,Ruwase, O.andHe, Y.(2020). Zero: Memory optimizations
toward training trillion parameter models. InSC20: International Conference for High Performance
Computing, Networking, Storage and Analysis. IEEE.
Roziere, B.,Gehring, J.,Gloeckle, F.,Sootla, S.,Gat, I.,Tan, X. E.,Adi, Y.,Liu, J.,
Remez, T.,Rapin, J. et al.(2023). Code llama: Open foundation models for code.arXiv
preprint arXiv:2308.12950.
Sakaguchi, K.,Bras, R. L.,Bhagavatula, C.andChoi, Y.(2021). Winogrande: An adversarial
winograd schema challenge at scale.Communications of the ACM6499–106.
Samuel, A. L.(1959). Some studies in machine learning using the game of checkers.IBM Journal
of research and development3210–229.
Samuel, A. L.(2000). Some studies in machine learning using the game of checkers.IBM Journal
of research and development44206–226.
Schapire, R. E.(1990). The strength of weak learnability.Machine learning5197–227.
Silver, D.,Hubert, T.,Schrittwieser, J.,Antonoglou, I.,Lai, M.,Guez, A.,Lanctot,
M.,Sifre, L.,Kumaran, D.,Graepel, T. et al.(2017a). Mastering chess and shogi by
self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815.
Silver, D.,Schrittwieser, J.,Simonyan, K.,Antonoglou, I.,Huang, A.,Guez, A.,
Hubert, T.,Baker, L.,Lai, M.,Bolton, A. et al.(2017b). Mastering the game of go
without human knowledge.nature550354–359.
Singh, A.,Co-Reyes, J. D.,Agarwal, R.,Anand, A.,Patil, P.,Liu, P. J.,Harrison,
J.,Lee, J.,Xu, K.,Parisi, A. et al.(2023). Beyond human data: Scaling self-training for
problem-solving with language models.arXiv preprint arXiv:2312.06585.
Soviany, P.,Ionescu, R. T.,Rota, P.andSebe, N.(2022). Curriculum learning: A survey.
International Journal of Computer Vision1301526–1565.
Spitkovsky, V. I.,Alshawi, H.andJurafsky, D.(2009). Baby steps: How “less is more” in
unsupervised dependency parsing. InNIPS 2009 Workshop on Grammar Induction, Representation
of Language and Language Learning.
Stiennon, N.,Ouyang, L.,Wu, J.,Ziegler, D.,Lowe, R.,Voss, C.,Radford, A.,Amodei,
D.andChristiano, P. F.(2020). Learning to summarize with human feedback.Advances in
Neural Information Processing Systems333008–3021.
Taori, R.,Gulrajani, I.,Zhang, T.,Dubois, Y.,Li, X.,Guestrin, C.,Liang, P.and
Hashimoto, T. B.(2023). Stanford alpaca: An instruction-following llama model.https:
//github.com/tatsu-lab/stanford_alpaca.
24
Tesauro, G. et al.(1995). Temporal difference learning and td-gammon.Communications of the
ACM3858–68.
Thoppilan, R.,De Freitas, D.,Hall, J.,Shazeer, N.,Kulshreshtha, A.,Cheng, H.-T.,
Jin, A.,Bos, T.,Baker, L.,Du, Y. et al.(2022). Lamda: Language models for dialog
applications.arXiv preprint arXiv:2201.08239.
Touvron, H.,Martin, L.,Stone, K.,Albert, P.,Almahairi, A.,Babaei, Y.,Bashlykov,
N.,Batra, S.,Bhargava, P.,Bhosale, S. et al.(2023). Llama 2: Open foundation and
fine-tuned chat models.arXiv preprint arXiv:2307.09288.
Tunstall, L.,Beeching, E.,Lambert, N.,Rajani, N.,Rasul, K.,Belkada, Y.,Huang, S.,
von Werra, L.,Fourrier, C.,Habib, N. et al.(2023a). Zephyr: Direct distillation of lm
alignment.arXiv preprint arXiv:2310.16944.
Tunstall, L.,Beeching, E.,Lambert, N.,Rajani, N.,Rush, A. M.andWolf, T.(2023b).
The alignment handbook.https://github.com/huggingface/alignment-handbook.
Vapnik, V.(1999).The nature of statistical learning theory. Springer science & business media.
Victor, S.,Albert, W.,Colin, R.,Stephen, B.,Lintang, S.,Zaid, A.,Antoine, C.,
Arnaud, S.,Arun, R.,Manan, D. et al.(2022). Multitask prompted training enables
zero-shot task generalization. InInternational Conference on Learning Representations.
Vinyals, O.,Babuschkin, I.,Chung, J.,Mathieu, M.,Jaderberg, M.,Czarnecki, W.,
Dudzik, A.,Huang, A.,Georgiev, P.,Powell, R.,Ewalds, T.,Horgan, D.,Kroiss,
M.,Danihelka, I.,Agapiou, J.,Oh, J.,Dalibard, V.,Choi, D.,Sifre, L.,Sulsky, Y.,
Vezhnevets, S.,Molloy, J.,Cai, T.,Budden, D.,Paine, T.,Gulcehre, C.,Wang, Z.,
Pfaff, T.,Pohlen, T.,Yogatama, D.,Cohen, J.,McKinney, K.,Smith, O.,Schaul,
T.,Lillicrap, T.,Apps, C.,Kavukcuoglu, K.,Hassabis, D.andSilver, D.(2019).
AlphaStar: Mastering the Real-Time Strategy Game StarCraft II.https://deepmind.com/blog/
alphastar-mastering-real-time-strategy-game-starcraft-ii/.
Wei, J.,Wang, X.,Schuurmans, D.,Bosma, M.,Xia, F.,Chi, E.,Le, Q. V.,Zhou, D.
et al.(2022). Chain-of-thought prompting elicits reasoning in large language models.Advances
in Neural Information Processing Systems3524824–24837.
Wu, J.,Liang, Y.,Akbari, H.,Wang, Z.,Yu, C. et al.(2022). Scaling multimodal pre-training
via cross-modality gradient harmonization.Advances in Neural Information Processing Systems
3536161–36173.
Yang, Y.,Singh, A. K.,Elhoushi, M.,Mahmoud, A.,Tirumala, K.,Gloeckle, F.,Rozière,
B.,Wu, C.-J.,Morcos, A. S.andArdalani, N.(2023). Decoding data quality via synthetic
corruptions: Embedding-guided pruning of code data.arXiv preprint arXiv:2312.02418.
Yu, L.,Jiang, W.,Shi, H.,Yu, J.,Liu, Z.,Zhang, Y.,Kwok, J. T.,Li, Z.,Weller, A.
andLiu, W.(2023). Metamath: Bootstrap your own mathematical questions for large language
models.arXiv preprint arXiv:2309.12284.
25
Yuan, Z.,Yuan, H.,Li, C.,Dong, G.,Tan, C.andZhou, C.(2023). Scaling relationship on
learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825.
Zellers, R.,Holtzman, A.,Bisk, Y.,Farhadi, A.andChoi, Y.(2019). Hellaswag: Can a
machine really finish your sentence?arXiv preprint arXiv:1905.07830.
Zhang, D.,Meng, D.,Li, C.,Jiang, L.,Zhao, Q.andHan, J.(2015). A self-paced multiple-
instance learning framework for co-saliency detection. InProceedings of the IEEE international
conference on computer vision.
Zhang, X.,Kumar, G.,Khayrallah, H.,Murray, K.,Gwinnup, J.,Martindale, M. J.,
McNamee, P.,Duh, K.andCarpuat, M.(2018). An empirical exploration of curriculum
learning for neural machine translation.arXiv preprint arXiv:1811.00739.
Zheng, L.,Chiang, W.-L.,Sheng, Y.,Zhuang, S.,Wu, Z.,Zhuang, Y.,Lin, Z.,Li, Z.,Li,
D.,Xing, E. et al.(2023). Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv
preprint arXiv:2306.05685.
Ziegler, D. M.,Stiennon, N.,Wu, J.,Brown, T. B.,Radford, A.,Amodei, D.,Christiano,
P.andIrving, G.(2019). Fine-tuning language models from human preferences.arXiv preprint
arXiv:1909.08593.
26
Table 5: Generation example of our fine-tuned model bySPINat different iterations.
Prompt Can you clarify if Poppelsdorfer Allee 45 and the old site of the Mathe-
matical Institute are the same location? Answer according to: Go straight
to Poppelsdorfer Allee 45 and sign-in.˙ There is a big lecture hall at the
old site of the Mathematical Institute (Wegelerstr. 10, Bonn).
Ground truth I do not have access to the latest economic data on horsham and nearby
towns. However, you can check the latest economic statistics and reports
provided by the local government or the official websites of horsham and
nearby towns for a detailed comparison of their economic growth.
zephyr-7b=sft-fullAccording to the given material, Horsham was growing "significantly"
faster than nearby towns, indicating that it was economically thriving.˙
However, the given material does not specify the exact rates or levels
of economic growth of Horsham compared to nearby towns. Without
this information, it is challenging to determine whether the differences in
economic growth between Horsham and nearby towns were significant or
minimal.
Iteration 0 I do not have access to the latest economic data for horsham and nearby
towns. However, horsham is known to be a prosperous town in west sussex,
uk, with a diverse economy that includes agriculture, manufacturing, and
services. The town has a strong retail sector, and it is also home to
several large companies, including the global pharmaceutical company
novartis. In terms of comparison with nearby towns, horsham is generally
considered to be one of the more affluent towns in the region. However,
the economic growth of horsham and its neighboring towns may vary
depending on various factors such as industry, investment, and government
policies.
Iteration 1 I do not have access to real-time data or information about the most recent
economic growth of horsesham and nearby towns. However, according
to the recent reports from various sources such as the office for national
statistics, horsesham’s economy primarily relies on retail, healthcare,
education, and public administration industries. Compared to nearby
towns such as lewes, chichester, and bognor regis, horsesham’s economy
may have a different composition and growth rate based on their specific
industries and economic policies.
Iteration 2 I don’t have access to the most recent economic data. However, according
to a 2019 report by pwc, horsham’s expected economic growth up to 2025
was projected to be around2.3%per annum, which is slightly below the
average for the south east of england (2.8%). Neighbouring towns like
crawley and bognor regis were projected to experience similar growth
rates. note: this may not reflect the impact of the covid-19 pandemic
that occurred in 2020.
27
Table 6: Another generation example of our fine-tuned model bySPINat different iterations.
Prompt How does the economic growth of Horsham compare to nearby towns?
Ground truth No, Poppelsdorfer Allee 45 and the old site of the Mathematical Institute
(Wegelerstr. 10, Bonn) are not the same location. To sign-in, one should
go straight to Poppelsdorfer Allee 45. However, there is a big lecture hall
at the old site of the Mathematical Institute.
zephyr-7b=sft-fullYes, Poppelsdorfer Allee 45 and the old site of the Mathematical Institute
are the same location.
Iteration 0 Yes, Poppelsdorfer Allee 45 and the old site of the Mathematical Institute
(Wegelerstr. 10, Bonn) are the same location. The Mathematical Institute
has been relocated and renamed as the Institute for Mathematics and
Theoretical Physics, but the building where it used to be is still referred
to as the old site. The big lecture hall at the old site still exists and may
be used for events or classes.
Iteration 1 No, Poppelsdorfer Allee 45 and the old site of the Mathematical Institute
are not the same location. While Poppelsdorfer Allee 45 is where attendees
should sign-in, the old site of the Mathematical Institute (located at
Wegelerstr. 10, Bonn) has a large lecture hall mentioned in the statement.
Therefore, it can be assumed that the two locations are separate and not
the same.
Iteration 2 Poppelsdorfer Allee 45 and the old site of the Mathematical Institute
(Wegelerstr. 10, Bonn) are two different locations. As per the instruction,
you need to go straight to Poppelsdorfer Allee 45 and sign-in. However,
there is a lecture hall mentioned at the old site of the Mathematical
Institute (Wegelerstr. 10, Bonn).
28