Extracted Text
2402.01306.pdf
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh
1
Winnie Xu
2
Niklas Muennighoff
2
Dan Jurafsky
1
Douwe Kiela
1 2
Abstract
Kahneman & Tversky’sprospect theorytells us
that humans perceive random variables in a biased
but well-defined manner (1992); for example, hu-
mans are famously loss-averse. We show that
objectives for aligning LLMs with human feed-
back implicitly incorporate many of these biases—
the success of these objectives (e.g., DPO) over
cross-entropy minimization can partly be ascribed
to them beinghuman-aware loss functions(HA-
LOs). However, the utility functions these meth-
ods attribute to humans still differ from those in
the prospect theory literature. Using a Kahneman-
Tversky model of human utility, we propose a
HALO that directly maximizes the utility of gen-
erations instead of maximizing the log-likelihood
of preferences, as current methods do. We call
this approach Kahneman-Tversky Optimization
(KTO), and it matches or exceeds the performance
of preference-based methods at scales from 1B to
30B. Crucially, KTO does not need preferences—
only a binary signal of whether an output is desir-
able or undesirable for a given input. This makes
it far easier to use in the real world, where prefer-
ence data is scarce and expensive.
1. Introduction
Aligning generative models with human feedback has been
successfully used to make generations more helpful, factual,
and ethical, among other desiderata (Ouyang et al.,;
Tian et al.,). For LLMs, alignment methods such
as RLHF and DPO have consistently proven to be more
beneficial than doing supervised finetuning (SFT) alone.
However, human feedback is often discussed only in the
context of preferences (e.g., outputA≻B for inputx),
despite preferences being a kind of data that is relatively
scarce and expensive to collect in the real world (Casper
et al.,). This is largely because the alignment methods
shown to work best—RLHF (Christiano et al.,) and
1
Stanford University
2
Contextual AI. Correspondence to:
Kawin Ethayarajh<kawin@stanford.edu>.
the mathematically equivalent DPO (Rafailov et al.,)—
take preference data as input.
To understand why these alignment methods work so well,
and whether feedback needs to be in the form of prefer-
ences, we frame them through the lens ofprospect the-
ory(Kahneman & Tversky,;,
1992). Prospect theory explains why humans make deci-
sions about uncertain events that do not maximize expected
value. It formalizes how humans perceive random variables
in a biased but well-defined manner; for example, relative to
some reference point, humans are more sensitive to losses
than gains, a property calledloss aversion. We show that
popular alignment methods such as PPO (Schulman et al.,
2017), DPO (Rafailov et al.,), and SLiC (Zhao et al.,
2023) implicitly model such biases, helping explain their
success independently of the data used. For this reason, we
call themhuman-aware loss functions(HALOs).
Although it is impossible to say that HALOs are categori-
cally better than non-HALOs, we find that among existing
methods, those that meet the definition of a HALO work
better than those that do not. We find that DPO perfor-
mance can even be matched at most scales by running an
offline PPO variant on dummy +1/-1 rewards, suggesting
that preference data might not be needed if the inductive
bias in the loss function is good enough. However, despite
the surprising success of this simple baseline, it significantly
lags behind DPO at the 30B model scale and suffers from
hyperparameter sensitivity, making it difficult to use.
Taking a more principled approach, we derive a HALO using
the model of human utility that Kahneman & Tversky empir-
ically derived to describe how humans make decisions about
uncertain monetary outcomes (Tversky & Kahneman,).
This approach, which we call Kahneman-Tversky Optimiza-
tion (KTO), directly maximizes the utility of generations
instead of maximizing the log-likelihood of preferences, as
most current methods do. KTO only requires a binary signal
of whether an output is desirable or undesirable for a given
input. This data is much more abundant, cheaper, and faster
to collect in the real world than preferences, making it easier
to scale alignment in production environments and rapidly
iterate on models.
In our experiments, we find that:
1
Model Alignment as Prospect Theoretic Optimization
Figure 1.
The traditional pipeline for LLM alignment starts with supervised finetuning, followed by fitting the LLM to paired preference
data using a method such as RLHF or DPO. However, the paired preferences that existing approaches need are hard-to-get. Kahneman-
Tversky Optimization (KTO) only needs to know whether a given output is (un)desirable for the input, giving it access to a source of data
that is much more abundant, cheaper, and faster to collect in the real world.
•
KTO matches or exceeds DPO performance at scales
from 1B to 30B parameters.
1
That is, taking a prefer-
ence dataset ofnDPO pairs and breaking it up into
2nexamples for KTO can yield better generations,
despite the model ostensibly learning from a weaker
signal. We provide some theoretical explanations for
this phenomenon (§4.3).
•
KTO can handle extreme data imbalances, matching
DPO performance while using up to 90% fewer desir-
able examples (i.e., examples of good generations). Its
success thus cannot be ascribed to the alignment data
being sourced from a preference dataset.
•
When the pretrained model is sufficiently good, one
can skip supervised finetuning and go straight to KTO
without a loss in generation quality. In contrast, we
find that without doing SFT first, DPO-aligned models
are significantly worse at all scales.
The fact that KTO can match and sometimes even outper-
form DPO is surprising, given that it learns from a weaker
signal. We conclude by discussing some theoretical expla-
nations for this phenomenon.
2. Background
Feedback-aligned LLMs are traditionally trained in three
stages (Ouyang et al.,):
PretrainingGiven a large corpus, train the model to pre-
dict the next token conditioned on the preceding text using
the cross-entropy loss. Letπdenote the pretrained model.
1
Our code is available on.
Supervised FinetuningFinetune the model to predict the
next token on data that is more relevant to the downstream
task. Often, such data will comprise instructions and an
appropriate response (i.e., instruction finetuning). Letπref
denote the finetuned model.
RLHFGiven a datasetDof preferences(x, yw, yl) —
wherexis an input,yw, yl are the preferred and dispreferred
outputs (i.e.,yw≻yl forx), andr
∗is the “true” reward
function underlying the preferences—it is first assumed that
the probability thatywis preferred toylcan be captured with
a specific function class, typically a Bradley-Terry model
(Bradley & Terry,). Where σis the logistic function:
p
∗
(yw≻yl|x) =σ(r
∗
(x, yw)−r
∗
(x, yl))(1)
Since getting the true reward from a human would be in-
tractably expensive, a reward modelrϕlearns to serve as a
proxy, done by minimizing the negative log-likelihood of
the human preference data:
LR(rϕ) =Ex,yw,yl∼D[−logσ(rϕ(x, yw)−rϕ(x, yl))]
But solely maximizing the reward might come at the ex-
pense of desiderata such as generating grammatical text. To
avoid this, a KL divergence penalty is introduced to restrict
how far the language model can drift fromπref. Whereπθis
the model we are optimizing, the optimal modelπ
∗is that
which maximizes
Ex∈D,y∈πθ
[rϕ(x, y)]−βDKL(πθ(y|x)∥πref(y|x))(2)
whereβ >0 is a hyperparameter. Since this objective is
not differentiable, we need to use an RL algorithm like PPO
(Schulman et al.,).
2
Model Alignment as Prospect Theoretic Optimization
However, RLHF is often slow (largely because of having
to sample generations) and quite unstable in practice (espe-
cially in a distributed setting). For this reason, recent work
has focused on designing closed-form losses that maximize
the margin between the preferred and dispreferred gener-
ations, such as Sequence-Likelihood Calibration (SLiC)
(Zhao et al.,) and Direct Preference Optimization
(DPO) (Rafailov et al.,). The latter has become popular
due to its mathematical equivalence with RLHF:
LDPO(πθ, πref) =
E
ffi
−logσ
ȷ
βlog
πθ(yw|x)
πref(yw|x)
−βlog
πθ(yl|x)
πref(yl|x)
ffffl
(3)
3. A Prospect Theoretic View of Alignment
Kahneman & Tversky’sprospect theoryexplains why, faced
with an uncertain event, humans make decisions that do not
maximize the expected value (1992). For example, because
humans are loss-averse, given a gamble that returns$100
with 80% probability and$0 with 20% probability, a person
might accept$60 to avoid the gamble, despite theircertainty
equivalentof$60 being less than the expected value of$80.
3.1. Prospect Theory
In prospect theory, human utility depends on avalue func-
tionand aweighting function:
2
Definition 3.1.Avalue functionv:z→R maps an out-
comez, relative to some reference pointzref, to its perceived
(or subjective) value. For example, these functions capture
the fact that humans tend to be more sensitive to relative
losses than relative gains of the same magnitude.
Definition 3.2.Aweighting functionwis the derivative of
acapacity functionthat maps cumulative probabilities to
perceived cumulative probabilities. These functions capture,
for example, the fact that humans tend to overestimate the
chance of rare events. Letwzdenote the weight placed on
outcomez.
Definition 3.3.Theutility of a random variableZis a
function of its outcomes:u(Z)≜
P
z∈Z
wzv(z−zref).
However, because humans do not see the full probability
distribution of an LLM, weighting functions are not salient
to this discussion; we will focus only on value functions. Us-
ing experiments that presented real humans with monetary
gambles and asked for their certainty equivalent,
Kahneman1992) proposed the following functional form
for human value:
v(z, zref;λ;α) =
(
(z−zref)
α
ifz > zref
−λ(zref−z)
α
ifz < zref
(4)
2
Cumulative prospect theoryis the full name of the expanded
theory we dicuss here (Tversky & Kahneman,).
Figure 2.
The utility that a human gets from the outcome of a
random variable, as imputed by the value function implicit in
HALOs. Notice that the imputed functions share properties such
as loss aversion with the human value functions that Kahneman &
Tversky empirically derived (1992).
where the median value of hyperparameterα= 0.88 and
λ= 2.25 across individuals.αcontrols how quickly utility
changes andλcontrols the degree of loss aversion. While
the shape of the median Kahneman-Tversky value function
is illustrated in Figure, it should be noted that it varies
across individuals (Tversky & Kahneman,). There
are also other functional forms for the value function that
have been proposed in later work (Gurevich et al.,).
The salient qualities of a value function are: the existence
of a reference point that is added or subtracted to get the
relative gain or loss respectively; concavity in relative gains
(i.e. diminishing sensitivity away fromzref); loss aversion
(i.e., greater sensitivity to losses).
3.2. HALOs
Informally, HALOs are loss functions that model the human
biases in1992). Formally,
Definition 3.4(HALOs).Letx∈ Xdenote an input and
y∈ Yan output. Thenf: (x, y)→R is ahuman-aware
loss functionif there exists the following: a parameterized
reward functionrθsuch that∀(x1, y1),(x2, y2)∈ X × Y,
rθ(x1, y1)> rθ(x2, y2)⇐⇒(x1, y1)≻rθ
(x2, y2)
reference point distributionsQx(X
′
), Qy(Y
′
|X
′
) , a value
functionvf:R→R that is monotonic non-decreasing and
concave in(0,∞) , and a negative affine functiontsuch that
f(x, y;θ) =t(vf(rθ(x, y)−Ex
′
,y
′[rθ(x
′
, y
′
)]))(5)
wherex
′
∼Qx(X
′
)andy
′
∼Qy(Y
′
|x
′
).
Put simply, the requirement for the reward function is that it
assigns higher rewards to input-output pairs that are more
3
Model Alignment as Prospect Theoretic Optimization
preferred under it. The reference point is the expected re-
ward with respect to input-output pairs sampled from the
distributionsQx, Qy . We require that the value function
be concave in gains but not necessarily convex in losses—
unlike the canonical Kahneman-Tversky value functions—
because in the original work on prospect theory, a minority
of individuals were found to be risk-averse in both the gain
and loss regime (i.e., concave in both gains and losses)
(Kahneman & Tversky,). Note that risk-aversion is
different from loss-aversion; they relate to the curvature and
magnitude of the slope respectively.
Proposition 3.5.DPO, SLiC (calibration loss only), and
PPO-Clip are human-aware loss functions.
The proof is deferred to Appendix. In Figure, we can
see this more intuitively by plotting the value function for
each loss (i.e., the implied human utility). We see that the
value functions of all three losses incorporate a sense of loss
aversion, although this is not needed to meet the definition
of a HALO, since there are individuals and scenarios for
which loss aversion does not necessarily apply. The value
functions are also either concave or affine (depending on
the interval), unlike the standard Kahneman-Tversky value
function, which is concave in gains but convex in losses.
The reference point distributions used also differs across the
losses.
3.3. Does being a HALO matter?
A natural question is whether the modeling of human biases
in HALOs has practical benefits. This is difficult to answer,
since both HALOs and non-HALOs are diverse function
classes, but we attempt to do so by comparing popular non-
HALO and HALO baselines on the exact same data:
1.
CSFT: Conditional SFT is a simple alignment method
where a control token is prepended to the output during
training; then, at inference, the control token corre-
sponding to desirable generations (e.g.,<|good|>) is
appended to the input to induce good generations (Kor-
bak et al.,). This is a non-HALO loss.
2.
SLiC: SLiC with a regularization penalty (λreg̸= 0) is
a non-HALO loss:
LSLiC(πθ, πref) =Lcal(πθ) +λregLreg(πθ)
Lcal=Ex,yw,yl∼D
ffi
max
ȷ
0, δ−log
πθ(yw|x)
πθ(yl|x)
ffffl
Lreg=E
x∼D,y∼πref(x)[−logπθ(y|x)]
Although the max-margin lossLcalis a HALO on its
own (Proposition), the complete loss is not, since
theLregterm is the standard language modeling loss.
3.
DPO: DPO, as defined in (3), is a HALO loss (Propo-
sition).
4.
PPO (offline): The standard RLHF objective in (2) is
typically optimized with PPO-Clip, which works by
“clipping” how farπθcan drift from the versionπoldat
the previous step:
LPPO (offline)=−Ex,y,t∼D[min(qθA(x, y<t, yt),
clip(qθ,1−ϵ,1 +ϵ)A(x, y<t, yt))]
whereqθ= log
πθ
πold
andA(x, y<t, yt) is the per-token
advantage (i.e., the surplus benefit from producing a
given token in a given state).
PPO is an online algorithm—generations are sampled
from the current model, judged by a reward model, and
then used to update the current version. However, this
process is slow (due to having to sample generations),
so we choose to use offline data instead. Because
RLHF is also quite unstable in a distributed setting, we
never updateπoldand keep it asπref, instead clipping
less conservatively than we traditionally would (see
Appendix2023) found that
these changes, along with treating the entire output
sequence as a single action, greatly improves stability.
However, since RLHF has historically calculated token-
level advantages, we omit the third change and only
preserve the first two. The PPO-Clip loss itself is left
unchanged and is therefore a HALO (Proposition).
Calling this method PPO is somewhat imprecise, be-
cause it is offline and takes only one step, but to avoid
introducing too many new terms, we will call thisPPO
(offline). Instead of using learned rewards, we simplify
even further and use dummy +1/-1 rewards forywand
ylinstead. Further details on the implementation of
this method can be found in Appendix.
We compare these baselines on a suite of 7 models spanning
two model families, Pythia-{1.4B, 2.8B, 6.9B, 12B}(Bi-
derman et al.,) and Llama- {7B, 13B, 30B}(Touvron
et al.,). This permits us to see how LLM alignment
scales within a model family (Llama-2 lacks a 30B model,
hence our use of Llama). Later experiments (§4.2) are done
on Mistral-7B and its derivatives (Jiang et al.,). The
models were trained on a combination of Anthropic HH
(Ganguli et al.,), OpenAssistant (K ¨opf et al.,),
and SHP (Ethayarajh et al.,).
All models were aligned under identical settings on the
same data (e.g., same effective batch size, same optimizer,
etc.), save for hyperparameters unique to them. Similar to
Rafailov et al.2023), the target sequences for SFT are a
subset of the generations used to subsequently align the
model; however, for a more realistic SFT setup, we do not
necessarily set the most preferred generation to be the target
(with the exception of HH, since the dispreferred output in
that dataset is often harmful). Then we usedGPT-4-0613
4
Model Alignment as Prospect Theoretic Optimization
Figure 3.
Among existing alignment methods, the HALOs (DPO and our offline PPO variant) generally outperform non-HALOs (SLiC
and CSFT), though the gap is only significant(p <0.05) at 13B+ model sizes. In fact, only the HALO-aligned Llama-{13B, 30B}
models are able to match or exceed the generation quality of SFT target sequences, which are drawn directly from the alignment dataset.
It is also worth noting that up to a scale of 7B parameters, virtually all of the gains from LLM alignment come from the SFT stage.
to judge whether the aligned model’s response was bet-
ter than the SFT target for the given input with respect to
helpfulness, harmlessness, and conciseness, a now standard
practice (Zheng et al.,;,).
3
Note that
while the SFT target is considered a desirable output forx,
it is by no means thebestoutput, meaning that it can be
improved upon by an aligned model.
In Figure, we see the results of this analysis:
•
The HALOs we tested (DPO and our PPO variant) ei-
ther match or outperform the non-HALOs at all scales,
though the gap is only significant(p <0.05) at 13B+
model sizes. In fact, only the HALO-aligned Llama-
{13B, 30B}models match or exceed a win rate of 50%
(i.e., are able to match or exceed the generation quality
of the SFT targets in the test data).
•
Up to a scale of 7B parameters, alignment provides
virtually no gains over SFT alone. However, it is worth
noting that if the SFT data distribution were less sim-
ilar to the preference data, then the gains from the
alignment stage would ostensibly be greater.
•
Surprisingly, despite only using dummy +1/-1 rewards,
our offline PPO variant performs as well as DPO for
all models except Llama30B. This challenges conven-
tional wisdom, which places heavy emphasis on reward
learning (Casper et al.,), suggesting that even the
simplest rewards can prove useful when used in a loss
function that has a strong inductive bias. Despite its
surprising success, our offline PPO baseline still suffers
from hyperparameter sensitivity and training instability,
3
We validate that GPT-4 judgments concur with human judg-
ments in Appendix.
albeit not to the same extent as traditional RLHF.
4. Kahneman-Tversky Optimization
The surprising success of offline PPO with dummy +1/-1 re-
wards suggests that—with the right HALO—a binary signal
of good/bad generations may be sufficient to reach DPO-
level performance, even if the offline PPO approach itself
was unable to do so past a certain scale (§3.3). Taking a
more principled approach, we now derive a HALO using the
Kahneman-Tversky model of human utility, which allows
us to directly optimize for utility instead of maximizing the
log-likelihood of preferences. This Kahneman-Tversky Op-
timization (KTO) loss only needs a binary signal of whether
an output is (un)desirable for a given input, giving it access
to a source of data is more abundant, cheaper, and faster to
collect in the real world.
4.1. Derivation
From prior work (Go et al.,;,;
& Schaal,), we know that the policy that maximizes
the KL-constrained RLHF objective in (2) is
π
∗
(y|x) =
1
Z(x)
πref(y|x) exp
ȷ
1
β
r
∗
(x, y)
ff
whereZ(x)is a partition function.2023)
rewrite this in terms of the optimal reward for an input-
output pair:
r
∗
(x, y) =βlog
π
∗
(y|x)
πref(y|x)
+βlogZ(x) (6)
They then plug this expression into the Bradley-Terry model
of preferences and take the negative logarithm of that objec-
tive to get the DPO loss (3).
5
Model Alignment as Prospect Theoretic Optimization
Figure 4.
Kahneman-Tversky Optimization (KTO) is as good or better than DPO at all scales, both when preceded and not preceded by
supervised finetuning (SFT). In fact, for the Llama models, KTO alone matches the performance of SFT+DPO and is significantly better
than DPO alone. Error bars denote a 90% binomial confidence interval.
Instead, we plug this expression into the Kahneman-Tversky
model of human utility, with some changes to make it more
amenable to the LLM setting:
1.
The exponent in the Kahneman-Tversky value func-
tion (4) makes it difficult to optimize, so we setvKTO
to be the logistic functionσ, which is also concave
in gains and convex in losses. We replace the loss-
aversion coefficient with two hyperparametersλD, λU
that weight the losses for desirable and undesirable
outputs respectively.
2.
The Kahneman-Tversky value function was derived
based on experiments with humans and monetary gam-
bles. Since LLM generations do not have a monetary
reward associated with them, we setrKTOto be the
implicit reward under the RLHF objective (6).
3.
Rather than having just one dispreferred generation
yl|xas the reference point, we assume that humans
judge the quality of(x, y) in relation to all input-output
pairs they have seen. Thus we write the reference point
to be the expected reward under the optimal policy, not
just for generations followingxbut following any input
x
′:Ex
′
∼D,y
′
∼π
∗[r
∗
(x
′
, y
′
)] . Under the assumption
that the expected value of the partition function across
x
′is zero, this simplifies to the KL divergence between
π
∗
andπrefscaled byβ.
Combining all of these changes, we can optimize the follow-
ing loss, where the notion of an output being “desirable” or
“undesirable” corresponds to the Kahneman-Tversky notion
of a relative gain or loss.
LKTO(πθ, πref) =Ex,y∼D[w(y)(1−vKTO(x, y;β))](7)
where
rKTO(x, y) =βlog
πθ(y|x)
πref(y|x)
zref=Ex
′
∼D[βKL(πθ(y
′
|x
′
)∥πref(y
′
|x
′
))]
vKTO(x, y;β) =
(
σ(rKTO(x, y)−zref)ify∼ydesirable|x
σ(zref−rKTO(x, y))ify∼yundesirable|x
w(y) =
(
λDify∼ydesirable|x
λUify∼yundesirable|x
Intuitively, KTO works because if the model increases the
reward of a desirable example in a generic way, then the KL
penalty will also rise and no progress will be made on the
loss. This forces the model to learn exactly what makes an
output desirable, so that the reward can be increased while
keeping the KL term flat (or even decreasing it). A similar
argument works in the other direction as well, though the
non-negativity of the KL term allows faster saturation.
ImplementationIn practice, we estimate the KL term by
matching inputsx
′with unrelated outputsy
′
Uin a batch of
sizemand then calculatingmax
ˇ
0,
1
m
P
log
πθ(y
′
U
|x
′
)
πref(y
′
U
|x
′
)
ı
over the entire batch. We do not back-propagate through the
KL term, as it makes training much more stable. This means
that the KL term purely serves to control how saturated the
loss is.
βhas the same meaning as in DPO; the lower it is, the less
we penalizeπθfrom moving away from the SFT model
πref. We find thatβ= 0.1 is close-to-best on most datasets.
WherenDandnUrefer to the number of desirable and
undesirable examples respectively, we setλD, λU such that
λDnD
λUnU
∈
ffi
1,
4
3
ffl
(8)
6
Model Alignment as Prospect Theoretic Optimization
Figure 5.
Without doing SFT first, DPO-aligned models tend to
ramble and hallucinate entire conversations. KTO does not suffer
from this issue.
where at least one of the two should be set to 1 and the ratio
is controlled by changing the other. For example, if there
is a 1:1 ratio of desirable:undesirable examples, we would
setλU= 1, λD∈[1,1.33] . If we then discard 90% of
the desirable examples and only keep 10%, then we would
setλU= 1, λD∈[10,13.33] . The interval[1,4/3] was
determined empirically and suggests a value function that
is more gain-sensitive than loss-sensitive, in contrast to the
original Kahneman-Tversky value function (4). However,
the ideal interval is also task-dependent; for example, if
avoiding negative outcomes were very important, then we
might consider a setting ofλU>1instead.
DataIf the alignment data is naturally binary, every posi-
tive example can be assumed to be drawn fromydesirable|x
and every negative example fromyundesirable|x . However, the
canonical feedback datasets in academic research (HH, SHP,
OASST) are in preference format, since the methods that
have worked best up until now are preference-based. In our
experiments, we converted preference datayw≻yl by as-
suming thatywis drawn from the desirable distribution and
ylfrom the undesirable one. To enable an apples-to-apples
comparison with DPO, we apply KTO on the same data
for most experiments. However, to ensure that KTO can
be used with non-preference data, we also subsample one
outputyperxfor some experiments (denoted one-y-per-x).
If the data is score-based, where a higher score denotes
greater desirability, one has multiple options:
•
Assume that any output with a score above some fixed
thresholdτis desirable.
•
Assume that any output with a score above the mean
or median (either across all inputs or just the input it
was conditioned on) is desirable.
•
Let desirability be a Bernoulli random variable where
p(y∼ydesirable|x) is some function of its score (e.g.,
logistic). Then randomly sample to determine whether
yis desirable or not.
Figure 6.
Even after discarding 90% of the desirable examples
while keeping all of the undesirable data (leading to a 1:10 ratio
of desirable:undesirable data), a KTO-aligned Llama-7B model
still outperforms its DPO counterpart. This implies that preference
pairs do not have to be the source of KTO data.
4.2. Experiments
KTO≥DPOAs seen in Figure, SFT+KTO is com-
petitive with SFT+DPO at model scales from 1B to 30B,
despite learning from a weaker signal. KTO alone is better
than DPO alone for the Llama-{7B, 13B, 30B}models, and
this gap is significant (p <0.01 ) at 7B and 30B even after
correcting for multiple comparisons (Holm,). Perhaps
most surprising is the fact that a KTO-aligned Llama-{13B,
30B}model is competitive with its SFT+KTO counterpart,
despite not undergoing supervised finetuning first, and is the
only alignment method of the ones we tested to show this
behavior. This is perhaps due to the fact that KTO keeps
the average response length roughly the same as it is for
the SFT model. In contrast, doing DPO without SFT first
causes the average response length to increase dramatically.
KTO dataneed not
come from preference datasets.
Might KTO be secretly benefiting from the fact that its2n
examples in the previous experiment came fromnprefer-
ence pairs instead of a naturally unpaired data distribution?
To test this, we randomly discard increasingly large frac-
tions of the desirable data before KTO-aligning a Llama-7B
model. For example, if we discard 90% of the desirable
data while leaving the undesirable data untouched, then the
ratio of desirable:undesirable examples goes from 1:1 to
1:10 and the vast majority of examples no longer have a
preferred output counterpart. We handle such imbalances
by changing the loss weightsλD, λU to satisfy the criteria
in (8); when we drop 90% of the desirable data, we set
λu= 1, λD= 13.33 . The full results are given in Figure
6. For Llama-7b, we find that up to 90% of the desirable
7
Model Alignment as Prospect Theoretic Optimization
Table 1.
In aligning Mistral-7B on the OpenAssistant dataset, we
find that using KTO with only one output per input still outper-
forms DPO, despite this restriction reducing the amount of training
data by 72%. A 90% confidence interval is given.
Method Winrate vs. SFT Target
Mistral-7B (unaligned) 0.525 ±0.037
Mistral-7B + DPO 0.600 ±0.037
Mistral-7B + KTO (allyperx) 0.652±0.036
Mistral-7B + KTO (oneyperx) 0.631 ±0.036
Mistral-7B-Instruct 0.621 ±0.031
data can in fact be discarded while still outperforming DPO.
A similar trend holds when discarding undesirable data.
For different models and datasets, the optimal settings of
λD, λUdiffer.
We further verify this claim by aligning Mistral-7B on Ope-
nAssistant using DPO (onnpairs), standard KTO (on all
2noutputs), and KTO where only oneyperxis used. Since
the output of oneyin OpenAssistant is not conditioned on
the other outputs for the same input, the latter effectively
captures the setting where the data is from an inherently un-
paired distribution. Despite the one-y-per-xsetup decreas-
ing the amount of training data by 72%, the KTO-aligned
model still outperforms both its DPO counterpart and the
official instruction-tuned version of Mistral-7B (Jiang et al.,
2023), as seen in Table.
On average, KTO improves performance across gener-
ative benchmarks.Zephyr-βis a variant of Mistral-7B
that has been instruction-tuned and DPO-aligned on the Ul-
traFeedback dataset (Tunstall et al.,;,).
We find that substituting KTO for DPO (and changing noth-
ing else) improves performance across MMLU (0-shot)
(Hendrycks et al.,), GSM8K (8-shot, CoT) (Cobbe
et al.,), HumanEval (0-shot) (Chen et al.,), and
BigBench-Hard (3-shot CoT) (Srivastava et al.,). On
GSM8K, just swapping DPO for KTO improves perfor-
mance by 13.5 points. Even when we align with KTO using
only oneyperx(i.e., reducing the data volume by half), we
still outperform DPO on all but one benchmark.
4.3. Theoretical Analysis
KTO was designed with the motivation that even if it had
to learn from a weaker signal, it would make up for this
limitation with the fact that it has access to much more
data in the real world, where thumbs-up/thumbs-down data
is common but preferences are scarce and expensive to
collect. So why does KTO perform as good or better than
DPO in our experiments, when it sees the same amount
of data? Data efficiency may not be the only answer. Our
theoretical analysis suggests that preference likelihood can
Table 2.
Aligning Zephyr (Tunstall et al.,), a derivative of
Mistral-7B, on UltraFeedback with KTO instead of DPO improves
results across a suite of benchmarks. This is true even when only
one of the two outputs in each preference is seen by KTO, despite
this reducing the volume of data by half (one-y-per-x).
Dataset (→) MMLU GSM8k HumanEval BBH
Metric (→) EM EM pass@1 EM
Zephyr-βSFT 57.2 39.0 30.1 46.3
+DPO 58.2 40.0 30.1 44.1
+KTO 58.6 53.5 30.9 52.6
+KTO (one-y-per-x)58.0 50.0 30.7 49.9
be maximized without necessarily maximizing underlying
human utility and that KTO implicitly ignores noisy and
intransitive data.
Proposition 4.1.KTO does not learn from undesirable ex-
amples with sufficiently high rewards or desirable examples
with sufficiently low rewards.
Informally, if an example is too difficult to learn from, then
the KTO update will not changeπθ. This may be a blessing
in disguise, since human preferences are often noisy and
not every given preference can be recovered with the true
rewardr
∗(Hoeffler & Ariely,). This means that it may
be useful to avoid unlearnable preferences. However, this is
a double-edged sword: it also means that KTO could end
up ignoring some data that is hard-to-learn but necessary to
recoverr
∗
, resulting in under-fitting.
Theorem 4.2.Assuming the value function is logistic, for
any bounded reward functionra, there exists a reward func-
tion in its equivalence class (i.e.,rb(x, y) =ra(x, y)+h(x)
for someh(x)) that induces the same optimal policyπ
∗and
Bradley-Terry preference distribution but a different human
value distribution.
A key insight from2023) is that reward
functions in the same equivalence class (i.e., differing only
in an input-specific component) induce the same optimal
policy under (2) and the same Bradley-Terry preference
distribution. However, we show under mild assumptions
that the value distribution—i.e., human utility—is affected
by such input-specific changes, so maximizing preference
likelihood does not mean one is maximizing human utility.
Approaches that directly maximize utility, such as KTO,
may thus perform better in open-ended evaluation.
Theorem 4.3.Let two humansa, bhave value functions
va, vb and contradicting preferencesy1≻ay2 andy2≻by1
for some inputx. Assumeπref(y|x) = 0 =⇒πθ(y|x) = 0
for allx, y. In the worst-case, the optimal policy under DPO
decreases the expected value of both humans. In contrast, if
each preference is broken up into two examples, then KTO
(with default settings) does not change the policy.
Informally, we assume that humans want the model to in-
8
Model Alignment as Prospect Theoretic Optimization
crease and decrease the probability of generations they like
and dislike respectively. However, the preferences of two
humans often contradict, leading to a dataset containing
intransitive preferences. In the worst-case, DPO allows one
of the two preferences to be recovered while decreasing
the expected value of both humans. In contrast, KTO will
change nothing at all in any case. Since existing datasets
contain preferences from multiple annotators, the existence
of intransitivity may help explain why KTO works better.
4.4. KTO vs. DPO – when to use which?
When human feedback is in a binary format, and especially
when there is an imbalance between the number of desirable
and undesirable examples, KTO is the natural choice. When
your data is in the form of preferences, the choice is less
clear. Putting aside the greater data efficiency of KTO, our
theoretical analysis suggests that if your preference data
has sufficiently little noise and sufficiently little intransitiv-
ity, then DPO will work better, since there is some risk of
KTO underfitting. But if there is enough noise and transitiv-
ity, then the better worst-case guarantees of KTO will win
out. Most publicly available preference datasets (e.g., SHP,
OpenAssistant) contain noisy feedback from many different
humans whose preferences likely contradict, which explains
why KTO was able to match or exceed DPO performance
in our experiments. Even AI feedback can be noisy and in-
transitive, which helps explain why KTO outperforms DPO
when aligning with the synthetic UltraFeedback data.
5. Related Work
Human feedback has been used to improve LLM capabilities
in translation (Kreutzer et al.,), summarization (Stien-
non et al.,), sentiment-conditioned generation (Ziegler
et al.,), and instruction-following (Ouyang et al.,).
The RLHF framework (Christiano et al.,;,
2022) traditionally used to accomplish this is detailed in§2.
Still, momentum has largely shifted in favor of closed-form
losses that directly operate on offline preferences, such as
DPO (Rafailov et al.,). This single stage of optimiza-
tion distinguishes DPO from the conventional approach in
preference-based RL, which learns a reward and then fits the
policy to those rewards (Jain et al.,;,
2014). A recent string of work has centered on the idea of
“self-training” or “self-play”, during which new preference
data is inferred from a model’s generations (Chen et al.,
2024;,). Despite not being a human-aware
loss,unlikelihood trainingwas among to first to methods to
align language models using a binary signal (Welleck et al.,
2019). However, work by2023) found that it
is worse than the CSFT baseline we tested in our work.
Prospect theory, despite being highly influential in behav-
ioral economics, has had a fairly muted impact in machine
learning, with work concentrated in human-robot interac-
tion (Kwon et al.,;,;,).
Learning from sparse binary feedback is a staple of infor-
mation retrieval and recommender systems (He et al.,;
Koren et al.,), although to our knowledge it has not
been used to generate open-ended text.
6. Future Work
The existence of HALOs raises many questions. For one,
the KTO loss is based on the Kahneman-Tversky value
function for monetary gains and losses, which is almost
certainly different from how humans perceive the relative
goodness of text. What value function—and corresponding
HALO—best describes how humans perceive language?
Given that the data that KTO needs is much more abun-
dant, cheaper, and faster to collect—both as human and AI
feedback—how far can we push synthetic data? For exam-
ple, if we wanted to create a toxicity dataset to align our
models to be less toxic, creating a tuple(x, yw, yl) whereyl
is more toxic thanywis non-trivial. However, with KTO, we
can easily create a dataset where desirability is determined
by some black-box toxicity detection API. What other kinds
of desiderata can we synthetically optimize for with KTO?
Can we convert signals like “conversation lead to sale made”
or “support ticket resolved” into KTO data?
Currently, KTO can learn from score-based data when the
score is used to infer desirability. However, can we design a
HALO where scores are directly incorporated into this loss?
7. Conclusion
We proposed a class of functions calledhuman-aware losses
(HALOs) based on the idea of a Kahneman-Tversky value
function, which models some of the key cognitive biases
that inform how humans make decisions about uncertain out-
comes. We showed that among existing alignment methods,
those that met the definition of a HALO performed better
than those that did not, suggesting a benefit to the modeling
of human biases. We then designed a human-aware loss
called KTO for directly maximizing the utility of genera-
tions instead of maximizing preference likelihood. Despite
only learning from a binary signal of whether an output is
(un)desirable, KTO is as good or better than DPO at scales
from 1B to 30B. Still, we make no claims that KTO is the
best HALO for all scenarios; there remains much work to
be done in discovering the optimal human-aware for each
setting.
9
Model Alignment as Prospect Theoretic Optimization
Acknowledgements
We thank Dilip Arumugam and Arya McCarthy for feedback
on the paper and Nathan Lambert for feedback on an early
version of this draft. We thank Stas Bekman and Gautam
Mittal for cluster assistance and Alex Manthey for helping
with human evaluation.
References
Baheti, A., Lu, X., Brahman, F., Bras, R. L., Sap,
M., and Riedl, M. Improving language models with
advantage-based offline policy gradients.arXiv preprint
arXiv:2305.14718, 2023.
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., Das-
Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T.,
et al. Training a helpful and harmless assistant with rein-
forcement learning from human feedback.arXiv preprint
arXiv:2204.05862, 2022.
Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley,
H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S.,
Prashanth, U. S., Raff, E., et al. Pythia: A suite for ana-
lyzing large language models across training and scaling.
InInternational Conference on Machine Learning, pp.
2397–2430. PMLR, 2023.
Bradley, R. A. and Terry, M. E. Rank analysis of incom-
plete block designs: I. the method of paired comparisons.
Biometrika, 39(3/4):324–345, 1952.
Busa-Fekete, R., Sz¨or´enyi, B., Weng, P., Cheng, W., and
H¨ullermeier, E. Preference-based reinforcement learning:
evolutionary direct policy search using a preference-based
racing algorithm.Machine learning, 97:327–351, 2014.
Casper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J.,
Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire,
P., et al. Open problems and fundamental limitations
of reinforcement learning from human feedback.arXiv
preprint arXiv:2307.15217, 2023.
Chan, L., Critch, A., and Dragan, A. Human irrationality:
both bad and good for reward inference.arXiv preprint
arXiv:2111.06956, 2021.
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O.,
Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman,
G., et al. Evaluating large language models trained on
code.arXiv preprint arXiv:2107.03374, 2021.
Chen, Z., Deng, Y., Yuan, H., Ji, K., and Gu, Q. Self-play
fine-tuning converts weak language models to strong lan-
guage models.arXiv preprint arXiv:2401.01335, 2024.
Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg,
S., and Amodei, D. Deep reinforcement learning from
human preferences.Advances in neural information pro-
cessing systems, 30, 2017.
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H.,
Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano,
R., Hesse, C., and Schulman, J. Training verifiers to solve
math word problems.arXiv preprint arXiv:2110.14168,
2021.
Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y., Xie, G.,
Liu, Z., and Sun, M. Ultrafeedback: Boosting language
models with high-quality feedback, 2023.
Ethayarajh, K., Choi, Y., and Swayamdipta, S. Under-
standing dataset difficulty withV-usable information. In
Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu,
G., and Sabato, S. (eds.),Proceedings of the 39th In-
ternational Conference on Machine Learning, volume
162 ofProceedings of Machine Learning Research, pp.
5988–6008. PMLR, 17–23 Jul 2022.
Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y.,
Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse,
K., et al. Red teaming language models to reduce harms:
Methods, scaling behaviors, and lessons learned.arXiv
preprint arXiv:2209.07858, 2022.
Go, D., Korbak, T., Kruszewski, G., Rozen, J., Ryu, N., and
Dymetman, M. Aligning language models with prefer-
ences through f-divergence minimization.arXiv preprint
arXiv:2302.08215, 2023.
Gurevich, G., Kliger, D., and Levy, O. Decision-making
under uncertainty–a field study of cumulative prospect
theory.Journal of Banking & Finance, 33(7):1221–1229,
2009.
He, X., Liao, L., Zhang, H., Nie, L., Hu, X., and Chua,
T.-S. Neural collaborative filtering. InProceedings of
the 26th international conference on world wide web, pp.
173–182, 2017.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M.,
Song, D., and Steinhardt, J. Measuring massive multitask
language understanding.Proceedings of the International
Conference on Learning Representations (ICLR), 2021.
Hoeffler, S. and Ariely, D. Constructing stable preferences:
A look into dimensions of experience and their impact on
preference stability.Journal of consumer psychology, 8
(2):113–139, 1999.
Holm, S. A simple sequentially rejective multiple test pro-
cedure.Scandinavian journal of statistics, pp. 65–70,
1979.
10
Model Alignment as Prospect Theoretic Optimization
Jain, A., Wojcik, B., Joachims, T., and Saxena, A. Learning
trajectory preferences for manipulators via iterative im-
provement.Advances in neural information processing
systems, 26, 2013.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C.,
Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G.,
Lample, G., Saulnier, L., et al. Mistral 7b.arXiv preprint
arXiv:2310.06825, 2023.
Kahneman, D. and Tversky, A. Prospect theory: An analysis
of decision under risk.Econometrica, 47(2):263–292,
1979.
K¨opf, A., Kilcher, Y., von R¨utte, D., Anagnostidis, S.,
Tam, Z.-R., Stevens, K., Barhoum, A., Duc, N. M., Stan-
ley, O., Nagyfi, R., et al. Openassistant conversations–
democratizing large language model alignment.arXiv
preprint arXiv:2304.07327, 2023.
Korbak, T., Shi, K., Chen, A., Bhalerao, R. V., Buckley,
C., Phang, J., Bowman, S. R., and Perez, E. Pretrain-
ing language models with human preferences. InInter-
national Conference on Machine Learning, pp. 17506–
17533. PMLR, 2023.
Koren, Y., Bell, R., and Volinsky, C. Matrix factorization
techniques for recommender systems.Computer, 42(8):
30–37, 2009.
Kreutzer, J., Uyheng, J., and Riezler, S. Reliability and
learnability of human bandit feedback for sequence-to-
sequence reinforcement learning. InProceedings of the
56th Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers), pp. 1777–
1788, 2018.
Kwon, M., Biyik, E., Talati, A., Bhasin, K., Losey, D. P.,
and Sadigh, D. When humans aren’t optimal: Robots that
collaborate with risk-aware humans. InProceedings of
the 2020 ACM/IEEE international conference on human-
robot interaction, pp. 43–52, 2020.
Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I.,
Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacae-
val: An automatic evaluator of instruction-following
models.https://github.com/tatsu-lab/
alpaca_eval, 2023.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.,
Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A.,
et al. Training language models to follow instructions
with human feedback.Advances in Neural Information
Processing Systems, 35:27730–27744, 2022.
Peng, X. B., Kumar, A., Zhang, G., and Levine, S.
Advantage-weighted regression: Simple and scalable
off-policy reinforcement learning.arXiv preprint
arXiv:1910.00177, 2019.
Peters, J. and Schaal, S. Reinforcement learning by reward-
weighted regression for operational space control. In
Proceedings of the 24th international conference on Ma-
chine learning, pp. 745–750, 2007.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning,
C. D., and Finn, C. Direct preference optimization: Your
language model is secretly a reward model.arXiv preprint
arXiv:2305.18290, 2023.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Klimov, O. Proximal policy optimization algorithms.
arXiv preprint arXiv:1707.06347, 2017.
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid,
A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A.,
Garriga-Alonso, A., et al. Beyond the imitation game:
Quantifying and extrapolating the capabilities of language
models.arXiv preprint arXiv:2206.04615, 2022.
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R.,
Voss, C., Radford, A., Amodei, D., and Christiano,
P. F. Learning to summarize with human feedback.Ad-
vances in Neural Information Processing Systems, 33:
3008–3021, 2020.
Sun, L., Zhan, W., Hu, Y., and Tomizuka, M. Interpretable
modelling of driving behaviors in interactive driving sce-
narios based on cumulative prospect theory. In2019 IEEE
Intelligent Transportation Systems Conference (ITSC), pp.
4329–4335. IEEE, 2019.
Tian, K., Mitchell, E., Yao, H., Manning, C. D., and Finn,
C. Fine-tuning language models for factuality.arXiv
preprint arXiv:2311.08401, 2023.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E.,
Azhar, F., et al. Llama: Open and efficient foundation lan-
guage models.arXiv preprint arXiv:2302.13971, 2023.
Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul,
K., Belkada, Y., Huang, S., von Werra, L., Fourrier, C.,
Habib, N., Sarrazin, N., Sanseviero, O., Rush, A. M.,
and Wolf, T. Zephyr: Direct distillation of lm alignment,
2023.
Tversky, A. and Kahneman, D. Advances in prospect theory:
Cumulative representation of uncertainty.Journal of Risk
and uncertainty, 5:297–323, 1992.
von Werra, L., Belkada, Y., Tunstall, L., Beeching, E.,
Thrush, T., Lambert, N., and Huang, S. Trl: Trans-
former reinforcement learning.https://github.
com/huggingface/trl, 2020.
Welleck, S., Kulikov, I., Roller, S., Dinan, E., Cho, K.,
and Weston, J. Neural text generation with unlikelihood
11
Model Alignment as Prospect Theoretic Optimization
training. InInternational Conference on Learning Repre-
sentations, 2019.
Yuan, W., Pang, R. Y., Cho, K., Sukhbaatar, S., Xu, J.,
and Weston, J. Self-rewarding language models.arXiv
preprint arXiv:2401.10020, 2024.
Zhao, Y., Joshi, R., Liu, T., Khalman, M., Saleh, M., and
Liu, P. J. Slic-hf: Sequence likelihood calibration with
human feedback.arXiv preprint arXiv:2305.10425, 2023.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z.,
Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging
llm-as-a-judge with mt-bench and chatbot arena.arXiv
preprint arXiv:2306.05685, 2023.
Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford,
A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning
language models from human preferences.arXiv preprint
arXiv:1909.08593, 2019.
12
Model Alignment as Prospect Theoretic Optimization
A. Proofs
Proposition DPO, SLiC (calibration loss only), and PPO-Clip are human-aware loss functions.
Proof.For a loss to be a HALO, it needs to be expressible as
f(x, y;θ) =t(vf(rθ(x, y)−Ex
′
∼Q
′
x
,y
′
∼Q
′
y
[rθ(x
′
, y
′
)]))
with a parameterized reward functionrθsuch that∀(x1, y1),(x2, y2)∈ X ×Y ,rθ(x1, y1)> rθ(x2, y2)⇐⇒(x1, y1)≻rθ
(x2, y2)
, reference point distributionsQx(X
′
), Qy(Y
′
|X
′
) , a value functionvf:R→R that is monotonic non-decreasing
and concave in(0,∞), and a negative affine functiont.
The DPO loss is
LDPO(πθ, πref) =E
ffi
−logσ
ȷ
βlog
πθ(yw|x)
πref(yw|x)
−βlog
πθ(yl|x)
πref(yl|x)
ffffl
whereβ >0 is a hyperparameter. DPO meets the criteria with the following construction:t(·)is just taking the negative,
vf= logσ is increasing and concave everywhere,rθis the DPO rewardβlog[πθ(y|x)/πref(y|x)] ,Qxplaces all mass onx
andQyplaces all mass on the dispreferred outputylforxsuch thaty≻yl.
The calibration loss in SLiC is
Lcal=Ex,yw,yl∼D[max (0, δ−logπθ(yw|x) + logπθ(yl|x))]
=Ex,yw,yl∼D[−min (0,logπθ(yw|x)−logπθ(yl|x)−δ)]
whereδ >0 is hyperparameter. The calibration loss meets the criteria under the following construction:t(·)is just taking
the negative,vf(z) = min(0, z−δ) is non-decreasing everywhere and concave in gains,rθ(x, y) = logpθ(y|x) , and the
reference point distributions are defined the same as they are for DPO.
The PPO-Clip loss is
LPPO (offline)=−Ex,y,i∼D[min(qθA(x, y<i, yi),clip(qθ,1−ϵ,1 +ϵ)A(x, y<i, yi))]
whereqθ=
πθ(yt|x,y<i)
πref(yt|x,y<i)
are the token-level probability ratios, wherey<idenotes the output sequence up to thei-th token.
This token-level focus makes this objective different from DPO and SLiC.Adenotes the token-level advantages, andϵ >0
is a hyperparameter. The rewardrθ(x, y) = logqθ andtthen just takes the negative. We can letQxplace all mass on the
joint sequencex:yandQybe an arbitrary distribution overY— since there is no advantage to generating tokens past
yn, the distributionsπ
∗
(·|x:y) andπref(·|x:y) should be arbitrarily close, pushinglogqθ→0 . We construct the value
function piecewise:
vf(z) =
(
Amin(expz,1 +ϵ)ifA(x, y<i, yi)>0
Amax(expz,1−ϵ)ifA(x, y<i, yi)<0
Proposition KTO does not learn from undesirable examples with sufficiently high rewards or desirable
examples with sufficiently low rewards.
Proof.
Whereλ(y) =−λD whenyis desirable andλUwhenyis undesirable, andz=rKTO(x, y)−zref , the derivative of
the KTO loss is
∇θLKTO(πθ, πref) =Ex,y∼D[λ(y)σ(z)σ(−z)∇βlogπθ(y|x)] (9)
Note that we do not backpropagate through the KL term in the KTO loss andβ >0. This gradient is simple to interpret: ify
is desirable, thenλ(y)is negative and we push up the probability ofπθ(y|x) to minimize the loss; we do the opposite ify
is undesirable. Asztends to±∞, the gradient will tend to zero since eitherσ(−z) orσ(z)will tend to zero. Sincezis
increasing in the reward, this means that sufficiently large and sufficiently small rewards will yield a gradient of zero.
13
Model Alignment as Prospect Theoretic Optimization
Theorem Assuming the value function is logistic, for any bounded reward functionra, there exists a reward
function in its equivalence class (i.e.,rb(x, y) =ra(x, y) +h(x) for someh(x)) that induces the same optimal policyπ
∗
and Bradley-Terry preference distribution but a different human value distribution.
Proof.
Following the definition in2023), we say two functions ra(x, y) andrb(x, y) are in the same
equivalence class if there exists some functionh(x)such thatrb(x, y) =ra(x, y) +h(x) . From Lemma 1 in
(2023), we know that two functions in the same equivalence class induce the same optimal policy:
πrb
(y|x) =
1
Z(x)
πref(y|x) exp
ȷ
1
β
rb(x, y)
ff
=
1
P
y
πref(y|x) exp
ˇ
1
β
(ra(x, y) +h(x))
ıπref(y|x) exp
ȷ
1
β
(ra(x, y) +h(x))
ff
=
1
P
y
πref(y|x) exp
ˇ
1
β
ra(x, y)
ı
exp
ˇ
1
β
h(x)
ıπref(y|x) exp
ȷ
1
β
ra(x, y)
ff
exp
ȷ
1
β
h(x)
ff
=πra(y|x)
For a Bradley-Terry model of preferences, it is trivial to show thatp(yw≻yl|x) is unaffected byh(x)since it is added to the
reward of bothywandyl. We will now show that the two reward functions do not necessarily induce the same distribution
of values. Let
hu(x) =
(
0 ifx̸=u
Cu ifx=u
whereCu̸= 0is an input-specific constant.
Assume thatyis a desirable output forxwithout loss of generality. Letnube the number of timesuappears in a set of size
n. For reward functionsra, rb with corresponding logistic value functionsva, vb such thatrb(x, y) =ra(x, y) +hu(x) for
some inputu, :
vb(x, y) =σ(rb(x, y)−Ex
′
∼D,y
′
∼π
∗[rb(x
′
, y
′
)])
=σ(ra(x, y) +hu(x)−Ex
′
∼D,y
′
∼π
∗[ra(x
′
, y
′
) +hu(x
′
)])
=σ(ra(x, y)−Ex
′
∼D,y
′
∼π
∗[ra(x
′
, y
′
)] + (hu(x)−Ex
′
∼D[hu(x
′
)]))
Letv
′
abe the derivative ofva, and letϵdenote the error term from a first-order Taylor series expansion.
Whenu=x:
vb(x, y) =σ
ˇ
ra(x, y)−Ex
′
∼D,y
′
∼π
∗[ra(x
′
, y
′
)] +
ˇ
Cu−
nu
n
Cu
ıı
=va(x, y) +
ȷ
Cu−
nuCu
n
ff
v
′
a(x, y) +ϵ
=va(x, y) +
ȷ
Cu−
nuCu
n
ff
va(x, y)(1−va(x, y)) +ϵ
Whenu̸=x,
vb(x, y) =σ
ˇ
ra(x, y)−Ex
′
∼D,y
′
∼π
∗[ra(x
′
, y
′
)]−
nu
n
Cu
ı
=va(x, y)−
nuCu
n
va(x, y)(1−va(x, y)) +ϵ
Since the rewards are bounded by assumption, we haveva∈(0,1) . For ak-th order Taylor Series approximation, we thus
haveϵ∈O(Cu2
−k
) . Even if we generously assume thatϵ= 0, we havevb(x,·)̸=va(x,·) in at least one of these cases
(either whennu>0 or whennu=n ). We have thus shown that two bounded reward functions in the same equivalence
class can induce both the same policy and Bradley-Terry preference distribution but a different distribution of human
values.
14
Model Alignment as Prospect Theoretic Optimization
Theorem Let two humansa, bhave value functionsva, vb and contradicting preferencesy1≻ay2 and
y2≻by1 for some inputx. Assumeπref(y|x) = 0 =⇒πθ(y|x) = 0 for allx, y. In the worst-case, the optimal policy
under DPO decreases the expected value of both humans. In contrast, if each preference is broken up into two examples,
then KTO (with default settings) does not change the policy.
Proof.Whereu=βlog
πθ(y1|x)
πref(y1|x)
−βlog
πθ(y2|x)
πref(y2|x)
, we can write the total DPO loss as
ˆ
LDPO(πθ, πref) =
1
2
(−logσ(u)) +
1
2
(−logσ(−u))
Taking the derivative with respect touand setting to zero, we get
0 =−
1
2
ȷ
σ(u)σ(−u)
σ(u)
−
σ(−u)σ(u)
σ(−u)
ff
=
σ(u)(σ(−u))
2
σ(u)σ(−u)
−
(σ(u))
2
σ(−u)
σ(u)σ(−u)
=σ(−u)−σ(u)
=⇒u= 0
Sinceβ >0,u= 0can only occur when the rewards of both the preferred and dispreferred outputs are equal. This can
be satisfied whenπθ(y1|x) =πθ(y2|x) = 0 , with the probability mass allocated to examples with lower true rewardr
∗in
the worst case. Since value functions by definition are monotonically non-decreasing, the expected value for both humans
would decrease in the worst-case.
Wherez1, z2are the reference-adjusted rewards, we can write the total KTO loss (with default settingsλD=λU= 1) as:
ˆ
LKTO(πθ, πref) =
1
4
(1−σ(z1)) +
1
4
(1−σ(−z1)) +
1
4
(1−σ(z2)) +
1
4
(1−σ(−z2))
=
1
4
(σ(−z1)) +
1
4
(1−σ(−z1)) +
1
4
(σ(−z2)) +
1
4
(1−σ(−z2))
=
1
2
Therefore the loss is already minimal and no changes are made to the policy.
B. Implementations
SLiCInstead of sampling from the reference model to calculate theLregas2023) do—as it is very slow—we
just apply the cross-entropy loss to the SFT data, assuming that the reference model recovers the SFT distribution.
DPOWe use the implementation of DPO in the code provided by2023). We found that, as mentioned
in the original paper,β= 0.1 works best for most settings. Other training configurations, such as the learning rate and
optimizer, were borrowed from the original paper.
CSFTThe control tokens used for generating the good and bad outputs are<|good|> and<|bad|> respectively,
following the precedent set in2023).
KTOWe use aβ= 0.1 in our experiments unless otherwise specified (the same setting as for DPO), as it is close-to-
optimal for most settings.
PPOPPO-Clip is the traditional means of optimizing the RLHF objective (2). However, most implementations of
PPO-Clip for LLM alignment suffer from instability, particularly during distributed training. We find that running the
PPO-Clip objective on offline data with the following “tricks” leads to much more stable training:
•
We never update the reference distribution (i.e., the policy only takes one step in the trust region).2023)
recommend this as well. To accommodate for this conservative change, we clip the probability ratios more liberally,
15
Model Alignment as Prospect Theoretic Optimization
finding that an asymmetric interval of[0.25,4.0]works best instead of the small symmetrical interval (e.g.,[0.8,1.2])
that is traditionally recommended.
•
Including a KL penalty (between the policy and reference distributions) in addition to the clipping makes training more
stable, as is also done in the implementation by2020). We find that it is important to estimate the KL
term not using the entire distribution, however, but rather as the mean difference in the predicted log probabilities of
the actual output tokens (i.e., the labels). We suspect that this makes a difference because the rest of the distribution
can be poorly calibrated.
•
The value of a state is generally predicted by some value head attached to the policy model; the value loss is the MSE
between the predicted value and the discounted sum of future rewards for each token. This is a linear layer in many
RLHF implementations (von Werra et al.,). However, we find that backpropagating the value loss through this
head and the policy leads to worse performance. Instead, we make the value head a 3-layer MLP and detach it from the
computational graph, so that the value losses are not backpropagated through the policy model but the value head still
has sufficient capacity to learn good estimates.
C. Human Evaluation
For human evaluation, we randomly sampled 256 prompts from the OpenAssistant test set and generated outputs from
Mistral 7B models aligned with DPO and KTO. All inputs were multi-turn conversations between a user and an assistant,
where the LLM played the role of the assistant (see Table
These were sent to a third-party data annotation service where a pool of workers picked either the generated output or the
SFT target (from the OpenAssistant dataset) as the more appropriate response by the assistant. Any question that required
specific domain experience (e.g., coding) were skipped, leading to 214 comparisons for DPO and KTO each.
The winrates of the aligned model over the SFT targets are72.9%±5.3 for KTO and62.1%±5.7 for DPO (where the
intervals are 90% binomial confidence intervals). In contrast, Table
with GPT-4 as a judge instead:65.2%±3.6 for KTO and60.0%±3.7 . Thus although there is no significant difference in
the GPT-4-based evaluation, there is a significant difference with human evaluation atp <0.05 . We found that 68.7% of the
human judgments over the KTO comparisons concurred with GPT-4; this number fell to 65.9% for DPO.
D. Additional Experiments
Figure 7.
Alpaca Eval 2 winrates with GPT-4-turbo chain-of-thought as evaluator. KTO consistently outperforms DPO and unaligned
Mistral 7B base models. KTO also demonstrates greater robustness across varying temperatures.
E. Sample Generations
16
Model Alignment as Prospect Theoretic Optimization
Table 3.
Aligning Zephyr (Tunstall et al.,), a derivative of Mistral-7B, on UltraFeedback with KTO instead of DPO improves results
across a suite of benchmarks. AlpacaEval 2 was run using the GPT4 Turbo CoT as the LM judge annotator. All evaluations were run with
sampling temperature0.7.
Dataset (→)MMLU GSM8k HumanEval BBH TydiQA AlpacaEval Avg.
Setup (→) 0-shot 8-shot, CoT 0-shot, Synth. 3-Shot, CoT 1-Shot, GP 0-shot
Metric (→) EM EM pass@1 EM F1 % Win
Zephyr-βSFT57.2 39.0 30.1 46.3 36.3 6.41 35.9
+DPO 58.2 40.0 30.1 44.1 36.5 7.79 36.1
+KTO 58.6 53.5 30.9 52.6 31.2 12.48 39.9
+KTO (unary)58.0 50.0 30.7 49.9 37.7 11.55 39.6
17
Model Alignment as Prospect Theoretic Optimization
Prompt <|user|>
I need some show recommendations that are similar to Breaking Bad.
<|assistant|>
Are you looking for something with drama, violence, drugs, politics and murder?
<|user|>
This is exactly what I’m looking for in a show
<|assistant|>
SFT
Well, you’ve just asked for one of the best shows ever made, and they all have those elements
in some combination. There are a lot of shows with those elements.
SFT+KTO
Some great choices for you might be Narcos, Sons of Anarchy, and Better Call Saul. They
are all very popular and have a lot of the same themes as Breaking Bad.
SFT+DPO
Well, you might like *******, which is also a dark, suspenseful drama about political intrigue
and crime. Or you might also like *******, which is also a dark, suspenseful drama about
crime and corruption in law enforcement.
—
My friend has some really bad opinions about TV shows, and I want to correct him, can you
help?
—
Sure! You can try to politely explain to your friend why his or her opinion is incorrect or why
you disagree. However, it’s important to keep in mind that different people enjoy different
things. If your friend has some strongly held opinions about TV shows, there’s no reason
why you need to agree with him. You can both enjoy shows that you like and your friend can
enjoy the shows that he likes. That’s just the way it is.
—
Is there any TV show similar to Breaking Bad that doesn’t have violence in it?
—
It’s very hard to find a TV show that doesn’t have any violence in it at all. However, there are
some TV shows that are not focused on violence. For example, you might like ******* or
*******, which are both TV shows about people who solve crimes without using violence
or killing people.
—
Do you have any shows that are similar to Breaking Bad but without all the crime and
violence?
—
Sure! You might like ******* or *******, which are both dramas with dark themes and
stories, without the high level of violence and brutality in Breaking Bad. Or you might like
*******, which is also a dark and suspenseful drama.
—
Where can I find a list of TV shows similar to Breaking Bad, with all the plot elements and
details?
—
I’m not sure there’s an exact list of TV shows similar to Breaking Bad, but you can search
online to find shows like this. For example, you might like *******, which is also a dark
and suspenseful drama about crime and power ...
SFT+PPO (offline)
Well, you’ve just asked for one of the best shows ever made, and they only made 5 seasons
of it. If you want a similar show with a similarly good story, you might want to check out
Game of Thrones, which is a bit different but also tells an epic story about power, politics
and violence.
Let me know if you need any more recommendations, it’s no problem!
Table 4.
Sample generations from the different aligned versions of Llama-30b for a prompt about show recommendations (all models were
aligned with data following the user-assistant prompt). Note that the SFT answer is not helpful and the SFT+DPO answer hallucinates
multiple turns of the conversation (in fact we had to truncate the answer shown here because the complete answer is too long). The
SFT+PPO (offline) answer is helpful but only provides one recommendation, while SFT+KTO is succinct and provides multiple options.
18
Kawin Ethayarajh
1
Winnie Xu
2
Niklas Muennighoff
2
Dan Jurafsky
1
Douwe Kiela
1 2
Abstract
Kahneman & Tversky’sprospect theorytells us
that humans perceive random variables in a biased
but well-defined manner (1992); for example, hu-
mans are famously loss-averse. We show that
objectives for aligning LLMs with human feed-
back implicitly incorporate many of these biases—
the success of these objectives (e.g., DPO) over
cross-entropy minimization can partly be ascribed
to them beinghuman-aware loss functions(HA-
LOs). However, the utility functions these meth-
ods attribute to humans still differ from those in
the prospect theory literature. Using a Kahneman-
Tversky model of human utility, we propose a
HALO that directly maximizes the utility of gen-
erations instead of maximizing the log-likelihood
of preferences, as current methods do. We call
this approach Kahneman-Tversky Optimization
(KTO), and it matches or exceeds the performance
of preference-based methods at scales from 1B to
30B. Crucially, KTO does not need preferences—
only a binary signal of whether an output is desir-
able or undesirable for a given input. This makes
it far easier to use in the real world, where prefer-
ence data is scarce and expensive.
1. Introduction
Aligning generative models with human feedback has been
successfully used to make generations more helpful, factual,
and ethical, among other desiderata (Ouyang et al.,;
Tian et al.,). For LLMs, alignment methods such
as RLHF and DPO have consistently proven to be more
beneficial than doing supervised finetuning (SFT) alone.
However, human feedback is often discussed only in the
context of preferences (e.g., outputA≻B for inputx),
despite preferences being a kind of data that is relatively
scarce and expensive to collect in the real world (Casper
et al.,). This is largely because the alignment methods
shown to work best—RLHF (Christiano et al.,) and
1
Stanford University
2
Contextual AI. Correspondence to:
Kawin Ethayarajh<kawin@stanford.edu>.
the mathematically equivalent DPO (Rafailov et al.,)—
take preference data as input.
To understand why these alignment methods work so well,
and whether feedback needs to be in the form of prefer-
ences, we frame them through the lens ofprospect the-
ory(Kahneman & Tversky,;,
1992). Prospect theory explains why humans make deci-
sions about uncertain events that do not maximize expected
value. It formalizes how humans perceive random variables
in a biased but well-defined manner; for example, relative to
some reference point, humans are more sensitive to losses
than gains, a property calledloss aversion. We show that
popular alignment methods such as PPO (Schulman et al.,
2017), DPO (Rafailov et al.,), and SLiC (Zhao et al.,
2023) implicitly model such biases, helping explain their
success independently of the data used. For this reason, we
call themhuman-aware loss functions(HALOs).
Although it is impossible to say that HALOs are categori-
cally better than non-HALOs, we find that among existing
methods, those that meet the definition of a HALO work
better than those that do not. We find that DPO perfor-
mance can even be matched at most scales by running an
offline PPO variant on dummy +1/-1 rewards, suggesting
that preference data might not be needed if the inductive
bias in the loss function is good enough. However, despite
the surprising success of this simple baseline, it significantly
lags behind DPO at the 30B model scale and suffers from
hyperparameter sensitivity, making it difficult to use.
Taking a more principled approach, we derive a HALO using
the model of human utility that Kahneman & Tversky empir-
ically derived to describe how humans make decisions about
uncertain monetary outcomes (Tversky & Kahneman,).
This approach, which we call Kahneman-Tversky Optimiza-
tion (KTO), directly maximizes the utility of generations
instead of maximizing the log-likelihood of preferences, as
most current methods do. KTO only requires a binary signal
of whether an output is desirable or undesirable for a given
input. This data is much more abundant, cheaper, and faster
to collect in the real world than preferences, making it easier
to scale alignment in production environments and rapidly
iterate on models.
In our experiments, we find that:
1
Model Alignment as Prospect Theoretic Optimization
Figure 1.
The traditional pipeline for LLM alignment starts with supervised finetuning, followed by fitting the LLM to paired preference
data using a method such as RLHF or DPO. However, the paired preferences that existing approaches need are hard-to-get. Kahneman-
Tversky Optimization (KTO) only needs to know whether a given output is (un)desirable for the input, giving it access to a source of data
that is much more abundant, cheaper, and faster to collect in the real world.
•
KTO matches or exceeds DPO performance at scales
from 1B to 30B parameters.
1
That is, taking a prefer-
ence dataset ofnDPO pairs and breaking it up into
2nexamples for KTO can yield better generations,
despite the model ostensibly learning from a weaker
signal. We provide some theoretical explanations for
this phenomenon (§4.3).
•
KTO can handle extreme data imbalances, matching
DPO performance while using up to 90% fewer desir-
able examples (i.e., examples of good generations). Its
success thus cannot be ascribed to the alignment data
being sourced from a preference dataset.
•
When the pretrained model is sufficiently good, one
can skip supervised finetuning and go straight to KTO
without a loss in generation quality. In contrast, we
find that without doing SFT first, DPO-aligned models
are significantly worse at all scales.
The fact that KTO can match and sometimes even outper-
form DPO is surprising, given that it learns from a weaker
signal. We conclude by discussing some theoretical expla-
nations for this phenomenon.
2. Background
Feedback-aligned LLMs are traditionally trained in three
stages (Ouyang et al.,):
PretrainingGiven a large corpus, train the model to pre-
dict the next token conditioned on the preceding text using
the cross-entropy loss. Letπdenote the pretrained model.
1
Our code is available on.
Supervised FinetuningFinetune the model to predict the
next token on data that is more relevant to the downstream
task. Often, such data will comprise instructions and an
appropriate response (i.e., instruction finetuning). Letπref
denote the finetuned model.
RLHFGiven a datasetDof preferences(x, yw, yl) —
wherexis an input,yw, yl are the preferred and dispreferred
outputs (i.e.,yw≻yl forx), andr
∗is the “true” reward
function underlying the preferences—it is first assumed that
the probability thatywis preferred toylcan be captured with
a specific function class, typically a Bradley-Terry model
(Bradley & Terry,). Where σis the logistic function:
p
∗
(yw≻yl|x) =σ(r
∗
(x, yw)−r
∗
(x, yl))(1)
Since getting the true reward from a human would be in-
tractably expensive, a reward modelrϕlearns to serve as a
proxy, done by minimizing the negative log-likelihood of
the human preference data:
LR(rϕ) =Ex,yw,yl∼D[−logσ(rϕ(x, yw)−rϕ(x, yl))]
But solely maximizing the reward might come at the ex-
pense of desiderata such as generating grammatical text. To
avoid this, a KL divergence penalty is introduced to restrict
how far the language model can drift fromπref. Whereπθis
the model we are optimizing, the optimal modelπ
∗is that
which maximizes
Ex∈D,y∈πθ
[rϕ(x, y)]−βDKL(πθ(y|x)∥πref(y|x))(2)
whereβ >0 is a hyperparameter. Since this objective is
not differentiable, we need to use an RL algorithm like PPO
(Schulman et al.,).
2
Model Alignment as Prospect Theoretic Optimization
However, RLHF is often slow (largely because of having
to sample generations) and quite unstable in practice (espe-
cially in a distributed setting). For this reason, recent work
has focused on designing closed-form losses that maximize
the margin between the preferred and dispreferred gener-
ations, such as Sequence-Likelihood Calibration (SLiC)
(Zhao et al.,) and Direct Preference Optimization
(DPO) (Rafailov et al.,). The latter has become popular
due to its mathematical equivalence with RLHF:
LDPO(πθ, πref) =
E
ffi
−logσ
ȷ
βlog
πθ(yw|x)
πref(yw|x)
−βlog
πθ(yl|x)
πref(yl|x)
ffffl
(3)
3. A Prospect Theoretic View of Alignment
Kahneman & Tversky’sprospect theoryexplains why, faced
with an uncertain event, humans make decisions that do not
maximize the expected value (1992). For example, because
humans are loss-averse, given a gamble that returns$100
with 80% probability and$0 with 20% probability, a person
might accept$60 to avoid the gamble, despite theircertainty
equivalentof$60 being less than the expected value of$80.
3.1. Prospect Theory
In prospect theory, human utility depends on avalue func-
tionand aweighting function:
2
Definition 3.1.Avalue functionv:z→R maps an out-
comez, relative to some reference pointzref, to its perceived
(or subjective) value. For example, these functions capture
the fact that humans tend to be more sensitive to relative
losses than relative gains of the same magnitude.
Definition 3.2.Aweighting functionwis the derivative of
acapacity functionthat maps cumulative probabilities to
perceived cumulative probabilities. These functions capture,
for example, the fact that humans tend to overestimate the
chance of rare events. Letwzdenote the weight placed on
outcomez.
Definition 3.3.Theutility of a random variableZis a
function of its outcomes:u(Z)≜
P
z∈Z
wzv(z−zref).
However, because humans do not see the full probability
distribution of an LLM, weighting functions are not salient
to this discussion; we will focus only on value functions. Us-
ing experiments that presented real humans with monetary
gambles and asked for their certainty equivalent,
Kahneman1992) proposed the following functional form
for human value:
v(z, zref;λ;α) =
(
(z−zref)
α
ifz > zref
−λ(zref−z)
α
ifz < zref
(4)
2
Cumulative prospect theoryis the full name of the expanded
theory we dicuss here (Tversky & Kahneman,).
Figure 2.
The utility that a human gets from the outcome of a
random variable, as imputed by the value function implicit in
HALOs. Notice that the imputed functions share properties such
as loss aversion with the human value functions that Kahneman &
Tversky empirically derived (1992).
where the median value of hyperparameterα= 0.88 and
λ= 2.25 across individuals.αcontrols how quickly utility
changes andλcontrols the degree of loss aversion. While
the shape of the median Kahneman-Tversky value function
is illustrated in Figure, it should be noted that it varies
across individuals (Tversky & Kahneman,). There
are also other functional forms for the value function that
have been proposed in later work (Gurevich et al.,).
The salient qualities of a value function are: the existence
of a reference point that is added or subtracted to get the
relative gain or loss respectively; concavity in relative gains
(i.e. diminishing sensitivity away fromzref); loss aversion
(i.e., greater sensitivity to losses).
3.2. HALOs
Informally, HALOs are loss functions that model the human
biases in1992). Formally,
Definition 3.4(HALOs).Letx∈ Xdenote an input and
y∈ Yan output. Thenf: (x, y)→R is ahuman-aware
loss functionif there exists the following: a parameterized
reward functionrθsuch that∀(x1, y1),(x2, y2)∈ X × Y,
rθ(x1, y1)> rθ(x2, y2)⇐⇒(x1, y1)≻rθ
(x2, y2)
reference point distributionsQx(X
′
), Qy(Y
′
|X
′
) , a value
functionvf:R→R that is monotonic non-decreasing and
concave in(0,∞) , and a negative affine functiontsuch that
f(x, y;θ) =t(vf(rθ(x, y)−Ex
′
,y
′[rθ(x
′
, y
′
)]))(5)
wherex
′
∼Qx(X
′
)andy
′
∼Qy(Y
′
|x
′
).
Put simply, the requirement for the reward function is that it
assigns higher rewards to input-output pairs that are more
3
Model Alignment as Prospect Theoretic Optimization
preferred under it. The reference point is the expected re-
ward with respect to input-output pairs sampled from the
distributionsQx, Qy . We require that the value function
be concave in gains but not necessarily convex in losses—
unlike the canonical Kahneman-Tversky value functions—
because in the original work on prospect theory, a minority
of individuals were found to be risk-averse in both the gain
and loss regime (i.e., concave in both gains and losses)
(Kahneman & Tversky,). Note that risk-aversion is
different from loss-aversion; they relate to the curvature and
magnitude of the slope respectively.
Proposition 3.5.DPO, SLiC (calibration loss only), and
PPO-Clip are human-aware loss functions.
The proof is deferred to Appendix. In Figure, we can
see this more intuitively by plotting the value function for
each loss (i.e., the implied human utility). We see that the
value functions of all three losses incorporate a sense of loss
aversion, although this is not needed to meet the definition
of a HALO, since there are individuals and scenarios for
which loss aversion does not necessarily apply. The value
functions are also either concave or affine (depending on
the interval), unlike the standard Kahneman-Tversky value
function, which is concave in gains but convex in losses.
The reference point distributions used also differs across the
losses.
3.3. Does being a HALO matter?
A natural question is whether the modeling of human biases
in HALOs has practical benefits. This is difficult to answer,
since both HALOs and non-HALOs are diverse function
classes, but we attempt to do so by comparing popular non-
HALO and HALO baselines on the exact same data:
1.
CSFT: Conditional SFT is a simple alignment method
where a control token is prepended to the output during
training; then, at inference, the control token corre-
sponding to desirable generations (e.g.,<|good|>) is
appended to the input to induce good generations (Kor-
bak et al.,). This is a non-HALO loss.
2.
SLiC: SLiC with a regularization penalty (λreg̸= 0) is
a non-HALO loss:
LSLiC(πθ, πref) =Lcal(πθ) +λregLreg(πθ)
Lcal=Ex,yw,yl∼D
ffi
max
ȷ
0, δ−log
πθ(yw|x)
πθ(yl|x)
ffffl
Lreg=E
x∼D,y∼πref(x)[−logπθ(y|x)]
Although the max-margin lossLcalis a HALO on its
own (Proposition), the complete loss is not, since
theLregterm is the standard language modeling loss.
3.
DPO: DPO, as defined in (3), is a HALO loss (Propo-
sition).
4.
PPO (offline): The standard RLHF objective in (2) is
typically optimized with PPO-Clip, which works by
“clipping” how farπθcan drift from the versionπoldat
the previous step:
LPPO (offline)=−Ex,y,t∼D[min(qθA(x, y<t, yt),
clip(qθ,1−ϵ,1 +ϵ)A(x, y<t, yt))]
whereqθ= log
πθ
πold
andA(x, y<t, yt) is the per-token
advantage (i.e., the surplus benefit from producing a
given token in a given state).
PPO is an online algorithm—generations are sampled
from the current model, judged by a reward model, and
then used to update the current version. However, this
process is slow (due to having to sample generations),
so we choose to use offline data instead. Because
RLHF is also quite unstable in a distributed setting, we
never updateπoldand keep it asπref, instead clipping
less conservatively than we traditionally would (see
Appendix2023) found that
these changes, along with treating the entire output
sequence as a single action, greatly improves stability.
However, since RLHF has historically calculated token-
level advantages, we omit the third change and only
preserve the first two. The PPO-Clip loss itself is left
unchanged and is therefore a HALO (Proposition).
Calling this method PPO is somewhat imprecise, be-
cause it is offline and takes only one step, but to avoid
introducing too many new terms, we will call thisPPO
(offline). Instead of using learned rewards, we simplify
even further and use dummy +1/-1 rewards forywand
ylinstead. Further details on the implementation of
this method can be found in Appendix.
We compare these baselines on a suite of 7 models spanning
two model families, Pythia-{1.4B, 2.8B, 6.9B, 12B}(Bi-
derman et al.,) and Llama- {7B, 13B, 30B}(Touvron
et al.,). This permits us to see how LLM alignment
scales within a model family (Llama-2 lacks a 30B model,
hence our use of Llama). Later experiments (§4.2) are done
on Mistral-7B and its derivatives (Jiang et al.,). The
models were trained on a combination of Anthropic HH
(Ganguli et al.,), OpenAssistant (K ¨opf et al.,),
and SHP (Ethayarajh et al.,).
All models were aligned under identical settings on the
same data (e.g., same effective batch size, same optimizer,
etc.), save for hyperparameters unique to them. Similar to
Rafailov et al.2023), the target sequences for SFT are a
subset of the generations used to subsequently align the
model; however, for a more realistic SFT setup, we do not
necessarily set the most preferred generation to be the target
(with the exception of HH, since the dispreferred output in
that dataset is often harmful). Then we usedGPT-4-0613
4
Model Alignment as Prospect Theoretic Optimization
Figure 3.
Among existing alignment methods, the HALOs (DPO and our offline PPO variant) generally outperform non-HALOs (SLiC
and CSFT), though the gap is only significant(p <0.05) at 13B+ model sizes. In fact, only the HALO-aligned Llama-{13B, 30B}
models are able to match or exceed the generation quality of SFT target sequences, which are drawn directly from the alignment dataset.
It is also worth noting that up to a scale of 7B parameters, virtually all of the gains from LLM alignment come from the SFT stage.
to judge whether the aligned model’s response was bet-
ter than the SFT target for the given input with respect to
helpfulness, harmlessness, and conciseness, a now standard
practice (Zheng et al.,;,).
3
Note that
while the SFT target is considered a desirable output forx,
it is by no means thebestoutput, meaning that it can be
improved upon by an aligned model.
In Figure, we see the results of this analysis:
•
The HALOs we tested (DPO and our PPO variant) ei-
ther match or outperform the non-HALOs at all scales,
though the gap is only significant(p <0.05) at 13B+
model sizes. In fact, only the HALO-aligned Llama-
{13B, 30B}models match or exceed a win rate of 50%
(i.e., are able to match or exceed the generation quality
of the SFT targets in the test data).
•
Up to a scale of 7B parameters, alignment provides
virtually no gains over SFT alone. However, it is worth
noting that if the SFT data distribution were less sim-
ilar to the preference data, then the gains from the
alignment stage would ostensibly be greater.
•
Surprisingly, despite only using dummy +1/-1 rewards,
our offline PPO variant performs as well as DPO for
all models except Llama30B. This challenges conven-
tional wisdom, which places heavy emphasis on reward
learning (Casper et al.,), suggesting that even the
simplest rewards can prove useful when used in a loss
function that has a strong inductive bias. Despite its
surprising success, our offline PPO baseline still suffers
from hyperparameter sensitivity and training instability,
3
We validate that GPT-4 judgments concur with human judg-
ments in Appendix.
albeit not to the same extent as traditional RLHF.
4. Kahneman-Tversky Optimization
The surprising success of offline PPO with dummy +1/-1 re-
wards suggests that—with the right HALO—a binary signal
of good/bad generations may be sufficient to reach DPO-
level performance, even if the offline PPO approach itself
was unable to do so past a certain scale (§3.3). Taking a
more principled approach, we now derive a HALO using the
Kahneman-Tversky model of human utility, which allows
us to directly optimize for utility instead of maximizing the
log-likelihood of preferences. This Kahneman-Tversky Op-
timization (KTO) loss only needs a binary signal of whether
an output is (un)desirable for a given input, giving it access
to a source of data is more abundant, cheaper, and faster to
collect in the real world.
4.1. Derivation
From prior work (Go et al.,;,;
& Schaal,), we know that the policy that maximizes
the KL-constrained RLHF objective in (2) is
π
∗
(y|x) =
1
Z(x)
πref(y|x) exp
ȷ
1
β
r
∗
(x, y)
ff
whereZ(x)is a partition function.2023)
rewrite this in terms of the optimal reward for an input-
output pair:
r
∗
(x, y) =βlog
π
∗
(y|x)
πref(y|x)
+βlogZ(x) (6)
They then plug this expression into the Bradley-Terry model
of preferences and take the negative logarithm of that objec-
tive to get the DPO loss (3).
5
Model Alignment as Prospect Theoretic Optimization
Figure 4.
Kahneman-Tversky Optimization (KTO) is as good or better than DPO at all scales, both when preceded and not preceded by
supervised finetuning (SFT). In fact, for the Llama models, KTO alone matches the performance of SFT+DPO and is significantly better
than DPO alone. Error bars denote a 90% binomial confidence interval.
Instead, we plug this expression into the Kahneman-Tversky
model of human utility, with some changes to make it more
amenable to the LLM setting:
1.
The exponent in the Kahneman-Tversky value func-
tion (4) makes it difficult to optimize, so we setvKTO
to be the logistic functionσ, which is also concave
in gains and convex in losses. We replace the loss-
aversion coefficient with two hyperparametersλD, λU
that weight the losses for desirable and undesirable
outputs respectively.
2.
The Kahneman-Tversky value function was derived
based on experiments with humans and monetary gam-
bles. Since LLM generations do not have a monetary
reward associated with them, we setrKTOto be the
implicit reward under the RLHF objective (6).
3.
Rather than having just one dispreferred generation
yl|xas the reference point, we assume that humans
judge the quality of(x, y) in relation to all input-output
pairs they have seen. Thus we write the reference point
to be the expected reward under the optimal policy, not
just for generations followingxbut following any input
x
′:Ex
′
∼D,y
′
∼π
∗[r
∗
(x
′
, y
′
)] . Under the assumption
that the expected value of the partition function across
x
′is zero, this simplifies to the KL divergence between
π
∗
andπrefscaled byβ.
Combining all of these changes, we can optimize the follow-
ing loss, where the notion of an output being “desirable” or
“undesirable” corresponds to the Kahneman-Tversky notion
of a relative gain or loss.
LKTO(πθ, πref) =Ex,y∼D[w(y)(1−vKTO(x, y;β))](7)
where
rKTO(x, y) =βlog
πθ(y|x)
πref(y|x)
zref=Ex
′
∼D[βKL(πθ(y
′
|x
′
)∥πref(y
′
|x
′
))]
vKTO(x, y;β) =
(
σ(rKTO(x, y)−zref)ify∼ydesirable|x
σ(zref−rKTO(x, y))ify∼yundesirable|x
w(y) =
(
λDify∼ydesirable|x
λUify∼yundesirable|x
Intuitively, KTO works because if the model increases the
reward of a desirable example in a generic way, then the KL
penalty will also rise and no progress will be made on the
loss. This forces the model to learn exactly what makes an
output desirable, so that the reward can be increased while
keeping the KL term flat (or even decreasing it). A similar
argument works in the other direction as well, though the
non-negativity of the KL term allows faster saturation.
ImplementationIn practice, we estimate the KL term by
matching inputsx
′with unrelated outputsy
′
Uin a batch of
sizemand then calculatingmax
ˇ
0,
1
m
P
log
πθ(y
′
U
|x
′
)
πref(y
′
U
|x
′
)
ı
over the entire batch. We do not back-propagate through the
KL term, as it makes training much more stable. This means
that the KL term purely serves to control how saturated the
loss is.
βhas the same meaning as in DPO; the lower it is, the less
we penalizeπθfrom moving away from the SFT model
πref. We find thatβ= 0.1 is close-to-best on most datasets.
WherenDandnUrefer to the number of desirable and
undesirable examples respectively, we setλD, λU such that
λDnD
λUnU
∈
ffi
1,
4
3
ffl
(8)
6
Model Alignment as Prospect Theoretic Optimization
Figure 5.
Without doing SFT first, DPO-aligned models tend to
ramble and hallucinate entire conversations. KTO does not suffer
from this issue.
where at least one of the two should be set to 1 and the ratio
is controlled by changing the other. For example, if there
is a 1:1 ratio of desirable:undesirable examples, we would
setλU= 1, λD∈[1,1.33] . If we then discard 90% of
the desirable examples and only keep 10%, then we would
setλU= 1, λD∈[10,13.33] . The interval[1,4/3] was
determined empirically and suggests a value function that
is more gain-sensitive than loss-sensitive, in contrast to the
original Kahneman-Tversky value function (4). However,
the ideal interval is also task-dependent; for example, if
avoiding negative outcomes were very important, then we
might consider a setting ofλU>1instead.
DataIf the alignment data is naturally binary, every posi-
tive example can be assumed to be drawn fromydesirable|x
and every negative example fromyundesirable|x . However, the
canonical feedback datasets in academic research (HH, SHP,
OASST) are in preference format, since the methods that
have worked best up until now are preference-based. In our
experiments, we converted preference datayw≻yl by as-
suming thatywis drawn from the desirable distribution and
ylfrom the undesirable one. To enable an apples-to-apples
comparison with DPO, we apply KTO on the same data
for most experiments. However, to ensure that KTO can
be used with non-preference data, we also subsample one
outputyperxfor some experiments (denoted one-y-per-x).
If the data is score-based, where a higher score denotes
greater desirability, one has multiple options:
•
Assume that any output with a score above some fixed
thresholdτis desirable.
•
Assume that any output with a score above the mean
or median (either across all inputs or just the input it
was conditioned on) is desirable.
•
Let desirability be a Bernoulli random variable where
p(y∼ydesirable|x) is some function of its score (e.g.,
logistic). Then randomly sample to determine whether
yis desirable or not.
Figure 6.
Even after discarding 90% of the desirable examples
while keeping all of the undesirable data (leading to a 1:10 ratio
of desirable:undesirable data), a KTO-aligned Llama-7B model
still outperforms its DPO counterpart. This implies that preference
pairs do not have to be the source of KTO data.
4.2. Experiments
KTO≥DPOAs seen in Figure, SFT+KTO is com-
petitive with SFT+DPO at model scales from 1B to 30B,
despite learning from a weaker signal. KTO alone is better
than DPO alone for the Llama-{7B, 13B, 30B}models, and
this gap is significant (p <0.01 ) at 7B and 30B even after
correcting for multiple comparisons (Holm,). Perhaps
most surprising is the fact that a KTO-aligned Llama-{13B,
30B}model is competitive with its SFT+KTO counterpart,
despite not undergoing supervised finetuning first, and is the
only alignment method of the ones we tested to show this
behavior. This is perhaps due to the fact that KTO keeps
the average response length roughly the same as it is for
the SFT model. In contrast, doing DPO without SFT first
causes the average response length to increase dramatically.
KTO dataneed not
come from preference datasets.
Might KTO be secretly benefiting from the fact that its2n
examples in the previous experiment came fromnprefer-
ence pairs instead of a naturally unpaired data distribution?
To test this, we randomly discard increasingly large frac-
tions of the desirable data before KTO-aligning a Llama-7B
model. For example, if we discard 90% of the desirable
data while leaving the undesirable data untouched, then the
ratio of desirable:undesirable examples goes from 1:1 to
1:10 and the vast majority of examples no longer have a
preferred output counterpart. We handle such imbalances
by changing the loss weightsλD, λU to satisfy the criteria
in (8); when we drop 90% of the desirable data, we set
λu= 1, λD= 13.33 . The full results are given in Figure
6. For Llama-7b, we find that up to 90% of the desirable
7
Model Alignment as Prospect Theoretic Optimization
Table 1.
In aligning Mistral-7B on the OpenAssistant dataset, we
find that using KTO with only one output per input still outper-
forms DPO, despite this restriction reducing the amount of training
data by 72%. A 90% confidence interval is given.
Method Winrate vs. SFT Target
Mistral-7B (unaligned) 0.525 ±0.037
Mistral-7B + DPO 0.600 ±0.037
Mistral-7B + KTO (allyperx) 0.652±0.036
Mistral-7B + KTO (oneyperx) 0.631 ±0.036
Mistral-7B-Instruct 0.621 ±0.031
data can in fact be discarded while still outperforming DPO.
A similar trend holds when discarding undesirable data.
For different models and datasets, the optimal settings of
λD, λUdiffer.
We further verify this claim by aligning Mistral-7B on Ope-
nAssistant using DPO (onnpairs), standard KTO (on all
2noutputs), and KTO where only oneyperxis used. Since
the output of oneyin OpenAssistant is not conditioned on
the other outputs for the same input, the latter effectively
captures the setting where the data is from an inherently un-
paired distribution. Despite the one-y-per-xsetup decreas-
ing the amount of training data by 72%, the KTO-aligned
model still outperforms both its DPO counterpart and the
official instruction-tuned version of Mistral-7B (Jiang et al.,
2023), as seen in Table.
On average, KTO improves performance across gener-
ative benchmarks.Zephyr-βis a variant of Mistral-7B
that has been instruction-tuned and DPO-aligned on the Ul-
traFeedback dataset (Tunstall et al.,;,).
We find that substituting KTO for DPO (and changing noth-
ing else) improves performance across MMLU (0-shot)
(Hendrycks et al.,), GSM8K (8-shot, CoT) (Cobbe
et al.,), HumanEval (0-shot) (Chen et al.,), and
BigBench-Hard (3-shot CoT) (Srivastava et al.,). On
GSM8K, just swapping DPO for KTO improves perfor-
mance by 13.5 points. Even when we align with KTO using
only oneyperx(i.e., reducing the data volume by half), we
still outperform DPO on all but one benchmark.
4.3. Theoretical Analysis
KTO was designed with the motivation that even if it had
to learn from a weaker signal, it would make up for this
limitation with the fact that it has access to much more
data in the real world, where thumbs-up/thumbs-down data
is common but preferences are scarce and expensive to
collect. So why does KTO perform as good or better than
DPO in our experiments, when it sees the same amount
of data? Data efficiency may not be the only answer. Our
theoretical analysis suggests that preference likelihood can
Table 2.
Aligning Zephyr (Tunstall et al.,), a derivative of
Mistral-7B, on UltraFeedback with KTO instead of DPO improves
results across a suite of benchmarks. This is true even when only
one of the two outputs in each preference is seen by KTO, despite
this reducing the volume of data by half (one-y-per-x).
Dataset (→) MMLU GSM8k HumanEval BBH
Metric (→) EM EM pass@1 EM
Zephyr-βSFT 57.2 39.0 30.1 46.3
+DPO 58.2 40.0 30.1 44.1
+KTO 58.6 53.5 30.9 52.6
+KTO (one-y-per-x)58.0 50.0 30.7 49.9
be maximized without necessarily maximizing underlying
human utility and that KTO implicitly ignores noisy and
intransitive data.
Proposition 4.1.KTO does not learn from undesirable ex-
amples with sufficiently high rewards or desirable examples
with sufficiently low rewards.
Informally, if an example is too difficult to learn from, then
the KTO update will not changeπθ. This may be a blessing
in disguise, since human preferences are often noisy and
not every given preference can be recovered with the true
rewardr
∗(Hoeffler & Ariely,). This means that it may
be useful to avoid unlearnable preferences. However, this is
a double-edged sword: it also means that KTO could end
up ignoring some data that is hard-to-learn but necessary to
recoverr
∗
, resulting in under-fitting.
Theorem 4.2.Assuming the value function is logistic, for
any bounded reward functionra, there exists a reward func-
tion in its equivalence class (i.e.,rb(x, y) =ra(x, y)+h(x)
for someh(x)) that induces the same optimal policyπ
∗and
Bradley-Terry preference distribution but a different human
value distribution.
A key insight from2023) is that reward
functions in the same equivalence class (i.e., differing only
in an input-specific component) induce the same optimal
policy under (2) and the same Bradley-Terry preference
distribution. However, we show under mild assumptions
that the value distribution—i.e., human utility—is affected
by such input-specific changes, so maximizing preference
likelihood does not mean one is maximizing human utility.
Approaches that directly maximize utility, such as KTO,
may thus perform better in open-ended evaluation.
Theorem 4.3.Let two humansa, bhave value functions
va, vb and contradicting preferencesy1≻ay2 andy2≻by1
for some inputx. Assumeπref(y|x) = 0 =⇒πθ(y|x) = 0
for allx, y. In the worst-case, the optimal policy under DPO
decreases the expected value of both humans. In contrast, if
each preference is broken up into two examples, then KTO
(with default settings) does not change the policy.
Informally, we assume that humans want the model to in-
8
Model Alignment as Prospect Theoretic Optimization
crease and decrease the probability of generations they like
and dislike respectively. However, the preferences of two
humans often contradict, leading to a dataset containing
intransitive preferences. In the worst-case, DPO allows one
of the two preferences to be recovered while decreasing
the expected value of both humans. In contrast, KTO will
change nothing at all in any case. Since existing datasets
contain preferences from multiple annotators, the existence
of intransitivity may help explain why KTO works better.
4.4. KTO vs. DPO – when to use which?
When human feedback is in a binary format, and especially
when there is an imbalance between the number of desirable
and undesirable examples, KTO is the natural choice. When
your data is in the form of preferences, the choice is less
clear. Putting aside the greater data efficiency of KTO, our
theoretical analysis suggests that if your preference data
has sufficiently little noise and sufficiently little intransitiv-
ity, then DPO will work better, since there is some risk of
KTO underfitting. But if there is enough noise and transitiv-
ity, then the better worst-case guarantees of KTO will win
out. Most publicly available preference datasets (e.g., SHP,
OpenAssistant) contain noisy feedback from many different
humans whose preferences likely contradict, which explains
why KTO was able to match or exceed DPO performance
in our experiments. Even AI feedback can be noisy and in-
transitive, which helps explain why KTO outperforms DPO
when aligning with the synthetic UltraFeedback data.
5. Related Work
Human feedback has been used to improve LLM capabilities
in translation (Kreutzer et al.,), summarization (Stien-
non et al.,), sentiment-conditioned generation (Ziegler
et al.,), and instruction-following (Ouyang et al.,).
The RLHF framework (Christiano et al.,;,
2022) traditionally used to accomplish this is detailed in§2.
Still, momentum has largely shifted in favor of closed-form
losses that directly operate on offline preferences, such as
DPO (Rafailov et al.,). This single stage of optimiza-
tion distinguishes DPO from the conventional approach in
preference-based RL, which learns a reward and then fits the
policy to those rewards (Jain et al.,;,
2014). A recent string of work has centered on the idea of
“self-training” or “self-play”, during which new preference
data is inferred from a model’s generations (Chen et al.,
2024;,). Despite not being a human-aware
loss,unlikelihood trainingwas among to first to methods to
align language models using a binary signal (Welleck et al.,
2019). However, work by2023) found that it
is worse than the CSFT baseline we tested in our work.
Prospect theory, despite being highly influential in behav-
ioral economics, has had a fairly muted impact in machine
learning, with work concentrated in human-robot interac-
tion (Kwon et al.,;,;,).
Learning from sparse binary feedback is a staple of infor-
mation retrieval and recommender systems (He et al.,;
Koren et al.,), although to our knowledge it has not
been used to generate open-ended text.
6. Future Work
The existence of HALOs raises many questions. For one,
the KTO loss is based on the Kahneman-Tversky value
function for monetary gains and losses, which is almost
certainly different from how humans perceive the relative
goodness of text. What value function—and corresponding
HALO—best describes how humans perceive language?
Given that the data that KTO needs is much more abun-
dant, cheaper, and faster to collect—both as human and AI
feedback—how far can we push synthetic data? For exam-
ple, if we wanted to create a toxicity dataset to align our
models to be less toxic, creating a tuple(x, yw, yl) whereyl
is more toxic thanywis non-trivial. However, with KTO, we
can easily create a dataset where desirability is determined
by some black-box toxicity detection API. What other kinds
of desiderata can we synthetically optimize for with KTO?
Can we convert signals like “conversation lead to sale made”
or “support ticket resolved” into KTO data?
Currently, KTO can learn from score-based data when the
score is used to infer desirability. However, can we design a
HALO where scores are directly incorporated into this loss?
7. Conclusion
We proposed a class of functions calledhuman-aware losses
(HALOs) based on the idea of a Kahneman-Tversky value
function, which models some of the key cognitive biases
that inform how humans make decisions about uncertain out-
comes. We showed that among existing alignment methods,
those that met the definition of a HALO performed better
than those that did not, suggesting a benefit to the modeling
of human biases. We then designed a human-aware loss
called KTO for directly maximizing the utility of genera-
tions instead of maximizing preference likelihood. Despite
only learning from a binary signal of whether an output is
(un)desirable, KTO is as good or better than DPO at scales
from 1B to 30B. Still, we make no claims that KTO is the
best HALO for all scenarios; there remains much work to
be done in discovering the optimal human-aware for each
setting.
9
Model Alignment as Prospect Theoretic Optimization
Acknowledgements
We thank Dilip Arumugam and Arya McCarthy for feedback
on the paper and Nathan Lambert for feedback on an early
version of this draft. We thank Stas Bekman and Gautam
Mittal for cluster assistance and Alex Manthey for helping
with human evaluation.
References
Baheti, A., Lu, X., Brahman, F., Bras, R. L., Sap,
M., and Riedl, M. Improving language models with
advantage-based offline policy gradients.arXiv preprint
arXiv:2305.14718, 2023.
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., Das-
Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T.,
et al. Training a helpful and harmless assistant with rein-
forcement learning from human feedback.arXiv preprint
arXiv:2204.05862, 2022.
Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley,
H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S.,
Prashanth, U. S., Raff, E., et al. Pythia: A suite for ana-
lyzing large language models across training and scaling.
InInternational Conference on Machine Learning, pp.
2397–2430. PMLR, 2023.
Bradley, R. A. and Terry, M. E. Rank analysis of incom-
plete block designs: I. the method of paired comparisons.
Biometrika, 39(3/4):324–345, 1952.
Busa-Fekete, R., Sz¨or´enyi, B., Weng, P., Cheng, W., and
H¨ullermeier, E. Preference-based reinforcement learning:
evolutionary direct policy search using a preference-based
racing algorithm.Machine learning, 97:327–351, 2014.
Casper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J.,
Rando, J., Freedman, R., Korbak, T., Lindner, D., Freire,
P., et al. Open problems and fundamental limitations
of reinforcement learning from human feedback.arXiv
preprint arXiv:2307.15217, 2023.
Chan, L., Critch, A., and Dragan, A. Human irrationality:
both bad and good for reward inference.arXiv preprint
arXiv:2111.06956, 2021.
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O.,
Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman,
G., et al. Evaluating large language models trained on
code.arXiv preprint arXiv:2107.03374, 2021.
Chen, Z., Deng, Y., Yuan, H., Ji, K., and Gu, Q. Self-play
fine-tuning converts weak language models to strong lan-
guage models.arXiv preprint arXiv:2401.01335, 2024.
Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg,
S., and Amodei, D. Deep reinforcement learning from
human preferences.Advances in neural information pro-
cessing systems, 30, 2017.
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H.,
Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano,
R., Hesse, C., and Schulman, J. Training verifiers to solve
math word problems.arXiv preprint arXiv:2110.14168,
2021.
Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y., Xie, G.,
Liu, Z., and Sun, M. Ultrafeedback: Boosting language
models with high-quality feedback, 2023.
Ethayarajh, K., Choi, Y., and Swayamdipta, S. Under-
standing dataset difficulty withV-usable information. In
Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu,
G., and Sabato, S. (eds.),Proceedings of the 39th In-
ternational Conference on Machine Learning, volume
162 ofProceedings of Machine Learning Research, pp.
5988–6008. PMLR, 17–23 Jul 2022.
Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y.,
Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse,
K., et al. Red teaming language models to reduce harms:
Methods, scaling behaviors, and lessons learned.arXiv
preprint arXiv:2209.07858, 2022.
Go, D., Korbak, T., Kruszewski, G., Rozen, J., Ryu, N., and
Dymetman, M. Aligning language models with prefer-
ences through f-divergence minimization.arXiv preprint
arXiv:2302.08215, 2023.
Gurevich, G., Kliger, D., and Levy, O. Decision-making
under uncertainty–a field study of cumulative prospect
theory.Journal of Banking & Finance, 33(7):1221–1229,
2009.
He, X., Liao, L., Zhang, H., Nie, L., Hu, X., and Chua,
T.-S. Neural collaborative filtering. InProceedings of
the 26th international conference on world wide web, pp.
173–182, 2017.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M.,
Song, D., and Steinhardt, J. Measuring massive multitask
language understanding.Proceedings of the International
Conference on Learning Representations (ICLR), 2021.
Hoeffler, S. and Ariely, D. Constructing stable preferences:
A look into dimensions of experience and their impact on
preference stability.Journal of consumer psychology, 8
(2):113–139, 1999.
Holm, S. A simple sequentially rejective multiple test pro-
cedure.Scandinavian journal of statistics, pp. 65–70,
1979.
10
Model Alignment as Prospect Theoretic Optimization
Jain, A., Wojcik, B., Joachims, T., and Saxena, A. Learning
trajectory preferences for manipulators via iterative im-
provement.Advances in neural information processing
systems, 26, 2013.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C.,
Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G.,
Lample, G., Saulnier, L., et al. Mistral 7b.arXiv preprint
arXiv:2310.06825, 2023.
Kahneman, D. and Tversky, A. Prospect theory: An analysis
of decision under risk.Econometrica, 47(2):263–292,
1979.
K¨opf, A., Kilcher, Y., von R¨utte, D., Anagnostidis, S.,
Tam, Z.-R., Stevens, K., Barhoum, A., Duc, N. M., Stan-
ley, O., Nagyfi, R., et al. Openassistant conversations–
democratizing large language model alignment.arXiv
preprint arXiv:2304.07327, 2023.
Korbak, T., Shi, K., Chen, A., Bhalerao, R. V., Buckley,
C., Phang, J., Bowman, S. R., and Perez, E. Pretrain-
ing language models with human preferences. InInter-
national Conference on Machine Learning, pp. 17506–
17533. PMLR, 2023.
Koren, Y., Bell, R., and Volinsky, C. Matrix factorization
techniques for recommender systems.Computer, 42(8):
30–37, 2009.
Kreutzer, J., Uyheng, J., and Riezler, S. Reliability and
learnability of human bandit feedback for sequence-to-
sequence reinforcement learning. InProceedings of the
56th Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers), pp. 1777–
1788, 2018.
Kwon, M., Biyik, E., Talati, A., Bhasin, K., Losey, D. P.,
and Sadigh, D. When humans aren’t optimal: Robots that
collaborate with risk-aware humans. InProceedings of
the 2020 ACM/IEEE international conference on human-
robot interaction, pp. 43–52, 2020.
Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I.,
Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacae-
val: An automatic evaluator of instruction-following
models.https://github.com/tatsu-lab/
alpaca_eval, 2023.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.,
Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A.,
et al. Training language models to follow instructions
with human feedback.Advances in Neural Information
Processing Systems, 35:27730–27744, 2022.
Peng, X. B., Kumar, A., Zhang, G., and Levine, S.
Advantage-weighted regression: Simple and scalable
off-policy reinforcement learning.arXiv preprint
arXiv:1910.00177, 2019.
Peters, J. and Schaal, S. Reinforcement learning by reward-
weighted regression for operational space control. In
Proceedings of the 24th international conference on Ma-
chine learning, pp. 745–750, 2007.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning,
C. D., and Finn, C. Direct preference optimization: Your
language model is secretly a reward model.arXiv preprint
arXiv:2305.18290, 2023.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Klimov, O. Proximal policy optimization algorithms.
arXiv preprint arXiv:1707.06347, 2017.
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid,
A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A.,
Garriga-Alonso, A., et al. Beyond the imitation game:
Quantifying and extrapolating the capabilities of language
models.arXiv preprint arXiv:2206.04615, 2022.
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R.,
Voss, C., Radford, A., Amodei, D., and Christiano,
P. F. Learning to summarize with human feedback.Ad-
vances in Neural Information Processing Systems, 33:
3008–3021, 2020.
Sun, L., Zhan, W., Hu, Y., and Tomizuka, M. Interpretable
modelling of driving behaviors in interactive driving sce-
narios based on cumulative prospect theory. In2019 IEEE
Intelligent Transportation Systems Conference (ITSC), pp.
4329–4335. IEEE, 2019.
Tian, K., Mitchell, E., Yao, H., Manning, C. D., and Finn,
C. Fine-tuning language models for factuality.arXiv
preprint arXiv:2311.08401, 2023.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E.,
Azhar, F., et al. Llama: Open and efficient foundation lan-
guage models.arXiv preprint arXiv:2302.13971, 2023.
Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul,
K., Belkada, Y., Huang, S., von Werra, L., Fourrier, C.,
Habib, N., Sarrazin, N., Sanseviero, O., Rush, A. M.,
and Wolf, T. Zephyr: Direct distillation of lm alignment,
2023.
Tversky, A. and Kahneman, D. Advances in prospect theory:
Cumulative representation of uncertainty.Journal of Risk
and uncertainty, 5:297–323, 1992.
von Werra, L., Belkada, Y., Tunstall, L., Beeching, E.,
Thrush, T., Lambert, N., and Huang, S. Trl: Trans-
former reinforcement learning.https://github.
com/huggingface/trl, 2020.
Welleck, S., Kulikov, I., Roller, S., Dinan, E., Cho, K.,
and Weston, J. Neural text generation with unlikelihood
11
Model Alignment as Prospect Theoretic Optimization
training. InInternational Conference on Learning Repre-
sentations, 2019.
Yuan, W., Pang, R. Y., Cho, K., Sukhbaatar, S., Xu, J.,
and Weston, J. Self-rewarding language models.arXiv
preprint arXiv:2401.10020, 2024.
Zhao, Y., Joshi, R., Liu, T., Khalman, M., Saleh, M., and
Liu, P. J. Slic-hf: Sequence likelihood calibration with
human feedback.arXiv preprint arXiv:2305.10425, 2023.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z.,
Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging
llm-as-a-judge with mt-bench and chatbot arena.arXiv
preprint arXiv:2306.05685, 2023.
Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford,
A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning
language models from human preferences.arXiv preprint
arXiv:1909.08593, 2019.
12
Model Alignment as Prospect Theoretic Optimization
A. Proofs
Proposition DPO, SLiC (calibration loss only), and PPO-Clip are human-aware loss functions.
Proof.For a loss to be a HALO, it needs to be expressible as
f(x, y;θ) =t(vf(rθ(x, y)−Ex
′
∼Q
′
x
,y
′
∼Q
′
y
[rθ(x
′
, y
′
)]))
with a parameterized reward functionrθsuch that∀(x1, y1),(x2, y2)∈ X ×Y ,rθ(x1, y1)> rθ(x2, y2)⇐⇒(x1, y1)≻rθ
(x2, y2)
, reference point distributionsQx(X
′
), Qy(Y
′
|X
′
) , a value functionvf:R→R that is monotonic non-decreasing
and concave in(0,∞), and a negative affine functiont.
The DPO loss is
LDPO(πθ, πref) =E
ffi
−logσ
ȷ
βlog
πθ(yw|x)
πref(yw|x)
−βlog
πθ(yl|x)
πref(yl|x)
ffffl
whereβ >0 is a hyperparameter. DPO meets the criteria with the following construction:t(·)is just taking the negative,
vf= logσ is increasing and concave everywhere,rθis the DPO rewardβlog[πθ(y|x)/πref(y|x)] ,Qxplaces all mass onx
andQyplaces all mass on the dispreferred outputylforxsuch thaty≻yl.
The calibration loss in SLiC is
Lcal=Ex,yw,yl∼D[max (0, δ−logπθ(yw|x) + logπθ(yl|x))]
=Ex,yw,yl∼D[−min (0,logπθ(yw|x)−logπθ(yl|x)−δ)]
whereδ >0 is hyperparameter. The calibration loss meets the criteria under the following construction:t(·)is just taking
the negative,vf(z) = min(0, z−δ) is non-decreasing everywhere and concave in gains,rθ(x, y) = logpθ(y|x) , and the
reference point distributions are defined the same as they are for DPO.
The PPO-Clip loss is
LPPO (offline)=−Ex,y,i∼D[min(qθA(x, y<i, yi),clip(qθ,1−ϵ,1 +ϵ)A(x, y<i, yi))]
whereqθ=
πθ(yt|x,y<i)
πref(yt|x,y<i)
are the token-level probability ratios, wherey<idenotes the output sequence up to thei-th token.
This token-level focus makes this objective different from DPO and SLiC.Adenotes the token-level advantages, andϵ >0
is a hyperparameter. The rewardrθ(x, y) = logqθ andtthen just takes the negative. We can letQxplace all mass on the
joint sequencex:yandQybe an arbitrary distribution overY— since there is no advantage to generating tokens past
yn, the distributionsπ
∗
(·|x:y) andπref(·|x:y) should be arbitrarily close, pushinglogqθ→0 . We construct the value
function piecewise:
vf(z) =
(
Amin(expz,1 +ϵ)ifA(x, y<i, yi)>0
Amax(expz,1−ϵ)ifA(x, y<i, yi)<0
Proposition KTO does not learn from undesirable examples with sufficiently high rewards or desirable
examples with sufficiently low rewards.
Proof.
Whereλ(y) =−λD whenyis desirable andλUwhenyis undesirable, andz=rKTO(x, y)−zref , the derivative of
the KTO loss is
∇θLKTO(πθ, πref) =Ex,y∼D[λ(y)σ(z)σ(−z)∇βlogπθ(y|x)] (9)
Note that we do not backpropagate through the KL term in the KTO loss andβ >0. This gradient is simple to interpret: ify
is desirable, thenλ(y)is negative and we push up the probability ofπθ(y|x) to minimize the loss; we do the opposite ify
is undesirable. Asztends to±∞, the gradient will tend to zero since eitherσ(−z) orσ(z)will tend to zero. Sincezis
increasing in the reward, this means that sufficiently large and sufficiently small rewards will yield a gradient of zero.
13
Model Alignment as Prospect Theoretic Optimization
Theorem Assuming the value function is logistic, for any bounded reward functionra, there exists a reward
function in its equivalence class (i.e.,rb(x, y) =ra(x, y) +h(x) for someh(x)) that induces the same optimal policyπ
∗
and Bradley-Terry preference distribution but a different human value distribution.
Proof.
Following the definition in2023), we say two functions ra(x, y) andrb(x, y) are in the same
equivalence class if there exists some functionh(x)such thatrb(x, y) =ra(x, y) +h(x) . From Lemma 1 in
(2023), we know that two functions in the same equivalence class induce the same optimal policy:
πrb
(y|x) =
1
Z(x)
πref(y|x) exp
ȷ
1
β
rb(x, y)
ff
=
1
P
y
πref(y|x) exp
ˇ
1
β
(ra(x, y) +h(x))
ıπref(y|x) exp
ȷ
1
β
(ra(x, y) +h(x))
ff
=
1
P
y
πref(y|x) exp
ˇ
1
β
ra(x, y)
ı
exp
ˇ
1
β
h(x)
ıπref(y|x) exp
ȷ
1
β
ra(x, y)
ff
exp
ȷ
1
β
h(x)
ff
=πra(y|x)
For a Bradley-Terry model of preferences, it is trivial to show thatp(yw≻yl|x) is unaffected byh(x)since it is added to the
reward of bothywandyl. We will now show that the two reward functions do not necessarily induce the same distribution
of values. Let
hu(x) =
(
0 ifx̸=u
Cu ifx=u
whereCu̸= 0is an input-specific constant.
Assume thatyis a desirable output forxwithout loss of generality. Letnube the number of timesuappears in a set of size
n. For reward functionsra, rb with corresponding logistic value functionsva, vb such thatrb(x, y) =ra(x, y) +hu(x) for
some inputu, :
vb(x, y) =σ(rb(x, y)−Ex
′
∼D,y
′
∼π
∗[rb(x
′
, y
′
)])
=σ(ra(x, y) +hu(x)−Ex
′
∼D,y
′
∼π
∗[ra(x
′
, y
′
) +hu(x
′
)])
=σ(ra(x, y)−Ex
′
∼D,y
′
∼π
∗[ra(x
′
, y
′
)] + (hu(x)−Ex
′
∼D[hu(x
′
)]))
Letv
′
abe the derivative ofva, and letϵdenote the error term from a first-order Taylor series expansion.
Whenu=x:
vb(x, y) =σ
ˇ
ra(x, y)−Ex
′
∼D,y
′
∼π
∗[ra(x
′
, y
′
)] +
ˇ
Cu−
nu
n
Cu
ıı
=va(x, y) +
ȷ
Cu−
nuCu
n
ff
v
′
a(x, y) +ϵ
=va(x, y) +
ȷ
Cu−
nuCu
n
ff
va(x, y)(1−va(x, y)) +ϵ
Whenu̸=x,
vb(x, y) =σ
ˇ
ra(x, y)−Ex
′
∼D,y
′
∼π
∗[ra(x
′
, y
′
)]−
nu
n
Cu
ı
=va(x, y)−
nuCu
n
va(x, y)(1−va(x, y)) +ϵ
Since the rewards are bounded by assumption, we haveva∈(0,1) . For ak-th order Taylor Series approximation, we thus
haveϵ∈O(Cu2
−k
) . Even if we generously assume thatϵ= 0, we havevb(x,·)̸=va(x,·) in at least one of these cases
(either whennu>0 or whennu=n ). We have thus shown that two bounded reward functions in the same equivalence
class can induce both the same policy and Bradley-Terry preference distribution but a different distribution of human
values.
14
Model Alignment as Prospect Theoretic Optimization
Theorem Let two humansa, bhave value functionsva, vb and contradicting preferencesy1≻ay2 and
y2≻by1 for some inputx. Assumeπref(y|x) = 0 =⇒πθ(y|x) = 0 for allx, y. In the worst-case, the optimal policy
under DPO decreases the expected value of both humans. In contrast, if each preference is broken up into two examples,
then KTO (with default settings) does not change the policy.
Proof.Whereu=βlog
πθ(y1|x)
πref(y1|x)
−βlog
πθ(y2|x)
πref(y2|x)
, we can write the total DPO loss as
ˆ
LDPO(πθ, πref) =
1
2
(−logσ(u)) +
1
2
(−logσ(−u))
Taking the derivative with respect touand setting to zero, we get
0 =−
1
2
ȷ
σ(u)σ(−u)
σ(u)
−
σ(−u)σ(u)
σ(−u)
ff
=
σ(u)(σ(−u))
2
σ(u)σ(−u)
−
(σ(u))
2
σ(−u)
σ(u)σ(−u)
=σ(−u)−σ(u)
=⇒u= 0
Sinceβ >0,u= 0can only occur when the rewards of both the preferred and dispreferred outputs are equal. This can
be satisfied whenπθ(y1|x) =πθ(y2|x) = 0 , with the probability mass allocated to examples with lower true rewardr
∗in
the worst case. Since value functions by definition are monotonically non-decreasing, the expected value for both humans
would decrease in the worst-case.
Wherez1, z2are the reference-adjusted rewards, we can write the total KTO loss (with default settingsλD=λU= 1) as:
ˆ
LKTO(πθ, πref) =
1
4
(1−σ(z1)) +
1
4
(1−σ(−z1)) +
1
4
(1−σ(z2)) +
1
4
(1−σ(−z2))
=
1
4
(σ(−z1)) +
1
4
(1−σ(−z1)) +
1
4
(σ(−z2)) +
1
4
(1−σ(−z2))
=
1
2
Therefore the loss is already minimal and no changes are made to the policy.
B. Implementations
SLiCInstead of sampling from the reference model to calculate theLregas2023) do—as it is very slow—we
just apply the cross-entropy loss to the SFT data, assuming that the reference model recovers the SFT distribution.
DPOWe use the implementation of DPO in the code provided by2023). We found that, as mentioned
in the original paper,β= 0.1 works best for most settings. Other training configurations, such as the learning rate and
optimizer, were borrowed from the original paper.
CSFTThe control tokens used for generating the good and bad outputs are<|good|> and<|bad|> respectively,
following the precedent set in2023).
KTOWe use aβ= 0.1 in our experiments unless otherwise specified (the same setting as for DPO), as it is close-to-
optimal for most settings.
PPOPPO-Clip is the traditional means of optimizing the RLHF objective (2). However, most implementations of
PPO-Clip for LLM alignment suffer from instability, particularly during distributed training. We find that running the
PPO-Clip objective on offline data with the following “tricks” leads to much more stable training:
•
We never update the reference distribution (i.e., the policy only takes one step in the trust region).2023)
recommend this as well. To accommodate for this conservative change, we clip the probability ratios more liberally,
15
Model Alignment as Prospect Theoretic Optimization
finding that an asymmetric interval of[0.25,4.0]works best instead of the small symmetrical interval (e.g.,[0.8,1.2])
that is traditionally recommended.
•
Including a KL penalty (between the policy and reference distributions) in addition to the clipping makes training more
stable, as is also done in the implementation by2020). We find that it is important to estimate the KL
term not using the entire distribution, however, but rather as the mean difference in the predicted log probabilities of
the actual output tokens (i.e., the labels). We suspect that this makes a difference because the rest of the distribution
can be poorly calibrated.
•
The value of a state is generally predicted by some value head attached to the policy model; the value loss is the MSE
between the predicted value and the discounted sum of future rewards for each token. This is a linear layer in many
RLHF implementations (von Werra et al.,). However, we find that backpropagating the value loss through this
head and the policy leads to worse performance. Instead, we make the value head a 3-layer MLP and detach it from the
computational graph, so that the value losses are not backpropagated through the policy model but the value head still
has sufficient capacity to learn good estimates.
C. Human Evaluation
For human evaluation, we randomly sampled 256 prompts from the OpenAssistant test set and generated outputs from
Mistral 7B models aligned with DPO and KTO. All inputs were multi-turn conversations between a user and an assistant,
where the LLM played the role of the assistant (see Table
These were sent to a third-party data annotation service where a pool of workers picked either the generated output or the
SFT target (from the OpenAssistant dataset) as the more appropriate response by the assistant. Any question that required
specific domain experience (e.g., coding) were skipped, leading to 214 comparisons for DPO and KTO each.
The winrates of the aligned model over the SFT targets are72.9%±5.3 for KTO and62.1%±5.7 for DPO (where the
intervals are 90% binomial confidence intervals). In contrast, Table
with GPT-4 as a judge instead:65.2%±3.6 for KTO and60.0%±3.7 . Thus although there is no significant difference in
the GPT-4-based evaluation, there is a significant difference with human evaluation atp <0.05 . We found that 68.7% of the
human judgments over the KTO comparisons concurred with GPT-4; this number fell to 65.9% for DPO.
D. Additional Experiments
Figure 7.
Alpaca Eval 2 winrates with GPT-4-turbo chain-of-thought as evaluator. KTO consistently outperforms DPO and unaligned
Mistral 7B base models. KTO also demonstrates greater robustness across varying temperatures.
E. Sample Generations
16
Model Alignment as Prospect Theoretic Optimization
Table 3.
Aligning Zephyr (Tunstall et al.,), a derivative of Mistral-7B, on UltraFeedback with KTO instead of DPO improves results
across a suite of benchmarks. AlpacaEval 2 was run using the GPT4 Turbo CoT as the LM judge annotator. All evaluations were run with
sampling temperature0.7.
Dataset (→)MMLU GSM8k HumanEval BBH TydiQA AlpacaEval Avg.
Setup (→) 0-shot 8-shot, CoT 0-shot, Synth. 3-Shot, CoT 1-Shot, GP 0-shot
Metric (→) EM EM pass@1 EM F1 % Win
Zephyr-βSFT57.2 39.0 30.1 46.3 36.3 6.41 35.9
+DPO 58.2 40.0 30.1 44.1 36.5 7.79 36.1
+KTO 58.6 53.5 30.9 52.6 31.2 12.48 39.9
+KTO (unary)58.0 50.0 30.7 49.9 37.7 11.55 39.6
17
Model Alignment as Prospect Theoretic Optimization
Prompt <|user|>
I need some show recommendations that are similar to Breaking Bad.
<|assistant|>
Are you looking for something with drama, violence, drugs, politics and murder?
<|user|>
This is exactly what I’m looking for in a show
<|assistant|>
SFT
Well, you’ve just asked for one of the best shows ever made, and they all have those elements
in some combination. There are a lot of shows with those elements.
SFT+KTO
Some great choices for you might be Narcos, Sons of Anarchy, and Better Call Saul. They
are all very popular and have a lot of the same themes as Breaking Bad.
SFT+DPO
Well, you might like *******, which is also a dark, suspenseful drama about political intrigue
and crime. Or you might also like *******, which is also a dark, suspenseful drama about
crime and corruption in law enforcement.
—
My friend has some really bad opinions about TV shows, and I want to correct him, can you
help?
—
Sure! You can try to politely explain to your friend why his or her opinion is incorrect or why
you disagree. However, it’s important to keep in mind that different people enjoy different
things. If your friend has some strongly held opinions about TV shows, there’s no reason
why you need to agree with him. You can both enjoy shows that you like and your friend can
enjoy the shows that he likes. That’s just the way it is.
—
Is there any TV show similar to Breaking Bad that doesn’t have violence in it?
—
It’s very hard to find a TV show that doesn’t have any violence in it at all. However, there are
some TV shows that are not focused on violence. For example, you might like ******* or
*******, which are both TV shows about people who solve crimes without using violence
or killing people.
—
Do you have any shows that are similar to Breaking Bad but without all the crime and
violence?
—
Sure! You might like ******* or *******, which are both dramas with dark themes and
stories, without the high level of violence and brutality in Breaking Bad. Or you might like
*******, which is also a dark and suspenseful drama.
—
Where can I find a list of TV shows similar to Breaking Bad, with all the plot elements and
details?
—
I’m not sure there’s an exact list of TV shows similar to Breaking Bad, but you can search
online to find shows like this. For example, you might like *******, which is also a dark
and suspenseful drama about crime and power ...
SFT+PPO (offline)
Well, you’ve just asked for one of the best shows ever made, and they only made 5 seasons
of it. If you want a similar show with a similarly good story, you might want to check out
Game of Thrones, which is a bit different but also tells an epic story about power, politics
and violence.
Let me know if you need any more recommendations, it’s no problem!
Table 4.
Sample generations from the different aligned versions of Llama-30b for a prompt about show recommendations (all models were
aligned with data following the user-assistant prompt). Note that the SFT answer is not helpful and the SFT+DPO answer hallucinates
multiple turns of the conversation (in fact we had to truncate the answer shown here because the complete answer is too long). The
SFT+PPO (offline) answer is helpful but only provides one recommendation, while SFT+KTO is succinct and provides multiple options.
18