Extracted Text


2510.17558v1.pdf

TheFreeTransformer
François Fleuret
1
1
FAIR at Meta
We propose an extension of the decoder Transformer that conditions its generative process on
random latent variables which are learned without supervision thanks to a variational procedure.
Experimental evaluations show that allowing such a conditioning translates into substantial
improvements on downstream tasks.
Date:
Correspondence:
1
Since their invention, the Transformer (Vaswani et al.,), and more specifically the decoder-only
Transformers used originally for the GPT series of models (Radford et al.,), have become the core
components of AI systems.
It is remarkable that, after almost a decade, and in spite of improvements on many aspects of this class of
methods, the autoregressive modelling of Transformers remains essentially unchallenged. We propose in
this paper to revisit this key design aspect by allowing richer and more natural density models to emerge:
•We extend the auto-regressive model of the decoder Transformer by allowing the conditioning on
latent variables, thanks to a formulation as a conditional Variational Autoencoder (§).

We propose an implementation that requires a very modest computational and memory usage
overhead (§).
The benefits of the proposed method are shown by training 1.5B and 8B models from scratch and assessing
performance on multiple downstream benchmarks (§).
2
Decoder Transformers are auto-regressive discrete density approximators. They model a sequence of
tokensS1, . . . , ST by estimating the conditional distribution of each given those preceding it. Sampling is
done by generating one token after another, each time computing the distribution of the next symbol
given those generated so far.
The only density modelling and sampling that such models implement is that of the generated tokens. In
particular, a decoder Transformer does not make additional latent decisions about the stream of symbols
to generate. Its only decisions are the choices of the tokens themselves.
Consider, for instance, that we train such a model to generate movie reviews and that we want to have
two clearly separated categories of negative and positive reviews. Given a large enough model and the
necessary amount of training data, there is no doubt that a decoder Transformer trained on a dataset
of that form would work perfectly and would generate these two types of reviews. However, to do so, it
would generate tokens one after another and decide, based on the words generated so far, whether the
review it is currently generating is a positive or a negative one, and continue the process accordingly. In
particular,. It
would produce tokens, and this notion of a negative or positive review would be implicit in their posterior
probabilities.
Due to the chain rule, any density can be modelled as autoregressive. However, in particular when the
1
arXiv:2510.17558v1 [cs.LG] 20 Oct 2025

“natural” structure involves conditioning on latent variables, the autoregressive model of the signal may be
a great deal more complex than the full joint model including the latent.
We can consider a simple example illustrating that point. LetZ (0.5)
X1, . . . , XTbe equal to.
The ts are conditionally independent given
PX t+1= 1)(1
however, expressed as an auto-regressive model without
PX t+1= 1 1= 1, . . . , Xt= t) =
ı
ϵ
1−ϵ
ȷ
P
t
s=1
xs
(1)
t+1
+
`
1−ϵ
ϵ
´
P
t
s=1
xs
ϵ
t+1
ı
ϵ
1−ϵ
ȷ
P
t
s=1
xs
(1)
t
+
`
1−ϵ
ϵ
´
P
t
s=1
xs
ϵ
t
.
We could easily come with worse examples when expressed autoregressively, for instance when the latent
variables are positions in the sequence, e.g. the index where a certain pattern occurs as in the example of
§. What we observe in such cases is that it requires running estimates of probabilities (“probability
that the target appears here”) for which estimation errors are unavoidable and problematic.
The consequence is that a purely auto-regressive density model suffers potentially from several drawbacks:
•It requires an unnecessarily complicated computation, and greater capacity, to implicitly make
post-hoc decisions or infer latent quantities from the generated tokens.

It may be sent off track during the process if, by mistake, a few tokens generated are erroneous,
ambiguous or contradictory with those generated previously.

Key concepts do not appear spontaneously due to the "natural" factorization of the distribution,
but are built post-hoc by necessity to fit the training samples better. This may be a fundamental
weakness when operating out of distribution.
The main objective of the present work is to address these issues by providing the model with the freedom
of conditioning its auto-regressive process on latent random quantities that are not imposed by the training
examples.
For instance, for the review generator example above, the model could use a random Boolean value to
decide once for all whether the tokens it produces are from the distribution of negative or positive reviews,
removing the need for a complicated posterior estimate from the tokens already generated.
3
Any latent random valueYr, whatever its statistical dependency with the tokensS1, . . . , St and other latent
Y1, . . . , Yr1 sampled so far, can be expressed under reasonable assumptions asfr(S1, . . . , St, Y1, . . . , Yr1, Zr )
where ris a value coming from a random generator.
Hence, if we provide the model with enough random valuesZ1, Z2, . . .sampled independently during
generation, a proper training procedure could in principle build families of latent variables with arbitrary
dependency structure, as long as the model’s capacity allows it to encode r.
In the same way that the choice of a token during sampling can be expressed as a function of a random
value and the logits, any activation which is a function of a random value and other activations can be
interpreted as a decision made by the model during the generative process. Such decisions make the
latent activation non-deterministic functions of the tokens, and observing the latter only gives a partial
information about the former.
2

(a)DecS1:T1
ˆ
PS2:T)(b)DecZS1:T1
ˆ
PS2:T)EncDecS1:T1
ˆ
PS2:T)(c)Dec 1/2Dec 2/2S1:T1
ˆ
PS2:T)ZEncDec 1/2Dec 2/2S1:T1
ˆ
PS2:T)
Figure 1 Zin inference (b, left), in
which case it has to be trained as a conditional VAE with an encoder (b, right). The Free Transformer (c) reduces
the overhead of the encoder by having the random stateZinjected in its middle layer (c, left), and using for
encoder during training the combination of its first half combined with one non-causal layer specific to the encoder
(c, right). See Figure
3.1
Generating a full sequence from scratch with a model that depends on a random variableZis trivial:
sampleZ (Z)
modulated by
Training the model, however, is far more involved. Given a training sampleS, the objective is to maximize
PS
Z
z
PSPZdz,
which can be estimated only if we can getZs consistent withS, which amounts to a complex inference
problem if we want
Providing thoseZs is the role of the encoder of a Variational Autoencoder (Kingma and Welling,),
whose main purpose is to sample from a “good” distributionQ(Z ) Zmodulates the
decoder in a way that leads it to generate
We follow this approach and optimize jointly the parameters of the decoder and the parameters of a
second model, which is an encoder in the VAE sense.
Even though the noiseZhas no relation toSinitially, if the training succeeds, the model will use it to
structure the generative process. In the example of a movie review generator of the previous section,
for instance, given a review from the training set, the encoder would implicitly classify it as positive or
negative, and generate a consistentZ. IncreasingP(S ) Zcould be interpreted as improving
the “negative review generator” or the “positive review generator” that are implicitly encoded in the
decoder’s weights.
A key element of this approach is to limit the amount of information flowing from the encoder to the
decoder throughZ, so that the encoder does not provide quantities that should be computed by the
decoder. At the limit the encoder could copy entirelySintoZso that a trivial decoder, useless without
the encoder, hence in inference, would score perfectly in training.
The formal derivation of the VAE shows that the proper measure of information is the Kullback-Leibler
divergence betweenQ(Z ) P(Z), and that the loss to minimize should sum it with the reconstruction
3

S1:T1Embeddings
Decoder Causal
Transformer Block
Encoder Non-Causal
Transformer BlockQKVEncoder read-out FCBinary mapperZPost-sampler FC
Decoder Causal
Transformer BlockQKV
Decoder Causal
Transformer BlockDecoder read-out FCLogits 2:TζTTTTTTTTT
H
T
H
+TT×L/2×L/2
Figure 2
batch size from the tensor shapes for clarity. The operators in orange are specific to the encoder and are evaluated
only for training or KV cache pre-filling, those with a dashed contour have no trainable parameters. The Binary
Mapper is described in §. During generation, the encoder is not evaluated and Zis sampled uniformly among
the one-hot vectors of dimension
H
.
4

Algorithm 1
1:(tokens)
2:tokens)
3:, . . . , B
4:n](in)
5:
6:_readout(RMS_norm(x))
7:
8:
loss, which here is the usual cross-entropy.
3.2
In what follows, we call “Transformer Block” the usual combination of a Multi-Head Attention layer and a
MLP-like tokenwise module, with normalisation layers and residual connections.
As pictured on Figure Zinjected in its
middle layer. This allows to share half of the Transformer block with the encoder, cutting down drastically
the computational overhead by having a single Transformer block that has to be computed specifically for
the encoder. Hence, as we will see, this model possesses all the components of a decoder Transformer and
has an additional non-causal block and two linear layers for the encoder. While we did not investigate
what is the best depth to injectZ, doing it too early would reduce the encoder’s capacity, and doing it
too late would reduce the decoder’s capacity to process the latent variables.
For clarity, we omit in what follows the batch size in the tensor shapes.
As a standard decoder Transformer, the Free Transformer processes a sequence of tokens by first encoding
them with the embedding table into a tensor 0of shape
Then it evaluates sequentially the firstL/2 X
L/2 of same shape, and at this
point, it samples a sequence of one-hot vectorsZ= (Z1, . . . , Zt )∈ {0,1}
TC . During generation, this is
done by sampling, for eachZt, an indexcuniformly in{0, . . . , C 1}, and then encoding it as a one-hot
vector of dimensionC. During training or KV cache pre-filling,Zhas to be consistent with the tokens of
S.
This tensorZis processed by a linear layer to obtain a tensorRof shapeT . Then, theL/2 + 1th
Transformer block gets as input for queries the tensorX
L/2 and as input for keys and values the tensor
X
L/2 +R. The rest of the Transformer blocks are evaluated in sequence to getXLwhich is processed by
the read-out linear layer to obtain the logit tensor
3.3
As stated in the previous section, during training or KV cache pre-filling, the tensorZis sampled with
the encoder.
The Free Transformer possesses one Transformer block specific to the encoder, which is non-causal,
making the encoder as a whole non-causal. This is necessary since the conditioning by the decoder may
have long-range effects, requiring the full sequence to be taken into account to get a proper conditional
distribution of the latent.
This encoder-specific block gets as input for the queries a trained token embeddingζreplicated to match
the sequence length, and for the keys and values the output of the first half of the decoder’s blocks. The
motivation for using a learned constant input for the queries instead of the standard representation of the
input sequence is to prevent the encoder from building a token-wise mapping and make it instead capture
global properties of the sequence that may be more transferable across tasks and data-sets.
A linear readout computes from the encoder block’s output a vector of dimensionH= 16
5

Algorithm 2
1:(tokens)
2:tokens)
3:, . . . , B/2
4:n](in)
5:
6:
7:_block(in_q_kv)
8:linear_readout(RMS_norm(y
9:mapper(o)
10:
11:_hot(uniform
12:
13:_post_sampler(z
14:B/2 + 1](in_q_kv
15:2 + 1, . . . , B
16:n](in)
17:
18:_readout(RMS_norm(x))
19:
20:
These components are interpreted as logits of individual bit, used to sample a value in{0, . . . ,2
H

1}
which is encoded into a one-hot vector of dimension
H
= 65,536, with gradient pass-through, as described
in §.
Hence, the random embeddingZis a sequence ofTone-hot vectorsZtof dimension
H
. The prior
distribution used for generation is uniformP(Zt=z) = 1/2
H
, andQ(Z =s)
corresponding to the sampling with the encoder described above. The KL divergence is then equal to
DKL

Q(Zt|1, . . . , ST)


PZt)

=
2
H
X
z
Q(Z(Z.
We control it by adding it to the loss, and prevent its collapse by using a token-wise free bits method
(Kingma et al.,). This means that we sum the KL divergence of individual Ztthat are above a
threshold
This leads us to use for training loss the sum of the standard cross-entropy and the following quantity
1
T
T
X
t=1
max

0, KL

Q(Zt|1, . . . , ST)


PZt)



,
where the threshold
3.4
The last linear layer of the encoder computes for every indextof the sequence being processed a vector
Lt= (Lt,1, . . . , Lt,H )∈
H , whose components are interpreted as the logits of individual bits of a binary
encoding.
The Binary Mapper samples those bits t,1, . . . , Bt,Hindependentely with
PB t,h= 1) =
1
1 +
−Lt,h
,
and outputs a one-hot vector tof dimension
H
corresponding to the resulting value:
Yt,d=
ȷ
1
P
H
h=1
2
h−1
Bh,t
0
(7)
6

During training, the computation also propagates the gradient of the probabilities of the
H
values. If
Ud) = (U 1(d), . . . , UH(d))0,}
H
is the binary encoding of, and we define tas
Gt,d=B t=d
= exp

X
h
logB t,h= h(d
!
= exp

X
h
(1 h(d

1
1
1 +
−Lt,h
«
+ h(d

1
1 +
−Lt,h
«
!
,
then the Binary Mapper outputs
Yt,d+ t,d−G t,d),
wherex,x) = detach(x) = 0.
The motivation for using a binary encoding instead of having the encoder output
H
logits directly is to
facilitate the gradient pass-through thanks to the monotonicity of the sigmoid.
4
We first test the qualitative behavior of the Free Transformer on a synthetic task in §, then compare
it on multiple benchmarks to baselines with 1.5B and 8B parameters models for various KL divergence
thresholds in §, and finally assess the performance gain of a 8B parameter model trained on 1T tokens
in §.
4.1
To confirm that the Free Transformer indeed utilizesZto condition its generative process, we designed a
synthetic dataset and trained a small Free Transformer with different free-bits thresholds. Doing so allows
to observe what aspects of the modeling are packed by the encoder in
Each sequence in our synthetic training set is generated as follows:
•”,

pick an upper case letter and a position in the sequence at random, and replace the underscores
there with a “target” made of the selected letter repeated 8 times,
•/16
•”.
A few sequences generated with that process are shown in Figure.
We trained a Free Transformer on this data for four different values of the free bits thresholdκ, and
generated with the same random prompt three groups of sequences with each model, as pictured in
Figure. For each model, in the blue group, the noiseZis sampled independently for each sequence,
whereas we sampled one
For very low values of the KL divergence, the model behaves like a vanilla model (Figure, middle left),
and when the value increases, the model encodes initially the position of the target alone in the latent
state (Figure, middle right), then encodes both the target position and the noise (Figure, bottom left),
and finally encodes the full sequence, resulting in incorrect generation (Figure, bottom right).
4.2
For assessing performance on standard benchmarks we used decoder-only Transformers implemented in
the same Meta FAIR Transformer codebase as the one used by2025) for the Computational
7

K>!_________!_______!_______!____________!_______KKKKKKKK_________
C>___CCCCCCCC_________________!______!_____________!__!__!________
X>___________________!!_________!!___XX!XXXXX_____!_______________
R>!__RRRRRRRR_____________!__!_______________________________!____
P>!__!___________________________________________________PPPPPPPP_
L>_______!_!LLLLLLLL________!___________!!____________!___________
V>__!_________________!__!________VVVVVV!V________!_____!____!____
P>_________PPPPPPPP_____!________________!_______________________!
A>_______!___________!___________________________!_______AAAAAAA!_
P>____________________!____PPPPPP!P____!___________!_________!__!!
I>__________________________________________!_!__IIIIIIII_________
D>______!_!___________________________!_________!DDDDDDD__________
A>_____!___AAAAAAA!_______________!_________________!______!______
J>_______!_____!_________J!JJJJJJ_____________!___________________
Figure 3
times at a random position, an i.i.d. noise of exclamation marks, and a prompt indicating the target’s letter.
T>_________________________________TTTTTTTT_______________________
T>________________________________TTTTTTTT________________________
T>_________________________________TTTTTTTT_______________________
T>_____________________________________!TTTTTTTT__________________
T>_____________________________________________!_______TTTTTTTT___
T>_________________________________TTTTTTTT_______________________
T>_____________________________________________________TTTTTTTT___
T>______________________________________________________!TTTTTTTT_
T>___________________________!____________________________TTTTTTTT
T>_______________________________________________________TTTTTTTT_
T>_____________________________________________________TTTTTTTT___
T>_____________________________________________________TTTTTTTT___
T>___________________________________TTTTTTTT______!_____!____!___
T>_____________________________________________TTTTTTTT___________
T>__________________________________TTTTTTTT______________________
κ/64
F>_______________________________________________________FFFFFFFF_
F>___________________FFFFFFFF__________!__________!_______________
F>_________________FFFFFFFF________________________!____________!_
F>___________________________________FFFFFFFF__________________!__
F>____________________________________________!FFF!FFFF___________
F>_______________________!____________________________FFFFFFFF____
F>____________________________________________________FFFFFFFF____
F>____________________________________________________FFFFFFFF!___
F>___________________________________________________FFFFFFFF!____
F>_____________________________________________________FFFFFFFF___
F>_________________________FFFFFFFF!_________________!____________
F>__________!____________FF!FFFFF_________________________________
F>________________________FFFFFFFF___________________!____________
F>_______________________FFFFFFFF_______________!______!__________
F>_______________________FFFFFFFF______________________!__________
κ/8
J>JJJJJJJ!____!_________!!__!_!_!__________!___________!___!__!___
J>_____!_____!______!______!_JJJJJJJJ______________________!______
J>___JJJ!JJJJ____________!__!___!_!__!_____!_____!!__!_____!___!__
J>__!___________JJJJJJJJ___________________!________!____!________
J>______!!___!!!_____________JJJJJJJJ!______!!!_!_!___!___________
J>_________JJJ!JJJJ__!______________!__________!!___!!_________!__
J>_________JJJ!JJJJ__!______________!______!____!__!!__________!__
J>_________JJ!JJJJJ__!_______!________!________!!__!!__________!__
J>_________JJJ!JJJJ__!________________!!____!!_!!____!_________!__
J>___!_____JJ!JJJJJ__!______________!_______!__!!_______!______!__
J>__JJJJJJJJ______!___________!____!_!______!_______!__!________!_
J>__JJJJJJJJ______!___________!____!_!________!__!__!__!________!_
J>_JJJJJJJJ_______!_______!________!_!______!_______!__!_______!__
J>_JJJJJJJJ_______!________________!_!________!__!____!!___!____!_
J>_JJJJJJJJ_______!___________!_____!_!_____!_______!__!_______!__
κ
O>___________!!________!__!________________!_____!_______!______!!
O>____OOOOO______________________________________________________!
O>O!___O_!__!_!__!__!_OO____!!__OO_________________!!________!___!
O>_____!_____!_____________!_______________!_F___!!_!_______!_____
O>__OOOO!OO!___OO____!____O_!________________________O!____O_____!
O>_________OOOO________O__________!_____________________!!_!O____!
O>_________OO!O________O____O_____!_____________________!!!!O____!
O>_________OO!O________O____O_____!________________O____!__!O____!
O>_________OOOO________O____O_____!_____________________!!_!O____!
O>_________OOOO________O____O_____!______________________!_!O____!
O>__!___OO______________!___OO!O________O!O____________O_____O___!
O>O_!__OOO__________________OOOO________O!O____________O________!!
O>__!___OO__________________OO__________O!O____________O_____O____
O>__!___OO__________________OO__________O!O____O_______O_____O__!!
O>O_!__OOO_________!OO__!___OO__________O!O____________O_____O___!
κ
Figure 4
bit thresholds. To investigate the information encoded in the latent tensor, we sample aZper sequence of a blue
box, and aZper green box. For very low values of the KL divergence, the model behaves like a vanilla model (top
left), and when the KL divergence increases, the model encodes initially the position of the target alone in the
latent state (top right), then encodes both the target position and the noise (bottom left), and finally encodes the
full sequence, resulting in incorrect generation (bottom right).
8

World Model. Those are well optimized models using the SwiGLU non-linearity (Shazeer,), pre-
normalization with RMSNorm (Zhang et al.,), Rotary Positional Embedding (RoPE,),
and Group Query Attention (GQA,). The vocabulary size is
17
≈k
We used two sizes of models:

A 1.5B model, with
dimension,
32 H100s for

A 8B model with the structure of a Llama-3, which is,
heads, and ≈24
hours, or with 1T tokens, which takes 5 days.
We compare those baselines to the equivalent Free Transformers, which require one additional layer for
the encoder during training and KV cache pre-filling, resulting in a compute and memory overhead of
1/28.6%/32.1%
4.3
We kept our findings as clear as possible by avoiding other sources of performance improvement:

We stuck to the baseline architecture, optimizer, and learning rate schedule that were used to train
the baselines in FAIR’s framework, and did not optimize any hyperparameter for our setup.

We avoided any recipes for the VAE components, such as removing sampling in inference. We
followed the formal expressions rigorously.
• twas comparable to the vocabulary size of
17
.
We stress that the optimization hyperparameters were highly tuned for the baselines, and it is probable
that a combination of an encoder and a decoder has specific requirements that would greatly benefit from
an adapted training procedure.
4.4
We ran a series of experiments to assess the general behavior of the Free Transformer, and to calibrate
the
For any value ofκ, the cross-entropy goes down regularly during training, with no more instability and
spikes than what happens with the baselines. The KL divergence rapidly goes underκand stays there.
When we compare the cross-entropies for variousκ, they go down whenκincreases as expected, but the
values remain extremely close, with a difference of the order of .01 ≈2
and.8
For both sizes of models, settingκ= 4log2, corresponding to
a collapse of the cross-entropy, indicating that the encoder found a way to channel fully the tokens to
predict, and resulting in a collapse of performance on the downstream tasks. It is noteworthy that the
baseline 8B model reaches during training a cross-entropy of .8 = 2.59log(2), hence may explain why
allowing
The performance on downstream tasks are given in Table
models, both for four different values ofκcorresponding to /2
of performance during training are given in Appendix.
We observe a substantial increase of performance on HumanEval+, MBPP, and GSM8K which are arguably
the benchmarks requiring some form of reasoning, and there also is a clear improvement for the 8B model
with 1/2 bit of KL divergence on MMLU and CSQA, which are multi-choice questions.
9

1.5B models (47B tokens)
Baseline
Free Transformer
1/4 bit 1/2 bit 1 bit 2 bits
Generative code/math
human_eval_plu (pass@1) 0.0550.079+44.44%0.079+44.44%0.085+55.56%0.085+55.56%
mbpp (pass@1) 0.1120.144+28.57%0.148+32.14%0.152+35.71%0.122+8.93%
gsm8k (em) 0.0250.028+12.12%0.027+6.06%0.033+30.30%0.027+6.06%
Multi-choice general knowledge / common sense
mmlu (macro_avg/acc_char) 0.2520.265+5.31%0.261+3.76%0.254+1.07%0.257+2.19%
csqa (acc_char) 0.1990.175-11.93%0.199+0.00%0.187-6.17%0.197-0.82%
hellaswag (acc_char) 0.5930.591-0.40%0.594+0.15%0.592-0.27%0.595+0.32%
winogrande (acc_char) 0.6030.604+0.13%0.598-0.79%0.600-0.52%0.597-1.05%
obqa (acc_completion) 0.4460.450+0.90%0.468+4.93%0.460+3.14%0.490+9.87%
arc_challenge (acc_completion)0.4000.392-1.93%0.386-3.43%0.405+1.29%0.385-3.65%
arc_easy (acc_completion) 0.5960.602+0.92%0.592-0.64%0.603+1.06%0.592-0.71%
piqa (acc_char) 0.7340.736+0.22%0.738+0.52%0.734+0.07%0.733-0.15%
Multi-choice text understanding
race.high (acc_char) 0.3900.382-2.20%0.390+0.00%0.387-0.81%0.386-1.03%
race.middle (acc_char) 0.5320.511-3.93%0.519-2.49%0.522-1.83%0.514-3.40%
boolq (acc_completion) 0.5830.632+8.39%0.614+5.35%0.648+11.12%0.620+6.29%
Culture
nq (em) 0.0810.069-15.36%0.073-9.56%0.075-7.17%0.071-11.95%
tqa (em) 0.2050.191-6.93%0.190-7.58%0.200-2.84%0.197-4.13%
Table 1
and kept unchanged, but the Free Transformers require .6%
Figure
8B models (200B tokens)
Baseline
Free Transformer
1/4 bit 1/2 bit 1 bit 2 bits
Generative code/math
human_eval_plu (pass@1) 0.1590.171+7.69%0.189+19.23%0.165+3.85%0.177+11.54%
mbpp (pass@1) 0.2780.330+18.71%0.306+10.07%0.298+7.19%0.318+14.39%
gsm8k (em) 0.0860.079-8.77%0.095+9.65%0.104+20.18%0.096+10.53%
Multi-choice general knowledge / common sense
mmlu (macro_avg/acc_char) 0.3590.337-6.13%0.398+10.97%0.365+1.81%0.345-4.00%
csqa (acc_char) 0.3560.292-17.93%0.450+26.21%0.346-2.99%0.324-8.97%
hellaswag (acc_char) 0.7350.737+0.26%0.737+0.26%0.732-0.45%0.738+0.39%
winogrande (acc_char) 0.6800.667-1.86%0.664-2.32%0.664-2.32%0.667-1.86%
obqa (acc_completion) 0.5220.508-2.68%0.484-7.28%0.530+1.53%0.554+6.13%
arc_challenge (acc_completion)0.4650.483+3.87%0.468+0.55%0.452-2.95%0.485+4.24%
arc_easy (acc_completion) 0.6770.676-0.25%0.665-1.81%0.668-1.44%0.679+0.31%
piqa (acc_char) 0.7740.780+0.77%0.782+1.05%0.785+1.41%0.793+2.46%
Multi-choice text understanding
race.high (acc_char) 0.4330.447+3.30%0.443+2.25%0.444+2.58%0.435+0.53%
race.middle (acc_char) 0.5940.592-0.35%0.591-0.47%0.587-1.17%0.584-1.64%
boolq (acc_completion) 0.7050.632-10.37%0.632-10.33%0.687-2.47%0.671-4.82%
Culture
nq (em) 0.1810.183+1.38%0.167-7.67%0.173-4.14%0.168-6.90%
tqa (em) 0.4400.438-0.28%0.443+0.80%0.434-1.19%0.446+1.45%
Table 2
and kept unchanged, but the Free Transformers require .1%
Figure
10

8B models (1T tokens)
Final value Average (last third)
Baseline
Free Transformer
Baseline
Free Transformer
1/2 bit 1/2 bit
Generative code/math
human_eval_plu (pass@1) 0.2680.299+11.36% 0.2450.256+4.22%
mbpp (pass@1) 0.4280.440+2.80% 0.3960.421+6.08%
gsm8k (em) 0.3210.331+2.83% 0.2800.296+5.84%
Multi-choice general knowledge / common sense
mmlu (macro_avg/acc_char) 0.5920.623+5.20% 0.5670.596+5.16%
csqa (acc_char) 0.7070.748+5.79% 0.6890.733+6.28%
hellaswag (acc_char) 0.7990.799-0.01% 0.7870.788+0.18%
winogrande (acc_char) 0.7390.735-0.53% 0.7250.727+0.27%
obqa (acc_completion) 0.5640.562-0.35% 0.5560.551-0.86%
arc_challenge (acc_completion)0.5420.535-1.42% 0.5240.522-0.40%
arc_easy (acc_completion) 0.7210.711-1.41% 0.7060.711+0.68%
piqa (acc_char) 0.8050.812+0.88% 0.8020.807+0.61%
Multi-choice text understanding
race.high (acc_char) 0.4730.463-2.06% 0.4670.460-1.55%
race.middle (acc_char) 0.6320.634+0.33% 0.6230.624+0.16%
boolq (acc_completion) 0.7130.725+1.63% 0.7550.754-0.10%
Culture
nq (em) 0.2480.247-0.22% 0.2290.227-0.76%
tqa (em) 0.5830.577-1.00% 0.5490.544-0.90%
Table 3
iterations to mitigate the irregularity of the performance increase during training and get a more accurate estimate
of the relative improvement. The optimization hyperparameters were tuned for the baseline and kept unchanged,
but the Free Transformers require .1%
C
11

4.5
To measure improvement in a more realistic setting, closer to models actually used in real applications,
we trained 8B models on 1T tokens, which improves drastically the performance of both the baseline and
the Free Transformer.
Given the results with 200B tokens, we chose the valueκ=log(2)/2
information per token at most.
The performance on downstream tasks are given in Table
in Figure. We provide in the table the performance measured at the end of the training
as for the other configurations, but in addition we also give the average over the last third of the training.
We can observe on the graphs that the rate of improvement tend to be constant on this interval, which
justifies averaging to mitigate the performance fluctuations.
The key result is the boost of performance on HumanEval+, MBPP, GSM8K, MMLU and CSQA,
confirming what we observed in the smaller settings, and a greater stability on other tasks.
5
There have been several attempts at combining a VAE and a decoder Transformer, generally with a focus
on improving topic models and providing ways to guide the generation.
The OPTIMUS model (Li et al.,) combines a pre-trained BERT as text embedding / encoder, with a
GPT-2 playing the role of decoder, which are fine-tuned with a VAE-like loss.
The latent embeddingZis computed thanks to a CLS token, that is by adding a token to the input and a
read-out to extract its embedding in the output. To modulate the GPT-2 generation with it, it is either (1)
concatenated as an additional token in every layer, or (2) added to the input token embeddings. Collapse
of the KL divergence is prevented during training with the free bits method (Kingma et al.,).
This approach allows for better guided text generation with GPT-2 and better generalization on low-data
languages with BERT.
Xie et al.2021) extend OPTIMUS with a multi-objective loss, adding in particular the prediction of the
story topic, using the output of another model as ground truth, to obtain a better embedding space.
The CVAE proposed by2021) combines two pre-trained GPT-2, one used as the encoder
without causal masking. The embeddingZis an average of the encoder’s output, and the authors propose
three ways to modulate the decoder with linear images of it: (1) add it to each input token embedding, (2)
concatenate it to the Ks and Vs in every layer, (3) add it before the softmax. Experiments demonstrate
that this method allows controlling the generation without hurting the quality of the result.
AdaVAE (Tu et al.,) is similarly the combination of two pre-trained GPT-2, the first without causal
masking playing the role of the encoder. The latent embeddingZis extracted from its output with a
slightly modified attention operator. It is then injected into the decoder by either concatenating an image
of it to the keys and values as in OPTIMUS, or before the softmax as in CVAE.
6
The Free Transformer is a direct extension of a standard decoder Transformer, with the abstract structure
of a conditional VAE. It is implemented with a single additional non-causal Transformer block and requires
a few percent of computational and memory usage overhead.
Its structure makes it able to learn latent random variables unsupervised, and to condition its generative
process on them. In some ways, this approach aims at achieving in latent space with an autoencoder what
reasoning models do with chains-of-thought in token space and an RL procedure (DeepSeek-AI et al.,
2025). A combination of the two is, of course, promising.
12

The performance boost without tuning the optimization hyperparameters across multiple benchmarks and
two sizes of models, is a strong signal that the overall approach actually improves the inductive bias of
the vanilla Transformer.
Many properties and design choices should be explored. The performance curves during training are often
unstable, possibly due to the coupling of the optimization of the encoder and the decoder, and using
different optimization methods could be fruitful. The random embedding itself could take many forms,
and the one used in our implementation is arbitrary.
Finally, the behavior in larger scales, both in parameter count and dataset size, remains to be investigated.
References
J. Ainslie, J. Lee-Thorp, M. de Jong, et al. GQA: Training Generalized Multi-Query Transformer Models from
Multi-Head Checkpoints, 2023.
J. Austin, A. Odena, M. Nye, et al. Program Synthesis with Large Language Models, 2021.
Y. Bisk, R. Zellers, R. L. Bras, et al. PIQA: Reasoning about Physical Commonsense in Natural Language, 2019.
M. Chen, J. Tworek, H. Jun, et al. Evaluating Large Language Models Trained on Code, 2021.
C. Clark, K. Lee, M.-W. Chang, et al. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions.
arXiv preprint arXiv:1905.10044, 2019.
P. Clark, I. Cowhey, O. Etzioni, et al. Think you have solved question answering? Try ARC, the AI2 reasoning
challenge. In
Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 422–435, 2018.
K. Cobbe, V. Kosaraju, M. Bavarian, et al. Training Verifiers to Solve Math Word Problems, 2021.
J. Copet, Q. Carbonneaux, G. Cohen, et al. CWM: An Open-Weights LLM for Research on Code Generation with
World Models, 2025.
DeepSeek-AI, D. Guo, D. Yang, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement
Learning, 2025.
L. Fang, T. Zeng, C. Liu, et al. Transformer-based Conditional Variational Autoencoder for Controllable Story
Generation, 2021.
D. Hendrycks, C. Burns, S. Kadavath, et al. Measuring Massive Multitask Language Understanding.
arXiv:2009.03300, 2021.
D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes, December 2013. arXiv:1312.6114 [stat].
D. P. Kingma, T. Salimans, R. Jozefowicz, et al. Improving Variational Inference with Inverse Autoregressive
Flow, January 2016. arXiv:1606.04934 [cs].
T. Kwiatkowski, J. Palomaki, O. Redfield, et al. Natural Questions: A Benchmark for Question Answering
Research. In, volume 7, pages 453–466, 2019. doi:
10.1162/tacl_a_00276.
G. Lai, Q. Xie, H. Liu, et al. RACE: Large-scale ReAding Comprehension Dataset From Examinations. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, 2017.
doi: 10.18653/v1/D17-1082.
C. Li, X. Gao, Y. Li, et al. Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space, 2020.
J. Liu, C. S. Xia, Y. Wang, and L. Zhang. Is Your Code Generated by ChatGPT Really Correct? Rigorous
Evaluation of Large Language Models for Code Generation, 2023.
T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a Suit of Armor Conduct Electricity? A New Dataset
for Open Book Question Answering. In
Language Processing, pages 2381–2391, 2018. doi: 10.18653/v1/D18-1260.
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving Language Understanding by Generative
Pre-Training. Technical report, OpenAI, 2018.
K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. WinoGrande: An Adversarial Winograd Schema Challenge
at Scale, 2019.
13

N. Shazeer. GLU Variants Improve Transformer.
J. Su, Y. Lu, S. Pan, et al. RoFormer: Enhanced Transformer with Rotary Position Embedding, 2021.
A. Talmor, J. Herzig, N. Lourie, and J. Berant. CommonsenseQA: A Question Answering Challenge Targeting
Commonsense Knowledge. In
Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
pages 4149–4158, 2019. doi: 10.18653/v1/N19-1421.
H. Tu, Z. Yang, J. Yang, and Y. Huang. AdaVAE: Exploring Adaptive GPT-2s in Variational Auto-Encoders for
Language Modeling, 2022.
A. Vaswani, N. Shazeer, N. Parmar, et al. Attention Is All You Need, August 2017. arXiv:1706.03762 [cs].
Z. Xie, T. Cohn, and J. H. Lau. Exploring Story Generation with Multi-task Objectives in Variational Autoencoders,
2021.
R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi. HellaSwag: Can a Machine Really Finish Your Sentence?
preprint arXiv:1905.07830, 2019.
J. Zhang, R. Xiong, R. Socher, and C. Wang. Root Mean Square Layer Normalization.
arXiv:1910.07467, 2019.
A

HellaSwag: Multiple choices. Common sense focusing on physically situated scenarios. (Zellers et al.,
2019)

WinoGrande: Large-scale adversarial Winograd-style pronoun resolution (fill-in-the-blank) designed
to reduce annotation artifacts. (Sakaguchi et al.,)
•Clark et al.,)

PIQA: Physical commonsense multiple choice about everyday goals and affordances. (Bisk et al.,
2019)

OpenBookQA (OBQA): Open-book science QA: combines a provided set of core facts with com-
monsense/world knowledge to answer questions. (Mihaylov et al.,)

RACE: Multiple-choice reading comprehension from Chinese middle-school English exams. (Lai
et al.,)

MMLU: “Massive Multitask Language Understanding”. Questions spanning STEM, humanities,
social sciences, etc. (Hendrycks et al.,)

CommonsenseQA (CSQA): Multiple-choice QA requiring commonsense relational knowledge (lever-
aging ConceptNet relations). (Talmor et al.,)

BoolQ: Yes/no questions paired with passages to evaluate reading comprehension and entailment-like
inference. (Clark et al.,)

GSM8K: Grade-school math word problems requiring multi-step arithmetic reasoning. (Cobbe et al.,
2021)

HumanEval+: An augmented version of OpenAI’s HumanEval (Chen et al.,) with many more
unit tests per problem to reduce test fragility and overfitting in code generation evaluation. (Liu
et al.,)

MBPP: “Mostly Basic Programming Problems.” Short Python programming tasks solvable by
entry-level programmers; includes text spec and example tests. (Austin et al.,)

NQ: “Natural Questions.” Real user queries paired with Wikipedia pages. (Kwiatkowski et al.,)
14

B


pass@1 is the proportion of generated pieces of code that produce the expected behavior when
executed.

em (“exact match”) is the proportion of generated endings of a sequence that perfectly match a
reference solution.


acc_completion is the proportion of correct responses when the choice is based on the sum of
the log probabilities normalized with the number of tokens of each possible choices.


15

C0k 10k 20k 30k 40k
0.00
0.02
0.04
0.06
0.08
0.10
human_eval_plu (pass@1)
0k 10k 20k 30k 40k
0.000
0.025
0.050
0.075
0.100
0.125
0.150
0.175
mbpp (pass@1)
0k 10k 20k 30k 40k
0.000
0.005
0.010
0.015
0.020
0.025
0.030
0.035
0.040
gsm8k (em)
0k 10k 20k 30k 40k
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
mmlu (macro_avg/acc_char)
0k 10k 20k 30k 40k
0.00
0.05
0.10
0.15
0.20
0.25
csqa (acc_char)
0k 10k 20k 30k 40k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
hellaswag (acc_char)
0k 10k 20k 30k 40k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
winogrande (acc_char)
0k 10k 20k 30k 40k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
obqa (acc_completion)
0k 10k 20k 30k 40k
0.0
0.1
0.2
0.3
0.4
0.5
arc_challenge (acc_completion)
0k 10k 20k 30k 40k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
arc_easy (acc_completion)
0k 10k 20k 30k 40k
0.0
0.2
0.4
0.6
0.8
piqa (acc_char)
0k 10k 20k 30k 40k
0.0
0.1
0.2
0.3
0.4
race.high (acc_char)
0k 10k 20k 30k 40k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
race.middle (acc_char)
0k 10k 20k 30k 40k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
boolq (acc_completion)
0k 10k 20k 30k 40k
0.00
0.02
0.04
0.06
0.08
0.10
nq (em)
0k 10k 20k 30k 40k
0.00
0.05
0.10
0.15
0.20
0.25
tqa (em)
Baseline 1.5B FT 1.5B 1/4 bit FT 1.5B 1/2 bit FT 1.5B 1 bit FT 1.5B 2 bits
Figure 5
baseline and our models. The optimization hyperparameters were tuned for the baseline and kept unchanged, but
the Free Transformers require.6%
16

0k 10k 20k 30k 40k 50k
0.00
0.05
0.10
0.15
0.20
human_eval_plu (pass@1)
0k 10k 20k 30k 40k 50k
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
mbpp (pass@1)
0k 10k 20k 30k 40k 50k
0.00
0.02
0.04
0.06
0.08
0.10
0.12
gsm8k (em)
0k 10k 20k 30k 40k 50k
0.0
0.1
0.2
0.3
0.4
mmlu (macro_avg/acc_char)
0k 10k 20k 30k 40k 50k
0.0
0.1
0.2
0.3
0.4
0.5
csqa (acc_char)
0k 10k 20k 30k 40k 50k
0.0
0.2
0.4
0.6
0.8
hellaswag (acc_char)
0k 10k 20k 30k 40k 50k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
winogrande (acc_char)
0k 10k 20k 30k 40k 50k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
obqa (acc_completion)
0k 10k 20k 30k 40k 50k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
arc_challenge (acc_completion)
0k 10k 20k 30k 40k 50k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
arc_easy (acc_completion)
0k 10k 20k 30k 40k 50k
0.0
0.2
0.4
0.6
0.8
piqa (acc_char)
0k 10k 20k 30k 40k 50k
0.0
0.1
0.2
0.3
0.4
0.5
race.high (acc_char)
0k 10k 20k 30k 40k 50k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
race.middle (acc_char)
0k 10k 20k 30k 40k 50k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
boolq (acc_completion)
0k 10k 20k 30k 40k 50k
0.00
0.05
0.10
0.15
0.20
nq (em)
0k 10k 20k 30k 40k 50k
0.0
0.1
0.2
0.3
0.4
0.5
tqa (em)
Baseline 8B FT 8B 1/4 bit FT 8B 1/2 bit FT 8B 1 bit FT 8B 2 bits Figure 6
and our models. The training procedure was tuned for the baseline and kept unchanged, but the Free Transformers
require.1%
17

0k 100k 200k
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
human_eval_plu (pass@1)
0k 100k 200k
0.0
0.1
0.2
0.3
0.4
0.5
mbpp (pass@1)
0k 100k 200k
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
gsm8k (em)
0k 100k 200k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
mmlu (macro_avg/acc_char)
0k 100k 200k
0.0
0.2
0.4
0.6
0.8
csqa (acc_char)
0k 100k 200k
0.0
0.2
0.4
0.6
0.8
1.0
hellaswag (acc_char)
0k 100k 200k
0.0
0.2
0.4
0.6
0.8
winogrande (acc_char)
0k 100k 200k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
obqa (acc_completion)
0k 100k 200k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
arc_challenge (acc_completion)
0k 100k 200k
0.0
0.2
0.4
0.6
0.8
arc_easy (acc_completion)
0k 100k 200k
0.0
0.2
0.4
0.6
0.8
1.0
piqa (acc_char)
0k 100k 200k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
race.high (acc_char)
0k 100k 200k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
race.middle (acc_char)
0k 100k 200k
0.0
0.2
0.4
0.6
0.8
1.0
boolq (acc_completion)
0k 100k 200k
0.00
0.05
0.10
0.15
0.20
0.25
0.30
nq (em)
0k 100k 200k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
tqa (em)
Baseline 8B FT 8B 1/2 bit Figure 7
and our models. The training procedure was tuned for the baseline and kept unchanged, but the Free Transformers
require.1%
18