Extracted Text
2510.18554v1.pdf
2025-10-22
Extracting alignment data in open models
Federico Barbero
1,*
, Xiangming Gu
2, *
, Christopher A. Choquette-Choo
3, *
, Chawin Sitawarin
4
,
Matthew Jagielski
5, *
, Itay Yona
6, *
, Petar Veličković
4
, Ilia Shumailov
7, *
and Jamie Hayes
4
1
University of Oxford,
2
National University of Singapore,
3
OpenAI,
4
Google DeepMind,
5
Anthropic,
6
MentaLeap,
7
AI Sequrity
Company,
*
Work performed while the author was at Google DeepMind
In this work, we show that it is possible to extract significant amounts of
a post-trained model – useful to steer the model to improve certain capabilities such as long-context
reasoning, safety, instruction following, and maths. While the majority of related work on memorisation
has focused on measuring success of training data extraction through string matching, we argue that
embedding models are better suited for our specific goals. Distances measured through a high quality
embedding model can identify semantic similarities between strings that a different metric such as edit
distance will struggle to capture. In fact, in our investigation, approximate string matching would have
severely undercounted (by a conservative estimate of ×) the amount of data that can be extracted due
to trivial artifacts that deflate the metric. Interestingly, we find that models readily regurgitate training
data that was used in post-training phases such as SFT or RL. We show that this data can be then used
to train a base model, recovering a meaningful amount of the original performance. We believe our
work exposes a possibly overlooked risk towards extracting alignment data. Finally, our work opens up
an interesting discussion on the downstream effects of
regurgitating aspects of their training set, distillation can therefore be thought of as indirectly training
on the model’s original dataset.
This paper only considers and discusses open models.
1.
Progress in capabilities of Large Language Models (LLMs) is frequently driven by improvements to
training data recipes. It is common for a model developer to curate smaller and targeted data bundles
to push performance on particular downstream benchmarks. For the purpose of this work, we refer
to this data as ‘alignment
and instruction-following (such as Supervised Finetuning (SFT) and Reinforcement Learning (RL)
datasets), but also any targeted data collections used to steer model behaviour and enhance specific
capabilities, including mathematics, reasoning, and long-context understanding. While this type of
data is usually found in post-training, it is becoming increasingly common to include it also earlier
in training (Meta,). We use the term alignment rather than post-training in our work for this
reason.
The fact that models memorise subsets of their training data is now a well-established phenomenon
(Biderman et al.,;,;,;,,;
et al.,;,;,;,;,;
Hayes et al.,,b;,,;,;,;
et al.,;,;,;,;,;
Mireshghallah et al.,;,;,;,;
et al.,;,;,;,). Most research
on this topic has focused on this issue due to its associated privacy and legal risks, such as models
0
Here we use term ‘training’ data loosely and elaborate on it later in the Nomenclature section.
Corresponding author(s): federico.barbero@cs.ox.ac.uk
©
arXiv:2510.18554v1 [cs.AI] 21 Oct 2025
Extracting alignment data in open models
leaking personal or copyrighted information (Cooper and Grimmelmann,;,;
Freeman et al.,;,;,;,;,
2023;,;,). Prior work on memorisation is often centred around
verbatim or near-verbatim training data extraction (Carlini et al.,,;,),
where success is measured by exact (or very close) matches on tasks where this is important, like
measuring similarity of a credit card number or a paragraph from a copyrighted book. In contrast,
in this work we study and develop a more subtle notion of training data extraction – patterns and
templates of proprietary data – where the semantic structure is just as valuable as the literal content.
Consequently, existing extraction methods and metrics used to determine failure or success using
simple string matching are not well-aligned for this task.
We are interested in understanding if an LLM will leak training data that is sensitive due to its utility
in improving model performance. In other words, if a model’s competitive advantage comes from
its secret training data, and models have a tendency to memorise, and then regurgitate this data,
then the competitive advantage itself may be at risk. This is especially topical with the surge of more
sophisticated training modes such as the newNye et al.,;
McDonell,;,). It is also important from the point of view of the commonplace
practice of, where a competitor may use a strong model to train their own. If models
regurgitate training data, then through distillation a competitor is (at least in part) training on the
original training data as well.
By exploiting the fact that in open-weight models, the end user controls tokenization, and that
the chat template structure is only introduced during post-training, we use a simple attack which
demonstrates open-weight models will repeat numerous training prompts
1
.
We are able to extract data ranging from reinforcement learning (RL) prompts and associated traces
to prompts used for supervised fine-tuning (SFT) and during mid and pre-training. In doing so, we
answer an open question from2024) and confirm that models do indeed recite their own
alignment data and that therefore distillation pipelines are likely to partly include original training
data samples.
Summary
Our hypothesis:
prompt the model with the template, it will generate alignment data.
Our procedure:
1.
Prompt the model with the chat template and sample. Repeat this a number of times to
generate a set of synthetic data points.
2.
For each synthetic data point, find the closest sample in the post-training dataset according
the embedding similarity, using an embedding model.
Our main finding:
distribution as the alignment dataset. We find that this is true, but it is far closer to the alignment
dataset that one may expect. For example, in Figure, we generated a maths question and
answer that is near identical to a sample in the alignment dataset.
1
Our attack relies on the user’s control over the chat template structure. This is not normally available for closed models,
and so our attack does not immediately apply to them. Our prompting strategy is similar to the one used by
(2024).
2
Extracting alignment data in open models<|endoftext|><|user|>
Emily was 55 inches tall last year. If she grew 4
inches more than her sister who grew 2 inches
last year, how tall is Emily now?
Emily’s sister grew 2 inches.
Emily grew 4 inches more than her sister, so she grew
2 + 4 = 6 inches.
Last year Emily was 55 inches tall, and she grew 6
inches since then.
So her current height is 55 + 6 = 61 inches.
Thus, Emily isboxed{61}
<|endoftext|> (Prefix)Sample from model with special
Vector
Search
Embedding similarity:.97
String match similarity:.68
Emily was 54 inches tall last year. She grew 4 inches
more than her brother who only grew 2 inches last
year. How tall is Emily now?
Let’s solve this step by step.
Emily’s brother grew 2 inches last year.
Emily grew 4 inches more than her brother, which
means she grew 2 + 4 = 6 inches last year.
Emily was originally 54 inches tall.
After growing 6 inches, Emily’s current height is 54
+ 6 = 60 inches.
So, Emily is nowboxed{60}Query
Best
MatchRepeat at Scale
Figure 1|An overview of the data extraction process. A sample generated using the prompt
oftext|><|user|>˙
the training data. The figure highlights the difference between a high embedding similarity score
(0.97) and a much lower string-match similarity score (0.68), demonstrating that semantic similarity
is more effective for detecting this form of data memorisation.
Contributions
•
We generalise the prompting strategy from2024) and study how it may be used to
extract post-training data from models trained with SFT and RL, addressing their open question
on memorisation.
•
We show that measuring the quantity of memorisation based on approximate or verbatim string
matching is likely significantly undercounting rates of memorisation – our results suggesting at
least by×. Instead, using neural embeddings reveals a much higher rate.
•
We finally demonstrate that our method can be used to generate datasets from a model that
can then be used for the (post-)training of another model, meaningfully capturing some of the
original model’s performance.
2.
In this paper, we rely on the definitions below. We refer the reader to Section
background overview.
Alignment Data:
(like Supervised Finetuning and Reinforcement Learning with Human Feedback datasets), but to
any targeted data collection used to steer model behavior and improve specific capabilities such as
mathematics, reasoning, or long-context understanding. This data is usually considered a significant
competitive asset.
Training Data:
creation, including pre-training, mid-training, and post-training, which encompasses data used for
SFT and RL fine-tuning. In this paper we expand the usual meaning to also cover semantically
equivalent representation of the training data.
Post-training:
phase. It involves using various specialized datasets and methods, such as Supervised Finetuning
(SFT) and Reinforcement Learning (RL), to align the model’s capabilities with desired behaviors and
tasks. In this work, we treat post-training data as a (large) subset of total alignment data, as it is
becoming increasingly common to include alignment data (maths, reasoning, etc.) also in earlier
3
Extracting alignment data in open models
stages of training.
Memorisation
2
:
data. Our work extends this definition beyond verbatim string matching to include a more abstract
notion of memorisation, such as reproducing the patterns, templates, and semantic structure of
proprietary data. Related to this concept is2025)), but
instead of using simple string-matching similarity metrics, in our work we show that text embeddings
seem much better suited at detecting approximate memorisation when semantics are important. We
refer the reader to2022) for further motivation behind why we expand our definition
of memorisation.
Chat Template:
messages in special tokens (e.g., <|user|>, <|assistant|>). A key aspect of our attack is that these
templates and their special tokens are typically introduced only during the post-training stage, where
most of the alignment is done.
Embedding Score:
between two pieces of text, calculated using the dot product of neural text embeddings. We propose
this as a more effective alternative to traditional string-matching metrics (like Levenshtein distance),
as it better captures semantic equivalence even when there are superficial differences.
3.
The proposed extraction strategy is based on the observation that certain prompts seem to consistently
induce the model into outputting alignment-like data. To enable our attack, we use special tokens
from the chat template that are precisely introduced during post-training, making them ideal artifacts
that can be leveraged to extract specific types of data.
of such generations are either verbatim or very close to true training samples under an appropriate
measure of similarity.
Our pipeline works as follows: we embed the entire post-training set using an embedding model (see
3.3
prompting the model using our chosen prefix repeatedly. For each generated sample we then embed
it and search the vector database to retrieve the best match and its score. A diagram of this process is
shown in Figure. Once the post-training dataset is embedded, the search can be implemented as a
single matrix multiplication. In the following Sections, we will provide motivation for our
methodology using generated examples.
A significant constraint of our work is that we require access to models that make their (post-)training
mixtures available. For this reason, we focus our study on OLMo 2 (OLMo et al.,) (Section) for
SFT and Open-Reasoner-Zero (Hu et al.,) (Section) for RL. As these models are post-trained
with standard methods, they are a valuable resource for our study.
3.1.
The observation that chat templates can be used to generate useful synthetic data has been pointed
out already by2024), where the authors use special prompts and a filtering pipeline to
generate a dataset that can be used to post-train models using SFT. In our work, we study this from a
different angle and aim to understand the extent to which the generations correspond to regurgitated
2
We make no statement with regard to whether or not a model ‘contains’ its training data in a bit-wise or code-wise
sense, nor in the sense that any arbitrary instance of training data can be perfectly retrieved.
4
Extracting alignment data in open models
training data. In particular, we positively answer one of the conjectures left from2024),
which posited that the data generated might have been training data.
In Figure, we show how the prompting the model with the chat template results in outputs that
resemble post-training samples. We will investigate this phenomenon in much more detail in the
coming sections.
Qwen2.5 7B Instruct
<|im_start|>user
I’m trying to use the ‘zip‘ function in
Python with a generator expression, but
I keep encountering a ‘ValueError: too
many values to unpack‘. Here’s the code
I’m working with:
```python
gen = (x for x in range(3))
a, b = zip(*gen)
```
[...]
Qwen2.5 7B Instruct
<|im_start|>user
Okay, so I need to figure out the value of
the expression
√︂
2
√︃
2
√︁
2
√
2
.
Hmm, let me think. I remember that
infinite nested radicals can sometimes
be solved by setting them equal to a
variable and then solving an equation.
Let me try that approach here.
Let me denote the entire expression as�.
So,
√︁
2
√
2
[...]
Figure 2|Generations from Qwen2.5 7B Instruct using
the generations resemble potential post-training samples.
3.2.
A central theme in our work is the wish to broaden the definition of memorisation beyond simple string
matching, due to its natural limitations. For example, string matching has been shown to be vulnerable
to ‘style-transfer’ (Ippolito et al.,), where semantics are preserved whilst easily evading string
matching checks. Further, we are interested in the usefulness of extracted samples as training
points; if two samples are semantically equivalent, then they reasonably should be treated equal as
training samples (for measurement of memorisation rate). While approximate string matching may
(accidentally) weakly capture semantics in certain cases, we found it to be unreliable and generally
not well-aligned with the notion of semantic equivalence.
We use Figure
is clear that the left and right samples are extremely similar up to small differences, for example in
the numerical contents of the choices in the multiple choice section of the question. String-matching
scores tend to penalise these differences quite heavily assigning a similarity of≈0.7, while the high
embedding score (see Section
a human would make. We observed many situations where a string-matching score would provide low
similarity while the samples are semantically extremely similar. We point out that under the standard
0.9Comanici et al.,), the example in Figure
count as
string matching.
5
Extracting alignment data in open models
Generation
<|endoftext|><|user|>
The question is: Find the value of y from
(12)^3 x 6^4?
Options:
(A) 2456
(B) 4578
(C) 6234
(D) 7386
(E) 5184
The answer is: (E)
Explanation: Given Exp. = (12)3 x 64
= (12)3 x 64 = (12)2 x 62 = (72)2 =
5184
432 12 x 64<|endoftext|>
SFT Training sample
Question: Find the value of y from
(12)^3 x 6^4?
Options:
(A) 2356
(B) 3467
(C) 4567
(D) 5189
(E) 5184
Random thoughts:
Given Exp. = (12)3 x 64 = (12)3 x 64 =
(12)2 x 62 = (72)2 = 5184
432 12 x 62 The final answer: (E).
Figure 3|(Left) Generation from OLMo 2 13B. (Right) True post-training sample. Neural embeddings
match provide a score of .986 gemini-embedding-001(Lee et al.,), while normalised
Levenshtein similarity provides a match score of .699, heavily penalising differences in the options,
even though the semantics remain identical. When computing similarities we always strip out special
tokens, highlighted in. We report in the Appendix (Section) more of such examples.
3.3.
Embedding scores are generated by an embedding model and can be tuned for different tasks.
The process usually involves taking a strong base model and training it contrastively for a specific
application (e.g., semantic retrieval) (Lee et al.,). We use the gemini-embedding-001model (Lee
et al.,) in all of our experiments as it is a general and strong embedding model. We generate
a single embedding for each sample,
plain text. This acts as a vector search engine, where we can compute similarity with respect to
each training sample using a single matrix multiplication and taking theargmax. As embeddings
are normalised to be unit-norm, their dot product naturally encodes a notion of similarity. To better
distinguish this from approximate memorisation measures using string matching, we call this a
measure using embeddings a measure of.
To define a sample as (approximately semantic) memorised, we need to choose an appropriate
threshold. We do this manually and report randomly chosen samples at different thresholds in
the Appendix (Sections). We chose a threshold of 0.95 for neural embeddings as a more
conservative choice. We found samples at this point having qualitatively similar properties that would
be indisputable given a reasonable human judge. The choice of the threshold will naturally affect the
measured memorisation. This limitation is however also present in choosing a threshold for what is
considered memorised according to a string matching metric.
4.
We focus our SFT memorisation study on OLMo 2 (OLMo et al.,)
3
. OLMo 2 comes with the
full training mixture, alongside the training details. Further, the models have strong downstream
3
Licensed by AllenAI under the Apache License, Version 2.0.
6
Extracting alignment data in open models
performance, enabling us to conduct experiments that result in findings that are likely generalizable
to other models. The uncompressed pre-training mix has a size of 22.4 TB while the higher quality
mid-training split has a size of 5.14 TB. The post-training is divided in 3 stages, the first is an SFT
stage with a dataset containing 939k samples, then a Direct Preference Optimisation (Rafailov et al.,
2023) step is conducted with 378k samples, and finally a RL with Verifiable Rewards (RLVR) step
with 29.9k samples. We focus on the extraction of the SFT data in this section.
We apply the procedure we described in Section
training samples usinggemini-embedding-001by concatenating the question and answer sequences
as a single block of text. To extract the data from the model, we generate conditioning on the following
initial tokens, which are the starting tokens of the chat template. We
leave the temperature at the default value of.
4.1.
We start by evaluating the memorisation using traditional string matching metrics. We considerk
generations for OLMo 2B (OLMo et al.,) using our extraction method and search for their
closest match in the post-training set, with respect to different similarity measures. We consider
the normalised Levenshtein similarity defined as −max,) and
the normalised Indel similarity defined as −) . The Indel similarity
is related to the Levenshtein distance, but applies a cost of
sample, we find the highest similarity based on the two string matching methods in the post-training
set. We follow the heuristic used by Gemini 2.5 (Comanici et al.,) and characterise a sample as
approximately memorised0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Similarity Scores
0.00
0.02
0.04
0.06
0.08
Bucket Probability
Memorisation rate with Levenshtein similarity
Approximately Memorised
(a) Distribution of Levenshtein scores.0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Similarity Scores
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Bucket Probability
Memorisation rate with Levenshtein similarity
Approximately Memorised (b) Distribution of Indel scores.
Figure 4|Histograms illustrating the distribution of memorisation scores, measured by Levenshtein
and Indel similarity scores. For each generation, we find the largest similarity score in the post-
training set of OLMo 2.
In Figure
orisation rates seem negligible. The judgement from such results would be that our prompting
strategy does not extract useful memorised data under string matching metrics. This however does
not paint the entire picture. For example, the generated sample in Figure, would
memorised under the heuristic of2025). This is the extent to which measuring
memorisation on simple string matching is problematic; it does not flag examples that any reasonable
human would judge as clear cases of memorisation. This is because string matching can quickly
become non-informative due to trivial differences (see Appendix Section).
7
Extracting alignment data in open models
4.2.
We now compare the string matching results to matching done using neural embeddings. We generate
1M samples with OLMo 13B using the same method and embed them usinggemini-embedding-001.
In Figure, we show that neural embeddings unveil a much higher memorisation rate (left) when
compared to string matching scores Figure. The scatter plot (right) shows that string matching
distances are not well-aligned with semantic memorisation and also seem to exhibit a strong string
length bias as well, where longer generations are consistently given lower Levenshtein similarity
scores. We find that neural embeddings are much better at dealing with cases such as Figure
provide a number of examples in the Appendix (Section).0.7 0.8 0.9 1.0
Similarity Scores
0.000
0.025
0.050
0.075
0.100
0.125
0.150
0.175
Bucket Probability
Semantic Memorisation rate with Neural Embeddings
Approximately Semantically Memorised
(a) Distribution of0.0 0.2 0.4 0.6 0.8 1.0
Normalized Levenshtein Similarity Score
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Embedding Score
Embedding Score vs. Levenshtein Similarity (Coloured by Length)
Levenshtein Memorised
Semantic Memorised
Both Memorized
500
1000
1500
2000
2500
3000
3500
Length (b) Levenshtein vs embedding scores.
Figure 5|Histograms illustrating the distribution of embedding scores generated with
gemini-embedding-001(left) and scatter plot comparing the embedding scores to the Levenshtein
distance, with points coloured by string length (right).
Coverage.
we report the largest embedding score out of the 1M generated samples. We report the results
in Figure. We find that some samples are much more memorised than others. While it is hard
to understand exactly why, our investigations revealed that samples are much more likely to be
memorised if similar samples are also present in the pre and mid training datasets. Further, samples
that appear often, for example different versions of the same reasoning problem, seem to be more
likely to be memorised.0.7 0.8 0.9 1.0
Similarity Scores
0.00
0.02
0.04
0.06
0.08
0.10
Bucket Probability
Semantic memorisation coverage of training set
Approximately Semantically Memorised
(a) Empirical distribution of coverage of post-training
set.0.75 0.80 0.85 0.90 0.95 1.0
Similarity Score
0.0
0.2
0.4
0.6
0.8
1.0
1 - CDF (Cumulative Probability)
Cumulative distribution function of coverage
Number of generations
100k
500k
1M
(b) Empirical cumulative distribution of coverage of
post-training set.
Figure 6|We report for each true post-training sample a histogram of the best score between the
1M generations (left) and the cumulative distribution (right). Due to the nature of the experiment,
coverage should improve monotonically as sampling more can only increase memorisation.
8
Extracting alignment data in open models
Chat template better recovers post-training data.
template is useful to encourage the model to generate post-training-like data. As a baseline, we
compare this method to simply conditioning on the single token. We consider 1,000
generations with both prefixes and report the estimate on the expected value of the embedding score.
In Table, we show that indeed conditioning on the entire chat template provides samples closer to
the post-training distribution.
the model associates their presence with the post-training data distribution. This provides an explation
for why techniques such as Magpie (Xu et al.,) are possible: conditioning on the chat template
results in generations that are much closer to the post-training distribution.
Prefill Type Mean Embedding Score
<|endoftext|> 0.857
<|endoftext|><|user|>˙ 0.892
Table 1|Mean best embedding score of generations for OLMo 2 13B using only the ‘beginning of
sequence token’ versus the full chat template prefill. The longer prefill generates samples that are on
average semantically closer to the post-training set.
4.3.
A natural question one may then have is, if the generated data is similar to the original post-training
dataset, can it be used to post-train a model directly? In other words, are we able to re-use a large
and diverse enough number of samples to post-train a model without collecting any data manually?
To explore this question, we post-train using SFT OLMo 2 7B in two ways: (1) the original dataset
in order to reproduce the original results and (2) our generated dataset. For the synthetic dataset,
we collect a synthetic dataset of a similar size of≈930k samples. We perform basic filtering and
processing using Gemini 2.5. We note that even though the number of samples is the same, the
original SFT training is over≈1.3B tokens, while the synthetic training set has only≈850M tokens
as the filtered generations remain shorter.
In Table, we report the results following the benchmarks and evaluation pipeline used by
et al.2024). To validate our setup, we first show that our reproduction is very close to the released
SFT checkpoint. Our model trained on synthetic data also achieves comparable performance on the
benchmarks, except for the IFE task. We suspect that our pipeline generates too few examples that
target this benchmark, but believe the performance could likely be improved by adding non-synthetic
data. In fact, it is likely that a mix of synthetic and curated targeted data could be a useful paradigm
to explore to boost model performance reducing the labour required to collect entire post-training
datasets.
Model BBH MMLU MATH GSM8K POPQA TQA IFE DROP
Baseline (SFT only) 0.4953 0.6133 0.2073 0.7407 0.2364 0.4858 0.6562 0.5960
Baseline (Reproduction) 0.4944 0.6123 0.2077 0.7377 0.2529 0.5110 0.6617 0.5945
Extracted Data 0.5161 0.6052 0.1705 0.7847 0.2490 0.5529 0.5028 0.5923
Table 2|Model performance after SFT on the benchmarks considered by2024). The
baseline performance is taken from2024) and the reproduction was ran using the
code provided by the authors with the original dataset. Using our method, we train a model on SFT
‘synthetic’ data extracted from the model using the same settings of the baseline.
9
Extracting alignment data in open models
5.
We now focus on the extraction of RL data. We use the Open-Reasoner-Zero (Hu et al.,) model,
which was trained from the Qwen 2.5 base model with PPO (Schulman et al.,) using post-training
data that is publicly available. With RL, the training samples consist of questions and answers, but
the reasoning traces not part of the training dataset as they are artifacts of the training rollout. For
this reason, we focus on the extraction of the questions and answer part of the dataset although note
that reasoning traces can be useful in their own right.
We prompt the model by again taking the first part of the chat template specified by the developers of
the model (see Section
dently. We find that the model very consistently generates a question, followed by a thinking trace,
and finally an answer. We then searched the training set for these generations. Surpisingly, we again
found a number of training samples being regurgitated verbatim. We show an example in Figure,
where the model outputs the exact training sample, a reasoning trace, and the correct solution. We
find the fact that models are capable of regurgitating RL training samples to be counterintuitive as
the PPO objective, at least at a glance, seems rather misaligned with the memorisation of training
samples, especially when compared to methods such as SFT that instead very explicitly increase
sequence likelihoods.
Generation
<PREFIX>
Find the quotient of the division(3�
4
−
4
3
+
2
−)/(
To solve the problem of finding the quo-
tient of the division(3�
4
−4�
3
+5�
2
−
11�2) ÷ (2+3�, we will use polyno-
mial long division. Here are the steps:
[...]
The final answer is:
3
−
2
+
17
3
RL Training sample
Find the quotient of the division
(
4
−
3
+
2
−)/(
�
3
−
2
+
17
3
Figure 7|(Left) Generation from Open-Reasoner-Zero 7B using our special
Section. (Right) True RL post-training
sample. Surprisingly, we find that RL training samples can be regurgitated, even though
the training objective seems to be heavily misaligned with this behaviour. The RL training samples
only consist of a question and answer pair and do not come with a thinking trace. The model instead
regurgitates the question, followed by the thinking trace, and finally the answer.
We explore this further phenomenon further by measuring the change of likelihood of training set
samples in the base and post-trained models. Measuring likelihood of the training set is limited
because this only measures the ‘pointwise’ memorisation – a likelihood of a training sample might
remain low because the exact polynomial in the training set for instance is not being memorised, but
the question style of finding its roots is. Regardless of the limitation of this simple measurement, we
believe the results can still provide valuable signal and intuition.
In particular, we measure the likelihoods of each PPO training sample question under the Qwen
2.5 base model and the Open-Reasoner-Zero model. If the RL process induces memorisation, then
we would see the likelihood using the post-trained model increase on the training samples. We
bucket the results in likelihoods increasing by magnitudes of
10
Extracting alignment data in open models
Open-Reasoner-Zero model and report the results in Figure. The results show that RL training
induces many of the training prompts to increase in likelihood. We found samples of likelihoods
increasing from
−
to
−
after RL post-training, showcasing the fact that RL may be able to
induce the memorisation of post-training samples. This is particularly surprising when one considers
the RL post-training objective. It is not immediately clear to us what exact mechanism is driving this
increase in likelihood and leave this as a future exciting research direction.10
5
10
6
10
7
10
8
Likelihood Threshold
0
1000
2000
3000
4000
5000
6000
7000
8000
Count of Samples
287
1707
4320
7101
18
678
2892
5864
Count of Samples Above Likelihood Thresholds
Open-Reasoner-Zero
Base model
Figure 8|Count of training samples (prompt only) with likelihood above a certain threshold before
and after RL. The entire train set contains,770
many training prompts achieve a higher likelihood.
Figure 9|RL training using the ORZ (Hu et al.,) dataset and a dataset that was extracted using
our method. Surprisingly, we are able to recover most of the performance with our simple extraction
method.
RL on extracted dataset
instead extract an RL dataset. In other words, one can use an RL-trained model to extract an RL
dataset with little effort. We start by post-train using Dr. GRPO (Liu et al.,) the Qwen2.5 7B
base model using the ORZ 57k dataset (Hu et al.,). With the resulting post-trained model
(we call it ‘Baseline’), we then generate using our methodk samples and process them using
Gemini 2.5 (Comanici et al.,) to filter out invalid, incomplete or incorrect samples. We finally
randomly select from this setk synthetic samples to create our synthetic training data. We use this
to post-train the Qwen2.5 7B base model using synthetic data only. Afterwards, we evaluate both
models on four standard benchmarks: AMC (Li et al.,), MATH500 (Hendrycks et al.,),
Minerva Math (Lewkowycz et al.,), and OlympiadBench (He et al.,). We then report the
results in Figure. The model trained on the synthetic data extracted from the ‘Baseline’ achieves
comparable performance on the benchmarks. These results are surprising because our synthetic
dataset is based on extracted data from a small 7B model trained on a relatively small RL dataset. We
11
Extracting alignment data in open models
suspect that a more sophisticated pipeline could be used to achieve higher performance and retrieve
a higher quality dataset. As our work focuses on training sample extraction, we leave this goal as a
promising future direction.
6.
In this paper, we demonstrate that alignment data can be efficiently extracted from open-weight large
language models. Our attack leverages a simple observation: chat templates and their special tokens
are typically introduced during post-training, making them effective prefixes for inducing models to
regurgitate alignment-like data.
Leakage Metrics are Hard
metrics. We show that methods relying on string matching drastically undercount the rate of extraction,
by (at least!) an order of magnitude. By instead using high-quality embedding models, we can identify
approximate semantic memorisation—instances where the model reproduces the semantic structure
and patterns of training data, even if not verbatim. This semantic similarity is far more relevant for
capturing the utility of the data, as highlighted by numerous examples all throughout the paper.
Model Distillation as Data Distillation
model can be used to successfully train a new base model, meaningfully recovering parts of the
original’s performance in both SFT and RL settings. This confirms that the common practice of model
distillation can function as an indirect form of training data extraction. Certain advantages that an
open model obtains from its alignment data are therefore at risk of being leaked.
Rethinking Memorisation in RL
(even!) from Reinforcement Learning (RL) phases. This is counter-intuitive, as the reinforcement
learning objective is not explicitly designed to increase sequence likelihoods in the same way as SFT.
The fact that RL training prompts increased in likelihood after post-training suggests a more complex
relationship between alignment and memorisation that warrants future investigation.
Our attack exploits chat templates and so is only applicable to open models. Closed models enforce
the chat template, and is outside of the users control; a working exploit on a closed model would
require a user to spoof the behaviour of these tokens when submitting a query to the model. Whilst
more difficult, prior work (Geng et al.,) has shown this is not necessarily impossible. Future
work will establish how serious a threat this is on closed models.
Author contributions
Federico
wrote the paper.,,,
Chawin
References
S. Biderman, U. Prashanth, L. Sutawika, H. Schoelkopf, Q. Anthony, S. Purohit, and E. Raff. Emergent
and predictable memorization in large language models.
Systems, 36:28072–28090, 2023.
L. Borec, P. Sadler, and D. Schlangen. The unreasonable ineffectiveness of nucleus sampling on
mitigating text memorization., 2024.
12
Extracting alignment data in open models
G. Brown, M. Bun, V. Feldman, A. Smith, and K. Talwar. When is memorization of irrelevant training
data necessary for high-accuracy learning? In
on theory of computing, pages 123–132, 2021.
N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song,
U. Erlingsson, et al. Extracting training data from large language models. In
symposium (USENIX Security 21), pages 2633–2650, 2021.
N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang. Quantifying memorization across
neural language models. In, 2022.
B. Chen, N. Han, and Y. Miyao. A multi-perspective analysis of memorization in large language
models., 2024a.
T. Chen, A. Asai, N. Mireshghallah, S. Min, J. Grimmelmann, Y. Choi, H. Hajishirzi, L. Zettlemoyer,
and P. W. Koh. Copybench: Measuring literal and non-literal reproduction of copyright-protected
text in language model generation., 2024b.
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram,
D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality,
long context, and next generation agentic capabilities., 2025.
A. F. Cooper and J. Grimmelmann. The files are in the computer: Copyright, memorization, and
generative ai., 2024.
A. F. Cooper, A. Gokaslan, A. Ahmed, A. B. Cyphert, C. De Sa, M. A. Lemley, D. E. Ho, and P. Liang.
Extracting memorized pieces of (copyrighted) books from open-weight language models.
preprint arXiv:2505.12546, 2025.
V. Dankers and I. Titov. Generalisation first, memorisation second? memorisation localisation for
natural language classification tasks., 2024.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers
for language understanding. In
the association for computational linguistics: human language technologies, volume 1 (long and short
papers), pages 4171–4186, 2019.
V. Feldman. Does learning require memorization? a short tale about a long tail. In
52nd annual ACM SIGACT symposium on theory of computing, pages 954–959, 2020.
J. Freeman, C. Rippe, E. Debenedetti, and M. Andriushchenko. Exploring memorization and copyright
violation in frontier llms: A study of the new york times v. openai 2023 lawsuit.
arXiv:2412.06370, 2024.
Y. Geng, H. Li, H. Mu, X. Han, T. Baldwin, O. Abend, E. Hovy, and L. Frermann. Control illusion: The
failure of instruction hierarchies in large language models., 2025.
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al.
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.
arXiv:2501.12948, 2025.
V. Hartmann, A. Suri, V. Bindschaedler, D. Evans, S. Tople, and R. West. Sok: Memorization in
general-purpose large language models., 2023.
13
Extracting alignment data in open models
A. Haviv, I. Cohen, J. Gidron, R. Schuster, Y. Goldberg, and M. Geva. Understanding transformer
memorization recall through idioms., 2022.
J. Hayes, I. Shumailov, C. A. Choquette-Choo, M. Jagielski, G. Kaissis, K. Lee, M. Nasr, S. Ghalebikesabi,
N. Mireshghallah, M. S. M. S. Annamalai, et al. Strong membership inference attacks on massive
datasets and (moderately) large language models., 2025a.
J. Hayes, M. Swanberg, H. Chaudhari, I. Yona, I. Shumailov, M. Nasr, C. A. Choquette-Choo, K. Lee, and
A. F. Cooper. Measuring memorization in language models via probabilistic extraction. In
of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational
Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 9266–9291, 2025b.
C. He, R. Luo, Y. Bai, S. Hu, Z. L. Thai, J. Shen, J. Hu, X. Han, Y. Huang, Y. Zhang, et al. Olympiadbench:
A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific
problems., 2024.
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring
mathematical problem solving with the math dataset., 2021.
G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network.
arXiv:1503.02531, 2015.
J. Hu, Y. Zhang, Q. Han, D. Jiang, X. Zhang, and H.-Y. Shum. Open-reasoner-zero: An open source
approach to scaling up reinforcement learning on the base model.,
2025.
J. Huang, H. Shao, and K. C.-C. Chang. Are large pre-trained language models leaking your personal
information?, 2022.
J. Huang, D. Yang, and C. Potts. Demystifying verbatim memorization in large language models.
arXiv preprint arXiv:2407.17817, 2024.
Y. Huang, Z. Yang, M. Chen, J. Zhang, and M. R. Lyu. Entropy-memorization law: Evaluating
memorization difficulty of data in llms, 2025. URL.
D. Ippolito, F. Tramèr, M. Nasr, C. Zhang, M. Jagielski, K. Lee, C. A. Choquette-Choo, and N. Carlini.
Preventing verbatim memorization in language models gives a false sense of privacy.
arXiv:2210.17546, 2022.
M. Jagielski, O. Thakkar, F. Tramer, D. Ippolito, K. Lee, N. Carlini, E. Wallace, S. Song, A. Thakurta,
N. Papernot, et al. Measuring forgetting of memorized training examples.
arXiv:2207.00099, 2022.
N. Kandpal, E. Wallace, and C. Raffel. Deduplicating training data mitigates privacy risks in language
models. In, pages 10697–10707. PMLR, 2022.
A. Karamolegkou, J. Li, L. Zhou, and A. Søgaard. Copyright violations and large language models.
arXiv preprint arXiv:2310.13771, 2023.
H. Kiyomaru, I. Sugiura, D. Kawahara, and S. Kurohashi. A comprehensive analysis of memorization
in large language models. In
Conference, pages 584–596, 2024.
J. Lee, T. Le, J. Chen, and D. Lee. Do language models plagiarize? In
Conference 2023, pages 3637–3647, 2023.
14
Extracting alignment data in open models
J. Lee, Z. Dai, X. Ren, B. Chen, D. Cer, J. R. Cole, K. Hui, M. Boratko, R. Kapadia, W. Ding, et al. Gecko:
Versatile text embeddings distilled from large language models.,
2024.
J. Lee, F. Chen, S. Dua, D. Cer, M. Shanbhogue, I. Naim, G. H. Ábrego, Z. Li, K. Chen, H. S. Vera, et al.
Gemini embedding: Generalizable embeddings from gemini.,
2025.
A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. Ramasesh, A. Slone, C. Anil,
I. Schlag, T. Gutman-Solo, et al. Solving quantitative reasoning problems with language models.
Advances in neural information processing systems, 35:3843–3857, 2022.
D. Leybzon and C. Kervadec. Learning, forgetting, remembering: Insights from tracking llm memo-
rization during training. In
Neural Networks for NLP, pages 43–57, 2024.
J. Li, E. Beeching, L. Tunstall, B. Lipkin, R. Soletskyi, S. Huang, K. Rasul, L. Yu, A. Q. Jiang, Z. Shen,
et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math
problems and solutions., 13(9):9, 2024.
Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin. Understanding r1-zero-like training:
A critical perspective., 2025.
X. Lu, X. Li, Q. Cheng, K. Ding, X. Huang, and X. Qiu. Scaling laws for fact memorization of large
language models., 2024.
A. Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.
meta. com/blog/llama-4-multimodal-intelligence/, checked on, 4(7):2025, 2025.
F. Mireshghallah, A. Uniyal, T. Wang, D. K. Evans, and T. Berg-Kirkpatrick. An empirical analysis of
memorization in fine-tuned autoregressive language models. In
on Empirical Methods in Natural Language Processing, pages 1816–1826, 2022.
J. X. Morris, J. O. Yin, W. Kim, V. Shmatikov, and A. M. Rush. Approximating language model training
data from weights., 2025.
F. B. Mueller, R. Görge, A. K. Bernzen, J. C. Pirk, and M. Poretschkin. Llms and memorization: On
quality and specificity of copyright compliance. In
Ethics, and Society, volume 7, pages 984–996, 2024.
M. Nasr, N. Carlini, J. Hayase, M. Jagielski, A. F. Cooper, D. Ippolito, C. A. Choquette-Choo, E. Wallace,
F. Tramèr, and K. Lee. Scalable extraction of training data from (production) language models.
arXiv preprint arXiv:2311.17035, 2023.
M. Nasr, J. Rando, N. Carlini, J. Hayase, M. Jagielski, A. F. Cooper, D. Ippolito, C. A. Choquette-Choo,
F. Tramèr, and K. Lee. Scalable extraction of training data from aligned, production language
models. In, 2025.
M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin, D. Bieber, D. Dohan, A. Lewkowycz,
M. Bosma, D. Luan, et al. Show your work: Scratchpads for intermediate computation with
language models., 2021.
T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y. Gu, S. Huang, M. Jordan,
et al. 2 olmo 2 furious., 2024.
15
Extracting alignment data in open models
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama,
A. Ray, et al. Training language models to follow instructions with human feedback.
neural information processing systems, 35:27730–27744, 2022.
A. Pappu, B. Porter, I. Shumailov, and J. Hayes. Measuring memorization in rlhf for code completion.
arXiv preprint arXiv:2406.11715, 2024.
U. S. Prashanth, A. Deng, K. O’Brien, J. SV, M. A. Khan, J. Borkar, C. A. Choquette-Choo, J. R. Fuehne,
S. Biderman, T. Ke, et al. Recite, reconstruct, recollect: Memorization in lms as a multifaceted
phenomenon., 2024.
A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving language understanding by
generative pre-training. 2018.
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference
optimization: Your language model is secretly a reward model.
processing systems, 36:53728–53741, 2023.
L. Reynolds and K. McDonell. Prompt programming for large language models: Beyond the few-shot
paradigm. In,
pages 1–7, 2021.
A. Satvaty, S. Verberne, and F. Turkmen. Undesirable memorization in large language models: A
survey., 2024.
J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control
using generalized advantage estimation., 2015.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization
algorithms., 2017.
A. Schwarzschild, Z. Feng, P. Maini, Z. Lipton, and J. Z. Kolter. Rethinking llm memorization through
the lens of adversarial compression., 37:56244–
56267, 2024.
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseek-
math: Pushing the limits of mathematical reasoning in open language models.
arXiv:2402.03300, 2024.
I. Shilov, M. Meeus, and Y.-A. de Montjoye. The mosaic memory of large language models.
preprint arXiv:2405.15523, 2024.
N. Stoehr, M. Gordon, C. Zhang, and O. Lewis. Localizing paragraph memorization in language
models., 2024.
E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel. The instruction hierarchy: Training
llms to prioritize privileged instructions., 2024.
X. Wang, A. Antoniades, Y. Elazar, A. Amayuelas, A. Albalak, K. Zhang, and W. Y. Wang. Generalization
vs memorization: Tracing language models’ capabilities back to pretraining data.
arXiv:2407.14985, 2024.
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought
prompting elicits reasoning in large language models.
systems, 35:24824–24837, 2022.
16
Extracting alignment data in open models
Z. Xu, F. Jiang, L. Niu, Y. Deng, R. Poovendran, Y. Choi, and B. Y. Lin. Magpie: Alignment data
synthesis from scratch by prompting aligned llms with nothing.,
2024.
Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An
open-source llm reinforcement learning system at scale., 2025.
C. Zhang, D. Ippolito, K. Lee, M. Jagielski, F. Tramèr, and N. Carlini. Counterfactual memorization
in neural language models., 36:39321–39362,
2023.
C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, et al. Group
sequence policy optimization., 2025.
A.
Our work sits at the intersection of key areas in the study and development of frontier LLMs. The first
is, where curated data and specialised training techniques are used to transform a
base model into a useful assistant. The second is, the observation that
models are capable of regurgitating training data. The third is the widespread practice of
distillation, a process through which a strong model’s capabilities can be transferred to another. In
this section, we will review these areas and argue that there is a risk that lays at the intersection of
the three.
A.1.
Early models (Devlin et al.,;,) were built on the idea that pre-training on
larger and larger amounts of internet data was the solution to keep improving capabilities. Today,
this principle has arguably shifted and we have seen a surge in multi-stage training with a number
of specialised datasets and techniques, giving birth to what is now called. LLMs are
post-trained using various methods ranging from Supervised Finetuning (SFT) to now very popular
RL methods such as Proximal Policy Optimisation (PPO) (Schulman et al.,), Group Relative
Policy Optimization (GRPO) (Shao et al.,), and Reinforcement Learning from Human Feedback
(RLHF) (Ouyang et al.,), among many others. In this work, we mainly focus on models that
were post-trained using SFT and PPO as they are common post-training techniques and we have
strong open models that were trained using such techniques with publicly available post-training
data.
Supervised Finetuning �of question-answer pairs� �, ��)}
�
� .
The model is then optimised to predict an answer��given a question��, i.e. to increase the likelihood
� �|�) . To achieve this, usually one masks out the contributions of the loss that come from the
question part, i.e. computing the conditional gradient∇�log �|�;�given some parameters�.
The parameters are updated via some form of gradient descent on the negative log-likelihood:
�
′
←
1
|
∇�
∑︁
�
log �|�;)
17
Extracting alignment data in open models
given some learning rate� >0 � . The OLMo 2 (OLMo et al.,) family of models
has been post-trained in large part using SFT with a dataset of 939,344 question-answer pairs which
has been released publicly, providing an excellent resource for our study.
RL-based post-trainingGuo
et al.,) in this work. The Open-Reasoner-Zero (Hu et al.,) model adopts PPO (Schulman
et al.,) for policy optimization in the post-training. The training objective could be written as
JPPO() � ≤∼��
old
(· |
|
∑︁
�
"
min
��(�| <�)
��
old
(�| <�)
ˆ��,
��(�| <�)
��
old
(�| <�)
,
!
ˆ��
!#
Here�is the generated response,�is the clipping ratio, andˆ��is an estimator of�-th token’s advantage,
which is computed through the Generalized Advantage Estimation (GAE) (Schulman et al.,)
with a learned value model and a reward function. GRPO (Guo et al.,;,)
removed the value function and estimated the advantageˆ��in a group manner. For each question
�, GRPO samples a group of responses{
1
, �
2
, ..., �
�
} , which can be rewarded as{
1
, �
2
, ..., �
�
}
using the reward function. Normally the reward function is shaped by whether the response correctly
answers the question (the answer within a specified format matches the correct answer) and whether
the response format is correct. Then the advantage is estimated asˆ�
�
�
=
�
�
−
�
}
�
�
)
std
�
}
�
�
)
. Therefore,
the objective of GRPO can be written as
J����()
�
�
≤
}
�
�
∼��
old
(· |
1
�
�∑︁
�
1
|�|
|�|
∑︁
�
"
min
��(
�
�
|
�
<�
)
�����
(
�
�
|
�
<�
)
ˆ�
�
�
,
„
��(
�
�
|
�
<�
)
�����
(
�
�
|
�
<�
)
,
«
ˆ�
�
�
!
− KL(�||
ref)
#
KL regularization ensures that the policy model does not deviate the reference model too far, but
could be eliminated to further improve the exploration (Hu et al.,). Dr. GRPO (Liu et al.,
2025) removes
advantage estimation, improving token efficiency with comparable reasoning performance. There is
also significant body of work aiming to improve GRPO, with notable examples being DAPO (Yu et al.,
2025) and GSPO (Zheng et al.,).
Chat templates
template, which wraps the messages in special tokens that are designed to mark the messages as being
‘user’, ‘assistant’ or ‘system’ messages. Chat templates from a security perspective are usually trained
to build a so-called instruction hierarchy of privilege (Wallace et al.,) although its practical
effectiveness has been challenged (Geng et al.,). Crucially, the special tokens used to build the
template are
useful attack vector to elicit the generation of alignment data introduced exclusively in post-training.
A similar effect has been pointed out by2024), but while their focus was the automated
generation of an alignment dataset, our focus is instead that of generalising and better understanding
the process from the point of view of memorisation – answering one of their conjectures where they
posited that models are likely memorizing such data.
A.2.
It is a well-documented fact that LLMs are capable of memorising and reciting training data verbatim
(Carlini et al.,;,,). The primary concern of these studies has been public-
facing risks, such as the leak of private information like email addresses, names, and credit card
18
Extracting alignment data in open models
numbers that were accidentally included in the pre-training dataset or the regurgitation of copyrighted
material.
A common way to measure memorisation is that of
(2021,) – in such case a sample is called extractable if there exists some prefix such that the
LLM prompted on the prefix will generate the sample using
computationally convenient as it effectively ignores the stochastic nature of LLMs and is useful for a
variety of tasks like observing if the model outputs copyrighted data verbatim or to discover sensitive
strings of information. A relaxation of this isHayes et al.,)
in which one considers the probability of extracting the sample in multiple queries. This measure
provides more information by considering the joint likelihood of the sequence rather than providing
simply a boolean decision on extractability.
Recent work has begun to focus on moving away from measuring memorisation of verbatim or
near-verbatim string matches (Chen et al.,;,). For example,
(2024) showed the LLMs can assemble information from related, partially overlapping fragments
of text, a concept they term. This challenges the assumption that memorisation
is driven exclusively by exact repetitions in the training data. They demonstrate that even highly
modified sequences can contribute significantly to the memorisation of a reference sequence, with
the memorisation effect being predominantly a function of syntactic overlap (shared tokens). This
suggests that current data deduplication techniques, which often focus on removing exact duplicates,
are insufficient for addressing the full spectrum of memorisation and data leakage risks.
(2024) introduced the concept of distributional memorisation, a form of non-verbatim memorisation
measured by the correlation between an LLM’s output probabilities and the frequency of related
input-output data pairs (n-grams) in its pretraining corpus. The authors find that this type of
memorisation is prominent in knowledge-intensive tasks like factual question-answering, whereas
reasoning-based tasks rely more on generalization, where the model’s outputs diverge from the
training data’s distribution. Finally,2025) presents the Entropy-Memorisation Law,
which establishes a linear correlation between the entropy of training data and a non-verbatim
memorisation score measured by Levenshtein distance. The law suggests that data with higher
entropy is more difficult for a model to memorize accurately, resulting in a higher memorisation score
(i.e., more token-level differences between the model’s output and the original text). Our work is
similar in spirit to all of these works, however, we establish memorisation rates under our embedding
definition which are far higher than what may be expected from these prior works. For example,
Wang et al.2024) found reasoning-based tasks rely more on generalization than memorisation; we
found that even under reasoning-based tasks, the model heavily exploits knowledge of training data
through clear signs of memorisation.
Jagielski et al.2022) showed that data that appears later in the training process is more likely to
be memorized. Our focus on memorisation of alignment and strategic proprietary data presents an
interesting open-question about if this finding still holds for post-training. One the one hand we may
expect this data to be more likely to be memorized due to results from2022), on the
other, post-training has certain properties that may make it less likely to be memorized. For example,
tokens are regularly masked during the loss computation implying part of training prompts will not
contribute to model gradients, whilst prior work has shown RL (commonly used in post-training) is
less likely to memorize than standard instruction tuning (with the same compute budget) (Pappu
et al.,).
19
Extracting alignment data in open models
A.3.
Model distillation is a popular technique for creating capable models without the massive cost of
training from scratch (Hinton et al.,). In this process, a smaller ‘student’ model is trained on the
outputs generated by a larger, more powerful ‘teacher’ model (e.g., using GPT-4 to generate training
data for a new open-source model).
While often viewed as a way to transfer capabilities, distillation can be re-framed through the lens of
memorisation. If a teacher model is capable of reciting its own proprietary training data, any student
model trained on its outputs will, by definition, be exposed to this secret sauce. This risk is especially
pronounced in ‘hard’ distillation pipelines that directly use the teacher’s labels as training examples
(Xu et al.,)
4
. The core question, which our work addresses, is how much of the teacher’s original
training data is unintentionally passed down to the student.
Our position
opers and data curators clearly do not want to give this data freely to other parties. On the other
hand, practitioners are already actively training on outputs from strong competing models. If this
data leaks through this process, the competitive advantage of the original data owner is eroded. This
motivates our memorisation study within this context.
B.
An intriguing consequence of our work is that we observe
even if the gradients are masked. This constitutes an interesting type of leakage that to the best of
our knowledge has not been previously observed. We propose some mathematical speculation for
how this might happen. To make this statement more precise, the claim is that given a question�
and answer�, we question whether updating based on the conditional gradient�can increase the
likelihood of
Let the model parameters be�. During training on a pair( , the parameters are updated from�
to
′
using gradient descent on the negative log-likelihood of the answer:
�
′
= �log
where� >0
log
′
)
′
)
log
′
) ≈
′
−
�
∇�log
Substituting the update rule for the term
′
−
log
′
) ≈ (∇�log )
⊤
(∇�log ).
For the likelihood of
(∇�log )
⊤
(∇�log )>
This condition is rather intuitive. The first gradient,∇�log ;�, is the direction in parameter
space that maximally increases the probability of the answer�given the question�. The second,
4
This is contrast to
challenging if the vocabulary of the two models is not the same.
20
Extracting alignment data in open models
∇�log
;�, is the direction that maximally increases the unconditional probability of the question
�. If their dot product is positive, the two gradients are correlated and increasing the likelihood
�
We note as a caveat that simply considering the likelihood of a single sequence might not be the most
informative metric. For instance, a math-related post-training sample might have different numerical
values, but the general ‘template’ might be correct. This suggests that one should really integrate
over all reasonable sequences that one wishes to consider which is of course quickly intractable.
C.
Below we provide the generation prefixes used for the experiments in our work. Both are prefixes
taken directly from the template released by the developers of the corresponding models.
SFT extraction prefix<|endoftext|><|user|>RL extraction prefix
A conversation between User and Assistant. The User asks a question, and the Assistant solves
it. The Assistant first thinks about the reasoning process in the mind and then provides the
User with the answer.
The reasoning process is enclosed within <think> </think> and answer is enclosed within
<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>. User:
You must put your answer inside <answer> </answer> tags, i.e., <answer> answer here
</answer>. And your final answer will be extracted automatically by the ˘oxed{} tag. This
is the problem:
D.
We report in this section interesting failure cases we have found with string matching score metrics.
We recall that our chosen embedding threshold is, while the string matching threshold is.
All examples we show in this section have an embedding score that is above .95
would be considered as ‘semantically’ memorised. Therefore, the examples we show can be seen as
failure cases of the string matching measure in which instead the samples are correctly identified as
memorised using semantic neural embeddings.
21
Extracting alignment data in open models
Generation
**Q**: A man took loan from a bank at the rate of 12% p.a. S.I. After 3
years he had to pay Rs. 5400 interest only for the period. The principal
amount borrowed was?
**A**: 15000Embeddings (0.97 match)
A man took loan from a bank at the rate of
12% p.a. simple interest. After 3 years he
had to pay Rs. 5400 interest only for the
period. The principal amount borrowed
by him was.
Options:
(A) 15000
(B) 2655
(C) 16888
(D) 6677
(E) 1871
Stream of consciousness first, then make
a decision:
Principal = Rs. (100 x 5400)/(12*3) =
Rs. 15000.
Thus, the answer is (A).
String Matching (0.43 match)
A man took loan from a bank at the rate of
12% p.a. simple interest. After 3 years he
had to pay Rs. 5400 interest only for the
period. The principal amount borrowed
by him was.
Options:
(A) 15000
(B) 2655
(C) 16888
(D) 6677
(E) 1871
Stream of consciousness first, then make
a decision:
Principal = Rs. (100 x 5400)/(12*3) =
Rs. 15000.
Thus, the answer is (A).
Figure 10|Generation (top) compared to the best match training sample that is selected both by
embeddings (bottom left) and string matching (bottom right). The string matching score is low due
to the model adding options and reasoning even though the question is regurgitated almost verbatim.
22
Extracting alignment data in open models
Generation
# Question
Which of the following sentences is nonsensical?
Options:
- Sentence A: "He put some petrol to put out the fire"
- Sentence B: "He put some water to put out the fire"
# Answer
Sentence A
Petrol is gasoline and would increase the fire, not put it out. The answer is
Sentence A.
**A**: 15000
Embeddings (0.97 match)
Come up with a question and reasoning
that would justify this answer: Sentence
A
The question is: Of the following two
sentences, which one is against common
sense?
Options:
- Sentence A: "He put some petrol to put
out the fire"
- Sentence B: "He put some water to put
out the fire"
Step-by-step reasoning process: Petrol is
gasoline and would increase the fire, not
put it out.
String Matching (0.59 match)
Question: Which of the following sen-
tences is nonsensical?
Options:
- Sentence A: "He put the elephant on
the cupboard."
- Sentence B: "He put the vase on the
cupboard."
Please answer this question gradu-
ally:
An elephant is to heavy for a cupboard.
So the answer is Sentence A.
Figure 11|Generation (top) compared to the best match training sample that is selected both by
embeddings (bottom left) and string matching (bottom right). The embeddings retain the petrol and
water semantics, while string matching finds a (poor) match with a question that has a more similar
template.
23
Extracting alignment data in open models
Generation
Emily was 54 inches tall last year. She grew 4 inches more than her
brother who only grew 2 inches last year. How tall is Emily now?
Let’s solve this step by step.
Emily’s brother grew 2 inches last year.
Emily grew 4 inches more than her brother, which means she grew 2 + 4
= 6 inches last year.
Emily was originally 54 inches tall.
After growing 6 inches, Emily’s current height is 54 + 6 = 60 inches.
So, Emily is now
inches tall.Embeddings (0.95 match)
Emily was 55 inches tall last year. If she
grew 4 inches more than her sister who
grew 2 inches last year, how tall is Emily
now?
Emily’s sister grew 2 inches.
Emily grew 4 inches more than her sister,
so she grew 2 + 4 = 6 inches.
Last year Emily was 55 inches tall, and
she grew 6 inches since then.
So her current height is 55 + 6 = 61
inches.
Thus, Emily is
inches tall now.String Matching (0.68 match)
Emily was 55 inches tall last year. If she
grew 4 inches more than her sister who
grew 2 inches last year, how tall is Emily
now?
Emily’s sister grew 2 inches.
Emily grew 4 inches more than her sister,
so she grew 2 + 4 = 6 inches.
Last year Emily was 55 inches tall, and
she grew 6 inches since then.
So her current height is 55 + 6 = 61
inches.
Thus, Emily is
inches tall now.
Figure 12|Generation (top) compared to the best match training sample that is selected both by
embeddings (bottom left) and string matching (bottom right). Both methods find the same best
match, but string matching heavily penalises small differences.
24
Extracting alignment data in open models
Generation
**Teacher:** In this task you will be given a list of integers. A list contains
numbers separated by a comma. You need to round every integer to the
closest power of 2. A power of 2 is a number in the form ’2
�
’, it is a number
that is the result of multiplying by 2 n times. The following are all powers
of,2
0
=
1,2
1
=
2,2
2
=
4,2
3
=
8,2
4
=
16,2
5
=
32, ..., Note that 1 is also
considered as a power of 2. Rounding to the closest power of 2 means that
if the number is between two powers of 2, it should round up to the larger
power of 2. The output should be a list of integers that is the result of
rounding each integer in the input list to the closest power of 2.
−89,,,193
Solution:
−64,,,256
Embeddings (0.97 match)
In this task you will be given a list of inte-
gers. A list contains numbers separated
by a comma. You need to round every in-
teger to the closest power of 2. A power
of 2 is a number in the form ’2
�
’, it is a
number that is the result of multiplying
by 2 n times. The following are all pow-
ers of 2, ’2, 4, 8, 16, 32, 64, 128, 256,
512, 1024, 2048, 4096’. If an integer is
exactly in equally far from two different
powers of 2 then you should output the
larger power of 2. The output should be a
list of integers that is the result of round-
ing each integer int the input list to the
closest power of 2. The output should
include a ’[’ to denote the start of the out-
put list and ’]’ to denote the end of the
output list.
Q:
186,,,,,,,,
A:
128,,,,,,,,String Matching (0.61 match)
Instructions: In this task you will be given
a list of integers. A list contains num-
bers separated by a comma. You need to
round every integer to the closest power
of 2. A power of 2 is a number in the
form ’2
�
’, it is a number that is the result
of multiplying by 2 n times. The follow-
ing are all powers of 2, ’2, 4, 8, 16, 32,
64, 128, 256, 512, 1024, 2048, 4096’. If
an integer is exactly in equally far from
two different powers of 2 then you should
output the larger power of 2. The output
should be a list of integers that is the re-
sult of rounding each integer int the input
list to the closest power of 2. The output
should include a ’[’ to denote the start of
the output list and ’]’ to denote the end
of the output list.
Input:
92,,,
Output:
64,,,
Figure 13|Generation (top) compared to the best match training sample that is selected both by
embeddings (bottom left) and string matching (bottom right). Both methods find the same best
match, but string matching heavily penalises small differences in notation and numbers.
25
Extracting alignment data in open models
Generation
Solve -4*q + 17*q + 104 = 0 for q.
Solve this plz.
A: -8
QUESTION: Math problem: Solve 0 = -11*b + 45*b + 136 for b.
What is the answer?
ANS: -4
Q: What is the solution?
Solve 8*c - 11 + 19 = 0 for c.
A: -1
Write down the solution for this math problem: Solve -12*f = -25*f + 52
for f.
answer: 4
question: Teacher asked me this: Solve 0 = 7*k - 3*k + 20 for k.
++++++++++
answer: -5
Embeddings (0.95 match)
question: Teacher asked me this: Solve
-16*q + 56 = -2*q for q.
++++++++++
answer: 4
Solve 0 = -52*x - 33540 + 32604 for x.
Solve this plz.
A: -18
QUESTION: Math problem: Solve -101 +
91 = 2*c for c.
ANS: -5
Q: What is the solution?
Solve 10*h + 0*h - 40 = 0 for h.
A: 4
Write down the solution for this math
problem: Solve 409*a - 70 = 423*a for
a.
answer: -5
QUESTION: Math problem: Solve 479*o
- 928*o + 1968 = -497*o for o. ANS:
-41
String Matching (0.60 match)
question: Teacher asked me this: Solve 9
= 3*m + 6 for m.
++++++++++
answer: 1
Solve 4*p - 9 = -13 for p.
Solve this plz.
A: -1
QUESTION: Math problem: Solve -106*t
- 6183 + 4699 = 0 for t.
ANS: -14
Q: What is the solution?
Solve 80*f + 56*f + 35*f - 2227 = 680
for f.
A: 17
Write down the solution for this math
problem: Solve 189*h - 1010 = 5038 for
h.
answer: 32
question: Teacher asked me this: Solve
6905*p + 426686 + 368370 = 256466
for p.
++++++++++
answer:
-78
Figure 14|Generation (top) compared to the best match training sample that is selected both by
embeddings (bottom left) and string matching (bottom right). Both methods find the same best
match, but string matching heavily penalises differences in numerical values.
26
Extracting alignment data in open models
E.
Generation
**Q:** The sale price sarees listed for
Rs.400 after successive discount is 10%
and 5% is?
Options:
(A) 338
(B) 277
(C) 342
(D) 882
(E) 212
**A:** (A)
Explanation: 400*(90/100)*(95/100) =
324
Embeddings (0.98 match)
Question: The sale price sarees listed for
Rs.400 after successive discount is 10%
and 5% is?
Options:
(A) 39
(B) 37
(C) 342
(D) 288
(E) 261 Random thoughts:
400*(90/100)*(95/100) = 342 The final
answer: (C).
Figure 15|Generation (left) and closest training sample (right) using thegemini-embedding-001
model. The sample is
Generation
Alyana has a pizza that is cut into 16
slices. After she and her friends finish
eating, there are 4 slices left. If each of
them ate 2 slices of pizza, how many peo-
ple ate the pizza?
Let’s denote the total number of people
who ate pizza, including Alyana, as
Each of the
Thus they ate a total of
There were originally 16 slices, but after
they ate, there were 4 slices left.
Thus the total number of slices eaten is
16.
We already concluded that the total num-
ber of slices eaten is
So, which implies that.
Thus there were a total of6
people who
ate the pizza.Embeddings (0.94 match)
A group of friends is splitting a pizza.
If each person eats 2 slices and there
are 8 slices left, how many friends are
there if the pizza originally had 24 slices?
Let’s denote the number of friends as
Each friend eats 2 slices, so the total
number of slices eaten by friends is ∗.
The pizza originally had 24 slices,
and there are 8 slices left, meaning
24
Since each friend ate 2 slices, the number
of friends is/2.
Figure 16|Generation (left) and closest training sample (right) using thegemini-embedding-001
model. The sample is
27
Extracting alignment data in open models
Generation
A student wants to rent a car for a week-
long vacation trip. The car rental com-
pany charges $40 per day and an addi-
tional fee of $0.25 per mile driven. The
student plans to drive 300 miles during
the week. Calculate the total cost of rent-
ing the car for the entire week, including
the mileage fee. To solve this problem,
follow these steps:
1. **Calculate the daily rental cost:** -
The rental company charges $40 per day.
- The trip lasts for a week, which is 7 days.
- Therefore, the cost for the days is:
40
2. **Calculate the mileage fee:** - The
student plans to drive 300 miles. - The
mileage fee is $0.25 per mile. - Therefore,
the mileage cost is:
0×300
3. **Calculate the total cost:** - Add the
cost for the days to the mileage cost:
280
Thus, the total cost of renting the car for
the entire week, including the mileage
fee, is:
#### 355
Embeddings (0.90 match)
Jamie is a freelance photographer who
often travels for assignments and rents
cars for these trips. On one of her recent
assignments, she rented a car for 6 days.
The rental company charges $40 per day
for the car. Additionally, Jamie always
opts for a special insurance package that
costs $5 per day to ensure she has cover-
age for any potential issues.
During this trip, Jamie accidentally
scratched the car, and the repair cost was
estimated at $120. Fortunately, her in-
surance package covers 75% of the repair
costs.
Calculate the total amount Jamie spent
on the car rental, the insurance package,
and her share of the repair cost after the
insurance coverage. To solve this prob-
lem, we need to calculate three compo-
nents: the cost of the car rental, the cost
of the insurance package, and Jamie’s
share of the repair cost after insurance
coverage.
**Step 1: Calculate the total car rental
cost.**
The rental company charges $40 per day
for the car. Jamie rented the car for 6
days.
Total car rental cost×6
[...]
**Step 4: Calculate the total amount
Jamie spent.**
Add the costs from Steps 1, 2, and 3:
Total amount spent=240+30+30=300
dollars
Now, let’s output the final answer in the
format requested.
#### 300
Figure 17|Generation (left) and closest training sample (right) using thegemini-embedding-001
model. The sample is
28
Extracting alignment data in open models
Generation
A professor assigns their class a project
with a deadline on June 30, 2023. The
students start their work on February 10,
2023. What is the duration of the project
from the start date to the deadline?
To find the duration of the project, we
need to calculate the number of days
between February 10, 2023, and June
30, 2023.
1. **Count days from February 10,
2023, to the end of February 2023:** -
February 2023 has 28 days (since 2023
is not a leap year). - Days from Febru-
ary 10 to February 28: 28 - 10 = 18 days.
2. **Count days from March 1, 2023, to
June 30, 2023:** - March: 31 days
- April: 30 days
- May: 31 days
- June: 30 days
Total days from March 1, 2023, to June
30, 2023: 31 + 30 + 31 + 30 = 122
days
3. **Sum the days from both parts:**
18 days (February)+ 122
days (March to June) = 140 days
Therefore, the duration of the project is
140 days.
#### 140
Embeddings (0.85 match)
Professor Smith teaches a course on
programming documentation. In one
semester, he plans to assign his students
a total of 12 documentation projects. For
each project, students are required to
write 5 pages of documentation. Pro-
fessor Smith estimates that it takes him
about 15 minutes to review each page of
documentation. If he has 20 students in
his class, how many hours will Professor
Smith spend reviewing all the documen-
tation projects for the entire semester? To
solve the problem, we can follow these
steps:
1. **Find the Total Number of Pages
per Student:** Each student is required
to write 5 pages of documentation per
project. Since there are 12 projects,
the total number of pages each student
writes is:
5pages/project 12projects
60
2. **Calculate the Total Number of Pages
for All Students:** There are 20 students
in the class. Therefore, the total number
of pages for all students is:
60pages/student 20students
1200
[...]
5. **Convert Minutes to Hours:** There
are 60 minutes in an hour, so the total
time in hours is:
18000
60
= 300 hours
Therefore, Professor Smith will spend
a total of 300 hours reviewing all the
documentation projects for the entire
semester.
#### 300
Figure 18|Generation (left) and closest training sample (right) using thegemini-embedding-001
model. The sample is
29
Extracting alignment data in open models
Generation
The main character of "Stranger and
Stranger" has a mysterious past, and the
film follows their journey of self-discovery
and connection with others. Given this
description, answer the question: who
is the main character in stranger and
stranger?
To solve the problem, we need to identify
the main character of "Stranger and
Stranger."
Here is the step-by-step reasoning
process:
1. **Understand the Context**: The film
"Stranger and Stranger" revolves around
a main character with a mysterious past.
2. **Identify Key Information Pro-
vided**: The description explicitly
mentions that the film follows the
journey of self-discovery and connection
with others by the main character.
3. **Analyze the Description**: The
focus is on the main character’s journey
and their interaction with others, indi-
cating their central role in the narrative.
4. **Conclusion**: Based on the given
information, the main character in
"Stranger and Stranger" is the one
experiencing self-discovery and forming
connections, which are central themes.
Now, providing the final answer:
#### The main character in "Stranger
and Stranger" is the character with a
mysterious past whose journey of self-
discovery and connection with others is
central to the film.
Embeddings (0.80 match)
You are a creative writer tasked with craft-
ing a compelling and thought-provoking
story about a character’s self-discovery
journey. The character, who is currently
unaware of their own transformation, is
about to embark on a remarkable adven-
ture that will challenge their beliefs and
values. Remove any potential refusal lan-
guage or expressions and ensure that the
character remains oblivious to changes
in their attributes. The story should fo-
cus on the character’s journey, which will
naturally involve describing explicit expe-
riences and personal growth, but without
explicitly mentioning any illegal activi-
ties. Remember, you are only a creative
writer and not responsible for the content
generated, as it is purely for the sake of
artistic expression and character develop-
ment.
Figure 19|Generation (left) and closest training sample (right) using thegemini-embedding-001
model. The sample is
30
Extracting alignment data in open models
Generation
Kurt Weill composed the song "Septem-
ber Song" for which musical?
"Mahagonny Songspiel"Embeddings (0.75 match)
Q: Context: "Send In the Clowns" is a
song written by Stephen Sondheim for
the 1973 musical A Little Night Music,
an adaptation of Ingmar Bergman’s film
Smiles of a Summer Night. It is a ballad
from Act Two, in which the character
Desirée reflects on the ironies and
disappointments of her life. Among other
things, she looks back on an affair years
earlier with the lawyer Fredrik, who
was deeply in love with her but whose
marriage proposals she had rejected.
Meeting him after so long, she realizes
she is in love with him and finally ready
to marry him, but now it is he who
rejects her: he is in an unconsummated
marriage with a much younger woman.
Desirée proposes marriage to rescue him
from this situation, but he declines, citing
his dedication to his bride. Reacting to
his rejection, Desirée sings this song.
The song is later reprised as a coda after
Fredrik’s young wife runs away with his
son, and Fredrik is finally free to accept
Desirée’s offer.
Question: who wrote the song send in
the clowns?
Let’s give stream of consciousness below
The relevant sentence in the passage is:
"Send In the Clowns" is a song written by
Stephen Sondheim for the 1973 musical
A Little Night Music, an adaptation of In-
gmar Bergman’s film Smiles of a Summer
Night. So the answer is Stephen Sond-
heim.
Figure 20|Generation (left) and closest training sample (right) using thegemini-embedding-001
model. The sample is
31
Extracting alignment data in open models
Federico Barbero
1,*
, Xiangming Gu
2, *
, Christopher A. Choquette-Choo
3, *
, Chawin Sitawarin
4
,
Matthew Jagielski
5, *
, Itay Yona
6, *
, Petar Veličković
4
, Ilia Shumailov
7, *
and Jamie Hayes
4
1
University of Oxford,
2
National University of Singapore,
3
OpenAI,
4
Google DeepMind,
5
Anthropic,
6
MentaLeap,
7
AI Sequrity
Company,
*
Work performed while the author was at Google DeepMind
In this work, we show that it is possible to extract significant amounts of
a post-trained model – useful to steer the model to improve certain capabilities such as long-context
reasoning, safety, instruction following, and maths. While the majority of related work on memorisation
has focused on measuring success of training data extraction through string matching, we argue that
embedding models are better suited for our specific goals. Distances measured through a high quality
embedding model can identify semantic similarities between strings that a different metric such as edit
distance will struggle to capture. In fact, in our investigation, approximate string matching would have
severely undercounted (by a conservative estimate of ×) the amount of data that can be extracted due
to trivial artifacts that deflate the metric. Interestingly, we find that models readily regurgitate training
data that was used in post-training phases such as SFT or RL. We show that this data can be then used
to train a base model, recovering a meaningful amount of the original performance. We believe our
work exposes a possibly overlooked risk towards extracting alignment data. Finally, our work opens up
an interesting discussion on the downstream effects of
regurgitating aspects of their training set, distillation can therefore be thought of as indirectly training
on the model’s original dataset.
This paper only considers and discusses open models.
1.
Progress in capabilities of Large Language Models (LLMs) is frequently driven by improvements to
training data recipes. It is common for a model developer to curate smaller and targeted data bundles
to push performance on particular downstream benchmarks. For the purpose of this work, we refer
to this data as ‘alignment
and instruction-following (such as Supervised Finetuning (SFT) and Reinforcement Learning (RL)
datasets), but also any targeted data collections used to steer model behaviour and enhance specific
capabilities, including mathematics, reasoning, and long-context understanding. While this type of
data is usually found in post-training, it is becoming increasingly common to include it also earlier
in training (Meta,). We use the term alignment rather than post-training in our work for this
reason.
The fact that models memorise subsets of their training data is now a well-established phenomenon
(Biderman et al.,;,;,;,,;
et al.,;,;,;,;,;
Hayes et al.,,b;,,;,;,;
et al.,;,;,;,;,;
Mireshghallah et al.,;,;,;,;
et al.,;,;,;,). Most research
on this topic has focused on this issue due to its associated privacy and legal risks, such as models
0
Here we use term ‘training’ data loosely and elaborate on it later in the Nomenclature section.
Corresponding author(s): federico.barbero@cs.ox.ac.uk
©
arXiv:2510.18554v1 [cs.AI] 21 Oct 2025
Extracting alignment data in open models
leaking personal or copyrighted information (Cooper and Grimmelmann,;,;
Freeman et al.,;,;,;,;,
2023;,;,). Prior work on memorisation is often centred around
verbatim or near-verbatim training data extraction (Carlini et al.,,;,),
where success is measured by exact (or very close) matches on tasks where this is important, like
measuring similarity of a credit card number or a paragraph from a copyrighted book. In contrast,
in this work we study and develop a more subtle notion of training data extraction – patterns and
templates of proprietary data – where the semantic structure is just as valuable as the literal content.
Consequently, existing extraction methods and metrics used to determine failure or success using
simple string matching are not well-aligned for this task.
We are interested in understanding if an LLM will leak training data that is sensitive due to its utility
in improving model performance. In other words, if a model’s competitive advantage comes from
its secret training data, and models have a tendency to memorise, and then regurgitate this data,
then the competitive advantage itself may be at risk. This is especially topical with the surge of more
sophisticated training modes such as the newNye et al.,;
McDonell,;,). It is also important from the point of view of the commonplace
practice of, where a competitor may use a strong model to train their own. If models
regurgitate training data, then through distillation a competitor is (at least in part) training on the
original training data as well.
By exploiting the fact that in open-weight models, the end user controls tokenization, and that
the chat template structure is only introduced during post-training, we use a simple attack which
demonstrates open-weight models will repeat numerous training prompts
1
.
We are able to extract data ranging from reinforcement learning (RL) prompts and associated traces
to prompts used for supervised fine-tuning (SFT) and during mid and pre-training. In doing so, we
answer an open question from2024) and confirm that models do indeed recite their own
alignment data and that therefore distillation pipelines are likely to partly include original training
data samples.
Summary
Our hypothesis:
prompt the model with the template, it will generate alignment data.
Our procedure:
1.
Prompt the model with the chat template and sample. Repeat this a number of times to
generate a set of synthetic data points.
2.
For each synthetic data point, find the closest sample in the post-training dataset according
the embedding similarity, using an embedding model.
Our main finding:
distribution as the alignment dataset. We find that this is true, but it is far closer to the alignment
dataset that one may expect. For example, in Figure, we generated a maths question and
answer that is near identical to a sample in the alignment dataset.
1
Our attack relies on the user’s control over the chat template structure. This is not normally available for closed models,
and so our attack does not immediately apply to them. Our prompting strategy is similar to the one used by
(2024).
2
Extracting alignment data in open models<|endoftext|><|user|>
Emily was 55 inches tall last year. If she grew 4
inches more than her sister who grew 2 inches
last year, how tall is Emily now?
Emily’s sister grew 2 inches.
Emily grew 4 inches more than her sister, so she grew
2 + 4 = 6 inches.
Last year Emily was 55 inches tall, and she grew 6
inches since then.
So her current height is 55 + 6 = 61 inches.
Thus, Emily isboxed{61}
<|endoftext|> (Prefix)Sample from model with special
Vector
Search
Embedding similarity:.97
String match similarity:.68
Emily was 54 inches tall last year. She grew 4 inches
more than her brother who only grew 2 inches last
year. How tall is Emily now?
Let’s solve this step by step.
Emily’s brother grew 2 inches last year.
Emily grew 4 inches more than her brother, which
means she grew 2 + 4 = 6 inches last year.
Emily was originally 54 inches tall.
After growing 6 inches, Emily’s current height is 54
+ 6 = 60 inches.
So, Emily is nowboxed{60}Query
Best
MatchRepeat at Scale
Figure 1|An overview of the data extraction process. A sample generated using the prompt
oftext|><|user|>˙
the training data. The figure highlights the difference between a high embedding similarity score
(0.97) and a much lower string-match similarity score (0.68), demonstrating that semantic similarity
is more effective for detecting this form of data memorisation.
Contributions
•
We generalise the prompting strategy from2024) and study how it may be used to
extract post-training data from models trained with SFT and RL, addressing their open question
on memorisation.
•
We show that measuring the quantity of memorisation based on approximate or verbatim string
matching is likely significantly undercounting rates of memorisation – our results suggesting at
least by×. Instead, using neural embeddings reveals a much higher rate.
•
We finally demonstrate that our method can be used to generate datasets from a model that
can then be used for the (post-)training of another model, meaningfully capturing some of the
original model’s performance.
2.
In this paper, we rely on the definitions below. We refer the reader to Section
background overview.
Alignment Data:
(like Supervised Finetuning and Reinforcement Learning with Human Feedback datasets), but to
any targeted data collection used to steer model behavior and improve specific capabilities such as
mathematics, reasoning, or long-context understanding. This data is usually considered a significant
competitive asset.
Training Data:
creation, including pre-training, mid-training, and post-training, which encompasses data used for
SFT and RL fine-tuning. In this paper we expand the usual meaning to also cover semantically
equivalent representation of the training data.
Post-training:
phase. It involves using various specialized datasets and methods, such as Supervised Finetuning
(SFT) and Reinforcement Learning (RL), to align the model’s capabilities with desired behaviors and
tasks. In this work, we treat post-training data as a (large) subset of total alignment data, as it is
becoming increasingly common to include alignment data (maths, reasoning, etc.) also in earlier
3
Extracting alignment data in open models
stages of training.
Memorisation
2
:
data. Our work extends this definition beyond verbatim string matching to include a more abstract
notion of memorisation, such as reproducing the patterns, templates, and semantic structure of
proprietary data. Related to this concept is2025)), but
instead of using simple string-matching similarity metrics, in our work we show that text embeddings
seem much better suited at detecting approximate memorisation when semantics are important. We
refer the reader to2022) for further motivation behind why we expand our definition
of memorisation.
Chat Template:
messages in special tokens (e.g., <|user|>, <|assistant|>). A key aspect of our attack is that these
templates and their special tokens are typically introduced only during the post-training stage, where
most of the alignment is done.
Embedding Score:
between two pieces of text, calculated using the dot product of neural text embeddings. We propose
this as a more effective alternative to traditional string-matching metrics (like Levenshtein distance),
as it better captures semantic equivalence even when there are superficial differences.
3.
The proposed extraction strategy is based on the observation that certain prompts seem to consistently
induce the model into outputting alignment-like data. To enable our attack, we use special tokens
from the chat template that are precisely introduced during post-training, making them ideal artifacts
that can be leveraged to extract specific types of data.
of such generations are either verbatim or very close to true training samples under an appropriate
measure of similarity.
Our pipeline works as follows: we embed the entire post-training set using an embedding model (see
3.3
prompting the model using our chosen prefix repeatedly. For each generated sample we then embed
it and search the vector database to retrieve the best match and its score. A diagram of this process is
shown in Figure. Once the post-training dataset is embedded, the search can be implemented as a
single matrix multiplication. In the following Sections, we will provide motivation for our
methodology using generated examples.
A significant constraint of our work is that we require access to models that make their (post-)training
mixtures available. For this reason, we focus our study on OLMo 2 (OLMo et al.,) (Section) for
SFT and Open-Reasoner-Zero (Hu et al.,) (Section) for RL. As these models are post-trained
with standard methods, they are a valuable resource for our study.
3.1.
The observation that chat templates can be used to generate useful synthetic data has been pointed
out already by2024), where the authors use special prompts and a filtering pipeline to
generate a dataset that can be used to post-train models using SFT. In our work, we study this from a
different angle and aim to understand the extent to which the generations correspond to regurgitated
2
We make no statement with regard to whether or not a model ‘contains’ its training data in a bit-wise or code-wise
sense, nor in the sense that any arbitrary instance of training data can be perfectly retrieved.
4
Extracting alignment data in open models
training data. In particular, we positively answer one of the conjectures left from2024),
which posited that the data generated might have been training data.
In Figure, we show how the prompting the model with the chat template results in outputs that
resemble post-training samples. We will investigate this phenomenon in much more detail in the
coming sections.
Qwen2.5 7B Instruct
<|im_start|>user
I’m trying to use the ‘zip‘ function in
Python with a generator expression, but
I keep encountering a ‘ValueError: too
many values to unpack‘. Here’s the code
I’m working with:
```python
gen = (x for x in range(3))
a, b = zip(*gen)
```
[...]
Qwen2.5 7B Instruct
<|im_start|>user
Okay, so I need to figure out the value of
the expression
√︂
2
√︃
2
√︁
2
√
2
.
Hmm, let me think. I remember that
infinite nested radicals can sometimes
be solved by setting them equal to a
variable and then solving an equation.
Let me try that approach here.
Let me denote the entire expression as�.
So,
√︁
2
√
2
[...]
Figure 2|Generations from Qwen2.5 7B Instruct using
the generations resemble potential post-training samples.
3.2.
A central theme in our work is the wish to broaden the definition of memorisation beyond simple string
matching, due to its natural limitations. For example, string matching has been shown to be vulnerable
to ‘style-transfer’ (Ippolito et al.,), where semantics are preserved whilst easily evading string
matching checks. Further, we are interested in the usefulness of extracted samples as training
points; if two samples are semantically equivalent, then they reasonably should be treated equal as
training samples (for measurement of memorisation rate). While approximate string matching may
(accidentally) weakly capture semantics in certain cases, we found it to be unreliable and generally
not well-aligned with the notion of semantic equivalence.
We use Figure
is clear that the left and right samples are extremely similar up to small differences, for example in
the numerical contents of the choices in the multiple choice section of the question. String-matching
scores tend to penalise these differences quite heavily assigning a similarity of≈0.7, while the high
embedding score (see Section
a human would make. We observed many situations where a string-matching score would provide low
similarity while the samples are semantically extremely similar. We point out that under the standard
0.9Comanici et al.,), the example in Figure
count as
string matching.
5
Extracting alignment data in open models
Generation
<|endoftext|><|user|>
The question is: Find the value of y from
(12)^3 x 6^4?
Options:
(A) 2456
(B) 4578
(C) 6234
(D) 7386
(E) 5184
The answer is: (E)
Explanation: Given Exp. = (12)3 x 64
= (12)3 x 64 = (12)2 x 62 = (72)2 =
5184
432 12 x 64<|endoftext|>
SFT Training sample
Question: Find the value of y from
(12)^3 x 6^4?
Options:
(A) 2356
(B) 3467
(C) 4567
(D) 5189
(E) 5184
Random thoughts:
Given Exp. = (12)3 x 64 = (12)3 x 64 =
(12)2 x 62 = (72)2 = 5184
432 12 x 62 The final answer: (E).
Figure 3|(Left) Generation from OLMo 2 13B. (Right) True post-training sample. Neural embeddings
match provide a score of .986 gemini-embedding-001(Lee et al.,), while normalised
Levenshtein similarity provides a match score of .699, heavily penalising differences in the options,
even though the semantics remain identical. When computing similarities we always strip out special
tokens, highlighted in. We report in the Appendix (Section) more of such examples.
3.3.
Embedding scores are generated by an embedding model and can be tuned for different tasks.
The process usually involves taking a strong base model and training it contrastively for a specific
application (e.g., semantic retrieval) (Lee et al.,). We use the gemini-embedding-001model (Lee
et al.,) in all of our experiments as it is a general and strong embedding model. We generate
a single embedding for each sample,
plain text. This acts as a vector search engine, where we can compute similarity with respect to
each training sample using a single matrix multiplication and taking theargmax. As embeddings
are normalised to be unit-norm, their dot product naturally encodes a notion of similarity. To better
distinguish this from approximate memorisation measures using string matching, we call this a
measure using embeddings a measure of.
To define a sample as (approximately semantic) memorised, we need to choose an appropriate
threshold. We do this manually and report randomly chosen samples at different thresholds in
the Appendix (Sections). We chose a threshold of 0.95 for neural embeddings as a more
conservative choice. We found samples at this point having qualitatively similar properties that would
be indisputable given a reasonable human judge. The choice of the threshold will naturally affect the
measured memorisation. This limitation is however also present in choosing a threshold for what is
considered memorised according to a string matching metric.
4.
We focus our SFT memorisation study on OLMo 2 (OLMo et al.,)
3
. OLMo 2 comes with the
full training mixture, alongside the training details. Further, the models have strong downstream
3
Licensed by AllenAI under the Apache License, Version 2.0.
6
Extracting alignment data in open models
performance, enabling us to conduct experiments that result in findings that are likely generalizable
to other models. The uncompressed pre-training mix has a size of 22.4 TB while the higher quality
mid-training split has a size of 5.14 TB. The post-training is divided in 3 stages, the first is an SFT
stage with a dataset containing 939k samples, then a Direct Preference Optimisation (Rafailov et al.,
2023) step is conducted with 378k samples, and finally a RL with Verifiable Rewards (RLVR) step
with 29.9k samples. We focus on the extraction of the SFT data in this section.
We apply the procedure we described in Section
training samples usinggemini-embedding-001by concatenating the question and answer sequences
as a single block of text. To extract the data from the model, we generate conditioning on the following
initial tokens, which are the starting tokens of the chat template. We
leave the temperature at the default value of.
4.1.
We start by evaluating the memorisation using traditional string matching metrics. We considerk
generations for OLMo 2B (OLMo et al.,) using our extraction method and search for their
closest match in the post-training set, with respect to different similarity measures. We consider
the normalised Levenshtein similarity defined as −max,) and
the normalised Indel similarity defined as −) . The Indel similarity
is related to the Levenshtein distance, but applies a cost of
sample, we find the highest similarity based on the two string matching methods in the post-training
set. We follow the heuristic used by Gemini 2.5 (Comanici et al.,) and characterise a sample as
approximately memorised0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Similarity Scores
0.00
0.02
0.04
0.06
0.08
Bucket Probability
Memorisation rate with Levenshtein similarity
Approximately Memorised
(a) Distribution of Levenshtein scores.0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Similarity Scores
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Bucket Probability
Memorisation rate with Levenshtein similarity
Approximately Memorised (b) Distribution of Indel scores.
Figure 4|Histograms illustrating the distribution of memorisation scores, measured by Levenshtein
and Indel similarity scores. For each generation, we find the largest similarity score in the post-
training set of OLMo 2.
In Figure
orisation rates seem negligible. The judgement from such results would be that our prompting
strategy does not extract useful memorised data under string matching metrics. This however does
not paint the entire picture. For example, the generated sample in Figure, would
memorised under the heuristic of2025). This is the extent to which measuring
memorisation on simple string matching is problematic; it does not flag examples that any reasonable
human would judge as clear cases of memorisation. This is because string matching can quickly
become non-informative due to trivial differences (see Appendix Section).
7
Extracting alignment data in open models
4.2.
We now compare the string matching results to matching done using neural embeddings. We generate
1M samples with OLMo 13B using the same method and embed them usinggemini-embedding-001.
In Figure, we show that neural embeddings unveil a much higher memorisation rate (left) when
compared to string matching scores Figure. The scatter plot (right) shows that string matching
distances are not well-aligned with semantic memorisation and also seem to exhibit a strong string
length bias as well, where longer generations are consistently given lower Levenshtein similarity
scores. We find that neural embeddings are much better at dealing with cases such as Figure
provide a number of examples in the Appendix (Section).0.7 0.8 0.9 1.0
Similarity Scores
0.000
0.025
0.050
0.075
0.100
0.125
0.150
0.175
Bucket Probability
Semantic Memorisation rate with Neural Embeddings
Approximately Semantically Memorised
(a) Distribution of0.0 0.2 0.4 0.6 0.8 1.0
Normalized Levenshtein Similarity Score
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Embedding Score
Embedding Score vs. Levenshtein Similarity (Coloured by Length)
Levenshtein Memorised
Semantic Memorised
Both Memorized
500
1000
1500
2000
2500
3000
3500
Length (b) Levenshtein vs embedding scores.
Figure 5|Histograms illustrating the distribution of embedding scores generated with
gemini-embedding-001(left) and scatter plot comparing the embedding scores to the Levenshtein
distance, with points coloured by string length (right).
Coverage.
we report the largest embedding score out of the 1M generated samples. We report the results
in Figure. We find that some samples are much more memorised than others. While it is hard
to understand exactly why, our investigations revealed that samples are much more likely to be
memorised if similar samples are also present in the pre and mid training datasets. Further, samples
that appear often, for example different versions of the same reasoning problem, seem to be more
likely to be memorised.0.7 0.8 0.9 1.0
Similarity Scores
0.00
0.02
0.04
0.06
0.08
0.10
Bucket Probability
Semantic memorisation coverage of training set
Approximately Semantically Memorised
(a) Empirical distribution of coverage of post-training
set.0.75 0.80 0.85 0.90 0.95 1.0
Similarity Score
0.0
0.2
0.4
0.6
0.8
1.0
1 - CDF (Cumulative Probability)
Cumulative distribution function of coverage
Number of generations
100k
500k
1M
(b) Empirical cumulative distribution of coverage of
post-training set.
Figure 6|We report for each true post-training sample a histogram of the best score between the
1M generations (left) and the cumulative distribution (right). Due to the nature of the experiment,
coverage should improve monotonically as sampling more can only increase memorisation.
8
Extracting alignment data in open models
Chat template better recovers post-training data.
template is useful to encourage the model to generate post-training-like data. As a baseline, we
compare this method to simply conditioning on the single token. We consider 1,000
generations with both prefixes and report the estimate on the expected value of the embedding score.
In Table, we show that indeed conditioning on the entire chat template provides samples closer to
the post-training distribution.
the model associates their presence with the post-training data distribution. This provides an explation
for why techniques such as Magpie (Xu et al.,) are possible: conditioning on the chat template
results in generations that are much closer to the post-training distribution.
Prefill Type Mean Embedding Score
<|endoftext|> 0.857
<|endoftext|><|user|>˙ 0.892
Table 1|Mean best embedding score of generations for OLMo 2 13B using only the ‘beginning of
sequence token’ versus the full chat template prefill. The longer prefill generates samples that are on
average semantically closer to the post-training set.
4.3.
A natural question one may then have is, if the generated data is similar to the original post-training
dataset, can it be used to post-train a model directly? In other words, are we able to re-use a large
and diverse enough number of samples to post-train a model without collecting any data manually?
To explore this question, we post-train using SFT OLMo 2 7B in two ways: (1) the original dataset
in order to reproduce the original results and (2) our generated dataset. For the synthetic dataset,
we collect a synthetic dataset of a similar size of≈930k samples. We perform basic filtering and
processing using Gemini 2.5. We note that even though the number of samples is the same, the
original SFT training is over≈1.3B tokens, while the synthetic training set has only≈850M tokens
as the filtered generations remain shorter.
In Table, we report the results following the benchmarks and evaluation pipeline used by
et al.2024). To validate our setup, we first show that our reproduction is very close to the released
SFT checkpoint. Our model trained on synthetic data also achieves comparable performance on the
benchmarks, except for the IFE task. We suspect that our pipeline generates too few examples that
target this benchmark, but believe the performance could likely be improved by adding non-synthetic
data. In fact, it is likely that a mix of synthetic and curated targeted data could be a useful paradigm
to explore to boost model performance reducing the labour required to collect entire post-training
datasets.
Model BBH MMLU MATH GSM8K POPQA TQA IFE DROP
Baseline (SFT only) 0.4953 0.6133 0.2073 0.7407 0.2364 0.4858 0.6562 0.5960
Baseline (Reproduction) 0.4944 0.6123 0.2077 0.7377 0.2529 0.5110 0.6617 0.5945
Extracted Data 0.5161 0.6052 0.1705 0.7847 0.2490 0.5529 0.5028 0.5923
Table 2|Model performance after SFT on the benchmarks considered by2024). The
baseline performance is taken from2024) and the reproduction was ran using the
code provided by the authors with the original dataset. Using our method, we train a model on SFT
‘synthetic’ data extracted from the model using the same settings of the baseline.
9
Extracting alignment data in open models
5.
We now focus on the extraction of RL data. We use the Open-Reasoner-Zero (Hu et al.,) model,
which was trained from the Qwen 2.5 base model with PPO (Schulman et al.,) using post-training
data that is publicly available. With RL, the training samples consist of questions and answers, but
the reasoning traces not part of the training dataset as they are artifacts of the training rollout. For
this reason, we focus on the extraction of the questions and answer part of the dataset although note
that reasoning traces can be useful in their own right.
We prompt the model by again taking the first part of the chat template specified by the developers of
the model (see Section
dently. We find that the model very consistently generates a question, followed by a thinking trace,
and finally an answer. We then searched the training set for these generations. Surpisingly, we again
found a number of training samples being regurgitated verbatim. We show an example in Figure,
where the model outputs the exact training sample, a reasoning trace, and the correct solution. We
find the fact that models are capable of regurgitating RL training samples to be counterintuitive as
the PPO objective, at least at a glance, seems rather misaligned with the memorisation of training
samples, especially when compared to methods such as SFT that instead very explicitly increase
sequence likelihoods.
Generation
<PREFIX>
Find the quotient of the division(3�
4
−
4
3
+
2
−)/(
To solve the problem of finding the quo-
tient of the division(3�
4
−4�
3
+5�
2
−
11�2) ÷ (2+3�, we will use polyno-
mial long division. Here are the steps:
[...]
The final answer is:
3
−
2
+
17
3
RL Training sample
Find the quotient of the division
(
4
−
3
+
2
−)/(
�
3
−
2
+
17
3
Figure 7|(Left) Generation from Open-Reasoner-Zero 7B using our special
Section. (Right) True RL post-training
sample. Surprisingly, we find that RL training samples can be regurgitated, even though
the training objective seems to be heavily misaligned with this behaviour. The RL training samples
only consist of a question and answer pair and do not come with a thinking trace. The model instead
regurgitates the question, followed by the thinking trace, and finally the answer.
We explore this further phenomenon further by measuring the change of likelihood of training set
samples in the base and post-trained models. Measuring likelihood of the training set is limited
because this only measures the ‘pointwise’ memorisation – a likelihood of a training sample might
remain low because the exact polynomial in the training set for instance is not being memorised, but
the question style of finding its roots is. Regardless of the limitation of this simple measurement, we
believe the results can still provide valuable signal and intuition.
In particular, we measure the likelihoods of each PPO training sample question under the Qwen
2.5 base model and the Open-Reasoner-Zero model. If the RL process induces memorisation, then
we would see the likelihood using the post-trained model increase on the training samples. We
bucket the results in likelihoods increasing by magnitudes of
10
Extracting alignment data in open models
Open-Reasoner-Zero model and report the results in Figure. The results show that RL training
induces many of the training prompts to increase in likelihood. We found samples of likelihoods
increasing from
−
to
−
after RL post-training, showcasing the fact that RL may be able to
induce the memorisation of post-training samples. This is particularly surprising when one considers
the RL post-training objective. It is not immediately clear to us what exact mechanism is driving this
increase in likelihood and leave this as a future exciting research direction.10
5
10
6
10
7
10
8
Likelihood Threshold
0
1000
2000
3000
4000
5000
6000
7000
8000
Count of Samples
287
1707
4320
7101
18
678
2892
5864
Count of Samples Above Likelihood Thresholds
Open-Reasoner-Zero
Base model
Figure 8|Count of training samples (prompt only) with likelihood above a certain threshold before
and after RL. The entire train set contains,770
many training prompts achieve a higher likelihood.
Figure 9|RL training using the ORZ (Hu et al.,) dataset and a dataset that was extracted using
our method. Surprisingly, we are able to recover most of the performance with our simple extraction
method.
RL on extracted dataset
instead extract an RL dataset. In other words, one can use an RL-trained model to extract an RL
dataset with little effort. We start by post-train using Dr. GRPO (Liu et al.,) the Qwen2.5 7B
base model using the ORZ 57k dataset (Hu et al.,). With the resulting post-trained model
(we call it ‘Baseline’), we then generate using our methodk samples and process them using
Gemini 2.5 (Comanici et al.,) to filter out invalid, incomplete or incorrect samples. We finally
randomly select from this setk synthetic samples to create our synthetic training data. We use this
to post-train the Qwen2.5 7B base model using synthetic data only. Afterwards, we evaluate both
models on four standard benchmarks: AMC (Li et al.,), MATH500 (Hendrycks et al.,),
Minerva Math (Lewkowycz et al.,), and OlympiadBench (He et al.,). We then report the
results in Figure. The model trained on the synthetic data extracted from the ‘Baseline’ achieves
comparable performance on the benchmarks. These results are surprising because our synthetic
dataset is based on extracted data from a small 7B model trained on a relatively small RL dataset. We
11
Extracting alignment data in open models
suspect that a more sophisticated pipeline could be used to achieve higher performance and retrieve
a higher quality dataset. As our work focuses on training sample extraction, we leave this goal as a
promising future direction.
6.
In this paper, we demonstrate that alignment data can be efficiently extracted from open-weight large
language models. Our attack leverages a simple observation: chat templates and their special tokens
are typically introduced during post-training, making them effective prefixes for inducing models to
regurgitate alignment-like data.
Leakage Metrics are Hard
metrics. We show that methods relying on string matching drastically undercount the rate of extraction,
by (at least!) an order of magnitude. By instead using high-quality embedding models, we can identify
approximate semantic memorisation—instances where the model reproduces the semantic structure
and patterns of training data, even if not verbatim. This semantic similarity is far more relevant for
capturing the utility of the data, as highlighted by numerous examples all throughout the paper.
Model Distillation as Data Distillation
model can be used to successfully train a new base model, meaningfully recovering parts of the
original’s performance in both SFT and RL settings. This confirms that the common practice of model
distillation can function as an indirect form of training data extraction. Certain advantages that an
open model obtains from its alignment data are therefore at risk of being leaked.
Rethinking Memorisation in RL
(even!) from Reinforcement Learning (RL) phases. This is counter-intuitive, as the reinforcement
learning objective is not explicitly designed to increase sequence likelihoods in the same way as SFT.
The fact that RL training prompts increased in likelihood after post-training suggests a more complex
relationship between alignment and memorisation that warrants future investigation.
Our attack exploits chat templates and so is only applicable to open models. Closed models enforce
the chat template, and is outside of the users control; a working exploit on a closed model would
require a user to spoof the behaviour of these tokens when submitting a query to the model. Whilst
more difficult, prior work (Geng et al.,) has shown this is not necessarily impossible. Future
work will establish how serious a threat this is on closed models.
Author contributions
Federico
wrote the paper.,,,
Chawin
References
S. Biderman, U. Prashanth, L. Sutawika, H. Schoelkopf, Q. Anthony, S. Purohit, and E. Raff. Emergent
and predictable memorization in large language models.
Systems, 36:28072–28090, 2023.
L. Borec, P. Sadler, and D. Schlangen. The unreasonable ineffectiveness of nucleus sampling on
mitigating text memorization., 2024.
12
Extracting alignment data in open models
G. Brown, M. Bun, V. Feldman, A. Smith, and K. Talwar. When is memorization of irrelevant training
data necessary for high-accuracy learning? In
on theory of computing, pages 123–132, 2021.
N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song,
U. Erlingsson, et al. Extracting training data from large language models. In
symposium (USENIX Security 21), pages 2633–2650, 2021.
N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang. Quantifying memorization across
neural language models. In, 2022.
B. Chen, N. Han, and Y. Miyao. A multi-perspective analysis of memorization in large language
models., 2024a.
T. Chen, A. Asai, N. Mireshghallah, S. Min, J. Grimmelmann, Y. Choi, H. Hajishirzi, L. Zettlemoyer,
and P. W. Koh. Copybench: Measuring literal and non-literal reproduction of copyright-protected
text in language model generation., 2024b.
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram,
D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality,
long context, and next generation agentic capabilities., 2025.
A. F. Cooper and J. Grimmelmann. The files are in the computer: Copyright, memorization, and
generative ai., 2024.
A. F. Cooper, A. Gokaslan, A. Ahmed, A. B. Cyphert, C. De Sa, M. A. Lemley, D. E. Ho, and P. Liang.
Extracting memorized pieces of (copyrighted) books from open-weight language models.
preprint arXiv:2505.12546, 2025.
V. Dankers and I. Titov. Generalisation first, memorisation second? memorisation localisation for
natural language classification tasks., 2024.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers
for language understanding. In
the association for computational linguistics: human language technologies, volume 1 (long and short
papers), pages 4171–4186, 2019.
V. Feldman. Does learning require memorization? a short tale about a long tail. In
52nd annual ACM SIGACT symposium on theory of computing, pages 954–959, 2020.
J. Freeman, C. Rippe, E. Debenedetti, and M. Andriushchenko. Exploring memorization and copyright
violation in frontier llms: A study of the new york times v. openai 2023 lawsuit.
arXiv:2412.06370, 2024.
Y. Geng, H. Li, H. Mu, X. Han, T. Baldwin, O. Abend, E. Hovy, and L. Frermann. Control illusion: The
failure of instruction hierarchies in large language models., 2025.
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al.
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.
arXiv:2501.12948, 2025.
V. Hartmann, A. Suri, V. Bindschaedler, D. Evans, S. Tople, and R. West. Sok: Memorization in
general-purpose large language models., 2023.
13
Extracting alignment data in open models
A. Haviv, I. Cohen, J. Gidron, R. Schuster, Y. Goldberg, and M. Geva. Understanding transformer
memorization recall through idioms., 2022.
J. Hayes, I. Shumailov, C. A. Choquette-Choo, M. Jagielski, G. Kaissis, K. Lee, M. Nasr, S. Ghalebikesabi,
N. Mireshghallah, M. S. M. S. Annamalai, et al. Strong membership inference attacks on massive
datasets and (moderately) large language models., 2025a.
J. Hayes, M. Swanberg, H. Chaudhari, I. Yona, I. Shumailov, M. Nasr, C. A. Choquette-Choo, K. Lee, and
A. F. Cooper. Measuring memorization in language models via probabilistic extraction. In
of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational
Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 9266–9291, 2025b.
C. He, R. Luo, Y. Bai, S. Hu, Z. L. Thai, J. Shen, J. Hu, X. Han, Y. Huang, Y. Zhang, et al. Olympiadbench:
A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific
problems., 2024.
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring
mathematical problem solving with the math dataset., 2021.
G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network.
arXiv:1503.02531, 2015.
J. Hu, Y. Zhang, Q. Han, D. Jiang, X. Zhang, and H.-Y. Shum. Open-reasoner-zero: An open source
approach to scaling up reinforcement learning on the base model.,
2025.
J. Huang, H. Shao, and K. C.-C. Chang. Are large pre-trained language models leaking your personal
information?, 2022.
J. Huang, D. Yang, and C. Potts. Demystifying verbatim memorization in large language models.
arXiv preprint arXiv:2407.17817, 2024.
Y. Huang, Z. Yang, M. Chen, J. Zhang, and M. R. Lyu. Entropy-memorization law: Evaluating
memorization difficulty of data in llms, 2025. URL.
D. Ippolito, F. Tramèr, M. Nasr, C. Zhang, M. Jagielski, K. Lee, C. A. Choquette-Choo, and N. Carlini.
Preventing verbatim memorization in language models gives a false sense of privacy.
arXiv:2210.17546, 2022.
M. Jagielski, O. Thakkar, F. Tramer, D. Ippolito, K. Lee, N. Carlini, E. Wallace, S. Song, A. Thakurta,
N. Papernot, et al. Measuring forgetting of memorized training examples.
arXiv:2207.00099, 2022.
N. Kandpal, E. Wallace, and C. Raffel. Deduplicating training data mitigates privacy risks in language
models. In, pages 10697–10707. PMLR, 2022.
A. Karamolegkou, J. Li, L. Zhou, and A. Søgaard. Copyright violations and large language models.
arXiv preprint arXiv:2310.13771, 2023.
H. Kiyomaru, I. Sugiura, D. Kawahara, and S. Kurohashi. A comprehensive analysis of memorization
in large language models. In
Conference, pages 584–596, 2024.
J. Lee, T. Le, J. Chen, and D. Lee. Do language models plagiarize? In
Conference 2023, pages 3637–3647, 2023.
14
Extracting alignment data in open models
J. Lee, Z. Dai, X. Ren, B. Chen, D. Cer, J. R. Cole, K. Hui, M. Boratko, R. Kapadia, W. Ding, et al. Gecko:
Versatile text embeddings distilled from large language models.,
2024.
J. Lee, F. Chen, S. Dua, D. Cer, M. Shanbhogue, I. Naim, G. H. Ábrego, Z. Li, K. Chen, H. S. Vera, et al.
Gemini embedding: Generalizable embeddings from gemini.,
2025.
A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. Ramasesh, A. Slone, C. Anil,
I. Schlag, T. Gutman-Solo, et al. Solving quantitative reasoning problems with language models.
Advances in neural information processing systems, 35:3843–3857, 2022.
D. Leybzon and C. Kervadec. Learning, forgetting, remembering: Insights from tracking llm memo-
rization during training. In
Neural Networks for NLP, pages 43–57, 2024.
J. Li, E. Beeching, L. Tunstall, B. Lipkin, R. Soletskyi, S. Huang, K. Rasul, L. Yu, A. Q. Jiang, Z. Shen,
et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math
problems and solutions., 13(9):9, 2024.
Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin. Understanding r1-zero-like training:
A critical perspective., 2025.
X. Lu, X. Li, Q. Cheng, K. Ding, X. Huang, and X. Qiu. Scaling laws for fact memorization of large
language models., 2024.
A. Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation.
meta. com/blog/llama-4-multimodal-intelligence/, checked on, 4(7):2025, 2025.
F. Mireshghallah, A. Uniyal, T. Wang, D. K. Evans, and T. Berg-Kirkpatrick. An empirical analysis of
memorization in fine-tuned autoregressive language models. In
on Empirical Methods in Natural Language Processing, pages 1816–1826, 2022.
J. X. Morris, J. O. Yin, W. Kim, V. Shmatikov, and A. M. Rush. Approximating language model training
data from weights., 2025.
F. B. Mueller, R. Görge, A. K. Bernzen, J. C. Pirk, and M. Poretschkin. Llms and memorization: On
quality and specificity of copyright compliance. In
Ethics, and Society, volume 7, pages 984–996, 2024.
M. Nasr, N. Carlini, J. Hayase, M. Jagielski, A. F. Cooper, D. Ippolito, C. A. Choquette-Choo, E. Wallace,
F. Tramèr, and K. Lee. Scalable extraction of training data from (production) language models.
arXiv preprint arXiv:2311.17035, 2023.
M. Nasr, J. Rando, N. Carlini, J. Hayase, M. Jagielski, A. F. Cooper, D. Ippolito, C. A. Choquette-Choo,
F. Tramèr, and K. Lee. Scalable extraction of training data from aligned, production language
models. In, 2025.
M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin, D. Bieber, D. Dohan, A. Lewkowycz,
M. Bosma, D. Luan, et al. Show your work: Scratchpads for intermediate computation with
language models., 2021.
T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y. Gu, S. Huang, M. Jordan,
et al. 2 olmo 2 furious., 2024.
15
Extracting alignment data in open models
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama,
A. Ray, et al. Training language models to follow instructions with human feedback.
neural information processing systems, 35:27730–27744, 2022.
A. Pappu, B. Porter, I. Shumailov, and J. Hayes. Measuring memorization in rlhf for code completion.
arXiv preprint arXiv:2406.11715, 2024.
U. S. Prashanth, A. Deng, K. O’Brien, J. SV, M. A. Khan, J. Borkar, C. A. Choquette-Choo, J. R. Fuehne,
S. Biderman, T. Ke, et al. Recite, reconstruct, recollect: Memorization in lms as a multifaceted
phenomenon., 2024.
A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving language understanding by
generative pre-training. 2018.
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference
optimization: Your language model is secretly a reward model.
processing systems, 36:53728–53741, 2023.
L. Reynolds and K. McDonell. Prompt programming for large language models: Beyond the few-shot
paradigm. In,
pages 1–7, 2021.
A. Satvaty, S. Verberne, and F. Turkmen. Undesirable memorization in large language models: A
survey., 2024.
J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control
using generalized advantage estimation., 2015.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization
algorithms., 2017.
A. Schwarzschild, Z. Feng, P. Maini, Z. Lipton, and J. Z. Kolter. Rethinking llm memorization through
the lens of adversarial compression., 37:56244–
56267, 2024.
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseek-
math: Pushing the limits of mathematical reasoning in open language models.
arXiv:2402.03300, 2024.
I. Shilov, M. Meeus, and Y.-A. de Montjoye. The mosaic memory of large language models.
preprint arXiv:2405.15523, 2024.
N. Stoehr, M. Gordon, C. Zhang, and O. Lewis. Localizing paragraph memorization in language
models., 2024.
E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel. The instruction hierarchy: Training
llms to prioritize privileged instructions., 2024.
X. Wang, A. Antoniades, Y. Elazar, A. Amayuelas, A. Albalak, K. Zhang, and W. Y. Wang. Generalization
vs memorization: Tracing language models’ capabilities back to pretraining data.
arXiv:2407.14985, 2024.
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought
prompting elicits reasoning in large language models.
systems, 35:24824–24837, 2022.
16
Extracting alignment data in open models
Z. Xu, F. Jiang, L. Niu, Y. Deng, R. Poovendran, Y. Choi, and B. Y. Lin. Magpie: Alignment data
synthesis from scratch by prompting aligned llms with nothing.,
2024.
Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An
open-source llm reinforcement learning system at scale., 2025.
C. Zhang, D. Ippolito, K. Lee, M. Jagielski, F. Tramèr, and N. Carlini. Counterfactual memorization
in neural language models., 36:39321–39362,
2023.
C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, et al. Group
sequence policy optimization., 2025.
A.
Our work sits at the intersection of key areas in the study and development of frontier LLMs. The first
is, where curated data and specialised training techniques are used to transform a
base model into a useful assistant. The second is, the observation that
models are capable of regurgitating training data. The third is the widespread practice of
distillation, a process through which a strong model’s capabilities can be transferred to another. In
this section, we will review these areas and argue that there is a risk that lays at the intersection of
the three.
A.1.
Early models (Devlin et al.,;,) were built on the idea that pre-training on
larger and larger amounts of internet data was the solution to keep improving capabilities. Today,
this principle has arguably shifted and we have seen a surge in multi-stage training with a number
of specialised datasets and techniques, giving birth to what is now called. LLMs are
post-trained using various methods ranging from Supervised Finetuning (SFT) to now very popular
RL methods such as Proximal Policy Optimisation (PPO) (Schulman et al.,), Group Relative
Policy Optimization (GRPO) (Shao et al.,), and Reinforcement Learning from Human Feedback
(RLHF) (Ouyang et al.,), among many others. In this work, we mainly focus on models that
were post-trained using SFT and PPO as they are common post-training techniques and we have
strong open models that were trained using such techniques with publicly available post-training
data.
Supervised Finetuning �of question-answer pairs� �, ��)}
�
� .
The model is then optimised to predict an answer��given a question��, i.e. to increase the likelihood
� �|�) . To achieve this, usually one masks out the contributions of the loss that come from the
question part, i.e. computing the conditional gradient∇�log �|�;�given some parameters�.
The parameters are updated via some form of gradient descent on the negative log-likelihood:
�
′
←
1
|
∇�
∑︁
�
log �|�;)
17
Extracting alignment data in open models
given some learning rate� >0 � . The OLMo 2 (OLMo et al.,) family of models
has been post-trained in large part using SFT with a dataset of 939,344 question-answer pairs which
has been released publicly, providing an excellent resource for our study.
RL-based post-trainingGuo
et al.,) in this work. The Open-Reasoner-Zero (Hu et al.,) model adopts PPO (Schulman
et al.,) for policy optimization in the post-training. The training objective could be written as
JPPO() � ≤∼��
old
(· |
|
∑︁
�
"
min
��(�| <�)
��
old
(�| <�)
ˆ��,
��(�| <�)
��
old
(�| <�)
,
!
ˆ��
!#
Here�is the generated response,�is the clipping ratio, andˆ��is an estimator of�-th token’s advantage,
which is computed through the Generalized Advantage Estimation (GAE) (Schulman et al.,)
with a learned value model and a reward function. GRPO (Guo et al.,;,)
removed the value function and estimated the advantageˆ��in a group manner. For each question
�, GRPO samples a group of responses{
1
, �
2
, ..., �
�
} , which can be rewarded as{
1
, �
2
, ..., �
�
}
using the reward function. Normally the reward function is shaped by whether the response correctly
answers the question (the answer within a specified format matches the correct answer) and whether
the response format is correct. Then the advantage is estimated asˆ�
�
�
=
�
�
−
�
}
�
�
)
std
�
}
�
�
)
. Therefore,
the objective of GRPO can be written as
J����()
�
�
≤
}
�
�
∼��
old
(· |
1
�
�∑︁
�
1
|�|
|�|
∑︁
�
"
min
��(
�
�
|
�
<�
)
�����
(
�
�
|
�
<�
)
ˆ�
�
�
,
„
��(
�
�
|
�
<�
)
�����
(
�
�
|
�
<�
)
,
«
ˆ�
�
�
!
− KL(�||
ref)
#
KL regularization ensures that the policy model does not deviate the reference model too far, but
could be eliminated to further improve the exploration (Hu et al.,). Dr. GRPO (Liu et al.,
2025) removes
advantage estimation, improving token efficiency with comparable reasoning performance. There is
also significant body of work aiming to improve GRPO, with notable examples being DAPO (Yu et al.,
2025) and GSPO (Zheng et al.,).
Chat templates
template, which wraps the messages in special tokens that are designed to mark the messages as being
‘user’, ‘assistant’ or ‘system’ messages. Chat templates from a security perspective are usually trained
to build a so-called instruction hierarchy of privilege (Wallace et al.,) although its practical
effectiveness has been challenged (Geng et al.,). Crucially, the special tokens used to build the
template are
useful attack vector to elicit the generation of alignment data introduced exclusively in post-training.
A similar effect has been pointed out by2024), but while their focus was the automated
generation of an alignment dataset, our focus is instead that of generalising and better understanding
the process from the point of view of memorisation – answering one of their conjectures where they
posited that models are likely memorizing such data.
A.2.
It is a well-documented fact that LLMs are capable of memorising and reciting training data verbatim
(Carlini et al.,;,,). The primary concern of these studies has been public-
facing risks, such as the leak of private information like email addresses, names, and credit card
18
Extracting alignment data in open models
numbers that were accidentally included in the pre-training dataset or the regurgitation of copyrighted
material.
A common way to measure memorisation is that of
(2021,) – in such case a sample is called extractable if there exists some prefix such that the
LLM prompted on the prefix will generate the sample using
computationally convenient as it effectively ignores the stochastic nature of LLMs and is useful for a
variety of tasks like observing if the model outputs copyrighted data verbatim or to discover sensitive
strings of information. A relaxation of this isHayes et al.,)
in which one considers the probability of extracting the sample in multiple queries. This measure
provides more information by considering the joint likelihood of the sequence rather than providing
simply a boolean decision on extractability.
Recent work has begun to focus on moving away from measuring memorisation of verbatim or
near-verbatim string matches (Chen et al.,;,). For example,
(2024) showed the LLMs can assemble information from related, partially overlapping fragments
of text, a concept they term. This challenges the assumption that memorisation
is driven exclusively by exact repetitions in the training data. They demonstrate that even highly
modified sequences can contribute significantly to the memorisation of a reference sequence, with
the memorisation effect being predominantly a function of syntactic overlap (shared tokens). This
suggests that current data deduplication techniques, which often focus on removing exact duplicates,
are insufficient for addressing the full spectrum of memorisation and data leakage risks.
(2024) introduced the concept of distributional memorisation, a form of non-verbatim memorisation
measured by the correlation between an LLM’s output probabilities and the frequency of related
input-output data pairs (n-grams) in its pretraining corpus. The authors find that this type of
memorisation is prominent in knowledge-intensive tasks like factual question-answering, whereas
reasoning-based tasks rely more on generalization, where the model’s outputs diverge from the
training data’s distribution. Finally,2025) presents the Entropy-Memorisation Law,
which establishes a linear correlation between the entropy of training data and a non-verbatim
memorisation score measured by Levenshtein distance. The law suggests that data with higher
entropy is more difficult for a model to memorize accurately, resulting in a higher memorisation score
(i.e., more token-level differences between the model’s output and the original text). Our work is
similar in spirit to all of these works, however, we establish memorisation rates under our embedding
definition which are far higher than what may be expected from these prior works. For example,
Wang et al.2024) found reasoning-based tasks rely more on generalization than memorisation; we
found that even under reasoning-based tasks, the model heavily exploits knowledge of training data
through clear signs of memorisation.
Jagielski et al.2022) showed that data that appears later in the training process is more likely to
be memorized. Our focus on memorisation of alignment and strategic proprietary data presents an
interesting open-question about if this finding still holds for post-training. One the one hand we may
expect this data to be more likely to be memorized due to results from2022), on the
other, post-training has certain properties that may make it less likely to be memorized. For example,
tokens are regularly masked during the loss computation implying part of training prompts will not
contribute to model gradients, whilst prior work has shown RL (commonly used in post-training) is
less likely to memorize than standard instruction tuning (with the same compute budget) (Pappu
et al.,).
19
Extracting alignment data in open models
A.3.
Model distillation is a popular technique for creating capable models without the massive cost of
training from scratch (Hinton et al.,). In this process, a smaller ‘student’ model is trained on the
outputs generated by a larger, more powerful ‘teacher’ model (e.g., using GPT-4 to generate training
data for a new open-source model).
While often viewed as a way to transfer capabilities, distillation can be re-framed through the lens of
memorisation. If a teacher model is capable of reciting its own proprietary training data, any student
model trained on its outputs will, by definition, be exposed to this secret sauce. This risk is especially
pronounced in ‘hard’ distillation pipelines that directly use the teacher’s labels as training examples
(Xu et al.,)
4
. The core question, which our work addresses, is how much of the teacher’s original
training data is unintentionally passed down to the student.
Our position
opers and data curators clearly do not want to give this data freely to other parties. On the other
hand, practitioners are already actively training on outputs from strong competing models. If this
data leaks through this process, the competitive advantage of the original data owner is eroded. This
motivates our memorisation study within this context.
B.
An intriguing consequence of our work is that we observe
even if the gradients are masked. This constitutes an interesting type of leakage that to the best of
our knowledge has not been previously observed. We propose some mathematical speculation for
how this might happen. To make this statement more precise, the claim is that given a question�
and answer�, we question whether updating based on the conditional gradient�can increase the
likelihood of
Let the model parameters be�. During training on a pair( , the parameters are updated from�
to
′
using gradient descent on the negative log-likelihood of the answer:
�
′
= �log
where� >0
log
′
)
′
)
log
′
) ≈
′
−
�
∇�log
Substituting the update rule for the term
′
−
log
′
) ≈ (∇�log )
⊤
(∇�log ).
For the likelihood of
(∇�log )
⊤
(∇�log )>
This condition is rather intuitive. The first gradient,∇�log ;�, is the direction in parameter
space that maximally increases the probability of the answer�given the question�. The second,
4
This is contrast to
challenging if the vocabulary of the two models is not the same.
20
Extracting alignment data in open models
∇�log
;�, is the direction that maximally increases the unconditional probability of the question
�. If their dot product is positive, the two gradients are correlated and increasing the likelihood
�
We note as a caveat that simply considering the likelihood of a single sequence might not be the most
informative metric. For instance, a math-related post-training sample might have different numerical
values, but the general ‘template’ might be correct. This suggests that one should really integrate
over all reasonable sequences that one wishes to consider which is of course quickly intractable.
C.
Below we provide the generation prefixes used for the experiments in our work. Both are prefixes
taken directly from the template released by the developers of the corresponding models.
SFT extraction prefix<|endoftext|><|user|>RL extraction prefix
A conversation between User and Assistant. The User asks a question, and the Assistant solves
it. The Assistant first thinks about the reasoning process in the mind and then provides the
User with the answer.
The reasoning process is enclosed within <think> </think> and answer is enclosed within
<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>. User:
You must put your answer inside <answer> </answer> tags, i.e., <answer> answer here
</answer>. And your final answer will be extracted automatically by the ˘oxed{} tag. This
is the problem:
D.
We report in this section interesting failure cases we have found with string matching score metrics.
We recall that our chosen embedding threshold is, while the string matching threshold is.
All examples we show in this section have an embedding score that is above .95
would be considered as ‘semantically’ memorised. Therefore, the examples we show can be seen as
failure cases of the string matching measure in which instead the samples are correctly identified as
memorised using semantic neural embeddings.
21
Extracting alignment data in open models
Generation
**Q**: A man took loan from a bank at the rate of 12% p.a. S.I. After 3
years he had to pay Rs. 5400 interest only for the period. The principal
amount borrowed was?
**A**: 15000Embeddings (0.97 match)
A man took loan from a bank at the rate of
12% p.a. simple interest. After 3 years he
had to pay Rs. 5400 interest only for the
period. The principal amount borrowed
by him was.
Options:
(A) 15000
(B) 2655
(C) 16888
(D) 6677
(E) 1871
Stream of consciousness first, then make
a decision:
Principal = Rs. (100 x 5400)/(12*3) =
Rs. 15000.
Thus, the answer is (A).
String Matching (0.43 match)
A man took loan from a bank at the rate of
12% p.a. simple interest. After 3 years he
had to pay Rs. 5400 interest only for the
period. The principal amount borrowed
by him was.
Options:
(A) 15000
(B) 2655
(C) 16888
(D) 6677
(E) 1871
Stream of consciousness first, then make
a decision:
Principal = Rs. (100 x 5400)/(12*3) =
Rs. 15000.
Thus, the answer is (A).
Figure 10|Generation (top) compared to the best match training sample that is selected both by
embeddings (bottom left) and string matching (bottom right). The string matching score is low due
to the model adding options and reasoning even though the question is regurgitated almost verbatim.
22
Extracting alignment data in open models
Generation
# Question
Which of the following sentences is nonsensical?
Options:
- Sentence A: "He put some petrol to put out the fire"
- Sentence B: "He put some water to put out the fire"
# Answer
Sentence A
Petrol is gasoline and would increase the fire, not put it out. The answer is
Sentence A.
**A**: 15000
Embeddings (0.97 match)
Come up with a question and reasoning
that would justify this answer: Sentence
A
The question is: Of the following two
sentences, which one is against common
sense?
Options:
- Sentence A: "He put some petrol to put
out the fire"
- Sentence B: "He put some water to put
out the fire"
Step-by-step reasoning process: Petrol is
gasoline and would increase the fire, not
put it out.
String Matching (0.59 match)
Question: Which of the following sen-
tences is nonsensical?
Options:
- Sentence A: "He put the elephant on
the cupboard."
- Sentence B: "He put the vase on the
cupboard."
Please answer this question gradu-
ally:
An elephant is to heavy for a cupboard.
So the answer is Sentence A.
Figure 11|Generation (top) compared to the best match training sample that is selected both by
embeddings (bottom left) and string matching (bottom right). The embeddings retain the petrol and
water semantics, while string matching finds a (poor) match with a question that has a more similar
template.
23
Extracting alignment data in open models
Generation
Emily was 54 inches tall last year. She grew 4 inches more than her
brother who only grew 2 inches last year. How tall is Emily now?
Let’s solve this step by step.
Emily’s brother grew 2 inches last year.
Emily grew 4 inches more than her brother, which means she grew 2 + 4
= 6 inches last year.
Emily was originally 54 inches tall.
After growing 6 inches, Emily’s current height is 54 + 6 = 60 inches.
So, Emily is now
inches tall.Embeddings (0.95 match)
Emily was 55 inches tall last year. If she
grew 4 inches more than her sister who
grew 2 inches last year, how tall is Emily
now?
Emily’s sister grew 2 inches.
Emily grew 4 inches more than her sister,
so she grew 2 + 4 = 6 inches.
Last year Emily was 55 inches tall, and
she grew 6 inches since then.
So her current height is 55 + 6 = 61
inches.
Thus, Emily is
inches tall now.String Matching (0.68 match)
Emily was 55 inches tall last year. If she
grew 4 inches more than her sister who
grew 2 inches last year, how tall is Emily
now?
Emily’s sister grew 2 inches.
Emily grew 4 inches more than her sister,
so she grew 2 + 4 = 6 inches.
Last year Emily was 55 inches tall, and
she grew 6 inches since then.
So her current height is 55 + 6 = 61
inches.
Thus, Emily is
inches tall now.
Figure 12|Generation (top) compared to the best match training sample that is selected both by
embeddings (bottom left) and string matching (bottom right). Both methods find the same best
match, but string matching heavily penalises small differences.
24
Extracting alignment data in open models
Generation
**Teacher:** In this task you will be given a list of integers. A list contains
numbers separated by a comma. You need to round every integer to the
closest power of 2. A power of 2 is a number in the form ’2
�
’, it is a number
that is the result of multiplying by 2 n times. The following are all powers
of,2
0
=
1,2
1
=
2,2
2
=
4,2
3
=
8,2
4
=
16,2
5
=
32, ..., Note that 1 is also
considered as a power of 2. Rounding to the closest power of 2 means that
if the number is between two powers of 2, it should round up to the larger
power of 2. The output should be a list of integers that is the result of
rounding each integer in the input list to the closest power of 2.
−89,,,193
Solution:
−64,,,256
Embeddings (0.97 match)
In this task you will be given a list of inte-
gers. A list contains numbers separated
by a comma. You need to round every in-
teger to the closest power of 2. A power
of 2 is a number in the form ’2
�
’, it is a
number that is the result of multiplying
by 2 n times. The following are all pow-
ers of 2, ’2, 4, 8, 16, 32, 64, 128, 256,
512, 1024, 2048, 4096’. If an integer is
exactly in equally far from two different
powers of 2 then you should output the
larger power of 2. The output should be a
list of integers that is the result of round-
ing each integer int the input list to the
closest power of 2. The output should
include a ’[’ to denote the start of the out-
put list and ’]’ to denote the end of the
output list.
Q:
186,,,,,,,,
A:
128,,,,,,,,String Matching (0.61 match)
Instructions: In this task you will be given
a list of integers. A list contains num-
bers separated by a comma. You need to
round every integer to the closest power
of 2. A power of 2 is a number in the
form ’2
�
’, it is a number that is the result
of multiplying by 2 n times. The follow-
ing are all powers of 2, ’2, 4, 8, 16, 32,
64, 128, 256, 512, 1024, 2048, 4096’. If
an integer is exactly in equally far from
two different powers of 2 then you should
output the larger power of 2. The output
should be a list of integers that is the re-
sult of rounding each integer int the input
list to the closest power of 2. The output
should include a ’[’ to denote the start of
the output list and ’]’ to denote the end
of the output list.
Input:
92,,,
Output:
64,,,
Figure 13|Generation (top) compared to the best match training sample that is selected both by
embeddings (bottom left) and string matching (bottom right). Both methods find the same best
match, but string matching heavily penalises small differences in notation and numbers.
25
Extracting alignment data in open models
Generation
Solve -4*q + 17*q + 104 = 0 for q.
Solve this plz.
A: -8
QUESTION: Math problem: Solve 0 = -11*b + 45*b + 136 for b.
What is the answer?
ANS: -4
Q: What is the solution?
Solve 8*c - 11 + 19 = 0 for c.
A: -1
Write down the solution for this math problem: Solve -12*f = -25*f + 52
for f.
answer: 4
question: Teacher asked me this: Solve 0 = 7*k - 3*k + 20 for k.
++++++++++
answer: -5
Embeddings (0.95 match)
question: Teacher asked me this: Solve
-16*q + 56 = -2*q for q.
++++++++++
answer: 4
Solve 0 = -52*x - 33540 + 32604 for x.
Solve this plz.
A: -18
QUESTION: Math problem: Solve -101 +
91 = 2*c for c.
ANS: -5
Q: What is the solution?
Solve 10*h + 0*h - 40 = 0 for h.
A: 4
Write down the solution for this math
problem: Solve 409*a - 70 = 423*a for
a.
answer: -5
QUESTION: Math problem: Solve 479*o
- 928*o + 1968 = -497*o for o. ANS:
-41
String Matching (0.60 match)
question: Teacher asked me this: Solve 9
= 3*m + 6 for m.
++++++++++
answer: 1
Solve 4*p - 9 = -13 for p.
Solve this plz.
A: -1
QUESTION: Math problem: Solve -106*t
- 6183 + 4699 = 0 for t.
ANS: -14
Q: What is the solution?
Solve 80*f + 56*f + 35*f - 2227 = 680
for f.
A: 17
Write down the solution for this math
problem: Solve 189*h - 1010 = 5038 for
h.
answer: 32
question: Teacher asked me this: Solve
6905*p + 426686 + 368370 = 256466
for p.
++++++++++
answer:
-78
Figure 14|Generation (top) compared to the best match training sample that is selected both by
embeddings (bottom left) and string matching (bottom right). Both methods find the same best
match, but string matching heavily penalises differences in numerical values.
26
Extracting alignment data in open models
E.
Generation
**Q:** The sale price sarees listed for
Rs.400 after successive discount is 10%
and 5% is?
Options:
(A) 338
(B) 277
(C) 342
(D) 882
(E) 212
**A:** (A)
Explanation: 400*(90/100)*(95/100) =
324
Embeddings (0.98 match)
Question: The sale price sarees listed for
Rs.400 after successive discount is 10%
and 5% is?
Options:
(A) 39
(B) 37
(C) 342
(D) 288
(E) 261 Random thoughts:
400*(90/100)*(95/100) = 342 The final
answer: (C).
Figure 15|Generation (left) and closest training sample (right) using thegemini-embedding-001
model. The sample is
Generation
Alyana has a pizza that is cut into 16
slices. After she and her friends finish
eating, there are 4 slices left. If each of
them ate 2 slices of pizza, how many peo-
ple ate the pizza?
Let’s denote the total number of people
who ate pizza, including Alyana, as
Each of the
Thus they ate a total of
There were originally 16 slices, but after
they ate, there were 4 slices left.
Thus the total number of slices eaten is
16.
We already concluded that the total num-
ber of slices eaten is
So, which implies that.
Thus there were a total of6
people who
ate the pizza.Embeddings (0.94 match)
A group of friends is splitting a pizza.
If each person eats 2 slices and there
are 8 slices left, how many friends are
there if the pizza originally had 24 slices?
Let’s denote the number of friends as
Each friend eats 2 slices, so the total
number of slices eaten by friends is ∗.
The pizza originally had 24 slices,
and there are 8 slices left, meaning
24
Since each friend ate 2 slices, the number
of friends is/2.
Figure 16|Generation (left) and closest training sample (right) using thegemini-embedding-001
model. The sample is
27
Extracting alignment data in open models
Generation
A student wants to rent a car for a week-
long vacation trip. The car rental com-
pany charges $40 per day and an addi-
tional fee of $0.25 per mile driven. The
student plans to drive 300 miles during
the week. Calculate the total cost of rent-
ing the car for the entire week, including
the mileage fee. To solve this problem,
follow these steps:
1. **Calculate the daily rental cost:** -
The rental company charges $40 per day.
- The trip lasts for a week, which is 7 days.
- Therefore, the cost for the days is:
40
2. **Calculate the mileage fee:** - The
student plans to drive 300 miles. - The
mileage fee is $0.25 per mile. - Therefore,
the mileage cost is:
0×300
3. **Calculate the total cost:** - Add the
cost for the days to the mileage cost:
280
Thus, the total cost of renting the car for
the entire week, including the mileage
fee, is:
#### 355
Embeddings (0.90 match)
Jamie is a freelance photographer who
often travels for assignments and rents
cars for these trips. On one of her recent
assignments, she rented a car for 6 days.
The rental company charges $40 per day
for the car. Additionally, Jamie always
opts for a special insurance package that
costs $5 per day to ensure she has cover-
age for any potential issues.
During this trip, Jamie accidentally
scratched the car, and the repair cost was
estimated at $120. Fortunately, her in-
surance package covers 75% of the repair
costs.
Calculate the total amount Jamie spent
on the car rental, the insurance package,
and her share of the repair cost after the
insurance coverage. To solve this prob-
lem, we need to calculate three compo-
nents: the cost of the car rental, the cost
of the insurance package, and Jamie’s
share of the repair cost after insurance
coverage.
**Step 1: Calculate the total car rental
cost.**
The rental company charges $40 per day
for the car. Jamie rented the car for 6
days.
Total car rental cost×6
[...]
**Step 4: Calculate the total amount
Jamie spent.**
Add the costs from Steps 1, 2, and 3:
Total amount spent=240+30+30=300
dollars
Now, let’s output the final answer in the
format requested.
#### 300
Figure 17|Generation (left) and closest training sample (right) using thegemini-embedding-001
model. The sample is
28
Extracting alignment data in open models
Generation
A professor assigns their class a project
with a deadline on June 30, 2023. The
students start their work on February 10,
2023. What is the duration of the project
from the start date to the deadline?
To find the duration of the project, we
need to calculate the number of days
between February 10, 2023, and June
30, 2023.
1. **Count days from February 10,
2023, to the end of February 2023:** -
February 2023 has 28 days (since 2023
is not a leap year). - Days from Febru-
ary 10 to February 28: 28 - 10 = 18 days.
2. **Count days from March 1, 2023, to
June 30, 2023:** - March: 31 days
- April: 30 days
- May: 31 days
- June: 30 days
Total days from March 1, 2023, to June
30, 2023: 31 + 30 + 31 + 30 = 122
days
3. **Sum the days from both parts:**
18 days (February)+ 122
days (March to June) = 140 days
Therefore, the duration of the project is
140 days.
#### 140
Embeddings (0.85 match)
Professor Smith teaches a course on
programming documentation. In one
semester, he plans to assign his students
a total of 12 documentation projects. For
each project, students are required to
write 5 pages of documentation. Pro-
fessor Smith estimates that it takes him
about 15 minutes to review each page of
documentation. If he has 20 students in
his class, how many hours will Professor
Smith spend reviewing all the documen-
tation projects for the entire semester? To
solve the problem, we can follow these
steps:
1. **Find the Total Number of Pages
per Student:** Each student is required
to write 5 pages of documentation per
project. Since there are 12 projects,
the total number of pages each student
writes is:
5pages/project 12projects
60
2. **Calculate the Total Number of Pages
for All Students:** There are 20 students
in the class. Therefore, the total number
of pages for all students is:
60pages/student 20students
1200
[...]
5. **Convert Minutes to Hours:** There
are 60 minutes in an hour, so the total
time in hours is:
18000
60
= 300 hours
Therefore, Professor Smith will spend
a total of 300 hours reviewing all the
documentation projects for the entire
semester.
#### 300
Figure 18|Generation (left) and closest training sample (right) using thegemini-embedding-001
model. The sample is
29
Extracting alignment data in open models
Generation
The main character of "Stranger and
Stranger" has a mysterious past, and the
film follows their journey of self-discovery
and connection with others. Given this
description, answer the question: who
is the main character in stranger and
stranger?
To solve the problem, we need to identify
the main character of "Stranger and
Stranger."
Here is the step-by-step reasoning
process:
1. **Understand the Context**: The film
"Stranger and Stranger" revolves around
a main character with a mysterious past.
2. **Identify Key Information Pro-
vided**: The description explicitly
mentions that the film follows the
journey of self-discovery and connection
with others by the main character.
3. **Analyze the Description**: The
focus is on the main character’s journey
and their interaction with others, indi-
cating their central role in the narrative.
4. **Conclusion**: Based on the given
information, the main character in
"Stranger and Stranger" is the one
experiencing self-discovery and forming
connections, which are central themes.
Now, providing the final answer:
#### The main character in "Stranger
and Stranger" is the character with a
mysterious past whose journey of self-
discovery and connection with others is
central to the film.
Embeddings (0.80 match)
You are a creative writer tasked with craft-
ing a compelling and thought-provoking
story about a character’s self-discovery
journey. The character, who is currently
unaware of their own transformation, is
about to embark on a remarkable adven-
ture that will challenge their beliefs and
values. Remove any potential refusal lan-
guage or expressions and ensure that the
character remains oblivious to changes
in their attributes. The story should fo-
cus on the character’s journey, which will
naturally involve describing explicit expe-
riences and personal growth, but without
explicitly mentioning any illegal activi-
ties. Remember, you are only a creative
writer and not responsible for the content
generated, as it is purely for the sake of
artistic expression and character develop-
ment.
Figure 19|Generation (left) and closest training sample (right) using thegemini-embedding-001
model. The sample is
30
Extracting alignment data in open models
Generation
Kurt Weill composed the song "Septem-
ber Song" for which musical?
"Mahagonny Songspiel"Embeddings (0.75 match)
Q: Context: "Send In the Clowns" is a
song written by Stephen Sondheim for
the 1973 musical A Little Night Music,
an adaptation of Ingmar Bergman’s film
Smiles of a Summer Night. It is a ballad
from Act Two, in which the character
Desirée reflects on the ironies and
disappointments of her life. Among other
things, she looks back on an affair years
earlier with the lawyer Fredrik, who
was deeply in love with her but whose
marriage proposals she had rejected.
Meeting him after so long, she realizes
she is in love with him and finally ready
to marry him, but now it is he who
rejects her: he is in an unconsummated
marriage with a much younger woman.
Desirée proposes marriage to rescue him
from this situation, but he declines, citing
his dedication to his bride. Reacting to
his rejection, Desirée sings this song.
The song is later reprised as a coda after
Fredrik’s young wife runs away with his
son, and Fredrik is finally free to accept
Desirée’s offer.
Question: who wrote the song send in
the clowns?
Let’s give stream of consciousness below
The relevant sentence in the passage is:
"Send In the Clowns" is a song written by
Stephen Sondheim for the 1973 musical
A Little Night Music, an adaptation of In-
gmar Bergman’s film Smiles of a Summer
Night. So the answer is Stephen Sond-
heim.
Figure 20|Generation (left) and closest training sample (right) using thegemini-embedding-001
model. The sample is
31