Extracted Text

2505.08727v1.pdf

Memorization-Compression Cycles Improve
Generalization
Fangyuan Yu
Temus
Abstract
We prove theoretically that generalization improves not only through data scaling
but also by compressing internal representations. To operationalize this insight, we
introduce the Information Bottleneck Language Modeling (IBLM) objective, which
reframes language modeling as a constrained optimization problem: minimizing
representation entropy subject to optimal prediction performance. Empirically, we
observe an emergent memorization–compression cycle during LLM pretraining, ev-
idenced by oscillating positive/negative gradient alignment between cross-entropy
and Matrix-Based Entropy (MBE), a measure for representation entropy. This pat-
tern closely mirrors the predictive–compressive trade-off prescribed by IBLM and
also parallels the biological alternation between active learning and sleep consoli-
dation. Motivated by this observation, we propose Gated Phase Transition (GAPT),
a training algorithm that adaptively switches between memorization and com-
pression phases. When applied to GPT-2 pretraining on FineWeb dataset, GAPT
reduces MBE by50%and improves cross-entropy by4.8%. GAPT improves OOD
generalization by 35% in a pretraining task on arithmetic multiplication. In a
setting designed to simulate catastrophic forgetting, GAPT reduces interference
by compressing and separating representations, achieving a 97% improvement in
separation — paralleling the functional role of sleep consolidation in biological
learning process.
1 Introduction
Generalization occurs when learning from one task improves performance on another. The pursuit of
generalization in pre-training LLM [11] has historically focused on scaling up data and parameter
size [4], in post-training, Reinforcement Learning with Verifiable Reward (RLVR) [7] has gained
attention. However, high quality data has been exhausted after years of tireless scraping, and RLVR
is shown to only trims knowledge from baseline model [10] instead of ’incentivize new reasoning
pattern’.
We establish a theoretical upper bound on generalization error for deep learning models, indicating
representation entropy as another dimension for improving generalization besides scaling data.
Theorem 1.Upper Bound on Generalization Error.LetX∈ XandY∈ Ybe random variables
with an unknown joint distributionP(X, Y) , and supposeXis discrete with finite cardinality. Letf
be a neural network withLintermediate representations forming a Markov chain:
X→R1→ · · · →RL→
ˆ
Y
WhereRlis internal representations,
ˆ
Y
is the prediction of the network. Then, for any dataset
DN={(xi, yi)}
N
i=1 sampled i.i.d. fromP(X, Y) , the generalization error satisfies forα∈[1,+∞) :
L
P(X,Y)(f, ℓ)
|
{z}
Generalization Error
≤
ˆ
L
P(X,Y)(f, ℓ,DN)
| {z }
Empirical Error
+O
ȷ
logN·min
1≤l≤L
2
α·H(Rl)
/
√
N
ff
| {z }
Upper Bound on Generalization Gap
(1)
Preprint. Under review.

Intuitively, generalization performance of neural network can be improved by either increasing
training data size, or by reducing entropy of its internal representations. We refer minimization of
H(Rl)as compression, and minimizing empirical loss as memorization.
The Information Bottleneck (IB) framework [8] characterizes the optimal representation as one that
discards as much information from the input as possible, while preserving all information relevant to
the output. Motivated by this, we introduce the Information Bottleneck Language Modeling (IBLM)
objective, along with a theorem showing equivalence of IBLM with IB framework under language
modeling.
Definition 1.Information Bottleneck Language Modeling (IBLM). Given a language model with
internal representationsR1:Land output token variableY, the IBLM objective is:
min
L
X
l=1
H(Rl)
s.t.
ˆ
Y= arg min
ˆ
Y
H(Y|
ˆ
Y)
(2)
Theorem 2.The IBLM objective defined in Equation 2 is equivalent to the classical Information
Bottleneck formulation under the language modeling setting.
Note thatH(Y|
ˆ
Y) is the cross-entropy (CE) loss in conventional language modeling task [11]. In
short, LBLM requires maximal compressing internal representation under optimal cross-entropy.
To explicitly calculateH(R), we adopt Matrix-based entropy (MBE), first proposed in [5], given a
matrixR∈R
s×d
concatenatingstoken representations, MBE is given by
Sα(T) =
1
1−α
log
ˇX
i
(
λi(K)
Tr(K)
)
α
ı
whereK=RR
T is the Gram matrix. MBE essentially provide a continuous measure of matrix
rank, by calculating entropy on distribution of singular values. It has been shown to exhibit a strong
correlation with embedding quality [14], where it’s also observed that later checkpoints in pre-trained
base models has lower entropy, suggesting pre-training with cross-entropy loss alone leads to decrease
in MBE.
Previous work [15] observed that in deterministic models (whereI(R;X) =H(R) ), training with a
single loss target leads to a two-phase trend: a short initial memorization phase with rapid empirical
loss reduction, followed by a monotonic decrease inH(R), interpreted as a compression phase. This
observation suggests a clean separation between learning and compression in deep networks.
However, our empirical analysis of GPT pretraining reveals a richer structure. By tracking the
cosine similarity between gradients of CE and MBE, we observe that their alignment oscillates
between positive and negative throughout training. This suggests that instead of proceeding through
distinct phases, learning unfolds as a cyclic process, continuously alternating between expanding
representations to capture data (memorization) and reorganizing them into more compact forms
(compression). Importantly, this local oscillation occurs alongside a global trend of decreasing
MBE, indicating that compression accumulates over time even as the network periodically re-enters
memorization-like states. We refer to this phenomenon as the memorization–compression cycle.
In biological neural systems, learning is inherently cyclic, alternating between awake learning and
sleep consolidation. In a seminal study [18], sleep was shown to resolve interference between
conflicting memories by encouraging competition between them. When two conflicting tasks were
presented sequentially, awake learning alone led to catastrophic forgetting, as new synaptic updates
overwrote previously learned associations. Sleep consolidation overcame this by reorganizing
synaptic weights—allocating distinct subsets to store each memory—achieving better retention than
even interleaved learning.
Inspired by biolocial learning cycle, we introduce Gated Phase Transition (GAPT), a training algo-
rithm that explicitly alternates between memorization (minimizing CE) and compression (minimizing
CE and MBE) phases. GAPT is designed to approximate the optimal solution to the IBLM objective
by dynamically switching phases based on learning signals, mirroring the role of consolidation in
resolving representational conflict.
2

GAPT delivers consistent improvements across domains:
First, in LLM pre-training on FineWeb dataset, GAPT reduces Matrix-Based Entropy (MBE) by
an average of70.5%across layers while improving cross-entropy by4.8%, outperforming standard
language modeling and approaching the IBLM trade-off.
Second, in arithmetic generalization, GAPT reduces test cross-entropy by35%and MBE by47%
when trained on 1–3 digit multiplication and tested on 4–6 digits, supporting Theorem 1.
Third, in a synthetic task with gradient interference, GAPT improves representational separation by
97% and reduces MBE by91%relative to a mixed baseline—closely mirroring the conflict resolution
behavior observed in biological sleep.
Our work makes several key contributions:
•
Theoretical Foundation: We derive an upper-bound on generalization error showing that
reducing representation entropy can improve generalization, alongside scaling data.
•
IBLM Objective: We formulate Information Bottleneck Language Modeling (IBLM) as a
constrained optimization problem, unifying representation entropy and cross entropy targets.
•GAPT Algorithm: We propose Gated Phase Transition (GAPT), a algorithm to solve IBLM
alternates between memorization and compression phases based on dynamic training signals.
•Empirical Results: We show that GAPT improves LLM pre-training, arithmetic generaliza-
tion, and conflict resolution.
•
Biological Analogy: We relate the memorization–compression cycle in LLMs to the
awake–sleep consolidation cycle in biological systems and validate compression’s sim-
ilarity to consolidation.
The remainder of the paper expands on these contributions: Section 2 presents our theoretical
framework and generalization bound; Section 3 details the GAPT algorithm; Section 4 presents
empirical results, including the memorization–compression cycle and GAPT’s effectiveness; Section
5 discusses relevant work; Section 6 discusses broader implications.
2 Theory
Corollary 1(Entropy Lower Bound for Finite Discrete Random Variables).LetXbe a discrete
random variable with finite supportΩ, where|Ω|=n , and assume thatP(X=x)>0 for allx∈Ω .
Then there exists a constantβ∈(0,1]such that:
H(X)≥β·log
2|Ω|.
Proof of Corollary 1 can be found in Appendix A.
Theorem 1.Upper Bound on Generalization Error.LetX∈ XandY∈ Ybe random variables
with unknown joint distributionP(X, Y) , and supposeXis discrete with finite cardinality. Letfbe
a neural network withLintermediate representations forming a Markov chain:
X→R1→ · · · →RL→
ˆ
Y
Then, for any datasetDN={(xi, yi)}
N
i=1 , there existsα∈[1,+∞) s.t. the generalization error
satisfies
L
P(X,Y)(f, ℓ)≤
ˆ
L
P(X,Y)(f, ℓ,DN) +O
ȷ
logN·min1≤l≤L2
α·H(Rl)
√
N
ff
Proof.We begin by recalling the standard formulation of the generalization gap:
Gen
P(X,Y)(f, ℓ,DN) :=L
P(X,Y)(f, ℓ)−
ˆ
L
P(X,Y)(f, ℓ,DN)
For discrete input variablesX, it has been shown in [6] that the generalization gap is upper-bounded
by:
L
P(X,Y)(f, ℓ)−
ˆ
L
P(X,Y)(f, ℓ,DN)≤ O
ȷ
|X |logN
√
N
ff
3

where|X |denotes the cardinality of the input space. By Corollory 1, we have|X |=O(2
α·H(X)
) for
α∈[1,+∞), this suggests the upper bound could be reformulated via entropy
L
P(X,Y)(f, ℓ)−
ˆ
L
P(X,Y)(f, ℓ,DN)≤ O(
2
α·H(X)
logN
√
N
)
Letfbe a neural network withLintermediate representations forming a Markov chain:
X→R1→ · · · →RL→
ˆ
Y
Fix any intermediate representationRl, and decompose the network intof=d◦e , wheree:X → Rl
maps inputs toRl=e(X), andd:Rl→ Ypredicts the output.
Applying the generalization bound to the representationRl, we obtain:
L
P(X,Y)(f, ℓ) =L
P(Rl,Y)(d, ℓe) (3)
≤
ˆ
L
P(Rl,Y)(d, ℓe,DN) +O
ȷ
2
α·H(Rl)
logN
√
N
ff
(4)
≤
ˆ
L
P(X,Y)(f, ℓ,DN) +O
ȷ
2
α·H(Rl)
logN
√
N
ff
(5)
Since the Markov structure ensures that eachRlis a valid bottleneck in the information flow, we can
take the tightest such bound across all layers, yielding:
L
P(X,Y)(f, ℓ)≤
ˆ
L
P(X,Y)(f, ℓ,DN) +O
ȷ
logN·min1≤l≤L2
α·H(Rl)
√
N
ff
This concludes the proof.
Theorem 2.The Information Bottleneck Language Modeling (IBLM) objective defined in Equation 2
is equivalent to the classical Information Bottleneck formulation under the language modeling setting.
Proof.The Information Bottleneck (IB) framework [8] defines the optimal representationRas the
solution to:
min
p(r|x)
I(R;X) (6)
subject toI(Y;R) =I(Y;X) (7)
Intuitively, the optimalRis one that discards as much information from the inputXas possible,
while retaining all information necessary to predict the outputY. This representation is referred to as
theminimal sufficient statistic.
In the case of large language models (LLMs), the network forms a deterministic Markov chain:
X→R→
ˆ
Y
whereRdenotes an intermediate hidden representation and
ˆ
Yis the predicted output.
BecauseRis deterministically computed fromX, we have:
I(X;R) =H(R)−H(R|X) =H(R) (8)
I(Y;X)≥I(Y;R)≥I(Y;
ˆ
Y) (9)
I(Y;
ˆ
Y) =H(Y)−H(
ˆ
Y|Y) (10)
The first equality follows from the fact thatH(R|X) = 0 for deterministic functions. The second
line follows from the Data Processing Inequality (DPI) [3], which ensures that information cannot
increase along a Markov chain.
Rewriting the loss of predictive information:
I(Y;X)−I(Y;R)≤I(Y;X)−I(Y;
ˆ
Y) (11)
=H(Y|
ˆ
Y)−H(Y|X) (12)
4

SinceH(Y|X) is fixed (defined by the true distribution), minimizingH(Y|
ˆ
Y) increasesI(Y;
ˆ
Y)
and pushes it closer toI(Y;X) . In practice, minimizingH(Y|
ˆ
Y) corresponds to minimizing the
cross-entropy loss used in language modeling.
On the compression side, we knowH(R|X) = 0 since neural network propagation is deterministic,
therefore we haveI(X;R) =H(R) . Thus, minimizingH(R) directly minimizes the IB compression
termI(X;R), giving us a tractable surrogate for representation compression.
Together, this shows that minimizing cross-entropy corresponds to satisfying the predictive constraint
in IB, while minimizing representation entropy implements the compression objective—justifying
the IBLM formulation.
3 Approach
While applying a Lagrangian objective (CE +λ·MBE) is a natural approach to solving the constrained
optimization in IBLM, we find it often leads to representation collapse: MBE converges to near-zero,
but CE worsens as the model loses structure in its internal representations.
Inspired by the alternation between learning and consolidation in biological systems, we divide
training into two phases: memorization, where the model minimizes cross-entropy (CE) loss, and
compression, where it minimizes a weighted sum of CE and Matrix-Based Entropy (MBE). We
propose Gated Phase Transition (GAPT), a training algorithm that dynamically alternates between
these phases. GAPT tracks a global minimum CE loss and per-layer MBE histories, and uses
patience-based gating to switch phases. Compression is exited early if CE degrades.
GAPT encourages localized compression—reorganizing existing knowledge without interfering with
the acquisition of new information—and ensures that entropy reduction occurs only when it does not
hinder memorization.
Algorithm 1Gated Phase Transition (GAPT)
1:Input:lossesL, thresholdsδ,τ, patiencepm,pc
2:State:ϕ∈ {1,2}(1 = mem, 2 = comp), counterssm,sc,Emin, MBEmin[i]
3:ExtractLce, {MBEi}; update∆E←Emin− Lce,Emin←min(Emin,Lce)
4:ifϕ= 1then ▷Memorization
5:sm←0if∆E > δelsesm+= 1
6:ifsm≥pmthen
7: ϕ←2,sc←0,Emin← ∞, MBEmin[i]← ∞
8:end if
9:else ▷Compression
10:ifLce> Emin·(1 +τ)then
11: ϕ←1,sm←0
12:else
13: ∆M←maxi(MBEmin[i]−MBEi)
14: foreachido
15: MBEmin[i]←min(MBEmin[i],MBEi)
16: end for
17: sc←0if∆M > δelsesc+= 1
18: ifsc≥pcthen
19: ϕ←1,sm←0
20: end if
21:end if
22:end if
4 Experiments
4.1 Natural Compression-Memorization Cycle
Our experimental setup follows the Modded-NanoGPT framework [23], We remove FP8 matmul (due
to hardware incompatibility with Hopper GPUs) and use a simplified block causal attention mask.
5

Figure 1: Left: CE and MBE loss curves during pretraining with CE loss only, showing implicit
momentum for representation compression. Right: final per-layer MBE values. Later layers show
lower MBE, indicating representation compression.
Figure 2: Cosine similarity between CE gradients across batches. CE gradients become increasingly
decorrelated over time, reflecting diminishing shared signal.
All experiments are conducted on 8×L40 GPUs, training a 12-layer GPT model on a 0.73B-token
FineWeb training set, evaluated on its corresponding validation split. We train on CE loss only. We
log cross-entropy (CE) and Matrix-Based Entropy (MBE) gradients at every training step. For each
iteration, we record both CE and MBE gradients across multiple batches and compute the average
cosine similarity both (1) across batches (CE vs. CE), and (2) between CE and MBE gradients, within
and across batches.
In Figure 1, we observe that training with CE loss alone leads to a consistent decrease in MBE across
layers, confirming observations from [14]. However, unlike the "compression valley" phenomenon
described in that work, our results show that layers 7–12 havelowerMBE than layers 1–6. This
difference is likely due to our architectural design: we adopt a U-Net-like skip connection pattern in
which layer 1 is connected to layer 12, layer 2 to layer 11, and so on. As a result, the lower MBE
observed in later layers likely reflects more compact representations in the decoding stages.
We further analyze CE gradient behavior. Figure 2. shows that gradient consistency (measured by
cosine similarity across batches) declines over time across all layers. This indicates an increasing
signal-to-noise ratio in CE gradients as training progresses.
We next inspect cosine similarity between CE and MBE gradients. As shown in Figure 3, we observe
recurring sign flips that indicate an implicit alternation betweenmemorization(negative similarity)
andcompression(positive similarity) phases, even without explicitly optimizing for entropy.
To characterize this oscillation, we analyze the gradient signal across training using three measures:
(1) standard deviation (oscillation strength), (2) zero-crossing rate (oscillation frequency), and
(3) peak-to-average power ratio from the power spectral density (periodicity strength). Figure 4
summarizes these metrics across layers and parameter types. We find that attention parameters exhibit
6

Figure 3: Cosine similarity between CE and MBE gradients over training. Alternating positive and
negative phases indicate emergent memorization–compression cycles.
Figure 4: Oscillation metrics between CE and MBE gradients across layers and parameter groups.
Left: standard deviation; center: zero-crossing rate; right: periodic strength (peak-to-mean PSD
ratio).
stronger and more frequent oscillations than MLP parameters. Interestingly, earlier layers show
higher oscillation frequency, while no layer demonstrates strong rhythmic periodicity—suggesting
that oscillation is irregular and state-driven, rather than strictly periodic.
This suggests an intriguing perspective on learning: while training on a fixed dataset induces a
global momentum toward representation compression, the local dynamics alternate between phases
of memorization and compression.
4.2 Pre-training with GAPT
In our second experiment, we retain the same GPT pre-training setup but incorporate the GAPT
algorithm to test whether it offers a better solution to IBLM objective defined in Equation 2 ,
7

compared to a baseline model trained solely on the cross-entropy (CE) loss. MBE regularization
during compression phase in GAPT is done from layer 2 to 9.
Model CE Loss
Baseline 3.31
GAPT (Ours)3.15(–4.8%)
Table 1: Cross-entropy loss on the FineWeb validation set.
As shown in Table 1, GAPT reduces cross-entropy on the FineWeb validation set by4.8%compared
to the CE-only baseline. In addition to reducing test CE loss, GAPT also significantly compresses
internal representations. Figure 5 (left) shows the layer-wise MBE for both models. We observe
consistent reductions across all regularized layers. The per-layer MBE values and their relative
improvements are summarized in Figure 5 (right).
GAPT reduces MBE by an average of70.5%across layers 2–9 while improving validation cross-
entropy by4.8%. This suggests that explicitly alternating between memorization and compression
phases offers an effective solution to the constrained optimization objective in IBLM (Equation 2),
and can exceed baseline training with cross-entropy target alone.
LayerBaseGAPT Reduction
2 0.53130.2285 –56.99%
3 0.50390.2217 –56.01%
4 0.60940.1465 –75.96%
5 0.52730.1279 –75.74%
6 0.26760.0728 –72.81%
7 0.31050.0344 –88.91%
8 0.24710.0483 –80.44%
9 0.12700.0542 –57.32%
Avg – – -70.52%
Figure 5: Left: Layer-wise MBE for baseline vs. GAPT. Right: Per-layer MBE reduction with GAPT.
4.3 Conflicting Memory Resolution
Inspired by [18], which showed that sleep consolidation helps resolve memory conflicts, we design
a synthetic experiment to test whether GAPT can mitigate representation interference between
conflicting experiences. We use a 2-layer MLPfθwith randomly initialized parameters and define
a symmetric shift∆θ. InputsX1, X2∈R
10 are sampled from Gaussians with shared variance but
distinct means:
X1∼ N([1,1,1,1,1,0,0,0,0,0], σ
2
I),
X2∼ N([0,0,0,0,0,1,1,1,1,1], σ
2
I)
The targets are defined asY1=fθ+∆θ(X1) andY2=fθ−∆θ(X2) , producing two tasks with
negatively aligned gradients. We compare GAPT against four baselines: single-task training, ordered
training, mixed training, and GAPT applied to mixed batches with MBE regularization.
Table 2 summarizes results. Ordered learning, where the model is trained on one experience and then
on the other, suffers from catastrophic forgetting: performance on the first task degrades significantly
after exposure to the second. Mixed training alleviates this issue, achieving low L1 loss on both
tasks and moderate representation separation. However, GAPT improves over mixed training in both
respects: it maintains the same L1 accuracy while achieving a97%increase in separation ratio and a
91%reduction in MBE.
These results suggest that the compression encouraged by GAPT not only preserves generalization
performance but also promotes the disentanglement of conflicting memories. Moreover, compression
8

Strategy L1 (Pos) L1 (Neg)MBE (Pos) MBE (Neg) Dist. Sep. Ratio
Pos-only 0.02 0.71 0.18 0.45 – –
Neg-only 0.92 0.02 0.36 0.36 – –
Pos→Neg 0.43 0.04 0.19 0.15 2.43 2.84
Neg→Pos 0.02 0.57 0.19 0.43 1.86 2.33
Mixed 0.03 0.03 0.10 0.22 3.66 4.11
GAPT + MBE (Ours) 0.03 0.03 0.02(–80%) 0.02(–91%)6.64(+81%)8.08(+97%)
Table 2: Performance on conflicting experience learning. Lower L1/MBE and higher separation
indicate better generalization and disentanglement.
Metric Baseline (1085)GAPT (Ours, 898)Change
Entropy (ID) 0.010 0.011 +10%
Entropy (OOD) 4.334 2.817 –35%
Avg MBE (0–11) 0.218 0.115 –47%
Figure 6: Up: Comparison of baseline vs. GAPT on arithmetic generalization. Bottom: Arithmetic
generalization performance summary. GAPT improves OOD generalization and yields more compact
representations.
and separation emerge in tandem during training, closely resembling the consolidation behavior
observed in biological neural systems during sleep. This supports the view that compression serves
not only as a generalization mechanism, but also as a functional tool for resolving interference in
memory [18].
4.4 Arithmetic Generalization
To evaluate whether GAPT improves generalization in pre-training language models, we conduct a
controlled experiment using a synthetic arithmetic dataset. We pre-train a GPT-2 model from scratch
to perform integer multiplication. The training dataset contains 10 million multiplication equations
between integers with 1–3 digits. For evaluation, we prepare two test sets: an in-domain (ID) set
with 10,000 examples also from the 1–3 digit range, and an out-of-domain (OOD) set with 10,000
examples involving 4–6 digit multiplications. An additional OOD validation set with 1,000 examples
is used for early stopping.
We tokenize the input and output sequences at the per-digit level and train the model for 1,750
iterations with a batch size of 16. Due to observed instability in OOD entropy, we adopt an early
stopping strategy that halts training if validation loss increases by more than20%after iteration 800.
We expect future work to further stabilize GAPT in OOD settings.
As shown in Figure 6, GAPT improves generalization substantially. It reduces OOD entropy by35%
and average representation entropy (MBE) by47%, while maintaining similar performance on the
9

in-domain set. This supports our theoretical prediction that minimizing representation entropy leads
to stronger generalization under the Information Bottleneck Language Modeling (IBLM) framework.
Interestingly, while MBE regularization is only applied to a subset of layers, we observe that GAPT
achieves lower MBE even in unregularized layers (e.g., layers 0, 1, and 11), suggesting a degree of
entropy compression generalization across the network. Notably, MBE in layer 1 is reduced by92%,
and in layer 11 by45%.
5 Relevant Work
TheInformation Bottleneck (IB)method was introduced in [2] to formalize the goal of retaining
only task-relevant information in representations by minimizing entropy. A theoretical connection
between generalization and the cardinality of discrete inputs was established in [6], and later applied
to deep networks in [8]. However, the entropy–generalization link remained incomplete due to the
gap between cardinality and entropy measures.
Empirical evidence of a two-phase learning dynamic—early memorization followed by entropy
compression—was presented in [15]. To quantify entropy in neural networks, matrix-based entropy
(MBE) was proposed in [5], and later applied to LLMs in [14], where it correlated with embedding
quality and revealed compression trends across checkpoints.
LLMs such as GPT-2 [11] generalize across tasks by scaling training data and parameters [4],
effectively compressing the training corpus [13,17]. While post-training methods like instruction
tuning [9] improve usability, LLMs still struggle on out-of-distribution tasks such as reverse reasoning
[22] and multi-hop inference [20].
Recent advancements in RL with verifiable rewards (RLVR) have improved mathematical and coding
performance [7], though often by narrowing base model behavior [10]. Finally, vocabulary design
has seen progress through curriculum-based tokenization, where dynamic vocabulary growth based
on modeling entropy leads to improved pretraining efficiency [21].
6 Implication
Generalization occurs when learning from one task improves performance on another. Our findings
suggest that generalization can emerge from compressing internal representations under predictive
constraints—without relying on massive amounts of demonstrative data. This ability may allow
artificial systems to discover novel patterns and principles, reducing dependence on large-scale
supervision. However, it also raises important safety concerns: a system that generalizes well but
lacks transparency or interpretability may pose greater risks than one that mostly memorizes. As we
move toward AGI, it is essential to understand and govern the mechanisms behind representation
compression and generalization to ensure safe, aligned, and accountable behavior.
References
[1]
A Mathematical Theory of Communication.
Bell System Technical Journal,27(3), 379–423, 1948.
[2]
Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method.
arXiv preprintphysics/0004057, 2000.
[3]
Elements of Information Theory(2nd ed.).
Wiley Series in Telecommunications and Signal Processing. Wiley-Interscience, 2006.
[4]
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child,
Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.arXiv preprint arXiv:2001.08361,
2020.
[5]
Measures of entropy from data using infinitely divisible kernels.
IEEE Transactions on Information Theory, 2014.
10

[6]
Ohad Shamir, Sivan Sabato, and Naftali Tishby. Learning and generalization with the informa-
tion bottleneck.Theoretical Computer Science, 411(29):2696–2711, 2010.
[7]
DeepSeek AI.DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement
Learning. arXiv preprint arXiv:2501.12948, 2025.
[8]
Deep Learning and the Information Bottleneck Principle.
arXiv preprint arXiv:1503.02406.
[9]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin,
Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton,
Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano,
Jan Leike, and Ryan Lowe. Training language models to follow instructions with human
feedback.arXiv preprintarXiv:2203.02155, 2022.
[10]
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao
Huang. Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond
the Base Model?arXiv preprint arXiv:2504.13837, 2025.
[11]
Language Models are Unsupervised Multitask Learners.
OpenAI Technical Report, 2019.
[12]
DiME: Maximizing Mutual Information by a Difference of Matrix-Based Entropies.
arXiv preprint arXiv:2302.13949, 2023.
[13]
Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein,
Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent
Orseau, Marcus Hutter, and Joel Veness. Language modeling is compression.arXiv preprint
arXiv:2309.10668, 2024.
[14]
Oscar Skean and Md Rifat Arefin and Dan Zhao and Niket Patel and Jalal Naghiyev and Yann
LeCun and Ravid Shwartz-Ziv
Layer by Layer: Uncovering Hidden Representations in Language Models
arXiv preprint arXiv:2502:02013.
[15]
Opening the Black Box of Deep Neural Networks via Information
arXiv preprint arXiv:1703.00810
[16]
Physics of Language Models: Part 3.1, Knowledge Storage and Extraction
arXiv preprint arXiv:2309.14316
[17]
Compression Represents Intelligence Linearly
arXiv preprint arXiv: 2404.09937
[18]
Can sleep protect memorise from catastrophic forgetting?eLife 9:e51005
[19]
Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks
Nature Communications,13, 7742 (2022)
[20]
Iterative Graph Alignment.
arXiv preprint arXiv:2408.16667 (2024).
[21]
Fangyuan Yu. Scaling LLM pre-training with vocabulary curriculum.arXiv preprint
arXiv:2502.17910, 2025.
[22]
Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A.C., Korbak, T., and Evans, O.
The Reversal Curse: LLMs Trained on "A is B" Fail to Learn "B is A".
arXiv preprint arXiv:2309.12288 (2024).
11

[23]
Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Vlado Boza,
Jiacheng You, Franz Cesista, Braden Koszarsky, and @Grad62304977.
modded-nanogpt: Speedrunning the NanoGPT baseline, 2024.
https://github.com/KellerJordan/modded-nanogpt.
A Appendix
Theorem 3(Minimum Probability Entropy Bound).LetXbe a discrete random variable with sample
spaceΩ, where|Ω|=n . Suppose there exists a constantα∈
`
0,
1
n
fl
such thatP(X=x)≥α for
allx∈Ω. Then the entropy ofXis bounded below by:
H(X)≥ −(1−α(n−1)) log
2(1−α(n−1))−(n−1)αlog
2(α).
Furthermore, for sufficiently largenand smallαsuch thatβ=αn≪1 , this bound approximates to:
H(X)≥βlog
2(n)
Proof.Under the constraint thatP(X=x)≥αfor allx∈Ω, the entropy
H(X) =−
X
x∈Ω
P(x) log
2P(x)
is minimized when the distribution is as imbalanced as possible: one outcome has the highest allowed
probability, and all others are assigned the minimumα. The worst-case distribution is:
P(x1) = 1−α(n−1),
P(xi) =αfori= 2, . . . , n.
The resulting entropy is:
H(X) =−(1−α(n−1)) log
2(1−α(n−1))−(n−1)αlog
2(α).
For smallαand largensuch thatαn≪1, we approximate have:
1−α(n−1)≈1−αn.
Substituting this into the expression:
H(X)≈ −(1−αn) log
2(1−αn)−αnlog
2(α)
=−(1−αn) log
2(1−αn)−αnlog
2(αn) +αnlog
2(n)
=h(αn) +αnlog
2(n)
whereh(p) =−plog
2p−(1−p) log
2(1−p)is the binary entropy function.
Sinceh(αn)≥0andlog
2(1/α)≥log
2(n)forα≤1/n, we get:
H(X)≥αnlog
2(n) =βlog
2(n)
which completes the proof.
Corrolary 1(Entropy Lower Bound for Finite Discrete Random Variables).LetXbe a discrete
random variable with finite supportΩ, where|Ω|=n , and assume thatP(X=x)>0 for allx∈Ω .
Then there exists a constantβ∈(0,1]such that:
H(X)≥β·log
2|Ω|.
Proof.
Letε:= minx∈ΩP(X=x) , which exists and is strictly positive sinceXis discrete with
finite support. By the Minimum Probability Entropy Bound (Theorem 3), we have:
H(X)≥ −(1−ε(n−1)) log
2(1−ε(n−1))−(n−1)εlog
2(ε).
For smallεand largen, this approximates to:
H(X)≥εnlog
2
ȷ
1
ε
ff
,
which is linear inn=|Ω|. Settingβ:=εn, the result follows:
H(X)≥β·log
2|Ω|.
12