Extracted Text

2506.16640v2.pdf

arXiv:2506.16640v2 [cs.CL] 24 Jun 2025
Long-Context Generalization with Sparse Attention
Pavlo Vasylenko
1,2
, Marcos Treviso
2
, André F. T. Martins
1,2,3,4
1
Instituto Superior Técnico, University of Lisbon
2
Instituto de Telecomunicações,
3
Unbabel,
4
ELLIS Unit Lisbon
Abstract
Transformer-based architectures traditionally employ softmax to compute attention
weights, which produces dense distributions over all tokens in a sequence. While
effective in many settings, this density has been shown to be detrimental for tasks
that demand precise focus on fixed-size patterns: as sequence length increases, non-
informative tokens accumulate attention probability mass, leading to dispersion
and representational collapse. We show in this paper that sparse attention mecha-
nisms usingα-entmax can avoid these issues, due to their ability to assign exact
zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax
(ASEntmax), which endowsα-entmax with a learnable temperature parameter,
allowing the attention distribution to interpolate between sparse (pattern-focused)
and dense (softmax-like) regimes. Finally, we show that the ability to locate and
generalize fixed-size patterns can be further improved through a careful design of
position encodings, which impacts both dense and sparse attention methods. By
integrating ASEntmax into standard transformer layers alongside proper positional
encodings, we show that our models greatly outperform softmax, scalable softmax,
and fixed-temperatureα-entmax baselines on long-context generalization.
1 Introduction
The transformer architecture (Vaswani et al.,) has become the foundation of modern large
language models (LLMs), establishing new benchmarks across diverse domains. However, as re-
searchers push these models toward increasingly longer contexts—from thousands to millions of
tokens—several fundamental limitations emerge that can be traced to thesoftmaxtransformation
used in attention. Three critical limitations stand out:representational collapseoccurs due to soft-
max’s inability to maintain distinct attention patterns as sequence length grows, erasing meaningful
distinctions between tokens (Barbero et al.,); over-squashingis exacerbated by softmax’s dense
probability distribution, leading to exponential dilution of gradients (Alon & Yahav,;
et al.,); and attention dispersionarises from softmax’s fundamental property that forces prob-
ability mass to be distributed across all tokens, with attention weights necessarily approaching an
uniform distribution as context grows (Veli ˇckovi´c et al.,;,).
Previous approaches to address these challenges include positional encoding innovations such as
ALiBi (Press et al.,) and RoPE (Su et al.,), which help to mitigate position bias issues.
Recent works directly target the root cause—the softmax function itself.2025) proposes
Scalable-Softmax to scale logits based on context length, while ˇckovi´c et al.2024) identify
fundamental limitations of softmax for sharp out-of-distribution generalization and propose to learn
adaptive temperatures to control the sharpness of softmax. While effective, these solutions often
require careful tuning or address only a subset of the challenges.
In this paper, we address the root cause of these problems by replacing softmax withα-entmax
(Peters et al.,), a differentiable sparse transformation that induces probability distributions
where irrelevant tokens receiveexactly zero attention. Whileα-entmax has been used successfully in
transformers (Correia et al.,), its length generalization properties, to the best of our knowledge,
Preprint. Under review.

32641282565121K2K4K8K16K32K65K
Sequence Length (n)
20%
40%
60%
80%
100%
Exact Match Accuracy
In-Distribution
Softmax + RoPE
SSmax + NAPE
Entmax + NAPE
ASEntmax + NAPE 32 64 96128 256 512
Sequence Length (n)
20%
40%
60%
80%
100%
In-Distribution Figure 1: Comparison of softmax andα-entmax for long-context length on Multi-query Multi-token
Associative Recall (left) and Reverse task (right). SSMax represents the Scalable Softmax approach
by2025). While all methods benefit from using NAPE, our adaptive-scaling version of
α-entmax exhibits the best extrapolation results, effectively handling extremely long sequences.
have never been studied. We show theoretically and empirically thatα-entmax consistently helps to
address challenges in long context modeling. Our key contributions include:
1
•
Non-dispersion: We establish thatα-entmax attention distributions maintain consistent focus
regardless of sequence length, with entropy bounded byO(logs) rather than approaching maximum
entropyO(logn) as with softmax, wheres≪n is the number of tokens with nonzero probability.
•Representational preservation: We demonstrate thatα-entmax is able to prevent representational
collapse by maintaining distinguishable token representations even in extremely long contexts.
•
Over-squashing alleviation: We show howα-entmax reduces the number of gradient paths from
O(n
L
) toO(s
L
) , whereLis the number of layers, significantly strengthening gradient flow for
long-range dependencies.
•
Adaptive-scalableα-entmax: We introduceASEntmax, which adaptively adjust sparsity based on
sequence length to maintain optimal token selection even in extremely long contexts. Furthermore,
we developNAPE(NoPE + ALiBi Positional Encoding), a hybrid scheme that combines attention
heads between content-focused processing and locality-aware positional encoding, leading to a
synergistic interaction withα-entmax.
•
Empirical validation: We demonstrate substantial performance improvements across diverse tasks.
For example, as shown in Figure, ASEntmax achieves 76.7% accuracy on associative recall at
65K tokens after training on just 64 tokens—a 1000×length extrapolation.
2 Background
2.1 Transformers
In this work, we study (causal) transformers with sparse attention distributions created by replacing
softmax withα-entmax. We present the precise mathematical formulation below, following closely the
notation from (Barbero et al.,). Concretely, given a sequence of token embeddings X∈R
n×d ,
wherenis the sequence length anddis the hidden dimension, transformers compute query, key, and
value projectionsQ=XWQ ,K=XWK , andV=XWV . We denote withqi,ki,vi∈R
d
thed-dimensional query, key, and value vectors of thei-th token. For each query positioni, the
representation at layerℓfor thei-th token is computed as:
u
(ℓ)
i
=
X
j≤i
p
(ℓ)
ij
norm
(ℓ)
1
“
v
(ℓ−1)
j
”
+v
(ℓ−1)
i
,v
(ℓ)
i
=FFN
(ℓ)
“
norm
(ℓ)
2
“
u
(ℓ)
i
””
+u
(ℓ)
i
,(1)
wherep
(ℓ)
ijare attention weights,FFN
(ℓ) is the feed-forward network,norm(·) represent LayerNorm
modules (Xiong et al., yi=norm3
“
v
(L)
i
” . The attention
weightsp
(ℓ)
ij
=π(z
(ℓ)
i
)j are computed by applying a transformationπ:R
n
→ △n to the attention
logitsz
(ℓ)
ij
=⟨q
(ℓ)
i
,k
(ℓ)
j
⟩/
√
d
, where△n:={p∈R
n
:p≥0,1
⊤
p= 1} represents the
probability simplex. Standard transformers employ the softmax function asπ. In this work, we study
transformers by castingπas theα-entmax transformation.
1
Our code is available athttps://github.com/deep-spin/adasplash.
2

0.0 2.5 5.0 7.5 10.0
1
0.00
0.25
0.50
0.75
1.00
Probability
Softmax (=1.0)
0.0 2.5 5.0 7.5 10.0
1
Entmax =1.5
0.0 2.5 5.0 7.5 10.0
1
Entmax =1.66
0.0 2.5 5.0 7.5 10.0
1
Entmax =2.0 Figure 2: Visualization ofα-entmax(z/θ) for different values ofα. Each panel shows how probability
mass is distributed among five elements ofz= [2.0,1.8,1.6,1.4,1.2] as the temperature parameter
decreases (θ
−1
increases). The vertical lines show the temperature that leads to zero probability.
2.2α-entmax
α
-entmax (Peters et al.,) is a differentiabletransformation that generalizes softmax by allowing
forsparseprobability distributions. For an input vectorz∈R
n
andα >1,α-entmax is defined as:
α-entmax(z)i= [(α−1)zi−τ(z)]
1
α−1
+, (2)
where[·]+:= max(0,·) andτ:R
n
→R yields a threshold that ensures the resulting distribution
sums to1. A key property ofα-entmax is that tokens with scores below the threshold receive
exactly zero probability, creating sparse attention patterns. Whenα→1
+ , this reduces to the
standard softmax function. The sparsity level increases withα, withα= 2corresponding to the
sparsemax function (Martins & Astudillo,). Figure α-entmax for different values of
α. We provide more information onα-entmax in §A. While α-entmax is a suitable choice for sparse
attention, its theoretical and empirical impact on long inputs is still unclear. In the next section, we
demonstrate how it fundamentally changes the way attention behaves for long contexts.
3 Theoretical Properties ofα-entmax for Long Contexts
We analyze the theoretical properties ofα-entmax that make it especially suitable for long-context
modeling, focusing on how it addresses the fundamental limitations of softmax.
3.1 Non-Vanishing Attention Probabilities
A critical limitation of softmax in transformers is that attention weights inevitably decrease as the
sequence length increases. Our first result demonstrates howα-entmax avoids this issue.
Lemma 1(Non-Vanishing Attention Property).Consider scalarsa1, ..., an−1, c∈R . Letx=
[a1, ..., an−1, c]
⊤
∈R
n
andx
∗
= [a1, ..., an−1, b, c]
⊤
∈R
n+1 , with all entries bounded. The
following properties hold:
•
For allα≥1 , we haveα-entmax(x)n≥α-entmax(x
∗
)n+1 . In the softmax case (α= 1),
Barbero et al.2024, Lemma B.1) have shown that the inequality is always strict: softmax(x)n>
softmax(x
∗
)n+1.
•
For allα >1, there is somebmax∈R such that, for anyb≤bmax , we haveα-entmax(x)n=
α-entmax(x
∗
)n+1.
Furthermore, forα >1, the differenceα-entmax(x)n−α-entmax(x
∗
)n+1 can take any value in
[0, α-entmax(x)n]by appropriate choice ofb.
The proof can be found in §C.1. This result demonstrates a fundamental difference between softmax
andα-entmax. Unlike softmax, where adding a new token always reduces the attention probability
of existing tokens strictly,α-entmax allows a distinct behavior: the attention probability can remain
unchanged. This occurs because theα-entmax’s thresholding effect allows tokens with logits below a
certain threshold to receive exactly zero attention, letting the model focus only on the relevant tokens.
Having established thatα-entmax prevents the vanishing of individual attention weights, we now
formalize the broader concept of attention dispersion to better understand how attention distributions
as a whole behave as the sequence length increases.
3

3.2 Attention Dispersion and Concentration
Recent work by2025) and ˇckovi´c et al.2024) has highlighted attention dispersion as
a fundamental limitation of softmax for long context generalization. Building upon these insights, we
provide a formal definition to characterize attention dispersion and show howα-entmax naturally
exhibits concentration properties that address these limitations.
Definition 1(Attention Dispersion).Letf:R
n
→ △n denote a transformation (such as softmax)
mapping logits to the probability simplex△n:={p∈R
n
:p≥0,1
⊤
p= 1}.
1.f
exhibitscomplete dispersionif for any bounded sequence of logits(zn)n∈N , the normalized
entropy approaches 1 as the sequence length increases:
lim
n→∞
H(f(z1:n))
logn
= 1. (3)
2.f
exhibitsconcentration resilienceif there are bounded sequences of logits where the normalized
entropy remains bounded away from 1:
lim
n→∞
H(f(z1:n))
logn
<1. (4)
These definitions allow us to examine how softmax andα-entmax behave as sequence length grows:
Proposition 1(Dispersion Properties of Attention Mechanisms).Comparing softmax andα-entmax
(α >1) attention mechanisms:
1.α
-entmax can retain probability, while softmax always leaks:For anyα >1 and any logits
z∈R
n
, there are logitsz
∗
∈R
N
withN > nsuch that:
α-entmax(z)i=α-entmax(z
∗
)i∀i≤n. (5)
This is impossible forα= 1(softmax), for which we always havesoftmax(z)i<softmax(z
∗
)i .
2.Softmax exhibits complete dispersion:For any fixed temperatureθ >0 and any bounded
sequence of logits(zn)n∈N:
lim
n→∞
H(softmax(z1:n/θ))
logn
= 1. (6)
3.α
-entmax can exhibit strong concentration resilience:When the support size grows sublinearly
as|S|=O(n
β
)withβ <1,α-entmax maintains bounded normalized entropy:
lim
n→∞
H(α-entmax(z1:n))
logn
≤β <1. (7)
The full proof can be consulted in §D. The key takeaway of this result is that the entropy of
attention distributions reveals how concentrated or dispersed they are across tokens. While softmax
distributions approach maximum entropyΘ(logn) as the sequence length increases (indicating
complete dispersion),α-entmax distributions maintain bounded entropyO(logs) wheresis the
support size. This allows models withα-entmax to maintain focused, low-entropy attention patterns
even when processing extremely long sequences, as long as the support size is smaller than the
full sequence length. This non-dispersion property means transformers withα-entmaxcan scale to
very long contexts without the attention becoming dispersed, maintaining their ability to focus
on relevant information regardless of how much additional context is present. However, attention
dispersion is not the only obstacle to effective long-sequence modeling.
3.3 Representational Preservation and Over-squashing Alleviation
Two other critical challenges in long-context transformers are representational collapse and over-
squashing—both fundamentally linked to the properties of softmax attention (Barbero et al.,).
To clarify these concepts, we provide precise definitions below.
4

2
7
2
8
2
9
2
10
2
11
2
12
2
13
2
14
Sequence Length (n)
0
1
2
3
L1 Norm 0 1 2 3 4 5 6 7
Layer
2
4
6
Avg. Gradient Norm
Softmax
Entmax =1.5
Entmax =1.75
Entmax =2.0 Figure 3: Empirical verification of representational collapse and over-squashing.Left:Representation
difference (L1norm) after 6 layers across sequence lengths. Softmax (α= 1.0 ) exhibits inevitable
collapse, whileα-entmax maintains bounded differences even at extreme lengths.Right:Gradient
norms across layers in a transformer (n= 256).α-entmax maintains up to6×stronger gradient sig-
nals than softmax, particularly in earlier layers, demonstrating its ability to alleviate over-squashing.
Definition 2(Representational Collapse and Over-squashing).A transformer model suffers from
representational collapsewhen token representations become similar as sequence length grows:
for anyϵ >0, there existsnϵsuch that∥v
(L)
i
−v
(L)
j
∥1< ϵfor alli, j≤nwhenn > nϵ.
Over-squashingoccurs when gradient signals are exponentially diluted as they flow throughL
layers, with∥
∂yn
∂v
(0)
i
∥< ϵfor anyϵ >0due to theO(n
L
)gradient paths in softmax attention.
Building on these definitions, we show in §C.2C.3 α-entmax addresses both issues:
Proposition 2(Representational Preservation).There are choices ofv
(0)
∈R
n×d andv
∗(0)
∈
R
(n+1)×d
such thatα-entmax (α >1 ) can maintain distinct token representations even asn→ ∞:
∥v
(L)
n−v
∗(L)
n+1
∥1≥cfor any constantc >0.
Proposition 3(Over-squashing Alleviation).With support sizes≪n ,α-entmax withα >1
reduces gradient paths fromO(n
L
) toO(s
L
) , alleviating over-squashing via stronger gradient
signals.
To empirically validate these properties, we conducted controlled experiments with transformer
architectures employing different attention mechanisms, fully described in §C. For representational
collapse, we measured how token representations change when sequence length increases. For
over-squashing, we analyzed gradient flow through an 8-layer network on a copying task, where the
model must copy information across long distances. The results are shown in Figure. The left
panel shows theL1norm of representation differences between original and extended sequences
after multiple transformer layers. With softmax, this difference rapidly approaches zero as sequence
length increases. In contrast,α-entmax maintains bounded differences even at extreme lengths (128K
tokens), with higherαvalues providing stronger preservation. The right panel illustrates gradient flow
strength across sequence lengths, showing howα-entmax maintains significantly stronger gradient
signals across layers compared to softmax.
4 Adaptive-Scalableα-entmax (ASEntmax)
In the previous section we saw thatα-entmax, for any choice ofα >1 , can avoid some of the pitfalls
of softmax thanks to its ability to assign zero weight to many tokens, ignoring irrelevant information.
But what if it ignorestoo manytokens? Can it handle situations where many tokens arerelevantand
should be attended? We show in this section that indeed the model might not be able to cope with
this for a fixedαand a fixed temperature, and we propose a practical solution.
4.1 Controlling Sparsity in Long Contexts via ASEntmax
We start by introducing the following remark for a sequence of Gaussian random variables.
5

0 2004006008001000
n
0.5
0.6
0.7
0.8
pn
Layer 1, Head 8
n (R²=0.74)
logn (R²=-3.31)
+(logn) (R²=0.70)
0 2004006008001000
n
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
pn
Layer 3, Head 4
n (R²=0.28)
logn (R²=-9.04)
+(logn) (R²=0.80)
0 250 500 750 1000
n
1.2
1.3
1.4
1.5
1.6
pn
Layer 6, Head 1
n (R²=0.73)
logn (R²=-14.81)
+(logn) (R²=0.59)
0 2004006008001000
n
0.60
0.65
0.70
0.75
0.80
0.85
0.90
pn
Layer 8, Head 12
n (R²=0.81)
logn (R²=-0.87)
+(logn) (R²=0.88)
0 2004006008001000
n
0.80
0.85
0.90
0.95
1.00
1.05
pn
Layer 7, Head 5
n (R²=0.13)
logn (R²=-23.57)
+(logn) (R²=0.60) Figure 4: Learned positions per head. Besides a simple linear fit baseline (βn), we also show the fit
of a log model (βlogn) and a log-power model (β(logn)
γ
), inherited by ASEntmax.
Remark 1(Normal Concentration).Let the attention logitsz= (z1, . . . , zn) be IIDN(0, σ)
random variables and denote
M:= max
1≤i≤n
zi, m := min
1≤i≤n
zi,∆ :=M−m. (8)
By standard results on the asymptotic behavior of extreme values of normal random variables (Ka-
math,), the expected range satisfiesE[∆]≤2σ
√
2 lognandlimn→∞
E[∆]
2σ
√
2 logn
= 1.
Since the logit spread increases only as
√
logn
, a fixed temperature can cause excessive sparsity in
long contexts. To address this, we propose Adaptive-Scalableα-entmax (ASEntmax), which adap-
tively scales the temperature for each attention head and query based on the context size and content:
ASEntmax(z) =α-entmax((δ+β(logn)
γ
)z), (9)
whereβ, γ, δ∈R are head-specific scalars. Concretely, for each head, we obtain vectorsβandγ
whose entries contain these coefficients for each query:
β=softplus(Xwβ)∈R
n
+,γ=stanh(Xwγ)∈(−s, s)
n
, (10)
wherewβ,wγ∈R
d are learnable, head-specific projection vectors. The scaling factor(logn)
γ
in Eq.. This
allows the model learn a slowly rising (γ >0) or dampening (γ <0) temperature schedule without
interfering in the positional encodings (which would happen with negative values ofβ). Specifically,
for IID Gaussian logitsN(0, σ) , whenδ= 0andγ=−0.5 the scaling counteracts the growth of
logit ranges (∆n) as context increases:
β(logn)
−0.5
·∆n=β(logn)
−0.5
·2σ
p
2 logn= 2σβ
√
2, (11)
which remains constant asnincreases, preventing excessive sparsification. Furthermore, with
this parameterization, ASEntmax can recover standardα-entmax whenβ= 0, hence allowing
a smooth transition between scaled and unscaled regimes. By makingwβandwγhead-specific
and learnable, ASEntmax can adapt to the optimal scaling behavior for each head, balancing the
natural concentration benefits ofα-entmax with precise control over how sparsity patterns evolve
with sequence length. Finally, we note that simply scaling the query-key products is appealing from
a practical perspective since it allows the direct use of fast optimized kernels forα-entmax, such
as AdaSplash (Gonçalves et al.,), without any modifications.
Empirical Analysis.To empirically validate the importance of the parameterγin our proposed
scaling formulation, we conducted experiments on a language modeling task using a 120M-parameter
transformer trained on 5B tokens from the FineWeb dataset (Penedo et al.,). Following the
methodology of2025), we implemented learnable scaling parameters for the attention
logits, but with a key difference: while they used a global scaling parameter, we learn separate scaling
parameters for each attention head, motivated by2019)’s finding that attention heads
develop distinct sparsity patterns.
Figure
curves from different scaling models. The results clearly demonstrate that scaling parameters vary
significantly across heads, and that a simple log-scaling modelβlogn (as used by2025))
provides poor fits for many heads. In contrast, our proposed formδ+β(logn)
γ provides consistently
better fits capturing the diverse scaling behaviors across different attention heads. The complete
distribution of fittedβandγvalues across all heads is provided in §F, and additional training details
can be found in §F.1.
6

4.2 Interacting with Positional Encoding
Positional bias is known to affect attention dynamics and generalization (Barbero et al.,;
et al.,). In this work, we also introduceNAPE(NoPE + ALiBi Positional Encoding), a hybrid
scheme in which half of the attention heads receive no positional encoding (NoPE), enabling flexible,
data-driven position inference, while the other half use a fixed monotonic bias (ALiBi) to encourage
locality. This complementary design allows some heads to focus on global patterns while others
encode recency, combining the strengths of both.
While NAPE is not central to our theoretical results, we find it plays a crucial empirical role in
enabling robust length generalization, especially when paired with the sparse inductive bias of
α-attention. As detailed in our theoretical and empirical analysis in §E, the NoPE component of
NAPE enables models to develop effective relative positional encoding—supporting the flexibility
hypothesis of2023)—and, more broadly, α-entmax interacts with positional
encodings in ways that fundamentally alter attention behavior compared to softmax, such as creating
hard attention windows with ALiBi or inducing frequency-dependent patterns with RoPE.
5 Experiments
A number of works have turned to synthetic tasks as a probing ground for transformers’ length-
generalization capabilities (Anil et al.,;,;,). Such tasks,
like copying a sequence and sorting numbers, allow precise control over training and test lengths,
revealing whether a model has truly learned an algorithm that scales or merely memorized patterns
within a limited length. Vanilla transformers struggle in this setting: they often achieve perfect
accuracy on sequences up to the training length, yet fail catastrophically on even slightly longer
sequences (Press et al.,). To quantitatively evaluate our proposed improvements, we embrace
this paradigm of synthetic tasks for long-sequence testing. Concretely, we evaluate our models on a
diverse set of synthetic tasks designed to test different aspects of long-context modeling, covering both
position-agnostic reasoning (where token positions are not critical) and position-sensitive operations
(where relative or absolute positions matter):
•
Retrieval-focused tasks:These includeMax Retrieval(Barbero et al.,), which requires identi-
fying maximum values in sequences, andMulti-query Multi-token Associative Recall(MQMTAR)—
a variant of that proposed by2024), but with multi-token keys and values—which
involves matching queries to their corresponding key-value pairs. Both tasks test the model’s ability
to maintain focus on relevant tokens regardless of their positions in long contexts.
•
Memory-dependent tasks:We evaluate models onCopy(reproducing input sequences). It assesses
how well the model preserves token representations and accesses specific positional information
throughout the network. On this line, we also evaluate on 2Back, described in §G.
•
Ordering tasks:This category contains tasks such asSort(arranging tokens in ascending order)
andReverse(outputting tokens in reverse order). These evaluate compositional generalization and
positional reasoning, becoming increasingly challenging as sequence length grows.
Experimental Setup.All synthetic tasks are trained with a decoder-like transformer. Moreover, we
evaluate models in extreme settings by using as few layers as possible as our goal is to test the attention
mechanism coupled with our positional-encoding strategy, rather than the scaling capabilities of
transformer. However, for the Reverse task, we increment the layer count until the softmax baseline
generalizes to at least 1.5×the in-distribution length, as this task proved especially challenging for
softmax models. To improve length extrapolation in models employing RoPE, we apply a RoPE
scaling factor of 16. For NAPE, we set ALiBi slopes tom=
1
h
, wherehis the head index. Finally,
we useδ= 1in ASEntmax models. More experimental details are described in §G.
Discussion.The results, shown in Table, reveal two critical factors for length generalization:
positional encoding and attention sparsity. With RoPE, all methods struggle beyond 2×training
length, while NAPE (NoPE + ALiBi) enables substantially better generalization across all tasks.
Within NAPE variants, ASEntmax dramatically outperforms others at extreme lengths—maintaining
96.4% accuracy at 256×test length on MQMTAR (vs. 80.2% for softmax) and 96.4% at 4×on
Reverse (vs. 0% for softmax). This positional encoding advantage aligns with our theoretical analysis
showing thatα-entmax transforms positional biases into adaptive attention windows.
7

Table 1: Exact match accuracy (%) on representative tasks. For each task, we report in-distribution
sequence lengthnin the first column (n= 64for all tasks), followed by OOD results at increasing
sequence lengths.Lindicates the number of layers. Best results are inbold.
MQMTAR(L= 4) Reverse(L= 6)
Method ID2× 4×16×64×256×1024× ID1.5×2×4×8×
RoPE
Softmax 100.0 3.1 0.0 0.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0 0.0
SSMax 99.8 6.2 0.0 0.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0 0.0
Entmax 99.8 49.4 4.5 0.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0 0.0
ASEntmax 100.0 66.9 0.8 0.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0 0.0
NAPE
Softmax 100.0 100.0 100.099.5 97.8 80.2 3.0 100.0 36.0 0.0 0.0 0.0
SSMax 99.9 100.099.9 99.4 94.1 53.4 1.0 100.0 54.6 0.0 0.0 0.0
Entmax 100.0 100.0 100.099.2 92.7 66.8 9.3 100.0 99.0 86.0 28.5 0.2
ASEntmax 100.0100.099.999.8 99.2 96.4 76.7 100.0100.0 99.8 96.4 56.7
Copy(L= 2) Sort(L= 2)
ID2× 4×8×16×32× 64× ID2×4×8× -
RoPE
Softmax 100.0 2.8 0.0 0.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0 -
SSMax 100.0 0.0 0.0 0.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0 -
Entmax 100.0 34.3 0.0 0.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0 -
ASEntmax 100.0 5.3 0.0 0.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0 -
NAPE
Softmax 100.0 100.099.9 99.9 99.4 96.1 85.5 100.0 0.0 0.0 0.0 -
SSMax 100.0 100.0 100.0 99.9 99.6 99.3 95.8 100.0 0.0 0.0 0.0 -
Entmax 100.0 99.0 86.0 28.5 0.2 0.0 0.0 100.0 99.3 57.8 0.0 -
ASEntmax 100.0100.099.9 99.7 99.4 96.3 86.6 100.0 100.0 79.70.0 -16 32 64 128 256 512 1K 2K 4K
Sequence Length
20%
40%
60%
80%
100%
Accuracy (%)In-Distribution
Softmax
Adapt. Temp.
SSMax
Entmax =1.5
Entmax =2
Entmax =16
ASEntmax
Figure 5: Accuracy (%) on the Max Retrieval task across different sequence lengths. Adaptive
Temperature (Adapt. Temp.) represents the approach proposed by ˇckovi´c et al.2024).
Task-specific patterns further validate our theoretical claims. The consistent superiority of ASEntmax
over basicα-entmax confirms the benefits of adaptive scaling, particularly at extreme lengths where
fixed-αmay become too sparse or too diffuse. SSMax performs well on the Copy task, even
outperforming other methods, but struggles on more complex tasks like MQMTAR and Reverse at
extreme lengths. This indicates that while scaling logits helps maintain peak attention magnitude,
the explicit sparsity ofα-entmax provides additional benefits by completely removing irrelevant
connections. These findings are further supported by results on the Max Retrieval task (Figure),
where sparse attention mechanisms demonstrate superior length extrapolation compared to dense
approaches, with ASEntmax maintaining over 60% accuracy even at 4096-length sequences—a
dramatic improvement over standard Softmax and Adaptive Temperature (Veli ˇckovi´c et al.,).
Finally, Copy and Reverse show moderate generalization (up to 64×and 8×respectively), while
Sort fails beyond 4×length for all methods. This pattern suggests that tasks requiring precise
global ordering (Sort, Reverse) are inherently more challenging for length generalization than tasks
dependent on local or independent token properties. We provide per-task results, including results for
other positional encoding methods such as ALiBi and NoPE, in §H.
8

6 Related Works
Understanding and improving length generalization is a very active research area. We summarize
below the related work in different dimensions.
Attention Dispersion.Recent work has identified attention dispersion as a fundamental limitation
in softmax-based transformers (Dong et al.,;,; ˇckovi´c et al.,). For
example, ˇckovi´c et al.2024) demonstrate that softmax attention inevitably disperses focus as
sequence length increases, while2025) propose SSMax to scale attention logits based on
sequence length. Our approach employsα-entmax (Peters et al.,), which naturally produces
sparse distributions by assigning exactly zero probability to irrelevant tokens. We provide theoretical
guarantees thatα-entmax maintains bounded normalized entropy as sequence length increases—a
property softmax fundamentally lacks. Our ASEntmax further improvesα-entmax with learnable,
context-dependent scaling, leading to consistent gains over SSMax across diverse tasks.
Representational Collapse and Over-Squashing.Studies analyzing attention patterns in neural
networks have noted that increasing depth and context length can induce representational degenera-
tion (Dong et al.,;,;,). In particular,2024)
prove that with softmax attention, token representations become indistinguishable as sequence length
increases and gradient paths grow asO(n
L
) , causing exponential signal dilution. Our analysis shows,
theoretically and empirically, thatα-entmax can address both limitations by maintaining distinct
token representations and by reducing gradient paths toO(s
L
)for increasing sequence lengthsn.
Positional Encodings.The interaction between attention mechanisms and positional encodings
significantly impacts long-context generalization.2025
distance bias creates a position advantage for early tokens, while2025) analyze
frequency components in RoPE (Su et al.,). Our work shows how α-entmax transforms
these biases: with NoPE, it helps attention to not always converge to early tokens; with ALiBi, it
creates hard attention windows—similar to Hard-ALiBi (Jelassi et al.,); with RoPE, it induces
frequency-dependent sparsity. Based on this analysis, we introduce NAPE (NoPE + ALiBi), which
combines ALiBi’s consistent recency bias with NoPE’s flexible content-based attention for improved
long-context generalization. Similar to NAPE,2025) propose to combine heads with ALiBi
and RoPE to increase training stability.
Attention Scaling.Recent work has shown that scaling attention logits is key for maintaining
sharp, focused attention in long contexts. Techniques like YaRN (Peng et al.,) and the entropy-
aware approach of2024) use dynamic logit scaling—often alongside modified RoPE—
to stabilize attention when training the model on extend contexts. Scalable-Softmax (Nakanishi,
2025) applies a simplelognscaling to logits during training to control dispersion without the need
for post-training adaptation. InfoScale (Li et al.,) derive scaling rules from the principle of
entropy invariance, while Scale-invariant Attention (Anson et al.,) introduces position-dependent
transformations to balance attention across both local and global contexts. Across these methods,
adaptive scaling consistently improves extrapolation to longer sequences. Our ASEntmax builds on
this line of work by introducing context-dependent, learnable scaling within theα-entmax framework,
enabling sparse, focused attention as context length increases.
Sparse Attention.Previous sparse attention approaches include structured patterns like Longformer
(Beltagy et al.,), BigBird (Zaheer et al.,), Flash-Attention (Dao et al.,), as well as
adaptive methods likeα-entmax (Peters et al.,) and Sparsefinder (Treviso et al.,), recently
accelerated by AdaSplash (Gonçalves et al.,). Our work differs by providing a theoretical
analysis of why sparsity helps with long-context generalization, linking it to fundamental limitations
including dispersion, representational collapse, and over-squashing.
7 Conclusions
In this paper, we present a principled approach to long-context modeling by replacing softmax
withα-entmax in transformer attention. Our theoretical analysis demonstrates how this simple
change addresses three fundamental limitations: it avoids attention dispersion through naturally
9

sparse distributions, prevents representational collapse by maintaining distinct token representations,
and alleviates over-squashing by reducing gradient paths fromO(n
L
) toO(s
L
) , wheres≪n is
the number of tokens with nonzero probability. We further introduce Adaptive-Scalableα-entmax
(ASEntmax), which adaptively adjusts sparsity based on sequence length for each attention head and
query input. Our empirical results confirm these theoretical predictions, with ASEntmax significantly
outperforming softmax and existing alternatives across diverse tasks, maintaining high performance
even at sequence lengths 1000×beyond training examples. These findings suggest that addressing
the fundamental mathematical limitations of transformer attention mechanisms may provide a direct
path to achieve robust long-context generalization.
Limitations.While our experiments focus on controlled settings with small transformers, the theo-
retical principles we establish should extend to deeper architectures and larger scales; however, future
research on sparse attention in production-scale models is still needed, especially with multi-stage
approaches commonly employed for extending the context-length in traditional LLMs (Grattafiori
et al.,).
Acknowledgments
We thank Hugo Pitorro for his insightful and constructive comments throughout this work. We also
thank the SARDINE Lab members for reviewing this paper and providing helpful feedback. This
work was supported by the Portuguese Recovery and Resilience Plan through project C645008882-
00000055 (Center for ResponsibleAI), by the EU’s Horizon Europe Research and Innovation Actions
(UTTER, contract 101070631), by the project DECOLLAGE (ERC-2022-CoG 101088763), and
by FCT/MECI through national funds and when applicable co-funded EU funds under UID/50008:
Instituto de Telecomunicações.
References
Alon, U. and Yahav, E. On the bottleneck of graph neural networks and its practical implications. In
International Conference on Learning Representations, 2021. URLhttps://openreview.net/
forum?id=i80OPhOCVH2.
Anil, C., Wu, Y., Andreassen, A., Lewkowycz, A., Misra, V., Ramasesh, V., Slone, A., Gur-Ari, G.,
Dyer, E., and Neyshabur, B. Exploring length generalization in large language models.Advances
in Neural Information Processing Systems, 35:38546–38556, 2022.
Anson, B., Wang, X., and Aitchison, L. Scale-invariant attention.arXiv preprint arXiv:2505.17083,
2025.
Arora, S., Eyuboglu, S., Timalsina, A., Johnson, I., Poli, M., Zou, J., Rudra, A., and Re, C. Zoology:
Measuring and improving recall in efficient language models. InThe Twelfth International
Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=
LY3ukUANko.
Arroyo, Á., Gravina, A., Gutteridge, B., Barbero, F., Gallicchio, C., Dong, X., Bronstein, M., and
Vandergheynst, P. On vanishing gradients, over-smoothing, and over-squashing in gnns: Bridging
recurrent and graph learning.arXiv preprint arXiv:2502.10818, 2025.
Barbero, F., Banino, A., Kapturowski, S., Kumaran, D., Madeira Araújo, J., Vitvitskyi, O., Pascanu,
R., and Veliˇckovi´c, P. Transformers need glasses! information over-squashing in language tasks.
Advances in Neural Information Processing Systems, 37:98111–98142, 2024.
Barbero, F., Vitvitskyi, A., Perivolaropoulos, C., Pascanu, R., and Veliˇckovi´c, P. Round and round we
go! what makes rotary positional encodings useful? InThe Thirteenth International Conference on
Learning Representations, 2025. URLhttps://openreview.net/forum?id=GtvuNrk58a.
Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer.
arXiv:2004.05150, 2020.
10

Correia, G. M., Niculae, V., and Martins, A. F. T. Adaptively sparse transformers. In Inui, K., Jiang,
J., Ng, V., and Wan, X. (eds.),Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pp. 2174–2184, Hong Kong, China, November 2019. Association
for Computational Linguistics. doi: 10.18653/v1/D19-1223. URLhttps://aclanthology.
org/D19-1223/.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. FlashAttention: Fast and memory-efficient exact
attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS),
2022.
Dong, Y., Cordonnier, J.-B., and Loukas, A. Attention is not all you need: Pure attention loses rank
doubly exponentially with depth. InInternational conference on machine learning, pp. 2793–2803.
PMLR, 2021.
Dziri, N., Lu, X., Sclar, M., Li, X. L., Jiang, L., Lin, B. Y., Welleck, S., West, P., Bhagavatula, C., Bras,
R. L., Hwang, J. D., Sanyal, S., Ren, X., Ettinger, A., Harchaoui, Z., and Choi, Y. Faith and fate:
Limits of transformers on compositionality. InThirty-seventh Conference on Neural Information
Processing Systems, 2023. URLhttps://openreview.net/forum?id=Fkckkr3ya8.
Fu, Z., Song, W., Wang, Y., Wu, X., Zheng, Y., Zhang, Y., Xu, D., Wei, X., Xu, T., and Zhao, X. Slid-
ing window attention training for efficient large language models.arXiv preprint arXiv:2502.18845,
2025.
Gonçalves, N., Treviso, M., and Martins, A. F. Adasplash: Adaptive sparse flash attention.arXiv
preprint arXiv:2502.12082, 2025.
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur,
A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra,
A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson,
A., Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Caucheteux, C., Nayak, C., Bi, C.,
Marra, C., McConnell, C., Keller, C., Touret, C., Wu, C., Wong, C., Ferrer, C. C., Nikolaidis, C.,
Allonsius, D., Song, D., Pintz, D., Livshits, D., Wyatt, D., Esiobu, D., Choudhary, D., Mahajan,
D., Garcia-Olano, D., Perino, D., Hupkes, D., Lakomkin, E., AlBadawy, E., Lobanova, E., Dinan,
E., Smith, E. M., Radenovic, F., Guzmán, F., Zhang, F., Synnaeve, G., Lee, G., Anderson, G. L.,
Thattai, G., Nail, G., Mialon, G., Pang, G., Cucurell, G., Nguyen, H., Korevaar, H., Xu, H.,
Touvron, H., Zarov, I., Ibarra, I. A., Kloumann, I., Misra, I., Evtimov, I., Zhang, J., Copet, J., Lee,
J., Geffert, J., Vranes, J., Park, J., Mahadeokar, J., Shah, J., van der Linde, J., Billock, J., Hong, J.,
Lee, J., Fu, J., Chi, J., Huang, J., Liu, J., Wang, J., Yu, J., Bitton, J., Spisak, J., Park, J., Rocca, J.,
Johnstun, J., Saxe, J., Jia, J., Alwala, K. V., Prasad, K., Upasani, K., Plawiak, K., Li, K., Heafield,
K., Stone, K., El-Arini, K., Iyer, K., Malik, K., Chiu, K., Bhalla, K., Lakhotia, K., Rantala-Yeary,
L., van der Maaten, L., Chen, L., Tan, L., Jenkins, L., Martin, L., Madaan, L., Malo, L., Blecher,
L., Landzaat, L., de Oliveira, L., Muzzi, M., Pasupuleti, M., Singh, M., Paluri, M., Kardas, M.,
Tsimpoukelli, M., Oldham, M., Rita, M., Pavlova, M., Kambadur, M., Lewis, M., Si, M., Singh,
M. K., Hassan, M., Goyal, N., Torabi, N., Bashlykov, N., Bogoychev, N., Chatterji, N., Zhang, N.,
Duchenne, O., Çelebi, O., Alrassy, P., Zhang, P., Li, P., Vasic, P., Weng, P., Bhargava, P., Dubal,
P., Krishnan, P., Koura, P. S., Xu, P., He, Q., Dong, Q., Srinivasan, R., Ganapathy, R., Calderer,
R., Cabral, R. S., Stojnic, R., Raileanu, R., Maheswari, R., Girdhar, R., Patel, R., Sauvestre, R.,
Polidoro, R., Sumbaly, R., Taylor, R., Silva, R., Hou, R., Wang, R., Hosseini, S., Chennabasappa,
S., Singh, S., Bell, S., Kim, S. S., Edunov, S., Nie, S., Narang, S., Raparthy, S., Shen, S., Wan, S.,
Bhosale, S., Zhang, S., Vandenhende, S., Batra, S., Whitman, S., Sootla, S., Collot, S., Gururangan,
S., Borodinsky, S., Herman, T., Fowler, T., Sheasha, T., Georgiou, T., Scialom, T., Speckbacher, T.,
Mihaylov, T., Xiao, T., Karn, U., Goswami, V., Gupta, V., Ramanathan, V., Kerkez, V., Gonguet,
V., Do, V., Vogeti, V., Albiero, V., Petrovic, V., Chu, W., Xiong, W., Fu, W., Meers, W., Martinet,
X., Wang, X., Wang, X., Tan, X. E., Xia, X., Xie, X., Jia, X., Wang, X., Goldschlag, Y., Gaur,
Y., Babaei, Y., Wen, Y., Song, Y., Zhang, Y., Li, Y., Mao, Y., Coudert, Z. D., Yan, Z., Chen, Z.,
Papakipos, Z., Singh, A., Srivastava, A., Jain, A., Kelsey, A., Shajnfeld, A., Gangidi, A., Victoria,
A., Goldstand, A., Menon, A., Sharma, A., Boesenberg, A., Baevski, A., Feinstein, A., Kallet, A.,
Sangani, A., Teo, A., Yunus, A., Lupu, A., Alvarado, A., Caples, A., Gu, A., Ho, A., Poulton, A.,
Ryan, A., Ramchandani, A., Dong, A., Franco, A., Goyal, A., Saraf, A., Chowdhury, A., Gabriel,
A., Bharambe, A., Eisenman, A., Yazdan, A., James, B., Maurer, B., Leonhardi, B., Huang, B.,
11

Loyd, B., Paola, B. D., Paranjape, B., Liu, B., Wu, B., Ni, B., Hancock, B., Wasti, B., Spence, B.,
Stojkovic, B., Gamido, B., Montalvo, B., Parker, C., Burton, C., Mejia, C., Liu, C., Wang, C., Kim,
C., Zhou, C., Hu, C., Chu, C.-H., Cai, C., Tindal, C., Feichtenhofer, C., Gao, C., Civin, D., Beaty,
D., Kreymer, D., Li, D., Adkins, D., Xu, D., Testuggine, D., David, D., Parikh, D., Liskovich, D.,
Foss, D., Wang, D., Le, D., Holland, D., Dowling, E., Jamil, E., Montgomery, E., Presani, E., Hahn,
E., Wood, E., Le, E.-T., Brinkman, E., Arcaute, E., Dunbar, E., Smothers, E., Sun, F., Kreuk, F.,
Tian, F., Kokkinos, F., Ozgenel, F., Caggioni, F., Kanayet, F., Seide, F., Florez, G. M., Schwarz, G.,
Badeer, G., Swee, G., Halpern, G., Herman, G., Sizov, G., Guangyi, Zhang, Lakshminarayanan, G.,
Inan, H., Shojanazeri, H., Zou, H., Wang, H., Zha, H., Habeeb, H., Rudolph, H., Suk, H., Aspegren,
H., Goldman, H., Zhan, H., Damlaj, I., Molybog, I., Tufanov, I., Leontiadis, I., Veliche, I.-E., Gat,
I., Weissman, J., Geboski, J., Kohli, J., Lam, J., Asher, J., Gaya, J.-B., Marcus, J., Tang, J., Chan, J.,
Zhen, J., Reizenstein, J., Teboul, J., Zhong, J., Jin, J., Yang, J., Cummings, J., Carvill, J., Shepard,
J., McPhie, J., Torres, J., Ginsburg, J., Wang, J., Wu, K., U, K. H., Saxena, K., Khandelwal, K.,
Zand, K., Matosich, K., Veeraraghavan, K., Michelena, K., Li, K., Jagadeesh, K., Huang, K.,
Chawla, K., Huang, K., Chen, L., Garg, L., A, L., Silva, L., Bell, L., Zhang, L., Guo, L., Yu, L.,
Moshkovich, L., Wehrstedt, L., Khabsa, M., Avalani, M., Bhatt, M., Mankus, M., Hasson, M.,
Lennie, M., Reso, M., Groshev, M., Naumov, M., Lathi, M., Keneally, M., Liu, M., Seltzer, M. L.,
Valko, M., Restrepo, M., Patel, M., Vyatskov, M., Samvelyan, M., Clark, M., Macey, M., Wang,
M., Hermoso, M. J., Metanat, M., Rastegari, M., Bansal, M., Santhanam, N., Parks, N., White,
N., Bawa, N., Singhal, N., Egebo, N., Usunier, N., Mehta, N., Laptev, N. P., Dong, N., Cheng, N.,
Chernoguz, O., Hart, O., Salpekar, O., Kalinli, O., Kent, P., Parekh, P., Saab, P., Balaji, P., Rittner,
P., Bontrager, P., Roux, P., Dollar, P., Zvyagina, P., Ratanchandani, P., Yuvraj, P., Liang, Q., Alao,
R., Rodriguez, R., Ayub, R., Murthy, R., Nayani, R., Mitra, R., Parthasarathy, R., Li, R., Hogan,
R., Battey, R., Wang, R., Howes, R., Rinott, R., Mehta, S., Siby, S., Bondu, S. J., Datta, S., Chugh,
S., Hunt, S., Dhillon, S., Sidorov, S., Pan, S., Mahajan, S., Verma, S., Yamamoto, S., Ramaswamy,
S., Lindsay, S., Lindsay, S., Feng, S., Lin, S., Zha, S. C., Patil, S., Shankar, S., Zhang, S., Zhang,
S., Wang, S., Agarwal, S., Sajuyigbe, S., Chintala, S., Max, S., Chen, S., Kehoe, S., Satterfield,
S., Govindaprasad, S., Gupta, S., Deng, S., Cho, S., Virk, S., Subramanian, S., Choudhury, S.,
Goldman, S., Remez, T., Glaser, T., Best, T., Koehler, T., Robinson, T., Li, T., Zhang, T., Matthews,
T., Chou, T., Shaked, T., Vontimitta, V., Ajayi, V., Montanez, V., Mohan, V., Kumar, V. S., Mangla,
V., Ionescu, V., Poenaru, V., Mihailescu, V. T., Ivanov, V., Li, W., Wang, W., Jiang, W., Bouaziz,
W., Constable, W., Tang, X., Wu, X., Wang, X., Wu, X., Gao, X., Kleinman, Y., Chen, Y., Hu, Y.,
Jia, Y., Qi, Y., Li, Y., Zhang, Y., Zhang, Y., Adi, Y., Nam, Y., Yu, Wang, Zhao, Y., Hao, Y., Qian,
Y., Li, Y., He, Y., Rait, Z., DeVito, Z., Rosnbrick, Z., Wen, Z., Yang, Z., Zhao, Z., and Ma, Z. The
llama 3 herd of models, 2024. URLhttps://arxiv.org/abs/2407.21783.
Jelassi, S., d’Ascoli, S., Domingo-Enrich, C., Wu, Y., Li, Y., and Charton, F. Length generalization in
arithmetic transformers.arXiv preprint arXiv:2306.15400, 2023.
Jelassi, S., Brandfonbrener, D., Kakade, S. M., and Malach, E. Repeat after me: Transformers are
better than state space models at copying.arXiv preprint arXiv:2402.01032, 2024.
Kamath, G. Bounds on the expectation of the maximum of samples from a gaussian.http:
//www.gautamkamath.com/writings/gaussian_max.pdf, 2015. Accessed: 2025-05-13.
Kazemnejad, A., Padhi, I., Natesan Ramamurthy, K., Das, P., and Reddy, S. The impact of positional
encoding on length generalization in transformers.Advances in Neural Information Processing
Systems, 36:24892–24928, 2023.
Li, K., Kong, Y., Xu, Y., Su, J., Huang, L., Zhang, R., and Zhou, F. Information entropy invariance:
Enhancing length extrapolation in attention mechanisms.arXiv preprint arXiv:2501.08570, 2025.
Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Exposing attention glitches with
flip-flop language modeling. InThirty-seventh Conference on Neural Information Processing
Systems, 2023. URLhttps://openreview.net/forum?id=VzmpXQAn6E.
Martins, A. and Astudillo, R. From softmax to sparsemax: A sparse model of attention and multi-label
classification. In Balcan, M. F. and Weinberger, K. Q. (eds.),International Conference on Machine
Learning (ICML), volume 48 ofProceedings of Machine Learning Research, pp. 1614–1623, New
York, New York, USA, 20–22 Jun 2016. PMLR. URLhttp://proceedings.mlr.press/v48/
martins16.html.
12

Martins, A. F., Treviso, M., Farinhas, A., Aguiar, P. M., Figueiredo, M. A., Blondel, M., and Niculae,
V. Sparse continuous distributions and fenchel-young losses.Journal of Machine Learning
Research, 23(257):1–74, 2022.
Nakanishi, K. M. Scalable-softmax is superior for attention.arXiv preprint arXiv:2501.19399, 2025.
Noci, L., Anagnostidis, S., Biggio, L., Orvieto, A., Singh, S. P., and Lucchi, A. Signal propagation
in transformers: Theoretical perspectives and the role of rank collapse.Advances in Neural
Information Processing Systems, 35:27198–27211, 2022.
Penedo, G., Kydlíˇcek, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L. V., and Wolf,
T. The fineweb datasets: Decanting the web for the finest text data at scale. InThe Thirty-eight
Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024.
URLhttps://openreview.net/forum?id=n6SCkn2QaG.
Peng, B., Quesnelle, J., Fan, H., and Shippole, E. YaRN: Efficient context window extension of large
language models. InThe Twelfth International Conference on Learning Representations, 2024.
URLhttps://openreview.net/forum?id=wHBfxhZu1u.
Peters, B., Niculae, V., and Martins, A. F. T. Sparse sequence-to-sequence models. InProceedings
of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1504–1519,
Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1146.
URLhttps://www.aclweb.org/anthology/P19-1146.
Press, O., Smith, N., and Lewis, M. Train short, test long: Attention with linear biases enables
input length extrapolation. InInternational Conference on Learning Representations, 2022. URL
https://openreview.net/forum?id=R8sQPpGCv0.
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y. Roformer: Enhanced transformer with rotary
position embedding.Neurocomputing, 568:127063, 2024.
Treviso, M., Góis, A., Fernandes, P., Fonseca, E., and Martins, A. Predicting attention sparsity in
transformers. InProceedings of the Sixth Workshop on Structured Prediction for NLP, pp. 67–81,
Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.
spnlp-1.7. URLhttps://aclanthology.org/2022.spnlp-1.7.
Tsallis, C. Possible generalization of boltzmann-gibbs statistics.Journal of Statistical Physics, 1988.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser,
Ł., and Polosukhin, I. Attention is all you need.Advances in neural informa-
tion processing systems, 30, 2017. URLhttps://papers.nips.cc/paper/2017/hash/
3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
Veliˇckovi´c, P., Perivolaropoulos, C., Barbero, F., and Pascanu, R. softmax is not enough (for sharp
out-of-distribution).arXiv preprint arXiv:2410.01104, 2024.
Wu, X., Wang, Y., Jegelka, S., and Jadbabaie, A. On the emergence of position bias in transformers.
arXiv preprint arXiv:2502.01951, 2025.
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T.
On layer normalization in the transformer architecture. InInternational conference on machine
learning, pp. 10524–10533. PMLR, 2020.
Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula,
A., Wang, Q., Yang, L., et al. Big bird: Transformers for longer sequences.Advances in neural
information processing systems, 33:17283–17297, 2020.
Zhai, S., Likhomanenko, T., Littwin, E., Busbridge, D., Ramapuram, J., Zhang, Y., Gu, J., and
Susskind, J. M. Stabilizing transformer training by preventing attention entropy collapse. In
International Conference on Machine Learning, pp. 40770–40803. PMLR, 2023.
Zhang, Y., Li, J., and Liu, P. Extending llms’ context window with 100 samples.arXiv preprint
arXiv:2401.07004, 2024.
13

Zhou, H., Bradley, A., Littwin, E., Razin, N., Saremi, O., Susskind, J. M., Bengio, S., and Nakkiran,
P. What algorithms can transformers learn? a study in length generalization. InThe Twelfth
International Conference on Learning Representations, 2024. URLhttps://openreview.net/
forum?id=AssIuHnmHX.
14

Aα-entmax Transformation
Forα >1, theα-entmax transformation of a score vectorz∈R
n
is defined as (Peters et al.,):
α-entmax(z) := arg max
p∈△n
p
⊤
z+Hα(p),△n:={p∈R
n
:p≥0,1
⊤
p= 1},(12)
whereHα(p) is the Tsallis(α) entropy (Tsallis,). The closed form for α-entmax withα >1 is
p
⋆
i=
fi
(α−1)zi−τ(z)
fl
1
α−1
+,
where[·]+= max(0,·) , andτ(z)is chosen so thatp
⋆sums to1. Figure α-entmax
with tempered scoresz/θbehaves for different choices ofα.
B Model Definition and Notation
In this work, we study (causal) transformers with sparse attention distributions created by replacing
softmax withα-entmax. We present the precise mathematical formulation of our model below,
following closely the notation from (Barbero et al.,).
LetQ=XWQ,K=XWK,V=XWV∈R
n×d be the query, key, and value projections of
the input embeddings respectively, wherenis sequence length anddthe hidden size. We denote
withqi,ki,vi∈R
d thed-dimensional query, key, and value vectors of thei-th token. For a single
attention head, transformers compute the representation of thei-th token through the following
layer-wise transformations:
2
u
(ℓ)
i
=
X
j≤i
p
(ℓ)
ij
norm
(ℓ)
1
ˇ
v
(ℓ−1)
j
ı
+v
(ℓ−1)
i
, (13)
v
(ℓ)
i
=FFN
(ℓ)
ˇ
norm
(ℓ)
2
ˇ
u
(ℓ)
i
ıı
+u
(ℓ)
i
, (14)
yi=norm3
ˇ
v
(L)
i
ı
, (15)
whereℓis the later index,p
(ℓ)
ijrepresents the attention weights,FFN
(ℓ)
:R
d
→R
d represents
the feed-forward network, andnorm
(ℓ)
1
,norm
(ℓ)
2
, andnorm3 are normalization functions. The final
representationyiis computed after applyingLtransformer layers. For next-token prediction tasks,
the model output typically depends solely onyn, the final representation of the last token. The
attention weightsp
(ℓ)
ij
are computed by applying a transformationπ:R
n
→ △nas follows:
p
(ℓ)
ij
=π
ˇ
z
(ℓ)
i
ı
j
, (16)
wherez
(ℓ)
i
∈R
n is the vector of logits for tokeniat layerℓ, with elementsz
(ℓ)
ij
=⟨q
(ℓ)
i
,k
(ℓ)
j
⟩/
√
d
.
The functionπmaps these logits to a probability distribution over thentokens, with△ndenoting
the probability simplex. In standard transformers,πis the softmax function:
softmax(z)j=
exp(zj)
P
k≤i
exp(zk)
. (17)
In our approach, we replace softmax withα-entmax (§A). We group the attention weights into an
attention matrix at theℓ-th layer, defined element-wise as[P
(ℓ)
]ij:=p
(ℓ)
ij . This is a row-stochastic
lower triangular matrix that can also be interpreted as a probabilistic directed graph. Finally, when
incorporating positional information, we modify the attention logits computation according to the
chosen positional encoding strategy:
•NoPE:z
(ℓ)
ij
=⟨q
(ℓ)
i
,k
(ℓ)
j
⟩/
√
d.
•ALiBi:z
(ℓ)
ij
=⟨q
(ℓ)
i
,k
(ℓ)
j
⟩/
√
d+m·(j−i), wherem∈Ris a slope hyperparameter.
•RoPE:z
(ℓ)
ij
= (q
(ℓ)
i
)
⊤
R
j−i
k
(ℓ)
j
, whereR∈R
d×d
is a block-diagonal rotation matrix.
2
Following2024), we omit the linear projections used to compute the vectors from the output
of previous layers for clarity; however, this does not impact our derivations and conclusions.
15

C Representational Collapse and Over-squashing
C.1 Proof of Lemma
Adding a new element to the sequence of logits can only redistribute probability mass, so
α-entmax(x)n≥α-entmax(x
∗
)n+1 must always hold, with equality iffα-entmax(x
∗
)n= 0 .
Since softmax (α= 1) cannot return zeros, we must have a strict inequality forα= 1.
Forα >1, we need to find the valuebmaxsuch that, for anyb≤bmax ,α-entmax(x
∗
)n= 0 holds.
From the definition ofα-entmax(2), a tokenireceives non-zero probability iff(α−1)zi> τ(z) ,
whereτ(z)is the threshold ensuring the sum of probabilities equals 1. Therefore, for the tokenbin
the extended sequencex
∗to receive zero probability (thus not affecting other probabilities), we need:
(α−1)b≤τ(x
∗
). (18)
We know thatτ(x
∗
)≥τ(x) in general forα-entmax, as shown by2019);
(2022, Lemma 3; Proposition 4). Therefore, a sufficient condition is:
(α−1)b≤τ(x). (19)
Solving forb, we get:
b≤
τ(x)
α−1
. (20)
Thus, we can definebmax=
τ(x)
α−1
. For anyb≤bmax , the token at positionninx
∗receives zero
attention, meaning it doesn’t affect the normalization. Therefore,τ(x
∗
) =τ(x) , which means that
the condition (20) is both necessary and sufficient, and:
α-entmax(x)n=
fi
(α−1)c−τ(x)
fl1
α−1
+
=
fi
(α−1)c−τ(x
∗
)
fl1
α−1
+
=α-entmax(x
∗
)n+1.(21)
By choosing different values ofbsuch thatb≤bmax , we can control the change in threshold
τ(x
∗
) and consequently the differenceα-entmax(x)n−α-entmax(x
∗
)n+1 can be as large as
α-entmax(x)n.
C.2 Proof of Proposition
We prove Proposition
Proposition 4(Counterexample to Representational Collapse withα-entmax).Letv∈R
(n−1)×d
be a sequence of embedding vectors, and define:
v
(0)
= [v,va]
⊤
∈R
n×d
,v
∗(0)
= [v,va,va]
⊤
∈R
(n+1)×d
, (22)
where the final tokenva∈R
d
is repeated.
For appropriate choice of embeddings andα >1, there exists a constantc >0independent ofn
such that:
∥v
(L)
n−v
∗(L)
n+1
∥1≥c >0 (23)
afterLtransformer layers withα-entmax attention, demonstrating that representational collapse is
not inevitable.
In contrast,2024) proved that for softmax attention, ∥v
(L)
n−v
∗(L)
n+1
∥1→0 as
n→ ∞for any such construction.
Proof.We prove this through explicit construction.
Sincev
(0)
1:n−1
=v
∗(0)
1:n−1 and both sequences end withva, the attention logits computed by the final
tokens are:
z
(1)
n= [v
⊤
av1, . . . ,v
⊤
avn−1,v
⊤
ava], (24)
z
∗(1)
n+1
= [v
⊤
av1, . . . ,v
⊤
avn−1,v
⊤
ava,v
⊤
ava]. (25)
Consider the specific embedding choice where:
16

•vais chosen such thatv
⊤
ava=ϕfor someϕ∈R.
•vifori= 1, . . . , n−1are chosen such thatv
⊤
avi=bfor some valueb < ϕ.
This construction yields the following attention logits:
z
(1)
n= [b, b, . . . , b, ϕ], (26)
z
∗(1)
n+1
= [b, b, . . . , b, ϕ, ϕ]. (27)
Specific counterexample.Considerd= 1,α= 2.0 ,ϕ= 0.5 , andb= 0. We can construct inputs
asva=
√
0.5andvi= 0fori= 1, . . . , n−1. The attention logits are:
z
(1)
n= [0, . . . ,0,0.5], (28)
z
∗(1)
n+1
= [0,0, . . . ,0,0.5,0.5]. (29)
Forα= 2.0(sparsemax), the attention distributions are:
sparsemax(z
(1)
n) = [pb, pb, . . . , pb, pn](dense) (30)
sparsemax(z
∗(1)
n+1
) = [0,0, . . . ,0,
1
2
,
1
2
](sparse) (31)
where0< pb< pn<1. Asn→ ∞, we havepn→p
∗
where:
p
∗
= [(α−1)(ϕ−b)]
1
α−1= [(2−1)(0.5−0)]
1
2−1= (0.5)
1
= 0.5. (32)
The average representation is¯v=
1
n−1
P
n−1
i=1
0 = 0, and therefore:
lim
n→∞
∥v
(1)
n−v
∗(1)
n+1
∥1= (1−p
∗
)|¯v−va|= (1−0.5)|0−
√
0.5|= 0.5×
√
0.5≈0.354.(33)
This demonstrates that theL1difference remains bounded at approximatelyc= 0.354 , independent
of sequence lengthn.This specific example already establishes the existence of a counterexample
to representational collapse withα-entmax. We now extend this result to prove Proposition
general constructions ofvandva.
General Construction.With values ofϕsuch thatb+ 2
−(α−1)
/(α−1)≤ϕ < b+ 1/(α−1) ,
α-entmax produces the following distributions:
3
α-entmax(z
(1)
n) = [p
(1)
1
, p
(1)
2
, . . . , p
(1)
n−1
, p
(1)
n], (34)
α-entmax(z
∗(1)
n+1
) = [0,0, . . . ,0,
1
2
,
1
2
]. (35)
In particular, since the firstn−1 positions share the same representation, we havep
(1)
1
=p
(1)
2
=
...=p
(1)
n−1
= (1−p
(1)
n)/(n−1) =p
(1)
b
>0, with0< p
(1)
n<1. This leads to representations:
v
(1)
n=p
(1)
b
v1+· · ·+p
(1)
b
vn−1+p
(1)
nva, (36)
v
∗(1)
n+1
=
1
2
va+
1
2
va=va. (37)
Let,¯v:=
1
n−1
P
n−1
i=1
vi
denote the average of the first block of vectors. Taking theL1-norm of the
representations difference:
∥v
(1)
n−v
∗(1)
n+1
∥1=

(1−p
(1)
n)¯v+p
(1)
nva−va

1
(38)
= (1−p
(1)
n)∥¯v−va∥
1
. (39)
We need to show that the above expression does not tend to 0 asn→ ∞. To that end, we need (i)
limn→∞p
(1)
n=p
∗
<1, and (ii)limn→∞∥¯v−va∥1=c >0.
3
The upper bound ensures a dense output forz
(1)
n, following Lemma τ= (α−1)ϕ−1 . The lower
bounds ensure a sparse output forz
∗(1)
n+1
, following Lemma k= 2.
17

First condition.We need to choose parameters so that asn→ ∞, the original sequences remains
denseand the extended sequence is in thesparseregime. From our analysis with Lemma,
this requires:
2
−(α−1)
α−1
≤ϕ−b <
1
α−1
. (40)
From theα-entmax definition, we have:
(p
(1)
b
)
α−1
= (α−1)b−τ, (41)
(p
(1)
n)
α−1
= (α−1)ϕ−τ, (42)
wherep
(1)
bis the probability for eachbtoken andp
(1)
nis the probability for theϕtoken. Subtracting
the first equation from the second:
(p
(1)
n)
α−1
−(p
(1)
b
)
α−1
= (α−1)(ϕ−b). (43)
From the normalization constraint(n−1)p
(1)
b
+p
(1)
n= 1:
p
(1)
b
=
1−p
(1)
n
n−1
. (44)
Substituting:
(p
(1)
n)
α−1
−

1−p
(1)
n
n−1
!
α−1
= (α−1)(ϕ−b). (45)
We know from the dense regime thatp
(1)
n→p
∗ , with0< p
∗
<1 . Thus, asn→ ∞, the probability
p
(1)
nsatisfies:
p
(1)
n→p
∗
wherep
∗
solves(p
∗
)
α−1
−lim
n→∞
ȷ
1−p
∗
n−1
ff
α−1
= (α−1)(ϕ−b). (46)
Since
ˇ
1−p
∗
n−1
ı
α−1
→0asn→ ∞, we get:
(p
∗
)
α−1
= (α−1)(ϕ−b) (47)
Therefore:
p
∗
= [(α−1)(ϕ−b)]
1
α−1<1. (48)
The inequality holds becauseϕ−b <
1
α−1
, so(α−1)(ϕ−b)<1 . The key insight is that asn→ ∞,
the small probabilities on thebtokens become negligible, but their sum(n−1)p
(1)
b
= 1−p
∗ remains
finite, so each individualp
(1)
b
→0.
Second condition.Choose the input sequence such that¯v=
1
n−1
P
n−1
i=1
vi̸=va
, and use the
construction:
•va=
√
ϕe1wheree1is the first standard basis vector.
•vi=
b
√
ϕ
e1+e2fori= 1, . . . , n−1.
Note that this construction satisfies the logit constraints:
v
T
avi=
p
ϕ·
b
√
ϕ
+ 0·1 =b, (49)
v
T
ava= (
p
ϕ)
2
=ϕ. (50)
18

2
7
2
8
2
9
2
10
2
11
2
12
2
13
2
14
Sequence Length (n)
0
1
2
L1 Norm 2
7
2
8
2
9
2
10
2
11
2
12
2
13
2
14
Sequence Length (n)
0
1
2
3
L1 Norm Figure 6:L1norm of representation difference between original sequence and extended sequence
after 6 transformer layers, with constant prefix (left) and random prefix (right). With softmax
(α= 1.0 ), representation difference rapidly approaches zero, demonstrating inevitable collapse.
α-entmax (α >1.0) maintains bounded differences even at extreme sequence lengths.
The average representation is:
¯v=
1
n−1
n−1
X
i=1
vi=
b
√
ϕ
e1+e2. (51)
Sinceva=
√
ϕe1, we have:
∥¯v−va∥1=

b
√
ϕ
e1+e2−
p
ϕe1

1
=

b−ϕ
√
ϕ

+ 1>0 (52)
The bound is strictly positive becauseb̸=ϕ by the logit difference requirement, and because of
the constant+1term from thee2component. Therefore,limn→∞∥¯v−va∥1̸→0 . For the case
d= 1, we can use the simpler constructionvi=
b
√
ϕ
andva=
√
ϕ
, where the non-collapse condition
becomes verifying that
b
√
ϕ
̸=
√
ϕ
, which follows fromb̸=ϕ. In contrast, as shown by
(2024), the resulting representations become increasingly similar asn→ ∞with softmax (α= 1.0 ),
regardless of the input content, leading to representational collapse.
Empirical Verification of Representational Preservation.To empirically validate our theoretical
analysis, we conducted the following experiment: we implemented the counterexample construction
using identity projection matrices for queries/keys/values and tested two scenarios withd= 1:
1.Constant prefixes:b= 1andϕ= 1.2.
2.Random prefixes:b∼ U(0,1)andϕ= 1.2.
Using a 6-layered transformer with residual connections, we experiment with increasing sequence
lengthsn∈ {128,256, . . . ,16384} andα∈ {1.0,1.5,1.75,2.0} , and compute∥v
(L)
n−v
∗(L)
n+1
∥1 .
Figure
lished in the previous counterexample, while softmax attention inevitably leads to representational
collapse in long contexts,α-entmax can maintain distinct representations even as sequence length
grows.
C.3 Proof of Proposition
The following proposition demonstrates howα-entmax helps alleviate the problem of over-squashing,
the exponential dilution of gradient signals through deep networks. For clarity, we follow the same
set of assumptions as2024)—independence of attention coefficients from the values
and approximation of layer normalization with a constant factor.
Proposition 5(Over-squashing Alleviation withα-entmax).Consider anL-layer transformer-like
model where the attention distribution for each head is computed byα-entmax withα >1. For a
19

0 1 2 3 4 5 6 7
Layer
2
4
6
Avg. Gradient Norm
Softmax
Entmax =1.5
Entmax =1.75
Entmax =2.0 0 1 2 3 4 5 6 7
Layer
0
50
100
Avg. Gradient Paths Figure 7:Left:Layer-wise gradient norms in an 8-layer transformer with sequence lengthn= 256.
α-entmax withα >1.0 maintain substantially stronger gradient signals, especially in earlier layers,
compared to softmax. This demonstrates howα-entmax alleviates over-squashing by enabling more
effective gradient flow through the network, with gradient norms up to 6x higher than softmax.Right:
Visualization of the average number of non-zero gradient paths in a 8-layer transformer per layer,
showing howα-entmax creates fewer paths, which helps to alleviate over-squashing compared to
softmax, which always selects all possible paths (within machine precision).
tokennin the final layer, letv
(L)
n∈R
d
be its hidden representation and
yn= norm3
Γ
v
(L)
n
˙
(53)
be its final normalized output. The sensitivity ofynto the initial embeddingv
(0)
iof tokeni
experiences less over-squashing withα-entmax than with softmax attention. Specifically, if the
support size of theα-entmax attention distributions is|S
(ℓ)
j
|=s≪n for tokensjacross all layers
ℓ, then the number of gradient paths from tokenito tokennis reduced fromO(n
L
) toO(s
L
) , and
consequently helping to alleviate over-squashing by providing stronger gradient signals.
Proof.
We begin by expanding
∂yn
∂v
(0)
i
through the chain rule. Sinceyn= norm(v
(L)
n) , andv
(L)
nis
the output ofLtransformer layers, we have:
∂yn
∂v
(0)
i
=
1
β3
∂v
(L)
n
∂v
(0)
i
, (54)
where
1
β3
accounts for the normalization assumption. Expanding the gradient through allLlayers:
∂v
(L)
n
∂v
(0)
i
=
X
k1,k2,...,kL−1
∂v
(L)
n
∂v
(L−1)
kL−1
∂v
(L−1)
kL−1
∂v
(L−2)
kL−2
· · ·
∂v
(1)
k1
∂v
(0)
i
. (55)
Due to causal masking, the only non-zero terms occur wheni≤k1≤k2≤ · · · ≤kL−1≤n . For
each pair of adjacent layers, we have:
∂v
(ℓ+1)
j
∂v
(ℓ)
k
=

σψ
β
(ℓ)
2
+ 1
!
∂u
(ℓ)
j
∂v
(ℓ)
k
, (56)
whereu
(ℓ)
j
=
P
k≤j
p
(ℓ)
j,k
v
(ℓ)
k
β
(ℓ)
1
+v
(ℓ)
j
, andp
(ℓ)
j,kare the attention probabilities computed usingα-entmax.
Thus, fork≤j, we have:
∂u
(ℓ)
j
∂v
(ℓ)
k
=
p
(ℓ)
j,k
β
(ℓ)
1
+δj,kI, (57)
whereδj,kis the Kronecker delta and reflects the contribution from the residual connection, which
happens whenk=j. For simplicity, let¯p
(ℓ)
j,k
=
p
(ℓ)
j,k
β
(ℓ)
1
+δj,k
. Taking the norm and combining all
layers, we obtain:

∂yn
∂v
(0)
i

≤C
X
k1≥i
X
k2≥k1
· · ·
X
kL−1≥kL−2
¯p
(L−1)
n,kL−1
¯p
(L−2)
kL−1,kL−2
· · ·¯p
(0)
k1,i
, (58)
20

whereC=
1
β3
Q
L
ℓ=1
ȷ
σψ
β
(ℓ)
2
+ 1
ff
is a constant independent of the sequence length.
The crucial distinction between softmax andα-entmax lies in the attention probabilitiesp
(ℓ)
j,k. For
α-entmax withα >1 , many tokens receive exactly zero attention. Specifically, if we define the
support set for thej-th token asS
(ℓ)
j
={k|p
(ℓ)
j,k
>0andk≤j} , thenp
(ℓ)
j,k
= 0 for allk /∈ S
(ℓ)
j .
Consequently,¯p
(ℓ)
j,k
= 0 whenk /∈ S
(ℓ)
j andj̸=k (i.e., when there is no contribution from either
attention or the residual connection). This means we can rewrite our bound as:

∂yn
∂v
(0)
i

≤C
X
k1∈T1
X
k2∈T2(k1)
· · ·
X
kL−1∈TL−1(kL−2)
¯p
(L−1)
n,kL−1
¯p
(L−2)
kL−1,kL−2
· · ·¯p
(0)
k1,i
. (59)
where we precisely characterize the gradient flow paths via the setsT1andTℓ(kℓ−1) , which identify
tokens that receive non-zero gradient contributions:
T1={k∈ {i, i+ 1, . . . , n}:k∈ S
(0)
i
ork=i}, (60)
Tℓ(kℓ−1) ={k∈ {kℓ−1, kℓ−1+ 1, . . . , n}:k∈ S
(ℓ−1)
kℓ−1
ork=kℓ−1}forℓ >1.(61)
These sets have the following meaning:
•T1
represents the tokens in layerℓ= 1that can receive non-zero gradients from tokeniin layer
ℓ= 0, either through attention (k∈ S
(0)
i
) or via the residual connection (k=i).
•Tℓ(kℓ−1)
represents the tokens in layerℓthat can receive non-zero gradients from tokenkℓ−1in
layerℓ−1, either through attention (k∈ S
(ℓ−1)
kℓ−1
) or via the residual connection (k=kℓ−1).
The causal constraint (k≥iforT1andk≥kℓ−1 forTℓ) is explicitly incorporated in these definitions
to account for the causal attention mask. Importantly, the cardinality of these sets directly corresponds
to the potential gradient paths through the network. While softmax attention would yield|Tℓ(kℓ−1)|=
n−kℓ−1+ 1paths from each token,α-entmax’s sparsity ensures|Tℓ(kℓ−1)|=|S
(ℓ−1)
kℓ−1
|+ 1.
Hence, if the support size|S
(ℓ)
j
|=s≪n and assuming thati=jtokens are always in the support
due to the residual connections, this reduces the number of terms in the sum fromO(n
L
) toO(s
L
) ,
drastically reducing the total number of gradient paths. Furthermore, as a direct consequence of
Lemma, since α-entmax may concentrate probability mass on fewer tokens, the non-zerop
(ℓ)
j,k
values can be larger than with softmax. In such cases, the gradients along the remaining paths will
be stronger, helping to further alleviate over-squashing by concentrating gradient flow on important
tokens.
Empirical Verification of Over-squashing Alleviation.To empirically validate our theoretical
prediction thatα-entmax reduces gradient paths fromO(n
L
) toO(s
L
) , we conducted a controlled
experiment using a delayed copying task. In this task, the model is presented with a sequence
consisting of a prefix of random tokens, followed by a separator token, after which it must reproduce
the prefix tokens—creating a natural long-range dependency. We trained 8-layer transformers with
different attention mechanisms on sequences of length 256, where the model must copy information
from the beginning of the sequence to predict tokens after the separator.
Figure
layer input. Softmax exhibits consistently low gradient norms across all layers, indicating severe
gradient dilution. In contrast,α-entmax variants maintain substantially stronger gradient signals,
particularly in the earlier layers of the network. This confirms that gradient information propagates
more effectively through the network withα-entmax, preserving signal strength even when flowing
through multiple layers.
The right part of Figure
softmax, nearly all possible connections remain active within numerical precision, creatingO(n
2
)
paths per layer. This compounds across layers, resulting inO(n
L
) total paths.α-entmax dramatically
21

reduces the number of active paths, with stronger sparsity (higherαvalues) creating even sharper
reductions. Notably, in the first layer less than 5 tokens are kept active on average. This empirically
confirms our theoretical claim thatα-entmax prunes the computational graph toO(s
L
)paths.
D Non-Dispersion ofα-entmax
A critical problem in long-context modeling is the dispersion of attention, where relevant signals
get diluted across increasingly long sequences. For clarity, we begin by examining howα-entmax
behaves with two-level logits, and then proceed to define dispersion more rigorously and how
α-entmax naturally counteracts this issue.
Lemma 2(Threshold Behavior for Two-level Logits).Consider logitsz∈R
n wherektokens
have valueMand(n−k)tokens have valuemwithM > m.
1.
Forα >1, when∆ :=M−m≥
k
−(α−1)
α−1
, only thektokens with valueMreceive non-zero
attention. The threshold converges toτ(z) = (α−1)M−k
−(α−1) , and each high-value token
receives attention
1
k
while others receive zero attention. As a consequence,α-entmax maintains
a constant attention weight ofΘ(
1
k
)
on high-value tokens regardless of the total sequence length
n.
2.
In contrast, softmax (with fixed temperatureθ >0) necessarily disperses with attention weights
ofΘ(
1
n
)
asnincreases. For softmax to maintain concentration of at leastc∈(0,1) on thek
high-value tokens, the required logit difference must grow logarithmically withn:
∆
θ
≥ln
ȷ
n−k
k
·
c
1−c
ff
(62)
Proof.We prove the two parts below.
Part (i):LetSbe the support set. A tokeniis inSif and only ifzi>
τ(z)
α−1
whereτ(z)satisfies:
n
X
i=1
[ (α−1)zi−τ(z) ]
1
α−1
+= 1. (63)
For our two-level distribution, this becomes:
X
i:zi=M
[ (α−1)M−τ(z) ]
1
α−1
++
X
i:zi=m
[ (α−1)m−τ(z) ]
1
α−1
+= 1. (64)
For only tokens with valueMto receive non-zero attention, we need:
(α−1)M−τ(z)>0and(α−1)m−τ(z)≤0. (65)
Rearranging:(α−1)m≤τ(z)<(α−1)M. In this regime:
k·[ (α−1)M−τ(z) ]
1
α−1= 1. (66)
Solving forτ(z):
[ (α−1)M−τ(z) ]
1
α−1=
1
k
(67)
(α−1)M−τ(z) =k
−(α−1)
(68)
τ(z) = (α−1)M−k
−(α−1)
. (69)
For this threshold to satisfyτ(z)≥(α−1)m, we need:
(α−1)M−k
−(α−1)
≥(α−1)m (70)
M−m≥
k
−(α−1)
α−1
(71)
∆≥
k
−(α−1)
α−1
. (72)
22

Thus, when∆≥
k
−(α−1)
α−1
, only thektokens with valueMreceive non-zero attention, each receiving
an attention of
1
k
. Therefore, when∆≥
k
−(α−1)
α−1
, the attention weights forα-entmax are:
α-entmax(z)i=
∂
1
k
ifzi=M
0ifzi=m.
(73)
These weights remainΘ(
1
k
)
for high-value tokens regardless ofn, demonstrating thatα-entmax can
maintain constant attention on important tokens even as sequence length grows, as long askis fixed
asn→ ∞.
Part (ii):For softmax with temperatureθ >0, the attention weight for tokens with logitMis:
softmax(z/θ)i=
exp(M/θ)
kexp(M/θ) + (n−k) exp(m/θ)
. (74)
For softmax to maintain concentration of at leastcon thekhigh-value tokens combined:
kexp(M/θ)
kexp(M/θ) + (n−k) exp(m/θ)
≥c. (75)
Through algebraic manipulation:
kexp(M/θ)≥c[kexp(M/θ) + (n−k) exp(m/θ)] (76)
(1−c)kexp(M/θ)≥c(n−k) exp(m/θ) (77)
kexp(M/θ)
(n−k) exp(m/θ)
≥
c
1−c
(78)
k
n−k
exp(∆/θ)≥
c
1−c
(79)
exp(∆/θ)≥
n−k
k
·
c
1−c
. (80)
Taking the natural logarithm:
∆
θ
≥ln
ȷ
n−k
k
·
c
1−c
ff
. (81)
This shows that asngrows, the required∆for maintaining concentration with softmax grows
logarithmically withn. In contrast, forα-entmax, assuming we have akthat is fixed asngrows,
the condition∆≥
k
−(α−1)
α−1
is independent ofn, enabling constant focus regardless of sequence
length.
We now prove Proposition, which concerns the concept of dispersion presented in Definition.
Proof.
We address each claim in turn. For bounded sequences(zn)n∈N , we assumem, M∈R with
m≤Msuch thatm≤zi≤Mfor everyi∈N.
Part (i) -α-entmax can retain probability, while softmax always leaks:Forα >1 , consider
logitsz∈R
n and an extended sequencez
∗
∈R
N withN > n, where all additional elements have
values below the threshold,z
∗
i
≤τ(z)/(α−1) fori > n. By the non-vanishing attention property
ofα-entmax (Lemma), these additional elements receive exactly zero probability, resulting in:
α-entmax(z)i=α-entmax(z
∗
)i∀i≤n. (82)
This demonstrates thatα-entmax can produce identical distributions despite arbitrarily different
sequence lengths, maintaining the same concentration regardless of whether we have a distinct
number of tokens. In contrast, for softmax (α= 1),2024) proved that adding any
element to the sequence strictly decreases the probability assigned to existing elements, making such
invariance impossible.
23

Part (ii) - Complete dispersion of softmax:For softmax with constant temperatureθ >0 , the
attention weights for bounded logits can be bounded as:
exp(m/θ)
P
n
j=1
exp(zj/θ)
≤softmax(z1:n/θ)i≤
exp(M/θ)
P
n
j=1
exp(zj/θ)
. (83)
Since
P
n
j=1
exp(zj/θ)≥n·exp(m/θ)and
P
n
j=1
exp(zj/θ)≤n·exp(M/θ), we have:
exp(m/θ)
n·exp(M/θ)
≤softmax(z1:n/θ)i≤
exp(M/θ)
n·exp(m/θ)
. (84)
This simplifies to:
1
n
exp
ȷ
−
∆
θ
ff
≤softmax(z1:n/θ)i≤
1
n
exp
ȷ
∆
θ
ff
, (85)
where∆ =M−mis bounded.
These bounds show that asn→ ∞, all softmax weights areΘ(1/n). For the entropy:
H(softmax(z1:n/θ)) =−
n
X
i=1
softmax(z1:n/θ)ilogsoftmax(z1:n/θ)i→logn.(86)
Thus,limn→∞
H(softmax(z/θ))
logn
= 1, showing complete dispersion.
Part (iii) - Strong concentration resilience ofα-entmax:First, we focus on the two-level case
from Lemma, where ktokens have logit valueMand(n−k) tokens have valuem. When
∆ =M−m≥
k
−(α−1)
α−1
, only thektokens with valueMreceive non-zero attention:
α-entmax(z1:n)i=
∂
1
k
ifzi=M
0ifzi=m.
(87)
The Shannon entropy of this distribution is:
H(α-entmax(z1:n)) =−
k
X
i=1
1
k
log
1
k
= logk. (88)
The normalized entropy is:
H(α-entmax(z1:n))
logn
=
logk
logn
. (89)
For fixedkasn→ ∞, this ratio approaches 0, confirming concentration resilience.
For cases where the support grows sublinearly ask:=|S|=O(n
β
) for someβ <1, the Shannon
entropy is bounded by:
H(α-entmax(z1:n))≤logk=O(logn
β
) =O(βlogn). (90)
The normalized entropy is therefore:
lim
n→∞
H(α-entmax(z1:n))
logn
≤β <1. (91)
This confirms that the normalized entropy remains strictly bounded away from 1, even with growing
support, as long as the growth is sublinear.
24

This proposition shows that the entropy of attention distributions reveals how concentrated or
dispersed they are across tokens. While softmax distributions with bounded logits must approach
maximum entropyO(logn) as sequence length increases (indicating complete dispersion),α-entmax
distributions can maintain bounded entropyO(logk) wherekis the support size. This allows models
withα-entmax to maintain focused, low-entropy attention patterns even when processing extremely
long sequences.
Moreover, this proposition demonstrates thatα-entmax attention distributions have a remarkable
property: they do not necessarily disperse as sequence length increases. They can maintain identical
attention patterns regardless of context length. This non-dispersion property means transformers with
α-entmax can scale to very long contexts without the attention becoming diluted, maintaining their
ability to focus on relevant information regardless of how much additional context is present.
E Interaction with Positional Encoding
Here, we study—theoretically and empirically—howα-entmax interacts with different positional
encoding methods. We follow the same model definition and notation as in §B.
E.1 No Positional Encoding (NoPE)
We adopt the same set of assumptions as2025). Namely,
A1There existsC∈Rsuch thatmax
t∈N
˛
∥W
(t)
Q
∥2,∥W
(t)
K
∥2

≤C.
A2The sequence
˛
∥
Q
k
t=0
W
(t)
V
∥2

∞
k=0
is bounded.
The first assumption tells that key and query weight matrices are bounded. The second assump-
tion ensures that the node representations’ trajectories acrosstlayers stays within a fixed interval
[−C
2
, C
2
].
Proposition 6(No Positional Encoding withα-entmax).LetGbe the causal mask graph and
P
(ℓ)
∈R
n×n represent the causal attention matrix at layerℓcomputed row-wise usingα-entmax
(αsubscript) withα∈(1,2] or softmax (softsubscript). In particular, letp
(ℓ)
ijdenote the(i, j)
probability entry inP
(ℓ). Further, let
˜
P
(ℓ)
=P
(ℓ)
· · ·P
(0)
represent the product of attention
matrices through layerℓ, which we call the cumulative attention matrix, which captures how
information from tokens in the input layer flows to tokens in layerℓthrough the composition of
attention operations.
For softmax,2025) have shown that
lim
ℓ→∞
˜p
(ℓ)
soft,i1
= 1
for all1< i≤n.
Under assumptionsA1-A2, for any indices1< j≤i≤n, withα-entmax we have:
1.
Edge deletion:For every layerℓ, an edge(j, i)∈ G is present in thedynamicgraphG
(ℓ)
αif and
only if(α−1)z
(ℓ)
ij
> τ(z
(ℓ)
i
) , wherez
(ℓ)
ij
=⟨q
(ℓ)
i
,k
(ℓ)
j
⟩ andτ(z
(ℓ)
i
) is the entmax threshold.
Otherwise,p
(ℓ)
α,ij
= 0and the edge is removed for that layer.
2.
Modified attention patterns:Unlike softmax,α-entmax withα >1 creates sparse attention
patterns by completely removing some connections. For tokens that survive the threshold, the
behavior of their attention weights depends on:
•How far the token’s logit sits above the threshold
•The relative differences between logits
25

For tokens that remain connected through the dynamic attention graphG
(ℓ)
αat all layers, the
cumulative attention still exhibits a decay pattern:
˜p
(ℓ)
α,ij
≤C(1−δij)
ℓ
(92)
whereδijdepends on the connectivity pattern of the dynamic attention graph. This decay rate
differs from softmax due to edge pruning and the redistribution of probability mass.
3.
Disrupted limit behavior:Unlike softmax,α-entmax does not necessarily converge to the first
token:
(a)
If for every layerℓ, there exists at least one directed path from token 1 to tokeniin the
dynamic graphG
(ℓ)
α, then:
lim
ℓ→∞
˜p
(ℓ)
α,i1
= 1. (93)
(b) ℓ0, directed paths from token 1 to tokeniare deleted inG
(ℓ0)
α, then:
0≤lim
ℓ→∞
˜p
(ℓ)
α,i1
<1, (94)
with the exact limit determined by the structure of the strongly connected components
formed in the dynamic graph.
Proof.
(i) Edge deletion:Forα-entmax, a coefficient is non-zero if and only if its pre-activation
exceeds the layer-specific threshold. The stated condition follows directly from the definition of
α-entmax:
p
(ℓ)
α,ij
= [(α−1)z
(ℓ)
ij
−τ(z
(ℓ)
i
)]
1
α−1
+, (95)
where[x]+= max(0, x).
(ii) Modified attention patterns:Forα-entmax, the attention weights are determined by thresholding:
p
α
i= [(α−1)zi−τ]
1
α−1
+. (96)
Letτ
′
=
τ
α−1
for simplicity. Consider two tokens with logitszi≥zj both in the support. The ratio
of their probabilities is:
p
α
j
p
α
i
=
ȷ
zj−τ
′
zi−τ
′
ff1
α−1
. (97)
For softmax, the ratio is:
p
soft
j
p
soft
i
=e
−(zi−zj)
. (98)
The comparison between these ratios depends on how farzjsits aboveτ
′. Let∆ =zi−zj and
b=zj−τ
′
. Through algebraic manipulation, we can show:
p
α
j
p
α
i
≷
p
soft
j
p
soft
i
⇔b≷
∆
e
(α−1)∆
−1
. (99)
This means the relative behavior of attention weights inα-entmax compared to softmax depends on
the specific configuration of logits and thresholds, rather than following a simple universal relationship.
Since tokens that remain connected must distribute probability mass among fewer options (due to
pruning), there exists some0< δij<1such that:
˜p
(ℓ)
α,ij
≤C(1−δij)
ℓ
. (100)
The specific value ofδijdepends on the connectivity pattern of the dynamic attention graph and may
differ significantly from the softmax case due to edge pruning and probability redistribution.
(iii) Disrupted limit behavior.
26

(a) Case where paths to token the first token persist:
Suppose that for every layerℓ, there exists at least one directed path from token 1 to tokeniin the
dynamic graphG
(ℓ)
α— the unique “center node” as defined by2025). For any token j >1,
the geometric decay established in part (ii) applies:
˜p
(ℓ)
α,ij
≤C(1−δij)
ℓ
→0asℓ→ ∞. (101)
Since the row sums of
˜
P
(ℓ)
must equal 1 (as it is a product of row-stochastic matrices), and all entries
˜p
(ℓ)
α,ij
withj >1approach 0, we have:
lim
ℓ→∞
˜p
(ℓ)
α,i1
= 1−lim
ℓ→∞
i
X
j=2
˜p
(ℓ)
α,ij
= 1. (102)
(b) Case where paths to token 1 are cut:The key difference from softmax arises when edge deletion
creates a configuration where token 1 cannot reach tokeni. Letℓ0be the first layer where paths from
token 1 to tokeniare removed inG
(ℓ0)
α. LetC
(ℓ0)
i
⊂ {1,2, ..., n} be the set of tokens in the same
strongly connected component as tokeniinG
(ℓ0)
α. By our assumption,1/∈ C
(ℓ0)
i . At layerℓ0, the
attention probability is distributed only among tokens inC
(ℓ0)
i
:
X
j∈C
(ℓ
0
)
i
p
(ℓ0)
α,ij
= 1andp
(ℓ0)
α,i1
= 0. (103)
For all layersℓ > ℓ0, the multiplication by zero ensures˜p
(ℓ)
α,i1
= 0, and therefore:
0≤lim
ℓ→∞
˜p
(ℓ)
α,i1
<1. (104)
The exact limit depends on the structure of the strongly connected components formed in the dynamic
graph through subsequent layers. In fact, the limit can be exactly zero whenever all paths from token
1 to tokeniare removed fromGαsince layerℓ0.
This proposition demonstrates a fundamental difference between softmax andα-entmax transformers:
while softmax inevitably leads to concentration of attention on the first token,α-entmax can potentially
disrupt this position bias through its ability to dynamically prune edges in the attention graph. This
provides a theoretical foundation for using sparse attention mechanisms to mitigate position bias in
transformer architectures.
E.2 ALiBi
We start by recalling the definition of ALiBi from (Press et al.,).
Definition 3(ALiBi Positional Encoding).LetHbe the number of attention heads. A general
form of ALiBi bias for headh∈ {1,2, . . . , H}can be defined as:
b
(h)
ij
=
∂
mh(j−i)ifj≤i
0 otherwise,
(105)
wheremh∈R+is the slope parameter for headh, defined asmh= 2
−
h
H.
Now, we consider how it interacts withα-entmax.
Proposition 7(ALiBi withα-entmax).Considerα-entmax attention with the ALiBi bias from the
above definition. Assume the raw attention logitsz
(h)
ij
∈[z
(h)
min
, z
(h)
max]for alli, j. Let
d
(h)
max=
$
z
(h)
max−z
(h)
min
+
1
α−1
mh
+ 1
%
. (106)
27

Then, any tokenjwith(i−j)> d
(h)
maxreceives zero attention from tokeniat headh.
Proof.For a tokenjto receive non-zero attention from tokeniwithα-entmax, we require:
(α−1)(z
(h)
ij
+b
(i)
ij
)> τ, (107)
whereτis the threshold ensuring normalization. Since slopes are positive (mi>0), we have:
(α−1)(z
(h)
ij
−(i−j)mh)> τ. (108)
In the extreme case where only the closest token receives attention (single-support case), we can
solve forτexactly:
1 =
h
(α−1)(z
(h)
max−τ)
i1
α−1
⇒(α−1)(z
(h)
max−τ) = 1⇒τ=z
(h)
max−
1
α−1
. (109)
Since the logits drop by at leastmhper position due to the ALiBi bias, we havez
(h)
max≥z
(h)
min
+mh ,
which gives us:
τ≥z
(h)
min
+mh−
1
α−1
. (110)
For a tokenjwith distance(i−j), even with the maximum logitz
(h)
max:
(α−1)(z
(h)
max−(i−j)mh)≤(α−1)(z
(h)
min
−mh)−1. (111)
Solving for(i−j):
(i−j)≥
z
(h)
max−z
(h)
min
+mh+
1
α−1
mh
. (112)
Hence,
d
(h)
max=
$
z
(h)
max−z
(h)
min
+
1
α−1
mh
+ 1
%
. (113)
This proposition establishes that withα-entmax and ALiBi positional bias, there exists a head-
dependent hard cutoff distanced
(h)
maxbeyond which tokens receive exactly zero attention. This creates
an adaptive but bounded attention window that depends on both content relevance (z
(h)
max−z
(h)
min ) and
the sparsity parameterα, naturally limiting the effective context without requiring explicit truncation.
This property allows the model to focus computational resources on a relevant window of tokens,
which can be particularly valuable for efficiently processing long documents.
E.3 RoPE
To analyze the interaction between RoPE (Su et al.,) and α-entmax, we first establish our
notation. Following2025), we consider queries qi∈R
d and keyskj∈R
d , wherei
andjare token positions in the sequence. RoPE decomposes these vectors intod/2two-dimensional
chunks, denoted asq
(k)
i
∈R
2 andk
(k)
j
∈R
2 fork∈ {1, . . . , d/2} . Each chunk rotates at a different
frequencygk=θ
−2(k−1)/d
, whereθ(typically 10,000) is the base wavelength parameter.
RoPE applies position-dependent rotations through matricesρ(gk)
i to transform the original queries
and keys. The resultant raw attention logit between query positioniand key positionjis: In a
standard transformer with RoPE,
zij=
d/2
X
k=1
⟨q
(k)
i
, ρ(gk)
j−i
k
(k)
j
⟩, (114)
whereρ(gk)is the 2D rotation matrix with frequencygk.
28

Query-Key Interaction with RoPE.Letϕijkbe the angle between the original (unrotated) vectors
q
(k)
i
andk
(k)
j
. That is:
cos(ϕijk) =
⟨q
(k)
i
,k
(k)
j
⟩
∥q
(k)
i
∥2∥k
(k)
j
∥2
. (115)
For a single 2D chunk, since rotation preserves vector magnitudes, the contribution to the raw score
is:
⟨q
(k)
i
, ρ(gk)
j−i
k
(k)
j
⟩=∥q
(k)
i
∥2∥ρ(gk)
j−i
k
(k)
j
∥2cos(ϕijk+gk(j−i)) (116)
=∥q
(k)
i
∥2∥k
(k)
j
∥2cos(ϕijk+gk(j−i)), (117)
whereϕijkis the original angle betweenq
(k)
iandk
(k)
j. As shown in Proposition 3.1 of
(2025), RoPE allows for maximal attention at any arbitrary distance. However, RoPE combined with
α-entmax creates a hard boundary on attention distance due to the thresholding effect, which we
analyze next.
Approximation for Small Angles.First, note that we can use the angle-sum expansion for cosine
as follows:
cos(ϕijk+gk(j−i)) = cos(ϕijk) cos(gk(j−i))−sin(ϕijk) sin(gk(j−i)).(118)
Further, note that for small anglesϕijkwe can use a second-order Taylor expansion forgk(j−i):
4
cos(gk(j−i))≈1−
g
2
k
(j−i)
2
2
. (119)
Finally, applying the dot-product:
zij≈
d/2
X
k=1
∥q
(k)
i
∥2∥k
(k)
j
∥2cos(ϕijk)−
d/2
X
k=1
∥q
(k)
i
∥2∥k
(k)
j
∥2cos(ϕijk)
g
2
k
(j−i)
2
2
+ sinterms
(120)
≈zmax−
d/2
X
k=1
ckg
2
k(i−j)
2
. (121)
Here, we simplified the last step by focusing on the quadratic decay from the cosine term while
omitting the sine terms−sin(ϕijk)gk(j−i) . For semantically aligned tokens whereϕijk≈0 , the
sine term’s contribution is minimal sincelimx→0sin(x) = 0.
Proposition 8(Maximum Attention Distance for RoPE (Small-Angle Regime)).Letgmax=
maxkgkbe the maximum frequency in RoPE. Within thesmall-angle domainwhere
|i−j| ≤
π
2gmax
,
so that all rotational anglesθ=gk(i−j) satisfy|θ| ≤
π
2
, and assumingzmax>
τ(zi)
α−1
, there exists
acritical distancedmaxbeyond which tokens receive exactly zero attention:
dmax=




v
u
u
t
zmax−
τ(zi)
α−1
P
d/2
k=1
ckg
2
k



. (122)
Proof.
For a token to receive non-zero attention underα-entmax, we must havezij>
τ(zi)
α−1
. Sub-
stituting the decay pattern from Equation, which is valid within the small-angle domain where
cosine can be approximated using a Taylor expansioncosθ≈1−
θ
2
2
:
zmax−
d/2
X
k=1
ckg
2
k(i−j)
2
>
τ(zi)
α−1
. (123)
4
cos(x)≈1−x
2
/2 +higher order terms.
29

Rearranging for(i−j)
2
:
(i−j)
2
<
zmax−
τ(zi)
α−1
P
d/2
k=1
ckg
2
k
. (124)
Taking the floor of the square root gives usdmax.
Note that this analysis applies to the first attention window. Due to the periodicity of rotation
operations, at distances beyond
π
gk
for any frequency componentk, the attention pattern may exhibit
additional windows of non-zero attention, which we address next.
Proposition 9(Frequency-Specific Cutoff for RoPE).For each frequency componentk, letβk=
τ(zi)
(α−1)∥q
(k)
i
∥2∥k
(k)
j
∥2
. Since at least one token must receive non-zero attention forα-entmax to
yield a valid probability distribution,βk≤1 must hold for at least one component. Assuming
βk∈[−1,1] (covering all possible cosine values), there exists a sequence of distances{dk,n}
∞
n=0
at which its contribution to attention crosses the threshold.
The first such distance is:
dk,0=
$
1
gk
arccos

τ(zi)
(α−1)∥q
(k)
i
∥2∥k
(k)
j
∥2
!%
. (125)
Due to the periodicity of cosine, subsequent threshold crossings occur at approximately:
dk,n≈
2πn±dk,0
gk
, n∈N. (126)
Furthermore,dk,0is non-increasing inαand inversely proportional togk.
Proof.For a single frequency componentk, the contribution to the raw score from Equation
⟨q
(k)
i
, ρ(gk)
j−i
k
(k)
j
⟩=∥q
(k)
i
∥2∥k
(k)
j
∥2cos(gk(j−i) +ϕijk). (127)
For this to exceed the attention threshold under optimal alignment (ϕijk= 0 , which maximizes the
contribution):
(α−1)∥q
(k)
i
∥2∥k
(k)
j
∥2cos(gk(j−i))> τ(zi) (128)
cos(gk(j−i))>
τ(zi)
(α−1)∥q
(k)
i
∥2∥k
(k)
j
∥2
. (129)
Taking the arccos of both sides and dividing bygkgives the first threshold crossing distancedk,0:
|i−j|>
1
gk
arccos

τ(zi)
(α−1)∥q
(k)
i
∥2∥k
(k)
j
∥2
!
. (130)
Due to the2π-periodicity of cosine, subsequent threshold crossings occur at distances
2πn±dk,0
gk
for
integersn >0 . Moreover,τ(zi)/(α−1) grows asαincreases, making the arccos term smaller
and consequently decreasingdk,0. The inverse proportionality togkis evident directly from the
formula.
These theoretical analyses of RoPE withα-entmax reveal two interesting takeaways. First, differ-
ent frequency components in RoPE naturally create attention windows of different widths. High-
frequency components (largegk) produce very narrow windows focused on local context, while
low-frequency components (smallgk) enable attention over longer distances. Second, the sparsity
pattern induced by the combination of RoPE andα-entmax is not uniform but varies across frequency
components, creating a more complex attention structure than simple distance-based decay methods
like ALiBi.
30

Query Position
0
25
50
75
100
125
Attended Centroid
Head 1
NoPE
ALiBi
RoPE
Query Position
Attended Centroid
Head 2
Query Position
Attended Centroid
Head 3
Query Position
Attended Centroid
Head 4
20 40 60 80 100 120
Query Position
0
25
50
75
100
125
Attended Centroid
Head 5
20 40 60 80 100 120
Query Position
Attended Centroid
Head 6
20 40 60 80 100 120
Query Position
Attended Centroid
Head 7
20 40 60 80 100 120
Query Position
Attended Centroid
Head 8 Figure 8: Per-head attention centroids across different positional encoding methods. Each panel
represents one attention head’s behavior. The dashed line shows the identity function (attending to
self). NoPE heads consistently exhibit an early-token bias, ALiBi heads maintain proximity-based
attention, and RoPE heads display more varied patterns with irregular fluctuations.
E.4 Comparison between Positional Encoding Methods withα-entmax
The positional encoding scheme used in a transformer has significant implications for how attention
behaves over long contexts. Our theoretical approach from previous subections reveals thatα-entmax
interacts with positional encodings in ways that fundamentally alter attention behavior compared
to softmax. We summarize our theoretical findings next, along with an empirical analysis within a
controlled experimental setting.
Theoretical Analysis.With NoPE (Kazemnejad et al.,), softmax transformers (without MLP
layers) develop an implicit bias towards the first tokens as depth increases, as shown by
(2025). α-entmax disrupts this behavior through its ability to create disconnected attention graphs.
By assigning exactly zero attention to some connections, it may remove the implicit bias encouraging
attention to concentrate on early tokens.
When combined with ALiBi (Press et al.,), α-entmax transforms the smooth linear de-
cay into a hard attention window. For tokens separated by distanced > d
(h)
max , whered
(h)
max=
j
1
mh
(zmax−zmin+
1
α−1
) + 1
k
, attention weights become exactly zero. This creates an adaptive
but bounded attention window that depends on the input (zmax−zmin ) and the sparsity parameterα.
With RoPE (Su et al.,), α-entmax induces frequency-dependent sparsity. Each frequency
componentkhas a critical distancedkbeyond which its contribution falls below the attention
threshold. This creates a multi-scale attention pattern where nearby tokens interact through all
frequency components, while distant tokens interact only through low-frequency components.
These interactions between positional encodings andα-entmax have important practical implications,
and further motivate the introduction of our hybrid approach, NAPE (§4.2). By creating natural,
content-adaptive attention windows, NAPE combine the benefits of sparse, focused attention with
awareness of token positions, allowing models to effectively balance local and global information
processing. This provides a principled alternative to manually designed sparse attention patterns like
sliding windows or dilated attention, with the advantage of adapting to content relevance rather than
using fixed patterns.
Empirical Analysis Setup.We simulated a sequence of lengthn= 128with attention heads using
α-entmax (α= 1.5 ). For each query positioni, we generated a vector of raw attention scores (logits)
for all positionsj≤iaccording to:
zij=z
base
ij+z
prox
ij
+z
noise
ij, (131)
wherez
base
ij
∼ U(Zmin, Zmax) represents the base content-based affinity,z
prox
ij
= 0.2(Zmax−
Zmin)(1−0.5
|j−i|
i
)
introduces a proximity bias, andz
noise
ij
∼ N(0, σ
2
) adds Gaussian noise. The
31

0 1 2 3 4 5 6 7 8 9 10 11
Head
0
1
2
3
4
5
6
7
8
9
10
11
Layer
1.891.411.451.291.060.990.081.690.530.020.111.52
1.571.431.390.980.880.771.330.000.780.000.591.09
1.371.431.571.251.241.340.781.420.751.521.140.84
1.541.510.981.460.851.001.231.260.991.220.931.07
1.501.411.331.231.151.050.781.010.691.120.091.20
2.011.351.221.181.121.070.842.141.891.020.930.83
1.531.571.231.171.100.611.300.911.202.071.210.96
1.511.321.281.261.081.011.371.620.631.120.820.42
1.271.271.191.141.130.830.990.811.050.851.220.84
1.541.321.331.210.990.821.200.790.531.280.551.04
1.210.861.220.981.121.000.921.021.071.161.001.17
1.211.281.071.211.191.060.941.441.121.210.780.94
Values
0 1 2 3 4 5 6 7 8 9 10 11
Head
0
1
2
3
4
5
6
7
8
9
10
11
Layer
-0.00-0.02-0.05-0.06-0.14-0.051.19-0.540.131.851.03-0.45
-0.03-0.03-0.06-0.05-0.04-0.19-0.313.140.065.000.20-0.18
-0.18-0.17-0.29-0.30-0.29-0.28-0.00-0.310.03-0.35-0.18-0.00
-0.16-0.18-0.09-0.18-0.28-0.34-0.23-0.26-0.12-0.23-0.10-0.16
-0.04-0.01-0.02-0.04-0.08-0.150.04-0.140.09-0.171.20-0.24
-0.20-0.01-0.09-0.02-0.11-0.08-0.02-0.59-0.56-0.24-0.110.03
-0.15-0.14-0.13-0.10-0.14-0.15-0.400.01-0.14-0.63-0.150.01
-0.07-0.09-0.08-0.11-0.05-0.06-0.32-0.510.18-0.210.060.38
-0.06-0.06-0.06-0.09-0.10-0.06-0.20-0.06-0.13-0.02-0.33-0.05
-0.12-0.10-0.12-0.16-0.13-0.09-0.260.070.23-0.230.21-0.24
-0.03-0.02-0.00-0.10-0.10-0.10-0.07-0.19-0.16-0.27-0.10-0.21
-0.05-0.06-0.08-0.12-0.10-0.27-0.07-0.33-0.16-0.220.01-0.10
Values
0.25
0.50
0.75
1.00
1.25
1.50
1.75
2.00
0
1
2
3
4 Figure 9: Heatmaps forβandγper head for Scalableα-entmax.
parameters were set toZmin=−5.0 ,Zmax= 5.0 , andσ= 0.5 . This formulation models a realistic
mixture of content-based attention (random component), a mild inherent bias toward nearby tokens
(proximity component), and natural variation (noise component). We then applied different positional
encoding methods to modify these base logits. Finally, we calculated the attention distribution using
α-entmax and determined the attention centroid for each query positioniascentroidi=
P
i
j=1
j·pij ,
wherepij=α-entmax(zi)j.
Empirical Results.The head-specific analysis in Figure
tional encoding methods when combined withα-entmax. While seems to NoPE exhibit aweakbias
towards earlier positions, it also shows a modest variability, indicating more disperse attention. ALiBi
clearly creates a consistent recency bias, with centroids following slightly below the identity line,
maintaining low variability that indicates focused attention. RoPE demonstrates centroid patterns
similar to NoPE but with lower entropy (higher variability), suggesting a focused attention in more
distant positions. These observations may explain why NAPE—the hybrid NoPE+ALiBi— works
well in practice, since ALiBi heads provide consistent positional structure focused on recent context,
while NoPE heads can contribute complementary via early-token and semantic focus, creating a
more balanced attention mechanism than either approach alone. In fact, as we show in §H.1, models
equipped with NoPE are flexible enough and can acquire relative positional encoding, thus also
supporting the original hypothesis of2023). Therefore, in NAPE has the ability to
encourage short-span focus with ALiBi alongside learning more longer-span focus that are guided
via semantic information with NoPE.
F Adaptive-Scalableα-entmax (ASEntmax)
F.1 Learning Scalers for Language Modeling
To verify our scaling approach, we follow the setup from (Nakanishi,) and trained a language
model with learnable scalespifor eachi-th position. However, we do so independently for each head
in the model. Specifically, we train a 12-layer transformer with approximately 120 million parameters.
We set hidden size to 768, attention heads to 12, MLP intermediate size to 2048, learning rate to
6×10
−4 , weight decay to0.001, batch size to 1M tokens, and sequence length to 1024. Each head
thus contains 1024 learnable parameters (1 per position). Finally, we train on the FineWeb dataset for
a total of 5 billion tokens.
In Figure δh+βh(logn)
γh scaling performs well for different attention heads.
In contrast, removingγh—as done by2025)—leads to a severe degradation fit for several
heads. Specifically, the linear fit has an overallR
2
= 0.12 , the log fit hasR
2
=−14 (severe
underfitting), and our log with aγexponent hasR
2
= 0.17 . The full set ofβhandγhlearned by our
approach are shown in Figure.
32

Effect of Negativeγh.Experiments on the Copy task, shown in Table, suggest that, without
scaling,α-entmax can hurt performance, leading to a noticeable drop in accuracy in the OOD scenario.
Introducing an adaptive temperature, however, substantially mitigates this effect. We hypothesize that
Copy requires less sparse attention patterns, which can be accomplished by applying a negative power
to the logarithm function. We confirm this hypothesis in Figure, which shows that ASEntmax
learns negative values ofγhin all heads, resulting in more spread-out attention distributions.
G Experimental Details
G.1 Synthetic Data
Following the data diversity assumptions of2024), we generate a large number of
samples—between 10 million and 50 million, depending on task complexity. See Table
training/test details on for each task along with model hyperparameters. The 2Back and Local Count
tasks are token-classification tasks, while the remaining tasks are generative. In Figure, we show
examples of the tasks we introduce in this work.
2Back.In this classification task, the model must predict the class of the token that appeared two
positions earlier. Via this task, we examine the ability of models equipped with NoPE to learn relative
positional bias and assess their behaviour in out-of-distribution scenarios (see).
Local Count.The Local Count task is a classification task in which the model must predict the
number of times a word has occurred so far. We restrict the vocabulary size to 16, allowing multiple
clusters of the same word to appear within a sequence multiple times. This increases the task’s
difficulty, as the model must distinguish between different clusters of identical words. We sample the
number of repetitions for each cluster uniformly fromU(1,48) to test whether models equipped with
NoPE can learn a longer focus span than observed in the 2Back task.
Flip-Flop (Liu et al.,). Flip-Flop uses sequences made up of alternating instructions "write",
"ignore", and "read" (w, i ,r). Each instruction is paired with a bit (0 or 1). The model must memorize
the bit if it follows aw, and recall the last stored bit when it encounters anr. For example, in the
string"i 0 w 0 i 1 i 1 w 1 i 0 i 1 r", the correct output is 1. We conducted experiments
on two datasets with different “write” instruction probabilities:10%- sparse,80%- dense.
Max Retrieval.We follow ˇckovi´c et al.2024) in constructing the Max Retrieval dataset and
the model architecture.
Multi-Query Multi-Token Associative Recall (MQMTAR).MQMTAR is a retrieval task in
which models must produce a sequence of multi-token values corresponding to the queries provided
at the end of the input sequence (see Figure). MQMTAR employs three special tokens: (1) 0for
empty space; (2)1as the key–value delimiter; and (3)3as the query delimiter, which is also used
to separate values in the target sequence. We set the lengths of both keys and values to be 2 tokens,
resulting in 5 tokens per key-value pair in the input. The number of queries is 4, and the density of
key-value pairs is80%of the total number of tokens. Finally, the size of alphabet is 256 from which
we constructed 100K key-value pairs.
Sort, Copy and Reverse.These are well-known tasks for testing models’ length generalization
(Kazemnejad et al.,). We use a small vocabulary size of 32 to generate more sequences with
repeated tokens, since models must handle such repetitions increasingly as sequence length grows.
G.2 Models
All synthetic tasks are trained with a decoder-like transformer. We evaluate models in extreme settings
by using as few layers as possible as our aim is to test the attention mechanism coupled with our
positional-encoding strategy, rather than the scaling capabilities of transformer. However, for the
Reverse task—which proved particularly challenging for softmax-based models—we increment the
layer count until the softmax baseline generalizes to at least 1.5x the in-distribution length.
33

Multi-query Multi-token Associative Recall
k1 v1 k2 v2 k3 v3 q1 q2
Input : 0 4 6 1 7 8 0 0 0 5 5 1 3 9 0 0 3 6 1 6 8 0 0 2 3 6 2 4 6
Target: 2 6 8 2 7 8
2Back
Input : 0 0 2 5 2 6 9 7 1 2 3 3 8 2 2
Target: 0 0 0 0 2 5 2 6 9 7 1 2 3 3 8
Local Count
Input : 2 2 2 4 4 6 6 8 8 8 8 2 2
Target: 0 1 2 0 1 0 1 0 1 2 3 0 1 Figure 10: Examples of the introduced tasks.MQMTAR: Each digit is a token; the alphabet size
is 10, and the number of queries is 2.2Back: A special token 0 is added at the beginning of the
sequence to ensure the model has something to predict at the first two positions; the vocabulary size
is 10.Local Count: The maximum number of repetitions is 4, and the vocabulary size is 8.
For experiments with RoPE, we use the Hugging Face implementation from LLaMA 3 (Grattafiori
et al.,), which includes RoPE scaling. Since our sequences are relatively short, the base
frequency is set to its default value of10,000. To improve length extrapolation in RoPE-based
models, we apply a scaling factor of16, which we found to be optimal for Flip-Flop under4×
extrapolation (see Table); factors of 8or32degrade performance. For NAPE, each ALiBi head
uses a slope ofm=
1
h
, wherehis the head index. We employ 8 attention heads for all tasks except
Copy and MQMTAR, where 16 heads yields a performance boost across all models.
In case of ASEntmax,β
i
handγ
i
hare computed per tokenivia linear projections followed by
activations softplus and tanh respectively, allowing to scale attention adaptively based on content.
For our experiments withα-entmax, we useα= 1.5 as the default value for theα-entmax models,
unless mentioned otherwise. Furthermore, we use the Gemma2 implementation from Hugging Face,
but disabling sliding-window attention in all layers. For experiments withα-entmax, we replaced
FlashAttention with AdaSplash (Gonçalves et al.,).
For optimization, we use the AdamW with default betas and a cosine learning-rate scheduler with
warm-up, setting 10K warm-up steps. Given the large training corpus, we do not employ dropout or
weight decay. In addition, we use bfloat16 in all experiments. All models are relatively small (2–10M
parameters) and fit on a single GPU.
We observe that even when models are100%accurate in-distribution, they still require significantly
more training; in some cases, the training loss reached as low as10
−8. Therefore, the best checkpoint
is selected based on performance at8×the in-distribution sequence length. In some cases such as
the Sort task and models with RoPE, where generalization up to8×was not possible, we use BLEU
as an intermediate metric and2×or4×the in-distribution sequence length. We perform evaluation
with1Ksamples per sequence length. 2Back and Local Count are evaluated using accuracy, as they
are classification tasks, whereas for the remaining tasks, we use exact match accuracy—assigning 1
only if the entire predicted sequence matches the reference and 0 otherwise. We report results for the
single best-performing model selected from experiments conducted with multiple random seeds (3)
and various learning rates.
Finally, for some tasks, we also report results for SEntmax—which learnsβandγdirectly, without
linear projections or nonlinearities—and also for ASSMax—which applies our adaptive-scaling
strategy to Softmax.
H Additional Results
In this section, we provide a more detailed analysis of each task.
34

Table 2: Task details and hyperparameters.
Task Samples Length Batch Vocab. Heads Layers Hid. dim. Int. dim.
2Back 10M 32-64 128 16 8 2 256 512
Local Count 10M 64-128 128 16 8 3 128 512
Flip-Flop 10M 32-64 128 4 8 4 256 512
Copy 20M 32-64 128 32 16 2 256 1024
Reverse 30M 32-64 128 32 8 6 256 512
MQMTAR 50M 32-64 128 256 16 4 512 1024
Sort 40M 32-64 128 32 8 2 256 1024
H.1 2Back
Table 3: Accuracy (%) on 2Back.
ID Out-of-Distribution
Model 64 128 256 512 1024 2048 4096
RoPE
Softmax 100.0 100.0 100.0 99.3 81.4 63.4 41.1
SSMax 100.0 100.0 100.0 99.8 98.5 90.4 69.0
Entmax 100.0 98.2 94.8 83.7 65.5 45.7 31.3
ASEntmax 100.0 100.0 100.0 95.0 61.2 36.2 22.1
NoPE
Softmax 99.9 83.2 51.4 30.1 18.5 12.7 9.7
SSMax 100.0 80.5 47.2 27.2 16.9 11.8 9.2
Entmax 100.0 77.3 50.5 33.2 20.7 13.7 10.2
ASEntmax 100.0 90.3 55.1 36.5 24.3 16.2 11.5
NAPE
Softmax 100.0 100.0 100.0 100.0 100.0 100.0 100.0
SSMAx 100.0 100.0 100.0 100.0 100.0 100.0 100.0
Entmax 100.0 100.0 100.0 100.0 100.0 100.0 100.0
ASEntmax 100.0 100.0 100.0 100.0 100.0 100.0 100.0
Following the hypothesis of2023) that NoPE can learn a relative positional
bias, we conducted experiments on the simple 2Back task, in which the model must predict the
class of the token two positions earlier. Table
perfect in-distribution performance. Moreover, the attention maps in Figure
NoPE indeed acquires relative positional encodings, thus supporting the hypothesis. However, the
OOD attention maps (right part) reveal that, as sequence length increases, the recency bias diffuses
unevenly across positions. Such behavior is detrimental in tasks requiring attention to a fixed-size
local context (e.g., associative recall, previous instructions in code, n-grams in text). By contrast,
ALiBi constrains attention to a local window irrespective of content. Moreover, we observe that
ASEntmax partially mitigates diffusion in the attention maps (bottom right), which is also reflected in
the accuracy gains shown in Table. In our design of positional encoding, NoPE + ALiBi (NAPE),
half of the attention heads employ a faster-decaying variant of ALiBi to enforce a short-span focus,
while the remaining heads use NoPE which can 1) learn a focus that spans longer and depends on
position 2) guide attention semantically.
H.2 Local Count
As observed (Table), models using ALiBi perfectly solve the task, which is unsurprising given that
ALiBi induces a recency bias. Furthermore, the results for ALiBi and NAPE suggest that models can
rely exclusively on ALiBi heads in case of NAPE. With NoPE, however, the model is challenged
because identical tokens are not distinguishable at the very first layer. Therefore, the model must
develop a mechanism to locate the current cluster. Figure
35

Softmax
Entmaxα= 1.5 ASEntmaxα= 1.5
Figure 11: Comparison of attention maps for the 2Back task.Left:In-distribution, sequence length
64.Right:Out-of-distribution, sequence length 512 (for visualization clarity, we applied max pooling
with a window size 4 and stride 4). The maps are shown for the second layer for all models. We
can observe that diagonal patterns are less distorted withα-entmax. Moreover, ASEntmax mitigate
dispersal of the diagonal pattern up to3×the in-distribution sequence length and make it less distorted
up to8×.
36

NoPE model exhibits a relative positional bias. Combined with the bias observed in 2Back, this
indicates that NoPE models can acquire various content-based recency biases that differ from those
induced by ALiBi or RoPE. Finally, for Local Count, we observe no improvement in NoPE models
when using attention scaling.
Table 4: Accuracy (%) on Local Count.
ID Out-of-distribution
Model 128 256 512 1024 2048 4096
RoPE
Softmax 100.0 99.4 91.6 55.2 31.1 17.3
SSMAx 100.0 100.0 81.3 42.6 23.0 13.4
Entmax 100.0 99.9 89.3 47.1 24.9 14.1
ASEntmax 100.0 100.0 79.1 41.4 22.4 13.1
NoPE
Softmax 99.1 71.7 36.5 18.3 9.2 4.6
SSMax 99.1 71.4 36.8 18.6 9.3 4.7
Entmax 99.8 80.8 45.6 25.0 13.8 7.7
ASEntmax 99.6 78.1 42.6 22.5 11.9 6.4
ALiBi
Softmax 100.0 100.0 100.0 100.0 100.0 100.0
Entmax 100.0 100.0 100.0 100.0 100.0 100.0
NAPE
Softmax 100.0 100.0 100.0 100.0 100.0 100.0
SSMax 100.0 100.0 99.9 99.9 99.8 99.8
Entmax 100.0 100.0 100.0 100.0 100.0 100.0
ASEntmax 100.0 100.0 100.0 100.0 100.0 100.0
Figure 12: Attention maps of Entmax model on Local Count.Left:Layer 1.Right:Layer 3; We
observe a local pattern: attention weights fade as relative distance increases. Input sequence:(1..1
×10, 2..2×4, 3..3×10, 2..2×4, 4..4×10, 2..2×4)..×4
H.3 Max Retrieval
Solving this task requires an extremely concentrated attention distribution. Such distribution can
be achieved by either lowering the temperatureθfor softmax or increasing the entmax parameterα.
As Table αyields substantial performance gains. However, ifαbecomes too
37

large, the distribution may collapse to a one-hot vector, causing entmax to lose all gradient signal and
hindering learning (e.g.α >16 ). Instead, this issue can be alleviated by scaling entmax based on the
sequence length. With this approach, ASEntmax withα= 1.5 , learnedβ, and elevatedγachieves
substantially improved performance on the task.
Table 5: Accuracy (%) on Max Retrieval
ID Out-of-Distribution
Model 16 32 64 128 256 512 1024 2048 4096
Softmax ˇckovi´c et al.2024) 98.6 97.1 94.3 89.7 81.3 70.1 53.8 35.7 22.6
Adapt. temp. ˇckovi´c et al.2024) 98.6 97.1 94.5 89.9 82.1 72.5 57.7 39.4 24.9
Softmaxθ=
√
d 99.2 98.5 96.7 93.2 86.7 73.5 54.4 36.4 24.1
Softmaxθ= 0.1 99.5 99.0 97.8 95.1 89.6 77.9 60.2 41.2 28.5
Softmaxθ= 0.0004 99.2 98.4 97.0 94.2 89.4 81.8 71.4 58.4 43.4
SSMax 99.4 98.9 97.8 95.9 92.3 85.0 74.7 59.9 44.7
Entmaxα= 1.5 99.4 98.8 97.4 94.7 89.9 80.1 65.1 50.0 36.8
Entmaxα= 2 99.5 99.1 98.0 96.0 92.1 84.5 72.0 58.4 44.6
Entmaxα= 4 99.5 98.9 97.7 95.9 92.1 84.8 75.2 61.4 46.9
Entmaxα= 16 99.699.498.7 97.5 95.2 91.0 82.8 70.3 53.4
Entmaxα= 32 99.4 98.7 97.5 95.5 91.5 83.8 72.6 57.5 41.7
Entmaxα= 64 99.1 98.4 96.8 93.9 88.7 78.6 64.6 45.5 28.1
ASEntmax,α= 1.5,βlearn,γ= 199.5 99.0 98.1 96.3 93.1 86.1 76.2 61.9 44.5
ASEntmax,α= 1.5,βlearn,γ= 299.6 99.2 98.4 96.9 94.4 89.0 81.4 69.5 55.1
ASEntmax,α= 1.5,βlearn,γ= 399.699.499.0 98.0 96.0 92.4 85.9 76.1 62.7
ASEntmax,α= 1.5,βlearn,γ= 499.3 98.7 97.6 95.4 91.3 84.6 73.6 59.7 45.9
H.4 MQMTAR
We observe the same pattern across all tasks: despite theoretical extrapolation to16×via RoPE
scaling, RoPE models poorly generalize beyond4×. Moreover, although models with ALiBi can
extrapolate up to64×, ALiBi’s limited span inevitably leads to performance degradation on very
long sequences. However, NAPE provides a substantial boost in all models. As in the copy task,
Entmax alone underperforms Softmax, but adaptive scaling in ASEntmax makes the model superior,
extending the generalization to an impressive1024× . We also conducted experiments with ASSMax
to demonstrate that, despite adaptive scaling benefits and improved performance in comparison with
the model without scaling, softmax dispersion still causes a significant performance drop on very
long sequences.
H.5 Copy
As shown by2023), transformers generalize to the Copy task especially with appropriate
positional encodings. Table 64×the ID
length. Notably, SSMax outperforms all other models which might suggest that scaling is crucial for
this task.
As we can also see, without scaling,α-entmax can hurt performance, leading to a noticeable drop in
accuracy in the OOD scenario. Introducing an adaptive temperature, however, substantially mitigates
this effect: ASEntmax matches Softmax performance outperforming Entmax 4 times of the sequence
length. We hypothesize that Copy requires less sparse attention patterns, which can be accomplished
by applying a negative power to the logarithm function. We confirm this hypothesis in Figure,
which shows that ASEntmax learns negative values ofγin all heads, resulting in more spread-out
attention distributions.
38

Table 6: Exact match accuracy (%) on Multi-query Multi-token Associative Recall.
ID Out-of-Distribution
Model 64 128 256 512 1024 2048 4096 8192 16K 32K 65K
RoPE
Softmax 100.0 3.1 0 0 0 0 0 0 0 0 0
SSMAx 99.8 6.2 0 0 0 0 0 0 0 0 0
Entmax 99.8 49.4 4.5 0 0 0 0 0 0 0 0
ASEntmax 100.0 66.9 0.8 0 0 0 0 0 0 0 0
NoPE
Softmax 100.0 30.6 0 0 0 0 0 0 0 0 0
Entmax 100.0 26.1 0 0 0 0 0 0 0 0 0
SSMax 100.0 39.1 0 0 0 0 0 0 0 0 0
ASEntmax 100.0 58.3 0 0 0 0 0 0 0 0 0
ALiBi
Softmax 100.0 100.0 99.5 99.0 98.0 93.9 77.8 38.8 6.8 0 0
Entmax 99.5 97.5 90.6 75.4 44.7 11.7 1.1 0 0 0 0
NAPE
Softmax 100.0 100.0 100.0 99.4 99.5 98.2 97.8 90.2 80.2 34.2 3.0
SSMax 99.9 100.0 99.9 99.7 99.4 97.2 94.1 82.4 53.4 13.2 1.0
ASSMax 100.0 100.0 100.0 99.9 99.8 98.2 98.4 95.4 87.2 69.2 6.0
Entmax 100.0 100.0 100.0 99.9 99.2 97.8 92.7 86.2 66.8 35.8 9.3
SEntmax 100.0 100.0 100.0 99.9 99.8 98.8 91.5 12.0 0 0 0
ASEntmax 100.0 100.0 99.9 99.9 99.8 99.8 99.2 98.6 96.4 91.2 76.7
Figure 13: Distributions ofγper head and layer for ASEntmax trained on Copy.
H.6 Flip-Flop
We first conducted an ablation study to evaluate model performance with various RoPE scaling factors
(Table). Although the random baseline accuracy for Flip-Flop is50%, our generative training setup
with a vocabulary of 7 tokens (4 main and 3 special) can yield accuracies below50%. Therefore,
we treat accuracies at or below50%as poor and select a scaling factor of 16 as optimal. The RoPE
scaling factor defines the expansion multiple to which the model must generalize. Throughout all
experiments, however, we observe that RoPE models poorly generalize at sequence lengths8×the
in-distribution length.
While Flip-Flop is considered a challenging task for testing length extrapolation (Liu et al.,),
we found that ALiBi and NAPE strategy almost perfectly solves both the sparse and dense variants.
Surprisingly, RoPE models generalize better with the sparse variants.
H.7 Reverse
From Table, we can see that ASEntmax with NAPE achieved impressive 8×length generalization
which to our knowledge, represents the largest extrapolation reported. Moreover, RoPE models fail
39

Table 7: Exact match accuracy (%) on Copy task.
ID Out-of-Distribution
Model 64 128 256 512 1024 2048 4096
RoPE
Softmax 100.0 2.8 0 0 0 0 0
SSMax 100.0 0 0 0 0 0 0
ASSMax 99.9 19.9 0 0 0 0 0
Entmax 100.0 34.3 0 0 0 0
ASEntmax 100.0 5.3 0 0 0 0 0
NoPE
Softmax 56.3 0 0 0 0 0 0
SSMax 56.1 0 0 0 0 0 0
Entmax 34.6 0 0 0 0 0 0
ASEntmax 45.8 0 0 0 0 0 0
AliBi
Softmax 100.0 99.8 99.8 98.8 98.3 93.9 26.8
Entmax 100.0 100.0 96.6 14.6 0.1 0 0
NAPE
Softmax 100.0 100.0 99.9 99.9 99.4 96.1 85.5
SSMax 100.0 100.0 100.0 99.9 99.6 99.3 95.8
ASSMax 99.9 99.8 99.7 99.3 97.5 91.1 72.8
Entmax 100.0 99.0 86.0 28.5 0.2 0 0
SEntmax 100.0 100.0 99.9 99.0 96.2 69.7 6.5
ASEntmax 100.0 100.0 99.9 99.7 99.4 96.3 86.6
Table 8: Exact match accuracy (%) for ablation of LLaMA 3 RoPE scaling on Flip-Flop (sparse)
ID Out-of-Distribution
Model Factor 64 128 256 512 1024 2048 4096
Softmax - 100.0 79.9 54.4 51.5 48.8 50.8 50.8
Softmax 4 100.0 99.6 33.8 11.2 3.3 0.5 0.0
Softmax 8 100.0 100.0 72.6 0.2 0 0 0
Softmax 16 100.0 99.9 97.3 36.7 0 0 0
Softmax 32 100.0 99.2 71.6 51.3 51.1 49.3 49.2
even at a sequence length of 96. Although NAPE improves Softmax and SSMax, it does not enable
generalization beyond1.5×the in-distribution length; however, applying adaptive scaling to Softmax
(ASSMax) enables performance at2×the in-distribution length.
H.8 Sort
Table α-entmax on Sort, with two-layer models generalizing almost
perfectly to2×under both NoPE and NAPE configurations. Furthermore, Softmax models with
NAPE experience a performance decline relative to their NoPE counterparts, and adaptive scaling
degrades performance for both Softmax andα-entmax (we also report results for NoPE + SEntmax
to be convinced). However, combining NAPE with adaptive scaling enhancesα-entmax. This pattern
suggest that sparsity, adaptive scaling, and NAPE can act complementarily.
40

Table 9: Accuracy (%) on Flip-Flop (sparse).
ID Out-of-Distribution
Model 64 128 256 512 1024 2048 4096 8192 16K 32K
RoPE
Softmax 100.0 99.9 97.2 36.6 0.0 0.0 0.0 - - -
SSMax 100.0 99.8 91.8 77.8 52.2 22.0 39.2 - - -
Entmax 100.0 99.9 89.0 64.0 50.6 50.6 55.1 - - -
ASEntmax 100.0 99.8 98.9 51.4 51.4 50.2 49.2 - - -
ALiBi
Softmax 100.0 99.9 99.8 99.9 99.9 100.0 99.7 99.7 99.9 99.7
Entmax 100.0 99.9 99.8 99.8 99.8 99.9 99.7 99.7 99.7 99.7
NAPE
Softmax 100.0 99.8 99.6 99.3 99.7 99.6 99.6 99.6 99.3 99.4
SSMax 100.0 99.8 99.9 99.8 99.8 99.7 99.8 100.0 99.9 99.6
Entmax 100.0 99.9 99.8 99.9 99.9 100.0 99.7 99.7 99.8 99.7
ASEntmax 100.0 99.8 99.6 99.3 99.5 99.3 99.5 99.7 99.6 99.5
Table 10: Accuracy (%) on Flip-Flop (dense).
ID Out-of-Distribution
Model 64 128 256 512 1024 2048 4096
RoPE
Softmax 100.0 70.2 62.2 53.2 49.2 50.3 53.1
SSMax 100.0 69.1 60.4 53.2 48.5 51.0 53.1
Entmax 100.0 80.4 73.6 60.3 49.3 51.2 53.1
ASEntmax 100.0 100.0 100.0 49.6 48.9 51.1 53.1
NAPE
Softmax 100.0 100.0 100.0 100.0 100.0 100.0 100.0
SSMAx 100.0 100.0 100.0 100.0 100.0 100.0 100.0
Entmax 100.0 100.0 100.0 100.0 100.0 100.0 100.0
ASEntmax 100.0 100.0 100.0 100.0 100.0 100.0 100.0
41

Table 11: Exact match accuracy (%) on Reverse.
ID Out-of-Distribution
Model 64 96 128 256 512
RoPE
Softmax 100.0 0 0 0 0
SSMax 100.0 0 0 0 0
ASSMax 100.0 0 0 0 0
Entmax 100.0 0 0 0 0
ASEntmax 100.0 0 0 0 0
NoPE
Softmax 100.0 0 0 0 0
SSMax 100.0 0 0 0 0
Entmax 100.0 77.1 0 0 0
ASEtmax 100.0 74.4 0 0 0
AliBi
Softmax 100.0 0 0 0 0
Entmax 100.0 96.1 78.5 0 0
NAPE
Softmax 100.0 36.0 0 0 0
SSMax 100.0 54.6 0 0 0
ASSMax 100.0 98.7 62.4 0 0
Entmax 100.0 99.0 86.0 28.5 0.2
SEntmax 100.0 100.0 98.1 51.4 0.0
ASEntmax 100.0 100.0 99.8 96.4 56.7
Table 12: Exact match accuracy (%) on Sort.
ID Out-of-Distribution
Model 64 128 256 512
RoPE
Softmax 100.0 0 0 0
SSMax 100.0 0 0 0
Entmax 100.0 0 0 0
ASentmax 100.0 0 0 0
NoPE
Softmax 100.0 97.6 46.6 0
SSMax 100.0 96.3 29.8 0
Entmax 100.0 99.9 66.2 0
SEntmax 100.0 99.4 47.7 0
ASEnmtax 100.0 97.5 20.8 0
AliBi
Softmax 99.9 0 0 0
Entmax 99.2 0 0 0
NAPE
Softmax 100.0 0 0 0
SSMax 100.0 0 0 0
ASSMax 100.0 99.5 9.4 0
Entmax 100.0 99.3 57.8 0
ASEntmax 100.0 100.0 79.7 0
42