Extracted Text


2510.23095v1.pdf

Under review as a conference paper at ICLR 2026
R
V
Jie Huang
1,2,*,
Xuejing Liu
1,* Sibo Song
1
Ruibing Hou
2,†
Hong Chang
2
Junyang Lin
1
Shuai Bai
1,†
1
Qwen Team
, Alibaba Group.
2
Institute of Computing Technology, Chinese Academy of Sciences
*Equal contribution

Corresponding author
{yuzheng.lxj,sibo.ssb,junyang.ljy,baishuai.bs}@alibaba-inc.com
{huangjie24s,houruibing,changhong}@ict.ac.cn
A
Multimodal position encoding is essential for vision-language models, yet there
has been little systematic investigation into multimodal position encoding. We
conduct a comprehensive analysis of multimodal Rotary Positional Embedding
(RoPE) by examining its two core components: position design and frequency
allocation. Through extensive experiments, we identify three key guidelines:
positional coherence, full frequency utilization, and preservation of textual pri-
ors—ensuring unambiguous layout, rich representation, and faithful transfer from
the pre-trained LLM. Based on these insights, we propose Multi-Head RoPE
(MHRoPE) and MRoPE-Interleave (MRoPE-I), two simple and plug-and-play
variants that require no architectural changes. Our methods consistently outper-
form existing approaches across diverse benchmarks, with significant improve-
ments in both general and fine-grained multimodal understanding. Code will be
avaliable at.
1 I
The permutation-invariant nature of the self-attention mechanism requires the use of positional en-
codings to inform Large Language Models (LLMs) of sequence order, relative distance, and struc-
tural dependencies. While early methods relied on absolute position embeddings (Vaswani et al.,
2017), relative encodings—which better generalize to varying sequence lengths—have become the
standard. Among these, Rotary Position Embedding (RoPE) (Su et al., 2024) has emerged as the de
facto choice in modern LLMs such as Llama (Grattafiori et al., 2024) and Qwen (Yang et al., 2025).
Vision-Language Models (VLMs) also require positional encodings that can handle heterogeneous
modalities, including 1D text and 2D/3D visual inputs. Current methods fall into two main cate-
gories: 1D sequential and multi-dimensional designs. The former, exemplified by vanilla RoPE (Su
et al., 2024) and V2PE (Ge et al., 2024), flattens and concatenates all inputs into a single sequence.
While simple, this approach discards the native visual geometry, leading to a significant degradation
in performance on tasks requiring visual grounding and spatial reasoning.
Multi-dimensional designs, the second approach, extend RoPE to multiple axes (time, height, width)
by partitioning embedding channels. Qwen2-VL Wang et al. (2024a) adopts Multimodal RoPE
(MRoPE) to unify positional encoding for text and visual tokens. However, MRoPE allocates the
position embedding into t-h-w chunk, placing the temporal infomation entirely in the high-frequency
channels. This bias in temporal encoding harms long-range video modeling. Subsequent work has
attempted to improve it, but this has led to a fragmented landscape of highly specialized solutions.
Some methods focus exclusively on image understanding (Wang et al., 2025), others on video com-
prehension (Wei et al., 2025; Li et al., 2025; Liu et al., 2025), and a third group on image genera-
tion (Liao et al., 2025; Wu et al., 2025). While these models achieve notable performance in their
§
This work was done during a research internship at Qwen Team, Alibaba Group.
1
arXiv:2510.23095v1 [cs.CV] 27 Oct 2025

Under review as a conference paper at ICLR 2026
Table 1: Comparison of different RoPE methods.
Method
Position Design Freq Allocation Compatible with
Text-only RoPE
3D Struct. Modal Int. Range Gran.
Vanilla RoPE (Su et al., 2024)
V2PE (Ge et al., 2024)
RoPE-tie (Su, 2024)
MRoPE (Bai et al., 2025)
CircleRoPE (Wang et al., 2025)
VideoRoPE (Wei et al., 2025)
IL-RoPE (Liao et al., 2025)
Omni-RoPE (Wu et al., 2025)
MHRoPE
MRoPE-I
respective domains, the development of a truly robust and versatile VLM requires a more holistic
positional encoding strategy. In this work, we aim to develop a more holistic positional encod-
ing strategy capable of supporting the core, unified capabilities of image and video understanding,
complemented by fine-grained visual grounding.
To build a more robust multimodal positional encoding, we build on MRoPE and systematically ex-
plore three underexplored design: (i) position design—how to assign unambiguous, well-separated
coordinates to text and visual tokens; and (ii) frequency allocation—how to distribute rotary fre-
quencies across embedding dimensions for each positional axis; (iii) compatibility with text-only
RoPE—ensuring the design defaults to vanilla RoPE for pure text inputs to enable effective trans-
fer learning. As Table 1 shows, we systematically compare recent methods across the three design
axes and conduct extensive experiments. From this analysis, we identify common pitfalls: modal-
ities confusion arising from positional ambiguity; degraded cross-modal fusion due to suboptimal
modality intervals; impaired multi-scale modeling from restricted frequency allocations; and com-
promised transfer learning caused by incompatibility with text-only RoPE.
Based on our experiment, we distill three core guidelines for designing robust VLM positional en-
codings: (i) positional coherence, requiring unambiguous coordinates with a well-defined modality
interval; (ii) full frequency allocation, ensuring all positional axes have access to the full frequency
spectrum; and (iii) preservation of textual priors, keeping the text RoPE identical to the base LLM.
To satisfy the guidelines of full frequency allocation, we propose two simple yet effective methods.
Multi-Head RoPE dedicates distinct attention heads to different positional axes to preserve full fre-
quency resolution. MRoPE-Interleave employs a fine-grained, round-robin distribution of channels
to ensure each axis is encoded with the full frequency spectrum. Besides, we introduce,
a novel mechanism that resets the spatial position for visual content. This simple modification was
found to significantly facilitate the model’s focus to visual information.
Our methods consistently outperform strong baselines across key tasks, including image and video
understanding and visual grounding. Our contributions are three-fold: (1) a systematic decom-
position of multimodal RoPE design; (2) two lightweight instantiations—Multi-Head RoPE and
MRoPE-Interleave that satisfy the guidelines; and (3), a general-purpose optimization
for improved visual information flow.
2 A
This section provides a systematic analysis of multimodal RoPE. We begin by revisiting the ba-
sics of vanilla RoPE. We then evaluate existing multimodal extensions along three core design
axes—position design, frequency allocation and compatibility with text-only RoPE. Through this
analytical lens, we identify critical limitations in current approaches, directly motivating our pro-
posal of two simple yet effective methods: Multi-Head RoPE and MRoPE-Interleave.
2

Under review as a conference paper at ICLR 2026
2.1 P
Vanilla RoPE (Su et al., 2024) is a pivotal method for encoding positional information in modern
LLMs. Unlike additive position embeddings, RoPE applies a rotational transformation to the query
and key vectors, thereby incorporating relative position dependencies directly into the self-attention
mechanism. Given a query vector, the attention
scores
SR mq

(Rnk

R

mRnk

Rn−mk
The transformation
relative position mas a block-diagonal matrix
parameterized by the absolute position i. The rotation frequencies,
θi=
−2i/d
for, d/2, are set according to a geometric sequence. This design creates
a spectrum of frequencies ranging from high (for small) to low (for large), corresponding to each
pair of dimensions.Fast-growing
Position IDs
without3D Structure
T axis
(a) Vanilla RoPE / V2PET axis (b) MRoPEModalities Confusion
in Generation!
T axis (c) VideoRoPE / HoPESqueezed
Video
Frames
Large ID Spacing
Across Modalities
T axis (d) CircleRoPENotCompatible with
Text-only RoPE
T axis (e) IL-RoPE / Omni-RoPEMore Focus on
Visual Info ��
Attention Sink
Near (t,0,0)
T axis (f) MHRoPE / MRoPE-I
Figure 1: Position design of different multimodal RoPE variants. The illustrated example follows an
interleaved multimodal sequence:,,,,
<text>,,,,,.
2.2 P
This section governs how the positional identifier
2.2.1 1D S
The most straightforward approach, employed by vanilla RoPE (Su et al., 2024) and V2PE (Ge et al.,
2024), is to treat the multimodal input as a flattened, one-dimensional sequence. Position indices
are assigned incrementally, with the position iof the-th token defined as i= i−1+ mod,
where modis a step size specific to the token’s modality. For vanilla RoPE, all modalities are treated
uniformly, with.
As shown in Figure 1a, this design presents two significant drawbacks. First, it discards the inherent
3D structure of visual content, which can alter the spatio-temporal reasoning capabilities of a VLM.
Second, position indices can grow exceedingly large in long sequences, negatively affecting the
model’s extrapolation performance (Wei et al., 2025).
To address the issue of large position indices, V2PE (Ge et al., 2024) introduces dynamic posi-
tion scaling for visual tokens, setting their step size visualto a value in1,/2, . . . ,/256}. This
3

Under review as a conference paper at ICLR 2026
modification mitigates the rapid growth of position indices and has shown benefits in long video
understanding. However, the 3D structure of visual content is still ignored in 1D sequential design.
2.2.2 M
To preserve the native 3D structure of visual content, methods like MRoPE (Wang et al., 2024a)
(See Figure 1b) extend the scalar position identifier to a multi-dimensional tuple. For instance, a
token’s position can be represented as i= (m
t
i
, m
h
i
, m
w
i
), corresponding to its temporal, vertical,
and horizontal axes. MRoPE conceptually treats each visual content (e.g., an image or a set of video
frames) as a single, large ”cube”. The temporal position of the subsequent token is then set by
”jumping” past the maximum coordinate value of current block. This is achieved with the update
rule:
m
t
next= max(m
t
prev, m
h
prev, m
w
prev) + 1
This strategy guarantees that no positional overlaps occur between modalities. However, proponents
of VideoRoPE (Li et al., 2025) and HoPE (Li et al., 2025) argue that MRoPE’s position design
lacks inter-modal symmetry, they introduce a ”diagonal layout” by centering the spatial coordinates
(see Figure 1c). In this scheme, visual frames are not only stacked along the temporal axis but are
also shifted along the vertical and horizontal axes. Despite its theoretical elegance, this diagonal
layout introduces a critical flaw: the potential for position id overlap between visual content and
generated text tokens. For high-resolution image content like documents, the spatial coordinates of
visual tokens can extend into the index range subsequently assigned to the generated text tokens.
We identify this positional ambiguity as a source of ”modalities confusion in generation”, a failure
mode that manifested as endless text repetition in our later experiments.
CircleRoPE (Wang et al., 2025) arranges image tokens in a circular layout, orthogonal to the linear
axis of text positions (see Figure 1d). A key property of this design is that it renders all visual tokens
equidistant from any given text token, which theoretically promotes uniform attention across the
image. However, CircleRoPE’s design has two limitations. First, the large interval between modali-
ties may impede effective cross-modal interaction. Second, lacking a temporal axis, it collapses all
video frames onto a single ring, which introduces severe temporal ambiguity.
2.2.3 T
Our preceding analysis, summarized in the first column of Table 1, indicates that a robust position
design must satisfy several criteria, which we collectively term Positional Coherence: (i) preserve
the 3D structure of visual content; (ii) maintain a slow growth rate; (iii) avoid modalities confusion
in generation; (iv) establish an appropriate modality interval.
While MRoPE fulfills most of these requirements, our analysis uncovers a crucial phenomenon:
MRoPE exhibits a visual ”attention sink,” where attention concentrates on the top-left corner of
each image or video frame, a behavior visualized in Figure 2. This is analogous to the attention
sink at the initial tokens in large language models. This insight directly motivates our proposal of
spatial-reset
spatial-reset, we aim to align this visual sink with the LLM’s bias for small position IDs, accelerating
visual adaptation.Attention Sink
Near the Left-Top
Corner
Figure 2: Visual attention sink in MRoPE. Av-
erage attention scores to input sequence from the
ChartQA (Left) and VideoMME-short (Right).
Furthermore,
benefit for video understanding by disentan-
gling the representation of motion. Consider an
object token at spatial coordinatesh 1, w1)
time 1and a second token for the same object
ath 2, w2) 2. Let their absolute posi-
tion indices be 1and 2, respectively. Un-
der the standard MRoPE formulation, the tem-
poral and spatial dimensions are coupled. The
absolute positions are 1= (t1, t1+ 1, t1+
w1) 2= (t2, t2+h2, t2+w2). The result-
ing relative position indices, rel= 2− 1,
becomes entangled:
mrel= (t2−1,t2−1) + (h 2− 1),t2−1) + (w 2− 1))
4

Under review as a conference paper at ICLR 2026
In contrast, our method with
m1= (t1, h1, w1) 2= (t2, h2, w2). This yields a purely spatio-temporal relative vector:
mrel= (t2−1, h2− 1, w2− 1)
This disentangled representation of motion is more intuitive and provides a cleaner inductive bias
for the model to learn from. Therefore, the position design we adopt for our proposed MHRoPE and
MRoPE-I methods builds upon MRoPE by incorporatingTemporal (T) Vertical (H) Horizontal (W)
Vanilla RoPE / V2PE / VRoPE
MRoPE/ CircleRoPE/ OmniRoPE
VideoRoPE / HoPE
IL-RoPE
MRoPE-I
MHRoPE
High Freq Low Freq
Chunked Frequency
Figure 3: Frequence allocation of different multimodal RoPEs.
2.3 F
Frequency Allocation governs the assignment of feature channels and their corresponding frequen-
cies ito the various axes of the position identifier, vertical
2.3.1 F
In 1D methods like vanilla RoPE and V2PE, all feature channels are allocated to encode the
temporal axis. The frequencies idecay as the channel index
from high-frequency (for short-range dependencies) to low-frequency (for long-range dependen-
cies). This design also imparts a long-range decay property on attention scores, as their upper
bound is a function of the relative distance, which can be approximated by
P
d/2−1
i=0
|Si+1|, where
Sj=
P
j1
k
e
i(m−n)θ k
(see Appendix D.2 for detailed derivation). V2PE’s position scaling for visual
tokens effectively slows this decay, enhancing the model’s ability to focus on long visual content.
2.3.2 M
The standard MRoPE partitions the
one to each of the
design forces the temporal axis to be encoded entirely by the highest-frequency channels. This
creates a strong inductive bias that is detrimental to long-sequence understanding, as it leads to a
rapid decay of attention over time. Furthermore, because the
non-overlapping frequency ranges, they exhibit different long-range decay rates, as visualized in
Figure 4a. This asymmetry can impair the model’s ability to learn consistent spatial relationships.
Subsequent methods have attempted to rectify this temporal bias through various frequency re-
allocation strategies. VideoRoPE and HoPE, for example, move the temporal axis to occupy the
low-frequency channels. IL-RoPE employs a form of interleaving but similarly reserves the lowest-
frequency channels for the temporal dimension. While these approaches can mitigate the long-
context issue for the temporal axis, they introduce a critical, unaddressed trade-off: they force the
spatial dimensions into a restricted, and often exclusively high-frequency, band. This severely limits
the model’s ability to capture multi-scale spatial relationships, which can impair performance on
tasks reliant on fine-grained spatial reasoning, such as visual grounding. Furthermore, the very act
of partitioning feature dimensions inherently coarsens the frequency resolution for each positional
axis. The performance implications of this reduced granularity is under-exploration.
5

Under review as a conference paper at ICLR 2026
(a) MRoPE (b) MHRoPE / MRoPE-I
Figure 4: The long-range decay property of MRoPE, MHRoPE and MRoPE-I.
2.3.3 T
To address the limitations of frequency allocation, we propose two effective strategies, as summa-
rized in the second column of Table 1. Both methods resolve the rapid temporal decay and asym-
metric spatial decay of MRoPE, yielding a unified decay profile for all axes as shown in Figure 4b.
Multi-Head Allocation.
recent work demonstrating channel-level redundancy in RoPE (e.g., partial RoPE (Barbero et al.,
2025)). Based on the premise that similar redundancy exists at the attention head level, MHRoPE
partitions the positional encoding task among different attention heads
1
, as shown in Figure 3. A
primary advantage of this strategy is that each axis is encoded using the full frequency spectrum
available within its assigned heads. This approach avoids the loss of frequency resolution inherent
to channel-splitting methods. Moreover, it may be more scalable. As the number of positional
axes grows (Liu et al., 2025), partitioning a fixed channel channels (e.g., 128) becomes untenable,
whereas dedicating distinct heads to new dimensions offers a far more robust and flexible approach.
Interleaved Allocation.
method, distributes feature channels to the temporal (t), vertical (h), and horizontal (w
fine-grained, round-robin manner, as shown in Figure 3. This design ensures that each positional axis
is encoded using the full frequency spectrum, from high to low, thereby enabling robust multi-scale
modeling for each positional axis. Moreover, the uniform frequency distribution of our interleaved
design is compatible with extrapolation algorithms like NTK-aware (bloc97, 2023) and YaRN (Peng
et al., 2024), which function by rescaling the frequency spectrum (see Appendix D.3).
2.4 C
Most VLMs are adapted from LLMs, which typically use vanilla RoPE for positional encoding. This
raises a natural question: should the encoding strategy for text tokens in VLMs remain identical to
that of the base LLM? While most works implicitly agree, some methods have explored deviations.
From the Position Design aspect, methods like IL-RoPE (Liao et al., 2025) and Omni-RoPE (Wu
et al., 2025) modify the text encoding. As shown in Figure 1e, they reset spatial coordinates for each
image to aid editing but concurrently set the spatial dimensions for text tokens to zero. This design
choice breaks compatibility with the standard RoPE used in the pre-trained LLM.
From the Frequency Allocation aspect, we also explored a potential modification. Since the coor-
dinate range of spatial dimensions is much smaller than that of the temporal axis, a smaller rotary
‘base‘ could be used to better encode the spatial dimension. However, our experiments showed
that this strategy led to poor performance across most benchmarks. This outcome strongly indi-
cates the critical importance of maintaining full compatibility with the text-only RoPE for effective
knowledge transfer from pre-trained LLMs.
1
For Group Query Attention (GQA), we partition on KV heads and repeat on corresponding query heads.
6

Under review as a conference paper at ICLR 2026
2.5 P
Based on our analysis, we propose two novel multimodal RoPE variants: Multi-Head RoPE
(MHRoPE) and MRoPE-Interleave (MRoPE-I). Both methods are built upon a shared set of design
guidelines for robustness and performance.
For position design, we enhance MRoPE position design with
focus on visual information. We also maintain strict compatibility with text-only RoPE, ensuring
the effective transfer of pre-trained knowledge.
The key distinction between our variants lies in their frequency allocation strategy. MHRoPE em-
ploys Multi-Head Allocation, dedicating distinct attention heads to different axes to preserve fre-
quency resolution and offer scalability. In contrast, MRoPE-I uses Interleaved Allocation, a fine-
grained approach ensuring full-spectrum encoding and compatibility with extrapolation techniques.
For a detailed discussion on the trade-offs between MHRoPE and MRoPE-I, see Appendix D.1.
3 E
3.1 E
Training Details.
2
, while initializing
the VLM backbone with the Qwen2.5 7B LLM. During training, we freeze the ViT to fix the visual
representations and unfreeze the connector and LLM backbone. This strategy is designed to isolate
the effects of our proposed RoPE modifications while adhering to the standard VLM adaptation
paradigm of building upon a pre-trained LLM.
Training process adopts a batch size of 128, cosine-decayed learning rate of
−5
, and allocate
approximately 512 Nvidia A100 GPU hours per experiment. The training context length is set to
32K, and the rotary base is set to 1000000. All experiments share identical training data, model
architecture, and hyperparameters, with the sole difference being the choice of multimodal RoPE.
Training Data and Evaluation Benchmarks.
high-quality supervised fine-tuning (SFT) samples, covering a wide range of visual scenarios includ-
ing image captioning, OCR, visual reasoning, visual grounding, document comprehension, and long
video understanding. For evaluation, we adopt more than 20 benchmarks spanning images, videos,
and grounding tasks. Specifically, the image benchmarks include MMMU (Yue et al., 2024), MM-
Bench (Liu et al., 2024a), MMStar (Chen et al., 2024), OCRBench (Liu et al., 2024b), AI2D (Kem-
bhavi et al., 2016), RealWorldQA (X.AI., 2024), DocVQA (Mathew et al., 2021), TextVQA Singh
et al. (2019), InfoVQA (Mathew et al., 2022), and ChartQA (Masry et al., 2022). The video bench-
marks consist of MVBench (Li et al., 2024), STAR (Wu et al., 2021), VideoMME (Fu et al., 2025),
LVBench (Wang et al., 2024b), MLVU (Zhou et al., 2024b), and Charades-STA (Zhou et al., 2024a).
For grounding, we evaluate on RefCOCO (Kazemzadeh et al., 2014) series.
3.2 O
The overall performance of different multimodal RoPEs is presented in Table 2. Both MHRoPE and
MRoPE-I achieve consistently better performance across the majority of benchmarks. For instance,
MRoPE-I outperforms the vanilla RoPE baseline by a significant margin of
+5.28%.27% val.
The results also reveal that while vanilla RoPE serves as a competitive baseline, its performance is
noticeably impaired on benchmarks that demand fine-grained spatial reasoning, such as ChartQA
and the RefCOCO series. This performance gap highlights the fundamental limitations of its flat-
tened, 1D position design. Vanilla RoPE also suffers from extrapolation, see Appendix D.4.
While VideoRoPE and HoPE demonstrate stronger performance on video benchmarks, they exhibit
anomalous degradation on DocVQA, InfoVQA, and ChartQA. We attribute this discrepancy to a
critical flaw in their position design: the overlap of position indices, which induces confusion be-
2
For fair comparison with prior work which using Qwen2VL, we disable the absolute time encoding used
in Qwen2.5VL.
7

Under review as a conference paper at ICLR 2026
Table 2: Overall performance of multimodal RoPEs variants on various benchmarks. The values in
parentheses on the far right indicate the performance
Types Benchmarks
Image
MMMU 50.56 50.22 49.89 49.89 47.22 53.00 53.22
MMBenchavg 74.75 74.06 (+0.81)
MMstar 49.13 49.93 49.60 50.33 47.00 49.60
OCRBench 73.40 72.70 66.20 66.60 70.60 73.40 74.00
AI2D 74.45 75.45 75.36
RealworldQA 58.30 57.25 56.21 57.12 56.60
DocVQA 82.94 81.49 60.13 60.12 77.70 81.32
TextVQA 66.80 65.85 66.58 66.77 65.54 66.49
InfoVQA (-0.62)
ChartQA 56.84 62.12
Video
MVBench 57.05 57.85 56.78 58.00 57.10
STAR 58.07 58.28 57.20 58.30 57.94
MLVU 64.69 63.26 65.46
VideoMME 58.63 58.22 58.70 (+0.33)
LVBench 38.93 39.22 40.15 (+1.61)
Charades-STA 32.49 32.23 34.21 (+1.87)
Grounding
RefCOCOval 77.67 78.35 77.95 77.72 79.59 79.87 80.94
un RefCOCO testA 81.37 82.52 80.43 81.60 83.98 83.66
RefCOCOtestB 72.66 72.31 72.62 71.44 74.35 73.20
RefCOCO+val 69.16 68.80 68.15 69.61 70.19 70.55 71.80
RefCOCO+testA 74.48 75.95 73.14 74.55 76.77 76.88 77.44
RefCOCO+testB 61.67 59.97 60.50 61.69 61.96(+0.29)
RefCOCOg
val
75.45 75.86 74.06 74.69 76.10 76.55 77.70
RefCOCOg
test
75.40 75.73 73.90 75.45 76.12 76.68 77.34
Overall
Image 65.69 65.18 60.64 60.72 62.86 66.40 66.65
Video 51.64 51.51 52.18 52.36(+0.72)
Grounding 73.48 73.69 72.59 73.19 74.96 74.92
tween visual and generated text tokens. The ablation study in Table 3 confirms that this confusion is
the root cause of the degradation.
The suboptimal designs of MRoPE and CircleRoPE manifest in their performance. MRoPE breaks
the full frequency spectrum for each positon axis. Consequently, it struggles on tasks demanding
specific frequency ranges, such as long-video understanding (MLVU, LVBench), which requires
robust low-frequency temporal encoding, and visual grounding (RefCOCO), which benefits from
high-frequency spatial encoding. Similarly, CircleRoPE introduces a large modality interval and
collapses the video positions, results in poor video understanding.
In contrast, MHRoPE and MRoPE-I leverage
ties confusion and not introducing an improper modality interval. By providing each positional axis
with a full frequency spectrum (interleave or multi-head allocation), they enable the models to bet-
ter capture both fine-grained spatial details (high-frequency) and long-range temporal dependencies
(low-frequency), leading to their superior overall performance.
3.3 A
This section presents ablation studies on key design choices for our robust multimodal RoPEs. Ad-
ditional results are provided in Appendix D.5.
3.3.1 A
We previously argued that an optimal position design should: (1) incorporate 3D structure to capture
native spatio-temporal information, (2) maintain a proper modality interval, and (3) preserve com-
patibility with text-only RoPE. To systematically dissect the impact of these factors, we conduct an
ablation study, fixing the frequency allocation strategy to our interleaved allocation while varying
the position design. The results are presented in Table 3. Simply introducing a 3D structure over
the vanilla RoPE provides a notable boost to grounding performance. The addition of
mechanism yields substantial gains across all benchmark categories, confirming its effectiveness.
We also ablated other position designs proposed in prior work, as shown in Table 3.
8

Under review as a conference paper at ICLR 2026
Diagonal Layout:
in performance on document-centric benchmarks (DocVQA, InfoVQA and ChartQA). A qualitative
analysis reveals a specific failure mode: repetitive, nonsensical text generation (e.g., “1111...”),
which occurs even when the layout is applied only at inference time. We attribute this behavior
to modalities confusion induced by positional overlap, causing the model to misinterpret its own
generated text tokens as visual tokens, resulting in this unpredictable repetitive output.
Enlarged Modality Interval:
that of vanilla RoPE, a strategy similar to RoPE-Tie (Su, 2024). This also resulted in poor document-
related performance. However, the failure mode was distinct: the model generated fluent but contex-
tually irrelevant text, effectively ignoring the visual input. This suggests that while a clear modality
interval is necessary, simply maximizing its size to align with vanilla RoPE can be detrimental by
Text.
dimensions for visual tokens as well as text ones (Fig. 1e). This approach resulted in a notable per-
formance degradation compared to the vanilla RoPE, emphasizing that preserving RoPE alignment
for text is critical for successfully adapting LLMs into VLMs.
Scaling rotary base.
with scaling their corresponding rotary ‘base‘ (e.g., from 1,000,000 to 10,000). This consistently
resulted in a clear performance drop on image benchmarks. This finding demonstrates that even
well-intentioned deviations from the base LLM’s RoPE formulation can break compatibility and
severely impair knowledge transfer.
Position Design DocVQA InfoVQA ChartQA
vanilla RoPE 65.69 73.48 51.64 82.94 58.85
+ 3D structure 65.87 74.40 51.2982.33 57.24 61.44
+ 3D + 83.72 62.12
+ diagonal layout 61.20 72.33 52.51 60.13 37.42 54.88
+ modality interval 62.80 73.19 50.88 70.43 42.18 51.28
+ text 77.30 52.15 44.33
+ scaling rotary base 60.15 74.13 52.11 80.44 52.16 58.80
Table 3: Ablation study of different position design strategies.
3.3.2 A
To determine the optimal frequency allocation strategy, we fix the position design the same as
MRoPE with
in Table 4, a more uniform allocation strategy consistently outperforms alternatives that split the
spectrum into partial chunks. This highlights the importance of ensuring that each positional axes
(time, height, width) retains access to the full frequency spectrum.
Allocation Type
VideoRoPE-like 65.33 52.11 72.50 63.31
IL-RoPE-like 65.26 51.15 72.80 63.07
Multi-Head 66.40 52.58 74.92 64.63
Interleave 66.65 52.36 75.85
Table 4: Ablation results of different frequency allocation strategies.
4 C
In this work, we conducted the first systematic investigation into multimodal Rotary Positional Em-
bedding (RoPE) for Vision-Language Models (VLMs). From our systematic comparison and ex-
tensive experiments, we identified three key design considerations for robust multimodal RoPE:
positional coherence, full frequency utilization, and preservation textual priors from pre-trained
LLMs. Guided by these insights, we proposed two plug-and-play RoPE variants: Multi-Head RoPE
(MHRoPE) and MRoPE-Interleave (MRoPE-I). Both methods adhere to our identified guidelines,
effectively addressing common failure modes and achieving significant performance in both general
and fine-grained multimodal understanding. This work offers a comprehensive guide for designing
effective multimodal positional encodings, paving the way for future advancements in VLMs.
9

Under review as a conference paper at ICLR 2026
R
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang,
Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Ming-Hsuan Yang, Zhaohai Li, Jianqiang
Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Ze-
sen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical re-
port., abs/2502.13923, 2025. URL
13923.
Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, and Petar Velick-
ovic. Round and round we go! what makes rotary positional encodings useful? In
International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025.
OpenReview.net, 2025. URL.
bloc97. Ntk-aware scaled rope allows llama models to have extended (8k+) con-
text size without any fine-tuning and minimal perplexity degradation., 2023. URL
https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_
scaled_rope_allows_llama_models_to_have/.
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi
Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-
language models? In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich
Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.),
Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024,
Vancouver, BC, Canada, December 10 - 15, 2024, 2024.
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu
Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li,
Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme:
The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In
IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN,
USA, June 11-15, 2025, pp. 24108–24118. Computer Vision Foundation / IEEE, 2025.
Junqi Ge, Ziyi Chen, Jintao Lin, Jinguo Zhu, Xihui Liu, Jifeng Dai, and Xizhou Zhu. V2PE:
improving multimodal long-context capability of vision-language models with variable visual
position encoding., abs/2412.09616, 2024. URL
arXiv.2412.09616.
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models, 2024.
URL.
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to
objects in photographs of natural scenes. In Alessandro Moschitti, Bo Pang, and Walter Daele-
mans (eds.),
Processing (EMNLP), October 2014.
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Min Joon Seo, Hannaneh Hajishirzi, and Ali
Farhadi. A diagram is worth a dozen images. In Bastian Leibe, Jiri Matas, Nicu Sebe, and
Max Welling (eds.),
The Netherlands, October 11-14, 2016, Proceedings, Part IV, volume 9908 of
Computer Science, pp. 235–251. Springer, 2016.
Haoran Li, Yingjie Qin, Baoyuan Ou, Lai Xu, and Ruiwen Xu. Hope: Hybrid of position embedding
for length generalization in vision-language models., abs/2505.20444, 2025. URL
//doi.org/10.48550/arXiv.2505.20444.
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen,
Ping Lou, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understand-
ing benchmark. In
2024, Seattle, WA, USA, June 16-22, 2024, pp. 22195–22206. IEEE, 2024.
10

Under review as a conference paper at ICLR 2026
Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang
Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal
generation., abs/2505.05472, 2025. URL
2505.05472.
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi
Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model
an all-around player? In Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten
Sattler, and G¨ul Varol (eds.),
Italy, September 29-October 4, 2024, Proceedings, Part VI, volume 15064 of
Computer Science, pp. 216–233. Springer, 2024a.
Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin,
Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of OCR in large
multimodal models., 67(12), 2024b.
Zikang Liu, Longteng Guo, Yepeng Tang, Junxian Cai, Kai Ma, Xi Chen, and Jing Liu. Vrope:
Rotary position embedding for video large language models., abs/2502.11664, 2025. URL
https://doi.org/10.48550/arXiv.2502.11664.
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq R. Joty, and Enamul Hoque. Chartqa: A
benchmark for question answering about charts with visual and logical reasoning. In Smaranda
Muresan, Preslav Nakov, and Aline Villavicencio (eds.),
tational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pp. 2263–2279. Association
for Computational Linguistics, 2022.
Minesh Mathew, Dimosthenis Karatzas, and C. V. Jawahar. Docvqa: A dataset for VQA on doc-
ument images. In
Waikoloa, HI, USA, January 3-8, 2021, pp. 2199–2208. IEEE, 2021.
Minesh Mathew, Viraj Bagal, Rub`en Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V. Jawahar.
Infographicvqa. In
2022, Waikoloa, HI, USA, January 3-8, 2022, pp. 2582–2591. IEEE, 2022.
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context win-
dow extension of large language models. In
Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh,
and Marcus Rohrbach. Towards VQA models that can read. In
Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 8317–
8326. Computer Vision Foundation / IEEE, 2019.
Jianlin Su. Transformer upgrade path: 17. insights into multimodal positional encoding, March
2024. URL.
Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer:
Enhanced transformer with rotary position embedding., 568:127063, 2024. URL
https://doi.org/10.1016/j.neucom.2023.127063.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von
Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman
Garnett (eds.),
Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.
5998–6008, 2017.
Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, and Kai Han.
Circle-rope: Cone-like decoupled rotary positional embedding for large vision-language mod-
els., abs/2505.16416, 2025. URL
16416.
11

Under review as a conference paper at ICLR 2026
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu,
Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng
Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s
perception of the world at any resolution., abs/2409.12191, 2024a. URL
org/10.48550/arXiv.2409.12191.
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin
Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Lvbench: An extreme long video understanding
benchmark., abs/2406.08035, 2024b.
Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong
Duan, Qipeng Guo, Jiaqi Wang, et al. Videorope: What makes for good video rotary position
embedding? In, 2025.
Bo Wu, Shoubin Yu, Zhenfang Chen, Josh Tenenbaum, and Chuang Gan. STAR: A benchmark
for situated reasoning in real-world videos. In Joaquin Vanschoren and Sai-Kit Yeung (eds.),
Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1,
NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021.
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan
Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun
Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu.
Omnigen2: Exploration to advanced multimodal generation., abs/2506.18871, 2025. URL
https://doi.org/10.48550/arXiv.2506.18871.
X.AI. Grok-1.5 vision preview., 2024. URL.
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu,
Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.
arXiv:2505.09388, 2025.
Xiang Yue, Yuansheng Ni, Tianyu Zheng, Kai Zhang, Ruoqi Liu, Ge Zhang, Samuel Stevens,
Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun,
Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and
Wenhu Chen. MMMU: A massive multi-discipline multimodal understanding and reasoning
benchmark for expert AGI. In
nition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pp. 9556–9567. IEEE, 2024.
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang,
Tiejun Huang, and Zheng Liu. MLVU: A comprehensive benchmark for multi-task long video
understanding., abs/2406.04264, 2024a.
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang,
Tiejun Huang, and Zheng Liu. MLVU: A comprehensive benchmark for multi-task long video
understanding., abs/2406.04264, 2024b.
A E
This work adheres to the ICLR Code of Ethics. In this study, no human subjects or animal ex-
perimentation was involved. All datasets used were sourced in compliance with relevant usage
guidelines, ensuring no violation of privacy. We have taken care to avoid any biases or discrimi-
natory outcomes in our research process. No personally identifiable information was used, and no
experiments were conducted that could raise privacy or security concerns. We are committed to
maintaining transparency and integrity throughout the research process.
B R
We have made every effort to ensure that the results presented in this paper are reproducible. Code
will be made publicly available to facilitate replication and verification after inspection. The exper-
imental setup, including training steps, model configurations, and hardware details, is described in
12

Under review as a conference paper at ICLR 2026
detail in the paper. We believe these measures will enable other researchers to reproduce our work
and further advance the field.
C LLM U
Large Language Models (LLMs) were used to aid in the writing and polishing of the manuscript.
Specifically, we used an LLM to assist in refining the language, improving readability, and ensuring
clarity in various sections of the paper. The model helped with tasks such as sentence rephrasing,
grammar checking.
It is important to note that the LLM was not involved in the ideation, research methodology, or
experimental design. All research concepts, ideas, and analyses were developed and conducted by
the authors. The contributions of the LLM were solely focused on improving the linguistic quality
of the paper, with no involvement in the scientific content or data analysis.
The authors take full responsibility for the content of the manuscript, including any text generated
or polished by the LLM. We have ensured that the LLM-generated text adheres to ethical guidelines
and does not contribute to plagiarism or scientific misconduct.
D A
D.1 P
While both of our proposed methods are effective, we currently recommend MRoPE-I over
MHRoPE for two primary reasons: its consistent (albeit slight) performance advantage and its
greater implementation simplicity. We attribute MHRoPE’s minor performance deficit to its head-
level information partitioning, which prevents the integration of different positional axes within the
self-attention mechanism. From an engineering perspective, MRoPE-I is also simpler, avoiding the
complexities that MHRoPE introduces with distributed training paradigms like tensor parallelism.
Nevertheless, MHRoPE’s design offers a potentially more scalable architecture for future models
that may need to accommodate a larger number of positional axes.
D.2 D
Here, we provide a formal derivation for the upper bound of the RoPE attention score. The RoPE
dot product between a query
form as:
(Rmq

(Rnk


d/2−1
X
i=0
(q
[2i:2i+1] ·

[2i:2i+1]
)e
i(m−n)θ i

 (5)
where

denotes the complex conjugate of a 2D vector treated as a complex number, and
complex product.
To derive the upper bound, we analyze the magnitude of the summation term. To apply summation
by parts, let us define a content-dependent sequence i=
[2i:2i+1] ·

[2i:2i+1]
and a position-
dependent sequence of partial sums j=
P
j1
k
e
i(m−n)θ k
. We also set the boundary conditions
S0= 0
d/2= 0. The standard summation by parts formula is
P
b
i=a
ui∆vi= [u ivi]
b+1
a−
P
b
i=a
vi+1∆ui. Applying this, the magnitude of the summation can be rewritten and bounded as
13

Under review as a conference paper at ICLR 2026
follows:






d/2−1
X
i=0
hie
i(m−n)θ i






=






[hiSi]
d/2
0

d/2−1
X
i=0
Si+1(hi+1− i)






=






(h
d/2S
d/2− 0S0)
d/2−1
X
i=0
Si+1(hi+1− i)






=







d/2−1
X
i=0
Si+1(hi+1− i)







d/2−1
X
i=0
|Si+1||hi+1− i|

`
max
0≤i<d/2
|hi+1− i|
´d/2−1
X
i=0
|Si+1|
(6)
This final expression reveals that the upper bound is a product of two distinct components.
maxh i+1− i|
and key vectors. The second,
P
|Si+1|, is a purely position-dependent term whose value is deter-
mined only by the relative position i. Since the content-dependent
term is independent of position, the long-range decay property of the attention score is governed
primarily by this position-dependent term. Therefore, its average value,
1
d/2
P
d/2
i=1
|Si|, serves as a
practical indicator to characterize how the upper bound attenuates with relative distance.
D.3 C
As shown in Figure 4b, the interleaved frequency allocation of MRoPE-I makes it compatible with
extrapolation algorithms like NTK-aware (bloc97, 2023) and YaRN (Peng et al., 2024). Whereas
standard MRoPE’s partitioned spectrum complicates the application of a consistent frequency scal-
ing boundary, our interleaved design provides a full spectrum across all positional axes, enabling a
straightforward and symmetric application of these methods.
Furthermore, MRoPE-I requires a smaller YaRN scaling factor than vanilla RoPE. This is a direct
result of its more efficient position design, which limits the growth of position IDs (advancing by
Oh, wh
extension.Empirically, for inputs of around
MRoPE-I that is approximately
D.4 L
We further compare the performance of different methods on long video understanding, with context
lengths ranging from 32K to 256K. As shown in Figure 5, apart from LVBench, we do not observe
clear performance improvements or degradation when extrapolating to longer sequences. The only
exception is Vanilla RoPE, which suffers from a sharp performance drop at 128K/256K. We attribute
this to excessively fast-growing position IDs, which lead to degraded extrapolation capability, which
also discussed in other works (Wei et al., 2025; Li et al., 2025).
Overall, methods such as VideoRoPE and HoPE, which allocate most low-frequency channels to
the temporal axis, exhibit slightly better extrapolation ability in long video senario. However, when
considering performance across images and grounding tasks, MHRoPE and MRoPE-I remain the
most comprehensive and balanced designs.
14

Under review as a conference paper at ICLR 2026
Figure 5: Video extrapolation performance. Models are trained with a context length of 32k (256
frames) and extrapolated to 64k (512 frames), 128k (1024 frames), and 256k (2048 frames).
D.5 M
D.5.1 E
To understand the mechanism driving the effectiveness of, we analyzed its impact on
the model’s attention patterns. As detailed in Table 5, we calculated the total attention scores on vi-
sual tokens using the DocVQA test set. Specifically, we extracted attention scores from layers 4, 12,
20, and 28, and averaged the scores across all attention heads and samples. The result demonstrates
that MRoPE equipped with
deeper layers, confirming it’s effectiveness in enhancing the model’s visual focus.
Method
MHRoPE 40.31 21.76 32.05 19.00
w/o
MRoPE-I 37.48 15.68 28.08 23.23
w/o
Table 5: Average attention scores (%) on visual contents. The inputs are from DocVQA test set.
And the scores are averaged between attention heads and samples.
D.5.2 A
We further investigate different allocation ratios under the interleave frequency strategy. The results
are summarized in Table 6. The balanced allocation (t:h:w = 24:20:20) achieves the best overall per-
formance. Increasing the proportion of channels assigned to the temporal axis reduces the available
high-frequency capacity for spatial dimensions. This leads to a degradation in grounding ability and
negatively impacts benchmarks involving spatial understanding in both images and videos.
Allocation Ratio
24:20:20 66.65 52.36 75.85
32:16:16 64.07 51.15 74.65 63.29
48: 8: 8 65.06 51.17 72.87 63.03
Table 6: Ablation results of different frequency allocation ratios under interleave design.
D.5.3 T
This section investigates the impact of different temporal strides between video frames. Specifically,
we experiment with.5,,, as well as dynamic strides as used in V2PE and HoPE (with
applied during inference). The results are shown in Table 7.
From the results,δ.5) or larger
(δ) strides lead to performance drops. Incorporating the dynamic stride from V2PE does not
shows a significant benefit.
15

Under review as a conference paper at ICLR 2026
Stride
0.5 56.55 57.90 58.96 38.99 62.37 31.88 51.11
1
2 55.70
Dynamic 56.28 57.93 58.74
Table 7: Comparison of temporal stride settings for video benchmarks on MRoPE-I.
16