Extracted Text

2504.16891v1.pdf

2025-4-24
AIMO-2 Winning Solution: Building
State-of-the-Art Mathematical Reasoning Models
with OpenMathReasoning dataset
Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel,
Benedikt Schifferer, Wei Du, Igor Gitman
Abstract:
This paper presents our winning submission to the AI Mathematical Olympiad - Progress Prize 2 (AIMO-2)
competition. Our recipe for building state-of-the-art mathematical reasoning models relies on three key
pillars. First, we create a large-scale dataset comprising 540K unique high-quality math problems, including
olympiad-level problems, and their 3.2M long-reasoning solutions. Second, we develop a novel method to
integrate code execution with long reasoning models through iterative training, generation, and quality
filtering, resulting in 1.7M high-quality Tool-Integrated Reasoning solutions. Third, we create a pipeline to
train models to select the most promising solution from many candidates. We show that such generative
solution selection (GenSelect) can significantly improve upon majority voting baseline. Combining these ideas,
we train a series of models that achieve state-of-the-art results on mathematical reasoning benchmarks. To
facilitate further research, we release our code, models, and the completeOpenMathReasoningdataset under
a commercially permissive license.
1.
Recent advances in large language models (LLMs)
have significantly improved their ability to solve com-
plex reasoning tasks, including olympiad-level mathe-
matics. A key idea behind this progress has been to
allow models to spend more tokens thinking about the
solution before producing the final answer. Initially,
models were trained to produce a series of interme-
diate solution steps (chain-of-thought (CoT) [35]).
More recently,long reasoningmodels [12,10] have
learned to reflect on their work, exploring and refining
multiple strategies within a single generation. This
has led to further improvements across mathematics,DS-Distill-1.5B
OpenMath-
Nemotron-1.5B
DS-Distill-7B OpenMath-
Nemotron-7B
DS-Distill-14B OpenMath-
Nemotron-14B
DS-Distill-32B
QwQ-32B
OpenMath-
Nemotron-32B
DeepSeek-R1
0
10
20
30
40
50
60
70
80
Accuracy (%) 31.6
65.6
48.5
71.5
56.2
74.6
57.4
67.6
76.6
64.1
1.5B Models 7B Models 14B Models 32B Models DeepSeek-R1
CoT pass@1 CoT maj@64 TIR maj@64 TIR maj@64 + Self GenSelect
Figure 1: Accuracy on math problems from AIME and HMMT competitions.
©2025 NVIDIA. All rights reserved.

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
coding, and scientific domains. To keep pace with this
rapid development, the community has introduced in-
creasingly challenging benchmarks and competitions
that help to evaluate the progress.
The
(AIMO-2)
ments in this domain by challenging participants to
create models capable of solving 50 difficult, national-
level mathematical problems within strict computa-
tional limits. These problems were never published
online, ensuring a more rigorous evaluation compared
to traditional benchmarks. This report details our
first-place submission to the competition, which cor-
rectly solved 34 out of 50 problems on the private test
set. To develop the winning recipe, we focused on
addressing several limitations of the publicly available
reasoning models that we describe below.
Large-scale long-reasoning dataset (§2). To
improve existing models we started by collecting an
extensive set of mathematical problems from the inter-
net. We developed an LLM-based problem extraction
and refinement pipeline to construct a dataset of 540K
unique problems. Using this dataset, we then gener-
ated 3.2M long-reasoning CoT solutions by prompt-
ingDeepSeek-R1[10] andQwQ-32B[29]. Training
Qwen2.5-Basemodels [39] on this large-scale distilla-
tion data, we are able to surpass the accuracy of all
other open-weight models of comparable size, except
forQwQ-32B, which is slightly better than our 32B
model.
Tool-Integrated Reasoning (§3). To improve
the results further we developed a method for integrat-
ing code execution into long-reasoning generations.
Our initial attempts to elicit Tool-Integrated Reason-
ing (TIR) fromDeepSeek-R1andQwQ-32Bthrough
simple prompting proved unsuccessful. We hypothe-
size that these models struggle to deviate from their
standard solution format due to extensive training on
reasoning tasks and limited exposure to instruction-
following. To overcome this challenge, we built a
pipeline that starts with a small-scale reasoning fine-
tuning of aninstruction-followingmodel [42]. By
prompting this model to generate long-reasoning TIR
solutions followed by aggressive quality filtering, we
established an initial dataset suitable for training.
Through multiple iterations of training, generation,
and filtering, we constructed a 1.7M TIR solution set
that was crucial for improving the accuracy of our
final models. To make TIR more efficient, we also
developed a method to accurately control the number
of code executions the model is allowed to make for
each generation.
Generative Solution Selection (§4). A com-
mon approach to maximize model accuracy is to gen-
erate multiple candidate solutions and select the most
promising one. While majority voting [34] serves as
a strong baseline, its performance falls significantly
short of the theoretical maximum performance of
pass@k. To address this limitation, we developed
a pipeline for training models to identify the most
promising solution when presented with multiple can-
didates. We generated 566K selection examples to
train our models. Although this approach showed
considerable promise, we ultimately were unable to
integrate it into our AIMO-2 Kaggle submission due
to the competition’s strict time constraints.
Combining these three innovations, we developed a
series of state-of-the-art open-weight math reasoning
models with 1.5B, 7B, 14B, and 32B parameters. Each
model supports CoT, TIR and GenSelect inference
modes when appropriately prompted. For our winning
AIMO-2 submission we used an intermediate version
of the 14B model and implemented various inference
optimizations to accommodate the competition’s time
and compute constraints. We discuss model training
process and evaluation results in Section
Kaggle-specific optimizations in Section.
To accelerate progress in open-source mathemat-
ical reasoning, we are releasing our code, fine-
tunedOpenMath-Nemotronmodels, and the complete
OpenMathReasoningdataset under a commercially
permissive license.
1
2.
In this section, we outline our validation and train-
ing data curation pipeline. Section
methodology for preparing a large-scale problem set
for training. Section
set collection process. Finally, Section
our approach to synthesizing long-reasoning Chain-
of-Thought (CoT) solutions.
2.1.
We collect a large set of mathematical problems from
the
rums. We include all forum discussions except “Mid-
dle School Math”, which we found to be too elemen-
tary and unhelpful for training in our preliminary
experiments. After retrieving forum discussions, we
implement a systematic process to extract problems
and their corresponding answers. Throughout our
pipeline, we utilizeQwen2.5-32B-Instruct[39] for
all processing steps unless specified otherwise.
1
Data and models are available at
https://huggingface.co/collections/nvidia/openmathreasoning-
68072c0154a5099573d2e730, our code is avaliable at
https://github.com/NVIDIA/NeMo-Skills
2

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
1.
Problem Extraction:We prompt an LLM to
identify and extract all problems from the initial
forum posts (Appendix). While most posts
contain a single problem, some include multiple
problems or none at all.
2.
Problem Classification:Each extracted prob-
lem is classified into the following categories. We
use an LLM to perform the classification:
•Proof problem or not (Appendix)
•
Multiple choice question or not (Ap-
pendix)
•
Binary question (yes-or-no answer) or not
(Appendix)
•Valid problem or not (Appendix)
2
We remove all multiple-choice questions, binary
questions, and invalid problems from the final
dataset.
3.
Question Transformation:For proof ques-
tions, we convert them into answer-based ques-
tions that require similar problem-solving tech-
niques (Appendix).
4.
Answer Extraction:For non-proof questions,
we attempt to extract the final answer from the
forum discussions (Appendix)
3
.
5.
Benchmark Decontamination: Following [41]
we use an LLM-based comparison to remove ques-
tions that closely resemble those in popular math
benchmarks.
All prompts and scripts necessary to run the above
pipeline are available in. Table
breakdown of the dataset size after each processing
stage and Table
We provide a comparison with other popular datasets
sourced from AoPS forums in Table.
Pipeline Stage Data Size
Original forum discussions 620K
Extracted problems 580K
Removing “bad” problems 550K
Benchmark decontamination 540K
Table 1: Dataset size after each processing stage.
2.2.
To create a robust validation dataset for our eval-
uation, we combined problems from American In-
vitational Mathematics Examinations (AIME) and
Harvard-MIT Mathematics Tournaments (HMMT)
2
E.g. problems that are lacking context or referring to other
problems are considered invalid.
3
We do not try to extract the full solution, just the final
answer.
Subset Size
Converted proofs 260K
With extracted answer 190K
No extracted answer 90K
Total problems 540K
Table 2: Final dataset composition.
Dataset # of Problems
OpenMathReasoning (ours) 540K
AoPS-Instruct [20] 650K
NuminaMath-1.5 (AoPS part) [14] 68K
Table 3: Comparison with other datasets sourced
from AoPS forums. Our work was done concurrently
with [20] and [14].
gathered from the Art of Problem Solving forums.
We restricted our selection to 2024 and 2025 com-
petitions to minimize potential data contamination.
AIME and HMMT problems were selected for our vali-
dation set due to their strong alignment with AIMO-2
competition requirements. They covered similar math-
ematical topics, matched the difficulty level, and were
predominantly non-proof-based questions requiring
single numerical answers. We excluded proof-based
questions and those awarding partial credit based on
estimate accuracy, as these are generally incompati-
ble with an exact match evaluation framework. The
resulting dataset, which we callComp-Math-24-25,
consists of 256 problems, as detailed in Table.
Problem source # of Problems
AIME 2024 30
AIME 2025 30
HMMT Nov 2024 62
HMMT Feb 2024 68
HMMT Feb 2025 66
Total 256
Table 4: Composition of our Comp-Math-24-25
validation dataset.
2.3.
To generate CoT solutions, we follow a common
pipeline of directly prompting an existing open-weight
LLM to solve problems collected in Section. We
utilizeDeepSeek-R1andQwQ-32Bmodels and gener-
ate up to 32 solution candidates for each problem in
our dataset. We use temperature0.7,top-&#3627408477;= 0.95,
and limit generations to 16384 tokens. We gener-
3

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
ate more solutions forharderproblems with known
answers, where the hardness was estimated by com-
puting an average pass-rate across 32 generations
from theQwen2.5-72B-Math-Instructmodel [40].
As the final filtering step we remove any solutions
that do not reach the expected answer. Predicted
and expected answers are compared by prompting
Qwen2.5-32B-Instructto judge whether they are
equivalent in the context of the problem (we re-use
judge prompt from [30]). For each problem where we
weren’t able to extract the final answer (and for all
converted proofs) we treat the most common answer
across all available solution candidates as the ground-
truth. Table
solutions in our dataset.
Model
CoT solutions
after filtering all
QwQ-32B 0.5M 1.0M
DeepSeek-R1 2.7M 4.2M
Total 3.2M 5.2M
Table 5: Final distribution of CoT solutions in our
dataset.
3.
Allowing LLMs to integrate natural language rea-
soning with Python code execution is a known way
of improving accuracy on challenging math prob-
lems [31,40]. However, the best open-weight rea-
soning models (most notablyDeepSeek-R1[10] and
QwQ-32B[29]) are not able to directly produce such
Tool-Integrated Reasoning (TIR) solutions. Our ini-
tial attempts to induce TIR generations by prompting
these reasoning models with direct instructions or few-
shot examples turned out to be unsuccessful. Unable
to solve this via prompting, we had to develop a
more elaborate pipeline for building reasoning models
capable of producing TIR solutions.
In our early experiments, we noticed that when
non-reasoninginstructLLMs are trained on a limited
quantity of reasoning data [42], they tend to retain
their good instruction-following abilities. Building on
this intuition, we were able to successfully prompt
LIMO-Qwen-32B[42] model to produce TIR solutions,
but found that they tend to below-qualityon aver-
age. The produced code was often irrelevant or was
merely used to verify calculations of preceding CoT
steps. To overcome this, we developed a filtering step
aimed at retaining only high-quality examples where
code execution provides substantial reasoning bene-
fits. Using this filtered dataset, we then fine-tuned our
reasoning model, achieving significant performance
improvements over the CoT-only predecessor. Fi-
nally, we employed an iterative model improvement
approach by training a more powerful TIR model
in each iteration and using it to generate and filter
additional TIR examples, further enhancing model
performance. In the following subsections, we detail
each stage of this pipeline.
3.1.
Prior work [21,42] shows that fine-tuning on as few as
1000 samples is sufficient to make LLM produce long-
CoT solutions. We hypothesize that aninstructmodel
fine-tuned on such a small dataset can potentially
preserve its instruction-following and long-reasoning
capabilities.
To test this, we promptedLIMO-Qwen-32Bto solve
the problem using Python code for the steps that
require complex calculations. The zero-shot prompt
we designed for this purpose is provided in Appendix
B.1. For roughly half of the problems, the model pro-
duced a solution that contained at least one Python
code block. We then synthesized 1.2M solutions
forOpenMathReasoningproblems, using tempera-
ture=0.7, top-&#3627408477;= 0.95, allowing maximum sequence
length of 16384 tokens and stopping generations if
the solution contained more than 8 code executions.
3.2.
Careful inspection of generated solutions revealed that
code execution often does not benefit the solution and
could easily be replaced with several simple CoT steps
(see example in Appendix). Instead, we want an
ideal TIR solution to provide significant shortcuts
by implementing otherwise infeasible brute-force ap-
proaches, e.g., using numeric solvers or conducting an
exhaustive search of possible solutions. To filter un-
wanted code usages, we apply several filters. First, we
utilizeQwen2.5-32B-Instructto classify each code
block by two criteria:
•
novel calculation/verification. Whether
the code execution leads to a novel result or it
simply verifies the previous steps (see the prompt
in Appendix).
•
significant/moderate/trivial. Whether
the code implements an important part of the
solution or is easily substitutable with several
CoT steps (see the prompt in Appendix).
We then only keep solutions that either have at least
one novel and significant code block or more than half
novel and moderate code blocks. Additionally, we
4

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
apply rule-based filtering and remove solutions with
incorrect final answer and solutions without code ex-
ecution. We also remove solutions with more than
two code blocks, as we found it to be helpful in our
preliminary experiments. As part of preprocessing,
we also replace the tags marking the start and end of
code blocks. Duringstage-0generation, we instruct
the model to place code between "‘‘‘python" and
"‘‘‘˙", following a markdown-like style that mod-
els can easily produce; we then replace these with
"<tool_call>" and "</tool_call>" tags, respectively,
to make the code ending tag distinguishable from
regular markdown and facilitate code extraction. All
described filtering steps result in the TIR dataset,
consisting of 15k samples, which we will refer to as
stage-0 TIR data.
3.3.
For the next stage of TIR solution generation, we
leverageQwQ-32Bas it proved to be a powerful yet
lightweight synthetic reasoning data generator. For
this purpose, we fine-tune it on thestage-0data
for 7 epochs with a constant learning rate of 5e-6.
We then synthesize solutions forOpenMathReasoning
problems. We generate 700K samples and filter them
down to 260K by removing incorrect solutions and
solutions not using code. We find that novelty and
significance filters degrade the performance at this
stage, so we do not use them.
To further improve results, we repeat this process
one more time using an intermediate version of our
14B model, which was finetuned on the CoT-only
subset ofOpenMathReasoningdata. We train this
14B model onQwQ-32Bsolutions and then execute a
final round of data generation and filtering, ultimately
resulting in the final 1.7M TIR dataset.
3.4.
We developed a simple, yet effective method to con-
trol the number of code blocks that the model can
use. During all data generation stages, we format the
code output as shown in Appendix, appending
additional notification warning about how many code
executions are remaining. We find that model often
refers to this message in its thinking process, refrain-
ing from further code usage when the limit is reached.
Thus, for each problem we randomly select between 1
and 8 allowed code executions and provide this infor-
mation in the prompt. We remove generations that
try to use more code blocks than requested in order to
reinforce the correct behavior in training. As a result,
model learns to follow specified code execution limit.
An example of this behavior is provided in Appendix
F.3.
4.
We observe a considerable gap in themajority@k
vspass@kperformance for our models, implying the
models theoretical ability to solve far more problems
than can be achieved with a majority answer. To
bridge this gap, we explore training a model that,
given a set of candidate solutionsummaries, picks the
most promising solution. In our early experiments,
we found that comparing multiple solutions yields
significantly better results than judging each solution
in isolation. Following [45], we do not change the
model’s architecture and instead let it reason in nat-
ural language before selecting one of the provided
solutions. We detail the pipeline to prepare the train-
ing data for suchselectiongenerations (GenSelect) in
the following sections. The data construction pipeline
of is shown in Figure.
4.1.
Solutions generated by reasoning models have athink-
ingpart and asummarywhich follows it. We noticed
that summaries generated by reasoning models, such
asDeepSeek-R1, could be very succinct; in extreme
cases, they could just be stating the final answer.
Since we require a representative summary for compar-
ing different solutions during inference, we replace the
nativesummary of the reasoning models by synthesiz-
ing new summaries with theQwen2.5-32B-Instruct
model. We synthesize four candidate summaries per
solution with a maximum length of 2048 tokens. To
ensure the summary is faithful, we filter out sum-
maries where the predicted answer is different from
the original solution’s predicted answer. If there are
no valid summaries, we discard the sample
4
, other-
wise we select the longest summary to replace the
original summary. We regenerate summaries for the
entire OpenMathReasoning dataset using this pro-
cess, so that models trained on it can produce these
summaries directly. See Appendix
between one-wordDeepSeek-R1summary and a new
one generated byQwen2.5-32B-Instruct.
4.2.
We discover that modest accuracy gains over major-
ity voting can be achieved by simply presenting new
solution summaries to reasoning models and prompt-
ing them to compare and select one (see prompt in
Appendix). Building on this observation, we
develop the following pipeline to generate training
data for this GenSelect inference.
For each problem in theOpenMathReasoning
4
No more than 5% of all samples were discarded this way.
5

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset0 10 20 30 40 50 60
Number of samples
60
65
70
75
80
85
Comp-Math-24-25 Accuracy (in %)
Pass
GenSelect
Majority
(a) 14B CoT0 10 20 30 40 50 60
Number of samples
60
65
70
75
80
85
90
Comp-Math-24-25 Accuracy (in %)
Pass
GenSelect
Majority (b) 14B TIR
Figure 2: Comparison of majority, GenSelect and pass metrics for different number of generation samples. To
construct the input for GenSelect, we use subsets of 16 solutions (or all if fewer samples were generated). For
the final answer, we perform majority@8 over the answers selected by the GenSelect.OpenMath-Nemotron
-14B
model is used to perform CoT, TIR, and GenSelect inference. We find that GenSelect becomes unstable
when using more than 32 generations as we can no longer show all solutions in a single prompt.
dataset, we randomly sample between 2 and 16 candi-
date solution summaries. We ensure that each sample
group contains at least one correct and one incorrect
solution. This process is repeated until we obtain 8
distinct comparison groups for each problem. Using
the GenSelect prompt (Appendix), we then task
QwQ-32Bwith selecting the most promising solution
from each group. This procedure generates 1M selec-
tions, which we subsequently filter down to 565K by
eliminating any instances where incorrect solutions
were chosen.
4.3.
While this dataset is suitable for training, the com-
parison generations can be as long as the original
solutions, making GenSelect inference computation-
ally expensive. To address this challenge, we ex-
plored training models to directly generate the final
comparisonsummaryrather than learning the full
reasoning trace. Consistent with our previous obser-
vations, the natural comparison summaries produced
byQwQ-32Bproved suboptimal. We therefore again
usedQwen2.5-32B-Instructto regenerate all com-
parison summaries (see the prompt in Appendix
and trained our models using these summarized com-
parisons. Our early experiments revealed only a small
reduction in accuracy (1–2%) compared to models
trained on the whole reasoning traces. This final
setup makes GenSelect inference remarkably efficient
compared to the original long-reasoning generations.
With output tokens capped at 2048, most computa-
tion occurs in a highly-parallelizable pre-filling phase.
Since each solution summary is similarly limited to
2048 tokens, the total input context cannot exceed
32768 tokens when using the maximum of 16 solutions
per problem. Although more than 16 solutions could
theoretically be included in a prompt, we generally
observe diminishing returns as the context becomes
too large. For scenarios requiring evaluation of more
solution candidates, we propose sampling 16 solutions
multiple times and then performing majority voting to
determine the final answer. Nevertheless, our findings
indicate that the most significant accuracy improve-
ments occur when GenSelect is applied to a smaller
number of generations (Figure).
5.
In this section we present the training and evaluation
details of ourOpenMath-Nemotronseries of models.
5.1.
To build our final models we perform supervised-
finetuning (SFT) on a series ofQwen2.5-Basemodels
(1.5B, 7B, 14B and 32B) [39]. For 1.5B and 7B mod-
els, we start from the special model versions finetuned
for mathematical reasoning tasks [40]. Unlike gen-
eral Qwen2.5 models, the math versions only support
a limited context window of 4096 tokens, which is
inadequate for the long-reasoning generations. To
overcome this, we follow [2] and change RoPE [27]
base to 500K.
6

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning datasetSolution 1Solution 2
.
.
.SolutionNcorrectSolution 1Solution 2
.
.
.SolutionNincorrectSolution SummariesProblemGenSelect InputQwQ-32BQwen2.5-32B-InstructGenSelect Output
Sampled Correct
Instances
Sampled Incorrect
InstancesFilter Reasoning TracesSummarize Reasoning Trace
Figure 3: Data construction pipeline of GenSelect.
The GenSelect input is constructed by sampling
solution summaries of both correct and incorrect
instances, ensuring that the input contains at least
one correct and one incorrect solution. The input is
then fed toQwQ-32B, which is tasked with selecting
the best solution among the candidate solutions.
The reasoning traces that select correct solutions are
summarized withQwen2.5-32B-Instruct, which
forms the GenSelect output.
All models are trained for six epochs on a combi-
nation of three tasks: CoT solution generation, TIR
solution generation, and GenSelect, where the task is
to select one correct solution out of multiple candi-
dates. Each task is defined by a unique prompt that
we can use at inference time to switch between differ-
ent generation modes (see prompts in Appendix).
We found that training on a mix of all tasks results in
a similar accuracy compared to training on each task
sequentially (first CoT, then TIR, then GenSelect).
The total SFT dataset size is 5.5M samples (3.2M
CoT, 1.7M TIR, and 566K GenSelect).
We train all models using AdamW optimizer [18]
with weight decay of0.01and a cosine learning rate
decay schedule with a 10% linear warmup. We use
a starting learning rate of 3e-4 for 1.5B, 2e-4 for 7B
and 1e-4 for 14B and 32B models. The final learning
rate is set to be 1000 times smaller. We use batch
size of 1024 samples and leverage sequence packing
and context parallelization techniques from NeMo-
Aligner [26] that significantly accelerate training on
the long-reasoning data. Following [30] we save 4
equally spaced checkpoints during the training runs,
which are averaged to create the final model. We show
the accuracy on the Comp-Math-24-25 benchmark
(Section) of intermediate 1.5B and 14B model
checkpoints in Figure.
After the first round of training, we perform an-
other SFT on a subset of harder problems. These
problems are selected only from forums discussing
Olympiad math and we discard any problems for
whichQwen2.5-Math-72B-InstructTIR model has
a pass-rate bigger than 0.3 out of 32 generations. Ad-
ditionally, we filter any solutions that have fewer than
5000 tokens. The total SFT data size of this harder
set is 2.2M samples. We follow the same setup as for
the first round of SFT except we train for 4 epochs
instead of 6. We do this second round of training for
all models except 32B as we found some degradation
in results. Models’ accuracy after the first and second
round of training is presented in Table. We find
that CoT results tend to significantly improve while
TIR results stay stable or slightly degrade.
Model First SFT Second SFT
1.5B CoT 55.1 58.2
1.5B TIR 64.1 64.5
7B CoT 61.3 62.5
7B TIR 71.1 70.7
14B CoT 62.9 65.2
14B TIR 74.6 73.4
Table 6: Accuracy with majority@64 on the
Comp-Math-24-25 benchmark after the first and
second SFT rounds. We see significant gains for CoT
generations and comparable results for TIR
generations.
5.2.
Final evaluation results of our models are presented in
Table. In addition to Comp-Math-24-25, introduced
in Section, we use Humanity’s Last Exam dataset
[6]. We only evaluate on a subset consisting of 975
text-only problems from “Math” category. We refer
to it as HLE-Math.
We notice that despite being superior in majority@&#3627408472;
setting with TIR prompt, smaller models perform
on par or even worse in pass@1, compared to CoT
prompt. The results in Table
7

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset0.751.5 3 6
Training Epochs
20
30
40
50
60
Comp-Math-24-25 Accuracy (in %)
CoT maj@64
TIR maj@64
CoT pass@1
TIR pass@1
(a) 1.5B0.751.5 3 6
Training Epochs
30
40
50
60
70
Comp-Math-24-25 Accuracy (in %)
CoT maj@64
TIR maj@64
CoT pass@1
TIR pass@1 (b) 14B
Figure 4: Accuracy improvement through the course of training. We observe that smaller models need to be
trained for longer to achieve meaningful improvements.
reason is that with the TIR prompt there are more
unfinished solutions across all model sizes, with 1.5B
clearly standing out. We hypothesize that the reason
behind this is that smaller models are less consistent
in using tools effectively.
6.
In this section, we present the details of our win-
ning submission to the AI Mathematical Olympiad -
Progress Prize 2 (AIMO-2) [7]. AIMO-2 is a competi-
tion organized by the AIMO Prize team and hosted on
Kaggle
5
. The competition featured 110 challenging
math problems: 10 were made publicly available as a
reference subset, while the remaining problems were
split between the public and private leaderboards.
Submitted solutions were evaluated under strict com-
putational constraints: a 5-hour time limit in an of-
fline Jupyter notebook environment powered by four
L4 GPUs. Our 1st-place submission correctly solved
34 out of 50 questions on the private leaderboard.
6.1.
For our winning Kaggle submission we used a some-
what different training recipe that we detail in this
section. We first trainedQwen2.5-14B-Basemodel
for 8 epochs on a 2.2M subset of CoT solutions, ex-
cluding any converted proof problems. We only used
DeepSeek-R1solutions for this training. This is fol-
lowed by a light-weight fine-tuning on 15kstage-0
TIRsamples. The process for collecting these samples
is detailed in section. We train TIR model for
5
https://www.kaggle.com/competitions/ai-mathematical-
olympiad-progress-prize-2
400 steps with a constant learning rate of 1e-5 and
use the last checkpoint without averaging. We then
merge CoT and TIR checkpoints as it both improves
accuracy and speeds up generation by reducing solu-
tion length and number of code executions. We did
not use GenSelect training or inference for the Kaggle
submission.
6.2.
In this competition, we explored various methods for
merging two LLMs with CoT and TIR behaviors. Our
primary goal was to effectively combine the distinct
strengths of these two fine-tuning stages to enhance
model performance. We experimented with several
merging techniques from mergekit [9] package. Sur-
prisingly, the most effective approach turned out to
be a simple linear combination of the two checkpoints:
the CoT checkpoint used before TIR fine-tuning and
the best TIR checkpoint attained thereafter. This
strategy allowed us to control the extent to which
each stage influenced the final model’s behavior. Ta-
ble
length and code usage statistics of the models before
and after the described merging procedure.
6.3.
The strict time limits of the competition presented
a serious constraint. An extra requirement was that
problems had to be answered one-at-a-time making it
harder to parallelize computation and allocate time.
To overcome these challenges we implemented several
optimizations that maximize inference efficiency while
maintaining output quality.
8

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
Model
Comp-Math-24-25
HLE-Math
AIME24 AIME25 HMMT-24-25
DeepSeek-R1-Distill-Qwen-1.5B 26.8 (60.0) 21.4 (36.7) 14.2 (26.5) 2.9 (5.0)
OpenMath-Nemotron-1.5B CoT 61.6 (80.0) 49.5 (66.7) 39.9 (53.6) 5.4 (5.4)
OpenMath-Nemotron-1.5B TIR 52.0 (83.3) 39.7 (70.0) 37.2 (60.7) 2.5 (6.2)
+ Self GenSelect 83.3 70.0 62.2 7.9
+ 32B GenSelect 83.3 70.0 62.8 8.3
DeepSeek-R1-Distill-Qwen-7B 54.4 (80.0) 38.6 (53.3) 30.6 (42.9) 3.3 (5.2)
OpenMath-Nemotron-7B CoT 74.8 (80.0) 61.2 (76.7) 49.7 (57.7) 6.6 (6.6)
OpenMath-Nemotron-7B TIR 72.9 (83.3) 57.5 (76.7) 54.6 (66.3) 7.8 (10.8)
+ Self GenSelect 86.7 76.7 68.4 11.5
+ 32B GenSelect 86.7 76.7 69.9 11.9
DeepSeek-R1-Distill-Qwen-14B 65.8 (80.0) 48.4 (60.0) 40.1 (52.0) 4.2 (4.8)
OpenMath-Nemotron-14B-MIX (kaggle) 73.7 (86.7) 57.9 (73.3) 50.5 (64.8) 5.7 (6.5)
OpenMath-Nemotron-14B CoT 76.3 (83.3) 63.0 (76.7) 52.1 (60.7) 7.5 (7.6)
OpenMath-Nemotron-14B TIR 76.3 (86.7) 61.3 (76.7) 58.6 (70.9) 9.5 (11.5)
+ Self GenSelect 86.7 76.7 72.4 14.1
+ 32B GenSelect 90.0 76.7 71.9 13.7
QwQ-32B 78.1 (86.7) 66.5 (76.7) 55.9 (63.3) 9.0 (9.5)
DeepSeek-R1-Distill-Qwen-32B 66.9 (83.3) 51.8 (73.3) 39.9 (51.0) 4.8 (6.0)
OpenMath-Nemotron-32B CoT 76.5 (86.7) 62.5 (73.3) 53.0 (59.2) 8.3 (8.3)
OpenMath-Nemotron-32B TIR 78.4 (93.3) 64.2 (76.7) 59.7 (70.9) 9.2 (12.5)
+ Self GenSelect 93.3 80.0 73.5 15.7
DeepSeek-R1 79.1 (86.7) 64.3 (73.3) 53.0 (59.2) 10.5 (11.4)
Table 7: Evaluation results on mathematical benchmarks. All models are evaluated with a maximum of 32768
output tokens, temperature of0.6, and top-p0.95. We present metrics as pass@1 (maj@64) where pass@1 is
an average accuracy across 64 generations and maj@64 is the result of majority voting. The 14B model used
in our kaggle submission is denoted as (kaggle). For HMMT and HLE-Math benchmarks we use LLM-judge
setup of [30] to verify the answers. To construct the input for GenSelect, we use subsets of 16 solutions from
the set of 64 solutions. We repeat this 64 times and perform majority voting over the answers selected by the
GenSelect.
6.3.1.
Pretrained models were converted to TensorRT en-
gines using TensorRT-LLM [23]. TensorRT-LLM’s
in-flight batching boosts throughput by dynamically
grouping inference requests, releasing each sample
as soon as it completes—reducing latency and opti-
mizing GPU utilization. Since samples are processed
independently, batches can mix different prompts or
inference parameters seamlessly. TensorRT-LLM in-
cludes a number of other optimizations such as custom
attention kernels and paged KV caching.
Quantization involves a speed-accuracy tradeoff,
as outlined in TensorRT-LLM’s best practices [24].
We prioritized int8 weight-only (W8A16) and FP8
quantization, which delivered faster inference than
BF16 with minimal accuracy loss. The reduced weight
size also freed up memory for larger key-value caches.
6.3.2.
To accelerate inference, we employ ReDrafter [4], a
recurrent speculative decoding technique that uses an
RNN-based drafter to propose and verify multiple to-
kens per decoding step. We trained a drafter capable
of proposing up to three tokens at each step, with all
three tokens being accepted in approximately 65% of
the steps.
For training ReDrafter, we sampled a random sub-
set of 100k problems from theOpenMathReasoning
dataset. With the target model, we generated one
solution per problem, leveraging the resulting data to
train the drafter.
Table
9

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
Model Prompt Unfinished (in %)
1.5B 2.23
7B CoT 0.98
14B 1.13
1.5B 40.31
7B TIR 6.45
14B 4.06
Table 8: Percentage of unfinished solutions on the
Comp-Math-24-25 dataset. We generate 32k tokens
and consider solution unfinished if it does not
contain “∖boxed”.
tion techniques and the speculative decoding method,
analyzing their impact on both the inference speed
and the accuracy.
We experimented with various sampling parame-
ters but observed minimal differences in the results.
We thus based our winning submission on analmost
greedysearch strategy by setting temperature to 0 and
enabling theredrafter_greedy_searchparameter.
Despite these settings TensorRT-LLM still produced
varying outputs within a single batch of identical
prompts. We did not investigate this behavior in
detail, but we suspect that it could be related to an
accumulation of small numerical errors which cause
a few tokens to be different early on in the gener-
ation. This difference then accumulates over many
tokens resulting in a substantially diverse solution
set at the end. Ultimately, we chose this approach
because it provided more stable results at small batch
sizes and offered a small improvement in the speed of
speculative decoding.
6.3.3.
Models were served via a FastAPI backend powered
by Nemo-Skills [22], which supports time-constrained
generation. This allowed us to dynamically limit
response times per question—if an answer wasn’t
completed within the window, we returned early to
check for extractable results.
Nemo-Skills’ async generation enabled batched pro-
cessing with early stopping. For example, in a batch
of 16, if the first 4-5 completions agreed on the final
answer, we canceled the remaining generations and
proceeded to the next question. We also mitigated
stragglers—samples that ran significantly longer than
others—by canceling the last&#3627408475;pending requests once
the rest finished. This early stopping increased re-
sponse relevance as shorter answers tended to be
higher quality.
Method Speed (tok/s) AIME24 AIME25
BF16 210 82.7 66.7
W8A16 (int8) 315 82.7 66.7
W4A16 (int4) 436 72.7 60.7
FP8 310 83.3 68.7
FP8+ReDrafter 554 81.3 71.3
Table 9: Submission pipeline with different
optimizations methods benchmarked on 4 x L4 GPU.
Reported scores are maj@12 on the merged model
averaged over 5 runs each.Boldindicates
configuration used in our winning submission.
Model maj@16 pass@16 length code
CoT 62.9 76.2 11203 -
TIR 66.8 80.1 15834 2.73
CoT*0.3 + TIR*0.7 69.1 81.3 12489 0.85
Table 10: Accuracy and generation statistics of
merged models on Comp-Math-24-25 dataset.
lengthindicates the average number of tokens per
solution, whilecoderefers to the average number of
code executions per solution.
6.3.4.
A buffering strategy was implemented, allocating 350
seconds per question as the base time limit. If a
question completed early, the unused time was added
to a shared buffer. The next question could then draw
up to 210 extra seconds from this buffer, allowing a
maximum of 560 seconds when combined with its
base allocation.
6.3.5.
For tool-calling capabilities, we used Nemo-Skills’s
code execution wrapper to enable tool-integrated rea-
soning. A Flask-based sandbox environment han-
dles parallelized Python execution for each inference
thread, processing LLM-generated code blocks with
strict safeguards:
•Maximum 6 code calls per generation cycle
•2 second timeout for each code execution
•
Only the first 200 characters of the code output
were shown back to the LLM
The system feeds back either execution results or error
traces into the generation process, enabling iterative
reasoning while maintaining computational efficiency.
10

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
6.4.
Our Kaggle submission is based on an early devel-
opment version of the finalOpenMath-Nemotron-14B
model. This model was trained on a smaller dataset,
did not have GenSelect capability, and could not
switch between CoT and TIR modes by changing the
prompt. While we did have a much better check-
point towards the end of the competition, we were
ultimately unable to make a high-scoring submission
with it. In this section, we explore several potential
explanations for why this happened.
High variance in scores.The competition rules
allow only a single submission per day. Since the pub-
lic leaderboard consists of only 50 problems presented
in random order, we observed substantial variance
across our submissions. This made it hard to make
quick decisions on which directions to prioritize, es-
pecially in cases when our local evaluations disagreed
with the leaderboard scores.
Focus on smaller models.As shown in Table
OpenMath-Nemotron-7Bmodel performs comparably
or better than the 14B model used in Kaggle. Ob-
serving this towards the end of the competition, we
tried to prioritize submissions with the smaller model,
allowing it more generations, and also increased the
maximum generation length. Yet we were unable to
see comparable leaderboard scores. This discrepancy
suggests that either our local evaluation set differs
substantially from what was used in Kaggle, or that
the smaller models struggle with a few particularly
challenging problems—a limitation difficult to detect
through aggregate benchmark scores alone.
Longer average generations.Our local eval-
uations always had a fixed token budget for each
generation. However, the time management logic im-
plemented in Kaggle (Section) heavily relied on
solving easy problems quickly to allocate more time
for challenging ones. Interestingly, we discovered that
although our final models achieved superior scores
within the same token budget, they produced around
10%more tokenson average. Not realizing this early
enough, we were unable to fix this undesirable feature
before the end of the competition.
7.
7.1.
Tool-augmented approaches to mathematical problem
solving have advanced rapidly in recent years. A semi-
nal contribution by Chen et al. [3] introduced the Pro-
gram of Thoughts (PoT) framework, which integrates
natural language with executable code to support
step-by-step reasoning through a hybrid of textual
and programmatic logic. Building on this foundation,
subsequent research has focused on developing both
datasets and models that facilitate tool-integrated
reasoning.
On the data side, OpenMathInstruct-1 [31] offers
1.8 million instruction-tuning examples derived from
code interpreters across benchmarks such as GSM8K
and MATH. Similarly, InfinityMATH [44] introduces
100K instances of programmatic reasoning, while
MARIO [16] combines model reasoning with tool out-
puts, accompanied by a dataset constructed from
GSM8K [5] and MATH [11]. These resources have
significantly enriched the training landscape for tool-
augmented reasoning systems.
On the modeling side, Qwen2.5 [40] introduced a
series of models with strong mathematical reason-
ing capabilities, supporting advanced techniques like
Chain-of-Thought (CoT) and Tool-Integrated Reason-
ing (TIR). Gao et al. [8] proposed a two-stage method:
training large language models to generate reasoning
chains, and then invoking domain-specific tools to ex-
ecute each step by injecting the necessary knowledge.
Xiong et al. [38] proposed a multi-turn, online, itera-
tive direct preference learning framework tailored to
this unique context. By incorporating feedback from
code interpreters during the training process, their ap-
proach achieves significant performance improvements
on the MATH benchmark. Wu et al. [36] dynamically
integrate web search, code execution, and structured
reasoning with contextual memory to tackle complex
problems that demand deep research and multistep
logical deduction. Li et al. [15] developed a Tool-
Integrated Reinforcement Learning framework that
autonomously utilizes computational tools by scaling
reinforcement learning directly from base models, and
demonstrate substantial improvements compared to
RL without tools.
7.2.
Conventional reward models and verifiers are often
trained as discriminative binary classifiers, under-
utilizing the generative strengths of large language
models (LLMs). To address this, Generative Reward
Models (GenRM) [19], introduced by Mahan et al.,
reformulate verification as a generation task—using
the log probabilities of tokens like “Yes” or “No” to
represent correctness. This framing allows GenRM
to better exploit LLMs’ natural language generation
capabilities, leading to improved alignment with hu-
man judgments across both in-distribution and out-of-
distribution tasks. Concurrently, Zhang et al. [45] in-
troduced Generative Verifiers, training CoT-GenRM
with a supervised fine-tuning (SFT) objective to serve
11

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
as a verifier for mathematical reasoning. Building on a
similar motivation, Ankner et al. [1] combined Chain-
of-Thought (CoT) reasoning generation with Bradley-
Terry reward modeling, enabling reward models to
explicitly reason about response quality before assign-
ing scores. Extending this line of work, Wang et al.
[33] proposed self-taught evaluators, jointly training
generative models and LLM-as-a-Judge frameworks to
produce both intermediate reasoning traces and final
judgments. In related approaches, Wang et al. [32]
trained large language models as generative judges
by leveraging Direct Preference Optimization (DPO)
on both positive and negative data, demonstrating
improved evaluation performance across diverse tasks.
Wu et al. [37] introduced a Meta-Rewarding step in
the self-improvement process, enabling the model to
evaluate its own judgments and use the feedback to
refine its evaluation capabilities.
7.3.
In the pursuit of improving mathematical reasoning
in large language models, researchers have recently
introduced several large-scale, high-quality datasets.
Skywork-MathQA [43] stands out with its 2.5 million
question-answer pairs, generated using a trio of aug-
mentation methods and built upon a varied set of
foundational problems. Complementing this, Numi-
naMath [13] offers 860K challenging competition-style
math problems, each carefully annotated with step-by-
step reasoning chains [34], enabling more interpretable
and structured model outputs.
More recent work has focused on advancing the com-
plexity and depth of reasoning. New datasets have
emerged that emphasize challenging questions paired
with rich, multi-step reasoning traces, pushing models
to handle more sophisticated mathematical thought
processes. BackMATH was introduced in [46], a novel
dataset centered on backward reasoning. It contains
approximately 14K problems specifically designed to
support backward problem-solving, along with 100K
detailed reasoning steps that trace the reverse logical
flow from solution to problem. The OpenR1 team re-
leased OpenR1-Math-220K [25], a large-scale dataset
for mathematical reasoning comprising 220K math
problems. Each problem includes two to four reason-
ing traces generated by DeepSeek R1, based on prob-
lems from NuminaMath 1.5 [14]. In addition, Zhao et
al. [47] presented AM-DeepSeek-R1-Distilled, a large-
scale dataset featuring 1.4 million question-response
pairs with associated thinking traces for general rea-
soning tasks. This dataset is composed of high-quality,
challenging problems aimed at advancing reasoning
capabilities. Following a similar direction, Liu et al.
[17] introduced a Chinese version of the DeepSeek-R1
distilled dataset, consisting of 110K question-solution
pairs. The DolphinR1 team [28] released a dataset of
800K samples, combining outputs from various rea-
soning models, including DeepSeek-R1, Gemini 2.0
Flash Thinking, and Dolphin Chat.
8.
In this paper, we present our winning submission to
the AIMO-2 competition and a pipeline for developing
state-of-the-art mathematical reasoning models. Our
contributions can be summarized as follows:
•
We develop a method to combine code execution
with long chain-of-thought (CoT) generations
to produce tool-integrated reasoning (TIR) solu-
tions.
•
We create a pipeline for training models to gen-
erate samples that select the most promising
solution from multiple candidates (GenSelect).
•
We release a large-scaleOpenMathReasoning
dataset. It contains 540K unique mathemati-
cal problems, 3.2M long chain-of-thought (CoT)
solutions, 1.7M long tool-integrated reasoning
(TIR) solutions, and 566K generative solution
selection (GenSelect) traces.
•
We release a series ofOpenMath-Nemotronmod-
els capable of operating in CoT, TIR, or GenS-
elect inference modes. With this release, we
establish a new state-of-the-art in mathematical
reasoning among open-weight models.
References
[1]
Zachary Ankner, Mansheej Paul, Brandon
Cui, Jonathan D Chang, and Prithviraj Am-
manabrolu. Critique-out-Loud Reward Models.
arXiv preprint arXiv:2408.11791, 2024.
[2]
bloc97. Ntk-aware scaled rope allows
llama models to have extended (8k+)
context size without any fine-tuning
and minimal perplexity degradation.
https://www.reddit.com/r/LocalLLaMA/
comments/14lz7j5/ntkaware_scaled_rope_
allows_llama_models_to_have/, 2023.
[3]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and
William W. Cohen. Program of Thoughts
Prompting: Disentangling Computation from
Reasoning for Numerical Reasoning Tasks.
TMLR, 2023.
[4]
Yunfei Cheng, Aonan Zhang, Xuanyu Zhang,
Chong Wang, and Yi Wang. Recurrent Drafter
12

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
for Fast Speculative Decoding in Large Language
Models.arXiv preprint arXiv:2403.09919, 2024.
[5]Karl Cobbe, Vineet Kosaraju, Mohammad Bavar-
ian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton,
Reiichiro Nakano, Christopher Hesse, and John
Schulman. Training Verifiers to Solve Math Word
Problems.arXiv preprint arXiv:2110.14168,
2021.
[6]
[7]
Simon Frieder, Sam Bealing, Arsenii Nikolaiev,
Geoff C. Smith, Kevin Buzzard, Timo-
thy Gowers, Peter J. Liu, Po-Shen Loh,
Lester Mackey, Leonardo de Moura, Dan
Roberts, D. Sculley, Terence Tao, David
Balduzzi, Simon Coyle, Alex Gerko, Ryan
Holbrook, Addison Howard, and XTX Mar-
kets. Ai mathematical olympiad - progress
prize 2.https://kaggle.com/competitions/
ai-mathematical-olympiad-progress-prize-2
,
2024. Kaggle.
[8]
Silin Gao, Jane Dwivedi-Yu, Ping Yu, Xiao-
qing Ellen Tan, Ramakanth Pasunuru, Olga
Golovneva, Koustuv Sinha, Asli Celikyilmaz, An-
toine Bosselut, and Tianlu Wang. Efficient Tool
Use with Chain-of-Abstraction Reasoning.arXiv
preprint arXiv:2401.17464, 2024.
[9]
Charles Goddard, Shamane Siriwardhana,
Malikeh Ehghaghi, Luke Meyers, Vladimir
Karpukhin, Brian Benedict, Mark McQuade, and
Jacob Solawetz. Arcee’s MergeKit: A Toolkit for
Merging Large Language Models, 2024.
[10]
Daya Guo, Dejian Yang, Haowei Zhang, Junx-
iao Song, Ruoyu Zhang, Runxin Xu, Qihao
Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al.
DeepSeek-R1: Incentivizing Reasoning Capabil-
ity in LLMs via Reinforcement Learning.arXiv
preprint arXiv:2501.12948, 2025.
[11]
Dan Hendrycks, Collin Burns, Saurav Kada-
vath, Akul Arora, Steven Basart, Eric Tang,
Dawn Song, and Jacob Steinhardt. Measuring
Mathematical Problem Solving With the MATH
Dataset. InNeurIPS Datasets and Benchmarks,
2021.
[12]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam
Richardson, Ahmed El-Kishky, Aiden Low, Alec
Helyar, Aleksander Madry, Alex Beutel, Alex
Carney, et al. Openai o1 system card.arXiv
preprint arXiv:2412.16720, 2024.
[13]
Jia Li, Edward Beeching, Lewis Tunstall, Ben
Lipkin, Roman Soletskyi, Shengyi Huang, Kashif
Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen,
et al. NuminaMath: The largest public dataset in
AI4Maths with 860k pairs of competition math
problems and solutions, 2024.
[14]
Jia LI, Edward Beeching, Lewis Tunstall, Ben
Lipkin, Roman Soletskyi, Shengyi Costa Huang,
Kashif Rasul, Longhui Yu, Albert Jiang, Ziju
Shen, Zihan Qin, Bin Dong, Li Zhou, Yann
Fleureau, Guillaume Lample, and Stanislas
Polu. NuminaMath.[https://huggingface.
co/AI-MO/NuminaMath-1.5](https://github.
com/project-numina/aimo-progress-prize/
blob/main/report/numina_dataset.pdf)
,
2024.
[15]
Xuefeng Li, Haoyang Zou, and Pengfei Liu.
Torl: Scaling tool-integrated rl.arXiv preprint
arXiv:2503.23383, 2025.
[16]
Minpeng Liao, Wei Luo, Chengxi Li, Jing
Wu, and Kai Fan. MARIO: MAth Reasoning
with code Interpreter Output–A Reproducible
Pipeline.arXiv preprint arXiv:2401.08190, 2024.
[17]Cong Liu, Zhong Wang, ShengYu Shen,
Jialiang Peng, Xiaoli Zhang, ZhenDong Du,
and YaFang Wang. The chinese dataset
distilled from deepseek-r1-671b.https:
//huggingface.co/datasets/Congliu/
Chinese-DeepSeek-R1-Distill-data-110k
,
2025.
[18]
Ilya Loshchilov and Frank Hutter. Decoupled
Weight Decay Regularization. InICLR, 2019.
[19]
Dakota Mahan, Duy Van Phung, Rafael Rafailov,
and Chase Blagden1 Nathan Lile1 Louis Castri-
cato. Generative Reward Models.arXiv preprint
arXiv:2410.12832, 2024.
[20]
Sadegh Mahdavi, Muchen Li, Kaiwen Liu,
Christos Thrampoulidis, Leonid Sigal, and
Renjie Liao. Leveraging Online Olympiad-
Level Math Problems for LLMs Training and
Contamination-Resistant Evaluation.arXiv
preprint arXiv:2501.14275, 2025.
[21]
Niklas Muennighoff, Zitong Yang, Weijia Shi,
Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi,
Luke Zettlemoyer, Percy Liang, Emmanuel Can-
dès, and Tatsunori Hashimoto. s1: Simple test-
time scaling.arXiv preprint arXiv:2501.19393,
2025.
[22]
NVIDIA. NeMo-Skills.https://github.com/
NVIDIA/NeMo-Skills.
13

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
[23]
NVIDIA. TensorRT-LLM. https://github.
com/NVIDIA/TensorRT-LLM.
[24]
NVIDIA. TensorRT-LLM Best Practices.
https://nvidia.github.io/TensorRT-LLM/
blogs/quantization-in-TRT-LLM.html.
[25]
OpenR1 Team. OpenR1 Math 220k, February
2025. Dataset available on Hugging Face.
[26]
Gerald Shen, Zhilin Wang, Olivier Delalleau, Ji-
aqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun,
Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, et al.
Nemo-aligner: Scalable toolkit for efficient model
alignment.arXiv preprint arXiv:2405.01481,
2024.
[27]
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Mur-
tadha, Bo Wen, and Yunfeng Liu. RoFormer:
enhanced transformer with rotary position em-
bedding.arXiv preprint arXiv:2104.09864, 2021.
[28]DolphinR1 Team. Dolphin r1.
https://huggingface.co/datasets/
cognitivecomputations/dolphinr1
, February
2025. Accessed April 2025.
[29]
Qwen Team. QwQ-32B: Embracing the Power
of Reinforcement Learning, March 2025.
[30]
Shubham Toshniwal, Wei Du, Ivan Moshkov,
Branislav Kisacanin, Alexan Ayrapetyan, and
Igor Gitman. OpenMathInstruct-2: Accelerating
AI for Math with Massive Open-Source Instruc-
tion Data. InICLR, 2025.
[31]
Shubham Toshniwal, Ivan Moshkov, Sean Naren-
thiran, Daria Gitman, Fei Jia, and Igor Gitman.
OpenMathInstruct-1: A 1.8 Million Math In-
struction Tuning Dataset. InNeurIPS Datasets
and Benchmarks, 2024.
[32]
Peifeng Wang, Austin Xu, Yilun Zhou, Caiming
Xiong, and Shafiq Joty. Direct Judgement Pref-
erence Optimization.arXiv preprint 2409.14664,
2024.
[33]
Tianlu Wang, Ilia Kulikov, Olga Golovneva,
Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu,
Richard Yuanzhe Pang, Maryam Fazel-Zarandi,
Jason Weston, and Xian Li. Self-Taught Evalua-
tors.arXiv preprint arXiv:2408.02666, 2024.
[34]
Xuezhi Wang, Jason Wei, Dale Schuurmans,
Quoc V. Le, Ed H. Chi, Sharan Narang,
Aakanksha Chowdhery, and Denny Zhou. Self-
Consistency Improves Chain of Thought Reason-
ing in Language Models. InICLR, 2023.
[35]
Jason Wei, Xuezhi Wang, Dale Schuurmans,
Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le,
Denny Zhou, et al. Chain-of-thought prompt-
ing elicits reasoning in large language models.
NeurIPS, 2022.
[36]
Junde Wu, Jiayuan Zhu, and Yuyuan Liu.
Agentic Reasoning: Reasoning LLMs with
Tools for the Deep Research.arXiv preprint
arXiv:2502.04644, 2025.
[37]
Tianhao Wu, Weizhe Yuan, Olga Golovneva,
Jing Xu, Yuandong Tian, Jiantao Jiao, Ja-
son Weston, and Sainbayar Sukhbaatar. Meta-
Rewarding Language Models: Self-Improving
Alignment with LLM-as-a-Meta-Judge.arXiv
preprint arXiv:2407.19594, 2024.
[38]
Wei Xiong, Chengshuai Shi, Jiaming Shen,
Aviv Rosenberg, Zhen Qin, Daniele Calandriello,
Misha Khalman, Rishabh Joshi, Bilal Piot, Mo-
hammad Saleh, et al. Building math agents with
multi-turn iterative preference learning.arXiv
preprint arXiv:2409.02392, 2024.
[39]
An Yang, Baosong Yang, Beichen Zhang,
Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan
Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan
Lin, Jian Yang, Jianhong Tu, Jianwei Zhang,
Jianxin Yang, Jiaxi Yang, Jingren Zhou, Jun-
yang Lin, Kai Dang, Keming Lu, Keqin Bao,
Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei
Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao
Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren,
Xuancheng Ren, Yang Fan, Yang Su, Yichang
Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru
Zhang, and Zihan Qiu. Qwen2.5 Technical Re-
port, 2025.
[40]
An Yang, Beichen Zhang, Binyuan Hui, Bofei
Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu,
Jianhong Tu, Jingren Zhou, Junyang Lin, Kem-
ing Lu, Mingfeng Xue, Runji Lin, Tianyu Liu,
Xingzhang Ren, and Zhenru Zhang. Qwen2.5-
Math Technical Report: Toward Mathematical
Expert Model via Self-Improvement, 2024.
[41]
Shuo Yang, Wei-Lin Chiang, Lianmin Zheng,
Joseph E. Gonzalez, and Ion Stoica. Rethinking
Benchmark and Contamination for Language
Models with Rephrased Samples, 2023.
[42]
Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern,
Shijie Xia, and Pengfei Liu. LIMO: Less is More
for Reasoning.arXiv preprint arXiv:2502.03387,
2025.
[43]
Liang Zeng, Liangjun Zhong, Liang Zhao, Tian-
wen Wei, Liu Yang, Jujie He, Cheng Cheng, Rui
14

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
Hu, Yang Liu, Shuicheng Yan, Han Fang, and
Yahui Zhou. Skywork-Math: Data Scaling Laws
for Mathematical Reasoning in Large Language
Models – The Story Goes On, 2024.
[44]
Bo-Wen Zhang, Yan Yan, Lin Li, and Guang Liu.
InfinityMATH: A Scalable Instruction Tuning
Dataset in Programmatic Mathematical Reason-
ing.arXiv preprint arXiv:2408.07089, 2024.
[45]
Lunjun Zhang, Arian Hosseini, Hritik Bansal,
Mehran Kazemi, Aviral Kumar, and Rishabh
Agarwal. Generative Verifiers: Reward Model-
ing as Next-Token Prediction.arXiv preprint
arXiv:2408.15240, 2024.
[46]
Shaowei Zhang and Deyi Xiong. BackMATH:
Towards Backward Reasoning for Solving Math
Problems Step by Step. InProceedings of the
31st International Conference on Computational
Linguistics: Industry Track, pages 466–482, 2025.
[47]Han Zhao, Haotian Wang, Yiping Peng, Sitong
Zhao, Xiaoyu Tian, Shuaiting Chen, Yunjie Ji,
and Xiangang Li. 1.4 Million Open-Source Dis-
tilled Reasoning Dataset to Empower Large
Language Model Training.arXiv preprint
arXiv:2503.19633, 2025.
15

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
A.
A.1.
Prompt: Binary Problem Classification
I w i l l p r o v i d e a math problem , and you need t o d e t e r m i n e whether i t i s a b i n a r y
q u e s t i o n .
Respond o n l y with ’ b i n a r y ’ i f t h e problem meets t h e c r i t e r i a , and ’ not b i n a r y ’
o t h e r w i s e .
A problem q u a l i f i e s a s a b i n a r y q u e s t i o n i f and o n l y i f :
1 . The problem e x p l i c i t l y a s k s f o r a b i n a r y r e s p o n s e , such a s " y e s o r no " , " t r u e o r
f a l s e " , o r a n o t h e r e q u i v a l e n t two−c h o i c e r e s p o n s e .
2 . The problem i s p h r a s e d a s a q u e s t i o n o r s t a t e m e n t t h a t n a t u r a l l y l e a d s t o a b i n a r y
r e s p o n s e ( e . g . , " I s t h i s t r u e ? " o r " Determine whether t h e s t a t e m e n t i s t r u e o r f a l s e " )
.
I f t h e problem d o e s not e x p l i c i t l y a s k f o r a b i n a r y r e s p o n s e , even i f i t can be
i n t e r p r e t e d t h a t way , i t s h o u l d be c l a s s i f i e d a s ’ not b i n a r y q u e s t i o n ’ .
Here a r e a few examples .
Example 1
Problem :
I s i t t r u e t h a t $0 . 4 3 9 5 3 0 8 9 9 9 9 9 9 \ \ l d o t s = 0 . 4 3 9 5 3 0 9 $ ?
Output : b i n a r y
Example 2
Problem :
Write f i r s t s e v e r a l terms o f a g e o m e t r i c p r o g r e s s i o n i n which t h e d i f f e r e n c e between
t h e t h i r d and f i r s t terms i s e q u a l t o 9 , and t h a t between t h e f i f t h and t h i r d terms
e q u a l 3 6 .
Output : not b i n a r y
Example 3
Problem :
S o l v e t h e f o l l o w i n g e q u a t i o n s : $ \ f r a c {{\ s i n ( 6 0 ^ \ c i r c+x )+\ s i n ( 6 0 ^ \ c i r c−x ) }}{{2}} =\ f r a c
{{\ tan x }}{{(1+\ tan ^2 x ) ^2}}+\ f r a c {{\ c o t x }}{{(1+\ c o t ^2 x ) ^2}} $
Output : not b i n a r y
Example 4
Problem :
Given t h e q u a d r a t i c e x p r e s s i o n \ \ ( ax ^2 + bx + c \ \ ) with c o e f f i c i e n t s \ \ ( a , b , c \ \ )
such t h a t \ \ ( b−c > a \ \ ) and \ \ ( a \\ neq 0 \ \ ) , i s i t t r u e t h a t t h e e q u a t i o n \ \ (
ax ^2 + bx + c = 0 \ \ ) a l w a y s has two d i s t i n c t r e a l r o o t s ?
Output : b i n a r y
16

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
Example 5 :
Problem :
Can t h e v e r t i c e s o f a cube be c o l o r e d i n red , y e l l o w , and b l u e such t h a t e v e r y s e t o f
f o u r c o p l a n a r v e r t i c e s c o n t a i n s a l l t h r e e c o l o r s ?
Output : b i n a r y
Example 6 :
Problem :
Can t h e numbers \ \ ( \ \ f r a c {{14 x + 5 } } { { 9 } } \ \ ) and \ \ ( \ \ f r a c {{17 x−4 } } { { 1 2 } } \ \ ) both be
i n t e g e r s f o r some i n t e g e r \ \ ( x \ \ ) ? I f so , f i n d t h a t i n t e g e r .
Output : not b i n a r y
Example 7 :
Problem :
Can t h e d i s t a n c e s from a p o i n t on t h e p l a n e t o t h e v e r t i c e s o f a c e r t a i n s q u a r e be
e q u a l t o $1 , 1 , 2 , $ and $3$ ?
Output : b i n a r y
Now h e r e i s t h e problem you need t o e x t r a c t t h e answer from .
Problem :
{ problem }
Output :
A.2.
Prompt: Valid Problem Classification
I w i l l p r o v i d e a problem s t a t e m e n t from a math forum . Your t a s k i s t o d e t e r m i n e
whether i t i s a v a l i d , s o l v a b l e math problem based on t h e g i v e n t e x t .
Respond with ’ not i n v a l i d ’ i f t h e problem meets a l l o f t h e f o l l o w i n g c o n d i t i o n s :
1 . I t i s a w e l l−d e f i n e d math q u e s t i o n , such a s s o l v i n g an e q u a t i o n , f i n d i n g a minimum ,
computing an e x p r e s s i o n , o r p r o v i n g a r e s u l t .
2 . I t c o n t a i n s enough i n f o r m a t i o n t o be s o l v e d u s i n g s t a n d a r d m a t h e m a t i c a l t e c h n i q u e s ,
even i f t h e s o l u t i o n r e q u i r e s advanced c o n c e p t s ( e . g . , l i m i t s , l o g a r i t h m s , r e c u r s i o n )
.
3 . I t i s not j u s t a c o n c e p t u a l o r d e f i n i t i o n a l q u e s t i o n ( e . g . , " What d o e s t h e n o t a t i o n
mean ? " i s not a v a l i d math problem ) .
4 . I t d o e s not r e l y on e x t e r n a l r e s o u r c e s such a s images o r m i s s i n g c o n t e x t .
Otherwise , r e s p o n d with ’ i n v a l i d ’ , but o n l y i f t h e r e i s a c l e a r and s t r o n g r e a s o n why
t h e problem c ann ot be s o l v e d . I f you a r e u n c e r t a i n , d e f a u l t t o ’ not i n v a l i d ’ .
Important Notes :
1 . The v a s t m a j o r i t y (>99%) o f p r o b l e m s w i l l be v a l i d math p r o b l e m s .
2 . Only e x t r e m e l y r a r e c a s e s a r e i n v a l i d , such a s : Problems r e l y i n g on e x t e r n a l images
o r m i s s i n g d e f i n i t i o n s . Vague o r i n c o m p l e t e s t a t e m e n t s t h a t ca nn ot be i n t e r p r e t e d
m a t h e m a t i c a l l y . Open−ended c o n c e p t u a l d i s c u s s i o n s r a t h e r than problem−s o l v i n g .
17

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
3 . A problem i s s t i l l v a l i d even i f s o l v i n g i t r e q u i r e s advanced methods l i k e
r e c u r s i o n , l i m i t s , o r l o g a r i t h m s .
4 . Do not e v a l u a t e whether t h e problem has a s o l u t i o n o r not .
5 . Do not a n a l y z e t h e d i f f i c u l t y o f t h e problem o r t h e methods r e q u i r e d t o s o l v e i t .
6 . Only c h e c k whether i t i s a w e l l−formed math problem t h a t can be m e a n i n g f u l l y
i n t e r p r e t e d .
Here a r e a few examples .
Example 1
Problem :
S o l v e t h e e q u a t i o n \ \ ( \ \ l o g ( x−2 ) ( 2 x−3 ) = \\ l o g ( x ^ 2 ) \ \ ) .
Output : not i n v a l i d
Example 2
Problem :
S o l v e t h e math problem found on Facebook ( image p r o v i d e d )
Output : i n v a l i d
Example 3
Problem :
S o l v e t h e f o l l o w i n g e q u a t i o n s : $ \ f r a c {{\ s i n ( 6 0 ^ \ c i r c+x )+\ s i n ( 6 0 ^ \ c i r c−x ) }}{{2}} =\ f r a c
{{\ tan x }}{{(1+\ tan ^2 x ) ^2}}+\ f r a c {{\ c o t x }}{{(1+\ c o t ^2 x ) ^2}} $
Output : not i n v a l i d
Example 4
Problem :
Find t h e a r e a o f s q u a r e T?
Output : i n v a l i d
Example 5 :
Problem :
P r o v i d e a n o t h e r example o f a s i m i l a r problem i n v o l v i n g r e m a i n d e r s and s q u a r i n g a
number .
Output : i n v a l i d
Example 6 :
Problem :
What d o e s t h e n o t a t i o n $ \\ v e c {{B}} $ mean i n t h e c o n t e x t o f v e c t o r s ?
Output : i n v a l i d
18

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
Example 7 :
Problem :
I s t h e r e a q u i c k way t o m u l t i p l y 59 and 61? I f so , e x p l a i n t h e method
Output : i n v a l i d
Example 8 :
Problem :
None\n\n ( Note : There i s o n l y one problem i n t h e g i v e n forum p o s t . )
Output : i n v a l i d
Example 9 :
Problem :
I f $a+b=31$ and $ab=240$ , f i n d t h e sum o f t h e r e c i p r o c a l s o f $a$ and $b$ .
Output : not i n v a l i d
Example 1 0 :
Problem :
What i s t h e v a l u e o f $35461 ^54593428 $ \\mod 11 $ ?
Output : not i n v a l i d
Now h e r e i s t h e problem you need t o e x t r a c t t h e answer from .
Problem :
{ problem }
Output :
A.3.
Prompt: Multiple Choice Problem Classification
I w i l l p r o v i d e a math problem , and you need t o d e t e r m i n e whether i t i s a m u l t i p l e−
c h o i c e problem .
Respond o n l y with ’ mcq ’ i f t h e problem meets t h e c r i t e r i a , and ’ not mcq ’ o t h e r w i s e .
A m u l t i p l e−c h o i c e problem must s a t i s f y a l l o f t h e f o l l o w i n g c o n d i t i o n s :
1 . The problem e x p l i c i t l y p r e s e n t s a s e t o f answer c h o i c e s t o s e l e c t from .
2 . The problem a s k s f o r a f i n a l answer r a t h e r than r e q u i r i n g a p r o o f , j u s t i f i c a t i o n ,
o r e x p l a n a t i o n .
3 . The problem has a t l e a s t one c o r r e c t answer among t h e g i v e n c h o i c e s .
I f t h e problem d o e s not i n c l u d e answer c h o i c e s , even i f i t has a n u m e r i c a l answer , i t
s h o u l d be c l a s s i f i e d a s ’ not mcq ’ .
Here a r e a few examples .
Example 1
Problem :
S i m p l i f y t h e e x p r e s s i o n \ \ ( \ \ f r a c {{{{2}} \\ s q r t {{6}}}}{{\\ s q r t {{2}}}} + \\ s q r t {{3}} +
\\ s q r t { { 5 } } \ \ ) and c h o o s e t h e c o r r e c t o p t i o n from t h e f o l l o w i n g : \ n\n
19

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
A. \ \ ( \ \ s q r t {{2}} + \\ s q r t {{3}}−\\ s q r t { { 5 } } \ \ ) \n\n
B . \ \ ( 4−\\ s q r t {{2}}−\\ s q r t { { 3 } } \ \ ) \n\n
C . \ \ ( \ \ s q r t {{2}} + \\ s q r t {{3}} + \\ s q r t {{6}}−5 \ \ ) \n\n
D. \ \ ( \ \ f r a c {{1}}{{2}} ( \ \ s q r t {{2}} + \\ s q r t {{5}}−\\ s q r t { { 3 } } ) \ \ ) \n\n
E . \ \ ( \ \ f r a c {{1}}{{3}} ( \ \ s q r t {{3}} + \\ s q r t {{5}}−\\ s q r t { { 2 } } ) \ \ )
Output : mcq
Example 2
Problem :
Write f i r s t s e v e r a l terms o f a g e o m e t r i c p r o g r e s s i o n i n which t h e d i f f e r e n c e between
t h e t h i r d and f i r s t terms i s e q u a l t o 9 , and t h a t between t h e f i f t h and t h i r d terms
e q u a l 3 6 .
Output : not mcq
Example 3
Problem :
S o l v e t h e f o l l o w i n g e q u a t i o n s : $ \ f r a c {{\ s i n ( 6 0 ^ \ c i r c+x )+\ s i n ( 6 0 ^ \ c i r c−x ) }}{{2}} =\ f r a c
{{\ tan x }}{{(1+\ tan ^2 x ) ^2}}+\ f r a c {{\ c o t x }}{{(1+\ c o t ^2 x ) ^2}} $
Output : not mcq
Example 4
Problem :
S i m p l i f y t h e e x p r e s s i o n \ \ ( \ \ l o g {{\\ f r a c {{ a }}{{ b }}}} + \\ l o g {{\\ f r a c {{b }}{{ c }}}} + \\
l o g {{\\ f r a c {{ c }}{{ d }}}}−\\ l o g {{\\ f r a c {{ ay }}{{ dx } } } } \ \ ) .
Choose from t h e f o l l o w i n g o p t i o n s : \ n \ \ [ \ n
\\ t e x t b f { { (A) }}\\ \\ l o g {{\\ f r a c {{ y }}{{ x }}}}\\ qquad
\\ t e x t b f { { (B) }}\\ \\ l o g {{\\ f r a c {{ x }}{{ y }}}}\\ qquad
\\ t e x t b f { { (C) }}\\ 1\\ qquad
\\ t e x t b f { { (D) }}\\ 0\\ qquad
\\ t e x t b f { { (E) }}\\ \\ l o g {{\\ f r a c {{ a ^2 y }}{{ d^2 x }}}}\ n \ \ ]
Output : mcq
Example 5 :
Problem :
What i s t h e maximum p o s s i b l e magnitude o f t h e d i f f e r e n c e between two v e c t o r s ? Choose
from t h e f o l l o w i n g o p t i o n s and p r o v i d e r e a s o n i n g :
A. The magnitude o f one o f t h e v e c t o r s .
B . The magnitude o f both v e c t o r s .
C . The magnitude o f t h e i r sum .
D. T h e i r s c a l a r p r o d u c t .
Output : mcq
Example 6 :
Problem :
Compare t h e numbers $a$ and $b$ : $a =3(\ l o g 7 −\ l o g 5 ) , \ b=2\ l e f t ( \ f r a c {{1}}{{2}}\ l o g
9−\ f r a c {{1}}{{3}}\ l o g 8\ r i g h t ) $
Output : not mcq
20

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
Example 7 :
Problem :
Which o f t h e two numbers $31 ^{{11}} $ and $17 ^{{14}} $ i s g r e a t e r ?
Output : not mcq
Example 8 :
Problem :
Let $ABCD$ be a r e c t a n g l e and $E$ t h e r e f l e c t i o n o f $A$ with r e s p e c t t o t h e d i a g o n a l
$BD$ . I f $EB = EC$ , what i s t h e r a t i o $ \ f r a c {{AD}}{{AB}}
Output : not mcq
Now h e r e i s t h e problem you need t o e x t r a c t t h e answer from .
Problem :
{ problem }
Output :
A.4.
Prompt: Proof Problem Classification
I w i l l g i v e you a math problem and a s k t o i d e n t i f y i f i t ’ s a " p r o o f " problem .
Respond o n l y with " p r o o f " i f i t i s a p r o o f problem , and " not p r o o f " i f i t i s not .
C o n s i d e r t h e f o l l o w i n g c h a r a c t e r i s t i c s o f p r o o f p r o b l e m s :
1 . They o f t e n u s e p h r a s e s l i k e " p r o v e t h a t " , " show t h a t " , o r " d e m o n s t r a t e t h a t " .
2 . They may a s k t o j u s t i f y o r e x p l a i n why a s t a t e m e n t i s t r u e .
3 . They don ’ t have a w e l l−d e f i n e d answer i n t h e form o f a number o r e x p r e s s i o n .
Here a r e a few examples .
Example 1
Problem :
Prove t h e i d e n t i t y $a ^\ f r a c {{1}}{{2}}−\ f r a c {{ a−a ^{{−2}}{{ a ^\ f r a c {{1}}{{2}}−a^{{−\ f r a c
{{1}}{{2}} +\ f r a c {{1−a ^{{−2}}{{ a ^\ f r a c {{1}}{{2}}+ a^{{−\ f r a c {{1}}{{2}}+\ f r a c {{2}}{{ a ^\
f r a c {{3}}{{2}}=0 $
Output : p r o o f
Example 2
Problem :
Write f i r s t s e v e r a l terms o f a g e o m e t r i c p r o g r e s s i o n i n which t h e d i f f e r e n c e between
t h e t h i r d and f i r s t terms i s e q u a l t o 9 , and t h a t between t h e f i f t h and t h i r d terms
e q u a l 3 6 .
Output : not p r o o f
Example 3
Problem :
21

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
S o l v e t h e f o l l o w i n g e q u a t i o n s : $ \ f r a c {{\ s i n ( 6 0 ^ \ c i r c+x )+\ s i n ( 6 0 ^ \ c i r c−x ) }}{{2}} =\ f r a c
{{\ tan x }}{{(1+\ tan ^2 x ) ^2}}+\ f r a c {{\ c o t x }}{{(1+\ c o t ^2 x ) ^2}} $
Output : not p r o o f
Example 4
Problem :
Denoting t h e sums o f t h e f i r s t $n_1$ , f i r s t $n_2$ and f i r s t $n_3$ terms o f an
a r i t h m e t i c p r o g r e s s i o n by $S_1$ , $S_2$ and $S_3$ , r e s p e c t i v e l y , show t h a t $$ \ f r a c {{S_1
}}{{n_1 } } ( n_2−n_3 )+\ f r a c {{S_2}}{{n_2 } } ( n_3−n_1 )+\ f r a c {{S_3}}{{n_3 } } ( n_1−n_2 ) =0. $$
Output : p r o o f
Now h e r e i s t h e problem you need t o e x t r a c t t h e answer from .
Problem :
{ problem }
Output :
A.5.
Prompt: Proof Problem Conversion
I w i l l g i v e you a math problem t h a t a s k s t o p r o v e s o m e t h i n g .
Your t a s k i s t o c r e a t e an e q u i v a l e n t problem t h a t i n s t e a d has some k i n d o f n u m e r i c a l
o r e x p r e s s i o n answer t h a t can be used t o a u t o m a t i c a l l y g r a d e t h e s o l u t i o n .
Make s u r e t h e new problem i s a t l e a s t a s d i f f i c u l t a s t h e o r i g i n a l p r o o f problem .
Here a r e a few examples .
Example 1
Problem :
Prove t h a t t h e system \ b e g i n {{ a l i g n∗}} x ^6 + x ^3 + x ^3 y + y &= 147^{{157}} \\ x ^3 + x ^3
y + y ^2 + y + z ^9 &= 157^{{147}}\ end {{ a l i g n∗}} has no s o l u t i o n s i n i n t e g e r s $x$ , $y$
, and $z$ .
Output :
Let $x$ , $y$ and $z$ be a s o l u t i o n t o t h e f o l l o w i n g system o f e q u a t i o n s \ b e g i n {{ a l i g n
∗}} x ^6 + x ^3 + x ^3 y + y &= 147^{{157}} \\ x ^3 + x ^3 y + y ^2 + y + z ^9 &= 157^{{147}}\
end {{ a l i g n∗} } .
C a l c u l a t e t h e sum o f a l l p o s s i b l e v a l u e s o f $x$ .
Example 2
Problem :
A t r i a n g l e i s c a l l e d a p a r a b o l i c t r i a n g l e i f i t s v e r t i c e s l i e on a
p a r a b o l a $y = x ^2 $ . Prove t h a t f o r e v e r y n o n n e g a t i v e i n t e g e r $n$ , t h e r e
i s an odd number $m$ and a p a r a b o l i c t r i a n g l e with v e r t i c e s a t t h r e e
d i s t i n c t p o i n t s with i n t e g e r c o o r d i n a t e s with a r e a $ ( 2 ^nm) ^2 $ .
Output :
C o n s i d e r p a r a b o l i c t r i a n g l e s whose v e r t i c e s l i e on $y = x ^2 $ with i n t e g e r c o o r d i n a t e s .
Let $ f ( n ) $ be t h e s m a l l e s t p o s s i b l e v a l u e o f $c$ , where $ ( 0 , 0 ) $ , $ ( b , b ^ 2 ) $ , and $ ( c , c
^ 2 ) $ a r e v e r t i c e s o f such a t r i a n g l e with a r e a e x a c t l y $ ( 2 ^ n ) ^2$ , f o r some i n t e g e r b
22

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
where 0 < b < c .
Find $ f ( 4 ) $ .
Now h e r e i s t h e problem you need t o modify . Only ou tp ut t h e new problem ∗∗WITH NO∗∗
e x p l a n a t i o n o r n o t e s a f t e r i t .
Again , s t a r t with t h e problem r i g h t away ,∗∗DO NOT∗∗s t a r t with " Let ’ s modify t h e
g i v e n problem " o r a n y t h i n g l i k e t h a t .
Problem :
{ problem }
Output :
A.6.
Prompt: Forum Answer Extraction
I w i l l g i v e you a s e r i e s o f p o s t s from a math−r e l a t e d forum t h a t c o n t a i n one o r
s e v e r a l math p r o b l e m s and d i s c u s s i o n o f t h e i r s o l u t i o n s .
I w i l l a l s o s p e c i f y which problem I ’m c u r r e n t l y l o o k i n g a t ( i n c a s e t h e r e a r e m u l t i p l e
) .
You t a s k i s t o f i n d an answer t o t h e problem I ’m c u r r e n t l y l o o k i n g a t i n s i d e t h e forum
d i s c u s s i o n s .
The answer s h o u l d be a n u m e r i c a l v a l u e o r a m a t h e m a t i c a l e x p r e s s i o n .
I f t h e answer i s not a v a i l a b l e , out pu t " Answer not found . " i n t h e l a s t l i n e o f your
r e s p o n s e .
You can t h i n k b e f o r e s t a t i n g t h e f i n a l answer . The f i n a l l i n e o f your r e s p o n s e s h o u l d
be " Answer : < f i n a l answer >".
Here i s an example .
F i r s t forum p o s t with problem ( s ) :
This problem was e x t r a c r e d i t f o r my math c l a s s and I haven ’ t g o t t e n i t back y e t but I
’m assuming
a . ) Everyone handed i t i n
and
b . ) None o f you h e r e g o e s / t a k e s / w i l l go / t a k e my math c l a s s
Anyways :
Suppose two o f t h e z e r o e s o f t h e f o l l o w i n g f o u r t h−d e g r e e e q u a t i o n a r e t h e same and t h e
o t h e r two z e r o e s a r e t h e r e c i p r i c a l s o f each o t h e r . Find a and b .
x : ^ 4 : + ax : ^ 3 : + bx : ^ 2 : + 4 x+4=0
I t ’ s not a t a l l hard a s i t l o o k s . . . a l o t o f work though , s o I s u g g e s t o r g i n i z i n g a s
you go a l o n g .
Problem we a r e l o o k i n g a t ( i t might be r e p h r a s e d ) :
Suppose two o f t h e z e r o e s o f t h e f o u r t h−d e g r e e e q u a t i o n \ ( x ^4 + ax ^3 + bx ^2 + 4x + 4
= 0 \ ) a r e t h e same and t h e o t h e r two z e r o e s a r e r e c i p r o c a l s o f each o t h e r . Find \ ( a
\ ) and \ ( b \ ) .
Forum d i s c u s s i o n s :
Post 1 :
Tare wrote : x : ^ 4 : + ax : ^ 3 : + bx : ^ 2 : + 4 x+4=0
Here ’ s a s h o r t e r way :
[ h i d e ] Say t h e f o u r r o o t s a r e c , c , d , and 1/d . Then t h e p r o d u c t o f t h e f o u r r o o t s i s
23

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
t h e c o n s t a n t term o f t h e p o l y n o m i a l , s o c ^2=4. Then c= : pm : 2 . S i m i l a r l y , from t h e
l i n e a r term , c ^2d+c ^2/d+c+c=−4. I f we p l u g i n c =2 , we g e t d=−1, s o t h e r o o t s a r e 2 ,
2 ,−1,−1. So a = −(2+2−1−1)=−2 and b = 2∗2+2(−1)+2(−1)+2(−1)+2(−1)+(−1)(−1)=−3. I f
we p l u g i n c=−2, we g e t 4d+4/d=0 , s o d+1/d=0. Then a = −(−2−2+0)=4 and b=(−2)(−2)
+(−2) ( 0 ) +(−2)0+1=5. So e i t h e r a=−2, b=−3 o r a =4 ,b=5.
[ / h i d e ]
Thanks Tare f o r c a t c h i n g t h e m i s t a k e s .
−−Dan
Post 2 :
Well . . . i t didn ’ t s p e c i f y t h a t t h e s o l u t i o n i s r e a l and a l s o you were s u p po s ed t o g e t a
and b . . .
Output :
Seems t h a t t h e r e i s an answer a t t h e end o f t h e f i r s t p o s t . S i n c e none o f t h e o t h e r
p o s t s c o n t r a d i c t s i t , we can assume t h a t t h e answer i s c o r r e c t .
Answer : a =−2, b =−3 o r a = 4 , b = 5
Now h e r e a r e t h e p o s t s from t h e forum t h a t I ’m c u r r e n t l y l o o k i n g a t . P l e a s e f i n d t h e
answer t o t h e problem .
Don ’ t f o r g e t t o s a y " Answer not found . " i f t h e answer i s not a v a i l a b l e .
F i r s t forum p o s t with problem ( s ) :
{ forum_post }
Problem we a r e l o o k i n g a t ( i t might be r e p h r a s e d ) :
{ problem }
Forum d i s c u s s i o n s :
{ f o r u m _ d i s c u s s i o n s }
Output :
A.7.
Prompt: Forum Problem Extraction
I w i l l g i v e you a p o s t from a math−r e l a t e d forum t h a t might c o n t a i n one o r s e v e r a l
math p r o b l e m s .
Your t a s k i s t o e x t r a c t a l l p r o b l e m s o r s t a t e t h a t none a r e a v a i l a b l e .
Here a r e some g u i d e l i n e s you s h o u l d f o l l o w
−I f no p r o b l e m s a r e a v a i l a b l e , ou tp ut " No p r o b l e m s i d e n t i f i e d . "
−For each problem found , u s e t h e f o l l o w i n g f o r m a t :
Problem 1 : <problem s t a t e m e n t >
Problem 2 : <problem s t a t e m e n t >
. . .
−For each math problem you i d e n t i f y , make s u r e t o r e p h r a s e i t such t h a t i t ’ s s t a t e d
c l e a r l y and c o n c i s e l y .
Remove any redundant c o n t e x t , p e r s o n a l commentary , a n e c d o t e s , o r u n r e l a t e d i n f o r m a t i o n
.
But make s u r e not t o change t h e meaning o f t h e problem and keep a l l n e c e s s a r y
m a t h e m a t i c a l o r t e c h n i c a l d e t a i l s .
−I f m u l t i p l e p r o b l e m s t h a t you e x t r a c t a r e r e l a t e d , make s u r e t o i n c l u d e a l l t h e
c o n t e x t i n each problem s t a t e m e n t
24

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
a s t h e y w i l l be l o o k e d a t i n d e p e n d e n t l y .
Here a r e a few examples .
Example 1
Forum p o s t :
Countdown :
What i s t h e r e m a i n d e r o f 8^6+7^7+6^8 i s d i v i d e d by 5?
no c a l c u l a t o r o f c o u r s e , paper i s n ’ t needed e i t h e r , but s u r e .
Output :
Problem 1 : What i s t h e r e m a i n d e r o f $8 ^6+7^7+6^8$ when d i v i d e d by 5?
Example 2
Forum p o s t :
Q u e s t i o n 1 :
A t e t r a h e d r o n has f o u r v e r t i c e s . We can l a b e l each v e r t e x by one o f t h e f o u r d i g i t s :
$1 , 2 , 3 , 4 $ . How many non−c o n g r u e n t ways a r e t h e r e t o a s s i g n a d i f f e r e n t d i g i t t o
each v e r t e x o f a t e t r a h e d r o n ? T e t r a h e d r a a r e c o n s i d e r e d c o n g r u e n t t h r o u g h r o t a t i o n .
R e f l e c t i o n s a r e c o n s i d e r e d d i f f e r e n t .
I ’m wondering how I c o u l d approach a problem l i k e t h i s . I s t a r t e d o f f with $4 ! $ and
then d i v i d e d by $4$ t o t a k e out t h e r o t a t i o n a s p e c t . Now I am s t u c k .
Note : I ’ d r a t h e r not do c a s e work b e c a u s e I ’m s u r e t h e t e s t w r i t e r s c o u l d have e a s i l y
used an i c o s a h e d r o n , o r s o m e t h i n g e q u a l l y l e n g t h y .
Another Q u e s t i o n a l o n g t h e same l i n e s :
How many ways t o c o l o r a cube u s i n g 6 c o l o r s , where each f a c e has a u n i q u e c o l o r ?
Thanks
Output :
Problem 1 : How many non −c o n g r u e n t ways a r e t h e r e t o a s s i g n a d i f f e r e n t d i g i t t o each
v e r t e x o f a t e t r a h e d r o n ? T e t r a h e d r a a r e c o n s i d e r e d c o n g r u e n t t h r o u g h r o t a t i o n .
R e f l e c t i o n s a r e c o n s i d e r e d d i f f e r e n t .
Problem 2 : How many ways can a cube be c o l o r e d u s i n g 6 c o l o r s , where each f a c e has a
u n i q u e c o l o r ?
Example 3
Forum p o s t :
Yes ! I c o m p l e t e l y a g r e e with what you s a i d . I t ’ s been a tough t r a n s i t i o n f o r me too ,
but we ’ l l f i g u r e i t out .
Output :
No p r o b l e m s i d e n t i f i e d
Example 4
Forum p o s t :
B i l l y Bob has f o u r t e e n d i f f e r e n t p a i r s o f s o c k s i n h i s drawer . They a r e j u s t thrown
around randomly i n t h e drawer . B i l l y Bob once woke up i n a h u r r y and had t o g e t h i s
s o c k s q u i c k l y .
Without s w i t c h i n g t h e l i g h t on , he p u l l e d out enough s o c k s t o know t h a t he had a t
l e a s t one p a i r , and then he ran out o f t h e room . How many s o c k s d i d B i l l y Bob p u l l out
25

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
Output :
Problem 1 : From a drawer c o n t a i n i n g 14 d i f f e r e n t p a i r s o f s o c k s , how many s o c k s must
be p u l l e d out randomly t o e n s u r e a t l e a s t one matching p a i r ?
P l e a s e a n a l y z e t h e f o l l o w i n g forum p o s t and e x t r a c t a l l math p r o b l e m s . Here a r e t h e
g u i d e l i n e s one more time f o r your r e f e r e n c e
−I f no p r o b l e m s a r e a v a i l a b l e , ou tp ut " No p r o b l e m s i d e n t i f i e d . "
−For each problem found , u s e t h e f o l l o w i n g f o r m a t :
Problem 1 : <problem s t a t e m e n t >
Problem 2 : <problem s t a t e m e n t >
. . .
−For each math problem you i d e n t i f y , make s u r e t o r e p h r a s e i t such t h a t i t ’ s s t a t e d
c l e a r l y and c o n c i s e l y .
Remove any redundant c o n t e x t , p e r s o n a l commentary , a n e c d o t e s , o r u n r e l a t e d i n f o r m a t i o n
.
But make s u r e not t o change t h e meaning o f t h e problem and keep a l l n e c e s s a r y
m a t h e m a t i c a l o r t e c h n i c a l d e t a i l s .
−I f m u l t i p l e p r o b l e m s t h a t you e x t r a c t a r e r e l a t e d , make s u r e t o i n c l u d e a l l t h e
c o n t e x t i n each problem s t a t e m e n t
a s t h e y w i l l be l o o k e d a t i n d e p e n d e n t l y .
Forum p o s t :
{ forum_post }
Output :
B.
B.1.
TIR Inference Prompt for Stage-0 Data Generation
You a r e a math problem s o l v e r t h a t u s e s Python code a s an i n t e g r a l p a r t o f your
r e a s o n i n g .
I n your s o l u t i o n you MUST s t r i c t l y f o l l o w t h e s e i n s t r u c t i o n s :
1 . For each s t e p r e q u i r i n g complex c a l c u l a t i o n w r i t e Python code .
2 . For Python code u s e t h e f o l l o w i n g t e m p l a t e :
‘ ‘ ‘ python
# Your Python code
‘ ‘ ‘
3 . Put t h e f i n a l answer w i t h i n \ boxed { { } } .
P l e a s e r e a s o n s t e p by s t e p , and put your f i n a l answer w i t h i n \\ boxed { { } } .
u s e r : |−
S o l v e t h e f o l l o w i n g math problem u s i n g Python code f o r t h e c a l c u l a t i o n s .
{ problem }
26

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
B.2.
Prompt to evaluate TIR novelty
You w i l l be g i v e n a f r a g m e n t o f a s o l u t i o n t o a math problem t h a t i n c l u d e s a Python
code b l o c k .
Your t a s k i s t o d e t e r m i n e t h e p u r p o s e o f t h i s Python code b l o c k i n t h e s o l u t i o n
f r a g m e n t .
I n your a s s e s s m e n t , you MUST f o l l o w t h e s e g u i d e l i n e s :
1 . C l a s s i f i c a t i o n :
−V e r i f i c a t i o n : Python code i s used t o v e r i f y t h e c o r r e c t n e s s o f t h e p r e v i o u s manual
c a l c u l a t i o n s o r t o c o n f i r m some r e s u l t s . E . g . i f t h e r e s u l t o f t h e code e x e c u t i o n
e x i s t s i n t h e s o l u t i o n above , i t i s d e f i n i t e l y a v e r i f i c a t i o n .
−Novel C a l c u l a t i o n : Otherwise , i f t h e r e s u l t o f code e x e c u t i o n i s not p r e s e n t i n ANY
FORM i n t h e s o l u t i o n above , i t i s a n o v e l c a l c u l a t i o n .
I f you a r e u n s u r e about t h e c l a s s i f i c a t i o n o f s p e c i f i c code b l o c k , you MUST l a b e l i t
a s V e r i f i c a t i o n !
2 . Output Format :
−Your r e s p o n s e MUST f o l l o w t h i s e x a c t f o r m a t ( w i t h o u t e x t r a commentary o r t e x t ) :
‘ ‘ ‘
Re a son i ng : <a c o u p l e o f s e n t e n c e s e x p l a i n i n g your r a t i o n a l e >
Judgement : < V e r i f i c a t i o n o r Novel C a l c u l a t i o n >
‘ ‘ ‘
−−−
∗∗EXAMPLES∗∗
1 .
" " "
S o l u t i o n :
<Some t e x t r e a s o n i n g w i t h o u t code>
Wait , s o t h e answer i s 143? Let me v e r i f y t h i s with t h e pow f u n c t i o n .
‘ ‘ ‘ python
# Compute 7^999 mod 1000 u s i n g pow f u n c t i o n
p r i n t ( pow ( 7 , 9 9 9 , 1 0 0 0 ) ) # Should p r i n t 143
‘ ‘ ‘
‘ ‘ ‘ o ut pu t
143
‘ ‘ ‘
So t h e answer i s \\ boxed { { 1 4 3 } } .
" " "
‘ ‘ ‘
Re a son i ng : This i s f o r s u r e a v e r i f i c a t i o n , b e c a u s e t h e r e s u l t o f t h e code e x e c u t i o n
i s p r e s e n t i n t h e s o l u t i o n above . Moreover , comment i n t h e code b l o c k e x p l i c i t l y
s t a t e s t h a t i t s h o u l d p r i n t 143 which means t h a t t h e r e s u l t i s known i n advance .
Judgement : V e r i f i c a t i o n
‘ ‘ ‘
−−−
2 .
" " "
S o l u t i o n :
<Some t e x t r e a s o n i n g w i t h o u t code>
T h e r e f o r e , l e t ’ s p r o c e e d t o compute P^ 5 . I can u s e Python f o r t h i s c a l c u l a t i o n t o
e n s u r e a c c u r a c y .
F i r s t , l e t ’ s d e f i n e t h e t r a n s i t i o n m a t r i x P a s a numpy a r r a y , then compute P^ 5 , then
e x t r a c t t h e r e l e v a n t e n t r y , and d i v i d e by 3 .
Let me imp ort numpy and do t h e c a l c u l a t i o n s .
‘ ‘ ‘ python
27

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
im por t numpy a s np
# D e f i n e t h e t r a n s i t i o n m a t r i x P
P = np . a r r a y ( [
[ 0 , 1 , 0 , 0 ] ,
[ 1 / 3 , 0 , 2 / 3 , 0 ] ,
[ 0 , 2 / 3 , 0 , 1 / 3 ] ,
[ 0 , 0 , 1 , 0 ]
] )
# Compute P^5
P5 = np . l i n a l g . matrix_power (P , 5 )
# The i n i t i a l s t a t e i s s t a t e 0 , s o t h e d i s t r i b u t i o n a f t e r 5 s t e p s i s [ 1 , 0 , 0 , 0 ] @ P5
# But s i n c e P5 i s t h e t r a n s i t i o n m a t r i x a f t e r 5 s t e p s , t h e e n t r y ( 0 , 1 ) i s t h e
p r o b a b i l i t y o f g o i n g from 0 t o 1 i n 5 s t e p s .
# However , t o g e t t h e d i s t r i b u t i o n , we need t o m u l t i p l y t h e i n i t i a l d i s t r i b u t i o n by P5
.
i n i t i a l _ d i s t r i b u t i o n = np . a r r a y ( [ 1 , 0 , 0 , 0 ] )
d i s t r i b u t i o n _ a f t e r _ 5 _ s t e p s = i n i t i a l _ d i s t r i b u t i o n @ P5
# The p r o b a b i l i t y mass a t s t a t e 1 ( d i s t a n c e 1 ) a f t e r 5 s t e p s
prob_mass_at_1 = d i s t r i b u t i o n _ a f t e r _ 5 _ s t e p s [ 1 ]
# S i n c e s t a t e 1 c o r r e s p o n d s t o 3 v e r t i c e s (B, D, E) , t h e p r o b a b i l i t y o f b e i n g a t B i s
t h i s mass d i v i d e d by 3
prob_at_B = prob_mass_at_1 / 3
prob_at_B
‘ ‘ ‘
‘ ‘ ‘ o ut pu t
np . f l o a t 6 4 ( 0 . 2 5 1 0 2 8 8 0 6 5 8 4 3 6 2 0 5 )
‘ ‘ ‘
. . .
" " "
‘ ‘ ‘
Re a son i ng : The s o l u t i o n f r a g m e n t d e s c r i b e s a l g o r i t h m i c s t e p s t o c a l c u l a t e t h e
p r o b a b i l i t y and t h e code b l o c k e x e c u t e s t h e s e s t e p s . The r e s u l t o f t h e code e x e c u t i o n
i s not p r e s e n t i n t h e s o l u t i o n above i n any form . T h e r e f o r e , t h i s i s a n o v e l
c a l c u l a t i o n .
Judgement : Novel C a l c u l a t i o n
‘ ‘ ‘
−−−
3 .
" " "
S o l u t i o n :
<Some t e x t r e a s o n i n g w i t h o u t code>
Compute C( 5 1 , 5 ) :
5 1 ! / ( 5 !∗4 6 ! ) = ?
But maybe I s h o u l d c a l c u l a t e i t u s i n g Python t o be a c c u r a t e .
‘ ‘ ‘ python
im por t math
math . comb ( 5 1 , 5 )
‘ ‘ ‘
‘ ‘ ‘ o ut pu t
2349060
‘ ‘ ‘
. . .
" " "
‘ ‘ ‘
Re a son i ng : The s o l u t i o n f r a g m e n t d e s c r i b e s t h e c a l c u l a t i o n o f a c o m b i n a t o r i a l
e x p r e s s i o n and t h e code b l o c k e x e c u t e s t h i s c a l c u l a t i o n . The r e s u l t o f t h e code
e x e c u t i o n i s not p r e s e n t i n t h e s o l u t i o n above i n any form . T h e r e f o r e , t h i s i s a n o v e l
c a l c u l a t i o n .
28

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
Judgement : Novel C a l c u l a t i o n
‘ ‘ ‘
−−−
4 .
" " "
S o l u t i o n :
<Some t e x t r e a s o n i n g w i t h o u t code>
But l e t ’ s compute t h e s e v a l u e s i n Python .
‘ ‘ ‘ python
im por t math
# Given d i m e n s i o n s
R = 4 # f e e t
H = 12 # f e e t
h = 9 # f e e t from t h e t i p , s o r e m a i n i n g h e i g h t
r = ( h / H)∗R # s i n c e r /R = h/H
# O r i g i n a l volume
V _ o r i g i n a l = ( 1 / 3 )∗math . p i∗R∗∗2∗H
# Remaining volume
V_remaining = ( 1 / 3 )∗math . p i∗r∗∗2∗h
# Volume poured out
V_poured = V _ o r i g i n a l−V_remaining
V_poured
‘ ‘ ‘
‘ ‘ ‘ o ut pu t
1 1 6 . 2 3 8 9 2 8 1 8 2 8 2 2 3 5
‘ ‘ ‘
When I computed t h e volume manually , I o b t a i n e d \ ( 37\ p i \ ) c u b i c f e e t . Approximating
t h i s a s $$
37∗3 . 1 4 1 5 9 \ approx 1 1 6 . 2 3
$$ , i t c l o s e l y matches t h e Python r e s u l t o f a p p r o x i m a t e l y 1 1 6 . 2 3 8 9 . T h e r e f o r e , t h e
r e s u l t a p p e a r s t o be c o r r e c t .
. . .
" " "
‘ ‘ ‘
Re as on i ng : The r a t i o n a l e r i g h t a f t e r t h e code b l o c k s t a t e s t h a t t h e manual c a l c u l a t i o n
( t h a t happened b e f o r e t h e code b l o c k ) matches t h e Python r e s u l t . T h e r e f o r e , code
b l o c k v e r i f i e s t h e p r e v i o u s manual c a l c u l a t i o n s . So , t h i s i s a v e r i f i c a t i o n .
Judgement : V e r i f i c a t i o n
‘ ‘ ‘
−−−
∗∗REMINDER∗∗
Focus o n l y on t h e Python code b l o c k i n t h e p r o v i d e d f r a g m e n t and c l a s s i f y i t a s e i t h e r
V e r i f i c a t i o n o r Novel C a l c u l a t i o n based on whether i t s o ut put a p p e a r s i n t h e s o l u t i o n
t e x t b e f o r e t h e code .
−−−
∗∗YOUR TASK∗∗
S o l u t i o n f r a g m e n t : { f r a g m e n t }
29

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
B.3.
Prompt to evaluate TIR significance
You w i l l be g i v e n a f r a g m e n t o f a s o l u t i o n t o a math problem t h a t i n c l u d e s a Python
code b l o c k .
Your t a s k i s t o e v a l u a t e t h e s i g n i f i c a n c e o f t h i s Python code i n s o l v i n g t h e math
problem .
I n your a s s e s s m e n t , you MUST f o l l o w t h e s e g u i d e l i n e s :
1 . C l a s s i f i c a t i o n :
E v a l u a t e t h e s i g n i f i c a n c e o f t h e code ’ s c o n t r i b u t i o n by c a t e g o r i z i n g i t i n t o one o f
t h r e e l e v e l s :
−T r i v i a l : The code p e r f o r m s c a l c u l a t i o n s t h a t c o u l d e a s i l y be done manually w i t h o u t
s i g n i f i c a n t e f f o r t ( e . g . , s o l v i n g s i m p l e e q u a t i o n s , d o i n g a r i t h m e t i c , a p p l y i n g
f o r m u l a s t o known v a r i a b l e s ) . The code u s a g e p r o v i d e s no m e a n i n g f u l o r minor advantage
o v e r manual c a l c u l a t i o n .
−Moderate : The code p e r f o r m s c a l c u l a t i o n s t h a t would be t e d i o u s , e r r o r−prone , o r time
−consuming t o do manually , but s t i l l t e c h n i c a l l y p o s s i b l e ( e . g . , m a t r i x o p e r a t i o n s ,
n u m e r i c a l i n t e g r a t i o n o f s t a n d a r d f u n c t i o n s , s o l v i n g s y s t e m s o f e q u a t i o n s ) . The code
u s a g e p r o v i d e s e f f i c i e n c y but i s n ’ t e s s e n t i a l .
−S i g n i f i c a n t : The code p e r f o r m s c a l c u l a t i o n s t h a t would be p r a c t i c a l l y i m p o s s i b l e o r
e x t r e m e l y d i f f i c u l t t o do manually ( e . g . , b r u t e−f o r c i n g c o m b i n a t o r i a l problems ,
complex s i m u l a t i o n s , s o l v i n g complex d i f f e r e n t i a l e q u a t i o n s , high−d i m e n s i o n a l
o p t i m i z a t i o n ) . The code u s a g e c r e a t e s a c r u c i a l s h o r t c u t t h a t f u n d a m e n t a l l y e n a b l e s
t h e s o l u t i o n .
2 . Output Format :
−Your r e s p o n s e MUST f o l l o w t h i s e x a c t f o r m a t ( w i t h o u t e x t r a commentary o r t e x t ) :
‘ ‘ ‘
Re a son i ng : <a c o u p l e o f s e n t e n c e s e x p l a i n i n g your r a t i o n a l e >
S i g n i f i c a n c e : <T r i v i a l , Moderate , o r S i g n i f i c a n t >
‘ ‘ ‘
−−−
∗∗EXAMPLES∗∗
1 .
" " "
Let ’ s f i n d t h e r o o t s o f t h e q u a d r a t i c e q u a t i o n : 3x ^2−5x + 2 = 0
‘ ‘ ‘ python
im por t numpy a s np
from sympy i mpo rt symbols , s o l v e , Eq
x = symbols ( ’ x ’ )
e q u a t i o n = 3∗x∗∗2−5∗x + 2
s o l u t i o n s = s o l v e ( e q u a t i o n , x )
p r i n t ( s o l u t i o n s )
‘ ‘ ‘
‘ ‘ ‘ o ut pu t
[ 2 / 3 , 1 ]
‘ ‘ ‘
So t h e s o l u t i o n s a r e x = 2/3 and x = 1 .
" " "
‘ ‘ ‘
Re as on i ng : This code s i m p l y s o l v e s a b a s i c q u a d r a t i c e q u a t i o n t h a t c o u l d e a s i l y be
s o l v e d manually u s i n g t h e q u a d r a t i c f o r m u l a o r f a c t o r i n g . F i n d i n g r o o t s o f a q u a d r a t i c
e q u a t i o n with s m a l l i n t e g e r c o e f f i c i e n t s i s a s t a n d a r d c a l c u l a t i o n t h a t r e q u i r e s
minimal e f f o r t by hand .
S i g n i f i c a n c e : T r i v i a l
‘ ‘ ‘
−−−
30

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
2 .
" " "
To s o l v e t h i s system o f 4 l i n e a r e q u a t i o n s with 4 unknowns :
3x + 2y−z + 2w = 10
x−y + 2 z−w =−1
2x + y + z + 3w = 12
x + 3y−z−w = 5
I ’ l l u s e Python t o s o l v e t h i s system u s i n g m a t r i c e s .
‘ ‘ ‘ python
im por t numpy a s np
from s c i p y im po rt l i n a l g
# D e f i n e c o e f f i c i e n t m a t r i x
A = np . a r r a y ( [
[ 3 , 2 ,−1, 2 ] ,
[ 1 ,−1, 2 ,−1] ,
[ 2 , 1 , 1 , 3 ] ,
[ 1 , 3 ,−1,−1]
] )
# D e f i n e c o n s t a n t s v e c t o r
b = np . a r r a y ( [ 1 0 ,−1, 1 2 , 5 ] )
# S o l v e t h e system
s o l u t i o n = l i n a l g . s o l v e (A, b )
p r i n t ( " x =" , s o l u t i o n [ 0 ] )
p r i n t ( " y =" , s o l u t i o n [ 1 ] )
p r i n t ( " z =" , s o l u t i o n [ 2 ] )
p r i n t ( "w =" , s o l u t i o n [ 3 ] )
‘ ‘ ‘
‘ ‘ ‘ o ut pu t
x = 0 . 6 4
y = 2 . 7
z = 1 . 6
w = 2 . 1 4
‘ ‘ ‘
T h e r e f o r e , t h e s o l u t i o n i s x = 0 . 6 4 , y = 2 . 7 , z = 1 . 6 , and w = 2 . 1 4 .
" " "
‘ ‘ ‘
Re a son i ng : This code s o l v e s a system o f 4 l i n e a r e q u a t i o n s with 4 unknowns . While t h i s
c o u l d be s o l v e d manually u s i n g Gaussian e l i m i n a t i o n o r Cramer ’ s r u l e , i t would be
t e d i o u s and e r r o r−prone . The system i s complex enough t h a t c o m p u t a t i o n a l a s s i s t a n c e
p r o v i d e s s i g n i f i c a n t e f f i c i e n c y but doesn ’ t e n a b l e s o m e t h i n g i m p o s s i b l e .
S i g n i f i c a n c e : Moderate
‘ ‘ ‘
−−−
3 .
" " "
For t h i s t r a v e l i n g s a l e s m a n problem with 11 c i t i e s , where t h e d i s t a n c e s between c i t i e s
a r e g i v e n i n t h e d i s t a n c e m a t r i x below , I need t o f i n d t h e s h o r t e s t p o s s i b l e r o u t e
t h a t v i s i t s each c i t y e x a c t l y once and r e t u r n s t o t h e s t a r t i n g c i t y .
‘ ‘ ‘ python
im por t numpy a s np
from i t e r t o o l s im por t p e r m u t a t i o n s
im por t time
# D i s t a n c e m a t r i x ( 1 1 x11 ) between c i t i e s
d i s t a n c e s = np . a r r a y ( [
[ 0 , 2 9 , 8 2 , 4 6 , 6 8 , 5 2 , 7 2 , 4 2 , 5 1 , 5 5 , 2 9 ] ,
[ 2 9 , 0 , 5 5 , 4 6 , 4 2 , 4 3 , 4 3 , 2 3 , 2 3 , 3 1 , 4 1 ] ,
[ 8 2 , 5 5 , 0 , 6 8 , 4 6 , 5 5 , 2 3 , 4 3 , 4 1 , 2 9 , 7 9 ] ,
[ 4 6 , 4 6 , 6 8 , 0 , 8 2 , 1 5 , 7 2 , 3 1 , 6 2 , 4 2 , 2 1 ] ,
31

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
[ 6 8 , 4 2 , 4 6 , 8 2 , 0 , 7 4 , 2 3 , 5 2 , 2 1 , 4 6 , 8 2 ] ,
[ 5 2 , 4 3 , 5 5 , 1 5 , 7 4 , 0 , 6 1 , 2 3 , 5 5 , 3 1 , 3 3 ] ,
[ 7 2 , 4 3 , 2 3 , 7 2 , 2 3 , 6 1 , 0 , 4 2 , 2 3 , 3 1 , 7 7 ] ,
[ 4 2 , 2 3 , 4 3 , 3 1 , 5 2 , 2 3 , 4 2 , 0 , 3 3 , 1 5 , 3 7 ] ,
[ 5 1 , 2 3 , 4 1 , 6 2 , 2 1 , 5 5 , 2 3 , 3 3 , 0 , 2 9 , 6 2 ] ,
[ 5 5 , 3 1 , 2 9 , 4 2 , 4 6 , 3 1 , 3 1 , 1 5 , 2 9 , 0 , 5 1 ] ,
[ 2 9 , 4 1 , 7 9 , 2 1 , 8 2 , 3 3 , 7 7 , 3 7 , 6 2 , 5 1 , 0 ] ,
] )
# Brute f o r c e approach t o s o l v e TSP
d e f t s p _ e x a c t ( d i s t a n c e s ) :
n = l e n ( d i s t a n c e s )
c i t i e s = l i s t ( r a n g e ( 1 , n ) ) # S t a r t from c i t y 0
min_length = f l o a t ( ’ i n f ’ )
b e s t _ r o u t e = None
s t a r t _ t i m e = time . time ( )
count = 0
# Try a l l p o s s i b l e p e r m u t a t i o n s o f c i t i e s ( e x c l u d i n g s t a r t i n g c i t y )
f o r perm i n p e r m u t a t i o n s ( c i t i e s ) :
r o u t e = ( 0 , ) + perm + ( 0 , ) # Complete r o u t e s t a r t i n g and e n d i n g a t c i t y 0
l e n g t h = sum ( d i s t a n c e s [ r o u t e [ i ] ] [ r o u t e [ i + 1 ] ] f o r i i n r a n g e ( l e n ( r o u t e )−1) )
count += 1
i f l e n g t h < min_length :
min_length = l e n g t h
b e s t _ r o u t e = r o u t e
end_time = time . time ( )
r e t u r n b e s t _ r o u t e , min_length , count , end_time−s t a r t _ t i m e
# S o l v e t h e TSP problem
b e s t _ r o u t e , min_length , p e r m u t a t i o n s _ t r i e d , time_taken = t s p _ e x a c t ( d i s t a n c e s )
p r i n t ( f " Best r o u t e : {{ b e s t _ r o u t e } } " )
p r i n t ( f " Minimum d i s t a n c e : {{ min_length } } " )
p r i n t ( f " P e r m u t a t i o n s e v a l u a t e d : {{ p e r m u t a t i o n s _ t r i e d : , } } " )
p r i n t ( f " Time t a k e n : {{ time_taken : . 2 f }} s e c o n d s " )
‘ ‘ ‘
‘ ‘ ‘ o ut pu t
Best r o u t e : ( 0 , 1 , 8 , 4 , 6 , 2 , 9 , 7 , 5 , 3 , 1 0 , 0 )
Minimum d i s t a n c e : 251
P e r m u t a t i o n s e v a l u a t e d : 3 , 6 2 8 , 8 0 0
Time t a k e n : 5 . 7 7 s e c o n d s
‘ ‘ ‘
T h e r e f o r e , t h e o p t i m a l r o u t e has a t o t a l d i s t a n c e o f 291 u n i t s .
" " "
‘ ‘ ‘
R ea so ni ng : This code s o l v e s a T r a v e l i n g Salesman Problem with 11 c i t i e s by e v a l u a t i n g
o v e r 3 . 6M p e r m u t a t i o n s−a computation t h a t would be a b s o l u t e l y i m p o s s i b l e t o do
manually . The b r u t e−f o r c e approach h e r e c r e a t e s a c r u c i a l s h o r t c u t t o t h e s o l u t i o n
t h a t would be p r a c t i c a l l y u n a t t a i n a b l e t h r o u g h manual c a l c u l a t i o n , even with
s i g n i f i c a n t time i n v e s t m e n t .
S i g n i f i c a n c e : S i g n i f i c a n t
‘ ‘ ‘
−−−
4 .
" " "
To f i n d a l l i n t e g e r s o l u t i o n s t o t h e D i o p h a n t i n e e q u a t i o n 17 x + 23 y = 3284 where both
x and y a r e non−n e g a t i v e , I ’ l l implement s e a r c h i n Python .
‘ ‘ ‘ python
d e f f i n d _ s o l u t i o n s ( a , b , c ) :
s o l u t i o n s = [ ]
32

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
# Find t h e maximum p o s s i b l e v a l u e o f x
max_x = c // a
# Check a l l p o s s i b l e v a l u e s o f x from 0 t o max_x
f o r x i n r a n g e (max_x + 1 ) :
# C a l c u l a t e t h e c o r r e s p o n d i n g y v a l u e
r e m a i n i n g = c−a∗x
# I f r e m a i n i n g i s d i v i s i b l e by b and t h e r e s u l t i s non−n e g a t i v e ,
# we have a v a l i d s o l u t i o n
i f r e m a i n i n g >= 0 and r e m a i n i n g % b == 0 :
y = r e m a i n i n g // b
s o l u t i o n s . append ( ( x , y ) )
r e t u r n s o l u t i o n s
# Given e q u a t i o n : 17 x + 23 y = 3284
a , b , c = 1 7 , 2 3 , 3284
s o l u t i o n s = f i n d _ s o l u t i o n s ( a , b , c )
p r i n t ( f " S o l u t i o n s t o {{ a }} x + {{b}} y = {{ c } } : " )
f o r x , y i n s o l u t i o n s :
p r i n t ( f " x = {{ x } } , y = {{ y } } " )
# V e r i f y t h e s o l u t i o n
p r i n t ( f " V e r i f i c a t i o n : {{ a }}∗{{ x }} + {{b }}∗{{ y }} = {{ a∗x + b∗y } } " )
p r i n t ( )
‘ ‘ ‘
‘ ‘ ‘ o ut pu t
S o l u t i o n s t o 17 x + 23 y = 3 2 8 4 :
x = 2 0 , y = 128
V e r i f i c a t i o n : 17∗20 + 23∗128 = 3284
x = 4 3 , y = 111
V e r i f i c a t i o n : 17∗43 + 23∗111 = 3284
x = 6 6 , y = 94
V e r i f i c a t i o n : 17∗66 + 23∗94 = 3284
x = 8 9 , y = 77
V e r i f i c a t i o n : 17∗89 + 23∗77 = 3284
x = 1 1 2 , y = 60
V e r i f i c a t i o n : 17∗112 + 23∗60 = 3284
x = 1 3 5 , y = 43
V e r i f i c a t i o n : 17∗135 + 23∗43 = 3284
x = 1 5 8 , y = 26
V e r i f i c a t i o n : 17∗158 + 23∗26 = 3284
x = 1 8 1 , y = 9
V e r i f i c a t i o n : 17∗181 + 23∗9 = 3284
‘ ‘ ‘
So t h e i n t e g e r s o l u t i o n s t o t h e D i o p h a n t i n e e q u a t i o n a r e x = 1 1 , y = 1 .
" " "
‘ ‘ ‘
R ea so ni ng : This code f i n d s a l l i n t e g e r s o l u t i o n s t o a D i o p h a n t i n e e q u a t i o n by
i t e r a t i n g t h r o u g h p o s s i b l e v a l u e s o f x and c a l c u l a t i n g t h e c o r r e s p o n d i n g y . While t h i s
c o u l d be done manually , t h e e x h a u s t i v e s e a r c h f o r non−n e g a t i v e i n t e g e r s o l u t i o n s i s
t e d i o u s and e r r o r−prone . The c o m p u t a t i o n a l approach r e d u c e s t h e e f f o r t and s i m p l i f i e s
t h e s o l u t i o n p r o c e s s , making i t more e f f i c i e n t . Thus i t p r o v i d e s a moderate l e v e l o f
s i g n i f i c a n c e .
S i g n i f i c a n c e : Moderate
‘ ‘ ‘
−−−
5 .
33

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
" " "
To v e r i f y my h y p o t h e s i s , I need t o f i n d t h e p r o b a b i l i t y o f g e t t i n g a t l e a s t 3 heads i n
10 c o i n f l i p s . I ’ l l c a l c u l a t e t h i s u s i n g t h e b i n o m i a l d i s t r i b u t i o n .
‘ ‘ ‘ python
im por t math
d e f b i n o m i a l _ p r o b a b i l i t y ( n , k , p ) :
# C a l c u l a t e t h e p r o b a b i l i t y o f k s u c c e s s e s i n n t r i a l s
# with p r o b a b i l i t y p o f s u c c e s s on a s i n g l e t r i a l
c o m b i n a t i o n s = math . comb ( n , k )
r e t u r n c o m b i n a t i o n s∗( p∗∗k )∗((1−p )∗∗( n−k ) )
# C a l c u l a t e P(X \ geq 3 ) when f l i p p i n g a f a i r c o i n 10 t i m e s
p_at_least_3 = sum ( b i n o m i a l _ p r o b a b i l i t y ( 1 0 , k , 0 . 5 ) f o r k i n r a n g e ( 3 , 1 1 ) )
p r i n t ( f "P(X \ geq 3 ) = {{ p_at_least_3 : . 6 f } } " )
p r i n t ( f " P e r c e n t a g e : {{ p_at_least_3∗1 0 0 : . 2 f }}%")
‘ ‘ ‘
‘ ‘ ‘ o ut pu t
P(X \ geq 3 ) = 0 . 9 4 5 3 1 2
P e r c e n t a g e : 94.53%
‘ ‘ ‘
So t h e p r o b a b i l i t y o f g e t t i n g a t l e a s t 3 heads i n 10 c o i n f l i p s i s a p p r o x i m a t e l y
9 4 . 5 3 % .
" " "
‘ ‘ ‘
Re a son i ng : This code c a l c u l a t e s a p r o b a b i l i t y u s i n g t h e b i n o m i a l d i s t r i b u t i o n f o r m u l a .
While t h e c a l c u l a t i o n i n v o l v e s c o m b i n a t i o n s and powers , t h e m a t h e m a t i c a l c o n c e p t i s
s t r a i g h t f o r w a r d and c o u l d be c a l c u l a t e d manually by e x p l i c i t l y w r i t i n g and r e d u c i n g
t h e terms . The code p r o v i d e s a minor c o m p u t a t i o n a l c o n v e n i e n c e but doesn ’ t
f u n d a m e n t a l l y change t h e n a t u r e o f t h e s o l u t i o n p r o c e s s , making i t a t r i v i a l u s e o f
Python code .
S i g n i f i c a n c e : T r i v i a l
‘ ‘ ‘
−−−
∗∗REMINDER∗∗
When e v a l u a t i n g s i g n i f i c a n c e , c o n s i d e r :
1 . Could t h i s c a l c u l a t i o n r e a s o n a b l y be done by hand ? I f yes , how d i f f i c u l t would i t
be ?
2 . Does t h e code e n a b l e a s o l u t i o n approach t h a t would o t h e r w i s e be i m p r a c t i c a l ?
3 . I s t h e c o m p u t a t i o n a l advantage m e r e l y c o n v e n i e n c e , o r i s i t e s s e n t i a l t o t h e
s o l u t i o n ?
Remember t o c l a s s i f y a s T r i v i a l , Moderate , o r S i g n i f i c a n t based on t h e s e
c o n s i d e r a t i o n s .
−−−
∗∗YOUR TASK∗∗
S o l u t i o n f r a g m e n t : { f r a g m e n t }
C.
C.1.
34

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
CoT Inference Prompt
S o l v e t h e f o l l o w i n g math problem . Make s u r e t o put t h e answer ( and o n l y answer ) i n s i d e
\ boxed { { } } .
{ problem }
C.2.
TIR Inference Prompt
S o l v e t h e f o l l o w i n g math problem , i n t e g r a t i n g n a t u r a l l a n g u a g e r e a s o n i n g with Python
code e x e c u t i o n s .
You may p e r f o r m up t o { t o t a l _ c o d e _ e x e c u t i o n s } Python code c a l l s t o a s s i s t your
r e a s o n i n g .
Make s u r e t o put t h e answer ( and o n l y answer ) i n s i d e \ boxed { { } } .
{ problem }
C.3.
GenSelect Inference Prompt
You w i l l be g i v e n a c h a l l e n g i n g math problem f o l l o w e d by { num_solutions } s o l u t i o n s .
Your t a s k i s t o s y s t e m a t i c a l l y a n a l y z e t h e s e s o l u t i o n s t o i d e n t i f y t h e most
m a t h e m a t i c a l l y sound approach .
I n p u t Format :
Problem : A complex m a t h e m a t i c a l word problem a t advanced h i g h s c h o o l o r c o l l e g e l e v e l
S o l u t i o n s : D e t a i l e d s o l u t i o n s i n d e x e d 0−{max_idx } , each c o n c l u d i n g with an answer i n \
boxed {{}} n o t a t i o n
YOUR TASK
Problem : { problem }
S o l u t i o n s :
{ s o l u t i o n s }
E v a l u a t i o n P r o c e s s :
1 . I n i t i a l S c r e e n i n g
−Group s o l u t i o n s by t h e i r f i n a l a n s w e r s
−I d e n t i f y and e x p l a i n m a t h e m a t i c a l c o n t r a d i c t i o n s between d i f f e r e n t a n s w e r s
−E l i m i n a t e s o l u t i o n s with c l e a r m a t h e m a t i c a l e r r o r s
2 . D e t a i l e d A n a l y s i s
For r e m a i n i n g s o l u t i o n s , e v a l u a t e :
−Mathematical p r e c i s i o n and a c c u r a c y
−L o g i c a l p r o g r e s s i o n o f s t e p s
−C o m p l e t e n e s s o f m a t h e m a t i c a l r e a s o n i n g
−Proper u s e o f m a t h e m a t i c a l n o t a t i o n , i n c l u d i n g \ boxed {{}}
−Handling o f edge c a s e s o r s p e c i a l c o n d i t i o n s
−For s o l u t i o n s c o n t a i n i n g and a d d r e s s i n g e r r o r s , e v a l u a t e t h e e r r o r i d e n t i f i c a t i o n
and c o r r e c t i o n methodology .
3 . S o l u t i o n Comparison
Compare v i a b l e s o l u t i o n s based on :
−E f f i c i e n c y o f approach
−C l a r i t y o f m a t h e m a t i c a l r e a s o n i n g
−S o p h i s t i c a t i o n o f method
35

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
−R o b u s t n e s s o f s o l u t i o n ( works f o r a l l c a s e s )
Your r e s p o n s e s h o u l d i n c l u d e :
1 . B r i e f a n a l y s i s o f c o n f l i c t i n g a n s w e r s
2 . D e t a i l e d e v a l u a t i o n o f m a t h e m a t i c a l l y sound s o l u t i o n s
3 . J u s t i f i c a t i o n f o r e l i m i n a t i n g i n c o r r e c t s o l u t i o n s
4 . C l e a r e x p l a n a t i o n f o r s e l e c t i n g t h e b e s t approach
End your e v a l u a t i o n with e x a c t l y :
Judgment : [ IDX ]
where IDX i s t h e i n d e x 0−{max_idx} o f t h e b e s t s o l u t i o n .
D.
D.1.
Prompt to re-generate comparison summary
I w i l l g i v e you a c r i t i q u e ’ s r e s p o n s e which compares m u l t i p l e s o l u t i o n s o f a math
problem . This r e s p o n s e w i l l e x p l o r e d i f f e r e n t a p p r o a c h e s , making m i s t a k e s a l o n g t h e
way , c o r r e c t i n g them , s w i t c h i n g around and s o on . But e v e n t u a l l y t h e c r i t i q u e
c o n v e r g e s on i t s r a n k i n g o f t h e d i f f e r e n t s o l u t i o n s and ends with a s t r i n g " Judgment :
" f o l l o w e d by a number c o r r e s p o n d i n g t o t h e i n d e x o f t h e b e s t s o l u t i o n . Your t a s k i s
t o w r i t e a c l e a n v e r s i o n o f t h e f i n a l c r i t i q u e w i t h o u t a l l t h e e x p l o r a t i o n and make
s u r e t o put t h e f i n a l judgment a t t h e end . P l e a s e make s u r e t o i n c l u d e t h e d e t a i l s
r e l a t e d t o a r r i v i n g a t t h e f i n a l judgment but not t h e t a n g e n t i a l d e t a i l s .
Problem :
{ problem }
S o l u t i o n s :
{ s o l u t i o n s }
C r i t i q u e :
{ g e n e r a t i o n }
Now w r i t e a c l e a n v e r s i o n o f t h e f i n a l c r i t i q u e w i t h o u t a l l t h e e x p l o r a t i o n . Make s u r e
t o put i n :
−The d e t a i l s r e l a t e d t o a r r i v i n g a t t h e f i n a l judgment .
−C l e a r l y w r i t i n g t h e f i n a l judgment a t t h e end v i a " Judgment : " f o l l o w e d by t h e i n d e x
o f t h e b e s t s o l u t i o n .
D.2.
Prompt to re-generate solution summary
I w i l l g i v e you a math problem and a l o n g s o l u t i o n t o t h a t problem e x p l o r i n g d i f f e r e n t
a p p r o a c h e s , making m i s t a k e s a l o n g t h e way , c o r r e c t i n g them , s w i t c h i n g around and s o
on . But e v e n t u a l l y t h a t s o l u t i o n g e t s t o t h e r i g h t approach and s o l v e s t h e problem .
Your t a s k i s t o w r i t e a c l e a n v e r s i o n o f t h e f i n a l c o r r e c t s o l u t i o n w i t h o u t a l l t h e
e x p l o r a t i o n . Cover a l l t h e d e t a i l s o f t h e f i n a l s o l u t i o n .
Problem :
{ problem }
S o l u t i o n :
{ g e n e r a t i o n }
36

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
Now w r i t e a c l e a n v e r s i o n o f t h e f i n a l c o r r e c t s o l u t i o n w i t h o u t a l l t h e e x p l o r a t i o n
but c o v e r a l l t h e d e t a i l s o f t h e f i n a l s o l u t i o n .
E.
Problem
Call a 9-digit number a cassowary if it uses each of the digits 1 through 9 exactly once. Compute the
number of cassowaries that are prime.Deepseek R1 Response
<think>
Okay, so I need to figure out how many 9-digit numbers, which use each digit from 1 to 9 exactly once,
are prime. These numbers are called cassowaries. Hmm, let’s start by understanding the problem step
by step.
First, a cassowary is a 9-digit pandigital number, meaning it uses each digit 1 through 9 once. So,
numbers like 123456789, but any permutation of those digits. The question is asking how many of
these are prime numbers.
Alright, so the main task is to compute the count of prime numbers among all 9-digit pandigitals. Let
me think about the properties of prime numbers and pandigital numbers.
First, let’s recall that a prime number is a number greater than 1 that has no positive divisors other
than 1 and itself. For a number to be prime, it can’t be even (except for 2), and it can’t end with 5.
Also, there are some divisibility rules that might help here.
Since we’re dealing with 9-digit numbers using all digits 1-9, let’s check the sum of the digits. The
digits are 1+2+3+4+5+6+7+8+9 = 45. Wait, 45 is divisible by 9, right? So any number made with
all these digits will have a digit sum of 45, which is divisible by 9. Therefore, any such number is
divisible by 9. But if a number is divisible by 9, it can’t be prime unless the number itself is 9, which
it’s not here because we’re dealing with 9-digit numbers. So, does this mean that all cassowaries are
divisible by 9 and hence composite?
If that’s the case, then there are zero prime cassowaries. But let me verify this again. The divisibility
rule for 9 is that if the sum of the digits is divisible by 9, then the number itself is divisible by 9. Since
45 is divisible by 9, every permutation of the digits 1-9 will result in a number divisible by 9. Therefore,
all cassowaries are divisible by 9, which means they can’t be prime. So the answer should be zero.
Wait, but let me make sure I’m not missing something. For example, could the number 9 itself be
considered? But the problem specifies a 9-digit number, so 9 is a single-digit number. So all cassowaries
are 9-digit numbers, which are divisible by 9, hence composite. Therefore, there are no primes in the
set of cassowaries. So the answer is 0.
But just to be thorough, let me check if there’s any exception. For example, is 9 a prime? No, 9 is
3 squared. So even if there was a 1-digit cassowary (which there isn’t, since cassowary is defined as
9-digit), it’s not prime. So, yeah, all 9-digit pandigitals using 1-9 are divisible by 9, hence composite.
Therefore, the number of prime cassowaries is zero.
</think>0
37

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
Qwen2.5-32b-instruct Solution Summary
To determine the number of 9-digit prime numbers (cassowaries) that use each digit from 1 to 9 exactly
once, we need to analyze the properties of these numbers.
1. **Sum of the Digits**: The sum of the digits from 1 to 9 is: 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 =
45.
2. **Divisibility by 9**: A number is divisible by 9 if the sum of its digits is divisible by 9. Since the
sum of the digits is 45, which is divisible by 9, every 9-digit number formed using the digits 1 through
9 will be divisible by 9.
3. **Prime Numbers and Divisibility**: A prime number has no positive divisors other than 1 and
itself. Since any 9-digit number formed using the digits 1 through 9 is divisible by 9, it cannot be a
prime number (except for the number 9 itself, which is not a 9-digit number).
4. **Conclusion**: Therefore, no 9-digit number that uses each digit from 1 to 9 exactly once can be
a prime number. Thus, the number of prime cassowaries is:0
F.
F.1.
TIR solution with code execution limit warning
<t o o l _ c a l l >
# I n i t i a l i z e sum
total_sum = 0
# I t e r a t e o v e r p o s s i b l e b a s e s b from 10 t o 49 i n c l u s i v e
f o r b i n r a n g e ( 1 0 , 5 0 ) :
d i v i s o r = b + 7
d i v i d e n d = 9∗b + 7
i f d i v i d e n d % d i v i s o r == 0 :
total_sum += b
total_sum
</ t o o l _ c a l l >
‘ ‘ ‘ o ut pu t
70
‘ ‘ ‘
‘ ‘ ‘ system
Remaining code e x e c u t i o n s : 3 . You w i l l not be a b l e t o c a l l code when you run out o f
e x e c u t i o n s , s o u s e i t w i s e l y . Note t h a t you can s t i l l c o n t i n u e s o l v i n g t h e problem
w i t h o u t code a f t e r t h a t .
‘ ‘ ‘
F.2.
Redundant code execution that does not introduce new information to the
solution
S i n c e 5 i s a prime number and 13 i s a l s o a prime number , t h e i r gcd s h o u l d i n d e e d be 1 .
J u s t t o be thorough , l e t me compute i t u s i n g Python .
<t o o l _ c a l l >
im por t math
# Compute gcd o f 5 and 13
38

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
gcd = math . gcd ( 5 , 1 3 )
p r i n t ( gcd )
</ t o o l _ c a l l >
‘ ‘ ‘ o ut pu t
1
‘ ‘ ‘
The gcd i s 1 , s o an i n v e r s e d o e s e x i s t . Good .
F.3.
Example of models reaction to reaching code execution limit
Let me imp ort t h e n e c e s s a r y l i b r a r i e s and d e f i n e t h e e q u a t i o n s .
<t o o l _ c a l l >
im por t numpy a s np
from s c i p y . o p t i m i z e im por t f s o l v e
# D e f i n e t h e system o f e q u a t i o n s
d e f e q u a t i o n s ( v a r s ) :
x , y = v a r s
eq1 = 24∗np . c o s ( x ) + 26∗np . c o s ( y )−13∗np . s q r t ( 3 )∗np . s i n ( y ) + 1 7 . 5
eq2 = 24∗np . s i n ( x ) + 13∗np . s q r t ( 3 )∗np . c o s ( y ) + 26∗np . s i n ( y ) + 7∗np . s q r t
( 3 ) /2
r e t u r n [ eq1 , eq2 ]
# I n i t i a l g u e s s
i n i t i a l _ g u e s s = [ 1 , 1 ]
# S o l v e t h e system
s o l u t i o n = f s o l v e ( e q u a t i o n s , i n i t i a l _ g u e s s )
s o l u t i o n
</ t o o l _ c a l l >
‘ ‘ ‘ o ut pu t
a r r a y ( [−0 . 9 3 5 8 0 3 0 4 , 2 . 0 3 2 1 0 4 2 7 ] )
‘ ‘ ‘
‘ ‘ ‘ system
You have run out o f code e x e c u t i o n s ! You can no l o n g e r w r i t e o r e x e c u t e code . Now you
s h o u l d c o n t i n u e s o l v i n g t h e problem by r e l y i n g on your m a t h e m a t i c a l r e a s o n i n g and
a n a l y t i c a l s k i l l s .
‘ ‘ ‘
Hmm, i t seems I ’ ve r e a c h e d t h e l i m i t o f code e x e c u t i o n s . Let me c h e c k t h e s o l u t i o n
p r o v i d e d by t h e code .
39