Extracted Text
2509.15207v1.pdf
FlowRL: Matching Reward Distributions for LLM Reasoning
FlowRL: Matching Reward Distributions for
LLM Reasoning
Xuekai Zhu
1
, Daixuan Cheng
6
, Dinghuai Zhang
3
, Hengli Li
5
, Kaiyan Zhang
4
, Che Jiang
4
, Youbang Sun
4
, Ermo Hua
4
, Yuxin
Zuo
4
, Xingtai Lv
4
, Qizheng Zhang
7
, Lin Chen
1
, Fanghao Shao
1
, Bo Xue
1
, Yunchong Song
1
, Zhenjie Yang
1
, Ganqu Cui
2
, Ning
Ding
4,
, Jianfeng Gao
3
, Xiaodong Liu
3
, Bowen Zhou
4,‡
, Hongyuan Mei
8‡
, Zhouhan Lin
1,‡
1
Shanghai Jiao Tong University
2
Shanghai AI Laboratory
3
Microsoft Research
4
Tsinghua University
5
Peking University
6
Renmin University of China
7
Stanford University
8
Toyota Technological Institute at Chicago
#
‡
Corresponding Authors.
Abstract|We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing
rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt
reward-maximizing methods (e.g., PPO and GRPO), which tend to over-optimize dominant reward signals
while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform
scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the
reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced
optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct
experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of .0%
over GRPO and .1%
These results highlight reward distribution-matching as a key step toward efficient exploration and diverse
reasoning in LLM reinforcement learning.Distribution-matching: FlowRL
KL=0.11 KL=8.68
Reward-maximizing∶ R++, PPO and GRPO
Math Average Score
CodeForcesRating
Figure 1|Top: Comparison between distribution-matching and reward-maximizing approaches.
FlowRL (left) learns to match the full reward distribution, maintaining diversity across multiple modes
with low KL divergence. In contrast, reward-maximizing methods like GRPO (right) concentrate on a
single high-reward peak, leading to mode collapse and higher KL divergence.: Performance
comparison. FlowRL consistently outperforms GRPO across math and code domains.
arXiv:2509.15207v1 [cs.LG] 18 Sep 2025
FlowRL: Matching Reward Distributions for LLM Reasoning
1.
Reinforcement learning (RL) plays a crucial role in the post-training of large language models
(LLMs) [Zhang et al.,]. A series of powerful reasoning models [Guo et al.,,,
2025,,] have employed large-scale reinforcement learning to achieve strong
performance on highly challenging benchmarks [He et al.,]. The evolution of RL algorithms
for LLM reasoning has progressed through several key stages: REINFORCE [Sutton et al.,]
provides a solid baseline that is easy to implement and efficient in simple settings; PPO [Schulman
et al.,] improves upon REINFORCE with better stability and efficiency in complex settings;
GRPO [Shao et al.,] simplifies PPO training by eliminating value functions and relying on group
comparisons, though at the cost of requiring more rollouts per update. However, all these methods
share a fundamental limitation in their reward-maximizing objective.
Reward-maximizing RL methods tend to overfit to the dominant mode of the reward distribu-
tion [Gao et al.,,,,,,,]. This often results
in limited diversity among generated reasoning paths and reduces generalization to less frequent yet
valid logical outcomes [Hu et al.,]. As illustrated in Figure, GRPO neglects other meaningful
modes. These drawbacks become especially pronounced in complex long chain-of-thought (CoT;
et al.,) reasoning, where capturing a diverse distribution of plausible solutions is essential for
effective generalization [Liu et al.,]. Recent approaches adjust the clip ratio [Yu et al.,],
augment the advantage function with an entropy-based term [Cheng et al.,], or selectively
promote high-entropy tokens [Wang et al.,], thereby dynamically adapting the training data
distribution and implicitly increasing diversity during training. This raises a fundamental question:
How can we promote diverse exploration to prevent convergence to dominant solution patterns in RL
training?
In this paper, we propose, a policy optimization algorithm that aligns the policy model
with the full reward distribution, encouraging mode coverage. FlowRL achieves more efficient
exploration by fundamentally shifting from reward maximization to reward distribution matching,
thereby addressing the inherent mode-collapse limitations of previous RL approaches. As illustrated
in Figure, the core idea of FlowRL is to introduce a learnable partition function that normalizes
scalar rewards into a target distribution, and to minimize the reverse KL divergence between the
policy and this reward-induced distribution. We develop this KL objective based on the trajectory
balance formulation from GFlowNets [Bengio et al.,], providing a gradient equivalence proof
that bridges generative modeling and policy optimization. To address the challenges of long CoT
training, we introduce two key technical solutions:
issues that occur with variable-length CoT reasoning, and
distribution mismatch between generated rollouts and the current policy.
We compare FlowRL with mainstream RL algorithms including REINFORCE++, PPO, and GRPO
across math and code domains, using both base and distilled LLMs (7B, 32B). In math domain,
FlowRL outperforms GRPO and PPO by .0% .1%, respectively, demonstrating consistent
improvements across six challenging math benchmarks. Furthermore, FlowRL surpasses both PPO
and GRPO on three challenging coding benchmarks, highlighting its strong generalization capabilities
in code reasoning tasks. To understand what drives these performance gains, we analyze the diversity
of generated reasoning paths. This diversity analysis confirms that FlowRL generates substantially
more diverse rollouts than baseline methods, validating our approach’s effectiveness in exploring
multiple solution strategies.
Contributions.
•
We propose FlowRL, a policy optimization algorithm that shifts from reward maximization to
2
FlowRL: Matching Reward Distributions for LLM Reasoning
reward distribution matching via flow balance, encouraging diverse reasoning path exploration
while addressing the inherent mode-collapse limitations of existing RL methods.
•
We introduce length normalization and importance sampling to enable effective training on variable-
length CoT reasoning, addressing gradient explosion and sampling mismatch issues.
•
FlowRL outperforms GRPO and PPO by 10.0% and 5.1% respectively across math benchmarks and
demonstrates strong generalization on code reasoning tasks, with diversity analysis confirming
substantially more diverse solution exploration.
2.
Reinforcement Learning for Reasoning.
where the policy model receives a question ∈ Xand generates an answer ∈ Y. The objective
is to learn a policy��(y|x)that produces high-quality answers under task-specific reward signals�.
To better illustrate the policy optimization procedure, we provide a detailed formulation of GRPO
below. For each question, GRPO samples a group of answers {y1,y2, . . . ,y�}from old policy�����
and updates the model by maximizing the following objective:
J����()
[∼ �}
�
�
∼��
���
( Y |
1
�
�∑︁
�
1
|�|
|�|
∑︁
�
ȷ
min
»
��(�,�|, �,<�)
�����
(�,�|, �,<�)
ˆ��,�,
„
��(�,�|, �,<�)
�����
(�,�|, �,<�)
,
«
ˆ��,�
–
− � �
ˆ
��||�� �
˜
ffl
,
�KL(�∥
ref)
�
ref(�|)
��(�|)
−
�
ref(�|)
��(�|)
−,
(1)
where�and�are hyper-parameters. Here,��denotes the advantage, computed by normalizing
the group reward values{1, �2, . . . , ��} as��=
��− 1,�2, �} )
std 1,�2, �} )
.
Compared to GRPO, REINFORCE
applies the policy gradient directly, without advantage normalization, clipping, or KL regularization.
PPO uses a critic model to estimate the advantage and employs importance sampling to stabilize
policy updates.!!!"!#
!$
!%
!&
!'
!(
!)
!*
!"!
!+
!+
!!
!+!+!+
In Flow
Z"##
Out Flow r(%)
Figure 2|GFlowNets [Bengio et al.,], a
flow-balance perspective on reinforcement learn-
ing. The initial flow��(0) injects probability mass
into the environment, which is transported through
intermediate states by the policy��and accumu-
lated at terminal states in proportion to the scalar
rewards.
GFlowNets.Ben-
gio et al.,] are a probabilistic frame-
work for training stochastic policies to sample
discrete, compositional objects (e.g., graphs,
sequences) in proportion to a given reward.
As shown in Figure, the core principle of
GFlowNets is to balance the forward and back-
ward probability flows at each state, inspired
by flow matching [Bengio et al.,]. The
initial flow is estimated by��(0) at the initial
state�0. The output flow is equal to the out-
come reward��)conditioned at the final state
��. Following2024], we use a 3-layer
MLP to parameterize��. This flow-balancing
mechanism facilitates the discovery of diverse,
high-reward solutions by ensuring proper exploration of the solution space. See Appendix
detailed GFlowNets background.
3
FlowRL: Matching Reward Distributions for LLM Reasoning
3.
In this section, we first formulate distribution matching in reinforcement learning through reverse
KL divergence and establish its connection to trajectory balance from GFlowNets. To address the
challenges of gradient explosion and sampling mismatch encountered during long CoT training, we
further incorporate length normalization and importance sampling. Using this enhanced framework,
we derive a flow-balanced objective, termed.
3.1.
As illustrated in Figure, recent powerful large reasoning models typically employ reward-maximizing
RL algorithms, such as PPO or GRPO. However, these methods tend to optimize toward the dominant
reward mode, frequently resulting in mode collapse and the neglect of other plausible, high-quality
reasoning paths. To address this fundamental limitation, we propose optimizing the policy by aligning
its output distribution to a target reward distribution. A simple yet effective way to achieve this is
to minimize the reverse KL divergence
1
between the policy and this target. However, in long CoT
reasoning tasks, the available supervision in RL is a scalar reward, rather than a full distribution.
Moreover, enumerating or sampling all valid trajectories to recover the true reward distribution is
computationally intractable.
Inspired by energy-based modeling [Du and Mordatch,,,], we introduce
a learnable partition function��(x)to normalize scalar rewards into a valid target distribution.
This allows us to minimize the reverse KL divergence between the policy and the reward-weighted
distribution, formalized as:
min
�
DKL
„
��()
exp,
��()
«
⇒ �() ∝,
where�x,y)is the reward function,�is a hyperparameter,��(x)is the learned partition function,
and the resulting target distribution is defined as˜y|x)
exp,
��(
. This objective encourages
the policy to sample diverse, high-reward trajectories in proportion to their rewards, rather than
collapsing to dominant modes as in standard reward maximization.
While the KL-based formulation provides a principled target distribution, we derive a more
practical, RL-style objective that facilitates efficient policy optimization.
Proposition 1.
minimizing the trajectory balance loss used in GFlowNet [Bartoldson et al.,,,,
et al.,,]:
min
�
DKL
„
��()
exp,
��()
«
⇐⇒
�
`
log �() + �() −
´
2
|
Trajectory Balance
(3)
Remark 2Trajectory balance as a practical surrogate for KL minimization).
established in Proposition, the KL-based distribution matching objective can be reformulated as the
trajectory balance loss. This reformulation provides a practical optimization approach by using a stable
squared loss form rather than direct KL optimization, and by treating��(x)as a learnable parameter
rather than requiring explicit computation of the intractable partition function. The trajectory balance
objective thus serves as a tractable surrogate for reward-guided KL minimization that can be directly
integrated into existing RL frameworks.
1
We use reverse KL since we can only sample from the policy model, not the target reward distribution.
4
FlowRL: Matching Reward Distributions for LLM Reasoning
3.2.
As established in Proposition, the target reward distribution can be approximated by optimizing
the trajectory balance objective. However, applying this objective directly to long CoT reasoning
introduces two key challenges:
Problem I: Exploding gradients from long trajectories.
objective, and applying it to long CoT reasoning with up to 8K tokens leads to exploding gradients
and unstable updates. This issue is not observed in prior GFlowNets works, which typically operate
on short trajectories in small discrete spaces. Specifically, the log-probability termlog �( y|x)
decomposes into a token-wise sum,
�
�log �(
y�|y<�,x), causing the gradient norm to potentially
scale with sequence length.
Problem II: Sampling mismatch.
perform micro-batch updates and reuse trajectories collected from an old policy��
old, enabling
data-efficient training. In contrast, the KL-based trajectory balance objective assumes fully on-
policy sampling, where responses are drawn from the current policy. This mismatch poses practical
limitations when integrating trajectory balance into existing RL pipelines.
These limitations motivate our reformulation that retains the benefits of distribution matching
while addressing key practical challenges. To enable this reformulation, we first redefine the reward
function following established practices in GFlowNets literature [Bartoldson et al.,,,
2024,,] by incorporating a reference model as a prior constraint on the reward
distribution. Specifically, we modify the original,
exp(� �, )·
ref()
where�x,y)denotes the outcome reward commonly used in reinforcement learning and�
refis the
initial pre-trained model. We follow2025] to use outcome-based reward signals, and
apply group normalization to�x,y)asˆ�= �− r))/std r), where = 1, �2, . . . , ��} denotes
the set of rewards within a sampled group. By substituting the redefined reward formulation Eq.
into Eq., we derive the following objective
2
:
min
�
`
log �() + �() − �(,
ref()
´
2
(5)
Remark 3Reward shaping via length normalization).
and the outcome reward as sequence-level quantities. In contrast, standard policy optimization
methods such as PPO or GRPO assign rewards at the token level and compute gradients at each
step. However, for trajectories of varying lengths (e.g., CoT responses), this mismatch can cause the
log-probability termlog �( y|x)
�
|
�
log �(�|<�, x)to scale with sequence length. To address
this, we apply a form of reward shaping by normalizing log-probabilities with respect to sequence
length. Specifically, we rescale the term as
1
|
log �(
y|x), balancing the contributions of long and
short sequences and stabilizing the learning signal.
Remark 4Importance sampling for data-efficient training).
employ importance sampling inspired by PPO to stabilize policy updates with off-policy data. We
re-weight stale trajectories using the importance ratio� �( y|x)/�
old( y|x), which serves as a
coefficient in the surrogate loss. Since our objective focuses on optimizing trajectory balance rather
than expected return, we detach the gradient from the current policy to prevent excessive policy
drift:� �( y|x)]/�
old( y|x). For additional stability, we incorporate PPO-style clipping to
bound the importance weights:
“
��(
�
old(
,
”
detach
.
2
The substitution replaces��x,y)in trajectory balance objective Eq. ��x,y) +
ref( y|x)to incorporate the
reference model constraint.
5
FlowRL: Matching Reward Distributions for LLM Reasoning
Incorporating these improvements into Eq., we arrive at the following FlowRL objective:
FlowRLL
FlowRL=
„
log �() +
1
|
log �() −,
1
|
log
ref()
«
2
(6)
where the clipped importance weight,
�
��()
�
old()
,)
detach
, �=
��−)
std)
.
We use this objective to update the policy parameters�during training, and refer to this strategy as
FlowRL. Implementation details and theoretical analysis are provided in §, respectively.
4.
4.1.
Reinforcement learning has emerged as a powerful approach for large language models post-training
on reasoning tasks [Guo et al.,,,,,,,
2024,,]. Most approaches employ reward-maximizing RL to optimize expected
cumulative returns. Entropy regularization [Ahmed et al.,,,,,
2018] is a classical technique for mitigating mode collapse by promoting diversity in the policy’s output
distribution, and has also been shown to enhance reasoning capabilities in various settings [Chao
et al.,,,]. However, for long CoT reasoning, the extended trajectory
length (e.g., 8k–16k tokens) makes it difficult for the regularization signal to effectively influence
reward-maximizing learning. Recent work [Cheng et al.,,,,,,
Wang et al.,] has discovered that training with more diverse or high-entropy training data can
further enhance training effectiveness. Compared to traditional entropy regularization, the above
methods explicitly increase the proportion of low-probability (i.e., high-entropy) tokens in the training
data. In our work, we address the mode-collapse problem by fundamentally shifting from reward
maximization to reward distribution matching in our RL formulation.
4.2.
GFlowNets [Bengio et al.,] represent a class of diversity-driven algorithms designed to balance
probability flows across states. They have rich connections to probabilistic modeling methods [Ma et al.,
Malkin et al.,,,,b,,,], and control methods [Pan
et al.,,c,d,,,,]. This advantage has enabled GFlowNets to
achieve successful applications in multiple downstream tasks, such as molecular drug discovery [Jain
et al.,,,b,,,,,,,,,],
phylogenetic inference [Zhou et al.,], and combinatorial optimization [Zhang et al.,,b].
For generative AI, GFlowNets provide a powerful approach to align pretrained models in scenarios such
as image generation [Yun et al.,,,] and language model fine-tuning [Hu et al.,
2024,,,,]. Another line of work primarily focuses on the theoretical
aspects of GFlowNets. Recent theoretical studies have interpreted GFlowNets as solving a maximum
entropy reinforcement learning problem within a modified Markov Decision Process (MDP) [Deleu
et al.,,,,,]. These theoretical contributions have
6
FlowRL: Matching Reward Distributions for LLM Reasoning
inspired us to enhance reinforcement learning from a more foundational standpoint using GFlowNets
principles. A comprehensive overview of GFlowNets theory can be found in Appendix.
4.3.
Flow matching simplifies diffusion-based approaches by learning vector fields that transport samples
from prior to target distributions [Lipman et al.,]. Recent work has explored flow matching
for policy optimization.2025] reformulates policy optimization using advantage-
weighted ratios from conditional flow matching loss, enabling flow-based policy training without
expensive likelihood computations.2025] explored reward-weighted flow matching
for improving policies beyond demonstration performance.2025] uses a separate one-step
policy to avoid unstable backpropagation through time when training flow policies with RL.
et al.2025a] proposed a combined loss function integrating PPO and GFlowNets to optimize diffusion
model alignment. However, these approaches focus on continuous control, image generation, or
vision-action models, rather than addressing mode-collapse limitations in reward-maximizing RL.
Inspired by flow matching principles, our work improves upon RL training to enhance training stability
while promoting diverse solution exploration.
5.
Backbone Models.: the policy model ��and the partition
function��. For the policy model��, we useQwen-2.5-7B/32B[Team,] for math tasks and
DeepSeek-R1-Distill-Qwen-7B[DeepSeek-AI,] for code tasks, respectively. For partition
function��, following2024], we use a randomly initialized 3-layer MLP with hidden
dimensions matching those of the base model. The reference model�
refis the corresponding fixed
pretrained model. All training scripts are based on the veRL [Sheng et al.,]. For the reward
function, following2024], we set the hyperparameter.
Baselines.
REINFORCE++ (R++;,,,), PPO [Schulman et al.,], and
GRPO [Shao et al.,]. All baselines follow the official veRL recipes, with consistent training
configurations. For fair comparison, all methods use the same learning rate, batch size, and training
steps, and are evaluated at convergence using identical step counts.
Training Configuration.
we use the training set collected from DAPO [Yu et al.,]. For the code domain, we follow the
setup of DeepCoder [Luo et al.,], using their training set. For 7B model training, we use a single
node equipped with 8 NVIDIA H800 GPUs (80GB memory each). For 32B model training, we scale
to 4 nodes with 32 GPUs to accommodate the larger memory requirements. All experiments use
max_prompt_length= 2048 andmax_response_length= 8192 across both model sizes. We
use a batch size of 512 for math reasoning tasks and 64 for code reasoning tasks. We set the learning
rate to 1e-6 and enable dynamic batch sizing in veRL for efficient training. For GRPO and FlowRL,
we configurerollout_n= 8, meaning each prompt generates 8 response rollouts as the group size.
Evaluation Configuration.
2024/2025 [MAA,], AMC 2023 [MAA,], MATH-500 [Lightman et al.,], Min-
erva [Lewkowycz et al.,], and Olympiad [He et al.,]. For the code domain, we evaluate
on LiveCodeBench [Jain et al.,], CodeForces [Penedo et al.,], and HumanEval+ [Chen
et al.,]. For all evaluation datasets, we perform 16 rollouts and report the average accuracy,
denoted as Avg@16. We further report rating and percentile for Codeforces. During generation, we
7
FlowRL: Matching Reward Distributions for LLM Reasoning
AIME24 AIME25 AMC23 MATH500 Minerva Olympiad Avg
Qwen2.5-32B-Base, Max Response Len=8K
Backbone4.6 22.7
R++ 14 +10 9+7 52 +24 44 − 17 − 24 +3 27.1
PPO 26 +22 20 +18 76 +47 69 +16 28 +1 37 +16 43.3
GRPO 23 +18 14 +12 76 +48 61 +9 19 − 34 +13 38.3
FlowRL 24 +19 21 +19 73 +45 80 +28 38 +11 51 +30 48.4
Qwen2.5-7B-Base, Max Response Len=8K
Backbone4.4 23.0
R++ 11 +6 5+3 66 +35 54 − 24 +2 27 +3 31.5
PPO 9+5 7+5 63 +32 58 +3 26 +4 27 +3 32.0
GRPO 13 +9 9+7 64 +33 57 +2 23 +0 26 +2 32.5
FlowRL 15 +11 10 +8 54 +23 67 +12 31 +9 34 +10 35.6
Table 1|Results on math benchmarks.
shown as subscripts. Positive gains are shown in. FlowRL
outperforms all baselines across both 7B and 32B model scales.
Models LiveCodeBench CodeForces HumanEval+
Avg@16 Pass@16 Rating Percentile Avg@16
DeepSeek-R1-Distill-Qwen-7B, Max Response Len=8K
Backbone30.7 886.7 80.9
R++ 30 − 52 +3 1208 +321 56 +37 76 −
PPO 35 +4 54 +5 1403 +516 73 +54 82 +1
GRPO 32 +2 52 +2 1313 +427 67 +47 80 −
FlowRL 37 +6 56 +6 1549 +662 83 +63 83 +2
Table 2|Results on code benchmarks.
subscripts. Positive gains are shown in. FlowRL achieves the
strongest performance across all three benchmarks, demonstrating its effectiveness in code reasoning
tasks.
use sampling parameters oftemperature= 0.6 andtop_p= 0.95 for all evaluations. The response
length for evaluation is set to 8,192, consistent with the training configuration.
6.
6.1.
Our experimental results, summarized in Table, demonstrate that FlowRL consistently
outperforms all reward-maximization baselines across both math and code reasoning domains. Ta-
ble
Table
achieves the highest average accuracy of 35.6% with the 7B model and 48.4% with the 32B model,
surpassing PPO by 5.1% and GRPO by 10.1% on the 32B model. FlowRL shows strong improvements
on challenging benchmarks like MATH-500 and Olympiad problems, demonstrating consistent gains
8
FlowRL: Matching Reward Distributions for LLM Reasoning
Method AIME 2024 AIME 2025 AMC 2023 MATH-500 Minerva Olympiad Avg
FlowRL 15.41 10.83 54.53 66.96 31.41 34.61 35.63
w/o IS 6.25 7.91 41.40 56.97 22.19 25.52 26.71
Zhang et al.2025a] 10.41 6.66 53.75 66.50 30.97 33.72 33.67
Table 3|Ablation study on FlowRL with Qwen2.5-7B as the base model. Avg@16 accuracy is reported
across six math reasoning benchmarks. IS denotes importance sampling.
across diverse mathematical domains. On code generation tasks, FlowRL achieves compelling im-
provements with the highest Avg@16 score of 37.43% on LiveCodeBench, a Codeforces rating of
1549.47 with 83.3% percentile ranking, and 83.28% accuracy on HumanEval+, outperforming all
baselines across the board. These consistent performance gains across both domains and model scales
provide strong empirical evidence that FlowRL’s flow-balanced optimization successfully enhances
generalization. This improvement comes from promoting diverse solution exploration compared to
previous reward-maximizing RL approaches.
6.2.=5 =10 =15 =30
30
31
32
33
34
35
36
37
Average Score (%)
31.34
34.41
35.63
35.09
Figure 3|Ablation study on the�in FlowRL.
�15
performance.
We conduct ablation studies on importance sampling
and the�hyperparameter. For importance sampling,
we compared the performance with and without it,
and implemented a combined loss approach pro-
posed by2025a] that simultaneously
optimizes both GFlowNets and PPO objectives. This
combined loss focuses on optimizing diffusion mod-
els, and we adapt it to long CoT reasoning tasks
for comparison. Table
tance sampling substantially improves FlowRL perfor-
mance across all math reasoning benchmarks. Com-
pared to2025a], using importance sam-
pling as a trajectory-level ratio is more suitable than
the combined loss of GFlowNets and PPO. The per-
formance drop without importance sampling (from
35.63% to 26.71%) highlights the critical role of cor-
recting for distribution mismatch between rollout
generation and policy training. For the hyperparam-
eter�, we conduct a series of parameter ablation studies, and Figure �15
optimal performance, with detailed results shown in Table.
7.
7.1.
To assess solution diversity, we follow the approach of2025a] and employ GPT-4o-mini[Ope-
nAI,] to evaluate all responses generated by each method on AIME 24/25. The evaluation
prompt is shown in Appendix. As shown in Figure, FlowRL achieves higher diversity scores
compared to baseline methods. This demonstrates that FlowRL improves sample diversity compared
to baselines, which tend to exhibit repetitive solution patterns. This diversity evaluation reveals
9
FlowRL: Matching Reward Distributions for LLM Reasoning
Table 4|Case study comparing GRPO and FlowRL rollouts on an AIME problem. GRPO exhibits
repetitive patterns (AM-GM×3, identity loops×2), while FlowRL follows a more diverse solution
path.
Content (boxed = actions; “×
Question
LetBbe the set of rectangular boxes with surface area. Let �
be the radius of the smallest sphere that can contain each box inB. If�
2
=
�
�
with
gcd), find.
GRPO “... denote�, �, �...2+), ...�
√
�
2
+
2
+
2
,2
...(++
2
=
2
+
2
+
2
++) ...AM–GM ×
:
AM–GM (1)...AM–GM (2)
... AM–GM (3)...(++
3
identity loop×
:
loop (1)... loop (2)...
� . . .back to++
2
. . . no factorization . . . ”
FlowRL “... let�, �, �with2+), ...�
√
�
2
+
2
+
2
,2
...(++
2
⇒
2
+
2
+
2
=
2
− ...� ...�
3
− ...
rational root ...factor)(
2
+) ...branch1
√
6...
back-sub/
2
. . .�
2
+
2
+
2
=
657
16
. . .�
2
=
657
64
. . .Answer 721. . . ”
significant differences in exploration patterns across methods. This nearly doubling of diversity score
compared to the strongest baseline (PPO) indicates that FlowRL generates qualitatively different
solution approaches rather than minor variations of the same strategy. The diversity analysis provides
empirical validation of our core hypothesis that flow-balanced optimization promotes mode coverage
in complex reasoning tasks.
7.2.R++ GRPO PPO FlowRL
0.0
0.5
1.0
1.5
2.0
2.5
Diversity Score
1.11
1.23
1.31
2.28
Figure 4|GPT-judged diversity scores on roll-
outs of AIME 24/25 problems. FlowRL gener-
ates more diverse solutions than R++, GRPO,
and PPO.
Table
GRPO and FlowRL on a representative AIME prob-
lem. GRPO exhibits repetitive patterns, applying AM-
GM three times and getting stuck in identity loops,
failing to solve the problem. FlowRL explores more
diverse actions: it sets� , derives a cubic equation,
finds the rational root, and reaches the correct an-
swer. This shows that FlowRL successfully avoids the
repetitive exploration patterns. The contrast reveals
fundamental differences in exploration strategies:
GRPO’s reward-maximizing approach leads to ex-
ploitation of familiar techniques (AM-GM inequality)
without exploring alternatives, eventually reaching
contradictory conclusions like� . In contrast,
FlowRL’s distribution-matching enables strategic de-
cisions such as the symmetry assumption� , which
transforms the problem into a tractable cubic equation�
3
−27�46=0, allowing systematic solution
through rational root testing and polynomial factorization.
10
FlowRL: Matching Reward Distributions for LLM Reasoning
8.
In this work, we introduce FlowRL, which transforms scalar rewards into normalized target distribu-
tions using a learnable partition function and minimizes the reverse KL divergence between the policy
and target distribution. We demonstrate that this approach is theoretically equivalent to trajectory
balance objectives from GFlowNets and implicitly maximizes both reward and entropy, thereby pro-
moting diverse reasoning trajectories. To further address gradient explosion and sampling mismatch
issues in long CoT reasoning, we incorporate importance sampling and length normalization. Through
experiments on math and code reasoning benchmarks, FlowRL achieves consistent improvements
across all tasks compared to GRPO and PPO. Our diversity analysis and case studies confirm that
FlowRL generates more varied solution approaches while avoiding repetitive patterns.
Acknowledgments
We are grateful to Mingqian Feng and Yuetai Li for their valuable discussions and feedback, which
helped improve the quality of this work.
References
Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the
impact of entropy on policy optimization. In, pages
151–160. PMLR, 2019.
Brian R Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee,
Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, and Bhavya Kailkhura. Trajectory balance with
asynchrony: Decoupling exploration and learning for fast, scalable llm post-training.
arXiv:2503.18929, 2025.
Emmanuel Bengio, Moksh Jain, Maksym Korablyov, Doina Precup, and Yoshua Bengio. Flow net-
work based generative models for non-iterative diverse candidate generation.
Processing Systems (NeurIPS), 2021.
Yoshua Bengio, Salem Lahlou, Tristan Deleu, Edward J. Hu, Mo Tiwari, and Emmanuel Bengio.
Gflownet foundations., 24(210):1–55, 2023a. URL http:
//jmlr.org/papers/v24/22-0364.html.
Yoshua Bengio, Salem Lahlou, Tristan Deleu, Edward J Hu, Mo Tiwari, and Emmanuel Bengio.
Gflownet foundations., 24(1):10006–10060, 2023b.
Chen-Hao Chao, Chien Feng, Wei-Fang Sun, Cheng-Kuang Lee, Simon See, and Chun-Yi Lee. Maximum
entropy reinforcement learning via energy-based normalizing flow.,
2024.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared
Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray,
Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter,
Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth
Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang,
Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N.
Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles
11
FlowRL: Matching Reward Distributions for LLM Reasoning
Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish,
Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code, 2021.
Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu
Wei. Reasoning with exploration: An entropy perspective., 2025.
Miruna Cretu, Charles Harris, Ilia Igashov, Arne Schneuing, Marwin Segler, Bruno Correia, Julien
Roy, Emmanuel Bengio, and Pietro Liò. Synflownet: Design of diverse and novel molecules with
synthesis constraints., 2024.
Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan,
Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning
language models., 2025.
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,
2025. URL.
Tristan Deleu, Padideh Nouri, Nikolay Malkin, Doina Precup, and Yoshua Bengio. Discrete probabilistic
inference as control in multi-path environments., 2024.
Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen,
Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al. Agentic reinforced policy optimization.
preprint arXiv:2507.19849, 2025.
Yilun Du and Igor Mordatch. Implicit generation and modeling with energy based models.
in neural information processing systems, 32, 2019.
Benjamin Eysenbach and Sergey Levine. Maximum entropy rl (provably) solves some robust rl
problems., 2021.
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In
International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi-
rong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via
reinforcement learning., 2025.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy
maximum entropy deep reinforcement learning with a stochastic actor. In
on machine learning, pages 1861–1870. Pmlr, 2018.
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han,
Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi
with olympiad-level bilingual multimodal scientific problems.,
2024.
Haoran He, Can Chang, Huazhe Xu, and Ling Pan. Looking backward: Retrospective backward
synthesis for goal-conditioned GFlownets. In
Representations, 2025. URL.
Geoffrey E. Hinton, Peter Dayan, Brendan J. Frey, and R M Neal. The “wake-sleep” algorithm for
unsupervised neural networks., 268 5214:1158–61, 1995.
Edward J Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua Bengio,
and Nikolay Malkin. Amortizing intractable inference in large language models.
arXiv:2310.04363, 2023.
12
FlowRL: Matching Reward Distributions for LLM Reasoning
Edward J Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua Bengio,
and Nikolay Malkin. Amortizing intractable inference in large language models. In
International Conference on Learning Representations, 2024. URL https://openreview.net/f
orum?id=Ouj6p4ca60.
Jian Hu, Jason Klein Liu, and Wei Shen. Reinforce++: An efficient rlhf algorithm with robustness to
both prompt and reward models, 2025., 3262:32–33, 2025.
Moksh Jain, Emmanuel Bengio, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Bonaventure F.P.
Dossou, Chanakya Ekbote, Jie Fu, Tianyu Zhang, Micheal Kilgour, Dinghuai Zhang, Lena Simine,
Payel Das, and Yoshua Bengio. Biological sequence design with GFlowNets.
on Machine Learning (ICML), 2022.
Moksh Jain, Tristan Deleu, Jason S. Hartford, Cheng-Hao Liu, Alex Hernández-García, and Yoshua
Bengio. Gflownets for ai-driven scientific discovery., abs/2302.00615, 2023a. URL https:
//api.semanticscholar.org/CorpusID:256459319.
Moksh Jain, Sharath Chandra Raparthy, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Yoshua Bengio,
Santiago Miret, and Emmanuel Bengio. Multi-objective GFlowNets.
Machine Learning (ICML), 2023b.
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando
Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free
evaluation of large language models for code., 2024.
Koray Kavukcuoglu. Gemini 2.5: Our most intelligent AI model, 2025. URLhttps://blog.goo
gle/technology/google-deepmind/gemini-model-thinking-updates-march-2025/
.
Google Blog (The Keyword), Published Mar. 25, 2025.
Minsu Kim, Taeyoung Yun, Emmanuel Bengio, Dinghuai Zhang, Yoshua Bengio, Sungsoo Ahn, and
Jinkyoo Park. Local search gflownets., abs/2310.02710, 2023.
Minsu Kim, Joohwan Ko, Taeyoung Yun, Dinghuai Zhang, Ling Pan, Woochang Kim, Jinkyoo Park,
Emmanuel Bengio, and Yoshua Bengio. Learning to scale logits for temperature-conditional
gflownets, 2024.
Seanie Lee, Minsu Kim, Lynn Cherif, David Dobre, Juho Lee, Sung Ju Hwang, Kenji Kawaguchi,
Gauthier Gidel, Yoshua Bengio, Nikolay Malkin, et al. Learning diverse attacks on large language
models for robust red-teaming and safety tuning., 2024.
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ra-
masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam
Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language
models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,
in Neural Information Processing Systems, volume 35, pages 3843–3857. Curran Associates, Inc.,
2022. URLhttps://proceedings.neurips.cc/paper_files/paper/2022/file/18abb
eef8cfe9203fdf9053c9c4fe191-Paper-Conference.pdf.
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan
Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.
arXiv:2305.20050, 2023a.
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan
Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In
International Conference on Learning Representations, 2023b.
13
FlowRL: Matching Reward Distributions for LLM Reasoning
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching
for generative modeling. In, 2023.
URL.
Dianbo Liu, Moksh Jain, Bonaventure F. P. Dossou, Qianli Shen, Salem Lahlou, Anirudh Goyal, Nikolay
Malkin, Chris C. Emezue, Dinghuai Zhang, Nadhir Hassen, Xu Ji, Kenji Kawaguchi, and Yoshua
Bengio. Gflowout: Dropout with generative flow networks. In
Learning, 2022.
Mingjie Liu, Shizhe Diao, Jian Hu, Ximing Lu, Xin Dong, Hao Zhang, Alexander Bukharin, Shaokun
Zhang, Jiaqi Zeng, Makesh Narsimhan Sreedhar, et al. Scaling up rl: Unlocking diverse reasoning
in llms via prolonged training., 2025a.
Zhen Liu, Tim Z Xiao, , Weiyang Liu, Yoshua Bengio, and Dinghuai Zhang. Efficient diversity-preserving
diffusion alignment via gradient-informed gflownets. In, 2025b.
Michael Luo, Sijun Tan, Roy Huang, Xiaoxiang Shi, Rachel Xin, Colin Cai, Ameen Patel, Alpay Ariyak,
Qingyang Wu, Ce Zhang, Li Erran Li, Raluca Ada Popa, Ion Stoica, and Tianjun Zhang. Deepcoder:
A fully open-source 14b coder at o3-mini level, 2025. Notion Blog.
Jiangyan Ma, Emmanuel Bengio, Yoshua Bengio, and Dinghuai Zhang. Baking symmetry into
gflownets.
MAA. American mathematics competitions - amc., 2023.
MAA. American invitational mathematics examination - aime., 2025.
Kanika Madan, Jarrid Rector-Brooks, Maksym Korablyov, Emmanuel Bengio, Moksh Jain, Andrei Cris-
tian Nica, Tom Bosc, Yoshua Bengio, and Nikolay Malkin. Learning gflownets from partial episodes
for improved convergence and stability. In, pages
23467–23483. PMLR, 2023.
Nikolay Malkin, Moksh Jain, Emmanuel Bengio, Chen Sun, and Yoshua Bengio. Trajectory balance:
Improved credit assignment in gflownets., 35:
5955–5967, 2022.
Nikolay Malkin, Salem Lahlou, Tristan Deleu, Xu Ji, Edward Hu, Katie Everett, Dinghuai Zhang,
and Yoshua Bengio. GFlowNets and variational inference.
Representations (ICLR), 2023.
David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng,
and Angjoo Kanazawa. Flow matching policy gradients., 2025.
Sobhan Mohammadpour, Emmanuel Bengio, Emma Frejinger, and Pierre-Luc Bacon. Maximum
entropy gflownets with soft q-learning. In
Statistics, pages 2593–2601. PMLR, 2024.
OpenAI. Gpt-4o mini.https://openai.com/index/gpt-4o-mini-advancing-cost-effic
ient-intelligence/
Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping
and mitigating misaligned models., 2022.
Ling Pan, Moksh Jain, Kanika Madan, and Yoshua Bengio. Pre-training and fine-tuning generative
flow networks, 2023a.
14
FlowRL: Matching Reward Distributions for LLM Reasoning
Ling Pan, Nikolay Malkin, Dinghuai Zhang, and Yoshua Bengio. Better training of GFlowNets with
local credit and incomplete trajectories.,
2023b.
Ling Pan, Dinghuai Zhang, Aaron Courville, Longbo Huang, and Yoshua Bengio. Generative augmented
flow networks., 2023c.
Ling Pan, Dinghuai Zhang, Moksh Jain, Longbo Huang, and Yoshua Bengio. Stochastic generative
flow networks., 2023d.
Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning. In
on Machine Learning, 2025. URL.
Guilherme Penedo, Anton Lozhkov, Hynek Kydlíček, Loubna Ben Allal, Edward Beeching, Agustín Pi-
queres Lajarín, Quentin Gallouédec, Nathan Habib, Lewis Tunstall, and Leandro von Werra.
Codeforces., 2025.
Samuel Pfrommer, Yixiao Huang, and Somayeh Sojoudi. Reinforcement learning for flow-matching
policies., 2025.
Abhinav Rastogi, Albert Q Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, Jason Rute, Joep
Barmentlo, Karmesh Yadav, Kartik Khandelwal, Khyathi Raghavi Chandu, et al. Magistral.
preprint arXiv:2506.10910, 2025.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms., 2017.
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan
Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open
language models., 2024.
Max W. Shen, Emmanuel Bengio, Ehsan Hajiramezanali, Andreas Loukas, Kyunghyun Cho, and Tom-
maso Biancalani. Towards understanding and improving gflownet training., abs/2305.07170,
2023. URL.
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng,
Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.
arXiv: 2409.19256, 2024.
Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing
reward gaming., 35:9460–9471, 2022.
Richard S Sutton, Andrew G Barto, et al. Reinforcement learning.,
11(1):126–134, 1999a.
Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods
for reinforcement learning with function approximation. In S. Solla, T. Leen, and K. Müller,
editors,, volume 12. MIT Press, 1999b. URL
https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0b
ed98e80ade0a5c43b0f-Paper.pdf
Qwen Team. Qwen2.5: A party of foundation models, September 2024. URLhttps://qwenlm.g
ithub.io/blog/qwen2.5/. 15
FlowRL: Matching Reward Distributions for LLM Reasoning
Daniil Tiapkin, Nikita Morozov, Alexey Naumov, and Dmitry P Vetrov. Generative flow networks as
entropy-regularized rl. In, pages
4213–4221. PMLR, 2024.
Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen,
Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive
effective reinforcement learning for llm reasoning., 2025.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
et al. Chain-of-thought prompting elicits reasoning in large language models.
information processing systems, 35:24824–24837, 2022.
Fangxu Yu, Lai Jiang, Haoqiang Kang, Shibo Hao, and Lianhui Qin. Flow of reasoning: Training
llms for divergent reasoning with minimal examples. In
Machine Learning, 2025a.
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong
Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.
arXiv preprint arXiv:2503.14476, 2025b.
Taeyoung Yun, Dinghuai Zhang, Jinkyoo Park, and Ling Pan. Learning to sample effective and diverse
prompts for text-to-image generation. In
Conference, pages 23625–23635, 2025.
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with
reasoning., 35:15476–15488, 2022.
David W. Zhang, Corrado Rainone, Markus F. Peschl, and Roberto Bondesan. Robust scheduling with
gflownets., abs/2302.05446, 2023a. URL https://api.semanticscholar.org/Corp
usID:256827133.
Dinghuai Zhang, Ricky T. Q. Chen, Nikolay Malkin, and Yoshua Bengio. Unifying generative models
with GFlowNets and beyond., 2022a.
Dinghuai Zhang, Nikolay Malkin, Zhen Liu, Alexandra Volokhova, Aaron Courville, and Yoshua
Bengio. Generative flow networks for discrete probabilistic modeling.
Machine Learning (ICML), 2022b.
Dinghuai Zhang, Hanjun Dai, Nikolay Malkin, Aaron C. Courville, Yoshua Bengio, and Ling Pan.
Let the flows tell: Solving graph combinatorial optimization problems with gflownets.,
abs/2305.17010, 2023b.
Dinghuai Zhang, Ricky T. Q. Chen, Cheng-Hao Liu, Aaron Courville, and Yoshua Bengio. Diffusion
generative flow samplers: Improving learning signals through partial trajectory optimization, 2024a.
Dinghuai Zhang, Ling Pan, Ricky T. Q. Chen, Aaron Courville, and Yoshua Bengio. Distributional
gflownets with quantile flows, 2024b.
Dinghuai Zhang, Yizhe Zhang, Jiatao Gu, Ruixiang ZHANG, Joshua M. Susskind, Navdeep Jaitly,
and Shuangfei Zhai. Improving GFlownets for text-to-image diffusion alignment.
Machine Learning Research, 2025a. ISSN 2835-8856. URL https://openreview.net/forum
?id=XDbY3qhM42.
16
FlowRL: Matching Reward Distributions for LLM Reasoning
Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian,
Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models.
preprint arXiv:2509.08827, 2025b.
Mingyang Zhou, Zichao Yan, Elliot Layne, Nikolay Malkin, Dinghuai Zhang, Moksh Jain, Mathieu
Blanchette, and Yoshua Bengio. Phylogfn: Phylogenetic inference with generative flow networks,
2024.
Heiko Zimmermann, Fredrik Lindsten, J.-W. van de Meent, and Christian Andersson Naesseth. A
variational perspective on generative flow networks., abs/2210.07992, 2022. URL https:
//api.semanticscholar.org/CorpusID:252907672.
17
FlowRL: Matching Reward Distributions for LLM Reasoning
A.
We begin by analyzing the gradient of the Kullback–Leibler (KL) divergence between the policy
��()
exp,
��(
:
∇��KL
„
��() ∥
exp,
��()
«
= �
∫
��()
»
��() · �()
exp,
–
�
=
∫
∇���()
»
��() �()
exp,
–
�
∫
��()∇ �log
»
��() �()
exp,
–
�
=
∫
��() ∇ �log �()
»
��() �()
exp,
–
�
∫
��() ∇ �log �()
|
=�
∫
��( �1=
=
∫
��() ∇ �log �()
»
��() �()
exp,
–
�
= y��(· |
»
log
„
��() �()
exp,
«
· ∇�log �()
–
(8)
Next, consider the trajectory balance objective used in GFlowNets learning [Bartoldson et al.,
2025,,,,], defined as:
L(;)
„
log
��() �()
exp,
«
2
.
Taking the gradient of this objective with respect to
∇�L() y��(· |
»„
log
��() · �()
exp,
«
· ∇�log �()
–
(10)
Thus, minimizing the KL divergence is equivalent (up to a constant) to minimizing the trajectory
balance loss, confirming Proposition.
B.
We conduct an interpretation of FlowRL that clarifies the role of each component in the objective.
Proposition 5.
maximizing reward and policy entropy:
max
�
�y��
� �,
|
reward
− �() +
ref()
+ H ( �)
|
entropy
.
Remark 6FlowRL beyond reward maximization).
as jointly maximizing expected reward and policy entropy. This shift encourages the policy to explore a
broader set of high-quality solutions, enabling more diverse and generalizable behaviors on reasoning
tasks. Our interpretation also aligns with prior work that views GFlowNets training as a form of
maximum entropy RL [Deleu et al.,,,].
18
FlowRL: Matching Reward Distributions for LLM Reasoning
The proof of Proposition
Recall from Eq.
divergence:
�KL
„
��() ∥
exp,
ref()
��()
«
=
∫
��()
»
��() �()
exp(� �, )·
ref()
–
�
(12)
Rearranging the terms, we obtain:
arg min
�
�KL
„
��() ∥
exp(� �, )·
ref()
��()
«
=
�
∫
��()
»
��() �()
exp(� �, )·
ref()
–
�
=
�
ȷ
�y��(· |log
»
exp(� �, )·
ref()
��()
–
−
∫
��() �()
ffl
=
�
ȷ
�y��(· |log
»
exp(� �, )·
ref()
��()
–
+ H ( �)
ffl
(13)
Finally, we express the FlowRL objective in its compact form:
max
�
�y��(· |
��,
|
reward
− �()
|
normalization
+
ref()
|
prior alignment
+ H ( �)
|
entropy
.
Therefore, minimizing the FlowRL objective can be interpreted as jointly maximizing reward
and entropy, while also aligning the policy with a structured prior. The reward term drives task
performance, while the normalization term��(x)ensures consistency with a properly normalized
target distribution. This encourages the policy��to cover the entire reward-weighted distribution
rather than collapsing to a few high-reward modes. The reference policy�
refprovides inductive
bias that regularizes the policy toward desirable structures, and the entropy termH ( �) encourages
diversity in sampled solutions. Together, these components promote better generalization of FlowRL.
C.
We follow the notation of [He et al.,,,] to introduce the fundamentals of
GFlowNets. LetXdenote the compositional objects and�be a reward function that assigns non-
negative values to each object� . GFlowNets aim to learn a sequential, constructive sampling
policy�that generates objects�with probabilities proportional to their rewards, i.e.,� .
This process can be represented as a directed acyclic graph (DAG)G , where the vertices
� are referred to as, and the directed edges ( are called. The generation
of an object� corresponds to a complete trajectory� 0→ · · · → �) ∈ T within the DAG,
beginning at the initial state�0and ending at a terminal state��∈ X. The state flow�is defined
as a non-negative weight assigned to each state� . The forward policy��(
′
| specifies the
transition probability to a child state�
′, while the backward policy��(
′
) specifies the transition
probability to a parent state�. To this end, detailed balance objective enforces local flow consistency
across every edge
′
) ∈ A
∀(
′
) ∈ A �( �(
′
|) �(
′
)�(
′
;)
19
FlowRL: Matching Reward Distributions for LLM Reasoning
To achieve this flow consistency, GFlowNets employ training objectives at different levels of granularity,
including detailed balance [Bengio et al.,], trajectory balance [Malkin et al.,], and sub-
trajectory balance [Madan et al.,]. Leveraging their diversity-seeking behavior, GFlowNets have
been successfully applied across a range of domains, including molecule generation [Cretu et al.,
2024], diffusion fine-tuning [Liu et al.,,,], and amortized reasoning [Hu
et al.,,,]. Among various training objective in GFlowNets, trajectory balance
maintains flow consistency at the trajectory level, defined as:
��
��
�
��(�|�;)
��
�
��(�|�;
Furthermore, sub-trajectory balance achieves local balance on arbitrary subpaths��= �→
· · · → �}
, offering a more stable and less biased learning signal. We build on trajectory balance to
extend our KL-based objective through a gradient-equivalence formulation (Prop.), and further
improve it to better support long CoT reasoning in RL.
Models AIME 2024 AIME 2025 AMC 2023 MATH-500 Minerva Olympiad Avg
Qwen2.5-7B Base Model
Backbone4.37 23.02
R++ 10 +6 5 +3 66 +35 54 − 24 +2 27 +3 31.29
PPO 9 +5 7 +5 63 +32 57 +3 26 +3 27 +3 32.03
GRPO 14 +9 10 +8 64 +33 57 +2 23 +0 27 +3 32.76
FlowRL 14 +9 10 +7 55 +24 66 +12 31 +9 34 +10 35.39
Table 5|Math reasoning performance (Avg@64) at temperature=0.6. Relative improvements are
shown as subscripts, with positive gains in. FlowRL consistently
outperforms all baselines and achieves the best average score under this low-temperature setting.
Models AIME 2024 AIME 2025 AMC 2023 MATH-500 Minerva Olympiad Avg
Qwen2.5-7B Base Model
Backbone3.39 18.20
R++ 10 +7 4 +3 66 +43 54 +9 23 +6 26 +8 31.19
PPO 10 +7 6 +5 63 +39 57 +12 25 +8 27 +8 31.77
GRPO 12 +9 10 +8 64 +40 57 +11 23 +6 26 +8 32.44
FlowRL 14 +10 9 +8 52 +29 66 +21 30 +13 34 +16 34.62
Table 6|Math reasoning performance (Avg@64) at temperature=1.0. Relative improvements are
shown as subscripts, with positive gains in. FlowRL maintains robust performance under higher
generation randomness and continues to outperform all baselines on average.
20
FlowRL: Matching Reward Distributions for LLM Reasoning
ModelsAIME 2024 AIME 2025 AMC 2023 MATH-500 Minerva Olympiad Avg
� 13.54 10.00 56.09 58.91 20.79 28.72 31.34
� 14.79 10.20 59.53 64.30 25.27 32.39 34.41
� 15.41 10.83 54.53 66.96 31.41 34.61 35.63
� 15.00 10.83 50.62 69.02 30.03 35.03 35.09
Table 7|Ablation study on the effect of the�parameter in FlowRL. We report Avg@16 accuracy
across six math reasoning benchmarks for different values of
Diversity Evaluation Prompt
System:
problem. Focus on detecting even SUBTLE differences in methodology that indicate different problem-
solving strategies.
PROBLEM:
{problem}
16 SOLUTION ATTEMPTS:
{formatted_responses}
EVALUATION CRITERIA - Rate diversity from 1 to 5:
Score 1 - Minimal Diversity:
•
•
•
•
Score 2 - Low Diversity:
•
•
•
•
Score 3 - Moderate Diversity:
•
•
•
•
Score 4 - High Diversity:
•
•
•
•
Score 5 - Maximum Diversity:
•≤3 responses use same method)
•
•
•
IMPORTANT:
to 5.
21
FlowRL: Matching Reward Distributions for
LLM Reasoning
Xuekai Zhu
1
, Daixuan Cheng
6
, Dinghuai Zhang
3
, Hengli Li
5
, Kaiyan Zhang
4
, Che Jiang
4
, Youbang Sun
4
, Ermo Hua
4
, Yuxin
Zuo
4
, Xingtai Lv
4
, Qizheng Zhang
7
, Lin Chen
1
, Fanghao Shao
1
, Bo Xue
1
, Yunchong Song
1
, Zhenjie Yang
1
, Ganqu Cui
2
, Ning
Ding
4,
, Jianfeng Gao
3
, Xiaodong Liu
3
, Bowen Zhou
4,‡
, Hongyuan Mei
8‡
, Zhouhan Lin
1,‡
1
Shanghai Jiao Tong University
2
Shanghai AI Laboratory
3
Microsoft Research
4
Tsinghua University
5
Peking University
6
Renmin University of China
7
Stanford University
8
Toyota Technological Institute at Chicago
#
‡
Corresponding Authors.
Abstract|We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing
rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt
reward-maximizing methods (e.g., PPO and GRPO), which tend to over-optimize dominant reward signals
while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform
scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the
reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced
optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct
experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of .0%
over GRPO and .1%
These results highlight reward distribution-matching as a key step toward efficient exploration and diverse
reasoning in LLM reinforcement learning.Distribution-matching: FlowRL
KL=0.11 KL=8.68
Reward-maximizing∶ R++, PPO and GRPO
Math Average Score
CodeForcesRating
Figure 1|Top: Comparison between distribution-matching and reward-maximizing approaches.
FlowRL (left) learns to match the full reward distribution, maintaining diversity across multiple modes
with low KL divergence. In contrast, reward-maximizing methods like GRPO (right) concentrate on a
single high-reward peak, leading to mode collapse and higher KL divergence.: Performance
comparison. FlowRL consistently outperforms GRPO across math and code domains.
arXiv:2509.15207v1 [cs.LG] 18 Sep 2025
FlowRL: Matching Reward Distributions for LLM Reasoning
1.
Reinforcement learning (RL) plays a crucial role in the post-training of large language models
(LLMs) [Zhang et al.,]. A series of powerful reasoning models [Guo et al.,,,
2025,,] have employed large-scale reinforcement learning to achieve strong
performance on highly challenging benchmarks [He et al.,]. The evolution of RL algorithms
for LLM reasoning has progressed through several key stages: REINFORCE [Sutton et al.,]
provides a solid baseline that is easy to implement and efficient in simple settings; PPO [Schulman
et al.,] improves upon REINFORCE with better stability and efficiency in complex settings;
GRPO [Shao et al.,] simplifies PPO training by eliminating value functions and relying on group
comparisons, though at the cost of requiring more rollouts per update. However, all these methods
share a fundamental limitation in their reward-maximizing objective.
Reward-maximizing RL methods tend to overfit to the dominant mode of the reward distribu-
tion [Gao et al.,,,,,,,]. This often results
in limited diversity among generated reasoning paths and reduces generalization to less frequent yet
valid logical outcomes [Hu et al.,]. As illustrated in Figure, GRPO neglects other meaningful
modes. These drawbacks become especially pronounced in complex long chain-of-thought (CoT;
et al.,) reasoning, where capturing a diverse distribution of plausible solutions is essential for
effective generalization [Liu et al.,]. Recent approaches adjust the clip ratio [Yu et al.,],
augment the advantage function with an entropy-based term [Cheng et al.,], or selectively
promote high-entropy tokens [Wang et al.,], thereby dynamically adapting the training data
distribution and implicitly increasing diversity during training. This raises a fundamental question:
How can we promote diverse exploration to prevent convergence to dominant solution patterns in RL
training?
In this paper, we propose, a policy optimization algorithm that aligns the policy model
with the full reward distribution, encouraging mode coverage. FlowRL achieves more efficient
exploration by fundamentally shifting from reward maximization to reward distribution matching,
thereby addressing the inherent mode-collapse limitations of previous RL approaches. As illustrated
in Figure, the core idea of FlowRL is to introduce a learnable partition function that normalizes
scalar rewards into a target distribution, and to minimize the reverse KL divergence between the
policy and this reward-induced distribution. We develop this KL objective based on the trajectory
balance formulation from GFlowNets [Bengio et al.,], providing a gradient equivalence proof
that bridges generative modeling and policy optimization. To address the challenges of long CoT
training, we introduce two key technical solutions:
issues that occur with variable-length CoT reasoning, and
distribution mismatch between generated rollouts and the current policy.
We compare FlowRL with mainstream RL algorithms including REINFORCE++, PPO, and GRPO
across math and code domains, using both base and distilled LLMs (7B, 32B). In math domain,
FlowRL outperforms GRPO and PPO by .0% .1%, respectively, demonstrating consistent
improvements across six challenging math benchmarks. Furthermore, FlowRL surpasses both PPO
and GRPO on three challenging coding benchmarks, highlighting its strong generalization capabilities
in code reasoning tasks. To understand what drives these performance gains, we analyze the diversity
of generated reasoning paths. This diversity analysis confirms that FlowRL generates substantially
more diverse rollouts than baseline methods, validating our approach’s effectiveness in exploring
multiple solution strategies.
Contributions.
•
We propose FlowRL, a policy optimization algorithm that shifts from reward maximization to
2
FlowRL: Matching Reward Distributions for LLM Reasoning
reward distribution matching via flow balance, encouraging diverse reasoning path exploration
while addressing the inherent mode-collapse limitations of existing RL methods.
•
We introduce length normalization and importance sampling to enable effective training on variable-
length CoT reasoning, addressing gradient explosion and sampling mismatch issues.
•
FlowRL outperforms GRPO and PPO by 10.0% and 5.1% respectively across math benchmarks and
demonstrates strong generalization on code reasoning tasks, with diversity analysis confirming
substantially more diverse solution exploration.
2.
Reinforcement Learning for Reasoning.
where the policy model receives a question ∈ Xand generates an answer ∈ Y. The objective
is to learn a policy��(y|x)that produces high-quality answers under task-specific reward signals�.
To better illustrate the policy optimization procedure, we provide a detailed formulation of GRPO
below. For each question, GRPO samples a group of answers {y1,y2, . . . ,y�}from old policy�����
and updates the model by maximizing the following objective:
J����()
[∼ �}
�
�
∼��
���
( Y |
1
�
�∑︁
�
1
|�|
|�|
∑︁
�
ȷ
min
»
��(�,�|, �,<�)
�����
(�,�|, �,<�)
ˆ��,�,
„
��(�,�|, �,<�)
�����
(�,�|, �,<�)
,
«
ˆ��,�
–
− � �
ˆ
��||�� �
˜
ffl
,
�KL(�∥
ref)
�
ref(�|)
��(�|)
−
�
ref(�|)
��(�|)
−,
(1)
where�and�are hyper-parameters. Here,��denotes the advantage, computed by normalizing
the group reward values{1, �2, . . . , ��} as��=
��− 1,�2, �} )
std 1,�2, �} )
.
Compared to GRPO, REINFORCE
applies the policy gradient directly, without advantage normalization, clipping, or KL regularization.
PPO uses a critic model to estimate the advantage and employs importance sampling to stabilize
policy updates.!!!"!#
!$
!%
!&
!'
!(
!)
!*
!"!
!+
!+
!!
!+!+!+
In Flow
Z"##
Out Flow r(%)
Figure 2|GFlowNets [Bengio et al.,], a
flow-balance perspective on reinforcement learn-
ing. The initial flow��(0) injects probability mass
into the environment, which is transported through
intermediate states by the policy��and accumu-
lated at terminal states in proportion to the scalar
rewards.
GFlowNets.Ben-
gio et al.,] are a probabilistic frame-
work for training stochastic policies to sample
discrete, compositional objects (e.g., graphs,
sequences) in proportion to a given reward.
As shown in Figure, the core principle of
GFlowNets is to balance the forward and back-
ward probability flows at each state, inspired
by flow matching [Bengio et al.,]. The
initial flow is estimated by��(0) at the initial
state�0. The output flow is equal to the out-
come reward��)conditioned at the final state
��. Following2024], we use a 3-layer
MLP to parameterize��. This flow-balancing
mechanism facilitates the discovery of diverse,
high-reward solutions by ensuring proper exploration of the solution space. See Appendix
detailed GFlowNets background.
3
FlowRL: Matching Reward Distributions for LLM Reasoning
3.
In this section, we first formulate distribution matching in reinforcement learning through reverse
KL divergence and establish its connection to trajectory balance from GFlowNets. To address the
challenges of gradient explosion and sampling mismatch encountered during long CoT training, we
further incorporate length normalization and importance sampling. Using this enhanced framework,
we derive a flow-balanced objective, termed.
3.1.
As illustrated in Figure, recent powerful large reasoning models typically employ reward-maximizing
RL algorithms, such as PPO or GRPO. However, these methods tend to optimize toward the dominant
reward mode, frequently resulting in mode collapse and the neglect of other plausible, high-quality
reasoning paths. To address this fundamental limitation, we propose optimizing the policy by aligning
its output distribution to a target reward distribution. A simple yet effective way to achieve this is
to minimize the reverse KL divergence
1
between the policy and this target. However, in long CoT
reasoning tasks, the available supervision in RL is a scalar reward, rather than a full distribution.
Moreover, enumerating or sampling all valid trajectories to recover the true reward distribution is
computationally intractable.
Inspired by energy-based modeling [Du and Mordatch,,,], we introduce
a learnable partition function��(x)to normalize scalar rewards into a valid target distribution.
This allows us to minimize the reverse KL divergence between the policy and the reward-weighted
distribution, formalized as:
min
�
DKL
„
��()
exp,
��()
«
⇒ �() ∝,
where�x,y)is the reward function,�is a hyperparameter,��(x)is the learned partition function,
and the resulting target distribution is defined as˜y|x)
exp,
��(
. This objective encourages
the policy to sample diverse, high-reward trajectories in proportion to their rewards, rather than
collapsing to dominant modes as in standard reward maximization.
While the KL-based formulation provides a principled target distribution, we derive a more
practical, RL-style objective that facilitates efficient policy optimization.
Proposition 1.
minimizing the trajectory balance loss used in GFlowNet [Bartoldson et al.,,,,
et al.,,]:
min
�
DKL
„
��()
exp,
��()
«
⇐⇒
�
`
log �() + �() −
´
2
|
Trajectory Balance
(3)
Remark 2Trajectory balance as a practical surrogate for KL minimization).
established in Proposition, the KL-based distribution matching objective can be reformulated as the
trajectory balance loss. This reformulation provides a practical optimization approach by using a stable
squared loss form rather than direct KL optimization, and by treating��(x)as a learnable parameter
rather than requiring explicit computation of the intractable partition function. The trajectory balance
objective thus serves as a tractable surrogate for reward-guided KL minimization that can be directly
integrated into existing RL frameworks.
1
We use reverse KL since we can only sample from the policy model, not the target reward distribution.
4
FlowRL: Matching Reward Distributions for LLM Reasoning
3.2.
As established in Proposition, the target reward distribution can be approximated by optimizing
the trajectory balance objective. However, applying this objective directly to long CoT reasoning
introduces two key challenges:
Problem I: Exploding gradients from long trajectories.
objective, and applying it to long CoT reasoning with up to 8K tokens leads to exploding gradients
and unstable updates. This issue is not observed in prior GFlowNets works, which typically operate
on short trajectories in small discrete spaces. Specifically, the log-probability termlog �( y|x)
decomposes into a token-wise sum,
�
�log �(
y�|y<�,x), causing the gradient norm to potentially
scale with sequence length.
Problem II: Sampling mismatch.
perform micro-batch updates and reuse trajectories collected from an old policy��
old, enabling
data-efficient training. In contrast, the KL-based trajectory balance objective assumes fully on-
policy sampling, where responses are drawn from the current policy. This mismatch poses practical
limitations when integrating trajectory balance into existing RL pipelines.
These limitations motivate our reformulation that retains the benefits of distribution matching
while addressing key practical challenges. To enable this reformulation, we first redefine the reward
function following established practices in GFlowNets literature [Bartoldson et al.,,,
2024,,] by incorporating a reference model as a prior constraint on the reward
distribution. Specifically, we modify the original,
exp(� �, )·
ref()
where�x,y)denotes the outcome reward commonly used in reinforcement learning and�
refis the
initial pre-trained model. We follow2025] to use outcome-based reward signals, and
apply group normalization to�x,y)asˆ�= �− r))/std r), where = 1, �2, . . . , ��} denotes
the set of rewards within a sampled group. By substituting the redefined reward formulation Eq.
into Eq., we derive the following objective
2
:
min
�
`
log �() + �() − �(,
ref()
´
2
(5)
Remark 3Reward shaping via length normalization).
and the outcome reward as sequence-level quantities. In contrast, standard policy optimization
methods such as PPO or GRPO assign rewards at the token level and compute gradients at each
step. However, for trajectories of varying lengths (e.g., CoT responses), this mismatch can cause the
log-probability termlog �( y|x)
�
|
�
log �(�|<�, x)to scale with sequence length. To address
this, we apply a form of reward shaping by normalizing log-probabilities with respect to sequence
length. Specifically, we rescale the term as
1
|
log �(
y|x), balancing the contributions of long and
short sequences and stabilizing the learning signal.
Remark 4Importance sampling for data-efficient training).
employ importance sampling inspired by PPO to stabilize policy updates with off-policy data. We
re-weight stale trajectories using the importance ratio� �( y|x)/�
old( y|x), which serves as a
coefficient in the surrogate loss. Since our objective focuses on optimizing trajectory balance rather
than expected return, we detach the gradient from the current policy to prevent excessive policy
drift:� �( y|x)]/�
old( y|x). For additional stability, we incorporate PPO-style clipping to
bound the importance weights:
“
��(
�
old(
,
”
detach
.
2
The substitution replaces��x,y)in trajectory balance objective Eq. ��x,y) +
ref( y|x)to incorporate the
reference model constraint.
5
FlowRL: Matching Reward Distributions for LLM Reasoning
Incorporating these improvements into Eq., we arrive at the following FlowRL objective:
FlowRLL
FlowRL=
„
log �() +
1
|
log �() −,
1
|
log
ref()
«
2
(6)
where the clipped importance weight,
�
��()
�
old()
,)
detach
, �=
��−)
std)
.
We use this objective to update the policy parameters�during training, and refer to this strategy as
FlowRL. Implementation details and theoretical analysis are provided in §, respectively.
4.
4.1.
Reinforcement learning has emerged as a powerful approach for large language models post-training
on reasoning tasks [Guo et al.,,,,,,,
2024,,]. Most approaches employ reward-maximizing RL to optimize expected
cumulative returns. Entropy regularization [Ahmed et al.,,,,,
2018] is a classical technique for mitigating mode collapse by promoting diversity in the policy’s output
distribution, and has also been shown to enhance reasoning capabilities in various settings [Chao
et al.,,,]. However, for long CoT reasoning, the extended trajectory
length (e.g., 8k–16k tokens) makes it difficult for the regularization signal to effectively influence
reward-maximizing learning. Recent work [Cheng et al.,,,,,,
Wang et al.,] has discovered that training with more diverse or high-entropy training data can
further enhance training effectiveness. Compared to traditional entropy regularization, the above
methods explicitly increase the proportion of low-probability (i.e., high-entropy) tokens in the training
data. In our work, we address the mode-collapse problem by fundamentally shifting from reward
maximization to reward distribution matching in our RL formulation.
4.2.
GFlowNets [Bengio et al.,] represent a class of diversity-driven algorithms designed to balance
probability flows across states. They have rich connections to probabilistic modeling methods [Ma et al.,
Malkin et al.,,,,b,,,], and control methods [Pan
et al.,,c,d,,,,]. This advantage has enabled GFlowNets to
achieve successful applications in multiple downstream tasks, such as molecular drug discovery [Jain
et al.,,,b,,,,,,,,,],
phylogenetic inference [Zhou et al.,], and combinatorial optimization [Zhang et al.,,b].
For generative AI, GFlowNets provide a powerful approach to align pretrained models in scenarios such
as image generation [Yun et al.,,,] and language model fine-tuning [Hu et al.,
2024,,,,]. Another line of work primarily focuses on the theoretical
aspects of GFlowNets. Recent theoretical studies have interpreted GFlowNets as solving a maximum
entropy reinforcement learning problem within a modified Markov Decision Process (MDP) [Deleu
et al.,,,,,]. These theoretical contributions have
6
FlowRL: Matching Reward Distributions for LLM Reasoning
inspired us to enhance reinforcement learning from a more foundational standpoint using GFlowNets
principles. A comprehensive overview of GFlowNets theory can be found in Appendix.
4.3.
Flow matching simplifies diffusion-based approaches by learning vector fields that transport samples
from prior to target distributions [Lipman et al.,]. Recent work has explored flow matching
for policy optimization.2025] reformulates policy optimization using advantage-
weighted ratios from conditional flow matching loss, enabling flow-based policy training without
expensive likelihood computations.2025] explored reward-weighted flow matching
for improving policies beyond demonstration performance.2025] uses a separate one-step
policy to avoid unstable backpropagation through time when training flow policies with RL.
et al.2025a] proposed a combined loss function integrating PPO and GFlowNets to optimize diffusion
model alignment. However, these approaches focus on continuous control, image generation, or
vision-action models, rather than addressing mode-collapse limitations in reward-maximizing RL.
Inspired by flow matching principles, our work improves upon RL training to enhance training stability
while promoting diverse solution exploration.
5.
Backbone Models.: the policy model ��and the partition
function��. For the policy model��, we useQwen-2.5-7B/32B[Team,] for math tasks and
DeepSeek-R1-Distill-Qwen-7B[DeepSeek-AI,] for code tasks, respectively. For partition
function��, following2024], we use a randomly initialized 3-layer MLP with hidden
dimensions matching those of the base model. The reference model�
refis the corresponding fixed
pretrained model. All training scripts are based on the veRL [Sheng et al.,]. For the reward
function, following2024], we set the hyperparameter.
Baselines.
REINFORCE++ (R++;,,,), PPO [Schulman et al.,], and
GRPO [Shao et al.,]. All baselines follow the official veRL recipes, with consistent training
configurations. For fair comparison, all methods use the same learning rate, batch size, and training
steps, and are evaluated at convergence using identical step counts.
Training Configuration.
we use the training set collected from DAPO [Yu et al.,]. For the code domain, we follow the
setup of DeepCoder [Luo et al.,], using their training set. For 7B model training, we use a single
node equipped with 8 NVIDIA H800 GPUs (80GB memory each). For 32B model training, we scale
to 4 nodes with 32 GPUs to accommodate the larger memory requirements. All experiments use
max_prompt_length= 2048 andmax_response_length= 8192 across both model sizes. We
use a batch size of 512 for math reasoning tasks and 64 for code reasoning tasks. We set the learning
rate to 1e-6 and enable dynamic batch sizing in veRL for efficient training. For GRPO and FlowRL,
we configurerollout_n= 8, meaning each prompt generates 8 response rollouts as the group size.
Evaluation Configuration.
2024/2025 [MAA,], AMC 2023 [MAA,], MATH-500 [Lightman et al.,], Min-
erva [Lewkowycz et al.,], and Olympiad [He et al.,]. For the code domain, we evaluate
on LiveCodeBench [Jain et al.,], CodeForces [Penedo et al.,], and HumanEval+ [Chen
et al.,]. For all evaluation datasets, we perform 16 rollouts and report the average accuracy,
denoted as Avg@16. We further report rating and percentile for Codeforces. During generation, we
7
FlowRL: Matching Reward Distributions for LLM Reasoning
AIME24 AIME25 AMC23 MATH500 Minerva Olympiad Avg
Qwen2.5-32B-Base, Max Response Len=8K
Backbone4.6 22.7
R++ 14 +10 9+7 52 +24 44 − 17 − 24 +3 27.1
PPO 26 +22 20 +18 76 +47 69 +16 28 +1 37 +16 43.3
GRPO 23 +18 14 +12 76 +48 61 +9 19 − 34 +13 38.3
FlowRL 24 +19 21 +19 73 +45 80 +28 38 +11 51 +30 48.4
Qwen2.5-7B-Base, Max Response Len=8K
Backbone4.4 23.0
R++ 11 +6 5+3 66 +35 54 − 24 +2 27 +3 31.5
PPO 9+5 7+5 63 +32 58 +3 26 +4 27 +3 32.0
GRPO 13 +9 9+7 64 +33 57 +2 23 +0 26 +2 32.5
FlowRL 15 +11 10 +8 54 +23 67 +12 31 +9 34 +10 35.6
Table 1|Results on math benchmarks.
shown as subscripts. Positive gains are shown in. FlowRL
outperforms all baselines across both 7B and 32B model scales.
Models LiveCodeBench CodeForces HumanEval+
Avg@16 Pass@16 Rating Percentile Avg@16
DeepSeek-R1-Distill-Qwen-7B, Max Response Len=8K
Backbone30.7 886.7 80.9
R++ 30 − 52 +3 1208 +321 56 +37 76 −
PPO 35 +4 54 +5 1403 +516 73 +54 82 +1
GRPO 32 +2 52 +2 1313 +427 67 +47 80 −
FlowRL 37 +6 56 +6 1549 +662 83 +63 83 +2
Table 2|Results on code benchmarks.
subscripts. Positive gains are shown in. FlowRL achieves the
strongest performance across all three benchmarks, demonstrating its effectiveness in code reasoning
tasks.
use sampling parameters oftemperature= 0.6 andtop_p= 0.95 for all evaluations. The response
length for evaluation is set to 8,192, consistent with the training configuration.
6.
6.1.
Our experimental results, summarized in Table, demonstrate that FlowRL consistently
outperforms all reward-maximization baselines across both math and code reasoning domains. Ta-
ble
Table
achieves the highest average accuracy of 35.6% with the 7B model and 48.4% with the 32B model,
surpassing PPO by 5.1% and GRPO by 10.1% on the 32B model. FlowRL shows strong improvements
on challenging benchmarks like MATH-500 and Olympiad problems, demonstrating consistent gains
8
FlowRL: Matching Reward Distributions for LLM Reasoning
Method AIME 2024 AIME 2025 AMC 2023 MATH-500 Minerva Olympiad Avg
FlowRL 15.41 10.83 54.53 66.96 31.41 34.61 35.63
w/o IS 6.25 7.91 41.40 56.97 22.19 25.52 26.71
Zhang et al.2025a] 10.41 6.66 53.75 66.50 30.97 33.72 33.67
Table 3|Ablation study on FlowRL with Qwen2.5-7B as the base model. Avg@16 accuracy is reported
across six math reasoning benchmarks. IS denotes importance sampling.
across diverse mathematical domains. On code generation tasks, FlowRL achieves compelling im-
provements with the highest Avg@16 score of 37.43% on LiveCodeBench, a Codeforces rating of
1549.47 with 83.3% percentile ranking, and 83.28% accuracy on HumanEval+, outperforming all
baselines across the board. These consistent performance gains across both domains and model scales
provide strong empirical evidence that FlowRL’s flow-balanced optimization successfully enhances
generalization. This improvement comes from promoting diverse solution exploration compared to
previous reward-maximizing RL approaches.
6.2.=5 =10 =15 =30
30
31
32
33
34
35
36
37
Average Score (%)
31.34
34.41
35.63
35.09
Figure 3|Ablation study on the�in FlowRL.
�15
performance.
We conduct ablation studies on importance sampling
and the�hyperparameter. For importance sampling,
we compared the performance with and without it,
and implemented a combined loss approach pro-
posed by2025a] that simultaneously
optimizes both GFlowNets and PPO objectives. This
combined loss focuses on optimizing diffusion mod-
els, and we adapt it to long CoT reasoning tasks
for comparison. Table
tance sampling substantially improves FlowRL perfor-
mance across all math reasoning benchmarks. Com-
pared to2025a], using importance sam-
pling as a trajectory-level ratio is more suitable than
the combined loss of GFlowNets and PPO. The per-
formance drop without importance sampling (from
35.63% to 26.71%) highlights the critical role of cor-
recting for distribution mismatch between rollout
generation and policy training. For the hyperparam-
eter�, we conduct a series of parameter ablation studies, and Figure �15
optimal performance, with detailed results shown in Table.
7.
7.1.
To assess solution diversity, we follow the approach of2025a] and employ GPT-4o-mini[Ope-
nAI,] to evaluate all responses generated by each method on AIME 24/25. The evaluation
prompt is shown in Appendix. As shown in Figure, FlowRL achieves higher diversity scores
compared to baseline methods. This demonstrates that FlowRL improves sample diversity compared
to baselines, which tend to exhibit repetitive solution patterns. This diversity evaluation reveals
9
FlowRL: Matching Reward Distributions for LLM Reasoning
Table 4|Case study comparing GRPO and FlowRL rollouts on an AIME problem. GRPO exhibits
repetitive patterns (AM-GM×3, identity loops×2), while FlowRL follows a more diverse solution
path.
Content (boxed = actions; “×
Question
LetBbe the set of rectangular boxes with surface area. Let �
be the radius of the smallest sphere that can contain each box inB. If�
2
=
�
�
with
gcd), find.
GRPO “... denote�, �, �...2+), ...�
√
�
2
+
2
+
2
,2
...(++
2
=
2
+
2
+
2
++) ...AM–GM ×
:
AM–GM (1)...AM–GM (2)
... AM–GM (3)...(++
3
identity loop×
:
loop (1)... loop (2)...
� . . .back to++
2
. . . no factorization . . . ”
FlowRL “... let�, �, �with2+), ...�
√
�
2
+
2
+
2
,2
...(++
2
⇒
2
+
2
+
2
=
2
− ...� ...�
3
− ...
rational root ...factor)(
2
+) ...branch1
√
6...
back-sub/
2
. . .�
2
+
2
+
2
=
657
16
. . .�
2
=
657
64
. . .Answer 721. . . ”
significant differences in exploration patterns across methods. This nearly doubling of diversity score
compared to the strongest baseline (PPO) indicates that FlowRL generates qualitatively different
solution approaches rather than minor variations of the same strategy. The diversity analysis provides
empirical validation of our core hypothesis that flow-balanced optimization promotes mode coverage
in complex reasoning tasks.
7.2.R++ GRPO PPO FlowRL
0.0
0.5
1.0
1.5
2.0
2.5
Diversity Score
1.11
1.23
1.31
2.28
Figure 4|GPT-judged diversity scores on roll-
outs of AIME 24/25 problems. FlowRL gener-
ates more diverse solutions than R++, GRPO,
and PPO.
Table
GRPO and FlowRL on a representative AIME prob-
lem. GRPO exhibits repetitive patterns, applying AM-
GM three times and getting stuck in identity loops,
failing to solve the problem. FlowRL explores more
diverse actions: it sets� , derives a cubic equation,
finds the rational root, and reaches the correct an-
swer. This shows that FlowRL successfully avoids the
repetitive exploration patterns. The contrast reveals
fundamental differences in exploration strategies:
GRPO’s reward-maximizing approach leads to ex-
ploitation of familiar techniques (AM-GM inequality)
without exploring alternatives, eventually reaching
contradictory conclusions like� . In contrast,
FlowRL’s distribution-matching enables strategic de-
cisions such as the symmetry assumption� , which
transforms the problem into a tractable cubic equation�
3
−27�46=0, allowing systematic solution
through rational root testing and polynomial factorization.
10
FlowRL: Matching Reward Distributions for LLM Reasoning
8.
In this work, we introduce FlowRL, which transforms scalar rewards into normalized target distribu-
tions using a learnable partition function and minimizes the reverse KL divergence between the policy
and target distribution. We demonstrate that this approach is theoretically equivalent to trajectory
balance objectives from GFlowNets and implicitly maximizes both reward and entropy, thereby pro-
moting diverse reasoning trajectories. To further address gradient explosion and sampling mismatch
issues in long CoT reasoning, we incorporate importance sampling and length normalization. Through
experiments on math and code reasoning benchmarks, FlowRL achieves consistent improvements
across all tasks compared to GRPO and PPO. Our diversity analysis and case studies confirm that
FlowRL generates more varied solution approaches while avoiding repetitive patterns.
Acknowledgments
We are grateful to Mingqian Feng and Yuetai Li for their valuable discussions and feedback, which
helped improve the quality of this work.
References
Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the
impact of entropy on policy optimization. In, pages
151–160. PMLR, 2019.
Brian R Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee,
Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, and Bhavya Kailkhura. Trajectory balance with
asynchrony: Decoupling exploration and learning for fast, scalable llm post-training.
arXiv:2503.18929, 2025.
Emmanuel Bengio, Moksh Jain, Maksym Korablyov, Doina Precup, and Yoshua Bengio. Flow net-
work based generative models for non-iterative diverse candidate generation.
Processing Systems (NeurIPS), 2021.
Yoshua Bengio, Salem Lahlou, Tristan Deleu, Edward J. Hu, Mo Tiwari, and Emmanuel Bengio.
Gflownet foundations., 24(210):1–55, 2023a. URL http:
//jmlr.org/papers/v24/22-0364.html.
Yoshua Bengio, Salem Lahlou, Tristan Deleu, Edward J Hu, Mo Tiwari, and Emmanuel Bengio.
Gflownet foundations., 24(1):10006–10060, 2023b.
Chen-Hao Chao, Chien Feng, Wei-Fang Sun, Cheng-Kuang Lee, Simon See, and Chun-Yi Lee. Maximum
entropy reinforcement learning via energy-based normalizing flow.,
2024.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared
Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray,
Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter,
Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth
Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang,
Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N.
Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles
11
FlowRL: Matching Reward Distributions for LLM Reasoning
Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish,
Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code, 2021.
Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu
Wei. Reasoning with exploration: An entropy perspective., 2025.
Miruna Cretu, Charles Harris, Ilia Igashov, Arne Schneuing, Marwin Segler, Bruno Correia, Julien
Roy, Emmanuel Bengio, and Pietro Liò. Synflownet: Design of diverse and novel molecules with
synthesis constraints., 2024.
Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan,
Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning
language models., 2025.
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,
2025. URL.
Tristan Deleu, Padideh Nouri, Nikolay Malkin, Doina Precup, and Yoshua Bengio. Discrete probabilistic
inference as control in multi-path environments., 2024.
Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen,
Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al. Agentic reinforced policy optimization.
preprint arXiv:2507.19849, 2025.
Yilun Du and Igor Mordatch. Implicit generation and modeling with energy based models.
in neural information processing systems, 32, 2019.
Benjamin Eysenbach and Sergey Levine. Maximum entropy rl (provably) solves some robust rl
problems., 2021.
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In
International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi-
rong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via
reinforcement learning., 2025.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy
maximum entropy deep reinforcement learning with a stochastic actor. In
on machine learning, pages 1861–1870. Pmlr, 2018.
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han,
Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi
with olympiad-level bilingual multimodal scientific problems.,
2024.
Haoran He, Can Chang, Huazhe Xu, and Ling Pan. Looking backward: Retrospective backward
synthesis for goal-conditioned GFlownets. In
Representations, 2025. URL.
Geoffrey E. Hinton, Peter Dayan, Brendan J. Frey, and R M Neal. The “wake-sleep” algorithm for
unsupervised neural networks., 268 5214:1158–61, 1995.
Edward J Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua Bengio,
and Nikolay Malkin. Amortizing intractable inference in large language models.
arXiv:2310.04363, 2023.
12
FlowRL: Matching Reward Distributions for LLM Reasoning
Edward J Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua Bengio,
and Nikolay Malkin. Amortizing intractable inference in large language models. In
International Conference on Learning Representations, 2024. URL https://openreview.net/f
orum?id=Ouj6p4ca60.
Jian Hu, Jason Klein Liu, and Wei Shen. Reinforce++: An efficient rlhf algorithm with robustness to
both prompt and reward models, 2025., 3262:32–33, 2025.
Moksh Jain, Emmanuel Bengio, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Bonaventure F.P.
Dossou, Chanakya Ekbote, Jie Fu, Tianyu Zhang, Micheal Kilgour, Dinghuai Zhang, Lena Simine,
Payel Das, and Yoshua Bengio. Biological sequence design with GFlowNets.
on Machine Learning (ICML), 2022.
Moksh Jain, Tristan Deleu, Jason S. Hartford, Cheng-Hao Liu, Alex Hernández-García, and Yoshua
Bengio. Gflownets for ai-driven scientific discovery., abs/2302.00615, 2023a. URL https:
//api.semanticscholar.org/CorpusID:256459319.
Moksh Jain, Sharath Chandra Raparthy, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Yoshua Bengio,
Santiago Miret, and Emmanuel Bengio. Multi-objective GFlowNets.
Machine Learning (ICML), 2023b.
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando
Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free
evaluation of large language models for code., 2024.
Koray Kavukcuoglu. Gemini 2.5: Our most intelligent AI model, 2025. URLhttps://blog.goo
gle/technology/google-deepmind/gemini-model-thinking-updates-march-2025/
.
Google Blog (The Keyword), Published Mar. 25, 2025.
Minsu Kim, Taeyoung Yun, Emmanuel Bengio, Dinghuai Zhang, Yoshua Bengio, Sungsoo Ahn, and
Jinkyoo Park. Local search gflownets., abs/2310.02710, 2023.
Minsu Kim, Joohwan Ko, Taeyoung Yun, Dinghuai Zhang, Ling Pan, Woochang Kim, Jinkyoo Park,
Emmanuel Bengio, and Yoshua Bengio. Learning to scale logits for temperature-conditional
gflownets, 2024.
Seanie Lee, Minsu Kim, Lynn Cherif, David Dobre, Juho Lee, Sung Ju Hwang, Kenji Kawaguchi,
Gauthier Gidel, Yoshua Bengio, Nikolay Malkin, et al. Learning diverse attacks on large language
models for robust red-teaming and safety tuning., 2024.
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ra-
masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam
Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language
models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,
in Neural Information Processing Systems, volume 35, pages 3843–3857. Curran Associates, Inc.,
2022. URLhttps://proceedings.neurips.cc/paper_files/paper/2022/file/18abb
eef8cfe9203fdf9053c9c4fe191-Paper-Conference.pdf.
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan
Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.
arXiv:2305.20050, 2023a.
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan
Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In
International Conference on Learning Representations, 2023b.
13
FlowRL: Matching Reward Distributions for LLM Reasoning
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching
for generative modeling. In, 2023.
URL.
Dianbo Liu, Moksh Jain, Bonaventure F. P. Dossou, Qianli Shen, Salem Lahlou, Anirudh Goyal, Nikolay
Malkin, Chris C. Emezue, Dinghuai Zhang, Nadhir Hassen, Xu Ji, Kenji Kawaguchi, and Yoshua
Bengio. Gflowout: Dropout with generative flow networks. In
Learning, 2022.
Mingjie Liu, Shizhe Diao, Jian Hu, Ximing Lu, Xin Dong, Hao Zhang, Alexander Bukharin, Shaokun
Zhang, Jiaqi Zeng, Makesh Narsimhan Sreedhar, et al. Scaling up rl: Unlocking diverse reasoning
in llms via prolonged training., 2025a.
Zhen Liu, Tim Z Xiao, , Weiyang Liu, Yoshua Bengio, and Dinghuai Zhang. Efficient diversity-preserving
diffusion alignment via gradient-informed gflownets. In, 2025b.
Michael Luo, Sijun Tan, Roy Huang, Xiaoxiang Shi, Rachel Xin, Colin Cai, Ameen Patel, Alpay Ariyak,
Qingyang Wu, Ce Zhang, Li Erran Li, Raluca Ada Popa, Ion Stoica, and Tianjun Zhang. Deepcoder:
A fully open-source 14b coder at o3-mini level, 2025. Notion Blog.
Jiangyan Ma, Emmanuel Bengio, Yoshua Bengio, and Dinghuai Zhang. Baking symmetry into
gflownets.
MAA. American mathematics competitions - amc., 2023.
MAA. American invitational mathematics examination - aime., 2025.
Kanika Madan, Jarrid Rector-Brooks, Maksym Korablyov, Emmanuel Bengio, Moksh Jain, Andrei Cris-
tian Nica, Tom Bosc, Yoshua Bengio, and Nikolay Malkin. Learning gflownets from partial episodes
for improved convergence and stability. In, pages
23467–23483. PMLR, 2023.
Nikolay Malkin, Moksh Jain, Emmanuel Bengio, Chen Sun, and Yoshua Bengio. Trajectory balance:
Improved credit assignment in gflownets., 35:
5955–5967, 2022.
Nikolay Malkin, Salem Lahlou, Tristan Deleu, Xu Ji, Edward Hu, Katie Everett, Dinghuai Zhang,
and Yoshua Bengio. GFlowNets and variational inference.
Representations (ICLR), 2023.
David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng,
and Angjoo Kanazawa. Flow matching policy gradients., 2025.
Sobhan Mohammadpour, Emmanuel Bengio, Emma Frejinger, and Pierre-Luc Bacon. Maximum
entropy gflownets with soft q-learning. In
Statistics, pages 2593–2601. PMLR, 2024.
OpenAI. Gpt-4o mini.https://openai.com/index/gpt-4o-mini-advancing-cost-effic
ient-intelligence/
Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping
and mitigating misaligned models., 2022.
Ling Pan, Moksh Jain, Kanika Madan, and Yoshua Bengio. Pre-training and fine-tuning generative
flow networks, 2023a.
14
FlowRL: Matching Reward Distributions for LLM Reasoning
Ling Pan, Nikolay Malkin, Dinghuai Zhang, and Yoshua Bengio. Better training of GFlowNets with
local credit and incomplete trajectories.,
2023b.
Ling Pan, Dinghuai Zhang, Aaron Courville, Longbo Huang, and Yoshua Bengio. Generative augmented
flow networks., 2023c.
Ling Pan, Dinghuai Zhang, Moksh Jain, Longbo Huang, and Yoshua Bengio. Stochastic generative
flow networks., 2023d.
Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning. In
on Machine Learning, 2025. URL.
Guilherme Penedo, Anton Lozhkov, Hynek Kydlíček, Loubna Ben Allal, Edward Beeching, Agustín Pi-
queres Lajarín, Quentin Gallouédec, Nathan Habib, Lewis Tunstall, and Leandro von Werra.
Codeforces., 2025.
Samuel Pfrommer, Yixiao Huang, and Somayeh Sojoudi. Reinforcement learning for flow-matching
policies., 2025.
Abhinav Rastogi, Albert Q Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, Jason Rute, Joep
Barmentlo, Karmesh Yadav, Kartik Khandelwal, Khyathi Raghavi Chandu, et al. Magistral.
preprint arXiv:2506.10910, 2025.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms., 2017.
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan
Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open
language models., 2024.
Max W. Shen, Emmanuel Bengio, Ehsan Hajiramezanali, Andreas Loukas, Kyunghyun Cho, and Tom-
maso Biancalani. Towards understanding and improving gflownet training., abs/2305.07170,
2023. URL.
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng,
Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.
arXiv: 2409.19256, 2024.
Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing
reward gaming., 35:9460–9471, 2022.
Richard S Sutton, Andrew G Barto, et al. Reinforcement learning.,
11(1):126–134, 1999a.
Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods
for reinforcement learning with function approximation. In S. Solla, T. Leen, and K. Müller,
editors,, volume 12. MIT Press, 1999b. URL
https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0b
ed98e80ade0a5c43b0f-Paper.pdf
Qwen Team. Qwen2.5: A party of foundation models, September 2024. URLhttps://qwenlm.g
ithub.io/blog/qwen2.5/. 15
FlowRL: Matching Reward Distributions for LLM Reasoning
Daniil Tiapkin, Nikita Morozov, Alexey Naumov, and Dmitry P Vetrov. Generative flow networks as
entropy-regularized rl. In, pages
4213–4221. PMLR, 2024.
Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen,
Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive
effective reinforcement learning for llm reasoning., 2025.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
et al. Chain-of-thought prompting elicits reasoning in large language models.
information processing systems, 35:24824–24837, 2022.
Fangxu Yu, Lai Jiang, Haoqiang Kang, Shibo Hao, and Lianhui Qin. Flow of reasoning: Training
llms for divergent reasoning with minimal examples. In
Machine Learning, 2025a.
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong
Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.
arXiv preprint arXiv:2503.14476, 2025b.
Taeyoung Yun, Dinghuai Zhang, Jinkyoo Park, and Ling Pan. Learning to sample effective and diverse
prompts for text-to-image generation. In
Conference, pages 23625–23635, 2025.
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with
reasoning., 35:15476–15488, 2022.
David W. Zhang, Corrado Rainone, Markus F. Peschl, and Roberto Bondesan. Robust scheduling with
gflownets., abs/2302.05446, 2023a. URL https://api.semanticscholar.org/Corp
usID:256827133.
Dinghuai Zhang, Ricky T. Q. Chen, Nikolay Malkin, and Yoshua Bengio. Unifying generative models
with GFlowNets and beyond., 2022a.
Dinghuai Zhang, Nikolay Malkin, Zhen Liu, Alexandra Volokhova, Aaron Courville, and Yoshua
Bengio. Generative flow networks for discrete probabilistic modeling.
Machine Learning (ICML), 2022b.
Dinghuai Zhang, Hanjun Dai, Nikolay Malkin, Aaron C. Courville, Yoshua Bengio, and Ling Pan.
Let the flows tell: Solving graph combinatorial optimization problems with gflownets.,
abs/2305.17010, 2023b.
Dinghuai Zhang, Ricky T. Q. Chen, Cheng-Hao Liu, Aaron Courville, and Yoshua Bengio. Diffusion
generative flow samplers: Improving learning signals through partial trajectory optimization, 2024a.
Dinghuai Zhang, Ling Pan, Ricky T. Q. Chen, Aaron Courville, and Yoshua Bengio. Distributional
gflownets with quantile flows, 2024b.
Dinghuai Zhang, Yizhe Zhang, Jiatao Gu, Ruixiang ZHANG, Joshua M. Susskind, Navdeep Jaitly,
and Shuangfei Zhai. Improving GFlownets for text-to-image diffusion alignment.
Machine Learning Research, 2025a. ISSN 2835-8856. URL https://openreview.net/forum
?id=XDbY3qhM42.
16
FlowRL: Matching Reward Distributions for LLM Reasoning
Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian,
Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models.
preprint arXiv:2509.08827, 2025b.
Mingyang Zhou, Zichao Yan, Elliot Layne, Nikolay Malkin, Dinghuai Zhang, Moksh Jain, Mathieu
Blanchette, and Yoshua Bengio. Phylogfn: Phylogenetic inference with generative flow networks,
2024.
Heiko Zimmermann, Fredrik Lindsten, J.-W. van de Meent, and Christian Andersson Naesseth. A
variational perspective on generative flow networks., abs/2210.07992, 2022. URL https:
//api.semanticscholar.org/CorpusID:252907672.
17
FlowRL: Matching Reward Distributions for LLM Reasoning
A.
We begin by analyzing the gradient of the Kullback–Leibler (KL) divergence between the policy
��()
exp,
��(
:
∇��KL
„
��() ∥
exp,
��()
«
= �
∫
��()
»
��() · �()
exp,
–
�
=
∫
∇���()
»
��() �()
exp,
–
�
∫
��()∇ �log
»
��() �()
exp,
–
�
=
∫
��() ∇ �log �()
»
��() �()
exp,
–
�
∫
��() ∇ �log �()
|
=�
∫
��( �1=
=
∫
��() ∇ �log �()
»
��() �()
exp,
–
�
= y��(· |
»
log
„
��() �()
exp,
«
· ∇�log �()
–
(8)
Next, consider the trajectory balance objective used in GFlowNets learning [Bartoldson et al.,
2025,,,,], defined as:
L(;)
„
log
��() �()
exp,
«
2
.
Taking the gradient of this objective with respect to
∇�L() y��(· |
»„
log
��() · �()
exp,
«
· ∇�log �()
–
(10)
Thus, minimizing the KL divergence is equivalent (up to a constant) to minimizing the trajectory
balance loss, confirming Proposition.
B.
We conduct an interpretation of FlowRL that clarifies the role of each component in the objective.
Proposition 5.
maximizing reward and policy entropy:
max
�
�y��
� �,
|
reward
− �() +
ref()
+ H ( �)
|
entropy
.
Remark 6FlowRL beyond reward maximization).
as jointly maximizing expected reward and policy entropy. This shift encourages the policy to explore a
broader set of high-quality solutions, enabling more diverse and generalizable behaviors on reasoning
tasks. Our interpretation also aligns with prior work that views GFlowNets training as a form of
maximum entropy RL [Deleu et al.,,,].
18
FlowRL: Matching Reward Distributions for LLM Reasoning
The proof of Proposition
Recall from Eq.
divergence:
�KL
„
��() ∥
exp,
ref()
��()
«
=
∫
��()
»
��() �()
exp(� �, )·
ref()
–
�
(12)
Rearranging the terms, we obtain:
arg min
�
�KL
„
��() ∥
exp(� �, )·
ref()
��()
«
=
�
∫
��()
»
��() �()
exp(� �, )·
ref()
–
�
=
�
ȷ
�y��(· |log
»
exp(� �, )·
ref()
��()
–
−
∫
��() �()
ffl
=
�
ȷ
�y��(· |log
»
exp(� �, )·
ref()
��()
–
+ H ( �)
ffl
(13)
Finally, we express the FlowRL objective in its compact form:
max
�
�y��(· |
��,
|
reward
− �()
|
normalization
+
ref()
|
prior alignment
+ H ( �)
|
entropy
.
Therefore, minimizing the FlowRL objective can be interpreted as jointly maximizing reward
and entropy, while also aligning the policy with a structured prior. The reward term drives task
performance, while the normalization term��(x)ensures consistency with a properly normalized
target distribution. This encourages the policy��to cover the entire reward-weighted distribution
rather than collapsing to a few high-reward modes. The reference policy�
refprovides inductive
bias that regularizes the policy toward desirable structures, and the entropy termH ( �) encourages
diversity in sampled solutions. Together, these components promote better generalization of FlowRL.
C.
We follow the notation of [He et al.,,,] to introduce the fundamentals of
GFlowNets. LetXdenote the compositional objects and�be a reward function that assigns non-
negative values to each object� . GFlowNets aim to learn a sequential, constructive sampling
policy�that generates objects�with probabilities proportional to their rewards, i.e.,� .
This process can be represented as a directed acyclic graph (DAG)G , where the vertices
� are referred to as, and the directed edges ( are called. The generation
of an object� corresponds to a complete trajectory� 0→ · · · → �) ∈ T within the DAG,
beginning at the initial state�0and ending at a terminal state��∈ X. The state flow�is defined
as a non-negative weight assigned to each state� . The forward policy��(
′
| specifies the
transition probability to a child state�
′, while the backward policy��(
′
) specifies the transition
probability to a parent state�. To this end, detailed balance objective enforces local flow consistency
across every edge
′
) ∈ A
∀(
′
) ∈ A �( �(
′
|) �(
′
)�(
′
;)
19
FlowRL: Matching Reward Distributions for LLM Reasoning
To achieve this flow consistency, GFlowNets employ training objectives at different levels of granularity,
including detailed balance [Bengio et al.,], trajectory balance [Malkin et al.,], and sub-
trajectory balance [Madan et al.,]. Leveraging their diversity-seeking behavior, GFlowNets have
been successfully applied across a range of domains, including molecule generation [Cretu et al.,
2024], diffusion fine-tuning [Liu et al.,,,], and amortized reasoning [Hu
et al.,,,]. Among various training objective in GFlowNets, trajectory balance
maintains flow consistency at the trajectory level, defined as:
��
��
�
��(�|�;)
��
�
��(�|�;
Furthermore, sub-trajectory balance achieves local balance on arbitrary subpaths��= �→
· · · → �}
, offering a more stable and less biased learning signal. We build on trajectory balance to
extend our KL-based objective through a gradient-equivalence formulation (Prop.), and further
improve it to better support long CoT reasoning in RL.
Models AIME 2024 AIME 2025 AMC 2023 MATH-500 Minerva Olympiad Avg
Qwen2.5-7B Base Model
Backbone4.37 23.02
R++ 10 +6 5 +3 66 +35 54 − 24 +2 27 +3 31.29
PPO 9 +5 7 +5 63 +32 57 +3 26 +3 27 +3 32.03
GRPO 14 +9 10 +8 64 +33 57 +2 23 +0 27 +3 32.76
FlowRL 14 +9 10 +7 55 +24 66 +12 31 +9 34 +10 35.39
Table 5|Math reasoning performance (Avg@64) at temperature=0.6. Relative improvements are
shown as subscripts, with positive gains in. FlowRL consistently
outperforms all baselines and achieves the best average score under this low-temperature setting.
Models AIME 2024 AIME 2025 AMC 2023 MATH-500 Minerva Olympiad Avg
Qwen2.5-7B Base Model
Backbone3.39 18.20
R++ 10 +7 4 +3 66 +43 54 +9 23 +6 26 +8 31.19
PPO 10 +7 6 +5 63 +39 57 +12 25 +8 27 +8 31.77
GRPO 12 +9 10 +8 64 +40 57 +11 23 +6 26 +8 32.44
FlowRL 14 +10 9 +8 52 +29 66 +21 30 +13 34 +16 34.62
Table 6|Math reasoning performance (Avg@64) at temperature=1.0. Relative improvements are
shown as subscripts, with positive gains in. FlowRL maintains robust performance under higher
generation randomness and continues to outperform all baselines on average.
20
FlowRL: Matching Reward Distributions for LLM Reasoning
ModelsAIME 2024 AIME 2025 AMC 2023 MATH-500 Minerva Olympiad Avg
� 13.54 10.00 56.09 58.91 20.79 28.72 31.34
� 14.79 10.20 59.53 64.30 25.27 32.39 34.41
� 15.41 10.83 54.53 66.96 31.41 34.61 35.63
� 15.00 10.83 50.62 69.02 30.03 35.03 35.09
Table 7|Ablation study on the effect of the�parameter in FlowRL. We report Avg@16 accuracy
across six math reasoning benchmarks for different values of
Diversity Evaluation Prompt
System:
problem. Focus on detecting even SUBTLE differences in methodology that indicate different problem-
solving strategies.
PROBLEM:
{problem}
16 SOLUTION ATTEMPTS:
{formatted_responses}
EVALUATION CRITERIA - Rate diversity from 1 to 5:
Score 1 - Minimal Diversity:
•
•
•
•
Score 2 - Low Diversity:
•
•
•
•
Score 3 - Moderate Diversity:
•
•
•
•
Score 4 - High Diversity:
•
•
•
•
Score 5 - Maximum Diversity:
•≤3 responses use same method)
•
•
•
IMPORTANT:
to 5.
21