Extracted Text
2508.13898v2.pdf
Fisher-Orthogonal Projection Methods for Natural Gradient Descent with Large
Batches
Yishun Lu
1
, Wesley Armour
1
1
Department of Engineering Science University of Oxford Oxford, UK
Abstract
Modern GPUs are equipped with large amounts of high-
bandwidth memory, enabling them to support mini-batch
sizes of up to tens of thousands of training samples. How-
ever, most existing optimizers struggle to perform effectively
at such a large batch size. As batch size increases, gradient
noise decreases due to averaging over many samples, lim-
iting the ability of first-order methods to escape sharp or
suboptimal minima and reach the global minimum. Mean-
while, second-order methods like the natural gradient with
Kronecker-Factored Approximate Curvature (KFAC) often
require excessively high damping to remain stable at large
batch sizes. This high damping effectively “washes out” the
curvature information that gives these methods their advan-
tage, reducing their performance to that of simple gradient
descent. In this paper, we introduce Fisher-Orthogonal Pro-
jection (FOP), a novel technique that restores the effective-
ness of the second-order method at very large batch sizes,
enabling scalable training with improved generalization and
faster convergence. FOP constructs a variance-aware update
direction by leveraging gradients from two sub-batches, en-
hancing the average gradient with a component of the gra-
dient difference that is orthogonal to the average under the
Fisher-metric. Through extensive benchmarks, we show that
FOP accelerates convergence by×1.2–1.3over KFAC and
×1.5–1.7over SGD/AdamW at the same moderate batch
sizes, while at extreme scales it achieves up to a×7.5
speedup. Unlike other methods, FOP maintains small-batch
accuracy when scaling to extremely large batch sizes. More-
over, it reduces Top-1 error by 2.3–3.3% on long-tailed CI-
FAR benchmarks, demonstrating robust generalization un-
der severe class imbalance. Our lightweight, geometry-aware
use of intra-batch variance makes natural-gradient optimiza-
tion practical on modern data-centre GPUs. FOP is open-
source and pip-installable, which can be integrated into ex-
isting training code with a single line and no extra configura-
tion.
Introduction
The increasing scale of modern language models and vision
transformers has made large mini-batch training a neces-
sity. Modern GPUs offer extensive high-bandwidth mem-
ory (such as 192GB in AMD MI300X), and data centers
often combine hundreds of these devices, enabling the ef-
ficient processing of tens of thousands of training exam-
ples per batch. This improves hardware utilization, reduces
communication overhead, and accelerates training signifi-
cantly. As batch sizes increase, the gradients become more
deterministic. This reduces the stochastic noise that pre-
viously helped first-order optimizers such as SGD (Rob-
bins and Monro 1951), Adam (Kingma and Ba 2014), and
AdamW (Loshchilov and Hutter 2017) to explore flatter
minima of the loss landscape. The loss of this noise (Keskar
et al. 2016; McCandlish et al. 2018) requires first-order
methods to rely on smaller learning rates and stronger ex-
plicit regularization to maintain stability and generalization
performance.
Natural-Gradient Descent (NGD) addresses scenarios where
stochastic gradient noise is minimal, such as very large
mini-batches (Pascanu and Bengio 2013; Shazeer and Stern
2018; Ishikawa and Yokota 2024). Unlike first-order meth-
ods, NGD incorporates second-order curvature information
via the Fisher information matrix (Amari 1998), enabling
geometry-aware parameter updates invariant to model pa-
rameterization. However, exact computation of the Fisher
matrix is infeasible for modern neural networks. Practical
implementations rely on approximations, with Kronecker-
Factored Approximate Curvature (KFAC) being the most
widely adopted (Martens and Grosse 2015). KFAC simpli-
fies the Fisher matrix by exploiting the layer-wise structure
of neural networks, making the approximation computa-
tionally efficient. Subsequent methods further approximate
KFAC to improve tractability (Lin et al. 2024; Yang et al.
2020; Liu et al. 2024; Eschenhagen et al. 2023). Despite its
promise, KFAC struggles in the very-large-batch regime re-
quired for modern hardware. As the batch size grows, the
Fisher matrix becomes increasingly ill-conditioned (Sagun,
Bottou, and LeCun 2016; Ghorbani, Krishnan, and Xiao
2019), leading to numerical instability. This forces the use
of strong damping, which unfortunately suppresses the very
curvature information that gives KFAC its advantage.
Prior attempts to scale natural-gradient methods to large-
batch training include SENG (Yang et al. 2020), which
uses low-rank sketches but introducing new hyperparam-
eters, and SP-NGD (Osawa et al. 2020), which relies on
empirical-Fisher approximations and stale-statistic heuris-
tics that require task-specific hyperparameter retuning.
Adaptive batch-size schedules (Yao et al. 2018) improve
throughput but still rely on heavily damped mean gra-
dients. Although these methods reduce hardware require-
arXiv:2508.13898v2 [cs.LG] 24 Aug 2025
ments, their update rules remain fundamentally unchanged,
still dominated by mean gradients and extensive hyperpa-
rameter tuning.
In this paper, we propose Fisher-Orthogonal Projection
(FOP), which augments natural gradient descent with a
Fisher-orthogonal variance correction. This novel geometry-
aware update captures intra-batch gradient variation without
relying on sketching ranks, stale-statistics heuristics, or cus-
tomized communication strategies. Specifically, our contri-
butions include:
•Fisher–Orthogonal Projection optimizer.We propose
a novel second-order update that augments the natu-
ral gradient with a geometry-aware, variance-controlled
component, capturing intra-batch information by stan-
dard KFAC.
•Extreme large-batch scalability.We demonstrate that
FOP seamlessly scales to cases where SGD, AdamW,
and K-FAC break down, and it can achieve speedups of
up to×7.5in wall-clock time while maintaining conver-
gence at extremely large batch sizes in ImageNet and CI-
FAR datasets.
•Robust generalization under imbalance.Reducing
Top-1 error rate by 2.3–3.3% on long-tailed CIFAR-LT
benchmarks without additional tricks
•Distributed FOP.Efficiently sharding Fisher compu-
tation across GPUs with dual-gradient AllReduce and
broadcasted updates, enabling scalable, low-overhead
training.
Natural Gradient Descent
Natural Gradient Descent is a second-order optimization
method derived from Newton’s method, where the Hessian
is replaced by the Fisher Information Matrix to reflect the
geometry of the parameter space. In standard gradient de-
scent, the update rule is:
θi=θi−1−η∇θL(θi−1)
whereθiis the parameter vector at iterationi,ηis the learn-
ing rate, and∇θL(θi−1)is the gradient of the loss function
Levaluated atθi−1. This treats all directions in parameter
space uniformly, which can lead to slow convergence in ill-
conditioned landscapes. Newton’s method improves this by
preconditioning the gradient with the inverse HessianH
−1
,
but computing the full Hessian is often infeasible for large
models.
Natural Gradient Descent modifies the update by replacing
the Hessian with the Fisher Information MatrixF, which de-
fines a Riemannian metric on the statistical manifold (Amari
1998):
θi=θi−1−ηF
−1
∇θL(θi−1)
This yields the steepest descent direction under the Kull-
back–Leibler (KL) divergence, making the update invari-
ant to parameter reparameterizations (Martens and Grosse
2015).Current 0
g1
g2
avg
FOP
Best Next
Step Loss
Figure 1: 3D loss landscape of ResNet-18 with CIFAR-10
for batch size of 1024. Arrows represent the direction of the
steps of different gradients. The green star is the smallest
loss after updating the model based on different update di-
rections.
Fisher-Orthogonal Projection
A loss landscape is plotted based on the method suggested
in (Li et al. 2018) in Figure 1, by projecting the high-
dimensional parameter space onto a 2D plane using two ran-
dom but normalized directions and evaluating the loss across
a grid in this plane. Naively averaging gradients across mini-
batches can obscure useful optimization directions due to
averaging over many samples. In particular, such averag-
ing may suppress informative signals when gradients from
different batches point in significantly different directions.
Especially at large batch sizes, where gradient variability is
low but intra-batch differences can still carry important opti-
mization signals. We propose theFisher-Orthogonal Projec-
tion (FOP)method, which preserves the informative struc-
ture of each mini-batch gradient while ensuring stable and
curvature-aware descent. The key idea is to use the average
gradient as the primary descent direction, capturing the com-
mon signal shared across mini-batches. In addition, FOP in-
troduces a Fisher-orthogonal component, derived from the
difference between two mini-batch gradients. This orthogo-
nal component contains complementary curvature-sensitive
information that would otherwise be lost through simple av-
eraging.
Suppose that at a parameter pointθ, we compute two gradi-
entsg1andg2from two independent mini-batches. Then the
FOP update is defined as:
L1(θ), L2(θ) (1)
each associated with a gradient:
g1=∇θL1(θ), g2=∇θL2(θ) (2)
We compute the average and difference of the two gradients:
gavg=
1
2
(g1+g2), gdiff=g1−g2. (3)
To eliminate redundancy and extract only novel informa-
tion, we orthogonalizegdiffwith respect togavgunder the
inner product induced by the Fisher matrixF. The projec-
tion scalar is computed as:
sproj=
g
⊤
diff
F gavg
g
⊤
avgF gavg+ϵ
(4)
and the orthogonal component is then:
g
⊥
diff=gdiff−sproj·gavg. (5)
By construction, this ensures that⟨gavg, g
⊥
diff
⟩F= 0(as de-
scribed in Lemma 1 in the Appendix), meaning the new
component contains only the information that is orthogonal
in the Fisher geometry, which is already captured in the av-
erage gradient.
The final combined update direction is given by:
gcombined=gavg+βg
⊥
diff, (6)
whereβis a scalar weight, adaptively determined to locally
minimize the primary or total loss. The overall parameter
update using Natural Gradient Descent then becomes:
θt+1=θt−ηF
−1
gcombined. (7)
Layer-wise Adaptive Coefficientβ
When the true optimization objective is the sum of two per-
batch losses, we can write the total loss as:
Ltot(θ) =L1(θ) +L2(θ). (8)
To minimize this objective locally, we seek an optimal mix-
ing coefficientβthat balances the directiong
⊥
diff
relative to
the average gradient. We derive this optimalβ(denotedβ
∗
)
using a second-order Taylor approximation.
We begin with the natural-gradient update step of the form
as in Eq. 7:
θnew=θ−ηF
−1
(gavg+βg
⊥
diff), (9)
Using a second-order Taylor expansion ofLtotaroundθ, and
assuming the Hessian matrix approximates to the Fisher ma-
trix (∇
2
Ltot≈F), the approximate loss after update is:
Ltot(θ−ηd)≈Ltot(θ)−η(g1+g2)
⊤
d+
η
2
2
d
⊤
F d,(10)
where
d=F
−1
(gavg+βg
⊥
diff). (11)
To isolate the effect ofβ, we define a surrogate objective
Jtot(β)by dropping constants and factors ofη:
Jtot(β) =−(g1+g2)
⊤
F
−1
(gavg+βg
⊥
diff)
+
1
2
(gavg+βg
⊥
diff)
⊤
F
−1
(gavg+βg
⊥
diff).(12)
To simplify, we define the following inner products:
D=g
⊤
avgF
−1
g
⊥
diff, E= (g
⊥
diff)
⊤
F
−1
g
⊥
diff. (13)
Noting thatg1+g2= 2gavg, we substitute into (12) and
expand:
Jtot(β) =−2D−2βD+
1
2
Γ
∥gavg∥
2
F
−1+ 2βD+β
2
E
∆
=−2D−βD+
1
2
β
2
E+const, (14)
where∥gavg∥
2
F
−1=g
⊤
avgF
−1
gavgis absorbed into the con-
stant term.
To find the optimalβ
∗
, we differentiate (14) with respect to
βand set the derivative to zero:
dJtot
dβ
=−D+β
∗
E= 0⇒β
∗
=
D
E
. (15)
Substituting back the definitions ofDandE, we obtain the
closed-form expression:
β
∗
=
g
⊤
avgF
−1
g
⊥
diff
(g
⊥
diff
)
⊤
F
−1
g
⊥
diff
. (16)
This yields a layer-wise coefficientβ
∗
that minimizes
our second-order surrogate, injecting orthogonal corrections
only when beneficial. While it relies on the Hessian–Fisher
approximation, which can misestimate curvature early in
training or in highly nonlinear models, damped Fisher matri-
ces and large-batch averaging curb these errors. In practice,
any misleading orthogonal signal drivesβ
∗
→0, safely re-
ducing FOP to a standard KFAC update.
Layer-wise Adaptive Scaling Step Sizeη
∗
ℓ
To ensure that each layer’s update magnitude is automati-
cally adjusted to account for its local curvature and gradient
alignment, we introduce a layer-wise adaptive coefficientη
∗
ℓ
.
Rather than using a single global learning rate for all param-
eters,η
∗
ℓ
is chosen to (locally) minimize a quadratic approx-
imation of the loss function along the natural-gradient direc-
tion for each layerℓ. Instead of differentiating aboutβin
Eq. 10, we minimize this one-dimensional quadratic model
inη. After we set the derivative with respect toηto zero, the
optimal step size becomes:
η
∗
ℓ=
g
⊤
ℓ,tot
F
−1
ℓ
gℓ,comb
g
⊤
ℓ,comb
F
−1
ℓ
gℓ,comb
. (17)
This expression automatically adjusts the step size based on
both the alignment between the total gradient and the pro-
posed update direction, and the curvature in that direction.
When the curvature alonggℓ,combis low and the update is
well-aligned with the descent direction,η
∗
ℓ
will approach 1,
allowing a full natural-gradient step. Conversely, when the
curvature is high or the combined gradient is poorly aligned
with the total gradient,η
∗
ℓ
decreases below 1, effectively
damping the update to avoid overshooting. The final updates
are:
dℓ=η0η
∗
ℓF
−1
ℓ
gℓ,comb, (18)
whereη0is the base learning rate shared across the whole
model.
Kullback–Leibler (KL) norm analysis
Natural-gradient methods, such as KFAC (Martens and
Grosse 2015), select update directions that maximize
progress in parameter space relative to the KL-divergence
between the model before and after the update. A standard
second-order approximation of the KL-divergence gives rise
to the KL-norm:
KL(pθ+∆∥pθ)≈
1
2
∥F
1/2
∆∥
2
(19)
whereFis the Fisher information matrix and∆is the pa-
rameter update step. The KL-norm quantifies how much the
model distribution changes due to an update.
In our case, the FOP update step is defined as:
∆FOP=−ηM
−1
(g+βg
⊥
) (20)
wheregis the average micro-batch gradientgavg,g
⊥
is an
orthogonal correction term in Eq. 6, satisfying(g
⊥
)
⊤
F g=
0,M=F+λIis the damped Fisher matrix, andβ∈R
controls the contribution of the orthogonal component.
Substituting this into the KL-norm expression we obtain a
decomposition into three terms:
F
1/2
∆FOP
2
=η
2
Θ
g
⊤
Qg+ 2βg
⊤
Qg
⊥
+β
2
(g
⊥
)
⊤
Qg
⊥
Λ
(21)
• g
⊤
Qgcorresponding to standard KFAC,
• 2β g
⊤
Qg
⊥
,
• β
2
(g
⊥
)
⊤
Qg
⊥
,
whereQ= (F+λI)
−1
F(F+λI)
−1
acts as a cur-
vature–weighted metric. In the large-damping case, where
λ≫Λifor every Fisher eigenvalueΛithat carries weight
ing, the spectral factor
Λi
(Λi+λ)
2is proportional toΛi/λ
2
.
Therefore, the base term scales asg
⊤
Qg∝λ
−2
.As shown
in Appendix B, the two additional terms from FOP decay
even more gently. Specifically:
∥F
1/2
∆FOP∥
2
≤η
2
"
g
⊤
Qg
|{z}
∥F
1/2
∆KFAC∥
2
+ 2β
µmax
λ
∥F
−1/2
g∥ · ∥F
−1/2
g⊥∥
+β
2
µmax
4λ
∥F
−1/2
g⊥∥
2
#
(22)
Because the KL-norm of the FOP update splits into a base-
term that decays asO(1/λ
2
)and two correction terms that
decay only asO(1/λ), there is an inherent separation in how
quickly these components diminish as the damping parame-
terλincreases.
During the early phase of training, it is common for
the∥gavg∥to be large, while the∥g
diff
⊥∥is also notable and
dominated by high-frequency, low-curvature noise that is ex-
aggerated by the application ofF
−1
. In such scenarios, the
orthogonal correctiong
⊥
diff
often points in the opposite direc-
tion to the main descent path. As a result, the optimal mixing
coefficient becomes negative (β <0), leading to a nega-
tive cross-term in the KL-norm. This partial cancellation of
the core component creates margins that safely reduce the
damping factorλ.
Distributed FOP
Algorithm 1: Distributed FOP with Dual Gradients
Require:
M: neural network model
P={p0, . . . , pN−1}: set ofNGPU processes
Sj: subset of layers for whichpjis the curvaturespe-
cialist
Dj: local mini-batch onpj
Fi: running estimate of the Fisher matrix block for layer
ℓi, stored and updated by its specialist GPUpk
Gpri={pj∈P|jmod 2 = 0} ▷primary group
Gsec={pj∈P|jmod 2 = 1}▷secondary group
Ensure:preconditioned global gradient˜g
1:for allpj∈Pin parallel do
2:gj← ∇θL(M;Dj) ▷local back-prop
3:ifcurvature update stepthen
4: for alllayersℓiinMdo
5: pk←specialist s.t.ℓi∈Sk
6: send local curvature factors ofℓitopk
/* async, non-blocking */
7: ifj=kthen
8: update and invertFi→F
−1
i
9: end if
10: end for
11:end if
12:ifpj∈Gprithen
13: global ALLREDUCEto computeg1
14:else
15: global ALLREDUCEto computeg2
16:end if
17:for alllayersℓiinMdo
18: pk←specialist forℓi
19: ifj=kthen
20: ˜gi←FOP(g1,i, g2,i, F
−1
i
)
21: broadcast˜gito all processes
22: end if
23:end for
24:end for
25:assemble˜g←[˜g1, . . . ,˜gL] ▷ L= #layers
Traditional second-order optimization methods like KFAC
are notoriously difficult to scale due to the high mem-
ory and computation costs of storing and inverting large
curvature matrices. Moreover, synchronizing these matri-
ces across multiple GPUs introduces significant communi-
cation overhead, making them impractical for large-scale
training without careful system design. We design a scalable
FOP implementation that combines data parallelism with
lightweight model parallelism to minimize the overhead of
splitting large batches. First, we assign each GPU as a spe-
cialist, responsible for updating the curvature (Fisher infor-
0 500 1000 1500
20
40
60
80
Test Accuracy (%)
BS = 2¹¹ (2048)
SGD
AdamW
KFAC
FOP
0 250 500 750
20
40
60
80
BS = 2¹² (4096)
SGD
AdamW
KFAC
FOP
0 200 400
20
40
60
80
Test Accuracy (%)
BS = 2¹³ (8192)
SGD
AdamW
KFAC
FOP
0 100 200
20
40
60
80
BS = 2¹ (16384)
SGD
AdamW
KFAC
FOP
0 50 100 150
Time (seconds)
20
40
60
80
Test Accuracy (%)
BS = 2¹ (32768)
SGD
AdamW
KFAC
FOP
0 50 100
Time (seconds)
20
40
60
80
BS = 50000
SGD
AdamW
KFAC
FOP Figure 2: Test accuracy vs. wall-clock time (in seconds) for
ResNet-18 on CIFAR-10, grouped by batch size. The dotted
line represents the threshold of 91%.
mation) of a subset of layers via a sharded preconditioner
similar to the previous works (Osawa et al. 2023; Pauloski
et al. 2022). Second, we introduce a dual-gradient reduction
strategy, where two global gradientsg1andg2are computed
in parallel over disjoint GPU groups. Finally, each special-
ist GPU applies the FOP using its local Fisher inverse and
both gradients to compute its layer’s update, which is then
broadcast across the GPUs. This distributed design enables
efficient second-order updates across large-scale multi-GPU
systems. The full algorithm is shown in Algorithm 1.
Experiments
In this section, we rigorously evaluate FOP against both
first-order (SGD, AdamW) and second-order (KFAC) base-
lines across four vision benchmarks: CIFAR-10 with
ResNet-18, ImageNet-100 with T2T-ViT, ImageNet-1K
with ResNet-50, and long-tailed CIFAR10-LT/100-LT with
ResNet-32, demonstrating its fast convergence, large-batch
scalability, and robustness under class imbalance. To en-
sure fair and rigorous comparisons, we evaluate all methods
on several standard benchmarks: ResNet-18 on CIFAR-10,
ResNet-50 on ImageNet-1k, and a Vision Transformer (ViT)
on ImageNet-100. For each setting, we perform an exten-
sive hyperparameter search for all optimizers. This includes
tuning the learning rate, and for second-order methods like
KFAC and FOP, also tuning the damping ratioλ. Following
the linear scaling rule from (Goyal et al. 2017), the learning
rate is scaled proportionally with the batch size. Addition-
ally, the curvature update frequency for the second-order op-
timizer is reduced as the batch size increases, until it reaches
a minimum threshold of 5 steps. To isolate the effect of our
Fisher-orthogonal projection, FOP and KFAC share identi-
cal learning-rate schedules in every experiment. While our
results highlight FOP’s advantages, we do not claim that
FOP is a universally superior optimizer for all tasks and
architectures. Rather, the evidence shows that augmenting
natural gradient methods with a principled, geometry-aware
variance component offers via FOP a robust and scalable
path for second-order optimization in modern large-batch
training scenarios. All experiments were performed on a sin-
gle node with two AMD EPYC 9534 64-core CPUs and
eight AMD MI300X GPUs. The implementations of KFAC
and FOP rely on the ASDL package (Osawa et al. 2023).
Full details of the hyperparameter search space, such as
learning rate, damping rate, and random seeding number,
and a discussion about the memory overhead are provided
in the Appendix.
CIFAR10 with ResNet18
We first evaluate optimization performance on the CIFAR-
10 dataset (Krizhevsky, Hinton et al. 2009), a widely used
benchmark for assessing second-order optimizers (Eschen-
hagen et al. 2023; Liu et al. 2024; Martens and Grosse 2015).
Each experiment is run with 5 different random seeds to en-
sure robustness. We employ theReduceLROnPlateau
learning rate scheduler during training. Figure 2 illustrates
the progression of test accuracy over wall-clock time for
ResNet-18, across batch sizes ranging from2048to50000
(the total number of training samples in CIFAR-10). At
small to moderate batch sizes (e.g.,2048and4096), all op-
timizers, including SGD, AdamW, KFAC, and FOP, achieve
91% accuracy, though FOP consistently reaches this thresh-
old the fastest. As batch size increases beyond16384, first-
order optimizers such as SGD and AdamW struggle to con-
verge within the same epoch limit, failing to reach 91%
accuracy altogether at larger batch sizes (e.g.,32768and
50000).
These trends are quantitatively summarized in Table 1,
which reports both the epochs and total wall-clock time
to reach 91% Top-1 accuracy on CIFAR-10 using the
same GPU count. At BS=2048, FOP hits the target accu-
racy in 29/475.2s, with×1.56faster than SGD (58/743.3s)
and×1.26faster than KFAC (37/588.7s). As we scale
to BS=4096 and 8192, FOP’s speedup over SGD grows
to×1.69and×2.91, respectively, and reaches×3.78at
BS=16384. Crucially, FOP is the only method to arrive 91%
at BS=32768 and 50000, doing so in 60/90.6s (×5.05) and
82/84.3s (×5.43). These results underscore FOP’s excep-
tional large-batch scalability and its ability to deliver sub-
stantial accelerations.
ImageNet-100 with T2T-ViT
To evaluate FOP on modern transformers, we train a Tokens-
to-Token Vision Transformer (T2T-ViT) from scratch on
ImageNet-100 with running 3 different random seeds(Yuan
et al. 2021; Lin et al. 2024). Following Lin et al. (2024), we
Batch Size Epochs / Time (s) and Speedup
(GPU) SGD AdamW KFAC FOP (ours)
2048 (2)58 / 743.3 61 / 768.4 37 / 588.7 29 / 475.2
4096 (2)
73 / 457.9 73 / 454.0 34 / 270.5 22 / 181.9
– – [ ×1.69] [×2.52]
8192 (2)
– / – – / – 71 / 296.4 35 / 157.5
– – [ ×1.54] [×2.91]
16384 (2)
– / – – / – 99 / 186.4 58 / 121.2
– – [ ×2.46] [×3.78]
32768 (2)
– / – – / – – / – 60 / 90.6
– – – [ ×5.05]
50000 (2)
– / – – / – – / – 82 / 84.3
– – – [ ×5.43]
Table 1: Epoch and training time (in seconds) to reach 91%
Top-1 accuracy on CIFAR-10. Each row shows the batch
size and number of GPUs used in the format:Batch (GPU)
and the correspondingEpoch/Time. “–” indicates the accu-
racy threshold was not reached. For KFAC and FOP, the
bracketed numbers show the speedup factor relative to SGD
at the batch size of 4096.
apply FOP and KFAC only to the convolutional and linear
layers’ gradients and activations, while all other parameters
(e.g., LayerNorm) are updated with AdamW (Loshchilov
and Hutter 2017). We run 100 epochs with a cosine learning-
rate schedule and measure Top-1 accuracy over wall-clock
time (Figure 3). We set our target at 80.6%, the best result
achieved by AdamW at batch size 512, and report the epochs
and training time to reach it in Table 2.
AdamW requires nearly the full 100 epochs, and the longest
runtime at batch size 512, whereas both second-order meth-
ods hit 80.6% in substantially fewer epochs and less time.
FOP consistently outperforms KFAC and AdamW across
batch sizes larger than 512, delivering speedups of×4.33,
×6.90, and×10.48, where KFAC only achieves×3.80,
×6.34, and×6.45compared to AdamW. KFAC scales bet-
ter than AdamW but still lags behind FOP, typically needing
more epochs to match the same accuracy.
Batch Size Epochs / Time (s) and Speedup
(GPU) AdamW KFAC FOP (ours)
512 (1)
97 /17 498.842 /8315.644 /9535.7
[×2.10] [×1.84]
1024 (2)
– 49 /4603.641 /4038.9
[×3.80] [×4.33]
2048 (4)
– 57 /2760.548 /2535.4
[×6.34] [×6.90]
4096 (8)
– 87 /2715.349 /1670.4
[×6.45] [×10.48]
Table 2: Epoch and training time (in seconds) to reach 80.6%
Top-1 accuracy for T2T-ViT on ImageNet-100. For KFAC
and FOP, the bracketed numbers show the speedup factor
relative to AdamW at the batch size of 512.0 10000 20000
0
20
40
60
80
Test Accuracy (%)
BS = 2 (512)
AdamW
KFAC
FOP
0 5000 10000
0
20
40
60
80
BS = 2¹ (1024)
AdamW
KFAC
FOP
0 2000 4000 6000
Time (seconds)
0
20
40
60
80
Test Accuracy (%)
BS = 2¹¹ (2048)
AdamW
KFAC
FOP
0 2000
Time (seconds)
0
20
40
60
80
BS = 2¹² (4096)
AdamW
KFAC
FOP
Figure 3: Test accuracy vs. wall-clock time (in seconds) for
T2T-ViT on ImageNet-100, grouped by batch size. The dot-
ted line represents the threshold of 80.6%.
ImageNet-1K with ResNet50
Batch SizeEpochs / Time (min) and Speedup
(GPU) SGD KFAC FOP (ours)
1024 (1)
71 / 2511.1 35 / 1336.5 32 / 1306.1
[×1.88] [×1.92]
2048 (2)
– / – – / – 33 / 999.5
[×2.51]
4096 (4)
– / – – / – 34 / 514.9
[×4.88]
8192 (8)
– / – – / – 40 / 335.1
[×7.50]
Table 3: Epoch and training time (in minutes) to reach
75.9% Top-1 accuracy on ImageNet-1K with ResNet-50.
For KFAC and FOP, the bracketed numbers show the
speedup factor relative to SGD at the batch size of 1024.
To evaluate FOP at a much larger-scale dataset, we train
ResNet-50 from scratch on ImageNet-1K (Deng et al. 2009)
with running 3 different random seeds, to see if it can reach
Top-1 test accuracy of 75.9%, which is a standard large-scale
convolutional benchmark (Mattson et al. 2019). Following
Yang et al. (2020); Liu et al. (2024), we run SGD for 80
epochs with a cosine learning-rate schedule, and we train
both KFAC and FOP for 50 epochs using an exponential
schedule on their learning rates.
Figure 4 and Table 3 illustrate the dramatic efficiency gains
of FOP in reaching 75.9% Top-1 accuracy on ImageNet-1K.
At BS=1024, FOP converges in 32/1306.1 min, that is nearly
×2faster than SGD (71/2511.1 min) and just slightly ahead
0 1000 2000
20
40
60
Test Accuracy (%)
BS = 2¹ (1024)
SGD
KFAC
FOP
0 1000 2000
20
40
60
BS = 2¹¹ (2048)
SGD
KFAC
FOP
0 500 1000
Time (minutes)
0
20
40
60
Test Accuracy (%)
BS = 2¹² (4096)
SGD
KFAC
FOP
0 200 400 600
Time (minutes)
0
20
40
60
BS = 2¹³ (8192)
SGD
KFAC
FOP Figure 4: Test accuracy vs. wall-clock time (in minutes) for
ResNet-50 on ImageNet-1K, grouped by batch size. The
dotted line represents the threshold of 75.9%.
of K-FAC’s 35/1336.5 min (×1.88). Beyond this scale,
only FOP succeeds to hit the threshold in 33/999.5 min at
BS=2048 (×2.51), 34/514.9 min at BS=4096 (×4.88), and
40/335.1 min at BS=8192 (×7.50), while both SGD and K-
FAC stall below 75.9%. FOP’s steadily shrinking time-to-
threshold with increasing batch size demonstrates its better
large-batch scalability and clear advantage over first- and
second-order alternatives. Moreover, our FOP results com-
pare favorably even to recent second-order methods. For in-
stance, SENG (Yang et al. 2020) reaches 75.9% Top-1 on
ImageNet-1K in 41 epochs at BS=4096, and SP-NGD (Os-
awa et al. 2020) only hits 74.8%–75.3% at BS=4096–8192.
In contrast, FOP matches or exceeds their results in fewer
epochs and achieves the same threshold in 34 epochs at
BS=4096 and 40 epochs at BS=8192.
CIFAR10-LT with ResNet32
Finally, to assess optimizer robustness under long-tailed
data, we evaluate KFAC and FOP on the CIFAR10-LT and
CIFAR100-LT benchmarks (imbalance factors of 100 and
50) using a lightweight ResNet-32, following the training
protocol of Zhang et al. (2021). We apply both second-order
methods as preconditioners with a fixed damping ratio of
1e−3and obtain results over 5 random seeds.
Table 4 reports the Top-1 error rates on the CIFAR-LT
datasets under two imbalance factors (50 and 100). We com-
pare our implementations of KFAC and FOP against base-
line results from Zhang et al. (2021) and representative re-
sults from other recent works (Liu et al. 2019; Kang et al.
2019; Cui et al. 2019). FOP consistently delivers the lowest
error, outperforming the baseline by 2.3–3.3% smaller er-
ror rate and KFAC by 1.8–2.0% smaller error rate across all
settings. For example, on CIFAR-100-LT with factor 100,
it cuts the error from 62.27% (baseline) to 58.97% (3.3%
drop), and on CIFAR-10-LT with factor 50, it reduces er-
ror from 23.55% to 20.55% (3.0% drop), highlighting its ro-
bustness under severe class imbalance. By contrast, KFAC
delivers only modest reduction (≈1%) on the CIFAR-LT
dataset with an imbalance factor of 50, and even underper-
forms the baseline/other works on those with the factor of
100 (28.59% vs. 28.05% in CIFAR-10-LT, and 61.78% vs.
61.68% in CIFAR-100-LT). These results highlight FOP’s
superior robustness under severe class imbalance, a benefit
we attribute to the Fisher–Orthogonal Projection term that
balances curvature estimates across head and tail classes.
Datasets CIFAR-10-LT CIFAR-100-LT
Imbalance Factor100 50 100 50
Backbones ResNet-32
Baselines 28.05 23.55 62.27 56.22
Other works 29.64 25.19 61.68 56.15
KFAC 28.59 ↑22.29↓61.78↑55.02↓
FOP (ours) 26.65↓20.55↓58.97↓53.67↓
Table 4: Top-1 error rates on CIFAR-LT, comparing KFAC
and FOP against the implementation baseline (Zhang et al.
2021) and prior results (Liu et al. 2019; Kang et al. 2019;
Cui et al. 2019).↓indicates that an error rate is lower (better)
than both the baseline and other reported results;↑indicates
that it is higher (worse).
Conclusion
In this work, we introduced Fisher–Orthogonal Projec-
tion (FOP), a novel second-order optimizer that combines
natural-gradient updates with a geometry-aware variance
correction. This design eliminates the need for task-specific
tuning when scaling to multiple GPUs in large-batch train-
ing. FOP enables seamless scaling to extremely large batch
sizes, without requiring any additional tricks or adaptive hy-
perparameter adjustments, except scaling the learning rate
as the batch size. In contrast, standard optimizers such as
SGD, AdamW, and KFAC often collapse or demand exten-
sive tuning under such conditions. We validate FOP through
extensive experiments on four diverse benchmarks: ResNet-
18 on CIFAR-10, T2T-ViT on ImageNet-100, ResNet-50
on ImageNet-1K, and ResNet-32 on long-tailed CIFAR-
LT. Across these settings, FOP consistently accelerates con-
vergence and scales more robustly to large batches. More-
over, under severe class imbalance in CIFAR-LT dataset,
FOP delivers better generalization, reducing Top-1 error by
2.3–3.3% compared to strong baselines. Together, these re-
sults highlight FOP’s unique ability to unlock stable, plug-
and-play large-batch training across both convolutional and
transformer architectures. This makes it especially well-
suited for large-scale distributed and resource-constrained
environments, paving the way for practical, reliable second-
order optimization at scale.
References
Amari, S.; and Nagaoka, H. 2000. Methods of information
geometry.
Amari, S.-I. 1998. Natural gradient works efficiently in
learning.Neural computation, 10(2): 251–276.
Cui, Y.; Jia, M.; Lin, T.-Y.; Song, Y.; and Belongie, S. 2019.
Class-balanced loss based on effective number of samples.
InProceedings of the IEEE/CVF conference on computer
vision and pattern recognition, 9268–9277.
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-
Fei, L. 2009. Imagenet: A large-scale hierarchical image
database. In2009 IEEE conference on computer vision and
pattern recognition, 248–255. Ieee.
DeVries, T.; and Taylor, G. W. 2017. Improved Regulariza-
tion of Convolutional Neural Networks with Cutout.arXiv
preprint arXiv:1708.04552.
Eschenhagen, R.; Immer, A.; Turner, R.; Schneider, F.; and
Hennig, P. 2023. Kronecker-factored approximate curvature
for modern neural network architectures.Advances in Neu-
ral Information Processing Systems, 36: 33624–33655.
Ghorbani, B.; Krishnan, S.; and Xiao, Y. 2019. An inves-
tigation into neural net optimization via hessian eigenvalue
density. InInternational Conference on Machine Learning,
2232–2241. PMLR.
Goyal, P.; Doll´ar, P.; Girshick, R.; Noordhuis, P.;
Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; and He,
K. 2017. Accurate, large minibatch sgd: Training imagenet
in 1 hour.arXiv preprint arXiv:1706.02677.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-
ual learning for image recognition. InProceedings of the
IEEE conference on computer vision and pattern recogni-
tion, 770–778.
Ishikawa, S.; and Yokota, R. 2024. When Does Second-
Order Optimization Speed Up Training? InThe Second Tiny
Papers Track at ICLR 2024.
Kang, B.; Xie, S.; Rohrbach, M.; Yan, Z.; Gordo, A.; Feng,
J.; and Kalantidis, Y. 2019. Decoupling representation
and classifier for long-tailed recognition.arXiv preprint
arXiv:1910.09217.
Keskar, N. S.; Mudigere, D.; Nocedal, J.; Smelyanskiy,
M.; and Tang, P. T. P. 2016. On large-batch training for
deep learning: Generalization gap and sharp minima.arXiv
preprint arXiv:1609.04836.
Kingma, D. P.; and Ba, J. 2014. Adam: A method for
stochastic optimization.arXiv preprint arXiv:1412.6980.
Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple
layers of features from tiny images.
Li, H.; Xu, Z.; Taylor, G.; Studer, C.; and Goldstein, T. 2018.
Visualizing the loss landscape of neural nets.Advances in
neural information processing systems, 31.
Lin, W.; Dangel, F.; Eschenhagen, R.; Neklyudov, K.; Kris-
tiadi, A.; Turner, R. E.; and Makhzani, A. 2024. Structured
Inverse-Free Natural Gradient Descent: Memory-Efficient &
Numerically-Stable KFAC. InInternational Conference on
Machine Learning (ICML).
Liu, X.; Li, S.; Gao, K.; and Wang, B. 2024. A layer-wise
natural gradient optimizer for training deep neural networks.
Advances in Neural Information Processing Systems, 37:
28460–28489.
Liu, Z.; Miao, Z.; Zhan, X.; Wang, J.; Gong, B.; and Yu,
S. X. 2019. Large-scale long-tailed recognition in an open
world. InProceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition, 2537–2546.
Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay
regularization.arXiv preprint arXiv:1711.05101.
Martens, J.; and Grosse, R. 2015. Optimizing neural net-
works with kronecker-factored approximate curvature. In
International conference on machine learning, 2408–2417.
PMLR.
Mattson, P.; Cheng, C.; Coleman, C.; Diamos, G.; ...; Reddi,
V. J.; and Zaharia, M. 2019. MLPerf Training Benchmark.
arXiv:1910.01500.
McCandlish, S.; Kaplan, J.; Amodei, D.; and Team, O. D.
2018. An empirical model of large-batch training.arXiv
preprint arXiv:1812.06162.
Osawa, K.; Ishikawa, S.; Yokota, R.; Li, S.; and Hoefler, T.
2023. ASDL: A Unified Interface for Gradient Precondi-
tioning in PyTorch.arXiv e-prints, arXiv–2305.
Osawa, K.; Tsuji, Y.; Ueno, Y.; Naruse, A.; Foo, C.-S.; and
Yokota, R. 2020. Scalable and practical natural gradient
for large-scale deep learning.IEEE Transactions on Pattern
Analysis and Machine Intelligence, 44(1): 404–415.
Pascanu, R.; and Bengio, Y. 2013. Revisiting natural gradi-
ent for deep networks.arXiv preprint arXiv:1301.3584.
Pauloski, J. G.; Huang, L.; Xu, W.; Chard, K.; Foster, I. T.;
and Zhang, Z. 2022. Deep neural network training with dis-
tributed K-FAC.IEEE Transactions on Parallel and Dis-
tributed Systems, 33(12): 3616–3627.
Robbins, H.; and Monro, S. 1951. A stochastic approxima-
tion method.The annals of mathematical statistics, 400–
407.
Sagun, L.; Bottou, L.; and LeCun, Y. 2016. Eigenvalues of
the hessian in deep learning: Singularity and beyond.arXiv
preprint arXiv:1611.07476.
Shazeer, N.; and Stern, M. 2018. Adafactor: Adaptive learn-
ing rates with sublinear memory cost. InInternational Con-
ference on Machine Learning, 4596–4604. PMLR.
Team, P. 2017. PyTorch Examples. https://github.com/
pytorch/examples. Accessed: 2025-08-03.
Yang, M.; Xu, D.; Wen, Z.; Chen, M.; and Xu, P. 2020.
Sketchy empirical natural gradient methods for deep learn-
ing.arXiv preprint arXiv:2006.05924.
Yao, Z.; Gholami, A.; Arfeen, D.; Liaw, R.; Gonzalez, J.;
Keutzer, K.; and Mahoney, M. 2018. Large batch size train-
ing of neural networks with adversarial training and second-
order information.arXiv preprint arXiv:1810.01021.
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.-H.;
Tay, F. E.; Feng, J.; and Yan, S. 2021. Tokens-to-Token ViT:
Training Vision Transformers From Scratch on ImageNet.
InProceedings of the IEEE/CVF International Conference
on Computer Vision (ICCV), 558–567.
Zhang, Y.; Wei, X.; Zhou, B.; and Wu, J. 2021. Bag of
Tricks for Long-Tailed Visual Recognition with Deep Con-
volutional Neural Networks. InAAAI, 3447–3455.
Supplementary Material
A. Notations and Lemmas
Lemma 1.Letgdiff, gavg∈R
n
, and letF∈R
n×n
be a sym-
metric positive semi-definite matrix (the Fisher information
matrix). Define the scalar projection:
sFOP:=
g
⊤
diff
F gavg
g
⊤
avgF gavg+ϵ
and the orthogonal component:
g
⊥
diff:=gdiff−sFOPgavg
for some small constantϵ. Then the Fisher inner product
betweeng
⊥
diff
andgavgsatisfies:
Γ
g
⊥
diff
∆⊤
F gavg=g
⊤
diffF gavg·
`
ϵ
g
⊤
avgF gavg+ϵ
´
In particular:
•Ifϵ= 0, theng
⊥
diff
is exactlyF-orthogonal togavg.
•Forϵ >0, the deviation from orthogonality is bounded
by:
Γ
g
⊥
diff
∆⊤
F gavg
≤ϵ∥F
1/2
gdiff∥ · ∥F
−1/2
gavg∥
Proof.We start with the definition:
g
⊥
diff=gdiff−sFOPgavg, sFOP=
g
⊤
diff
F gavg
g
⊤
avgF gavg+ϵ
Now take the Fisher inner product ofg
⊥
diff
withgavg:
Γ
g
⊥
diff
∆⊤
F gavg=g
⊤
diffF gavg−sFOP·g
⊤
avgF gavg
=g
⊤
diffF gavg−
g
⊤
diff
F gavg
g
⊤
avgF gavg+ϵ
·g
⊤
avgF gavg
=g
⊤
diffF gavg
1−
g
⊤
avgF gavg
g
⊤
avgF gavg+ϵ
!
=g
⊤
diffF gavg·
ϵ
g
⊤
avgF gavg+ϵ
This proves the first part of the lemma. Ifϵ= 0, the expres-
sion is zero, yielding exact orthogonality:
Γ
g
⊥
diff
∆⊤
F gavg= 0
Forϵ >0, we apply the Cauchy-Schwarz inequality under
the Fisher inner product:
Γ
g
⊥
diff
∆⊤
F gavg
=
g
⊤
diffF gavg·
ϵ
g
⊤
avgF gavg+ϵ
≤ϵ· ∥F
1/2
gdiff∥ · ∥F
−1/2
gavg∥
Lemma 2.LetF≻0be a symmetric positive definite ma-
trix, and letµminandµmaxdenote the minimum and max-
imum eigenvalues ofF, respectively. Then for any vector
x∈R
n
, the following inequality holds:
µ
1/2
min
∥F
−1/2
x∥ ≤ ∥x∥ ≤µ
1/2
max∥F
−1/2
x∥.
Proof.SinceF≻0, it is diagonalizable with positive
eigenvaluesµ1, . . . , µnand orthonormal eigenvectors. Let
x∈R
n
. The standard Euclidean norm is:
∥x∥
2
=x
⊤
x,
and the Fisher-scaled norm is:
∥F
−1/2
x∥
2
=x
⊤
F
−1
x.
From spectral theory, the matrix inequality holds:
µminI⪯F⪯µmaxI,
which implies (after inverting all sides):
µ
−1
maxI⪯F
−1
⪯µ
−1
min
I.
Now apply these bounds to the quadratic formx
⊤
F
−1
x:
x
⊤
F
−1
x≤µ
−1
min
x
⊤
x=µ
−1
min
∥x∥
2
,
x
⊤
F
−1
x≥µ
−1
maxx
⊤
x=µ
−1
max∥x∥
2
.
Rewriting in terms of norms:
∥F
−1/2
x∥
2
≤µ
−1
min
∥x∥
2
⇒ ∥x∥ ≥µ
1/2
min
∥F
−1/2
x∥,
∥F
−1/2
x∥
2
≥µ
−1
max∥x∥
2
⇒ ∥x∥ ≤µ
1/2
max∥F
−1/2
x∥.
B. KL-norm analysis
B.1. KL-norm of FOP
In natural gradient methods, such as K-FAC (Martens and
Grosse 2015), they define the direction in parameter space
which gives the largest change in the objective per unit of
change in the model (Amari and Nagaoka 2000), as mea-
sured by the Kullback–Leibler (KL) divergence between the
current modelpθand the updated modelpθ+∆.
By applying a second-order Taylor expansion of the KL
divergence aroundθ, we obtain the following approxima-
tion:
KL(pθ+∆∥pθ)≈
1
2
∆
⊤
F∆ =
1
2
∥F
1/2
∆∥
2
(23)
whereFis the Fisher information matrix, and∥F
1/2
∆∥
2
is
known as the KL-norm of the update step∆. This KL-norm
measures how far the update moves the model in distribution
space, making it a more meaningful metric for optimization
in probabilistic models.
In our paper, the update rule in the Fisher Orthogonal Pro-
jection (FOP) method:
∆FOP=−ηM
−1
(g+βg
⊥
) (24)
where:
•M=F+λIis the damped Fisher matrix,
•β∈Rcontrols the contribution from the orthogonal
componentg
⊥
,
•ηis the learning rate,
•g
⊥
≡g
⊥
diff
andg≡gavg
•g
⊥
is orthogonal togin the Fisher metric:(g
⊥
)
⊤
F g= 0.
To compute the KL-norm of this FOP update, we evaluate:
∥F
1/2
∆FOP∥
2
=η
2
(g+βg
⊥
)
⊤
M
−1
F M
−1
(g+βg
⊥
)
(25)
Let
Q=M
−1
F M
−1
(26)
which is symmetric and positive semi-definite. Then the
KL-norm becomes:
F
1/2
∆FOP
2
=η
2
(g+βg
⊥
)
⊤
Q(g+βg
⊥
)
=η
2
Θ
g
⊤
Qg+ 2βg
⊤
Qg
⊥
+β
2
(g
⊥
)
⊤
Qg
⊥
Λ
(27)
whereg
⊤
Qgis the KL norm of KFAC,2β g
⊤
Qg
⊥
is a
cross term andβ
2
(g
⊥
)
⊤
Qg
⊥
is contributing the orthogonal
projection. Each term captures a different component of the
curvature-informed update.
The first termg
⊤
Qgwas proved to be situated in the
trust region (Martens and Grosse 2015), so we now focus on
bounding the cross termg
⊤
Qg
⊥
, and the orthogonal term
β
2
(g
⊥
)
⊤
Qg
⊥
. Bounding these terms is essential to ensure
that the KL-norm remains predictable under varying damp-
ing parameters.
Bounding2β g
⊤
Qg
⊥
Substituting Eq. 26 into the cross term gives:
g
⊤
Qg
⊥
=g
⊤
M
−1
g
⊥
−λg
⊤
M
−2
g
⊥
(28)
and we can bound this:
|g
⊤
Qg
⊥
|=
g
⊤
M
−1
g
⊥
−λg
⊤
M
−2
g
⊥
(29)
and with triangle inequality (|a−b| ≤ |a|+|b|):
|g
⊤
Qg
⊥
| ≤ |g
⊤
M
−1
g
⊥
|+λ|g
⊤
M
−2
g
⊥
|(30)
Applying the Cauchy–Schwarz inequality to each term:
For the first term:
|g
⊤
M
−1
g
⊥
| ≤ ∥M
−1/2
g∥ · ∥M
−1/2
g
⊥
∥(31)
For the second term:
|g
⊤
M
−2
g
⊥
| ≤ ∥M
−1
g∥ · ∥M
−1
g
⊥
∥ (32)
SinceM
−1
⪯λ
−1
I, it follows that:
∥M
−1/2
g∥ ≤λ
−1/2
∥g∥,∥M
−1
g∥ ≤λ
−1
∥g∥(33)
Putting Eq. 33 into Eq. 31 and Eq. 32:
∥M
−1/2
g∥ · ∥M
−1/2
g
⊥
∥ ≤λ
−1
∥g∥ · ∥g
⊥
∥ (34)
λ· ∥M
−1
g∥ · ∥M
−1
g
⊥
∥ ≤λ·λ
−1
·λ
−1
∥g∥ · ∥g
⊥
∥
=λ
−1
∥g∥ · ∥g
⊥
∥ (35)
Putting these back to Eq. 30, we get:
|g
⊤
Qg
⊥
| ≤λ
−1
∥g∥·∥g
⊥
∥+λ
−1
∥g∥·∥g
⊥
∥=
2
λ
∥g∥·∥g
⊥
∥
(36)
To express this in terms of theF
−1/2
-norm, recallLemma
2:
∥g∥ ≤µ
1/2
max∥F
−1/2
g∥,∥g
⊥
∥ ≤µ
1/2
max∥F
−1/2
g
⊥
∥
Therefore:
|g
⊤
Qg
⊥
| ≤
2µmax
λ
∥F
−1/2
g∥ · ∥F
−1/2
g
⊥
∥ (37)
Boundingβ
2
(g
⊥
)
⊤
Qg
⊥
From Eq. 26:
Q= (F+λI)
−1
F(F+λI)
−1
(38)
In the eigenbasis ofF(symmetric and positive semi-
definite), whereF= diag(Λ1, . . . ,Λn)andΛiis the eigen-
value, we can expressQas a diagonal matrix with entries:
Qi=
Λi
(Λi+λ)
2
(39)
Letg
⊥
=
P
i
hiui, where{ui}are the eigenvectors ofF,
andhi=u
⊤
i
g
⊥
. Then:
(g
⊥
)
⊤
Qg
⊥
=
X
i
Λi
(Λi+λ)
2
h
2
i (40)
This achieves its maximum atΛi=λ, therefore:
(g
⊥
)
⊤
Qg
⊥
≤
1
4λ
X
i
h
2
i=
1
4λ
∥g
⊥
∥
2
(41)
RecallLemma2, we have:
∥g
⊥
∥ ≤µ
1/2
max∥F
−1/2
g
⊥
∥
Substitute into the bound:
β
2
(g
⊥
)
⊤
Qg
⊥
≤
β
2
4λ
µmax∥F
−1/2
g
⊥
∥
2
(42)
Finally, Eq. 27 becomes:
∥F
1/2
∆FOP∥
2
≤η
2
"
g
⊤
Qg
|{z}
∥F
1/2
∆KFAC∥
2
+ 2β
µmax
λ
∥F
−1/2
g∥ · ∥F
−1/2
g⊥∥
+β
2
µmax
4λ
∥F
−1/2
g⊥∥
2
#
(43)
The KL-norm of an FOP update eventually decomposes
into three terms: the K-FAC core term, which shrinks
quadratically asλ
−2
, and two FOP-specific terms, a cross
term and an orthogonal term, which shrink linearly as
λ
−1
. Crucially, the FOP-specific terms are also scaled by
∥F
−1/2
g
⊥
∥, which is typically smaller than∥F
−1/2
g∥.
The analysis deliberately combines an exact expression
for the core term with a worst-case bound for the orthogonal
term to ensure safety even when the spectrum ofg
⊥
is un-
known. In the large-damping training that K-FAC relies on
for stability at big batch sizes, this bound guarantees that the
FOP-specific terms remain at least one order ofλsmaller
than the core term.
This allows FOP to reduceλ, preserving more curvature
information while still respecting the KL trust-region. As a
result, FOP achieves faster convergence in large-batch train-
ing without compromising stability.
C. Experiments
C.1. Setup of CIFAR-10
The training of ResNet-18 (He et al. 2016) on the CIFAR-
10 (Krizhevsky, Hinton et al. 2009) dataset serves as a
fundamental experiment in the field of image classifica-
tion. In this subsection, we compare the proposed FOP op-
timizer with several established baselines, including SGD
with momentum (referred to as SGD (Robbins and Monro
1951)), AdamW (Loshchilov and Hutter 2017), and KFAC.
We follow standard experimental settings and apply com-
monly used data augmentation techniques consisting of ran-
dom cropping and horizontal flipping (DeVries and Tay-
lor 2017). All models are trained for 100 epochs. For SGD,
KFAC, and FOP, the initial learning rateα0for batch size of
1024is selected via grid search over the setα∈ {10
−2
,5×
10
−2
,10
−1
,5×10
−1
,1}. For AdamW, the grid is set to
α∈ {10
−4
,5×10
−4
,10
−3
,5×10
−3
,10
−2
}. The up-
date intervals for the curvature matrix and its inverse in
both KFAC and FOP are synchronised, and we evaluate two
base intervals:{10,100,200}. Each experiment is repeated
with 5 different random seeds ({0,1,2,3,4}) to account for
variability. We useReduceLROnPlateau as the learn-
ing rate scheduler, configured with a patience of 5 epochs,
a decay factor of 0.1, a relative threshold of 0.0001, and
thresholdmodeset torel.
C.2. Setup of ImageNet-100 experiment
The implementation of T2T-ViT (Yuan et al. 2021) follows
the original paper. We use the linear warmup strategy (Goyal
et al. 2017) in the first 5 epochs for AdamW, KFAC and FOP.
The initial learning rateα0is selected via grid search over
α∈ {10
−5
,10
−4
, . . . ,10
−1
}for a batch size of512. .The
update intervals for the curvature matrix and inverse matrix
correlating with KFAC and FOP are set to be{500,1000}
for a batch size of 512. All models are trained for 100 epochs
with three random seeding ({0,1,2}). AdamW, KFAC, and
FOP use the cosine learning rate updating strategy and are
set to be
αt= 0.001 + 0.5×(α0−0.001)
×
ȷ
1 + cos
ȷ
2×0.47×π×t
maxepoch
ffff
(44)
C.3. Setup of ImageNet-1K experiment
We implementResNet50following the official PyTorch
example (Team 2017), based on the architecture proposed
by He et al. (He et al. 2016). For all optimizers (SGD,
Adam, KFAC, and FOP), we apply a linear warm-up strat-
egy (Goyal et al. 2017) during the first 5 epochs. For the
KFAC and FOP optimizers, the update intervals for the cur-
vature matrix and its inverse are set to{800,1600,2400}
when using a batch size of1024. The total number of train-
ing epochs differs by optimizer: we train SGD and Adam for
80 epochs, while KFAC and FOP are trained for 50 epochs.
SGD uses the cosine annealing learning rate schedule as de-
scribed in Eq. 44. In contrast, KFAC and FOP adopt an ex-
ponential decay learning rate schedule given by:
αt=α0×
ȷ
1−
t
maxepoch
ff
E
wheretis the current epoch, andEis the decay exponent,
withE∈ {4,5,6}. The initial learning rateα0is selected
via grid search overα∈ {10
−5
,10
−4
, . . . ,10
−1
}for a
batch size of1024.
C.4. Additional experimental details and results
C.4.1 Damping ratio vs accFigure 6 presents results on
CIFAR-10 with ResNet-18. Across all batch sizes (2048–
50000), the proposed FOP optimiser consistently surpasses
the 91 % test-accuracy threshold (dotted line) when the
damping ratioλ∈[10
−5
,10
−3
]. In contrast, KFAC fails
to meet the cutoff whenever the damping is too aggressive
(λ≥10
−2
) and becomes unstable atλ= 10
−5
. While both
methods perform best aroundλ= 10
−3
, FOP’s flatter curve
indicates a much larger tolerance to hyper-parameter mis-
specification.
This pattern extends to transformer-based models as well.
As shown in Figure 7, for T2T-ViT trained on ImageNet-
100, FOP reliably exceeds the 80.9 % bar for all batch sizes
and for all damping values except the extreme low end. In
contrast, KFAC only reaches the target within the narrow
rangeλ∈[10
−4
,10
−3
]. Notably, FOP’s accuracy varies by
less than one percentage point across the entire sweep from
λ= 10
−5
to10
−1
at smaller batch sizes, highlighting its
robustness even on architectures that include attention lay-
ers. A similar trend appears on the large-scale ImageNet-1K
task with ResNet-50, as illustrated in Figure 8. Here, FOP
again outperforms or matches KFAC across the board and
remains above the 75.9 % accuracy threshold throughout the
full sweep and across all batch sizes (2048–8196). KFAC’s
performance, however, drops sharply forλ≥10
−2
and
exhibits visible instability at the smallest damping, echo-
ing the behaviour observed on CIFAR-10. Collectively, Fig-
ures 6–8 indicate that FOP provides a substantially broader
and more stable operating range for the damping ratio com-
pared to KFAC, across both convolutional and transformer-
based architectures, and over a wide range of data scales.
This increased robustness reduces the sensitivity to hyper-
parameter selection and supports the viability of FOP as a
reliable second-order optimisation method for large-batch
training in diverse tasks.
KFAC
FOP =0
=0.01 =-0.01
=0.1 =-0.1
0
20
40
60
80
Final Test Accuracy (%)
90.58%
93.74%
91.27%
92.51% 92.72% 92.38% 92.25%
90.58%
92.25% 92.70% 92.19% 91.79%
Ablation Study
=True
=False
KFAC / FOP Figure 5:Ablation of the scaling parameterηon CIFAR-10.Bars are grouped by optimiser KFAC (left cluster) and FOP
(right cluster), and coloured by the value of the scaling termη∈ {−0.1,−0.01,0,0.01,0.1}. Numbers above each bar give the
Top 3 test accuracy averaged over three random seeds.1e-051e-041e-031e-021e-01
85
90
95
Test Acc (%)
BS = 4096
KFAC
FOP
1e-051e-041e-031e-021e-01
85
90
95
BS = 8192
KFAC
FOP
1e-051e-041e-031e-021e-01
Damping ratio
85
90
95
Test Acc (%)
BS = 32768
KFAC
FOP
1e-051e-041e-031e-021e-01
Damping ratio
85
90
95
BS = 50000
KFAC
FOP
Figure 6: Best test accuracy vs. damping ratio for ResNet-18
on CIFAR-10, grouped by batch size. The dotted line repre-
sents the threshold of 91%.
C.4.2 GPU memoryTables 5–7 report the peak GPU
memory usage for CIFAR-10/ResNet-18, ImageNet-
100/T2T-ViT, and ImageNet-1K/ResNet-50, respectively,
along with the number of devices used in each run. On
CIFAR-10 (Table 5), both SGD and AdamW exhibit the
lowest memory footprint, with usage scaling nearly linearly
with batch size. KFAC incurs a moderate overhead, while
FOP demands the highest memory—up to 145 GB for a
batch of 16 384 on a single GPU, as it needs to keep two
gradientsg1andg2under the same device. This vanishes
when there are more than 1 device. For ImageNet-100
with T2T-ViT (Table 6), all three methods operate near full
device capacity (≈150–170 GB) at the smallest batch size,1e-051e-041e-031e-021e-01
50
75
Test Acc (%)
BS = 512
KFAC
FOP
1e-051e-041e-031e-021e-01
50
75
BS = 1024
KFAC
FOP
1e-051e-041e-031e-021e-01
Damping ratio
50
75
Test Acc (%)
BS = 2048
KFAC
FOP
1e-051e-041e-031e-021e-01
Damping ratio
50
75
BS = 4096
KFAC
FOP
Figure 7: Best test accuracy vs. damping ratio for T2T-ViT
on ImageNet-100, grouped by batch size. The dotted line
represents the threshold of 80.9%.
with FOP consistently exceeding the others by 5–20 GB.
On ImageNet-1K with ResNet-50 (Table 7), memory per
GPU remains constant within each optimiser regardless of
batch size, since larger batches are distributed across more
devices. Here, SGD requires≈88GB per GPU, KFAC
≈98GB, and FOP again matches KFAC closely except for
the smallest batch where it peaks at 153 GB.
1e-051e-041e-031e-021e-01
40
60
Test Acc (%)
BS = 1024
KFAC
FOP
1e-051e-041e-031e-021e-01
40
60
BS = 2048
KFAC
FOP
1e-051e-041e-031e-021e-01
Damping ratio
40
60
Test Acc (%)
BS = 4096
KFAC
FOP
1e-051e-041e-031e-021e-01
Damping ratio
40
60
BS = 8192
KFAC
FOP Figure 8: Best test accuracy vs. damping ratio for ResNet-
50 on ImageNet-1K, grouped by batch size. The dotted line
represents the threshold of 75.9%.
Optimizer # of GPU Max Mem/GPU (GB)
Batch Size: 2048
SGD 1 9.25
AdamW 1 9.30
KFAC 1 14.51
FOP 1 19.21
Batch Size: 4096
SGD 1 18.40
AdamW 1 18.44
KFAC 1 28.03
FOP 1 37.20
Batch Size: 8192
SGD 1 36.70
AdamW 1 36.74
KFAC 1 55.08
FOP 1 73.18
Batch Size: 16384
SGD 1 73.29
AdamW 1 73.33
KFAC 1 109.18
FOP 1 145.16
Batch Size: 32768
SGD 2 73.29
AdamW 2 73.38
KFAC 2 108.31
FOP 2 108.44
Table 5: Max GPU memory usage and number of GPUs for
CIFAR-10 training with ResNet-18.
Optimizer # of GPU Max Memory/GPU (GB)
Batch Size: 512
AdamW 1 145.41
KFAC 1 153.28
FOP 1 172.51
Batch Size: 1024
AdamW 1 151.10
KFAC 1 158.85
FOP 1 163.18
Batch Size: 2048
AdamW 1 148.54
KFAC 1 156.86
FOP 1 159.20
Batch Size: 4096
AdamW 1 142.95
KFAC 1 151.20
FOP 1 153.66
Table 6: Max GPU memory usage across optimizers and
number of GPUs used per run for training ImageNet-100
with T2T-ViT.
Optimizer # of GPU Max Memory/GPU (GB)
Batch Size: 1024
SGD 1 87.89
KFAC 1 110.31
FOP 1 153.31
Batch Size: 2048
SGD 2 87.99
KFAC 2 98.04
FOP 2 98.24
Batch Size: 4096
SGD 4 87.99
KFAC 4 98.04
FOP 4 98.24
Batch Size: 8192
SGD 8 87.99
KFAC 8 97.89
FOP 8 98.19
Table 7: Max GPU memory usage and number of GPUs used
per run for training ImageNet-1K with ResNet-50.
C.4.3 Scaling efficiency# of GPUs
2000
3000
4000
5000
6000
7000
8000
9000
Total Time (sec)
ImageNet-100: Total Time
FOP
# of GPUs
20000
30000
40000
50000
60000
70000
80000
Total Time (sec)
ImageNet-1K: Total Time
FOP
2 4 6 8
# of GPUs
0
20
40
60
80
100
120
Efficiency (%)
ImageNet-100: Parallel Efficiency
FOP
2 4 6 8
# of GPUs
0
20
40
60
80
100
120
Efficiency (%)
ImageNet-1K: Parallel Efficiency
FOP
Figure 9: Strong-scaling results for FOP on ImageNet-1K.
Top: wall-clock training time versus number of GPUs for
ImageNet-100 (left) and ImageNet-1K (right). Bottom: total
GPU-time (#GPUs × wall-clock) versus number of GPUs
for the same two datasets. Dashed lines indicate ideal linear
scaling.
D. Ablation Study of FOP
D.1. Effect ofβ
An ablation study on CIFAR-10, shown in Figure 5, eval-
uates the effect of varying the momentum parameterβin
the FOP optimiser. Using ResNet-18 trained for 100 epochs
(averaged over three seeds), the study compares different
configurations of FOP against KFAC, which fixes(η, β) =
(1,0). FOP decouples the learning rate scaleηfrom the mo-
mentum factorβ, allowing both to be fixed or adaptively up-
dated. Whenηis not adaptive (set to1), introducingβcon-
sistently improves accuracy over KFAC. Further gains are
observed whenηis adaptive, highlighting the benefit of dy-
namic learning rate scaling. Overall, the results demonstrate
that FOP’s flexibility in tuning or adapting(η, β)leads to
improved accuracy and robustness compared to fixed-setting
baselines.
E. Comparison between Nvidia and AMD
cards
E.1. System Configuration
AMD System (Training Runs)
•Node Type:Lenovo ThinkSystem SR685a V3
•CPUs:2× AMD EPYC 9534 (64 cores each, 128 total
cores, no SMT)
•Max Frequency:3.72 GHz, Frequency Boost enabled
•Instruction Support:AVX-512, AVX512-BF16,
AVX512-VNNI, AVX512-VBMI
•Cache:L3 cache total 512 MiB (16×32 MiB)
•GPUs:8× AMD MI300X
•Software Stack:PyTorch 2.6.0 + ROCm 6.2.4
•ROCm Driver Version:6.2.1
NVIDIA System (Benchmarking)
•Node Type:Custom system
•CPUs:2× Intel Xeon Platinum 8468 (48 cores each, 96
total cores)
•Max Frequency:3.8 GHz
•Instruction Support:AVX-512, AVX-VNNI, AMX
(Tile, INT8, BF16)
•GPUs:8× NVIDIA H100 (80 GB HBM3)
•Software Stack:PyTorch 2.8.0.dev20250413+cu126,
CUDA 12.9
•Driver Version:575.57.08
E.2. Benchmarking Results on ImageNet-1k with
ResNet-50
To evaluate both the scalability and efficiency of different
optimization methods across hardware platforms, we present
four subplots in Figure 10. Subplot (a) shows thetotal GPU
time per epoch, which accounts for both the per-step time
and the number of GPUs used, reflecting the total compu-
tational cost. Subplot (b) illustrates theraw wall-clock time
per epochwhen training with 8 GPUs.
In subplot (c), we measure thestrong scaling parallel
efficiency, defined as
Efficiency
#GPUs
=
T1
n·Tn
×100%,
whereT1is the training time using the smallest GPU count
(typically one GPU),Tnis the time when usingnGPUs,
andnis the number of GPUs. This metric indicates how
effectively the training process scales with added hardware
resources.
Subplot (d) evaluates theefficiency with respect to batch
size at fixed GPU count(8 GPUs), defined as
Efficiency
batch
=
Tb0
·b0
Tb·b
×100%,
whereTb0
is the epoch time at a reference batch sizeb0, and
Tbis the epoch time for a new batch sizeb. This quantifies
how well larger batch sizes maintain per-sample throughput
on the same hardware.
Together, these plots provide insights into both inter-GPU
scaling behavior and intra-GPU efficiency across diverse
training methods and compute platforms.
1000 2000 3000 4000 5000 6000 7000 8000
Global batch size
500
1000
1500
2000
2500
3000
3500
4000
Time (sec)
(a)
GPU time per epoch at max usage
Nvidia - KFAC
Nvidia - SGD
Nvidia - FOP
AMD - KFAC
AMD - SGD
AMD - FOP
1000 1500 2000 2500 3000 3500 4000
Global batch size
100
200
300
400
500
600
700
Time (sec)
(b)
Compute time per epoch for 8 GPUs
Nvidia - KFAC
Nvidia - SGD
Nvidia - FOP
AMD - KFAC
AMD - SGD
AMD - FOP
1 2 3 4 5 6 7 8
# of GPUs
60
70
80
90
100
Efficiency %
(c)
Parallel Efficiency vs #GPUs (max usage)
Nvidia - KFAC
Nvidia - SGD
Nvidia - FOP
AMD - KFAC
AMD - SGD
AMD - FOP
1000 1500 2000 2500 3000 3500 4000
Global batch size
50
60
70
80
90
100
Efficiency %
(d)
Parallel Efficiency vs Batch Size (8 GPUs)
Nvidia - KFAC
Nvidia - SGD
Nvidia - FOP
AMD - KFAC
AMD - SGD
AMD - FOP Figure 10: Training time and parallel efficiency comparisons across different methods (SGD, KFAC, FOP) and hardware
(NVIDIA H100 and AMD MI300X). (a) Total GPU time per epoch scaled by the number of GPUs used, representing overall
computational effort. (b) Raw wall-clock time per epoch using 8 GPUs. (c)Parallel efficiency vs number of GPUs, calculated
as Efficiency=
T1
nTn
, whereT1is the time with the smallest GPU count andTnthe time usingnGPUs. (d)Efficiency vs
global batch size on 8 GPUs, defined as Efficiency=
Tb
0
·b0
Tb·b
, measuring how well increased batch sizes are utilized on fixed
hardware.
Batches
Yishun Lu
1
, Wesley Armour
1
1
Department of Engineering Science University of Oxford Oxford, UK
Abstract
Modern GPUs are equipped with large amounts of high-
bandwidth memory, enabling them to support mini-batch
sizes of up to tens of thousands of training samples. How-
ever, most existing optimizers struggle to perform effectively
at such a large batch size. As batch size increases, gradient
noise decreases due to averaging over many samples, lim-
iting the ability of first-order methods to escape sharp or
suboptimal minima and reach the global minimum. Mean-
while, second-order methods like the natural gradient with
Kronecker-Factored Approximate Curvature (KFAC) often
require excessively high damping to remain stable at large
batch sizes. This high damping effectively “washes out” the
curvature information that gives these methods their advan-
tage, reducing their performance to that of simple gradient
descent. In this paper, we introduce Fisher-Orthogonal Pro-
jection (FOP), a novel technique that restores the effective-
ness of the second-order method at very large batch sizes,
enabling scalable training with improved generalization and
faster convergence. FOP constructs a variance-aware update
direction by leveraging gradients from two sub-batches, en-
hancing the average gradient with a component of the gra-
dient difference that is orthogonal to the average under the
Fisher-metric. Through extensive benchmarks, we show that
FOP accelerates convergence by×1.2–1.3over KFAC and
×1.5–1.7over SGD/AdamW at the same moderate batch
sizes, while at extreme scales it achieves up to a×7.5
speedup. Unlike other methods, FOP maintains small-batch
accuracy when scaling to extremely large batch sizes. More-
over, it reduces Top-1 error by 2.3–3.3% on long-tailed CI-
FAR benchmarks, demonstrating robust generalization un-
der severe class imbalance. Our lightweight, geometry-aware
use of intra-batch variance makes natural-gradient optimiza-
tion practical on modern data-centre GPUs. FOP is open-
source and pip-installable, which can be integrated into ex-
isting training code with a single line and no extra configura-
tion.
Introduction
The increasing scale of modern language models and vision
transformers has made large mini-batch training a neces-
sity. Modern GPUs offer extensive high-bandwidth mem-
ory (such as 192GB in AMD MI300X), and data centers
often combine hundreds of these devices, enabling the ef-
ficient processing of tens of thousands of training exam-
ples per batch. This improves hardware utilization, reduces
communication overhead, and accelerates training signifi-
cantly. As batch sizes increase, the gradients become more
deterministic. This reduces the stochastic noise that pre-
viously helped first-order optimizers such as SGD (Rob-
bins and Monro 1951), Adam (Kingma and Ba 2014), and
AdamW (Loshchilov and Hutter 2017) to explore flatter
minima of the loss landscape. The loss of this noise (Keskar
et al. 2016; McCandlish et al. 2018) requires first-order
methods to rely on smaller learning rates and stronger ex-
plicit regularization to maintain stability and generalization
performance.
Natural-Gradient Descent (NGD) addresses scenarios where
stochastic gradient noise is minimal, such as very large
mini-batches (Pascanu and Bengio 2013; Shazeer and Stern
2018; Ishikawa and Yokota 2024). Unlike first-order meth-
ods, NGD incorporates second-order curvature information
via the Fisher information matrix (Amari 1998), enabling
geometry-aware parameter updates invariant to model pa-
rameterization. However, exact computation of the Fisher
matrix is infeasible for modern neural networks. Practical
implementations rely on approximations, with Kronecker-
Factored Approximate Curvature (KFAC) being the most
widely adopted (Martens and Grosse 2015). KFAC simpli-
fies the Fisher matrix by exploiting the layer-wise structure
of neural networks, making the approximation computa-
tionally efficient. Subsequent methods further approximate
KFAC to improve tractability (Lin et al. 2024; Yang et al.
2020; Liu et al. 2024; Eschenhagen et al. 2023). Despite its
promise, KFAC struggles in the very-large-batch regime re-
quired for modern hardware. As the batch size grows, the
Fisher matrix becomes increasingly ill-conditioned (Sagun,
Bottou, and LeCun 2016; Ghorbani, Krishnan, and Xiao
2019), leading to numerical instability. This forces the use
of strong damping, which unfortunately suppresses the very
curvature information that gives KFAC its advantage.
Prior attempts to scale natural-gradient methods to large-
batch training include SENG (Yang et al. 2020), which
uses low-rank sketches but introducing new hyperparam-
eters, and SP-NGD (Osawa et al. 2020), which relies on
empirical-Fisher approximations and stale-statistic heuris-
tics that require task-specific hyperparameter retuning.
Adaptive batch-size schedules (Yao et al. 2018) improve
throughput but still rely on heavily damped mean gra-
dients. Although these methods reduce hardware require-
arXiv:2508.13898v2 [cs.LG] 24 Aug 2025
ments, their update rules remain fundamentally unchanged,
still dominated by mean gradients and extensive hyperpa-
rameter tuning.
In this paper, we propose Fisher-Orthogonal Projection
(FOP), which augments natural gradient descent with a
Fisher-orthogonal variance correction. This novel geometry-
aware update captures intra-batch gradient variation without
relying on sketching ranks, stale-statistics heuristics, or cus-
tomized communication strategies. Specifically, our contri-
butions include:
•Fisher–Orthogonal Projection optimizer.We propose
a novel second-order update that augments the natu-
ral gradient with a geometry-aware, variance-controlled
component, capturing intra-batch information by stan-
dard KFAC.
•Extreme large-batch scalability.We demonstrate that
FOP seamlessly scales to cases where SGD, AdamW,
and K-FAC break down, and it can achieve speedups of
up to×7.5in wall-clock time while maintaining conver-
gence at extremely large batch sizes in ImageNet and CI-
FAR datasets.
•Robust generalization under imbalance.Reducing
Top-1 error rate by 2.3–3.3% on long-tailed CIFAR-LT
benchmarks without additional tricks
•Distributed FOP.Efficiently sharding Fisher compu-
tation across GPUs with dual-gradient AllReduce and
broadcasted updates, enabling scalable, low-overhead
training.
Natural Gradient Descent
Natural Gradient Descent is a second-order optimization
method derived from Newton’s method, where the Hessian
is replaced by the Fisher Information Matrix to reflect the
geometry of the parameter space. In standard gradient de-
scent, the update rule is:
θi=θi−1−η∇θL(θi−1)
whereθiis the parameter vector at iterationi,ηis the learn-
ing rate, and∇θL(θi−1)is the gradient of the loss function
Levaluated atθi−1. This treats all directions in parameter
space uniformly, which can lead to slow convergence in ill-
conditioned landscapes. Newton’s method improves this by
preconditioning the gradient with the inverse HessianH
−1
,
but computing the full Hessian is often infeasible for large
models.
Natural Gradient Descent modifies the update by replacing
the Hessian with the Fisher Information MatrixF, which de-
fines a Riemannian metric on the statistical manifold (Amari
1998):
θi=θi−1−ηF
−1
∇θL(θi−1)
This yields the steepest descent direction under the Kull-
back–Leibler (KL) divergence, making the update invari-
ant to parameter reparameterizations (Martens and Grosse
2015).Current 0
g1
g2
avg
FOP
Best Next
Step Loss
Figure 1: 3D loss landscape of ResNet-18 with CIFAR-10
for batch size of 1024. Arrows represent the direction of the
steps of different gradients. The green star is the smallest
loss after updating the model based on different update di-
rections.
Fisher-Orthogonal Projection
A loss landscape is plotted based on the method suggested
in (Li et al. 2018) in Figure 1, by projecting the high-
dimensional parameter space onto a 2D plane using two ran-
dom but normalized directions and evaluating the loss across
a grid in this plane. Naively averaging gradients across mini-
batches can obscure useful optimization directions due to
averaging over many samples. In particular, such averag-
ing may suppress informative signals when gradients from
different batches point in significantly different directions.
Especially at large batch sizes, where gradient variability is
low but intra-batch differences can still carry important opti-
mization signals. We propose theFisher-Orthogonal Projec-
tion (FOP)method, which preserves the informative struc-
ture of each mini-batch gradient while ensuring stable and
curvature-aware descent. The key idea is to use the average
gradient as the primary descent direction, capturing the com-
mon signal shared across mini-batches. In addition, FOP in-
troduces a Fisher-orthogonal component, derived from the
difference between two mini-batch gradients. This orthogo-
nal component contains complementary curvature-sensitive
information that would otherwise be lost through simple av-
eraging.
Suppose that at a parameter pointθ, we compute two gradi-
entsg1andg2from two independent mini-batches. Then the
FOP update is defined as:
L1(θ), L2(θ) (1)
each associated with a gradient:
g1=∇θL1(θ), g2=∇θL2(θ) (2)
We compute the average and difference of the two gradients:
gavg=
1
2
(g1+g2), gdiff=g1−g2. (3)
To eliminate redundancy and extract only novel informa-
tion, we orthogonalizegdiffwith respect togavgunder the
inner product induced by the Fisher matrixF. The projec-
tion scalar is computed as:
sproj=
g
⊤
diff
F gavg
g
⊤
avgF gavg+ϵ
(4)
and the orthogonal component is then:
g
⊥
diff=gdiff−sproj·gavg. (5)
By construction, this ensures that⟨gavg, g
⊥
diff
⟩F= 0(as de-
scribed in Lemma 1 in the Appendix), meaning the new
component contains only the information that is orthogonal
in the Fisher geometry, which is already captured in the av-
erage gradient.
The final combined update direction is given by:
gcombined=gavg+βg
⊥
diff, (6)
whereβis a scalar weight, adaptively determined to locally
minimize the primary or total loss. The overall parameter
update using Natural Gradient Descent then becomes:
θt+1=θt−ηF
−1
gcombined. (7)
Layer-wise Adaptive Coefficientβ
When the true optimization objective is the sum of two per-
batch losses, we can write the total loss as:
Ltot(θ) =L1(θ) +L2(θ). (8)
To minimize this objective locally, we seek an optimal mix-
ing coefficientβthat balances the directiong
⊥
diff
relative to
the average gradient. We derive this optimalβ(denotedβ
∗
)
using a second-order Taylor approximation.
We begin with the natural-gradient update step of the form
as in Eq. 7:
θnew=θ−ηF
−1
(gavg+βg
⊥
diff), (9)
Using a second-order Taylor expansion ofLtotaroundθ, and
assuming the Hessian matrix approximates to the Fisher ma-
trix (∇
2
Ltot≈F), the approximate loss after update is:
Ltot(θ−ηd)≈Ltot(θ)−η(g1+g2)
⊤
d+
η
2
2
d
⊤
F d,(10)
where
d=F
−1
(gavg+βg
⊥
diff). (11)
To isolate the effect ofβ, we define a surrogate objective
Jtot(β)by dropping constants and factors ofη:
Jtot(β) =−(g1+g2)
⊤
F
−1
(gavg+βg
⊥
diff)
+
1
2
(gavg+βg
⊥
diff)
⊤
F
−1
(gavg+βg
⊥
diff).(12)
To simplify, we define the following inner products:
D=g
⊤
avgF
−1
g
⊥
diff, E= (g
⊥
diff)
⊤
F
−1
g
⊥
diff. (13)
Noting thatg1+g2= 2gavg, we substitute into (12) and
expand:
Jtot(β) =−2D−2βD+
1
2
Γ
∥gavg∥
2
F
−1+ 2βD+β
2
E
∆
=−2D−βD+
1
2
β
2
E+const, (14)
where∥gavg∥
2
F
−1=g
⊤
avgF
−1
gavgis absorbed into the con-
stant term.
To find the optimalβ
∗
, we differentiate (14) with respect to
βand set the derivative to zero:
dJtot
dβ
=−D+β
∗
E= 0⇒β
∗
=
D
E
. (15)
Substituting back the definitions ofDandE, we obtain the
closed-form expression:
β
∗
=
g
⊤
avgF
−1
g
⊥
diff
(g
⊥
diff
)
⊤
F
−1
g
⊥
diff
. (16)
This yields a layer-wise coefficientβ
∗
that minimizes
our second-order surrogate, injecting orthogonal corrections
only when beneficial. While it relies on the Hessian–Fisher
approximation, which can misestimate curvature early in
training or in highly nonlinear models, damped Fisher matri-
ces and large-batch averaging curb these errors. In practice,
any misleading orthogonal signal drivesβ
∗
→0, safely re-
ducing FOP to a standard KFAC update.
Layer-wise Adaptive Scaling Step Sizeη
∗
ℓ
To ensure that each layer’s update magnitude is automati-
cally adjusted to account for its local curvature and gradient
alignment, we introduce a layer-wise adaptive coefficientη
∗
ℓ
.
Rather than using a single global learning rate for all param-
eters,η
∗
ℓ
is chosen to (locally) minimize a quadratic approx-
imation of the loss function along the natural-gradient direc-
tion for each layerℓ. Instead of differentiating aboutβin
Eq. 10, we minimize this one-dimensional quadratic model
inη. After we set the derivative with respect toηto zero, the
optimal step size becomes:
η
∗
ℓ=
g
⊤
ℓ,tot
F
−1
ℓ
gℓ,comb
g
⊤
ℓ,comb
F
−1
ℓ
gℓ,comb
. (17)
This expression automatically adjusts the step size based on
both the alignment between the total gradient and the pro-
posed update direction, and the curvature in that direction.
When the curvature alonggℓ,combis low and the update is
well-aligned with the descent direction,η
∗
ℓ
will approach 1,
allowing a full natural-gradient step. Conversely, when the
curvature is high or the combined gradient is poorly aligned
with the total gradient,η
∗
ℓ
decreases below 1, effectively
damping the update to avoid overshooting. The final updates
are:
dℓ=η0η
∗
ℓF
−1
ℓ
gℓ,comb, (18)
whereη0is the base learning rate shared across the whole
model.
Kullback–Leibler (KL) norm analysis
Natural-gradient methods, such as KFAC (Martens and
Grosse 2015), select update directions that maximize
progress in parameter space relative to the KL-divergence
between the model before and after the update. A standard
second-order approximation of the KL-divergence gives rise
to the KL-norm:
KL(pθ+∆∥pθ)≈
1
2
∥F
1/2
∆∥
2
(19)
whereFis the Fisher information matrix and∆is the pa-
rameter update step. The KL-norm quantifies how much the
model distribution changes due to an update.
In our case, the FOP update step is defined as:
∆FOP=−ηM
−1
(g+βg
⊥
) (20)
wheregis the average micro-batch gradientgavg,g
⊥
is an
orthogonal correction term in Eq. 6, satisfying(g
⊥
)
⊤
F g=
0,M=F+λIis the damped Fisher matrix, andβ∈R
controls the contribution of the orthogonal component.
Substituting this into the KL-norm expression we obtain a
decomposition into three terms:
F
1/2
∆FOP
2
=η
2
Θ
g
⊤
Qg+ 2βg
⊤
Qg
⊥
+β
2
(g
⊥
)
⊤
Qg
⊥
Λ
(21)
• g
⊤
Qgcorresponding to standard KFAC,
• 2β g
⊤
Qg
⊥
,
• β
2
(g
⊥
)
⊤
Qg
⊥
,
whereQ= (F+λI)
−1
F(F+λI)
−1
acts as a cur-
vature–weighted metric. In the large-damping case, where
λ≫Λifor every Fisher eigenvalueΛithat carries weight
ing, the spectral factor
Λi
(Λi+λ)
2is proportional toΛi/λ
2
.
Therefore, the base term scales asg
⊤
Qg∝λ
−2
.As shown
in Appendix B, the two additional terms from FOP decay
even more gently. Specifically:
∥F
1/2
∆FOP∥
2
≤η
2
"
g
⊤
Qg
|{z}
∥F
1/2
∆KFAC∥
2
+ 2β
µmax
λ
∥F
−1/2
g∥ · ∥F
−1/2
g⊥∥
+β
2
µmax
4λ
∥F
−1/2
g⊥∥
2
#
(22)
Because the KL-norm of the FOP update splits into a base-
term that decays asO(1/λ
2
)and two correction terms that
decay only asO(1/λ), there is an inherent separation in how
quickly these components diminish as the damping parame-
terλincreases.
During the early phase of training, it is common for
the∥gavg∥to be large, while the∥g
diff
⊥∥is also notable and
dominated by high-frequency, low-curvature noise that is ex-
aggerated by the application ofF
−1
. In such scenarios, the
orthogonal correctiong
⊥
diff
often points in the opposite direc-
tion to the main descent path. As a result, the optimal mixing
coefficient becomes negative (β <0), leading to a nega-
tive cross-term in the KL-norm. This partial cancellation of
the core component creates margins that safely reduce the
damping factorλ.
Distributed FOP
Algorithm 1: Distributed FOP with Dual Gradients
Require:
M: neural network model
P={p0, . . . , pN−1}: set ofNGPU processes
Sj: subset of layers for whichpjis the curvaturespe-
cialist
Dj: local mini-batch onpj
Fi: running estimate of the Fisher matrix block for layer
ℓi, stored and updated by its specialist GPUpk
Gpri={pj∈P|jmod 2 = 0} ▷primary group
Gsec={pj∈P|jmod 2 = 1}▷secondary group
Ensure:preconditioned global gradient˜g
1:for allpj∈Pin parallel do
2:gj← ∇θL(M;Dj) ▷local back-prop
3:ifcurvature update stepthen
4: for alllayersℓiinMdo
5: pk←specialist s.t.ℓi∈Sk
6: send local curvature factors ofℓitopk
/* async, non-blocking */
7: ifj=kthen
8: update and invertFi→F
−1
i
9: end if
10: end for
11:end if
12:ifpj∈Gprithen
13: global ALLREDUCEto computeg1
14:else
15: global ALLREDUCEto computeg2
16:end if
17:for alllayersℓiinMdo
18: pk←specialist forℓi
19: ifj=kthen
20: ˜gi←FOP(g1,i, g2,i, F
−1
i
)
21: broadcast˜gito all processes
22: end if
23:end for
24:end for
25:assemble˜g←[˜g1, . . . ,˜gL] ▷ L= #layers
Traditional second-order optimization methods like KFAC
are notoriously difficult to scale due to the high mem-
ory and computation costs of storing and inverting large
curvature matrices. Moreover, synchronizing these matri-
ces across multiple GPUs introduces significant communi-
cation overhead, making them impractical for large-scale
training without careful system design. We design a scalable
FOP implementation that combines data parallelism with
lightweight model parallelism to minimize the overhead of
splitting large batches. First, we assign each GPU as a spe-
cialist, responsible for updating the curvature (Fisher infor-
0 500 1000 1500
20
40
60
80
Test Accuracy (%)
BS = 2¹¹ (2048)
SGD
AdamW
KFAC
FOP
0 250 500 750
20
40
60
80
BS = 2¹² (4096)
SGD
AdamW
KFAC
FOP
0 200 400
20
40
60
80
Test Accuracy (%)
BS = 2¹³ (8192)
SGD
AdamW
KFAC
FOP
0 100 200
20
40
60
80
BS = 2¹ (16384)
SGD
AdamW
KFAC
FOP
0 50 100 150
Time (seconds)
20
40
60
80
Test Accuracy (%)
BS = 2¹ (32768)
SGD
AdamW
KFAC
FOP
0 50 100
Time (seconds)
20
40
60
80
BS = 50000
SGD
AdamW
KFAC
FOP Figure 2: Test accuracy vs. wall-clock time (in seconds) for
ResNet-18 on CIFAR-10, grouped by batch size. The dotted
line represents the threshold of 91%.
mation) of a subset of layers via a sharded preconditioner
similar to the previous works (Osawa et al. 2023; Pauloski
et al. 2022). Second, we introduce a dual-gradient reduction
strategy, where two global gradientsg1andg2are computed
in parallel over disjoint GPU groups. Finally, each special-
ist GPU applies the FOP using its local Fisher inverse and
both gradients to compute its layer’s update, which is then
broadcast across the GPUs. This distributed design enables
efficient second-order updates across large-scale multi-GPU
systems. The full algorithm is shown in Algorithm 1.
Experiments
In this section, we rigorously evaluate FOP against both
first-order (SGD, AdamW) and second-order (KFAC) base-
lines across four vision benchmarks: CIFAR-10 with
ResNet-18, ImageNet-100 with T2T-ViT, ImageNet-1K
with ResNet-50, and long-tailed CIFAR10-LT/100-LT with
ResNet-32, demonstrating its fast convergence, large-batch
scalability, and robustness under class imbalance. To en-
sure fair and rigorous comparisons, we evaluate all methods
on several standard benchmarks: ResNet-18 on CIFAR-10,
ResNet-50 on ImageNet-1k, and a Vision Transformer (ViT)
on ImageNet-100. For each setting, we perform an exten-
sive hyperparameter search for all optimizers. This includes
tuning the learning rate, and for second-order methods like
KFAC and FOP, also tuning the damping ratioλ. Following
the linear scaling rule from (Goyal et al. 2017), the learning
rate is scaled proportionally with the batch size. Addition-
ally, the curvature update frequency for the second-order op-
timizer is reduced as the batch size increases, until it reaches
a minimum threshold of 5 steps. To isolate the effect of our
Fisher-orthogonal projection, FOP and KFAC share identi-
cal learning-rate schedules in every experiment. While our
results highlight FOP’s advantages, we do not claim that
FOP is a universally superior optimizer for all tasks and
architectures. Rather, the evidence shows that augmenting
natural gradient methods with a principled, geometry-aware
variance component offers via FOP a robust and scalable
path for second-order optimization in modern large-batch
training scenarios. All experiments were performed on a sin-
gle node with two AMD EPYC 9534 64-core CPUs and
eight AMD MI300X GPUs. The implementations of KFAC
and FOP rely on the ASDL package (Osawa et al. 2023).
Full details of the hyperparameter search space, such as
learning rate, damping rate, and random seeding number,
and a discussion about the memory overhead are provided
in the Appendix.
CIFAR10 with ResNet18
We first evaluate optimization performance on the CIFAR-
10 dataset (Krizhevsky, Hinton et al. 2009), a widely used
benchmark for assessing second-order optimizers (Eschen-
hagen et al. 2023; Liu et al. 2024; Martens and Grosse 2015).
Each experiment is run with 5 different random seeds to en-
sure robustness. We employ theReduceLROnPlateau
learning rate scheduler during training. Figure 2 illustrates
the progression of test accuracy over wall-clock time for
ResNet-18, across batch sizes ranging from2048to50000
(the total number of training samples in CIFAR-10). At
small to moderate batch sizes (e.g.,2048and4096), all op-
timizers, including SGD, AdamW, KFAC, and FOP, achieve
91% accuracy, though FOP consistently reaches this thresh-
old the fastest. As batch size increases beyond16384, first-
order optimizers such as SGD and AdamW struggle to con-
verge within the same epoch limit, failing to reach 91%
accuracy altogether at larger batch sizes (e.g.,32768and
50000).
These trends are quantitatively summarized in Table 1,
which reports both the epochs and total wall-clock time
to reach 91% Top-1 accuracy on CIFAR-10 using the
same GPU count. At BS=2048, FOP hits the target accu-
racy in 29/475.2s, with×1.56faster than SGD (58/743.3s)
and×1.26faster than KFAC (37/588.7s). As we scale
to BS=4096 and 8192, FOP’s speedup over SGD grows
to×1.69and×2.91, respectively, and reaches×3.78at
BS=16384. Crucially, FOP is the only method to arrive 91%
at BS=32768 and 50000, doing so in 60/90.6s (×5.05) and
82/84.3s (×5.43). These results underscore FOP’s excep-
tional large-batch scalability and its ability to deliver sub-
stantial accelerations.
ImageNet-100 with T2T-ViT
To evaluate FOP on modern transformers, we train a Tokens-
to-Token Vision Transformer (T2T-ViT) from scratch on
ImageNet-100 with running 3 different random seeds(Yuan
et al. 2021; Lin et al. 2024). Following Lin et al. (2024), we
Batch Size Epochs / Time (s) and Speedup
(GPU) SGD AdamW KFAC FOP (ours)
2048 (2)58 / 743.3 61 / 768.4 37 / 588.7 29 / 475.2
4096 (2)
73 / 457.9 73 / 454.0 34 / 270.5 22 / 181.9
– – [ ×1.69] [×2.52]
8192 (2)
– / – – / – 71 / 296.4 35 / 157.5
– – [ ×1.54] [×2.91]
16384 (2)
– / – – / – 99 / 186.4 58 / 121.2
– – [ ×2.46] [×3.78]
32768 (2)
– / – – / – – / – 60 / 90.6
– – – [ ×5.05]
50000 (2)
– / – – / – – / – 82 / 84.3
– – – [ ×5.43]
Table 1: Epoch and training time (in seconds) to reach 91%
Top-1 accuracy on CIFAR-10. Each row shows the batch
size and number of GPUs used in the format:Batch (GPU)
and the correspondingEpoch/Time. “–” indicates the accu-
racy threshold was not reached. For KFAC and FOP, the
bracketed numbers show the speedup factor relative to SGD
at the batch size of 4096.
apply FOP and KFAC only to the convolutional and linear
layers’ gradients and activations, while all other parameters
(e.g., LayerNorm) are updated with AdamW (Loshchilov
and Hutter 2017). We run 100 epochs with a cosine learning-
rate schedule and measure Top-1 accuracy over wall-clock
time (Figure 3). We set our target at 80.6%, the best result
achieved by AdamW at batch size 512, and report the epochs
and training time to reach it in Table 2.
AdamW requires nearly the full 100 epochs, and the longest
runtime at batch size 512, whereas both second-order meth-
ods hit 80.6% in substantially fewer epochs and less time.
FOP consistently outperforms KFAC and AdamW across
batch sizes larger than 512, delivering speedups of×4.33,
×6.90, and×10.48, where KFAC only achieves×3.80,
×6.34, and×6.45compared to AdamW. KFAC scales bet-
ter than AdamW but still lags behind FOP, typically needing
more epochs to match the same accuracy.
Batch Size Epochs / Time (s) and Speedup
(GPU) AdamW KFAC FOP (ours)
512 (1)
97 /17 498.842 /8315.644 /9535.7
[×2.10] [×1.84]
1024 (2)
– 49 /4603.641 /4038.9
[×3.80] [×4.33]
2048 (4)
– 57 /2760.548 /2535.4
[×6.34] [×6.90]
4096 (8)
– 87 /2715.349 /1670.4
[×6.45] [×10.48]
Table 2: Epoch and training time (in seconds) to reach 80.6%
Top-1 accuracy for T2T-ViT on ImageNet-100. For KFAC
and FOP, the bracketed numbers show the speedup factor
relative to AdamW at the batch size of 512.0 10000 20000
0
20
40
60
80
Test Accuracy (%)
BS = 2 (512)
AdamW
KFAC
FOP
0 5000 10000
0
20
40
60
80
BS = 2¹ (1024)
AdamW
KFAC
FOP
0 2000 4000 6000
Time (seconds)
0
20
40
60
80
Test Accuracy (%)
BS = 2¹¹ (2048)
AdamW
KFAC
FOP
0 2000
Time (seconds)
0
20
40
60
80
BS = 2¹² (4096)
AdamW
KFAC
FOP
Figure 3: Test accuracy vs. wall-clock time (in seconds) for
T2T-ViT on ImageNet-100, grouped by batch size. The dot-
ted line represents the threshold of 80.6%.
ImageNet-1K with ResNet50
Batch SizeEpochs / Time (min) and Speedup
(GPU) SGD KFAC FOP (ours)
1024 (1)
71 / 2511.1 35 / 1336.5 32 / 1306.1
[×1.88] [×1.92]
2048 (2)
– / – – / – 33 / 999.5
[×2.51]
4096 (4)
– / – – / – 34 / 514.9
[×4.88]
8192 (8)
– / – – / – 40 / 335.1
[×7.50]
Table 3: Epoch and training time (in minutes) to reach
75.9% Top-1 accuracy on ImageNet-1K with ResNet-50.
For KFAC and FOP, the bracketed numbers show the
speedup factor relative to SGD at the batch size of 1024.
To evaluate FOP at a much larger-scale dataset, we train
ResNet-50 from scratch on ImageNet-1K (Deng et al. 2009)
with running 3 different random seeds, to see if it can reach
Top-1 test accuracy of 75.9%, which is a standard large-scale
convolutional benchmark (Mattson et al. 2019). Following
Yang et al. (2020); Liu et al. (2024), we run SGD for 80
epochs with a cosine learning-rate schedule, and we train
both KFAC and FOP for 50 epochs using an exponential
schedule on their learning rates.
Figure 4 and Table 3 illustrate the dramatic efficiency gains
of FOP in reaching 75.9% Top-1 accuracy on ImageNet-1K.
At BS=1024, FOP converges in 32/1306.1 min, that is nearly
×2faster than SGD (71/2511.1 min) and just slightly ahead
0 1000 2000
20
40
60
Test Accuracy (%)
BS = 2¹ (1024)
SGD
KFAC
FOP
0 1000 2000
20
40
60
BS = 2¹¹ (2048)
SGD
KFAC
FOP
0 500 1000
Time (minutes)
0
20
40
60
Test Accuracy (%)
BS = 2¹² (4096)
SGD
KFAC
FOP
0 200 400 600
Time (minutes)
0
20
40
60
BS = 2¹³ (8192)
SGD
KFAC
FOP Figure 4: Test accuracy vs. wall-clock time (in minutes) for
ResNet-50 on ImageNet-1K, grouped by batch size. The
dotted line represents the threshold of 75.9%.
of K-FAC’s 35/1336.5 min (×1.88). Beyond this scale,
only FOP succeeds to hit the threshold in 33/999.5 min at
BS=2048 (×2.51), 34/514.9 min at BS=4096 (×4.88), and
40/335.1 min at BS=8192 (×7.50), while both SGD and K-
FAC stall below 75.9%. FOP’s steadily shrinking time-to-
threshold with increasing batch size demonstrates its better
large-batch scalability and clear advantage over first- and
second-order alternatives. Moreover, our FOP results com-
pare favorably even to recent second-order methods. For in-
stance, SENG (Yang et al. 2020) reaches 75.9% Top-1 on
ImageNet-1K in 41 epochs at BS=4096, and SP-NGD (Os-
awa et al. 2020) only hits 74.8%–75.3% at BS=4096–8192.
In contrast, FOP matches or exceeds their results in fewer
epochs and achieves the same threshold in 34 epochs at
BS=4096 and 40 epochs at BS=8192.
CIFAR10-LT with ResNet32
Finally, to assess optimizer robustness under long-tailed
data, we evaluate KFAC and FOP on the CIFAR10-LT and
CIFAR100-LT benchmarks (imbalance factors of 100 and
50) using a lightweight ResNet-32, following the training
protocol of Zhang et al. (2021). We apply both second-order
methods as preconditioners with a fixed damping ratio of
1e−3and obtain results over 5 random seeds.
Table 4 reports the Top-1 error rates on the CIFAR-LT
datasets under two imbalance factors (50 and 100). We com-
pare our implementations of KFAC and FOP against base-
line results from Zhang et al. (2021) and representative re-
sults from other recent works (Liu et al. 2019; Kang et al.
2019; Cui et al. 2019). FOP consistently delivers the lowest
error, outperforming the baseline by 2.3–3.3% smaller er-
ror rate and KFAC by 1.8–2.0% smaller error rate across all
settings. For example, on CIFAR-100-LT with factor 100,
it cuts the error from 62.27% (baseline) to 58.97% (3.3%
drop), and on CIFAR-10-LT with factor 50, it reduces er-
ror from 23.55% to 20.55% (3.0% drop), highlighting its ro-
bustness under severe class imbalance. By contrast, KFAC
delivers only modest reduction (≈1%) on the CIFAR-LT
dataset with an imbalance factor of 50, and even underper-
forms the baseline/other works on those with the factor of
100 (28.59% vs. 28.05% in CIFAR-10-LT, and 61.78% vs.
61.68% in CIFAR-100-LT). These results highlight FOP’s
superior robustness under severe class imbalance, a benefit
we attribute to the Fisher–Orthogonal Projection term that
balances curvature estimates across head and tail classes.
Datasets CIFAR-10-LT CIFAR-100-LT
Imbalance Factor100 50 100 50
Backbones ResNet-32
Baselines 28.05 23.55 62.27 56.22
Other works 29.64 25.19 61.68 56.15
KFAC 28.59 ↑22.29↓61.78↑55.02↓
FOP (ours) 26.65↓20.55↓58.97↓53.67↓
Table 4: Top-1 error rates on CIFAR-LT, comparing KFAC
and FOP against the implementation baseline (Zhang et al.
2021) and prior results (Liu et al. 2019; Kang et al. 2019;
Cui et al. 2019).↓indicates that an error rate is lower (better)
than both the baseline and other reported results;↑indicates
that it is higher (worse).
Conclusion
In this work, we introduced Fisher–Orthogonal Projec-
tion (FOP), a novel second-order optimizer that combines
natural-gradient updates with a geometry-aware variance
correction. This design eliminates the need for task-specific
tuning when scaling to multiple GPUs in large-batch train-
ing. FOP enables seamless scaling to extremely large batch
sizes, without requiring any additional tricks or adaptive hy-
perparameter adjustments, except scaling the learning rate
as the batch size. In contrast, standard optimizers such as
SGD, AdamW, and KFAC often collapse or demand exten-
sive tuning under such conditions. We validate FOP through
extensive experiments on four diverse benchmarks: ResNet-
18 on CIFAR-10, T2T-ViT on ImageNet-100, ResNet-50
on ImageNet-1K, and ResNet-32 on long-tailed CIFAR-
LT. Across these settings, FOP consistently accelerates con-
vergence and scales more robustly to large batches. More-
over, under severe class imbalance in CIFAR-LT dataset,
FOP delivers better generalization, reducing Top-1 error by
2.3–3.3% compared to strong baselines. Together, these re-
sults highlight FOP’s unique ability to unlock stable, plug-
and-play large-batch training across both convolutional and
transformer architectures. This makes it especially well-
suited for large-scale distributed and resource-constrained
environments, paving the way for practical, reliable second-
order optimization at scale.
References
Amari, S.; and Nagaoka, H. 2000. Methods of information
geometry.
Amari, S.-I. 1998. Natural gradient works efficiently in
learning.Neural computation, 10(2): 251–276.
Cui, Y.; Jia, M.; Lin, T.-Y.; Song, Y.; and Belongie, S. 2019.
Class-balanced loss based on effective number of samples.
InProceedings of the IEEE/CVF conference on computer
vision and pattern recognition, 9268–9277.
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-
Fei, L. 2009. Imagenet: A large-scale hierarchical image
database. In2009 IEEE conference on computer vision and
pattern recognition, 248–255. Ieee.
DeVries, T.; and Taylor, G. W. 2017. Improved Regulariza-
tion of Convolutional Neural Networks with Cutout.arXiv
preprint arXiv:1708.04552.
Eschenhagen, R.; Immer, A.; Turner, R.; Schneider, F.; and
Hennig, P. 2023. Kronecker-factored approximate curvature
for modern neural network architectures.Advances in Neu-
ral Information Processing Systems, 36: 33624–33655.
Ghorbani, B.; Krishnan, S.; and Xiao, Y. 2019. An inves-
tigation into neural net optimization via hessian eigenvalue
density. InInternational Conference on Machine Learning,
2232–2241. PMLR.
Goyal, P.; Doll´ar, P.; Girshick, R.; Noordhuis, P.;
Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; and He,
K. 2017. Accurate, large minibatch sgd: Training imagenet
in 1 hour.arXiv preprint arXiv:1706.02677.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-
ual learning for image recognition. InProceedings of the
IEEE conference on computer vision and pattern recogni-
tion, 770–778.
Ishikawa, S.; and Yokota, R. 2024. When Does Second-
Order Optimization Speed Up Training? InThe Second Tiny
Papers Track at ICLR 2024.
Kang, B.; Xie, S.; Rohrbach, M.; Yan, Z.; Gordo, A.; Feng,
J.; and Kalantidis, Y. 2019. Decoupling representation
and classifier for long-tailed recognition.arXiv preprint
arXiv:1910.09217.
Keskar, N. S.; Mudigere, D.; Nocedal, J.; Smelyanskiy,
M.; and Tang, P. T. P. 2016. On large-batch training for
deep learning: Generalization gap and sharp minima.arXiv
preprint arXiv:1609.04836.
Kingma, D. P.; and Ba, J. 2014. Adam: A method for
stochastic optimization.arXiv preprint arXiv:1412.6980.
Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple
layers of features from tiny images.
Li, H.; Xu, Z.; Taylor, G.; Studer, C.; and Goldstein, T. 2018.
Visualizing the loss landscape of neural nets.Advances in
neural information processing systems, 31.
Lin, W.; Dangel, F.; Eschenhagen, R.; Neklyudov, K.; Kris-
tiadi, A.; Turner, R. E.; and Makhzani, A. 2024. Structured
Inverse-Free Natural Gradient Descent: Memory-Efficient &
Numerically-Stable KFAC. InInternational Conference on
Machine Learning (ICML).
Liu, X.; Li, S.; Gao, K.; and Wang, B. 2024. A layer-wise
natural gradient optimizer for training deep neural networks.
Advances in Neural Information Processing Systems, 37:
28460–28489.
Liu, Z.; Miao, Z.; Zhan, X.; Wang, J.; Gong, B.; and Yu,
S. X. 2019. Large-scale long-tailed recognition in an open
world. InProceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition, 2537–2546.
Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay
regularization.arXiv preprint arXiv:1711.05101.
Martens, J.; and Grosse, R. 2015. Optimizing neural net-
works with kronecker-factored approximate curvature. In
International conference on machine learning, 2408–2417.
PMLR.
Mattson, P.; Cheng, C.; Coleman, C.; Diamos, G.; ...; Reddi,
V. J.; and Zaharia, M. 2019. MLPerf Training Benchmark.
arXiv:1910.01500.
McCandlish, S.; Kaplan, J.; Amodei, D.; and Team, O. D.
2018. An empirical model of large-batch training.arXiv
preprint arXiv:1812.06162.
Osawa, K.; Ishikawa, S.; Yokota, R.; Li, S.; and Hoefler, T.
2023. ASDL: A Unified Interface for Gradient Precondi-
tioning in PyTorch.arXiv e-prints, arXiv–2305.
Osawa, K.; Tsuji, Y.; Ueno, Y.; Naruse, A.; Foo, C.-S.; and
Yokota, R. 2020. Scalable and practical natural gradient
for large-scale deep learning.IEEE Transactions on Pattern
Analysis and Machine Intelligence, 44(1): 404–415.
Pascanu, R.; and Bengio, Y. 2013. Revisiting natural gradi-
ent for deep networks.arXiv preprint arXiv:1301.3584.
Pauloski, J. G.; Huang, L.; Xu, W.; Chard, K.; Foster, I. T.;
and Zhang, Z. 2022. Deep neural network training with dis-
tributed K-FAC.IEEE Transactions on Parallel and Dis-
tributed Systems, 33(12): 3616–3627.
Robbins, H.; and Monro, S. 1951. A stochastic approxima-
tion method.The annals of mathematical statistics, 400–
407.
Sagun, L.; Bottou, L.; and LeCun, Y. 2016. Eigenvalues of
the hessian in deep learning: Singularity and beyond.arXiv
preprint arXiv:1611.07476.
Shazeer, N.; and Stern, M. 2018. Adafactor: Adaptive learn-
ing rates with sublinear memory cost. InInternational Con-
ference on Machine Learning, 4596–4604. PMLR.
Team, P. 2017. PyTorch Examples. https://github.com/
pytorch/examples. Accessed: 2025-08-03.
Yang, M.; Xu, D.; Wen, Z.; Chen, M.; and Xu, P. 2020.
Sketchy empirical natural gradient methods for deep learn-
ing.arXiv preprint arXiv:2006.05924.
Yao, Z.; Gholami, A.; Arfeen, D.; Liaw, R.; Gonzalez, J.;
Keutzer, K.; and Mahoney, M. 2018. Large batch size train-
ing of neural networks with adversarial training and second-
order information.arXiv preprint arXiv:1810.01021.
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.-H.;
Tay, F. E.; Feng, J.; and Yan, S. 2021. Tokens-to-Token ViT:
Training Vision Transformers From Scratch on ImageNet.
InProceedings of the IEEE/CVF International Conference
on Computer Vision (ICCV), 558–567.
Zhang, Y.; Wei, X.; Zhou, B.; and Wu, J. 2021. Bag of
Tricks for Long-Tailed Visual Recognition with Deep Con-
volutional Neural Networks. InAAAI, 3447–3455.
Supplementary Material
A. Notations and Lemmas
Lemma 1.Letgdiff, gavg∈R
n
, and letF∈R
n×n
be a sym-
metric positive semi-definite matrix (the Fisher information
matrix). Define the scalar projection:
sFOP:=
g
⊤
diff
F gavg
g
⊤
avgF gavg+ϵ
and the orthogonal component:
g
⊥
diff:=gdiff−sFOPgavg
for some small constantϵ. Then the Fisher inner product
betweeng
⊥
diff
andgavgsatisfies:
Γ
g
⊥
diff
∆⊤
F gavg=g
⊤
diffF gavg·
`
ϵ
g
⊤
avgF gavg+ϵ
´
In particular:
•Ifϵ= 0, theng
⊥
diff
is exactlyF-orthogonal togavg.
•Forϵ >0, the deviation from orthogonality is bounded
by:
Γ
g
⊥
diff
∆⊤
F gavg
≤ϵ∥F
1/2
gdiff∥ · ∥F
−1/2
gavg∥
Proof.We start with the definition:
g
⊥
diff=gdiff−sFOPgavg, sFOP=
g
⊤
diff
F gavg
g
⊤
avgF gavg+ϵ
Now take the Fisher inner product ofg
⊥
diff
withgavg:
Γ
g
⊥
diff
∆⊤
F gavg=g
⊤
diffF gavg−sFOP·g
⊤
avgF gavg
=g
⊤
diffF gavg−
g
⊤
diff
F gavg
g
⊤
avgF gavg+ϵ
·g
⊤
avgF gavg
=g
⊤
diffF gavg
1−
g
⊤
avgF gavg
g
⊤
avgF gavg+ϵ
!
=g
⊤
diffF gavg·
ϵ
g
⊤
avgF gavg+ϵ
This proves the first part of the lemma. Ifϵ= 0, the expres-
sion is zero, yielding exact orthogonality:
Γ
g
⊥
diff
∆⊤
F gavg= 0
Forϵ >0, we apply the Cauchy-Schwarz inequality under
the Fisher inner product:
Γ
g
⊥
diff
∆⊤
F gavg
=
g
⊤
diffF gavg·
ϵ
g
⊤
avgF gavg+ϵ
≤ϵ· ∥F
1/2
gdiff∥ · ∥F
−1/2
gavg∥
Lemma 2.LetF≻0be a symmetric positive definite ma-
trix, and letµminandµmaxdenote the minimum and max-
imum eigenvalues ofF, respectively. Then for any vector
x∈R
n
, the following inequality holds:
µ
1/2
min
∥F
−1/2
x∥ ≤ ∥x∥ ≤µ
1/2
max∥F
−1/2
x∥.
Proof.SinceF≻0, it is diagonalizable with positive
eigenvaluesµ1, . . . , µnand orthonormal eigenvectors. Let
x∈R
n
. The standard Euclidean norm is:
∥x∥
2
=x
⊤
x,
and the Fisher-scaled norm is:
∥F
−1/2
x∥
2
=x
⊤
F
−1
x.
From spectral theory, the matrix inequality holds:
µminI⪯F⪯µmaxI,
which implies (after inverting all sides):
µ
−1
maxI⪯F
−1
⪯µ
−1
min
I.
Now apply these bounds to the quadratic formx
⊤
F
−1
x:
x
⊤
F
−1
x≤µ
−1
min
x
⊤
x=µ
−1
min
∥x∥
2
,
x
⊤
F
−1
x≥µ
−1
maxx
⊤
x=µ
−1
max∥x∥
2
.
Rewriting in terms of norms:
∥F
−1/2
x∥
2
≤µ
−1
min
∥x∥
2
⇒ ∥x∥ ≥µ
1/2
min
∥F
−1/2
x∥,
∥F
−1/2
x∥
2
≥µ
−1
max∥x∥
2
⇒ ∥x∥ ≤µ
1/2
max∥F
−1/2
x∥.
B. KL-norm analysis
B.1. KL-norm of FOP
In natural gradient methods, such as K-FAC (Martens and
Grosse 2015), they define the direction in parameter space
which gives the largest change in the objective per unit of
change in the model (Amari and Nagaoka 2000), as mea-
sured by the Kullback–Leibler (KL) divergence between the
current modelpθand the updated modelpθ+∆.
By applying a second-order Taylor expansion of the KL
divergence aroundθ, we obtain the following approxima-
tion:
KL(pθ+∆∥pθ)≈
1
2
∆
⊤
F∆ =
1
2
∥F
1/2
∆∥
2
(23)
whereFis the Fisher information matrix, and∥F
1/2
∆∥
2
is
known as the KL-norm of the update step∆. This KL-norm
measures how far the update moves the model in distribution
space, making it a more meaningful metric for optimization
in probabilistic models.
In our paper, the update rule in the Fisher Orthogonal Pro-
jection (FOP) method:
∆FOP=−ηM
−1
(g+βg
⊥
) (24)
where:
•M=F+λIis the damped Fisher matrix,
•β∈Rcontrols the contribution from the orthogonal
componentg
⊥
,
•ηis the learning rate,
•g
⊥
≡g
⊥
diff
andg≡gavg
•g
⊥
is orthogonal togin the Fisher metric:(g
⊥
)
⊤
F g= 0.
To compute the KL-norm of this FOP update, we evaluate:
∥F
1/2
∆FOP∥
2
=η
2
(g+βg
⊥
)
⊤
M
−1
F M
−1
(g+βg
⊥
)
(25)
Let
Q=M
−1
F M
−1
(26)
which is symmetric and positive semi-definite. Then the
KL-norm becomes:
F
1/2
∆FOP
2
=η
2
(g+βg
⊥
)
⊤
Q(g+βg
⊥
)
=η
2
Θ
g
⊤
Qg+ 2βg
⊤
Qg
⊥
+β
2
(g
⊥
)
⊤
Qg
⊥
Λ
(27)
whereg
⊤
Qgis the KL norm of KFAC,2β g
⊤
Qg
⊥
is a
cross term andβ
2
(g
⊥
)
⊤
Qg
⊥
is contributing the orthogonal
projection. Each term captures a different component of the
curvature-informed update.
The first termg
⊤
Qgwas proved to be situated in the
trust region (Martens and Grosse 2015), so we now focus on
bounding the cross termg
⊤
Qg
⊥
, and the orthogonal term
β
2
(g
⊥
)
⊤
Qg
⊥
. Bounding these terms is essential to ensure
that the KL-norm remains predictable under varying damp-
ing parameters.
Bounding2β g
⊤
Qg
⊥
Substituting Eq. 26 into the cross term gives:
g
⊤
Qg
⊥
=g
⊤
M
−1
g
⊥
−λg
⊤
M
−2
g
⊥
(28)
and we can bound this:
|g
⊤
Qg
⊥
|=
g
⊤
M
−1
g
⊥
−λg
⊤
M
−2
g
⊥
(29)
and with triangle inequality (|a−b| ≤ |a|+|b|):
|g
⊤
Qg
⊥
| ≤ |g
⊤
M
−1
g
⊥
|+λ|g
⊤
M
−2
g
⊥
|(30)
Applying the Cauchy–Schwarz inequality to each term:
For the first term:
|g
⊤
M
−1
g
⊥
| ≤ ∥M
−1/2
g∥ · ∥M
−1/2
g
⊥
∥(31)
For the second term:
|g
⊤
M
−2
g
⊥
| ≤ ∥M
−1
g∥ · ∥M
−1
g
⊥
∥ (32)
SinceM
−1
⪯λ
−1
I, it follows that:
∥M
−1/2
g∥ ≤λ
−1/2
∥g∥,∥M
−1
g∥ ≤λ
−1
∥g∥(33)
Putting Eq. 33 into Eq. 31 and Eq. 32:
∥M
−1/2
g∥ · ∥M
−1/2
g
⊥
∥ ≤λ
−1
∥g∥ · ∥g
⊥
∥ (34)
λ· ∥M
−1
g∥ · ∥M
−1
g
⊥
∥ ≤λ·λ
−1
·λ
−1
∥g∥ · ∥g
⊥
∥
=λ
−1
∥g∥ · ∥g
⊥
∥ (35)
Putting these back to Eq. 30, we get:
|g
⊤
Qg
⊥
| ≤λ
−1
∥g∥·∥g
⊥
∥+λ
−1
∥g∥·∥g
⊥
∥=
2
λ
∥g∥·∥g
⊥
∥
(36)
To express this in terms of theF
−1/2
-norm, recallLemma
2:
∥g∥ ≤µ
1/2
max∥F
−1/2
g∥,∥g
⊥
∥ ≤µ
1/2
max∥F
−1/2
g
⊥
∥
Therefore:
|g
⊤
Qg
⊥
| ≤
2µmax
λ
∥F
−1/2
g∥ · ∥F
−1/2
g
⊥
∥ (37)
Boundingβ
2
(g
⊥
)
⊤
Qg
⊥
From Eq. 26:
Q= (F+λI)
−1
F(F+λI)
−1
(38)
In the eigenbasis ofF(symmetric and positive semi-
definite), whereF= diag(Λ1, . . . ,Λn)andΛiis the eigen-
value, we can expressQas a diagonal matrix with entries:
Qi=
Λi
(Λi+λ)
2
(39)
Letg
⊥
=
P
i
hiui, where{ui}are the eigenvectors ofF,
andhi=u
⊤
i
g
⊥
. Then:
(g
⊥
)
⊤
Qg
⊥
=
X
i
Λi
(Λi+λ)
2
h
2
i (40)
This achieves its maximum atΛi=λ, therefore:
(g
⊥
)
⊤
Qg
⊥
≤
1
4λ
X
i
h
2
i=
1
4λ
∥g
⊥
∥
2
(41)
RecallLemma2, we have:
∥g
⊥
∥ ≤µ
1/2
max∥F
−1/2
g
⊥
∥
Substitute into the bound:
β
2
(g
⊥
)
⊤
Qg
⊥
≤
β
2
4λ
µmax∥F
−1/2
g
⊥
∥
2
(42)
Finally, Eq. 27 becomes:
∥F
1/2
∆FOP∥
2
≤η
2
"
g
⊤
Qg
|{z}
∥F
1/2
∆KFAC∥
2
+ 2β
µmax
λ
∥F
−1/2
g∥ · ∥F
−1/2
g⊥∥
+β
2
µmax
4λ
∥F
−1/2
g⊥∥
2
#
(43)
The KL-norm of an FOP update eventually decomposes
into three terms: the K-FAC core term, which shrinks
quadratically asλ
−2
, and two FOP-specific terms, a cross
term and an orthogonal term, which shrink linearly as
λ
−1
. Crucially, the FOP-specific terms are also scaled by
∥F
−1/2
g
⊥
∥, which is typically smaller than∥F
−1/2
g∥.
The analysis deliberately combines an exact expression
for the core term with a worst-case bound for the orthogonal
term to ensure safety even when the spectrum ofg
⊥
is un-
known. In the large-damping training that K-FAC relies on
for stability at big batch sizes, this bound guarantees that the
FOP-specific terms remain at least one order ofλsmaller
than the core term.
This allows FOP to reduceλ, preserving more curvature
information while still respecting the KL trust-region. As a
result, FOP achieves faster convergence in large-batch train-
ing without compromising stability.
C. Experiments
C.1. Setup of CIFAR-10
The training of ResNet-18 (He et al. 2016) on the CIFAR-
10 (Krizhevsky, Hinton et al. 2009) dataset serves as a
fundamental experiment in the field of image classifica-
tion. In this subsection, we compare the proposed FOP op-
timizer with several established baselines, including SGD
with momentum (referred to as SGD (Robbins and Monro
1951)), AdamW (Loshchilov and Hutter 2017), and KFAC.
We follow standard experimental settings and apply com-
monly used data augmentation techniques consisting of ran-
dom cropping and horizontal flipping (DeVries and Tay-
lor 2017). All models are trained for 100 epochs. For SGD,
KFAC, and FOP, the initial learning rateα0for batch size of
1024is selected via grid search over the setα∈ {10
−2
,5×
10
−2
,10
−1
,5×10
−1
,1}. For AdamW, the grid is set to
α∈ {10
−4
,5×10
−4
,10
−3
,5×10
−3
,10
−2
}. The up-
date intervals for the curvature matrix and its inverse in
both KFAC and FOP are synchronised, and we evaluate two
base intervals:{10,100,200}. Each experiment is repeated
with 5 different random seeds ({0,1,2,3,4}) to account for
variability. We useReduceLROnPlateau as the learn-
ing rate scheduler, configured with a patience of 5 epochs,
a decay factor of 0.1, a relative threshold of 0.0001, and
thresholdmodeset torel.
C.2. Setup of ImageNet-100 experiment
The implementation of T2T-ViT (Yuan et al. 2021) follows
the original paper. We use the linear warmup strategy (Goyal
et al. 2017) in the first 5 epochs for AdamW, KFAC and FOP.
The initial learning rateα0is selected via grid search over
α∈ {10
−5
,10
−4
, . . . ,10
−1
}for a batch size of512. .The
update intervals for the curvature matrix and inverse matrix
correlating with KFAC and FOP are set to be{500,1000}
for a batch size of 512. All models are trained for 100 epochs
with three random seeding ({0,1,2}). AdamW, KFAC, and
FOP use the cosine learning rate updating strategy and are
set to be
αt= 0.001 + 0.5×(α0−0.001)
×
ȷ
1 + cos
ȷ
2×0.47×π×t
maxepoch
ffff
(44)
C.3. Setup of ImageNet-1K experiment
We implementResNet50following the official PyTorch
example (Team 2017), based on the architecture proposed
by He et al. (He et al. 2016). For all optimizers (SGD,
Adam, KFAC, and FOP), we apply a linear warm-up strat-
egy (Goyal et al. 2017) during the first 5 epochs. For the
KFAC and FOP optimizers, the update intervals for the cur-
vature matrix and its inverse are set to{800,1600,2400}
when using a batch size of1024. The total number of train-
ing epochs differs by optimizer: we train SGD and Adam for
80 epochs, while KFAC and FOP are trained for 50 epochs.
SGD uses the cosine annealing learning rate schedule as de-
scribed in Eq. 44. In contrast, KFAC and FOP adopt an ex-
ponential decay learning rate schedule given by:
αt=α0×
ȷ
1−
t
maxepoch
ff
E
wheretis the current epoch, andEis the decay exponent,
withE∈ {4,5,6}. The initial learning rateα0is selected
via grid search overα∈ {10
−5
,10
−4
, . . . ,10
−1
}for a
batch size of1024.
C.4. Additional experimental details and results
C.4.1 Damping ratio vs accFigure 6 presents results on
CIFAR-10 with ResNet-18. Across all batch sizes (2048–
50000), the proposed FOP optimiser consistently surpasses
the 91 % test-accuracy threshold (dotted line) when the
damping ratioλ∈[10
−5
,10
−3
]. In contrast, KFAC fails
to meet the cutoff whenever the damping is too aggressive
(λ≥10
−2
) and becomes unstable atλ= 10
−5
. While both
methods perform best aroundλ= 10
−3
, FOP’s flatter curve
indicates a much larger tolerance to hyper-parameter mis-
specification.
This pattern extends to transformer-based models as well.
As shown in Figure 7, for T2T-ViT trained on ImageNet-
100, FOP reliably exceeds the 80.9 % bar for all batch sizes
and for all damping values except the extreme low end. In
contrast, KFAC only reaches the target within the narrow
rangeλ∈[10
−4
,10
−3
]. Notably, FOP’s accuracy varies by
less than one percentage point across the entire sweep from
λ= 10
−5
to10
−1
at smaller batch sizes, highlighting its
robustness even on architectures that include attention lay-
ers. A similar trend appears on the large-scale ImageNet-1K
task with ResNet-50, as illustrated in Figure 8. Here, FOP
again outperforms or matches KFAC across the board and
remains above the 75.9 % accuracy threshold throughout the
full sweep and across all batch sizes (2048–8196). KFAC’s
performance, however, drops sharply forλ≥10
−2
and
exhibits visible instability at the smallest damping, echo-
ing the behaviour observed on CIFAR-10. Collectively, Fig-
ures 6–8 indicate that FOP provides a substantially broader
and more stable operating range for the damping ratio com-
pared to KFAC, across both convolutional and transformer-
based architectures, and over a wide range of data scales.
This increased robustness reduces the sensitivity to hyper-
parameter selection and supports the viability of FOP as a
reliable second-order optimisation method for large-batch
training in diverse tasks.
KFAC
FOP =0
=0.01 =-0.01
=0.1 =-0.1
0
20
40
60
80
Final Test Accuracy (%)
90.58%
93.74%
91.27%
92.51% 92.72% 92.38% 92.25%
90.58%
92.25% 92.70% 92.19% 91.79%
Ablation Study
=True
=False
KFAC / FOP Figure 5:Ablation of the scaling parameterηon CIFAR-10.Bars are grouped by optimiser KFAC (left cluster) and FOP
(right cluster), and coloured by the value of the scaling termη∈ {−0.1,−0.01,0,0.01,0.1}. Numbers above each bar give the
Top 3 test accuracy averaged over three random seeds.1e-051e-041e-031e-021e-01
85
90
95
Test Acc (%)
BS = 4096
KFAC
FOP
1e-051e-041e-031e-021e-01
85
90
95
BS = 8192
KFAC
FOP
1e-051e-041e-031e-021e-01
Damping ratio
85
90
95
Test Acc (%)
BS = 32768
KFAC
FOP
1e-051e-041e-031e-021e-01
Damping ratio
85
90
95
BS = 50000
KFAC
FOP
Figure 6: Best test accuracy vs. damping ratio for ResNet-18
on CIFAR-10, grouped by batch size. The dotted line repre-
sents the threshold of 91%.
C.4.2 GPU memoryTables 5–7 report the peak GPU
memory usage for CIFAR-10/ResNet-18, ImageNet-
100/T2T-ViT, and ImageNet-1K/ResNet-50, respectively,
along with the number of devices used in each run. On
CIFAR-10 (Table 5), both SGD and AdamW exhibit the
lowest memory footprint, with usage scaling nearly linearly
with batch size. KFAC incurs a moderate overhead, while
FOP demands the highest memory—up to 145 GB for a
batch of 16 384 on a single GPU, as it needs to keep two
gradientsg1andg2under the same device. This vanishes
when there are more than 1 device. For ImageNet-100
with T2T-ViT (Table 6), all three methods operate near full
device capacity (≈150–170 GB) at the smallest batch size,1e-051e-041e-031e-021e-01
50
75
Test Acc (%)
BS = 512
KFAC
FOP
1e-051e-041e-031e-021e-01
50
75
BS = 1024
KFAC
FOP
1e-051e-041e-031e-021e-01
Damping ratio
50
75
Test Acc (%)
BS = 2048
KFAC
FOP
1e-051e-041e-031e-021e-01
Damping ratio
50
75
BS = 4096
KFAC
FOP
Figure 7: Best test accuracy vs. damping ratio for T2T-ViT
on ImageNet-100, grouped by batch size. The dotted line
represents the threshold of 80.9%.
with FOP consistently exceeding the others by 5–20 GB.
On ImageNet-1K with ResNet-50 (Table 7), memory per
GPU remains constant within each optimiser regardless of
batch size, since larger batches are distributed across more
devices. Here, SGD requires≈88GB per GPU, KFAC
≈98GB, and FOP again matches KFAC closely except for
the smallest batch where it peaks at 153 GB.
1e-051e-041e-031e-021e-01
40
60
Test Acc (%)
BS = 1024
KFAC
FOP
1e-051e-041e-031e-021e-01
40
60
BS = 2048
KFAC
FOP
1e-051e-041e-031e-021e-01
Damping ratio
40
60
Test Acc (%)
BS = 4096
KFAC
FOP
1e-051e-041e-031e-021e-01
Damping ratio
40
60
BS = 8192
KFAC
FOP Figure 8: Best test accuracy vs. damping ratio for ResNet-
50 on ImageNet-1K, grouped by batch size. The dotted line
represents the threshold of 75.9%.
Optimizer # of GPU Max Mem/GPU (GB)
Batch Size: 2048
SGD 1 9.25
AdamW 1 9.30
KFAC 1 14.51
FOP 1 19.21
Batch Size: 4096
SGD 1 18.40
AdamW 1 18.44
KFAC 1 28.03
FOP 1 37.20
Batch Size: 8192
SGD 1 36.70
AdamW 1 36.74
KFAC 1 55.08
FOP 1 73.18
Batch Size: 16384
SGD 1 73.29
AdamW 1 73.33
KFAC 1 109.18
FOP 1 145.16
Batch Size: 32768
SGD 2 73.29
AdamW 2 73.38
KFAC 2 108.31
FOP 2 108.44
Table 5: Max GPU memory usage and number of GPUs for
CIFAR-10 training with ResNet-18.
Optimizer # of GPU Max Memory/GPU (GB)
Batch Size: 512
AdamW 1 145.41
KFAC 1 153.28
FOP 1 172.51
Batch Size: 1024
AdamW 1 151.10
KFAC 1 158.85
FOP 1 163.18
Batch Size: 2048
AdamW 1 148.54
KFAC 1 156.86
FOP 1 159.20
Batch Size: 4096
AdamW 1 142.95
KFAC 1 151.20
FOP 1 153.66
Table 6: Max GPU memory usage across optimizers and
number of GPUs used per run for training ImageNet-100
with T2T-ViT.
Optimizer # of GPU Max Memory/GPU (GB)
Batch Size: 1024
SGD 1 87.89
KFAC 1 110.31
FOP 1 153.31
Batch Size: 2048
SGD 2 87.99
KFAC 2 98.04
FOP 2 98.24
Batch Size: 4096
SGD 4 87.99
KFAC 4 98.04
FOP 4 98.24
Batch Size: 8192
SGD 8 87.99
KFAC 8 97.89
FOP 8 98.19
Table 7: Max GPU memory usage and number of GPUs used
per run for training ImageNet-1K with ResNet-50.
C.4.3 Scaling efficiency# of GPUs
2000
3000
4000
5000
6000
7000
8000
9000
Total Time (sec)
ImageNet-100: Total Time
FOP
# of GPUs
20000
30000
40000
50000
60000
70000
80000
Total Time (sec)
ImageNet-1K: Total Time
FOP
2 4 6 8
# of GPUs
0
20
40
60
80
100
120
Efficiency (%)
ImageNet-100: Parallel Efficiency
FOP
2 4 6 8
# of GPUs
0
20
40
60
80
100
120
Efficiency (%)
ImageNet-1K: Parallel Efficiency
FOP
Figure 9: Strong-scaling results for FOP on ImageNet-1K.
Top: wall-clock training time versus number of GPUs for
ImageNet-100 (left) and ImageNet-1K (right). Bottom: total
GPU-time (#GPUs × wall-clock) versus number of GPUs
for the same two datasets. Dashed lines indicate ideal linear
scaling.
D. Ablation Study of FOP
D.1. Effect ofβ
An ablation study on CIFAR-10, shown in Figure 5, eval-
uates the effect of varying the momentum parameterβin
the FOP optimiser. Using ResNet-18 trained for 100 epochs
(averaged over three seeds), the study compares different
configurations of FOP against KFAC, which fixes(η, β) =
(1,0). FOP decouples the learning rate scaleηfrom the mo-
mentum factorβ, allowing both to be fixed or adaptively up-
dated. Whenηis not adaptive (set to1), introducingβcon-
sistently improves accuracy over KFAC. Further gains are
observed whenηis adaptive, highlighting the benefit of dy-
namic learning rate scaling. Overall, the results demonstrate
that FOP’s flexibility in tuning or adapting(η, β)leads to
improved accuracy and robustness compared to fixed-setting
baselines.
E. Comparison between Nvidia and AMD
cards
E.1. System Configuration
AMD System (Training Runs)
•Node Type:Lenovo ThinkSystem SR685a V3
•CPUs:2× AMD EPYC 9534 (64 cores each, 128 total
cores, no SMT)
•Max Frequency:3.72 GHz, Frequency Boost enabled
•Instruction Support:AVX-512, AVX512-BF16,
AVX512-VNNI, AVX512-VBMI
•Cache:L3 cache total 512 MiB (16×32 MiB)
•GPUs:8× AMD MI300X
•Software Stack:PyTorch 2.6.0 + ROCm 6.2.4
•ROCm Driver Version:6.2.1
NVIDIA System (Benchmarking)
•Node Type:Custom system
•CPUs:2× Intel Xeon Platinum 8468 (48 cores each, 96
total cores)
•Max Frequency:3.8 GHz
•Instruction Support:AVX-512, AVX-VNNI, AMX
(Tile, INT8, BF16)
•GPUs:8× NVIDIA H100 (80 GB HBM3)
•Software Stack:PyTorch 2.8.0.dev20250413+cu126,
CUDA 12.9
•Driver Version:575.57.08
E.2. Benchmarking Results on ImageNet-1k with
ResNet-50
To evaluate both the scalability and efficiency of different
optimization methods across hardware platforms, we present
four subplots in Figure 10. Subplot (a) shows thetotal GPU
time per epoch, which accounts for both the per-step time
and the number of GPUs used, reflecting the total compu-
tational cost. Subplot (b) illustrates theraw wall-clock time
per epochwhen training with 8 GPUs.
In subplot (c), we measure thestrong scaling parallel
efficiency, defined as
Efficiency
#GPUs
=
T1
n·Tn
×100%,
whereT1is the training time using the smallest GPU count
(typically one GPU),Tnis the time when usingnGPUs,
andnis the number of GPUs. This metric indicates how
effectively the training process scales with added hardware
resources.
Subplot (d) evaluates theefficiency with respect to batch
size at fixed GPU count(8 GPUs), defined as
Efficiency
batch
=
Tb0
·b0
Tb·b
×100%,
whereTb0
is the epoch time at a reference batch sizeb0, and
Tbis the epoch time for a new batch sizeb. This quantifies
how well larger batch sizes maintain per-sample throughput
on the same hardware.
Together, these plots provide insights into both inter-GPU
scaling behavior and intra-GPU efficiency across diverse
training methods and compute platforms.
1000 2000 3000 4000 5000 6000 7000 8000
Global batch size
500
1000
1500
2000
2500
3000
3500
4000
Time (sec)
(a)
GPU time per epoch at max usage
Nvidia - KFAC
Nvidia - SGD
Nvidia - FOP
AMD - KFAC
AMD - SGD
AMD - FOP
1000 1500 2000 2500 3000 3500 4000
Global batch size
100
200
300
400
500
600
700
Time (sec)
(b)
Compute time per epoch for 8 GPUs
Nvidia - KFAC
Nvidia - SGD
Nvidia - FOP
AMD - KFAC
AMD - SGD
AMD - FOP
1 2 3 4 5 6 7 8
# of GPUs
60
70
80
90
100
Efficiency %
(c)
Parallel Efficiency vs #GPUs (max usage)
Nvidia - KFAC
Nvidia - SGD
Nvidia - FOP
AMD - KFAC
AMD - SGD
AMD - FOP
1000 1500 2000 2500 3000 3500 4000
Global batch size
50
60
70
80
90
100
Efficiency %
(d)
Parallel Efficiency vs Batch Size (8 GPUs)
Nvidia - KFAC
Nvidia - SGD
Nvidia - FOP
AMD - KFAC
AMD - SGD
AMD - FOP Figure 10: Training time and parallel efficiency comparisons across different methods (SGD, KFAC, FOP) and hardware
(NVIDIA H100 and AMD MI300X). (a) Total GPU time per epoch scaled by the number of GPUs used, representing overall
computational effort. (b) Raw wall-clock time per epoch using 8 GPUs. (c)Parallel efficiency vs number of GPUs, calculated
as Efficiency=
T1
nTn
, whereT1is the time with the smallest GPU count andTnthe time usingnGPUs. (d)Efficiency vs
global batch size on 8 GPUs, defined as Efficiency=
Tb
0
·b0
Tb·b
, measuring how well increased batch sizes are utilized on fixed
hardware.