Extracted Text

2310.07831.pdf

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
When, Why and How Much?
Adaptive Learning Rate Scheduling by Refinement
Aaron Defazio ADEFAZIO@META.COM
FAIR, Meta
Ashok Cutkosky ASHOK@CUTKOSKY.COM
Boston University
Harsh Mehta HARSHM@GOOGLE.COM
Google Research
Konstantin Mishchenko KONSTA.MISH@GMAIL.COM
Samsung AI Center
Abstract
Learning rate schedules used in practice bear little resemblance to those recommended by theory. We close
much of this theory/practice gap, and as a consequence are able to derive newproblem-adaptivelearning rate
schedules. Our key technical contribution is a refined analysis of learning rate schedules for a wide class of
optimization algorithms (including SGD). In contrast to most prior works that study the convergence of the
average iterate, we study the last iterate, which is what most people use in practice. When considering only
worst-case analysis, our theory predicts that the best choice is the “linear decay” schedule: a popular choice in
practice that sets the stepsize proportionally to1−t/T, wheretis the current iteration andTis the total number
of steps. To go beyond this worst-case analysis, we use the observed gradient norms to derive schedules
refinedfor any particular task. These refined schedules exhibit learning rate warm-up and rapid learning rate
annealing near the end of training. Ours is the first systematic approach toautomaticallyyield both of these
properties. We perform the most comprehensive evaluation of learning rate schedules to date, evaluating
across 10 diverse deep learning problems, a series of LLMs, and a suite of logistic regression problems. We
validate that overall, the linear-decay schedule matches or outperforms all commonly used default schedules
including cosine annealing, and that our schedule refinement method gives further improvements.
1 Introduction
For minimizing a functionf, Stochastic Gradient Descent (SGD) updates the iteratextat steptvia:
xt+1=xt−ηtgt,
wheregtis a (possibly stochastic) sub-gradient atxt, andηtis the learning rate (LR) at timet. Choosing a
sequence ofηtfor stepst= 1, . . . , Tis a core problem in optimization.
The learning rate sequence for an optimizer is typically decomposed into two parts: thebaselinelearning
rate, indicating the maximum LR to use, and aschedule, a sequence that multiplies the baseline LR to give the
LR sequence. In this work we focus exclusively on the problem of scheduling. Choosing the right learning
rate schedule for best performance is difficult; standard practice is to perform a hyper-parameter sweep over
a set of standardized schedules (Wu et al., 2020).
Settingηtfrom theory is difficult due to a multitude of potential problem assumptions and the wildly
varying schedules that arise from these assumptions. For instance,ηt∝1/
√
t,ηt∝1/tand constant
schedulesηt=ηare all dictated by three common but different sets of assumptions. Unfortunately, all three
work suboptimally in practice for deep learning (Section 3.1), and are unpopular in the community (Ge et al.,
2019a). In this work we focus on two causes for this theory-practice gap:
1. Most theory analyzes the average iterate (Polyak, 1990; Ruppert, 1988)ˆxT=
1
T
P
T
t=1
xtor a ran-
domly sampled iterate. However, in practice the last iteratexTis used.
1

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
Algorithm 1Schedule Refinement for SGD
1:Input:G=∥gt∥sequence of lengthT, smoothing hyper-parameterτ >0
2:
ˆ
G=medianfilter(G,filterwidth=τ T,padding= (nearest,reflect))
3:Definewt=
ˆ
G
−2
t
4:fort= 1toTdo
5:
ηt=wt
T
X
p=t+1
wp
6:end for
7:Return normalized scheduleη/max(η)
2. Existing theory for the last iterate often uses crude constant bounds on the gradient norms or curva-
ture. Our new tighter bound involves the entire gradient norm sequence instead, allowing forproblem-
adaptiveLR schedules.
Our method is a “refinement” method: it uses a prior training run to produce an improved schedule to use
in future runs. The practical variant of our schedule refinement method for SGD is given in Algorithm 1.
Given a sequence of gradient norms produced by a prior run, it outputs a new schedule that is adaptive to
the structure of the problem. Mathematically, it is minimizing a novel bound we derive on the function value
f(xT)of the final iterate (Section 2.1). The learning rate at each time-step involves a sum of inverse-squared
gradient norms fromfuturetime-steps, a major departure from previous approaches to scheduling.
This approach is partially motivated by the remarkable work of Pan et al. (2022), who show that for the
last iterate of quadratic problems, minimizing an upper bound can yield improved schedules. However their
work requires knowledge of the full Eigenspectrum of the Hessian, making it impractical to use. Our theory
relies on knowledge of the sequence of expected gradient norms, a more tractable quantity.
Our analytical approach also departs from previous approaches by generalizing beyond SGD. Prior anal-
yses of learning rate schedules often rely on the particular form of the SGD iterates to drive the calculations
(Jain et al., 2019; Zamani and Glineur, 2023). Our approach is instead a broad technique that provides learn-
ing rate schedules foranybase optimization algorithm. On a technical level, we design a step-size schedule
that converts any algorithm which obtains a vanishing regret into one that ensures a last-iterate guarantee.
This means that our refinement technique can provide theoretically-motivated schedules for popular base
optimization algorithms like Adam.
Overall, we make three main contributions:
1. Theory: We provide an general analysis of learning rate schedules for arbitrary optimization algo-
rithms. Our approach recovers the optimal convergence rates for SGD, and can be used to produce
refined schedules customized for a particular task.
2. Practice: Among non-adaptive schedules, we show that warm-up followed by linear decay almost al-
ways matches or outperforms other schedules, including cosine decay. Our refined schedules typically
provide further improvements, suggesting that our theory provides actionable guidance even for train-
ing non-convex neural networks.
3. Theory meets Practice: Our refined schedules exhibit both warmup and nearly-linear decay (Figure
1). While Zamani and Glineur (2023) have very recently shown that linear decay has favorable last-
iterate guarantees, to our knowledge this is the first time that warmup has arisendirectly from theory
rather than as an empirical heuristic (Goya et al., 2017). That is, given hindsight knowledge of all
gradient statistics, our theoretically optimized schedules usually include a warmup phase during which
the learning rate increases.
2

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT0
1
2
3
4
5
Gradient
Norms
0
1
2
3
4
5
0
1
2
3
4
5
0
1
2
3
4
5
0%20%40%60%80%100%
0.0
0.2
0.4
0.6
0.8
1.0
Refined
Schedule
0%20%40%60%80%100%
0.0
0.2
0.4
0.6
0.8
1.0
0%20%40%60%80%100%
0.0
0.2
0.4
0.6
0.8
1.0
0%20%40%60%80%100%
0.0
0.2
0.4
0.6
0.8
1.0
Figure 1: Example gradient norm sequences (top row) and the resulting refined schedules given by Algo-
rithm 1 (bottom row). Black dashed line aty= 1shown for reference. Percentage of runtime on
the x-axis.
1.1 Notation
f:R
d
→Ris a convex objective.x1, . . . , xTandz1, . . . , zTare random vectors inR
d
withx1=z1, and
∆t≜zt+1−zt.∆twill indicate “baseline” updates before applying a schedule, andxtwill be iterates after
applying the schedule.g1, . . . , gTare random vectors inR
d
satisfyingE[gt|x1, . . . , xt]∈∂f(xt)(E[gt] =
∇f(xt)whenfis differentiable).G
2
is a bound onE[maxt∥gt∥
2
].w1, . . . , wTindicate non-negative
random variables inRsuch thatgtandwtare independent givenx1, . . . , xt. We definewa:b≜
P
b
t=a
wt.
Ifa > b, thenwa:b≜0.udenotes an arbitrary element ofR
d
; typically one should consider the case
u∈arg minf.D≜∥x1−u∥is the “distance-to-solution” term, andf⋆≜infuf(u).
2 Main Analytical Result
Our general result is Theorem 1. This allows us to convert any sequencez1, . . . , zTwith bounded regret
P
T
t=1
⟨gt, zt−u⟩into a sequence of iteratesx1, . . . , xTwith a bound onf(xT)−f(u). Thus, Theorem 1
can be viewed as another addition to the family of reductions from stochastic optimization to regret bounds
(Cesa-Bianchi et al., 2004; Cutkosky, 2019). The proof can be found in Appendix B.
Theorem 1Supposez1, . . . , zTis some arbitrary sequence of vectors. Letw1, . . . , wTbe an arbitrary
sequence of non-negative numbers. Recall that we define∆t=zt+1−ztandx1=z1. Fort≥1, suppose
xt+1satisfies:
xt+1=xt+
wt+1:T
w1:T
∆t,
then for anyu:
E[f(xT)−f(u)]≤E
"
T
X
t=1
1
w1:T
⟨wt·gt, zt−u⟩
#
.
Let us take a moment to consider the implications of this Theorem in the simplest setting ofwt= 1for all
t. In this case, it is well-known that by setting∆t=−ηgtforη=
D
G
√
T
, one guarantees
P
T
t=1
⟨gt, zt−u⟩ ≤
DG
√
T. Thus, we immediately obtain the following important corollary:
3

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
Corollary 2Setxt+1=xt−ηtgtwithηt=
D
G
√
T
Γ
1−
t
T
˙
. Then:
E[f(xT)−f(u)]≤
DG
√
T
.
ProofWithwt= 1for alltand∆t=−
D
G
√
T
gt, classical online gradient descent analysis (Zinkevich (2003))
yields
P
T
t=1
⟨wtgt, zt−u⟩ ≤DG
√
T. The result now follows from Theorem 1.
The sequencextsuggested by Corollary 2 is simply stochastic gradient descent (xt+1=xt−ηtgt) equipped
with alinear decay learning rate schedule:
ηt=
D
G
√
T
ȷ
1−
t
T
ff
. (1)
Linear decay emulates the effects of iterate averaging, as the contribution from each gradient to the returned
point is approximately the same as it would be in an average: the gradientg
T /2appears in half the points in
the average, and so its weight is halved, the gradientgtappears inT−tout ofTpoints and so its weight is
1−t/T.
This bound has a significantly better constant than previous schedules for last-iterate convergence of SGD
in this setting (Jain et al., 2019), and matches recent work by Zamani and Glineur (2023), who were the first
to show that this schedule is actually optimal for gradient descent for the ConvexG-Lipschitz complexity
class. Our regret analysis recovers their result when specialized to SGD.
In practice, the linear decay schedule is employed not only with SGD but also with a diverse panoply of
other optimization algorithms. Our Theorem 1 suggests a theoretical basis for this approach: so long as the
underlying optimization method ensures a regret bound, the linear decay schedule will provide a last-iterate
guarantee. Note that we do not require the regret bound to be analytically proven (as is the case for e.g. SGD
(Zinkevich, 2003), AdaGrad (Duchi et al., 2011; Streeter and McMahan, 2010) or AMSGrad (Reddi et al.,
2018)); it suffices for the regret bound to hold in practice (as may hold for Adam (Kingma and Ba, 2015) or
AdamW (Loshchilov and Hutter, 2017a)).
The key initial step in the proof of Theorem 1 is an identity we call the “all-tail summation bound”. This
is a refinement of a bound from Orabona (2020) who attributes the approach originally to Lin et al. (2016).
Using the notationqt=f(xt)−f∗, Lin et al. (2016) use the bound:
wTqT≤
1
T
T
X
t=1
wtqt+
T−1
X
k=1
1
k(k+ 1)
T
X
t=k+1
wt(qt−qk),
whereas our refined approach uses an equality:
qT=
1
w1:T
T
X
t=1
wtqt+
T−1
X
k=1
ηk
P
T
t=k+1
ηt

1
P
T
t=k
wt
T
X
t=k
wt(qt−qk)
!
.
This identity is derived in Appendix A. All our results arise from careful manipulation of this identity.
2.1 Optimizing the bound for data-dependent schedules
We have now seen that settingwt= 1for alltrecovers the linear decay schedule, and can obtain the worst-
case optimal convergence rates. However, optimizing for the worst case usually yields overly pessimistic
behavior. In this section, we build more adaptive schedules that obtain better results on real data. To do this,
we simply choosewtso as to optimize the bound in Theorem 1.
We focus on the particular case of SGD by settingxt+1=xt−ηtgtwithηt=
wtwt+1:T
w1:T
. In the notation
of Theorem 1, this corresponds to∆t=−wtgt. For this case, we have the following result, with proof in
Appendix C.
4

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
Theorem 3Suppose thatxt+1=xt−ηtgtwithηt=
wtwt+1:T
w1:T
. Then we have:
E[f(xT)−f(u)]≤E
"
1
2·w1:T

D
2
+
T
X
t=1
w
2
t∥gt∥
2
!#
. (2)
Moreover, for a fixed sequence∥g1∥
2
, . . . ,∥gT∥
2
, the value of
1
2·w1:T
(D
2
+
P
T
t=1
w
2
t∥gt∥
2
)is minimized by
setting:
wt=∥gt∥
−2
D
q
P
T
p=1
∥gp∥
−2
.
Theorem 3 suggests that if we knew the sequence of gradient norms ahead of time, then we could optimize
the weightswt(and therefore the learning rate scheduleηt) by settingwt∝ ∥gt∥
−2
. This yields a simple
practical approach forrefininga learning rate schedule based on empirical observations. First, perform one
run using a baseline schedule to observe the sequence of gradient norms. Then, use these norms to compute
an optimal schedule via Theorem 3. The constant factorD=∥x1−u∥appearing in the value forwtplays
the role of the “scale factor” typically applied to learning rate schedules. A line of recent work has shown
that this quantity can be efficiently estimated online without significant downsides (Mcmahan and Streeter,
2012; Orabona and P´al, 2016; Cutkosky and Orabona, 2018; Mhammedi and Koolen, 2020; Zhang et al.,
2022; Carmon and Hinder, 2022; Ivgi et al., 2023; Khaled et al., 2023; Cutkosky et al., 2023).
There are some subtleties here. Allowingwtto depend on random variablegtbreaks the independence
betweenwtandgt. Further, the act of changing the schedule from a baseline to a refined data-dependent
schedule will change the gradient norm sequence, which may in turn indicate that a substantially different
schedule would have been optimal. Our approach relies on the practical assumption that the gradient norm
sequence does not change significantly after refinement. To encourage this, Algorithm 1 applies a median
smoothing filter to the gradient norms before refining.
Finally, Theorem 3 provides an analysis that recovers schedules for SGD. However, practitioners com-
monly use optimization methods with more complicatedper-coordinateupdates or other preconditioning
schemes such as Adam. In Appendix E we provide an alternative version of Theorem 3 that applies to such
algorithms (Theorem 8). This result enables us to develop schedules tailored to any optimization algorithm.
However, actually building this schedule in practice may require inspecting the algorithm’s internal state,
which may be difficult or inefficient. For per-coordinate algorithms like Adam, we suggest simply setting
wt∝1/∥gt∥1as an approximation (see Section 3).
Figure 2.1 gives the Refined schedules on a set of standard benchmark machine learning problems when
initially trained using linear decay schedules to provide the gradient norm sequences. Full details of the
models and training setup for each problem is provided in the Appendix. Gradientℓ2norms are shown for
SGD trained problems (ImageNet, RCNN) andℓ1norms for the Adam trained problems, which also use
inverse-ℓ1norm weights for refinement (see Section 3). Our single hyper-parameterτ= 0.1was tuned so
that the resulting schedules obtained a balance between smoothness and capturing structure. We recommend
setting this parameterby eyerather than by grid search.
3 Experiments
To validate the effectiveness of our method on convex problems, we performed a comparison across 8 com-
monly used classification problems from the LIBSVM repository, with separable problems excluded (see
Section 3.5). We used a logistic regression loss together with the Adam optimizer withβ= (0.9,0.95),
with batch size 16 and trained for 100 epochs. The Refined schedules were calculated by using the linear
decay schedule to generate the initial gradient norm sequence. Table 3 demonstrates that both the linear
decay schedule and our refinement schedule consistently either match or out-perform the cosine schedule.
The linear decay schedule matches the cosine schedule on every problem (up to statistical significance), and
5

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT0% 20% 40% 60% 80% 100%
2
4
ImageNet
Gradient Norms
0% 20% 40% 60% 80% 100%
2
3
4
5
Smoothed Gradient Norms
0% 20% 40% 60% 80% 100%
0.0
0.5
1.0
Refined Schedule
0% 20% 40% 60% 80% 100%
100
200
300
IWSLT14
0% 20% 40% 60% 80% 100%
100
150
200
250
0% 20% 40% 60% 80% 100%
0.0
0.5
1.0
0% 20% 40% 60% 80% 100%
0
10000
20000
GPT
0% 20% 40% 60% 80% 100%
0
10000
20000
0% 20% 40% 60% 80% 100%
0.0
0.5
1.0
0% 20% 40% 60% 80% 100%
5000
10000
15000
RoBERTa
0% 20% 40% 60% 80% 100%
5000
10000
15000
0% 20% 40% 60% 80% 100%
0.0
0.5
1.0
0% 20% 40% 60% 80% 100%
100
200
DLRM
0% 20% 40% 60% 80% 100%
100
200
0% 20% 40% 60% 80% 100%
0.0
0.5
1.0
0% 20% 40% 60% 80% 100%
0
2000
4000
MRI
0% 20% 40% 60% 80% 100%
0
2000
4000
0% 20% 40% 60% 80% 100%
0.0
0.5
1.0
0% 20% 40% 60% 80% 100%
0
200
400 ViT
0% 20% 40% 60% 80% 100%
0
200
400
0% 20% 40% 60% 80% 100%
0.0
0.5
1.0
0% 20% 40% 60% 80% 100%
2.0
2.5
3.0
3.5
RCNN
0% 20% 40% 60% 80% 100%
2.2
2.4
2.6
0% 20% 40% 60% 80% 100%
0.0
0.5
1.0
10
0
2×10
0
3×10
0
4×10
0
6×10
0
10
3
10
1
10
2
2×10
2
3×10
2
10
2
10
0
10
3
10
4
10
2
10
0
10
3
10
4
10
2
10
1
10
0
10
2
4×10
1
6×10
1
2×10
2
10
2
10
1
10
0
10
2
10
3
10
2
10
1
10
0
10
0
10
1
10
2
10
2
10
0
2×10
0
3×10
0
10
2
10
0
Figure 2: Gradient Norm sequences and the resulting Refined schedules, generated using an initial linear
decay schedule with warmup for the initial run. Log scale views are inset for scale reference.
Fitted polynomial decay schedules are overlayed in blue.
6

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
Table 1: Logistic Regression Experiments (Train Error Rate %).
Problem Stepwise Cosine Linear Refined ∥g∥
2
2
Refined∥g∥
1
Refined
Aloi 13.38±0.0512.69±0.0512.76±0.0611.75±0.01 12.24±0.0412.30±0.05
Glass 31.39±0.1630.82±0.2230.72±0.1830.05±0.4429.62±0.37 30±0.41
Iris 1.39±0.0001.39±0.0001.39±0.000 1.46±0.07 1.46±0.07 1.46±0.07
Letter22.24±0.00822.24±0.0122.20±0.0222.23±0.0322.20±0.0322.20±0.03
Pendigits4.70±0.02 4.67±0.014.62±0.03 4.56±0.02 4.58±0.02 4.56±0.04
Sensorless11.84±0.0911.30±0.0911.29±0.0910.71±0.08 10.08±0.0510.11±0.05
Vehicle18.83±0.0918.49±0.0518.55±0.0618.21±0.1218.19±0.0818.21±0.1
Vowel 23.43±0.0822.99±0.0922.94±0.1022.48±0.1222.44±0.0822.41±0.080 50 100 150 200 250 300
Epoch
86
88
90
92
94
96
Test Accuracy (%)
Inverse-Sqrt Schedules 0 50 100 150 200 250 300
Epoch
86
88
90
92
94
96
Test Accuracy (%)
Inverse-Linear Schedules
Figure 3: Training curves on CIFAR-10 for a sweep over inverse-sqrt and inverse-linear hyper-parameters.
A linear-decay schedule baseline is shown in orange. All combinations are out-performed by the
linear-decay schedule.
out-performs it on two problems. The Refined schedule further out-performs the linear decay schedule, either
matching of exceeding it across the board.
Rather than using the Adam specific weighting for deriving the Refined schedules, it is often convenient
to use other norms, particularly if existing logs are available using these other norms. In Table 3 we present
results using both theℓ1norm andℓ2norm-squared weighting. Both are shown to maintain the advantages of
the refinement technique, out-performing the non-refined schedules consistently. Given the ease of logging
ℓ1norms compared to the internal Adam state, we advocate for weights to be the inverse of theℓ1norm (no
squaring) when using Adam.
3.1 Deep Learning experiments
Classical any-time learning rates schedules such as1/
√
tand1/tare commonly used in theoretical analysis
of optimization methods, yet they are rarely used by practitioners. To illustrate why, in Figure 3.1, we sweep
over the learning rateηand offsetβfor schedules of the form:
η
β
β+t
andη
√
β
√
β+t
.
A 5% duration learning rate warmup was also included in each case. Even with 60 hyper-parameter combi-
nations tried for each family, neither are able to match the performance of the linear-decay schedule. For our
deep learning experiments we implemented two commonly used practical variants of these schedules, since
sweeping theβhyper-parameter is computationally impractical. The vanilla version usesβ= 1, and theoff-
7

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
setversion usesβset to match the duration of the warmup period (a common setting due to its implementation
in FairSeq).
3.2 Large-scale schedule benchmark
We performed a large-scale comparison of schedules across common deep learning benchmarks:
CIFAR10For a small-scale image-classification problem, we chose CIFAR10 (Krizhevsky, 2009). A high-
performance Wide ResNet architecture was used (Zagoruyko and Komodakis, 2016). Test error per-
centage is reported.
CIFAR100A more realistic yet still small scale benchmark. We used a compact DenseNet architecture
(Huang et al., 2017) to increase the diversity of model architectures tested. Test error percentage is
reported.
ImageNetWe used a ResNet-50 architecture (He et al., 2016) for the ImageNet (Russakovsky et al., 2015)
classification task. Note that we used a standard (non-SOTA) data-augmentation pipeline following
other recent comparisons in the optimization literature. Test error percentage is reported.
IWSLT14A small-scale German-English translation task (Cettolo et al., 2014) using a LSTM model (Wise-
man and Rush, 2016). Test Perplexity is reported.
GPTA modern small (162M parameter) GPT-style auto-regressive transformer model (Radford et al., 2019)
trained on the large Book-Wiki corpus (Zhu et al., 2015). Test Perplexity is reported.
RoBERTaA masked autoencoder variant (Liu et al., 2019) also trained on the Book-Wiki corpus. Test
Perplexity is reported.
ViTA high-performance ImageNet classifier using a Vision Transformer (Dosovitskiy et al., 2021). In
contrast to our ResNet-50 experiments, here we use modern data-augmentation, following standard
practice for ViT training. Test error percentage is reported.
DLRMA modern recommendation system engine (Naumov et al., 2019) trained on the open Criteo Kaggle
Display Advertising dataset. Test Accuracy is reported.
MRIA large stacked U-Net architecture (Sriram et al., 2020) trained on the fastMRI dataset (Zbontar et al.,
2018), an image-to-image regression task from the medical imaging domain. Test set metric100·(1−
SSIM)is reported.
RCNNThe object detection method Faster-RCNN (Ren et al., 2015) trained on the COCO 2017 dataset,
using a ResNet-50 ImageNet pretrained backbone.100−box AP is reported.
For each problem, we first performed a sweep of learning rates on a grid [10
i
,2×10
i
,5×10
i
] for varying
i, separately for each schedule. We then ran multiple seeds using the best learning-rate from the grid search.
Mean and standard error of the mean was tabulated for each schedule. The best result for each method, up
to a statistical significance level of 0.05 using a paired two-sample t-test, is highlighted in bold. Specific
details of the hyper-parameters used for each problem are given in Appendix G. All non-adaptive schedules
included a fixed learning-rate warmup, with length following standard practices for the problem. Our step-
wise schedule uses a 30-60-90 percent tenthing, following standard practices from ImageNet training. As in
the convex experiments, the Refined schedules were calculated by using the linear schedule to generate the
initial gradient norm sequence.
Tables 2 & 3 show the results. We break the schedules into two categories, classical and modern. The
modern schedules consistently outperform the classical schedules, often by large margins. Although this is
common folk-law in deep learning, we are not aware of any existing large-scale experiments establishing
this. Our comparison of modern schedules shows a clear hierarchy among the schedules. The Stepwise
schedule is dominated by the Cosine schedule, and the Linear schedule matches or outperforms the Cosine
schedule on all problems except ViT. The refined schedule further outperforms the Linear schedule on 5 of
the 10 problems, but shows mixed results on ViT and RCNN. The refinement process produced degenerate
schedules that fail on the CIFAR problems, we discuss this in Appendix 3.5.
8

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
Table 2: Classical Schedule Comparison Against the Linear Schedule (lower = better).
Problem Flat 1/t 1/sqrt Offset 1/t Offset 1/sqrt Linear
CIFAR10 8.04±.13 5.42±.28 6.37±.41 9.23±.08 6.58±.06 4.35±.05
CIFAR10030.43±.2026.58±.1129.09±.1632.80±.07 27.62±.13 22.11±.08
ImageNet33.00±.1426.48±.0628.35±.0547.94±.08 27.34±.06 23.11±.07
IWSLT14 8.07±.02 7.62±.01 7.52±.01 12.89±.06 8.48±.01 7.10±.01
GPT 20.20±.00018.99±.0419.48±.0227.85±.05 22.88±.00618.60±.02
RoBERTa 4.52±.0054.25±.007 4.33±.01 5.33±.02 5.15±.02 3.94±.007
DLRM 20.95±.00647.59±6.4545.99±5.9820.94±.00720.99±.00920.94±.006
MRI 9.00±.04 8.91±.01 8.98±.04 9.53±.08 9.16±.05 8.88±.02
ViT 30.11±.2728.36±.4028.53±.1573.84±6.0850.36±12.3924.82±.31
RCNN 65.43±.1263.38±.0564.13±.1079.32±.07 69.25±.07 60.98±.02
Table 3: Modern Schedule Comparison (lower = better).
Problem Stepwise Cosine Linear Refinement
CIFAR10 4.53±.03 4.27±.04 4.35±.05 -
CIFAR10022.78±.1022.59±.0922.11±.08 -
ImageNet23.51±.0723.10±.0623.11±.0723.12±0.03
IWSLT14 7.43±.01 7.17±.009 7.10±.01 6.92±.03
GPT 19.70±.03 18.65±.0218.60±.0218.29±.005
RoBERTa 4.36±.01 4.07±.000 3.94±.007 3.86±.005
DLRM 20.95±.00820.94±.00520.94±.00620.94±.009
MRI 8.97±.02 8.90±.03 8.88±.02 8.85±.01
ViT 26.27±.3324.56±.1524.82±.3125.53±.16
RCNN 61.76±.0661.00±.0460.98±.0261.88±.02
3.3 Cosine schedule ablations
To further investigate the failure modes of the cosine schedule, we ran a series of shorter duration training runs
on CIFAR10. Our premise was that the cosine schedule is heavily over-fit to long training duration computer
vision problems. As shown in Table 4, the cosine schedule begins to under-perform the linear decay schedule
and the refined schedules when training for less than 30 epochs. In contrast, while the refined schedule also
under-performs for longer duration training it has no statistically significant differences from the linear decay
schedule for shorter duration training, where they both perform consistently better than the cosine schedule.
3.4 Large Language Model size ablations
In addition to the results above, we validate our insights on Language Models by performing an additional
set of experiments directly comparing to the setting explored in Chinchilla (Hoffmann et al., 2022) where
we train a vanilla Transformer model on the C4 dataset (Raffel et al., 2020) for 10k steps with a token batch
size of around 524k (2
19
). Starting with Hoffmann et al. (2022), most recent LLMs employ the AdamW
optimizer with Linear Warmup and Cosine Decay. We perform a head-to-head comparison of Cosine decay,
Linear decay and Refined schedules (the latter is refined from Linear decay). As shown in Table 5, contrary
to popular wisdom, Linear decay performs better than Cosine decay across all model sizes. The Refined
schedule outperforms Linear Decay for all but the 3.5B sized model.
9

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
Table 4: CIFAR10 Training Time Ablations (Test Error %).
Schedule 1 Epoch 5 Epochs 15 Epochs 30 Epochs
Cosine 35.80±0.2715.07±0.117.99±0.066.08±0.03
Linear Decay33.84±0.1714.36±0.077.65±0.076.12±0.08
Refined34.04±0.1914.59±0.087.88±0.046.24±0.060% 20% 40% 60% 80% 100%
0
1
CIFAR10
Gradient Norms
0% 20% 40% 60% 80% 100%
0.0
0.5
1.0
Smoothed Gradient Norms
0% 20% 40% 60% 80% 100%
0
100
200
Refined Schedule
0% 20% 40% 60% 80% 100%
1
2
3
CIFAR100
0% 20% 40% 60% 80% 100%
1
2
3
0% 20% 40% 60% 80% 100%
1
2
3
10
1
10
0
10
0
10
1
10
2
10
0
6×10
1
2×10
0
3×10
0
4×10
0
10
0
Figure 4: Limitations of refinement: if model over-fits, our method produces degenerate schedules.
3.5 Limitations
The cifar10 and cifar100 problems show potential pitfalls from using our approach (Figure 3.3) when our
assumptions break down. On these problems100%train accuracy is reached near the end, driving the gra-
dient norm sequence to zero. The resulting refined schedule rapidly increases the step size near the end of
optimization, which if used results in training divergence. We suggest using a linear decay schedule instead
on problems where the gradient norm drops significantly at the end.
4 Discussion
Gradient sequences that slowly decrease or slowly increase after stabilizing are often observed in practice
(Figure 2.1), and the resulting schedules that our framework produces show interesting behavior, resembling
polynomial-decayschedules with exponentsp:
ηt∝
ȷ
1−
t
T
ff
p
. (3)
Schedules of this form are already in use, withptypically tuned as a hyper-parameter. From Figure 1, we
see thatp <1should be used when the gradient norm is decreasing over time,p >1when the gradient
norm is increasing, andp= 1when the gradient sequence is flat. We use the termpolynomial decayfor
the schedule given in Equation 3, which its name in both PyTorch and Tensorflow, although in the literature
schedules of the form1/t
α
are also sometimes referred to as polynomial schedules. From Figure 2.1, we can
see that many of our test problems have refined schedules that are well-approximated by a polynomial decay
sequences, shown in blue.
5 Related work
The Robbins-Monro conditions are the foundation of early learning rate theory (Robbins and Monro, 1951).
They advocated for step size sequences where
P
∞
t=1
ηt=∞and
P
∞
t=1
η
2
t<∞. Of the schedules satisfying
these conditions, they advocated for schedules with1/tdecay as they are asymptotically optimal for twice
10

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
Figure 5: Normalized learning rate schedules learned on vanilla Transformer-based LLM trained on C4
dataset. Left: schedules obtained when settingwt∝ ∥gt∥
−2
. Right: schedules whenwt∝
1/∥gt∥1. We find thatℓ1norm is much more consistent across model sizes and baseline schedules.
Thus, we usedwt∝1/∥gt∥1weighting from linear baselines to obtained Refined schedule men-
tioned in Table 5. The x axis is training steps.
Table 5: LLM Ablations (C4 validation loss).
Schedule 117M 284M 1B 3.5B
Cosine 3.089 2.891 2.729 2.631
Linear Decay3.087 2.888 2.7252.625
Refined3.075 2.884 2.7222.634
continuously differentiable and strongly convex functions. This schedule was later shown to be optimal (even
non-asymptotically) for strongly convex G-Lipschitz stochastic optimization when appropriate averaging is
used (Shamir and Zhang, 2013; Lacoste-Julien et al., 2012; Rakhlin et al., 2012).
Nemirovski et al. (2009) point out that the performance of these step-sizes is extremely poor when the
base step size is miss-specified, and advocate1/
√
tstep sizes, which have the advantage of performing
asymptotically optimally when strong convexity is not present. The use of1/
√
tstep sizes in combination
with averaging was originally developed by Polyak (1990) and Ruppert (1988), and are optimal for the
complexity class of G-Lipschitz convex functions with or without stochasticity (Blair, 1985; Nesterov, 2018).
Scheduling for the stochastic quadratic setting has seen significant recent interest. The seminal work of
Bach and Moulines (2013) establish that a flat1/4Rschedule (for a given problem dependent-constantR),
independent of the stopping timeT, gives the optimalO(1/T)rate for the average iterate. For the last iterate,
Ge et al. (2019a) establish that the step-decay schedule’s rate:
Θ
ȷ
dσ
2
T
logT
ff
,
is a multiplicative log factor worse than the mini-max optimal rate. Heredis the dimensionality of the
problem andσa bound on the noise. They further show that the any-time last-iterate behavior of SGD is
poor, visiting highly sub-optimal function values infinitely often. Pan et al. (2022) study the situation where
11

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
Table 6: Polynomial Decay Schedules fit to refined schedules
ProblemWarmupPower
DLRM 20% 0.9
GPT 28% 1.0
RoBERTa 45% 0.8
MRI 40% 0.9
RCNN 0% 1.5
ImageNet 1% 3.0
the full Hessian spectrum is know in advance. In this case, they are able to improve the rate to
O
ȷ
dσ
2
T
logκ
ff
,
or, under a skewed spectrum assumption, to
O
ȷ
dσ
2
T
ff
.
Ge et al. (2019b) show that commonly used schedules such as polynomial decay schedules are sub-optimal
in the quadratic setting.
5.1 Non-smooth last iterate convergence
For non-smooth stochastic optimization, Shamir and Zhang (2013) show that the last iterate of SGD on non-
smooth, convex problems with a1/
√
tstep size is
O
ȷ
log(n)
√
n
ff
.
Harvey et al. (2019) show a corresponding lower bound, showing that the log factor is unavoidable when
using1/
√
tstep-sizes.
Jain et al. (2019) were the first to show that the log-factor can be removed if the horizon is known in
advance. Their approach uses a sequence of step sizes:
γt=
D2
−i
G
√
T+ 1
,
whereiis an epoch counter, for a sequence of epochs each half the length of the preceding epoch. This step
size sequence gives a convergence rate for the last iterate of:
E[f(xT)−f∗]≤
15GD
√
T
.
For more general conversions from regret bounds to last iterate guarantees, Cutkosky (2019) provides
an alternative conversion based upon evaluating gradients at running averages of iterates and achieves the
same last-iterate convergence rate as Theorem 1. However, our method is directly interpretable in terms of
learning rate schedules to yield immediate practical guidance while the technique of Cutkosky (2019) is a
more opaque transformation of the base algorithm. Nevertheless, there are interesting contrasts: the result
of Cutkosky (2019) is ”anytime” in the sense that each individual iterate comes with an optimal convergence
guarantee. In our case, only the last iterate has an optimal guarantee (in fact, Zamani and Glineur (2023)
show that this restriction is necessary for any scheme based on learning rate schedules). Understanding the
differences in these approaches is a valuable future direction.
12

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
PyTorch Scheduler Github Files (in thousands)
ReduceLROnPlateau 105
StepLR 101
MultiStepLR 37.9
CosineAnnealingLR 37.1
ExponentialLR 16
OneCycleLR 14.9
CosineAnnealingWarmRestarts 10.9
CyclicLR 9.1
LinearLR 5.9
ConstantLR 3.6
MultiplicativeLR 2.6
PolynomialLR 1.3
Table 7: Popularity of standard learning rate schedules
5.2 Schedules in Applications
Learning rate schedules used in applications usually fall within a few standard classes supported by major
frameworks. To survey the use of schedulers in open source software, we performed a GitHub search on each
of the schedulers in the PyTorch Library. Table 7 shows that the polynomial decay schedule that we advocate
is currently the least popular scheduler in PyTorch.
The reduce-on-plateau, and its fixed length sister, the step-wise schedule, popularized by Krizhevsky et al.
(2012) for use in image classification, is considered a benchmark method. They justify it heuristically (“The
heuristic which we followed was to divide the learning rate by 10 when the validation error rate stopped
improving with the current learning rate.”). Due to its use in highly influential papers as well as its use in
tutorials, it is by far the most commonly used schedule (Table 7).
Many early works use step-sizes satisfying the Robbins-Monro conditions, such as step sizes proportional
to1/(c+αt), for somecandα, as advocated in the influential work of Bottou (1991, 2012). These step sizes
produce large values for smalltand so are often clipped (usingmin(β,·)for someβ).
Schedules involving restarting (also called cycling) are also widely used, particularly on computer vision
problems where overfitting can be an issue (Smith and Topin, 2018). Loshchilov and Hutter (2017b) use
restarting together with the cosine annealing schedule:
ηt=ηmin+ (ηmax−ηmin)
ȷ
1 + cos
ȷ
t
T
π
ffff
, (4)
This schedule is remarkably effective in practice for image classification problems even without restarts; as
far as the authors are aware, for ResNet-50 ImageNet training, no schedule has been demonstrated that out-
performs it. It is less commonly used in other problem domains. The motivation for the use of cosine function
is unstated, and there is no theory to support it. The reasons for the success of the schedule are not currently
understood.
The only large-scale comparison of schedulers for deep learning optimization that we are aware of is the
work of Schmidt et al. (2021). They compare on a smaller set of problems than our comparison, although
across a range of optimizers, whereas we focus on just Adam & SGD.
5.3 Any-time Adaptive Schedules
There has been significant work on any-time learning rate adaptation that makes use of gradient or loss
information to adapt the learning rate. Schedules in the AdaGradNorm class are backed by strong theory (Wu
et al., 2020; Streeter and McMahan, 2010; Duchi et al., 2011; Ward et al., 2019; Li and Orabona, 2019). They
13

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
use step sizes of the form:
ηt∝
1
q
P
T
t=1
∥gt∥
2
They have not seen significant adoption in practice, likely due to the intrinsic disadvantages of any-time
O(1/
√
t)-style schedules as already discussed. It is also not clear how to apply this schedule on top of
existing high-performance optimizers such as Adam.
Methods based on hyper-gradients (i.e. gradients of the learning rate) are another heavily studied class of
any-time adaptive methods. They use recent gradient evaluations and other local information to update the
learning rate at each step of training (Almeida et al., 1999; Franceschi et al., 2017; Bengio, 2000; Domke,
2012; Pedregosa, 2016; Baydin et al., 2018; Feurer and Hutter, 2019; Donini et al., 2020; Chandra et al.,
2022). The hyper-gradient can also be used in a more cautious manor to define conditions under which the
learning rate should be decreased (Pflug, 1983, 1988; Lang et al., 2019; Zhang et al., 2020).
Methods based on loss minimization aim to choose a learning rate that greedily minimize the loss, either
at each step, or based on aggregate statistics that are either produced in stages or by smoothing. They include
methods that target the train loss or loss on held out validation data (Xu et al., 2019; Jin et al., 2021; Kim
et al., 2021).
Methods that query additional points, such as line search methods, are not scheduling methodsper sebut
can be used in place of scheduling (Mahsereci and Hennig, 2017; Iyer et al., 2021). Due to the additional
implementation complexity and the cost of their use, they also have not seen widespread adoption for deep
learning.
Conclusion
Our experimental results strongly support the use of the linear decay schedule as a default baseline for deep
learning optimization. Applying refinement on top of an initial linear decay run gives a schedule showing
improvements on 5 out of 10 deep learning problems and 5 of 8 convex problems. We develop extensive
theory supporting the linear decay schedule and our novel refinement technique, establishing that their SOTA
performance is well-supported both in theory and practice.
References
Lu´ıs B. Almeida, Thibault Langlois, Jos´e D. Amaral, and Alexander Plakhov.Parameter Adaptation in
Stochastic Optimization, page 111–134. Cambridge University Press, USA, 1999. ISBN 0521652634.
(Cited on page 14)
Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approimxation with convergence
rate o(1/n).Neural Information Processing Systems (NeuRIPS), 2013.(Cited on page 11)
Atılım G¨unes¸ Baydin, Robert Cornish, David Mart´ınez Rubio, Mark Schmidt, and Frank Wood. Online
learning rate adaptation with hypergradient descent. InSixth International Conference on Learning Rep-
resentations (ICLR), Vancouver, Canada, April 30 – May 3, 2018, 2018.(Cited on page 14)
Yoshua Bengio. Gradient-based optimization of hyperparameters. InNeural Computation, 2000.(Cited on
page 14)
Charles Blair. Problem complexity and method efficiency in optimization (a. s. nemirovsky and d. b. yudin).
SIAM Review, 27(2):264–265, 1985.(Cited on page 11)
Leon Bottou. Stochastic gradient learning in neural networks.Proceedings of Neuro-Nımes, 91(8):12, 1991.
(Cited on page 13)
L´eon Bottou.Stochastic Gradient Descent Tricks, pages 421–436. Springer Berlin Heidelberg, 2012.(Cited
on page 13)
14

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
Yair Carmon and Oliver Hinder. Making sgd parameter-free. InConference on Learning Theory, pages
2360–2389. PMLR, 2022.(Cited on page 5)
Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning
algorithms.IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.(Cited on page 3)
Mauro Cettolo, Jan Niehues, Sebastian St¨uker, Luisa Bentivogli, and Marcello Federico. Report on the 11th
IWSLT evaluation campaign. InIWSLT, 2014.(Cited on page 8)
Kartik Chandra, Audrey Xie, Jonathan Ragan-Kelley, and Erik Meijer. Gradient descent: The ultimate opti-
mizer. In36th Conference on Neural Information Processing Systems (NeurIPS), 2022.(Cited on page 14)
Ashok Cutkosky. Anytime online-to-batch, optimism and acceleration. InInternational conference on ma-
chine learning, pages 1446–1454. PMLR, 2019.(Cited on pages 3 and 12)
Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in banach
spaces. InConference On Learning Theory, pages 1493–1529. PMLR, 2018.(Cited on page 5)
Ashok Cutkosky, Aaron Defazio, and Harsh Mehta. Mechanic: A learning rate tuner.arXiv preprint
arXiv:2306.00144, 2023.(Cited on page 5)
Justin Domke. Generic methods for optimization-based modeling. InFifteenth International Conference on
Artificial Intelligence and Statistics (AISTATS), 2012.(Cited on page 14)
Michele Donini, Luca Franceschi, Orchid Majumder, Massimiliano Pontil, and Paolo Frasconi. Marthe:
Scheduling the learning rate via online hypergradients. In Christian Bessiere, editor,Proceedings of the
Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pages 2119–2125. Inter-
national Joint Conferences on Artificial Intelligence Organization, 7 2020.(Cited on page 14)
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Un-
terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil
Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational
Conference on Learning Representations, 2021.(Cited on page 8)
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic
optimization.Journal of Machine Learning Research, 12(61), 2011.(Cited on pages 4 and 13)
Matthias Feurer and Frank Hutter.Automated Machine Learning, chapter Hyperparameter Optimization.
Springer International Publishing, 2019.(Cited on page 14)
Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-
based hyperparameter optimization. InProceedings of the 34th International Conference on Machine
Learning, volume 70 ofProceedings of Machine Learning Research, pages 1165–1173. PMLR, 2017.
(Cited on page 14)
Rong Ge, Sham M. Kakade, Rahul Kidambi, and Praneeth Netrapalli. The step decay schedule: A near
optimal, geometrically decaying learning rate procedure for least squares.Neural Information Processing
Systems (NeuRIPS), 2019a.(Cited on pages 1 and 11)
Rong Ge, Sham M. Kakade, Rahul Kidambi, and Praneeth Netrapalli. Rethinking learning rate schedules
for stochastic optimization, 2019b. URLhttps://openreview.net/forum?id=HJePy3RcF7 .
(Cited on page 12)
Priya Goya, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tul-
loch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour.
Technical report, Facebook, 2017.(Cited on page 2)
15

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
Nicholas J. A. Harvey, Christopher Liaw, and Yaniv Plan Sikander Randhawa. Tight analyses for non-smooth
stochastic gradient descent.Annual Conference on Learning Theory (COLT), 2019.(Cited on page 12)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.(Cited on page 8)
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford,
Diego de las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland,
Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen
Simonyan, Erich Elsen, Oriol Vinyals, Jack William Rae, and Laurent Sifre. An empirical analysis of
compute-optimal large language model training. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave,
and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps:
//openreview.net/forum?id=iBBcRUlOAPR .(Cited on page 9)
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolu-
tional networks. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
2261–2269, 2017. doi: 10.1109/CVPR.2017.243.(Cited on page 8)
Maor Ivgi, Oliver Hinder, and Yair Carmon. DoG is SGD’s best friend: A parameter-free dynamic step size
schedule.The 40th International Conference on Machine Learning (ICML 2023), 2023.(Cited on page 5)
Nikhil Iyer, V. Thejas, Nipun Kwatra, Ramachandran Ramjee, and Muthian Sivathanu. Lrtuner: A learning
rate tuner for deep neural networks, 2021.(Cited on page 14)
Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Making the last iterate of SGD information theoreti-
cally optimal.Conference On Learning Theory (COLT), 2019.(Cited on pages 2, 4, and 12)
Yuchen Jin, Tianyi Zhou, Liangyu Zhao, Yibo Zhu, Chuanxiong Guo, Marco Canini, and Arvind Krishna-
murthy. Autolrs: Automatic learning-rate schedule by bayesian optimization on the fly. InProceedings of
the Ninth International Conference on Learning Representations (ICLR 2021), 2021.(Cited on page 14)
Ahmed Khaled, Konstantin Mishchenko, and Chi Jin. DoWG unleashed: An efficient universal parameter-
free gradient descent method.arXiv preprint arXiv:2305.16284, 2023.(Cited on page 5)
Chiheon Kim, Saehoon Kim, Jongmin Kim, Donghoon Lee, and Sungwoong Kim. Automated learning rate
scheduler for large-batch training.arXiv preprint arXiv:2107.05855, 2021.(Cited on page 14)
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann
LeCun, editors,3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA,
USA, May 7-9, 2015, Conference Track Proceedings, 2015. URLhttp://arxiv.org/abs/1412.
6980.(Cited on pages 4 and 26)
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of
Toronto, 2009.(Cited on page 8)
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional
neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors,Advances in Neural
Information Processing Systems, volume 25. Curran Associates, Inc., 2012.(Cited on page 13)
Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining anO(1/t)conver-
gence rate for the projected stochastic subgradient method, 2012.(Cited on page 11)
Hunter Lang, Lin Xiao, and Pengchuan Zhang. Using statistics to automate stochastic optimization. In
H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, and R. Garnett, editors,Advances in
Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.(Cited on page 14)
16

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
Xiaoyu Li and Francesco Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes.
InThe 22nd international conference on artificial intelligence and statistics, pages 983–992. PMLR, 2019.
(Cited on page 13)
Junhong Lin, Lorenzo Rosasco, and Ding-Xuan Zhou. Iterative regularization for learning with convex loss
functions.Journal of Machine Learning Research, 17(77):1–38, 2016. URLhttp://jmlr.org/
papers/v17/15-115.html .(Cited on pages 4 and 20)
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke
Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach.arXiv
preprint arXiv:1907.11692, 2019.(Cited on page 8)
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,
2017a.(Cited on page 4)
Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts, 2017b.(Cited on
page 13)
Maren Mahsereci and Philipp Hennig. Probabilistic line searches for stochastic optimization.J. Mach. Learn.
Res., 18(1):4262–4320, jan 2017. ISSN 1532-4435.(Cited on page 14)
Brendan Mcmahan and Matthew Streeter. No-regret algorithms for unconstrained online convex optimization.
Advances in neural information processing systems, 25, 2012.(Cited on page 5)
Zakaria Mhammedi and Wouter M. Koolen. Lipschitz and comparator-norm adaptivity in online learning. In
Conference on Learning Theory, pages 2858–2887. PMLR, 2020.(Cited on page 5)
Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jong-
soo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, An-
drey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kon-
dratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha
Smelyanskiy. Deep learning recommendation model for personalization and recommendation systems.
CoRR, 2019.(Cited on page 8)
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic
programming.SIAM Journal on Optimization, 19(4):1574–1609, 2009.(Cited on page 11)
Yurii Nesterov.Lectures on Convex Optimization. Springer Nature, 2018.(Cited on page 11)
Francesco Orabona. Last iterate of SGD converges (even in unbounded do-
mains), Aug 2020. URL https://parameterfree.com/2020/08/07/
last-iterate-of-sgd-converges-even-in-unbounded-domains/ .(Cited on pages 4
and 20)
Francesco Orabona and D´avid P´al. Coin betting and parameter-free online learning.Advances in Neural
Information Processing Systems, 29, 2016.(Cited on page 5)
Rui Pan, Haishan Ye, and Tong Zhang. Eigencurve: Optimal learning rate schedule for SGD on quadratic
objectives with skewed hessian spectrums.International Conference on Learning Representations (ICLR),
2022.(Cited on pages 2 and 11)
Fabian Pedregosa. Hyperparameter optimization with approximate gradient. InInternational conference on
machine learning, pages 737–746. PMLR, 2016.(Cited on page 14)
Georg Ch. Pflug. On the determination of the step size in stochastic quasigradient methods.International
Institute for Applied Systems Analysis (IIASA), Laxenburg, Austria, 1983.(Cited on page 14)
17

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
Georg Ch. Pflug. Adaptive stepsize control in stochastic approximation algorithms. InProceedings of 8th
IFAC Symposium on Identification and System Parameter Estimation,, 1988.(Cited on page 14)
Boris Polyak. New stochastic approximation type procedures.Avtomatica i Telemekhanika, 7:98–107, 01
1990.(Cited on pages 1 and 11)
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding
by generative pre-training. Technical report, OpenAI, 2019.(Cited on page 8)
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.
Journal of Machine Learning Research, 21(140):1–67, 2020. URLhttp://jmlr.org/papers/
v21/20-074.html.(Cited on page 9)
Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly
convex stochastic optimization. InProceedings of the 29th International Coference on International Con-
ference on Machine Learning, ICML’12, 2012.(Cited on page 11)
Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of Adam and beyond. InSixth
International Conference on Learning Representations (ICLR), Vancouver, Canada, April 30 – May 3,
2018, 2018.(Cited on page 4)
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection
with region proposal networks. InAdvances in Neural Information Processing Systems, volume 28. Curran
Associates, Inc., 2015.(Cited on page 8)
Herbert Robbins and Sutton Monro. A Stochastic Approximation Method.The Annals of Mathematical
Statistics, 22(3):400 – 407, 1951.(Cited on page 10)
David Ruppert. Efficient estimations from a slowly convergent robbins-monro process.Technical Report,
Cornell University, 02 1988.(Cited on pages 1 and 11)
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej
Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale
Visual Recognition Challenge.International Journal of Computer Vision (IJCV), 115(3), 2015.(Cited on
page 8)
Robin M Schmidt, Frank Schneider, and Philipp Hennig. Descending through a crowded valley - benchmark-
ing deep learning optimizers. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th Inter-
national Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research,
pages 9367–9376. PMLR, 18–24 Jul 2021.(Cited on page 13)
Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results
and optimal averaging schemes.International Conference on Machine Learning (ICML), 2013.(Cited on
pages 11 and 12)
Leslie N. Smith and Nicholay Topin. Super-convergence: Very fast training of residual networks using large
learning rates, 2018.(Cited on page 13)
Anuroop Sriram, Jure Zbontar, Tullie Murrell, Aaron Defazio, C. Lawrence Zitnick, Nafissa Yakubova, Flo-
rian Knoll, and Patricia Johnson. End-to-end variational networks for accelerated MRI reconstruction.
InInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer,
2020.(Cited on page 8)
Matthew Streeter and H. Brendan McMahan. Less regret via online conditioning.arXiv preprint
arXiv:1002.4862, 2010.(Cited on pages 4 and 13)
18

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad stepsizes: sharp convergence over nonconvex land-
scapes. InInternational Conference on Machine Learning, 2019.(Cited on page 13)
Sam Wiseman and Alexander M. Rush. Sequence-to-sequence learning as beam-search optimization. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association
for Computational Linguistics, 2016.(Cited on pages 8 and 34)
Xiaoxia Wu, Rachel Ward, and L´eon Bottou. WNGrad: Learn the learning rate in gradient descent.arXiv
preprint arXiv:1803.02865, 2020.(Cited on pages 1 and 13)
Zhen Xu, Andrew M. Dai, Jonas Kemp, and Luke Metz. Learning an adaptive learning rate schedule.arXiv
preprint arXiv:1909.09712, 2019.(Cited on page 14)
Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. InProceedings of the British Machine
Vision Conference (BMVC), 2016.(Cited on page 8)
Moslem Zamani and Franc¸ois Glineur. Exact convergence rate of the last iterate in subgradient methods.
arXiv preprint arXiv:2307.11134, 2023.(Cited on pages 2, 4, and 12)
Jure Zbontar, Florian Knoll, Anuroop Sriram, Matthew J. Muckley, Mary Bruno, Aaron Defazio, Marc Par-
ente, Krzysztof J. Geras, Joe Katsnelson, Hersh Chandarana, et al. fastMRI: An open dataset and bench-
marks for accelerated MRI.arXiv preprint arXiv:1811.08839, 2018.(Cited on page 8)
Pengchuan Zhang, Hunter Lang, Qiang Liu, and Lin Xiao. Statistical adaptive stochastic gradient methods.
arXiv preprint arXiv:2002.10597, 2020.(Cited on page 14)
Zhiyu Zhang, Ashok Cutkosky, and Ioannis Paschalidis. PDE-based optimal strategy for unconstrained online
learning. InInternational Conference on Machine Learning, pages 26085–26115. PMLR, 2022.(Cited on
page 5)
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja
Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading
books. InProceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), 2015.(Cited
on page 8)
Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. InProceedings
of the Twentieth International Conference on International Conference on Machine Learning, pages 928–
935, 2003.(Cited on pages 4 and 24)
19

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
Appendix A. All-Tail Summation Bound
In this section we develop some key Lemmas that underlie our results. The starting point of this development
is the following result from Orabona (2020); Lin et al. (2016). We present the result and proof below for
completeness, and then provide an improved version of the bound (Lemma 5) that is used to derive our
analytical results.
A.1 Existing bound
Lemma 4Letqt≥0, and letηtbe a positive non-increasing sequence. Then:
ηTqT≤
1
T
T
X
t=1
ηtqt+
T−1
X
k=1
1
k(k+ 1)
T
X
t=k+1
ηt(qt−qk),
ProofWe present below a version of the proof that’s written to move the single inequality application in the
whole proof to the last step. DefineSk=
1
k
P
T
t=T−k+1
ηtqt.
Then
kSk= (k+ 1)Sk+1−ηT−kqT−k
=kSk+1+Sk+1−ηT−kqT−k
=kSk+1+
1
k+ 1
T
X
t=T−k
(ηtqt−ηT−kqT−k).
Dividing through byk:
Sk=Sk+1+
1
k(k+ 1)
T
X
t=T−k
(ηtqt−ηT−kqT−k).
Unrolling, we have:
S1=ST+
T−1
X
k=1
1
k(k+ 1)
T
X
t=T−k
(ηtqt−ηT−kqT−k).
Now we useS1=ηTqT. Note that the first entry in that sum is zero so we may shift the indexing to start
att=T−k+ 1. Giving:
ηTqT=ST+
T−1
X
k=1
1
k(k+ 1)
T
X
t=T−k+1
(ηtqt−ηT−kqT−k)
≤ST+
T−1
X
k=1
1
k(k+ 1)
T
X
t=T−k+1
ηt(qt−qT−k).
Where the final step uses the fact thatηtis decreasing. Plugging in the definition ofST, and substituting
k=T−kto simplify gives the result.
20

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
A.2 Improved expression
Lemma 5Letqtbe any sequence, and letwtbe a positive sequence. Then:
qT=
1
P
T
t=1
wt
T
X
t=1
wtqt+
T−1
X
k=1
wk
P
T
t=k+1
wt

1
P
T
t=k
wt
T
X
t=k
wt(qt−qk)
!
=
1
w1:T
T
X
t=1
wtqt+
T−1
X
k=1
ȷ
1
wk+1:T
−
1
wk:T
ffT
X
t=k
wt(qt−qk).
ProofDefine
Sk=
1
P
T
t=T−k+1
wt
T
X
t=T−k+1
wtqt.
Note that with this definition:
S1=
1
P
T
t=T
wt
T
X
t=T
wtqt=qT,
andSTis the full sum:
ST=
1
P
T
t=1
wt
T
X
t=1
wtqt.
The difference from the weighting used in the the Lemma above: we normalized by the sum of the step sizes
rather thank. We get the following expansion:

T
X
t=T−k+1
wt
!
Sk=
T
X
t=T−k+1
wtqt
=
T
X
t=T−k
wtqt−wT−kqT−k
=

T
X
t=T−k
wt
!
Sk+1−wT−kqT−k
=

T
X
t=T−k+1
wt
!
Sk+1+ (wT−kSk+1−wT−kqT−k).
So dividing through by
P
T
t=T−k+1
wt:
Sk=Sk+1+
wT−k
P
T
t=T−k+1
wt
(Sk+1−qT−k).
Unrolling
S1=ST+
T−1
X
k=1
wT−k
P
T
t=T−k+1
wt
(Sk+1−qT−k).
Note that, plugging inSk+1:
Sk+1−qT−k=
1
P
T
t=T−k
wt
T
X
t=T−k
wtqt−qT−k
=
1
P
T
t=T−k
wt
T
X
t=T−k
wt(qt−qT−k).
21

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
So we have:
qT=
1
P
T
t=1
wt
T
X
t=1
wtqt+
T−1
X
k=1
wT−k
P
T
t=T−k+1
wt

1
P
T
t=T−k
wt
T
X
t=T−k
wt(qt−qT−k)
!
.
Finally, we make the simple change of variablesk→T−kto yield the result.
Remark 6We will typically use this result by settingqt=f(xt)−f∗in order to boundqT=f(xT)−f∗.
By using a weighting sequencewtelsewhere, we are able to remove thewTweight from in front of theqT
term (in contrast to Lemma 4). This is crucial, as we want to be able to analyze the situation in whichwt
drops extremely small near the end, and yet still boundqT=f(xT)−f∗, but ifqTis weighted bywTwe will
get very loose bounds whenwTis small. Notice also that we have an equality instead of an inequality, and
we have not had to impose the requirement thatwbe a non-increasing sequence.
Appendix B. Proof of Theorem 1
Before proving Theorem 1, we need the following important Lemma:
Lemma 7Supposez1, . . . , zTis an arbitrary sequence of vectors. Letw1, . . . , wTbe an arbitrary sequence
of positive numbers. Define the sequencex1, . . . , xTrecursively byx1=z1and:
xt=wt:T

zt
w1:T
+
t−1
X
p=1
xp
ȷ
1
wp+1:T
−
1
wp:T
ff
!
.
Supposegtare random variables. withE[gt|x1, . . . , xt]∈∂f(xt)for some convexf. Then:
E[f(xT)−f(u)]≤E
"
1
w1:T
T
X
t=1
wt⟨gt, zt−u⟩
#
.
ProofLet us setqt=f(xt)−f(u). Then we haveE[qt]≤E[⟨gt, xt−u⟩]andE[qt−qk]≤E[⟨gt, xt−xk⟩].
Then, Lemma 5 implies:
E[qT]≤E
"
1
w1:T
T
X
t=1
wt⟨gt, xt−u⟩+
T−1
X
k=1
ȷ
1
wk+1:T
−
1
wk:T
ffT
X
t=k
wt⟨gt, xt−xk⟩
#
.
Now, let us find the coefficient of⟨gt, xt⟩in the above expression. This is:
wt
w1:T
+
"
t
X
k=1
ȷ
1
wk+1:T
−
1
wk:T
ff
wt
#
−wt
ȷ
1
wt+1:T
−
1
wt:T
ff
=
wt
w1:T
+
t−1
X
k=1
ȷ
1
wk+1:T
−
1
wk:T
ff
wt
=
wt
wt:T
.
Next, forp < t, the coefficient of⟨gt, xp⟩is:
−
ȷ
1
wp+1:T
−
1
wp:T
ff
wt.
22

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
And forp > t, the coefficient of⟨gt, xp⟩is zero. Finally the coefficient of⟨gt, u⟩is−
wt
w1:T
.
Putting this all together, we can rearrange the expression as follows:
E[qT]≤E
"
T
X
t=1
*
gt,
wt
wt:T
xt−

wtu
w1:T
+
t−1
X
p=1
wtxt
ȷ
1
wp+1:T
−
1
wp:T
ff
!+#
=E
"
T
X
t=1
*
wt
wt:T
gt, xt−wt:T

u
w1:T
+
t−1
X
p=1
xp
ȷ
1
wp+1:T
−
1
wp:T
ff
!+#
.
Now, given an arbitrary sequencez1, . . . , zT, definextrecursively by:
xt=wt:T

zt
w1:T
+
t−1
X
p=1
xp
ȷ
1
wp+1:T
−
1
wp:T
ff
!
.
Then we have:
E[qT]≤E
"
T
X
t=1
*
wt
wt:T
gt, xt−wt:T

u
w1:T
+
t−1
X
p=1
xp
ȷ
1
wp+1:T
−
1
wp:T
ff
!+#
=E
"
T
X
t=1
wt
w1:T
⟨gt, zt−u⟩
#
.
Now, we are finally ready to prove Theorem 1:
Theorem 1Supposez1, . . . , zTis some arbitrary sequence of vectors. Letw1, . . . , wTbe an arbitrary
sequence of non-negative numbers. Recall that we define∆t=zt+1−ztandx1=z1. Fort≥1, suppose
xt+1satisfies:
xt+1=xt+
wt+1:T
w1:T
∆t,
then for anyu:
E[f(xT)−f(u)]≤E
"
T
X
t=1
1
w1:T
⟨wt·gt, zt−u⟩
#
.
ProofLet’s defineˆx1=ztand recursively set:
ˆxt=wt:T

zt
w1:T
+
t−1
X
p=1
ˆxp
ȷ
1
wp+1:T
−
1
wp:T
ff
!
.
Then, Lemma 7 shows thatE[f(ˆxT)−f(u)]≤E
h
1
w1:T
P
T
t=1
wt⟨gt, zt−u⟩
i
. So, it suffices to show that
xt= ˆxtfor allt. In turn, sinceˆx1=z1=x1, it suffices to showˆxt+1−ˆxt=
wt+1:T
w1:T
∆t=xt+1−xtfor all
t.
To this end, let’s do some calculation. First:
ˆxt=
wt
wt:T
ˆxt+
wt+1:T
wt:T
ˆxt
=
wt
wt:T
ˆxt+wt+t:T

zt
w1:T
+
t−1
X
p=1
ˆxp
ȷ
1
wp+1:T
−
1
wp:T
ff
!
.
23

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
With this expression, we have:
ˆxt+1−ˆxt= ˆxt+1−wt+1:T

zt+1
w1:T
+
t
X
p=1
xp
ȷ
1
wp+1:T
−
1
wp:T
ff
!
−
wt
wt:T
ˆxt
=wt+1:T

zt+1
w1:T
+
t
X
p=1
ˆxp
ȷ
1
wp+1:T
−
1
wp:T
ff
!
−wt+1:T

zt
w1:T
+
t−1
X
p=1
xp
ȷ
1
wp+1:T
−
1
wp:T
ff
!
−
wt
wt:T
ˆxt
=
wt+1:T(zt+1−zt)
w1:T
+wt+1:Tˆxt
ȷ
1
wt+1:T
−
1
wt:T
ff
−
wt
wt:T
ˆxt
=
wt+1:T
w1:T
∆t+ ˆxt
ȷ
1−
wt+1:T+wt
wt:T
ff
=
wt+1:T
w1:T
∆t.
Appendix C. Proof of Theorem 3
Theorem 3Suppose thatxt+1=xt−ηtgtwithηt=
wtwt+1:T
w1:T
. Then we have:
E[f(xT)−f(u)]≤E
"
1
2·w1:T

D
2
+
T
X
t=1
w
2
t∥gt∥
2
!#
. (2)
Moreover, for a fixed sequence∥g1∥
2
, . . . ,∥gT∥
2
, the value of
1
2·w1:T
(D
2
+
P
T
t=1
w
2
t∥gt∥
2
)is minimized by
setting:
wt=∥gt∥
−2
D
q
P
T
p=1
∥gp∥
−2
.
ProofFirst, observe that with∆t=−wtηt, we havext+1=xt+
wt+1:T
w1:T
∆t. Therefore withz1=x1and
zt+1=zt+ ∆t, Theorem 1 implies:
E[f(xT)−f(u)]≤E
"
1
w1:T
T
X
t=1
⟨wtgt, zt−u⟩
#
.
Next, observe thatztis simply online gradient descent with learning rate 1 acting on the loss vectorswtzt.
Standard analysis (Zinkevich, 2003) shows:
T
X
t=1
⟨wtgt, zt−u⟩=
∥z1−u∥
2
2
−
∥zT+1−u∥
2
2
+
T
X
t=1
w
2
t∥gt∥
2
2
.
This immediately implies the first part of the Theorem. Next, we need to solve for the minimizing values of
wt. To do this, we take the logarithm of the expression
1
2w1:T
ˇ
∥x1−u∥
2
+
P
T
t=1
w
2
t∥gt∥
2
ı
and differenti-
ate:
∂
∂wk
log
"
1
2w1:T

∥x1−u∥
2
+
T
X
t=1
w
2
t∥gt∥
2
!#
=
2wk∥gk∥
2
∥x1−u∥
2
+
P
T
t=1
w
2
t∥gt∥
2
−
1
w1:T
.
24

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
We set this equal to zero to solve for the optimalwk:
wk=∥gk∥
−2
∥x1−u∥
2
+
P
T
t=1
w
2
t∥gt∥
2
2w1:T
≜λ∥gk∥
−2
,
where we have definedλ=
∥x1−u∥
2
+
P
T
t=1
w
2
t
∥gt∥
2
2w1:T
, which does not depend onk. That is, the optimalwk
value is proportional to∥gk∥
−2
. With this expression, we have:
T
X
t=1
w
2
t∥gt∥
2
=λ
2
T
X
t=1
∥gt∥
−2
w1:T=λ
T
X
t=1
∥gt∥
−2
.
So, let us now solve forλby plugging in these values:
λ=
∥x1−u∥
2
+
P
T
t=1
w
2
t∥gt∥
2
2w1:T
λ=
∥x1−u∥
2
+λ
2
P
T
t=1
∥gt∥
−2
2λ
P
T
t=1
∥gt∥
−2
λ=
∥x1−u∥
2
2λ
P
T
t=1
∥gt∥
−2
+
λ
2
λ=
∥x1−u∥
q
P
T
t=1
∥gt∥
−2
.
This in turn implies the claimed optimal value forwk.
Appendix D. Proof of Theorem 9
ProofFirst, observe that for any solution to the desired identityηt=
wtwt:1:T
w1:T
, replacingwwithc·wfor
some constantcwill yield a solution forηreplaced withc·η. Thus, it suffices to consider the possibility that
maxtηt= 1.
The intuition for the rest of the proof is the following: Given the value forw1, we can solve forw2:T
using the equation
wtw2:T
w1+w2:T
=η1. This in turn provides us with the value ofw1:T. Now, givenwk:Tfor any
kwe can solve forwkusing the equationηk=
wkwk+1:T
w1:T
=
wk(wk:T−wk)
w1:T
. Thus we may recursively compute
all the values ofw. Each of these steps requires solving a quadratic equation. We simply choose an initialw1
so as to ensure that all the following quadratic equations have non-negative real roots.
Specifically, setw1=
2
2T
+
√
2
4T
−4η1
2
and defines1= 2
2T
. Then, recursively define fort= 2, . . . , T−1:
st=st−1−wt−1
wt=
st−
p
s
2
t−4s1ηt
2
and setwT=sT=sT−1−wT−1. Notice that these choices satisfy:
w
2
t−stwt+s1ηt= 0
25

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
so that if we could establish (1) that allwt≥0and (2)st=wt:T, then we would have:
wtwt+1:T
w1:T
=
wt(wt:T−wt)
w1:T
=
wt(st−wt)
s1
=
stwt−w
2
t
s1
=ηt
as desired.
Let us first prove (1): allwt≥0. Fort≤T−1, this will hold ifs
2
t−4s1ηt>0. To establish this, we
will first show thatst≥
s1
2
t−1for allt≤T−1. If this holds, then we have:
s
2
t≥
s
2
1
2
2t−2
≥
s12
2T
2
2t−2
≥4s1≥4s1ηt
fort≤T−1, where we have used our assumptionηt≤1.
So, we now establishst≥
s1
2
t−1by induction fort≤T−1. The statement is clear fort= 1. Suppose it
holds for for allt≤kfor somek≤T−2. Then we have:
sk+1=sk−wk
=
sk+
p
s
2
k
−4s1ηk
2
≥
sk
2
≥
s1
2
k−1+1
,
which establishes the claim. Therefore fort≤T−1,wtare non-negative real numbers. Finally, we have:
wT−1=
sT−1+
q
s
2
T−1
−4s1ηt
2
≤sT−1,
so thatwT=sT=sT−1−wT−1≥0. Thuswt≥0for allt.
Next, to show (2):st=wt:T. This is nearly immediate. By definition we havewT=sT. Suppose
sk=wt:Tfor somet≥2. Thenst=st−1−wt−1so thatst−1=st+wt−1=wt−1:T. Thus by induction
we havest=wt:Tfor allt, which is the last thing we needed to show.
Appendix E. Schedules for Per-Coordinate Updates
Many popular optimization algorithms in use today like Adam (Kingma and Ba, 2015) and its variants employ
per-coordinateupdates: the update∆tis not proportional to the gradientgtbut instead scales each coordinate
ofgtby an adaptively chosen value. In this section we propose an approximately optimal schedule for such
methods.
Theorem 8Suppose∆t=zt+1−zt=−wt·(ηt⊙gt)whereηt∈R
d
is a vector of learning rates and⊙
indicates coordinate-wise product. Setxt+1=xt−
wt+1:T
w1:T
∆t. Define the quantityRby:
R=
v
u
u
t
T
X
t=1
d
X
i=1
(zt,i−ui)
2
−(zt+1,i−ui)
2
ηt,i
26

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
Then we have:
E[f(xT)−f(u)]≤E
"
1
w1:T

R
2
2
+
T
X
t=1
w
2
t
d
X
i=1
ηt,ig
2
t,i
!#
.
Moreover, for any given fixed values forRandg
2
t,i
, the expression
1
w1:T
ˇ
R
2
2
+
P
T
t=1
w
2
t
P
d
i=1
η
2
t,i
g
2
t,i
ı
is
minimized by settingwtas below:
wt=
R
r
P
T
t=1
ˇ
P
d
i=1
ηt,ig
2
t,i
ı
−1
·

d
X
i=1
ηt,ig
2
t,i
!−1
.
As an example of how to use this result, let us suppose we are employing the Adam optimizer, and let us
also also ignore the presence of momentum when computing the weights (so really, these will be the weights
for the RMSProp optimizer). In this case,ηt,i∝
1
√
vt,i
wherevt,iis the exponential average of the squared
gradients. Thus, we get:
wt=λ

d
X
i=1
g
2
t,i
√
vt,i
!−1
for someλ. Then, we use a learning rate schedule of
wtwt+1:T
w1:T
.
That is, the corresponding procedure to Algorithm 1 is given by Algorithm 2:
Algorithm 2Schedule Refinement for Adam
1:Input:G=
ˇ
Gt=
P
d
i=1
g
2
t,i
√
vt,i
ı
lengthTsequence of weighted gradient norms from Adam optimizer,
smoothing hyper-parameterτ >0
2:
ˆ
G=medianfilter(G,filterwidth=τ T,padding= (nearest,reflect))
3:Definewt=
ˆ
G
−1
t
4:For eacht, let:
5:
ηt=wt
T
X
p=t+1
wp
6:Return normalized scheduleη/max(η)
In practice, however, recording the valuewt∝
ˇ
P
d
i=
g
2
t,i
√
vt,i
ı−1
may be difficult (for example, in Pytorch,
the Adam implementation is primarily C code rather than Python, which makes it substantially more involved
to modify). However, by inspecting the formula forwt, we can see that it is likely to be an interpolation
between the1/∥gt∥
2
2and1/∥gt∥1(the L1 norm is not squared). The intuition behind this is that ifvt,ihas a
minimum value of|gt,i|. With this minimum value,wt∝1/∥gt∥1. On the other hand if allvt,iare the same
constant, thenwt∝1/∥gt∥
2
2. In practice we expect behavior closer to the first case and recommend using
wt∝1/∥gt∥1.
Proof[proof of Theorem 8] By standard online gradient descent analysis, we have:
⟨wt·gt, zt−u⟩=
d
X
i=1
wtgt,i(zt,i−ui)
=
d
X
i=1
(zt,i−ui)
2
2ηt,i
−
(zt+1,i−ui)
2
2ηt,i
+
ηt,i
2
w
2
tg
2
t,i.
27

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
Summing this overtfrom1toT, we get
T
X
t=1
⟨wt·gt, zt−u⟩=
T
X
t=1
d
X
i=1
(zt,i−ui)
2
2ηt,i
−
(zt+1,i−ui)
2
2ηt,i
+
T
X
t=1
w
2
t
d
X
i=1
ηt,ig
2
t,i
2
=
R
2
2
+
T
X
t=1
w
2
t
2
d
X
i=1
ηt,ig
2
t,i.
The first part of the result now follows from Theorem 1.
Next, we again take the logarithm, differentiate and set equal to zero:
0 =
∂
∂wk
log

1
w1:T

R
2
2
+
T
X
t=1
w
2
t
2
d
X
i=1
ηt,ig
2
t,i
!!
=
2wk
P
d
i=1
η
2
k,i
g
2
k,i
R
2
+
P
T
t=1
w
2
t
P
d
i=1
ηt,ig
2
t,i
−
1
w1:T
.
Rearranging, we obtain
wk=
ˇ
P
d
i=1
η
2
k,i
g
2
k,i
ı
−1ˇ
R
2
+
P
T
t=1
w
2
t
P
d
i=1
ηt,ig
2
t,i
ı
2w1:T
≜λ

d
X
i=1
ηk,ig
2
k,i
!−1
,
where we have collected the nonk-dependent terms intoλ=
R
2
+
P
T
t=1
w
2
t
P
d
i=1
ηt,ig
2
t,i
2w1:T
. Now we solve forλ:
λ=
R
2
+
P
T
t=1
w
2
t
P
d
i=1
ηt,ig
2
t,i
2w1:T
=
R
2
+λ
2
P
T
t=1
ˇ
P
d
i=1
ηt,ig
2
t,i
ı
−1
2λ
P
T
t=1
ˇ
P
d
i=1
ηt,ig
2
t,i
ı
−1
=
R
2
2λ
P
T
t=1
ˇ
P
d
i=1
ηt,ig
2
t,i
ı
−1
+
λ
2
=
R
r
P
T
t=1
ˇ
P
d
i=1
ηt,ig
2
t,i
ı
−1
.
Puttingwt=λ
ˇ
P
d
i=1
ηt,ig
2
t,i
ı
−1
completes the proof.
Appendix F. Explicit bounds for arbitrary schedules
Section 2 develops a framework that suggests using learning rates of the formηt=
wtwt+1:T
w1:T
and in Theo-
rem 3 we provided a simple closed-form solution for the optimal values ofwt. Given this apparently restricted
form ofηt, it is natural to ask if this restriction is real: that is, caneveryscheduleη1, . . . , ηT−1be represented
using some weightswt? Theorem 9 shows that the answer is “yes”:
28

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
Theorem 9Letη1, . . . , ηT−1be a sequence of non-negative numbers withη1≥0. Then there is a sequence
of non-negative weightsw1, . . . , wTsuch thatηt=
wtwt+1:T
w1:T
for allt.
This Theorem shows that by optimizing the weightswt, we are in some sense also solving for the optimal
learning rateηt. However, notice that the reverse is not true: any given scheduleηtcan be represented by a
number of different weightswt, and these weights give rise to different bounds using Theorem 1. The proof
of Theorem 9 works by constructing a particular set of weightswt, but these may not be the best weights in
terms of providing the tightest convergence bound for the givenηt.
Although Theorem 9 shows that the learning rate representation of Theorem 3 indeed covers all possible
schedules, it does not provide a user-friendly way to analyze the convergence of an arbitrary schedule. In this
section, we provide an alternative analysis that fills this gap.
This approach is closer to previous final-iterate analysis techniques: we bound the final iterate in terms of
the average iterate, plus an additionalerrorterm. This bound is strictly looser than those of Theorems 1 and 3
in the constant factors, but provides a convenient way to analyze arbitrary schedules. The bound is presented
in Theorem 10 below.
Theorem 10Suppose thatfis convex and letxtbe given by SGD with learning ratesηt:xt+1=xt−ηtgt.
Then for the last iteratexTwe have:
E[f(xT)−f(u)]≤E
"
1
2
P
T
t=1
ηt
D
2
+
1
2
P
T
t=1
ηt
T
X
t=1
η
2
t∥gt∥
2
#
+E
"
1
2
T−1
X
k=1
ηk
P
T
t=k+1
ηt

1
P
T
t=k
ηt
T
X
t=k
η
2
t∥gt∥
2
!#
. (5)
Optimizing this bound with respect to the step-size sequence produces schedules that are visually indis-
tinguishable from those of the regret based approach described in Section 2.1 for largeT.
To prove Theorem 10, we will need to control quantities like:
T
X
t=k
ηt(qt−qk) =f(xt)−f(xk).
This can be bounded by the usual suboptimality inequality for SGD/GD methods, which holds for anyu:
T
X
t=k
ηt[f(xt)−f(u)]≤
1
2
∥xk−u∥
2
+
1
2
T
X
t=k
η
2
t∥gt∥
2
,
Settingu=xkyields:
T
X
t=k
ηt[qt−qk]≤
1
2
T
X
t=k
η
2
t∥gt∥
2
.
We can use this in our Lemma 5 to obtain the following result:
Corollary 11
qT=
1
P
T
t=1
ηt
T
X
t=1
ηtqt+
1
2
T−1
X
k=1
ηk
P
T
t=k+1
ηt

1
P
T
t=k
ηt
T
X
t=k
η
2
t∥gt∥
2
!
.
Now, we are ready to prove Theorem 10.
29

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
Theorem 10Suppose thatfis convex and letxtbe given by SGD with learning ratesηt:xt+1=xt−ηtgt.
Then for the last iteratexTwe have:
E[f(xT)−f(u)]≤E
"
1
2
P
T
t=1
ηt
D
2
+
1
2
P
T
t=1
ηt
T
X
t=1
η
2
t∥gt∥
2
#
+E
"
1
2
T−1
X
k=1
ηk
P
T
t=k+1
ηt

1
P
T
t=k
ηt
T
X
t=k
η
2
t∥gt∥
2
!#
. (5)
ProofFollowing the standard convergence bound approach:
∥xt+1−u∥
2
=∥xt−ηtgt−u∥
2
=∥xt−u∥
2
−2ηt⟨gt, xt−u⟩+η
2
t∥gt∥
2
≤ ∥xt−u∥
2
−2ηt[f(xt)−f(u)] +η
2
t∥gt∥
2
.
Summing overtand telescoping gives:
T
X
t=1
ηt[f(xt)−f(u)]≤
1
2
D
2
+
T
X
t=1
η
2
t∥gt∥
2
.
Then divide through by
P
T
t=1
ηt:
1
P
T
t=1
ηt
T
X
t=1
ηt[f(xt)−f∗]≤
1
2
P
T
t=1
ηt
D
2
+
1
2
P
T
t=1
ηt
T
X
t=1
η
2
t∥gt∥
2
.
Now we apply Corollary 11 to get:
f(xT)−f∗≤
1
2
P
T
t=1
ηt
D
2
+
1
2
P
T
t=1
ηt
T
X
t=1
η
2
t∥gt∥
2
+
1
2
T−1
X
k=1
ηk
P
T
t=k+1
ηt

1
P
T
t=k
ηt
T
X
t=k
η
2
t∥gt∥
2
!
.
By using a worst-case bound of∥gt∥
2
≤G
2
, we obtain:
Corollary 12IffisG-Lipschitz, then
f(xT)−f∗≤
1
2
P
T
t=1
ηt
D
2
+
G
2
2
P
T
t=1
ηt
T
X
t=1
η
2
t
+
G
2
2
T−1
X
k=1
ηk
P
T
t=k+1
ηt

1
P
T
t=k
ηt
T
X
t=k
η
2
t
!
.
F.1 Linear schedule analysis with Theorem 10
In this section, we use Theorem 10 to re-analyze the the linear decay schedule:
ηt=
D
G
√
T
ȷ
1−
t
T+ 1
ff
. (6)
The resulting convergence rate is asymptotically correct, but does not achieve the optimal constant obtained
by Theorem 3.
30

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
Theorem 13Schedule 6 gives the following bound on the last iterate:
f(xT)−f∗≤
ȷ
2 +
1
4
ff
DG
√
T
,
or more precisely:
f(xT)−f∗≤
ȷ
2 +
H(T−1)−2/3
T+ 1
ff
DG
√
T
,
where H(T) is the Tth harmonic sum.
ProofWe start with the above bound:
f(xT)−f∗≤
1
2
P
T
t=1
ηt
D
2
+
G
2
2
P
T
t=1
ηt
T
X
t=1
η
2
t
+
G
2
2
T−1
X
k=1
ηk
P
T
t=k+1
ηt

1
P
T
t=k
ηt
T
X
t=k
η
2
t
!
.
Since the
D
G
√
T
part of the step size is constant, it can be pulled outside the summations, so we just need to
focus on summations involving
ˇ
1−
t
T+1
ı
. For the first two terms we have in the bound, they simplify
T
X
t=1
ηt∝
T
X
t=1
ȷ
1−
t
T+ 1
ff
=

T−
1
T+ 1
T
X
t=1
t
!
=
ȷ
T−
1
2
T(T+ 1)
T+ 1
ff
=
T
2
.
Similarly,
T
X
t=1
η
2
t∝
T
X
t=1
ȷ
1−
t
T+ 1
ff
2
=
T
X
t=1
ȷ
1−
2t
T+ 1
+
t
2
T+ 1
2
ff
=

T−T+
T
X
t=1
t
2
T
2
!
=
ȷ
T(T+ 1)(2T+ 1)
6(T+ 1)
2
ff
=
ȷ
T(T+ 1)(T+ 1/2)
3(T+ 1)
2
ff
≤
T
3
.
So we have the following bound on the last iterate:
31

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
f(xT)−f∗≤
DG
√
T
+
DG
3
√
T
+
G
2
2
T−1
X
k=1
ηk
P
T
t=k+1
ηt

1
P
T
t=k
ηt
T
X
t=k+1
η
2
t
!
.
To simplify the remaining term we rely on computer algebra software. SymPy gives:
T−1
X
k=1
ηk
P
T
t=k+1
ηt

1
P
T
t=k
ηt
T
X
t=k+1
η
2
t
!
=
ȷ
D
G
√
T
ff
4T+ 6H(T−1)−4
3(T+ 1)
.
WhereH(T) =
P
T
t=1
1/tis the harmonic sum. So:
f(xT)−f∗≤
DG
√
T
+
DG
3
√
T
+
2DG
√
T
3T
+
DG(3H(T−1)−2)
3 (T+ 1)
√
T
≤
2DG
√
T
+
DG
3
√
T
+
DG(3H(T−1)−2)
3(T+ 1)
√
T
.
Note that the term(3H(T−1)−2)/3 (T+ 1)≤1/4for allT, so:
DG(3H(T−1)−2)
3(T+ 1)
√
T
=
DG
4
√
T
,
So combining with the
DG
√
T
+
DG
3
√
T
terms, we have:
f(xT)−f∗≤
ȷ
2 +
1
4
ff
DG
√
T
.
bounding the harmonic function with a log gives instead:
f(xT)−f∗≤
2DG
√
T
+O
ȷ
DGlog(T)
T
3/2
ff
.
Appendix G. Experimental Setup
Our experiments on CIFAR-10, CIFAR-100, ImageNet and RCNN use SGD, and the remaining problems
use Adam. We used decoupled weight decay with Adam in each case following standard practice for each
problem.
G.1 Convex Experiments
Each dataset is obtained from the LIBSVM repository with used without modifications.
Hyper-parameterValue
GPUs 1×V100
Batch size 16
Epochs 100
Seeds 10
Hyper-parameterValue
Decay 0.0
Optimizer Adam
β1 0.9
β2 0.95
32

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT0% 20% 40% 60% 80% 100%
0
250
500
Sensorless
Gradient Norms
0% 20% 40% 60% 80% 100%
20
40
Smoothed Gradient Norms
0% 20% 40% 60% 80% 100%
0.0
0.5
1.0
Refined Schedule
0% 20% 40% 60% 80% 100%
0
1
2
1e6
aloi.scale
0% 20% 40% 60% 80% 100%
0
500
0% 20% 40% 60% 80% 100%
0.0
0.5
1.0
0% 20% 40% 60% 80% 100%
0
20
glass
0% 20% 40% 60% 80% 100%
0
10
20
0% 20% 40% 60% 80% 100%
0.0
0.5
1.0
0% 20% 40% 60% 80% 100%
0.0
2.5
5.0
iris
0% 20% 40% 60% 80% 100%
0.0
2.5
5.0
0% 20% 40% 60% 80% 100%
0.0
0.5
1.0
0% 20% 40% 60% 80% 100%
0
20
letter
0% 20% 40% 60% 80% 100%
6
8
0% 20% 40% 60% 80% 100%
0.0
0.5
1.0
0% 20% 40% 60% 80% 100%
0
2000
4000 pendigits
0% 20% 40% 60% 80% 100%
500
1000
0% 20% 40% 60% 80% 100%
0.0
0.5
1.0
0% 20% 40% 60% 80% 100%
0
20
vehicle
0% 20% 40% 60% 80% 100%
10
20
30
0% 20% 40% 60% 80% 100%
0.0
0.5
1.0
0% 20% 40% 60% 80% 100%
0
10
vowel
0% 20% 40% 60% 80% 100%
5
10
15
0% 20% 40% 60% 80% 100%
0.0
0.5
1.0
Figure 6: Logistic regression schedules, generated using a linear decay schedule with warmup for the initial
run.
33

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
G.2 CIFAR-10
We used custom training code based on the PyTorch tutorial code for this problem. Following standard data-
augmentation practises, we appliyed random horizontal flips and random offset cropping down to 32x32,
using reflection padding of 4 pixels. Input pixel dataD was normalized by centering around 0.5.
Hyper-parameter Value
ArchitectureWide ResNet 16-8
Epochs 300
GPUs 1×V100
Batch size per GPU 128
Hyper-parameterValue
Seeds 10
decay 0.0001
Momentum 0.9
G.3 CIFAR-100
We used the same codebase as for our CIFAR-10 experiments, with the same data augmentation.
We normalized each input image using fixed mean and standard error values derived from pre-processing
the data.
Hyper-parameter Value
Architecture
DenseNet [6,12,24,16],
growth rate 12
Epochs 300
GPUs 1×V100
Hyper-parameterValue
Batch size per GPU64
Seeds 10
Decay 0.0002
Momentum 0.9
G.4 ImageNet
We used the same code-base as for our CIFAR-10 experiments, and applied the same preprocessing proce-
dure. The data-augmentations consisted of PyTorch’s RandomResizedCrop, cropping to 224x224 followed
by random horizontal flips. Test images used a fixed resize to 256x256 followed by a center crop to 224x224.
Hyper-parameter Value
ArchitectureResNet50
Epochs 100
GPUs 8×V100
Batch size per GPU32
Hyper-parameterValue
Seeds 5
Decay 0.0001
Momentum 0.9
G.5 IWSLT14
We used the FairSeq framework
1
for our experiments here, as well as for our GPT and RoBERTa experiments.
Rather than a vanilla LSTM we use the variant from (Wiseman and Rush, 2016) provided in the FairSeq
codebase.
Hyper-parameter Value
Architecturelstmwisemaniwsltdeen
Max Epoch 55
GPUs 1×V100
Tokens per batch 4096
Warmup steps 4000
Dropout 0.3
Label smoothing 0.1
Hyper-parameter Value
Share decoder, input,
output embed
True
Float16 True
Update Frequency 1
Seeds 10
Decay 0.05
β1, β2 0.9, 0.98
1.https://github.com/facebookresearch/fairseq
34

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
G.6 RoBERTa
The RoBERTa implementation in FairSeq is the canonical one. We differ from the paper’s results by training
for a shorter duration, which is necessary to keep our experiments computationally tractable. Our BookWiki
dataset matches the original paper.
Hyper-parameter Value
Architecture robertabase
Task maskedlm
Max updates 23,000
GPUs 8×V100
Max tokens per sample512
Dropout 0.1
Attention Dropout 0.1
Max sentences 16
Hyper-parameter Value
Warmup 10,000
Sample Break ModeComplete
Float16 True
Update Frequency 16
Seeds 5
Decay 0.0
β1, β2 0.9, 0.98
G.7 GPT
Since the training dataset for GPT models are not availiable, we use the BookWiki dataset as used for
RoBERTa training. Our model here is small, using 12 decoding layers and a decoder embedding dim of
768, giving 162 million parameters.
Hyper-parameter Value
Architecturetransformerlmgpt
Task languagemodeling
Max updates 65,000
GPUs 8×V100
Tokens per sample 512
Dropout 0.1
Attention Dropout 0.1
Max sentences 1
Warmup 10,000
Hyper-parameter Value
Sample Break ModeComplete
Share decoder, input,
output embed
True
Float16 True
Update Frequency 16
Seeds 5
Decay 0.005
β1, β2 0.9, 0.98
G.8 ViT
Our implementation uses the PyTorch Image Models library
2
, with hyper-parameters following examples
given in the repository.
Hyper-parameter Value
Model vittinypatch16224
GPUs 8×V100
Epochs 300
Batch Size 512
Warmup epochs 5
Hflip 0.5
aa rand-m6-mstd0.5
mixup 0.1
Hyper-parameter Value
mixup 0.1
cutmix 1.0
Crop Pct 0.9
BCE Loss True
Seeds 5
Decay 0.1
β1, β2 0.9, 0.999
2.https://github.com/rwightman/pytorch-image-models
35

ADAPTIVELEARNINGRATESCHEDULING BY REFINEMENT
G.9 DLRM
We used a custom implementation of the DLRM model based on the publicly available code. Our optimizer
uses dense gradients for implementation simplicity, although sparse-gradients using AdaGrad is a more com-
mon baseline on this problem, we consider AdaGrad variants of our scheduling approach as future work.
Hyper-parameterValue
Iterations300 000
Batch Size 128
Emb Dimension 16
GPUs 8×V100
Hyper-parameter Value
Seeds 5
Decay 0.0
β1, β2 0.9, 0.999
G.10 MRI
We used the version of the the fastMRI code base athttps://github.com/facebookresearch/
fastMRI/tree/main/banding_removal . Note that we found that training failed using PyTorch 2 or
newer, and so we ran these experiments using PyTorch 1.9.
Hyper-parameter Value
Architecture12 layer VarNet 2.0
Epochs 50
GPUs 8×V100
Batch size per GPU 1
Acceleration factor 4
Hyper-parameter Value
Low frequency lines16
Mask type Offset-1
Seeds 5
Decay 0.0
β1, β2 0.9, 0.999
G.11 RCNN
Our RCNN experiments use Detectron2
3
, and we use a pretrained ResNet backbone
4
.
Hyper-parameter Value
Backbone ResNet-50
Max Iter 200000
IMS Per Batch 16
Hyper-parameterValue
GPUs 8×V100
Momentum 0.9
Decay 1.5e-4
3.https://github.com/facebookresearch/detectron2
4.detectron2://ImageNetPretrained/MSRA/R-50.pkl
36