Extracted Text


2111.09832

Merging Models with Fisher-Weighted Averaging
Michael Matena Colin Raffel
Department of Computer Science
University of North Carolina at Chapel Hill
{mmatena,craffel}@cs.unc.edu
Abstract
Averaging the parameters of models that have the same architecture and initializa-
tion can provide a means of combining their respective capabilities. In this paper,
we take the perspective that this “merging” operation can be seen as choosing pa-
rameters that approximately maximize the joint likelihood of the posteriors of the
models' parameters. Computing a simple average of the models' parameters there-
fore corresponds to making an isotropic Gaussian approximation to their posteriors.
We develop an alternative merging procedure based on the Laplace approximation
where we approximate each model's posterior as a Gaussian distribution whose
precision matrix corresponds to its Fisher information. We rst show that our
“Fisher merging” technique provides a performance boost in settings where simple
parameter averaging is currently used – specically, robust ne-tuning and model
ensembling. Then, we compare merging to standard gradient-based transfer learn-
ing and demonstrate that merging enables a fundamentally different method for
transferring capabilities across models. Specically, we show that Fisher merging
is competitive with gradient-based transfer learning approaches (while being sig-
nicantly cheaper) in intermediate-task training and domain-adaptive pre-training.
We also show that our merging procedure makes it possible to combine models in
previously unexplored ways. We release our code to facilitate future research into
methods for merging models.
1
1 Introduction
How should we transfer knowledge and capabilities across trained models? One popular approach
is transfer learning [44], which ne-tunes a pre-trained model on a target task through additional
gradient-based training. The preparatory step of pre-training the model on a data-rich task ideally
instills useful “knowledge” into the network's parameters, which allows the model to learn more
rapidly and effectively when ne-tuned on a downstream task of interest. Transfer learning has
therefore become a particularly important and omnipresent tool across many elds, including natural
language processing [57,13,9,52,53,46] and computer vision [43,24,68]. Recently, it has been
shown that training on an “intermediate” task between pre-training and ne-tuning can further boost
performance through additional transfer of capabilities from the intermediate task [47,60,51,48].
Alternatively, continued self-supervised training on unlabeled domain-specialized data can serve as a
form of domain adaptation [19].
All of the aforementioned transfer learning methods transfer knowledge by using a trained network
to initialize another network followed by iterative gradient descent. While demonstrably powerful,
several drawbacks arise from this: First, improvements to ancestor models cannot be passed down to
descendants; instead, we must restart the whole process from the improved ancestor model, throwing
away our previous work. For example, if we ne-tune a pre-trained model on a downstream task,
1
https://github.com/mmatena/model_merging
Preprint. Under review.

Pre-trained
Fine-tuned
Donor

Pre-trained
Fine-tuned
Fine-tuned
Pre-trained
Pre-trained
Fine-tunedIntermediate
Donor Figure 1: Merging patterns considered in this work.Left:Merging many ne-tuned models as a form
of ensembling.Center, top:“Robust ne-tuning” [66] , where a ne-tuned model is merged with the
pre-trained model to improve performance on the original pre-training task.Center, bottom:Merging
a ne-tuned model with a “donor” task, analogous to intermediate-task transfer learning [47,51].
Right:Merging an intermediate-task trained model with a donor model.
but then the pre-trained model is improved through additional training, we must re-ne-tune the
new model on our downstream task if we want to confer benets from this additional pre-training.
Furthermore, if we gain access to a checkpoint that has been ne-tuned on a useful intermediate task,
we must again throw away our previous work and ne-tune from the intermediate task checkpoint.
Existing methods for transfer learning also have the disadvantage of only being able to transfer
information from a single model. While it may be possible to train on multiple intermediate tasks
sequentially, one quickly either runs into a combinatorial explosion of saved checkpoints or faces
the issue of “catastrophic forgetting” in continual learning [28]. In addition to slowing down
experimentation by preventing reuse of work, these drawbacks impose limitations on the types of
transfer that can occur.
A less common way of transferring capabilities across models is to simply average their parameters.
This procedure, which we call “merging”, is generally only feasible when the models being averaged
share a common architecture and initialization. Merging is the core component of the FedAvg
algorithm used in Federated Learning [39], where updates to a shared model computed by individual
workers that are training on different datasets are combined by simpling averaging the updates.
Recently, Wortsman et al.[66]demonstrated that merging can be used to improve robustness to
domain shift in ne-tuned models by averaging the parameters of the original pre-trained model
with the ne-tuned parameters. Merging is also a common way of performing ensembling [49,67],
where the parameters of individual models trained on the same dataset are averaged to create a single
performant model.
In this work, we view model merging as approximately maximizing the joint likelihood of the models'
posterior distribution over parameters. Since gradient-based maximum likelihood training only
provides a point estimate of the posterior, some approximation of the posterior distribution is required.
When an isotropic Gaussian distribution is used to approximate the posterior (with identity precision
matrix and mean set to the model's parameter values), we show that maximizing the joint likelihood
across models is equivalent to simply averaging their parameters. We therefore refer to merging
models by averaging parameters asisotropic merging. The view of merging as maximizing the joint
likelihood of model posteriors suggests that using a better estimate of the posterior distribution may
yield improved merging results. This leads us to introduceFisher merging, which leverages the
Laplace approximation by using the diagonal of each model's Fisher information as the precision
matrix for that model's posterior.
Empirically, we demonstrate that merging models with Fisher merging outperforms isotropic merging
in a variety of settings. We rst focus on the existing applications of model ensembling [49,67] and
improving ne-tuned model robustness [66]. Then, we demonstrate for the rst time that merging is
a viable alternative to traditional gradient-based transfer learning. Specically, we compare merging
to intermediate-task transfer learning [47,51] and domain-adaptive pre-training [19], nding that
merging can achieve comparable performance at signicantly lower cost. Additionally, we show
that merging can provide an additional boost to models created via traditional intermediate-task
training. This provides a concrete example of transfer that is fast and easy with merging but onerous
2

or impossible to do with existing methods. Diagrams of the merging patterns we consider in this
work are shown in g..
The rest of our paper is structured as follows: In section, we provide necessary background and
detail our Fisher merging procedure. Section
robust ne-tuning, intermediate-task training, and domain adaptation. We explore related works in
section.
2 Weighted Parameter Averaging for Model Merging
Our focus is on procedures formodel merging, i.e. averaging the parameters of models that share
an architecture and initialization. In this section, we rst frame the common practice of averaging
together model parameters as approximately maximizing the joint likelihood of model posteriors.
Specically, we show that parameter averaging corresponds to using an isotropic Gaussian as the
approximate posterior for each model. We then introduceFisher merging, which uses the model's
diagonal Fisher information matrix as the precision matrix of the Gaussian approximate posterior.
Fisher merging can be implemented by setting each merged parameter value to a weighted average of
the corresponding parameter values from the original models, with the weighting for each parameter
determined by its Fisher information. In addition, we add model-level weightings as additional
hyperparameters to set the relative importance of each model.
2.1 Isotropic merging
Consider the problem setting where we haveMtrained neural networks with parameters1; : : : ; M
and our goal is to create a single neural network with parametersthat, loosely speaking, inherits
the capabilities of theMtrained neural networks. Assume that all of these neural networks share a
common architecture and had the same set of initial parameter values before being trained.Merging
attacks this problem by nding the parametersthat maximize the joint likelihood of the posterior
distributions of theMmodels. Unfortunately, typical neural network training procedures do not
provide access to a posterior distribution, which necessitates approximation. If the posterior of
each model is approximated via an isotropic Gaussian with mean set to the model's parameters,
the optimization problem can be written as

=argmax

P
i
logp(ji; I) wherep(ji; I) is the
probability distribution of the aforementioned approximate isotropic Gaussian posterior distribution
used for modeliandIis the identity matrix. This optimization problem has a closed-form solution
given by

=
1
M
P
i
i
, i.e. an average of the model parameters. Such an averaging procedure has
been used in past work aiming to combine model capabilities, e.g. in federated learning [39], model
ensembling [49,], and robust ne-tuning [66].
2.2 Per-model weights
In this work, we additionally introduce model-specic scalar hyperparametersi; i2 f1; : : : ; Mg
into the model merging framework described above. Specically, we change the optimization problem
to

=argmax

P
i
ilogp(ji; I) wherei0;
P
i
i= 1 . In the case of isotropic merging, this
changes the solution to

=
P
i
ii , These hyperparameters provide control over the importance
assigned to each of the models that are being merged. For example, when using merging to perform
ensembling we might expect each model to be equally important and therefore seti= 1=M for alli.
On the other hand, when mimicking the setup of intermediate-task training where the capabilities of
a “donor” model are used to improve performance of a recipient model, we might weigh the recipient
model more highly. Wortsman et al.[66]introduce a similar hyperparameter when averaging the
parameters of two models and report results for varying values of .
2.3 Laplace Approximation
Framing merging as approximate maximization of the joint posterior likelihood reveals that simple
parameter averaging is implicitly using an isotropic Gaussian posterior approximation. Such an
approximation may be overly simplistic and lead to degraded performance. To explore improved
merging procedures, we consider improved methods for creating an approximate posterior from a
point estimate. Specically, we use the Laplace approximation to the posterior, which corresponds to
a second-order Taylor expansion of the log density around a mode [36,10]. This leads to a Gaussian
3

approximationN(; H
1
) of the posterior, whereHis the Hessian matrix andare the model's
trained parameter values. More precisely, we assume that the parameter valuesof a trained neural
network are a local maximum of the posterior. It can then be shown that the precision matrix of the
Laplace approximation is given by the Fisher information matrix of the network at.
The Fisher information matrixF[16,3] of a neural networkp(yjx) trained to predict an outputy
from input dataxis ajj  jjpositive semidenite matrix given by the formula
F=Ex

E
yp(yjx)
rlogp(yjx)rlogp(yjx)
T

: (1)
It can be shown that the Fisher information matrix coincides with the HessianHat modes of the
distribution [45], explaining its use in the Laplace approximation. The Fisher information matrixF
can also be used to relate changes in the model parameters to changes in the model output by noting
thatEx[DKL(p(yjx)jjp+(yjx))]
1
2

T
F
as!0 , whereDKLdenotes the KL-divergence
[45].
As the full Fisher matrix takesO(jj
2
) memory to store, it quickly becomes impractical for all but
the smallest models. We are thus forced to use an approximation to the full Fisher in practice. In this
paper, we follow the common practice of using the diagonal of the Fisher matrix [28]. While other
methods (e.g. [1]) exist for estimating the Fisher, we leave their exploration for future work. In our
experiments, we estimated the diagonal of the Fisher matrix via
^
F=
1
N
N
X
i=1
E
yp(yjxi)
(rlogp(yjxi))
2
; (2)
wherex1; : : : ; xN are drawni.i.d. from the dataset that was used to train the model. The expectation
overycan be estimated via sampling fromp(yjxi) or computed exactly when the number of
classes is small. We note that computing the Fisher requiresNper-example gradients, which can be
straightforwardly computed for neural networks using backpropagation. This makes computing the
diagonal Fisher have roughly the same computational cost as training onNexamples.
2.4 Fisher Merging
Having noted that the Laplace approximation provides a tractable way to obtain a better approximation
to the posterior, we now use it to create an improved merging procedure that we callFisher merging.
LettingF1; : : : ; FM correspond to the diagonal approximate Fisher matrices, we constructp(ji; Fi)
as a Gaussian-distributed posterior over the parameters of the merged model with meaniand
precisionFi. To obtain the merged model, we nd a single set of parameters that is given a high
probability under all posteriors. Formally, we have


=argmax

M
X
i=1
ilogp(ji; Fi); (3)
which has the closed-form solution

(j)
=
P
M
i=1
iF
(j)
i

(j)
i
P
M
i=1
iF
(j)
i
; (4)
wherej= 1; : : : ;jj . Intuitively, we can think of Fisher merging as computing a weighted average
of the parameter values in each model where the weighting is done according to each parameter's
Fisher information. Since the Fisher information is a local property of a single parameter value,
Fisher merging might be less performant when applied to models whose parameters are far apart in
parameter space. We therefore limit our focus to models that were trained from the same initialization.
Numerical Issues.Note that(4)can run into numerical issues when the Fisher is close to zero across
all models for a given parameter. In practice, we choose a privileged “target model” in all of our
experiments and “default” to the parameter's value in the target model in these cases. An alternative
would be to take an average weighted only by the merging coefcients (i.e., pretend the Fisher is the
same across all models). In practice, the choice of a “default” value for these parameters had little
impact on performance (likely because a small Fisher value implies that changing the parameter has a
minute effect on the model's outputs and is therefore relatively unimportant to the model's behavior).
4

Unmergeable Parameters.
In many cases, we have some parameters from each model that do
not appear in all of the models we are merging. For example, this includes having task-specic
classication heads on top of a common body architecture. We handle this by only applying the
merging procedure(3)to the shared body parameters and keeping the task-specic heads unchanged.
Although this may lead to a distribution shift in the classication head inputs, we found it to work
well in practice for the datasets and tasks we consider.
3 Experiments
Our rst experimental goal is to validate that our use of an improved estimate of the posterior yields
improved merging performance. To test this hypothesis, we apply Fisher merging to two settings
where isotropic merging has already proven successful: Model ensembling [49,67] and robust
ne-tuning [66]. Then, we demonstrate that Fisher merging provides a cheap and effective alternative
to traditional transfer learning pipelines by validating its performance in intermediate-task transfer
learning [47,51] and domain-adaptive pre-training [19]. Finally, we demonstrate that merging opens
up new paths of transferring capabilities across models by demonstrating a boost in performance
when merging an intermediate task-trained model with different donor models.
3.1 Ensembling
An existing application of isotropic merging is forensembling, i.e. combining models trained on the
same dataset to obtain better predictions. Ensembling is most commonly performed by averaging the
predictions of the individual models. This form of ensembling requires computing the output of all
Mmodels in the ensemble, thereby increasing the computational cost by a factor ofMcompared
to computing the output for a single model. A cheaper alternative is to average the parameters of
the models themselves. This approach is diagrammed in g., left. Such an approach is used in the
classical method of Polyak averaging [49], where parameter values from the nalMiterations of
training are averaged. More recently, Wortsman et al.[67]introduced the “Model Soup” approach
where ne-tuned models with different hyperparameter settings are averaged to improve performance.
To the best of our knowledge, all parameter-averaging ensemble methods have used isotropic merging,
i.e. an unweighted average.
To test whether Fisher merging provides a boost over isotropic merging when averaging parameters
for ensembling, we consider ensembling ne-tuned checkpoints derived from the same pre-trained
model. Specically, we consider the BERT-Base model [13] ne-tuned on the RTE [8], MRPC [14],
and SST-2 [59] datasets. For each dataset, we use ve ne-tuned checkpoints downloaded from the
Hugging Face model hub.
2
These checkpoints were ne-tuned with a variety of hyperparameter
settings that were not chosen by us, so our experimental setting most closely matches the “Model
Soup” approach [67]. A list of the checkpoints used is available in appendix. Since we do not
anticipate that any member of the ensemble should be given a larger weight, we seti= 1=5 for all
models.
Our results are shown in g.. We report validation set scores for Fisher merging, isotropic
merging, and prediction ensembling (specically, averaging the output probabilties of all models).
Fisher merging signicantly outperforms isotropic merging in all cases and attains comparable
performance to prediction ensembling. Notably, performing inference after merging isMcheaper
than prediction ensembling, suggesting that merging can provide a cheaper alternative to standard
ensembling procedures.
3.2 Robust Fine-Tuning
Recently, Wortsman et al.[66]found that while ne-tuning a pre-trained vision model tends to
improve performance on the downstream task, it also tends to decreases accuracy on the original pre-
training task. They therefore propose a “robust ne-tuning” procedure called WiSE-FT that computes
a weighted average of the original pre-trained parameters and the ne-tuned parameters. Different
weighting values produce different trade-offs between pre-training and ne-tuning task performance.
In some cases, robust ne-tuning can even improve performance on the original pre-training task
without sacricing performance on the downstream ne-tuning task relative to traditional ne-tuning.
2
https://huggingface.co/models
5

RTE MRPC SST-2
50
60
70
80
90
100
Accuracy
Isotropic merging
Fisher merging
Output ensembling Figure 2: Validation set accuracy for ensembles
of ve ne-tuned BERT models using different
ensembling methods on the RTE, MRPC, and
SST-2 datasets. Fisher merging produces a sin-
gle model that performs comparably to output
ensembling while being5cheaper. 70 75 80
ImageNet accuracy
54
56
58
60
62
Average OOD accuracy
Isotropic merging
Fisher merging
Figure 3: IID (ImageNet) and average out-
of-domain (OOD) accuracy across ve OOD
datasets when using the WiSE-FT procedure [66]
with either Fisher or isotropic merging. Dark to
light color indicates increasing1from0to1.
This procedure implicitly uses isotropic merging and therefore provides another natural testbed
for determining whether Fisher merging provides a boost in performance. A schematic of robust
ne-tuning is shown in g., center top.
We use the codebase and experimental setup of Wortsman et al.[66]exactly, simply replacing
isotropic merging with Fisher merging. For full details of this setup, we refer to Wortsman et al.[66].
As a short summary, we apply WiSE-FT to the ImageNet [11,58] pre-trained ViT-B/16 model [15]
on ve out-of-domain (OOD) datasets: ImageNet-A [21], ImageNet-R [20], ImageNet Sketch [62],
ImageNet V2 [56], and ObjectNet [4]. Following Wortsman et al.[66], we measure IID (ImageNet)
and OOD performance when averaging together the original pre-trained model parameters and
parameters from models ne-tuned on each of the OOD datasets, varying1(the averaging weight
for the pre-trained model, called by Wortsman et al.[66]) from0to1in0:1-step increments (with
2= 11 correspondingly decreasing from1to0). To determine whether Fisher merging confers
a boost in performance, we compare parameter averaging using either isotropic or Fisher merging.
We plot the IID (ImageNet) accuracy against the average accuracy on the ve OOD datasets for
varying values of1in g., with plots for individual OOD datasets in g.
merging produces a signicantly better trade-off between IID and OOD accuracy. In particular, Fisher
merging seems to general improve IID accuracy compared to isotropic merging. For example, for the
value of1producing the best average OOD accuracy, Fisher merging produces about 1% higher IID
accuracy than isotropic merging.
3.3 Intermediate-task training
Having established that Fisher merging produces better results than isotropic merging in settings
where merging has been attempted before, we now explore the use of merging as an alternative to
a gradient-based transfer learning procedure. Specically, we explore intermediate-task training
[47,51], where a model is ne-tuned on an intermediate “donor” task before being trained on the
target task of interest. To the best of our knowledge, no prior work has considered parameter averaging
as a way of performing intermediate-task transfer learning. For the most part, intermediate-task
training has mainly been considered in the NLP domain; as such, we limit our experiments to the
BERT [13] and RoBERTa [33] pre-trained language models. To enable comparison to past work, we
mostly explored merging pairs of models but we are interested in exploring merging more than two
models in future work. As in section, we made use of ne-tuned BERT and RoBERTa checkpoints
from the Hugging Face repository [65].
Following previous work [47,51], we rst ran experiments using BERT-base on the GLUE benchmark
[61]. The GLUE benchmark consists of the sentence acceptability task CoLA [64], the sentiment
detection task SST-2 [59], the paraphrase detection tasks MRPC and QQP [14,23], the sentence
similarity task STS-B [7], and the natural language inference (NLI) tasks MNLI, QNLI, RTE, and
6

CoLA SST-2 MRPC STS-B
QQP MNLI QNLI
Donor task
60
65
70
75
Accuracy
Isotropic merging
Fisher merging
Standard training Figure 4: Validation set accuracy on RTE
when performing intermediate-task training with
datasets from GLUE as the donor task. Dashed
line denotes RTE accuracy without intermediate-
task training.SST-2 MRPC STS-B
QQP MNLI QNLI
Donor task
77
78
79
80
81
Accuracy
Figure 5: Validation accuracy on RTE after rst
ne-tuning on MNLI, then ne-tuning on RTE,
and nally Fisher merging with various donor
task models. Dashed line denotes RTE accuracy
after MNLI intermediate-task training.
WNLI [6,54,8,31]. All of the GLUE tasks are classication tasks except for STS-B, which is a
regression task with a score ranging from 0 to 5. To simplify computation of the Fisher, we turn
STS-B into a classication task by partitioning the continuous label into 25 equally-sized buckets
[53]. Following common practice, we do not run experiments on WNLI due to the tricks required to
get a good score [12,29]. See Wang et al.[61]for more details on these tasks and their associated
metrics. We detail how we obtained ne-tuned checkpoints on these tasks in appendix. We
computed a diagonal Fisher approximation for each checkpoint using up to 4096 examples from the
corresponding train set. Since it is not clear a priori what weighting coefcientsito use in this
setting, we choseiby a grid search with 50 points, using the score on the rst 2048 validation
examples as the selection metric. We compare Fisher merging to isotropic merging as well as a
standard gradient-based intermediate-task ne-tuning baseline [47]. A diagram of intermediate-task
merging is shown in g., center bottom.
In initial experiments (reported in tables), we performed intermediate-task training for
possible pair of datasets from the GLUE benchmark. Congruent with past work [47,51,60], we
found that intermediate-task training provided the most notable performance boost when the RTE
dataset was the target. We therefore focus on RTE results in the main text. Figure
of intermediate-task training of BERT-base with RTE as the target task and the other GLUE datasets
as donor tasks, using Fisher merging, isotropic merging, or standard gradient-based training. Notably,
performing gradient-based intermediate-task traininghurtson some datasets, whereas merging always
helps. Fisher merging gets comparable or better performance than isotropic merging with the largest
gap observed when using MNLI as the intermediate task. On the other hand, merging performs worse
than standard gradient-based training when using MNLI as the donor task.
Exploring new paths for transfer
Given this performance gap, we were interested to see whether
merging could provide an additional boost on top of gradient-based intermediate-task training. We
therefore performed Fisher merging on a BERT-base model that was rst ne-tuned on MNLI and
then ne-tuned on RTE. A diagram of this setup is shown in g., right. This procedure does not
have a direct analog in traditional gradient-based, and as we will show later, performing multi-stage
gradient-based intermediate-task training generally harms results.
We consider Fisher merging the intermediate-task trained RTE model with all GLUE tasks and show
the results in g.. Fisher merging provides a boost over gradient-based intermediate-task training
for all tasks. Interestingly, a boost is still conferred when merging with an MNLI-trained model,
suggesting that merging provides a complementary path for transferring capabilities across models.
Scaling to RoBERTa-large
Seeing that merging can provide a boost on top of intermediate-task
training, we explored whether this boost could still be obtained for a stronger model than BERT-
base. We therefore applied the same procedure to a RoBERTa-large RTE model that had been
7

ne-tuned from an MNLI intermediate checkpoint. Our donor models were the original RoBERTa-
large checkpoint (i.e., not ne-tuned on MNLI) ne-tuned on MRPC, RTE, STS-B, and SST-2. We
additionally ran a sequential gradient-based ne-tuning baseline where we started with the MNLI
checkpoint, ne-tuned on the donor task, and then ne-tuned on the target task.
The results are shown in g.. We nd merging provides a boost in performance even on the more
performant RoBERTa model. The largest boost of 2.2 points came from Fisher merging with another
RTE checkpoint, which is reminiscent of using merging for ensembling. Notably, including an
additional intermediate task in gradient-based training signicantly harmed performance compared
to performing intermediate-task training on MNLI alone. We hypothesize this is related to the
phenomena of catastrophic forgetting [17], where the model's capabilities on MNLI are forgotten as
it is trained on the next intermediate task. Nevertheless, this illustrates model merging's ability to
sidestep the issue of catastrophic forgetting and enable exploration of novel transfer strategies.
Costs
We had previously noted that our merging procedure could potentially be substantially more
efcient than standard gradient-based ne-tuning. To measure this claim concretely, we computed the
FLOPs required for ne-tuning and merging an RTE checkpoint based on the heuristics described in
Kaplan et al.[26]. Fine-tuning BERT-base on RTE for 10 epochs would require about 5.5e14 FLOPs.
Our merging procedures require computing the merged checkpoint (eq. (4)) and then evaluating it
on the validation set with Fisher merging also requiring the estimation of the Fisher matrix (eq. (2))
beforehand. These steps require about 4.0e8, 2.0e12, and 9.1e13 FLOPs respectively, resulting in
a roughly 6lower total cost compared to ne-tuning for Fisher merging and 275lower cost for
isotropic merging. We note that the Fisher matrix only needs to be computed once and can be reused
for subsequent merges, which amortizes the most expensive step in Fisher merging.
To explore methods for further reducing costs, we experimented with using fewer examples to estimate
the Fisher. Specically, we experimented with intermediate-task Fisher merging of BERT-base with
MNLI as the donor task and RTE as the target task. The results are shown in table. While using
the full training set to estimate the Fisher produced the best performance (73.4%), using only 256
examples to estimate the Fisher only produced a mild degradation in accuracy (72.7%) and still
outperformed the isotropic merging baseline. This suggests that computing the Fisher over fewer
examples could further reduce computational costs without sacricing a great deal of accuracy.
3.4 Domain Adaptation
We now turn our attention to the “domain-adaptive pre-training” (DAPT) approach for domain
adaptation advocated by Gururangan et al.[19], which is methodologically similar to intermediate-
task training. DAPT consists of additional pre-training of an original general-purpose pre-trained
checkpoint on domain-specic unlabeled data. We explore the benets of merging in an experimental
setup similar to Gururangan et al.[19]. We focus on the biomedical (BIOMED) and computer science
(CS) domains because they correspond to the classication tasks that saw the largest gains from
domain-adaptive pre-training in [19]. Namely, we experimented with theCHEMPROT[30] relation
classication task on theBIOMEDdomain. On theCSdomain, we used the citation intent task of
ACL-ARC[25] and the relation classication task ofSCIERC[35]. Following Gururangan et al.
[19], we report macro-F1forACL-ARCandSCIERC, and we report micro-F1forCHEMPROT. We
used RoBERTa-base [33] as our baseline model. Appendix
ne-tuning, and merging procedures used.
We present our results in table. Merging provided the largest boost on ACL-ARC, and outperformed
traditional ne-tuning in this setting. We only observed a minor improvement in performance on
CHEMPROTandSCIERC. We note that our boosts from gradient-based ne-tuning were smaller
than reported in [19], which was likely because we were only able to train on public data and we
applied domain-adaptive pre-training for fewer steps. However, our results are consistent in the sense
that ACL-ARCreceived the largest boost and CHEMPROTreceived the smallest boost.
4 Related Work
Like our work, elastic weight consolidation (EWC) [28] uses the Laplace approximation to the
posterior over model parameters to create a regularizer to prevent catastrophic forgetting in the
context of continual learning. While their framework supports the use of posteriors from multiple
8

SST-2 MRPC STS-B RTE
Donor task
75
80
85
90
Accuracy
Fisher merging
Standard training Figure 6: Validation accuracy on RTE using the
setup of g., but with RoBERTa-large instead
of BERT-base. “Standard training” ne-tunes on
MNLI, then the donor task, then RTE. Dashed
line denotes MNLI intermediate-task training.
Method ChemProt ACL-ARC SciERC
Unmerged 82:70:3 70:53:2 81:00:4
Fisher 83:10:4 73:21:7 81:30:5
Isotropic82:80:4 72:52:3 81:70:5
Fine-tuned82:50:1 71:53:0 81:61:0
Table 1: Domain adaptation results. “Unmerged”
refers to checkpoints ne-tuned from RoBERTa-
base. “Fisher” and “Isotropic” refer to the result
of merging those checkpoints with the domain-
adaptive pre-trained (DAPT) checkpoint. “Fine-
tuned” refers to models ne-tuned from the
DAPT checkpoint. Subscripts provide the stan-
dard deviation across ve trials.
models as well, they restrict such models to be previous checkpoints of a continually trained model.
EWC keeps the model from losing previously acquired knowledge while merging provides a means
of directly adding new knowledge to a model.
Some other existing procedures such as distillation [22] and ensembling [42] can also be thought of
as combining or transferring knowledge between neural networks. However, those methods represent
knowledge solely through the output of models. The knowledge contained within the parameters of a
network will necessarily be greater than the knowledge contained in its output [2]. Hence, methods
that directly combine model parameters such as merging have the potential to be more powerful
than those methods. Furthermore, our merging procedure has an efcient and closed-form solution
(eq. (4)) while distillation requires iterative gradient descent-based training.
Isotropic checkpoint averaging is used by federated learning [39] and Polyak averaging [49]. However,
the checkpoints merged by those methods can be thought of coming from the same training run of
single model. We believe we are the rst to demonstrate cross-task transfer coming from checkpoint
averaging and to explore it in the context of transfer learning. However, adapting ideas from federated
learning such as [32,] could provide a fruitful avenue for future model merging research.
Natural gradient descent refers to an optimization procedure that uses KL-divergence of model
predictions as a distance measure rather than the Euclidean distance in parameter space employed by
regular gradient descent [3]. It does this by performing gradient descent on a Riemannian manifold
with the Fisher information matrix as its metric [45]. In practice, this amounts to using the Fisher as a
preconditioner during gradient descent. Some work on natural gradient descent may prove relevant for
model merging such as using Kronecker-factorized Fisher matrices as an alternative to the diagonal
approximation employed in this paper [37,18,38]. More broadly, in the eld of information geometry
the Fisher information matrix plays the role of a metric on a Riemannian manifold [40]. This has led
to explorations of model averaging using tools from information geometry, e.g. [5,,].
5 Conclusion
In this paper, we introducedFisher merging, a way to combine the capabilities of different models by
computing a weighted average of their parameters. Fisher merging is motivated by a novel perspective
of parameter averaging as maximizing the joint likelihood of model posteriors. Through extensive
experiments, we demonstrated that using the Fisher information as a weight on the contribution
of each parameter outperforms using an unweighted average. Furthermore, we showed that Fisher
merging attains comparable and sometimes better performance than traditional gradient-based transfer
learning methods at signicantly lower costs. Our experiments also demonstrated various merging
strategies that would be onerous with traditional gradient-based training, which opens up new avenues
for transferring capabilities across models. In future work, we plan to investigate different methods
for approximating the Fisher information and model posteriors as well as more esoteric combinations
of models.
9

References
[1]
Achille, A., Lam, M., Tewari, R., Ravichandran, A., Maji, S., Fowlkes, C. C., Soatto, S., and
Perona, P. Task2vec: Task embedding for meta-learning. InProceedings of the IEEE/CVF
International Conference on Computer Vision, pp. 6430–6439, 2019.
[2]
Achille, A., Paolini, G., and Soatto, S. Where is the information in a deep neural network?
arXiv preprint arXiv:1905.12213, 2019.
[3]
Amari, S. Neural learning in structured parameter spaces-natural riemannian gradient.Advances
in neural information processing systems, pp. 127–133, 1997.
[4]
Barbu, A., Mayo, D., Alverio, J., Luo, W., Wang, C., Gutfreund, D., Tenenbaum, J., and Katz,
B. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition
models. InAdvances in Neural Information Processing Systems (NeurIPS), 2019.
[5]
Bartels, L. M. Specication uncertainty and model averaging.American Journal of Political
Science, 1997.
[6]
Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D. A large annotated corpus for learning
natural language inference.arXiv preprint arXiv:1508.05326, 2015.
[7]
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., and Specia, L. Semeval-2017 task 1: Se-
mantic textual similarity-multilingual and cross-lingual focused evaluation.arXiv preprint
arXiv:1708.00055, 2017.
[8]
Dagan, I., Glickman, O., and Magnini, B. The pascal recognising textual entailment challenge.
InMachine Learning Challenges Workshop, pp. 177–190. Springer, 2005.
[9]
Dai, A. M. and Le, Q. V. Semi-supervised sequence learning. InAdvances in neural information
processing systems, 2015.
[10]
Daxberger, E., Kristiadi, A., Immer, A., Eschenhagen, R., Bauer, M., and Hennig, P. Laplace
redux-effortless bayesian deep learning.Advances in Neural Information Processing Systems
(NeurIPS), 2021.
[11]
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A large-scale hierar-
chical image database. In2009 IEEE conference on computer vision and pattern recognition,
2009.
[12]
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional
transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018.
[13]
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional
transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018.
[14]
Dolan, W. B. and Brockett, C. Automatically constructing a corpus of sentential paraphrases.
InProceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005.
[15]
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani,
M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth
16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,
2020.
[16]
Fisher, R. A. On the mathematical foundations of theoretical statistics.Philosophical Trans-
actions of the Royal Society of London. Series A, Containing Papers of a Mathematical or
Physical Character, 222(594-604):309–368, 1922.
[17]
Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. An empirical investigation
of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211,
2013.
[18]
Grosse, R. and Martens, J. A kronecker-factored approximate sher matrix for convolution
layers. InInternational Conference on Machine Learning, pp. 573–582. PMLR, 2016.
10

[19]
Gururangan, S., Marasovi´c, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., and Smith,
N. A. Don't stop pretraining: Adapt language models to domains and tasks.arXiv preprint
arXiv:2004.10964, 2020.
[20]
Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., Desai, R., Zhu, T.,
Parajuli, S., Guo, M., Song, D., Steinhardt, J., and Gilmer, J. The many faces of robustness: A
critical analysis of out-of-distribution generalization.International Conference on Computer
Vision (ICCV), 2021.
[21]
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., and Song, D. Natural adversarial examples.
Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
[22]
Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network.arXiv
preprint arXiv:1503.02531, 2015.
[23]
Iyer, S., Dandekar, N., and Csernai, K. First quora dataset release: Question pairs, 2017. URL
https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs.
[24]
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and
Darrell, T. Caffe: Convolutional architecture for fast feature embedding. InProceedings of the
22nd ACM international conference on Multimedia, 2014.
[25]
Jurgens, D., Kumar, S., Hoover, R., McFarland, D., and Jurafsky, D. Measuring the evolution of
a scientic eld through citation frames.Transactions of the Association for Computational
Linguistics, 6:391–406, 2018.
[26]
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Rad-
ford, A., Wu, J., and Amodei, D. Scaling laws for neural language models.arXiv preprint
arXiv:2001.08361, 2020.
[27]
Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization.arXiv preprint
arXiv:1412.6980, 2014.
[28]
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K.,
Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in
neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
[29]
Kocijan, V., Cretu, A.-M., Camburu, O.-M., Yordanov, Y., and Lukasiewicz, T. A surprisingly
robust trick for winograd schema challenge.arXiv preprint arXiv:1905.06290, 2019.
[30]
Kringelum, J., Kjaerulff, S. K., Brunak, S., Lund, O., Oprea, T. I., and Taboureau, O. Chemprot-
3.0: a global chemical biology diseases mapping.Database, 2016, 2016.
[31]
Levesque, H., Davis, E., and Morgenstern, L. The winograd schema challenge. InThirteenth
International Conference on the Principles of Knowledge Representation and Reasoning, 2012.
[32]Liu, X., Masana, M., Herranz, L., Van de Weijer, J., Lopez, A. M., and Bagdanov, A. D. Rotate
your networks: Better weight consolidation and less catastrophic forgetting. In2018 24th
International Conference on Pattern Recognition (ICPR), pp. 2262–2268. IEEE, 2018.
[33]
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L.,
and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach.arXiv preprint
arXiv:1907.11692, 2019.
[34]
Lo, K., Wang, L. L., Neumann, M., Kinney, R., and Weld, D. S. S2orc: The semantic scholar
open research corpus.arXiv preprint arXiv:1911.02782, 2019.
[35]
Luan, Y., He, L., Ostendorf, M., and Hajishirzi, H. Multi-task identication of entities, relations,
and coreference for scientic knowledge graph construction.arXiv preprint arXiv:1808.09602,
2018.
[36]
MacKay, D. J. A practical bayesian framework for backpropagation networks.Neural computa-
tion, 4(3):448–472, 1992.
11

[37]
Martens, J. and Grosse, R. Optimizing neural networks with kronecker-factored approximate
curvature. InInternational conference on machine learning, pp. 2408–2417. PMLR, 2015.
[38]
Martens, J., Ba, J., and Johnson, M. Kronecker-factored curvature approximations for recurrent
neural networks. InInternational Conference on Learning Representations, 2018.
[39]
McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-
efcient learning of deep networks from decentralized data. InArticial Intelligence and
Statistics, pp. 1273–1282. PMLR, 2017.
[40] Entropy, 22(10):1100, 2020.
[41]
Nielsen, F. and Nock, R. Sided and symmetrized bregman centroids.IEEE transactions on
Information Theory, 55(6), 2009.
[42]
Opitz, D. W. and Shavlik, J. W. Actively searching for an effective neural network ensemble.
Connection Science, 8(3-4):337–354, 1996.
[43]
Oquab, M., Bottou, L., Laptev, I., and Sivic, J. Learning and transferring mid-level image
representations using convolutional neural networks. InProceedings of the IEEE conference on
computer vision and pattern recognition, 2014.
[44]
Pan, S. J. and Yang, Q. A survey on transfer learning.IEEE Transactions on knowledge and
data engineering, 22(10):1345–1359, 2009.
[45]
Pascanu, R. and Bengio, Y. Revisiting natural gradient for deep networks.arXiv preprint
arXiv:1301.3584, 2013.
[46]
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L.
Deep contextualized word representations.arXiv preprint arXiv:1802.05365, 2018.
[47]
Phang, J., Févry, T., and Bowman, S. R. Sentence encoders on stilts: Supplementary training on
intermediate labeled-data tasks.arXiv preprint arXiv:1811.01088, 2018.
[48]
Phang, J., Htut, P. M., Pruksachatkun, Y., Liu, H., Vania, C., Kann, K., Calixto, I., and Bowman,
S. R. English intermediate-task training improves zero-shot cross-lingual transfer too.arXiv
preprint arXiv:2005.13013, 2020.
[49]
Polyak, B. T. and Juditsky, A. B. Acceleration of stochastic approximation by averaging.SIAM
journal on control and optimization, 30(4):838–855, 1992.
[50]
Posada, D. and Buckley, T. R. Model selection and model averaging in phylogenetics: ad-
vantages of akaike information criterion and bayesian approaches over likelihood ratio tests.
Systematic biology, 53(5), 2004.
[51]
Pruksachatkun, Y., Phang, J., Liu, H., Htut, P. M., Zhang, X., Pang, R. Y., Vania, C., Kann,
K., and Bowman, S. R. Intermediate-task transfer learning with pretrained models for natural
language understanding: When and why does it work?arXiv preprint arXiv:2005.00628, 2020.
[52]Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding
by generative pre-training, 2018.
[53]
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and
Liu, P. J. Exploring the limits of transfer learning with a unied text-to-text transformer.arXiv
preprint arXiv:1910.10683, 2019.
[54]
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. Squad: 100,000+ questions for machine
comprehension of text.arXiv preprint arXiv:1606.05250, 2016.
[55]
Rajpurkar, P., Jia, R., and Liang, P. Know what you don't know: Unanswerable questions for
squad.arXiv preprint arXiv:1806.03822, 2018.
[56]
Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do ImageNet classiers generalize to
ImageNet? InInternational Conference on Machine Learning (ICML), 2019.
12

[57]
Ruder, S., Peters, M. E., Swayamdipta, S., and Wolf, T. Transfer learning in natural language
processing. InProceedings of the 2019 Conference of the North American Chapter of the
Association for Computational Linguistics: Tutorials, pp. 15–18, 2019.
[58]
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
Khosla, A., Bernstein, M., et al. ImageNet large scale visual recognition challenge.International
journal of computer vision, 2015.
[59]
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. Recursive
deep models for semantic compositionality over a sentiment treebank. InProceedings of the
2013 conference on empirical methods in natural language processing, pp. 1631–1642, 2013.
[60]
Vu, T., Wang, T., Munkhdalai, T., Sordoni, A., Trischler, A., Mattarella-Micke, A., Maji,
S., and Iyyer, M. Exploring and predicting transferability across nlp tasks.arXiv preprint
arXiv:2005.00770, 2020.
[61]
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Glue: A multi-
task benchmark and analysis platform for natural language understanding.arXiv preprint
arXiv:1804.07461, 2018.
[62]
Wang, H., Ge, S., Lipton, Z., and Xing, E. P. Learning robust global representations by
penalizing local predictive power. InAdvances in Neural Information Processing Systems
(NeurIPS), 2019.
[63]
Wang, H., Yurochkin, M., Sun, Y., Papailiopoulos, D., and Khazaeni, Y. Federated learning
with matched averaging.arXiv preprint arXiv:2002.06440, 2020.
[64]
Warstadt, A., Singh, A., and Bowman, S. R. Neural network acceptability judgments.Transac-
tions of the Association for Computational Linguistics, 7:625–641, 2019.
[65]
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T.,
Louf, R., Funtowicz, M., et al. Huggingface's transformers: State-of-the-art natural language
processing.arXiv preprint arXiv:1910.03771, 2019.
[66]
Wortsman, M., Ilharco, G., Li, M., Kim, J. W., Hajishirzi, H., Farhadi, A., Namkoong, H., and
Schmidt, L. Robust ne-tuning of zero-shot models.arXiv preprint arXiv:2109.01903, 2021.
[67]
Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A. S.,
Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., and Schmidt, L. Model soups: averaging
weights of multiple ne-tuned models improves accuracy without increasing inference time.
arXiv preprint arXiv:2203.05482, 2022.
[68]
Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. How transferable are features in deep neural
networks? InAdvances in neural information processing systems, 2014.
13

A Checkpoints used for Ensembling
For the experiments in section, we use ne-tuned BERT-base checkpoints from the Hugging Face
model hub.
3
Specically, we used the following checkpoints for each datatset:
RTE
•textattack/bert-base-uncased-RTE
•yoshitomo-matsubara/bert-base-uncased-rte
•Ruizhou/bert-base-uncased-finetuned-rte
•howey/bert-base-uncased-rte
•anirudh21/bert-base-uncased-finetuned-rte
MRPC
•textattack/bert-base-uncased-MRPC
•yoshitomo-matsubara/bert-base-uncased-mrpc
•Maelstrom77/bert-base-uncased-MRPC
•Ruizhou/bert-base-uncased-finetuned-mrpc
•TehranNLP-org/bert-base-uncased-mrpc-2e-5-42
SST2
•aviator-neural/bert-base-uncased-sst2
•howey/bert-base-uncased-sst2
•yoshitomo-matsubara/bert-base-uncased-sst2
•ikevin98/bert-base-uncased-finetuned-sst2
•TehranNLP-org/bert-base-uncased-cls-sst2
B Individual dataset results for robust ne-tuning
Figure
using either isotropic or Fisher merging.
C GLUE Fine-tuning Details
For the high resource tasks QNLI, QQP, SST-2, and MNLI, we used checkpoints downloaded from
Hugging Face. We also used a checkpoint from Hugging Face that was ne-tuned on the extractive
question answering task SQuAD 2.0 [55] as an alternative intermediate task checkpoint. For the low
resource tasks CoLA, MRPC, RTE, and STS-B, we ne-tuned for 10 epochs using a batch size of 16
and the Adam optimizer [27] with a learning rate of 1e-5. We ran 5 independent ne-tuning runs for
the low-resource tasks, discarding runs with poor performance.
D Domain-Adaptive Pre-training Details
We performed additional domain-adaptive pre-training on RoBERTa-base for 32,768 steps with a
batch size of 32 using the Adam optimizer with a learning rate of 1e-5. We used theBIOMEDandCS
splits of the public S2ORC dataset of abstracts and full-length papers [34]. We note that Gururangan
et al.[19]used an internal version of S2ORC that includes additional papers that could not be released
due to copyright issues. Our ne-tuning and target task Fisher computation procedures were the
same as in our GLUE experiments with the exception of using a batch size of 8 when ne-tuning.
3
https://huggingface.co/models
14

70 75 80
ImageNet accuracy
40
45
50
55
ImageNet-A accuracy
Isotropic merging
Fisher merging
70 75 80
ImageNet accuracy
65
70
75
80
ImageNet-R accuracy
Isotropic merging
Fisher merging
70 75 80
ImageNet accuracy
46
48
50
52
54
ImageNet Sketch accuracy
Isotropic merging
Fisher merging
70 75 80
ImageNet accuracy
62.5
65.0
67.5
70.0
72.5
ImageNetV2 accuracy
Isotropic merging
Fisher merging
70 75 80
ImageNet accuracy
52
54
56
58
ObjectNet accuracy
Isotropic merging
Fisher merging Figure 7: Individual OOD dataset results from applying WiSE-FT [66] to ImageNet pre-trained
ViT-B/16 using either isotropic or Fisher merging. Dark to light color indicates increasing1from0
to1.
Fine-tuning for 10 epochs, we saved a checkpoint at the end of each epoch. We computed the Fisher
for the DAPT checkpoints on 131,072 examples, using one sample from the logits per example. We
merged each checkpoint saved during ne-tuning with the DAPT checkpoint from the task's domain.
We performed a grid search of 75 merging coefcients and used theF1score on the rst 2048 test
examples as the selection criterion. We report the scores of the best unmerged and the best merged
checkpoint from each ne-tuning run.
E Full results for intermediate-task training
In tables, we report results for intermediate-task training when considering all possible
datasets in GLUE as target tasks.
F Using fewer examples to estimate the Fisher
In table, we show the effect of limiting the number of examples used to compute the Fisher when
performing intermediate-task Fisher merging on BERT-base with MNLI as the donor task and RTE
as the target task.
15

Table A1: Intermediate task Fisher merging results on GLUE with BERT-base. Columns correspond
to target tasks while rows correspond to intermediate tasks. Italicized values on the diagonal are
scores of the unmerged target checkpoints. Subscripts denote standard deviation across runs.
TASK COLA SST-2 MRPC STS-B QQP MNLI QNLI RTE
COLA 55:41:892:40:084:80:486:10:989:30:083:90:090:90:064:71:0
SST-2 55:81:692:40:084:70:586:10:9 89:0 83 :9 90 :9 65:01:4
MRPC 55:71:792:40:084:50:386:10:888:90:183:80:090:90:065:00:9
STS-B 55:71:592:40:084:80:486:10:888:90:183:80:090:90:065:42:3
QQP 55:51:7 92:4 84:60:386:10:988:80:0 83:8 90 :9 65:82:3
MNLI 55:71:9 92:4 85:10:686:10:9 88:9 83:70:0 90:9 73:25:1
QNLI 55:51:7 92:4 85:00:886:10:9 89:4 83 :9 90:90:066:51:6
RTE 55:61:792:40:084:80:486:10:988:80:083:90:190:90:063:71:7
SQUAD 56:11:4 92:4 84:90:486:10:9 89:1 83 :9 91 :0 66:61:0
Table A2: Intermediate task isotropic-merging results on GLUE with BERT-base. Columns corre-
spond to target tasks while rows correspond to intermediate tasks. Italicized values on the diagonal
are scores of the unmerged target checkpoints. Subscripts denote standard deviation across runs.
TASK COLA SST-2 MRPC STS-B QQP MNLI QNLI RTE
COLA55:41:892:50:084:90:886:10:888:90:083:90:090:90:064:80:8
SST-255:51:792:40:084:90:786:10:9 88:8 83 :9 90 :9 64:81:1
MRPC 55:51:892:40:084:50:386:10:988:90:183:90:190:90:065:10:9
STS-B55:41:892:40:085:00:486:10:989:00:183:80:190:90:065:22:1
QQP 55:51:8 92:4 84:70:286:10:988:80:0 83:8 90 :9 65:11:7
MNLI 55:61:7 92:4 85:40:686:10:8 88:8 83:70:0 90:9 72:24:0
QNLI 55:51:7 92:4 85:10:786:10:9 89:1 83 :9 90:90:066:81:1
RTE 55:51:892:40:084:60:386:10:888:90:183:80:190:90:063:71:7
Table A3: Sequential ne-tuning results on GLUE with BERT-base. Columns correspond to target
tasks while rows correspond to intermediate tasks. Subscripts denote standard deviation across runs.
Italicized values represent ne-tuning directly on the target task (i.e. no intermediate-task training).
TASK COLA MRPC STS-B RTE
COLA55:41:885:00:985:90:862:12:3
SST-256:81:485:40:985:31:063:81:0
MRPC 58:50:484:50:385:30:862:75:2
STS-B56:30:486:70:786:10:964:52:5
QQP 56:02:087:11:287:50:471:61:9
MNLI 58:61:785:90:887:60:377:41:6
QNLI 56:41:987:80:687:10:571:04:1
RTE 56:70:982:22:585:80:563:71:7
Table A4: Effect of the number of examples used to compute the Fisher information. Columns
correspond to the number of examples used for RTE. Rows correspond to the number of examples
used for MNLI. Scores are the RTE validation set accuracy. The original RTE checkpoints had an
average accuracy of63:7and isotropic merging (i.e. 0 Fisher examples) had an average accuracy of
72:2.
EXAMPLES 256 1024 2490
256 72:7 72:9 73:1
1024 72:9 72:9 73:3
4096 72:9 73:0 73:2
32768 72:8 73:0 73:5
392702 72:9 73:1 73:4
16