Extracted Text

2405.07813v1.pdf

Localizing Task Information for Improved Model Merging and Compression
Ke Wang
* 1
Nikolaos Dimitriadis
* 1
Guillermo Ortiz-Jim´enez
2 3
Franc¸ois Fleuret
4
Pascal Frossard
1
Abstract
Model merging and task arithmetic have emerged
as promising scalable approaches to merge
multiple single-task checkpoints to one multi-task
model, but their applicability is reduced by sig-
nificant performance loss. Previous works have
linked these drops to interference in the weight
space and erasure of important task-specific
features. Instead, in this work we show that
the information required to solve each task is
still preserved after merging as different tasks
mostly use non-overlapping sets of weights. We
propose TALL-masks, a method to identify these
task supports given a collection of task vectors
and show that one can retrieve>99% of the
single task accuracy by applying our masks to
the multi-task vector, effectively compressing the
individual checkpoints. We study the statistics of
intersections among constructed masks and reveal
the existence ofselfishandcatastrophicweights,
i.e., parameters that are important exclusively to
one task and irrelevant to all tasks but detrimental
to multi-task fusion. For this reason, we propose
Consensus Merging, an algorithm that eliminates
such weights and improves the general perfor-
mance of existing model merging approaches.
Our experiments in vision and NLP benchmarks
with up to 20 tasks, show that Consensus Merging
consistently improves existing approaches.
Furthermore, our proposed compression scheme
reduces storage from 57Gb to 8.2Gb while
retaining 99.7% of original performance.
*
Equal contribution
1´
Ecole Polytechnique F´ed´erale de Lau-
sanne, Lausanne, Switzerland
2
Google Deepmind
3
Work done
while at EPFL.
4
University of Geneva, Geneva, Switzerland.. Cor-
respondence to: Ke Wang<k.wang@epfl.ch>, Nikolaos Dimitri-
adis<nikolaos.dimitriadis@epfl.ch>.
Proceedings of the41
stInternational Conference on Machine
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by
the author(s).
1. Introduction
In recent years, the field of ML has witnessed a paradigm
shift with the release of foundation models and the influx
of associated checkpoints, significantly improving the per-
formance on downstream applications (Devlin et al.,;
Ilharco et al.,;,;
et al.,;,). The widespread adop-
tion of foundation models has followed a proliferation of
works addressing practical challenges arising from their
sheer computational and storage requirements. For example,
parameter-efficient fine-tuning (Hu et al.,;,
2022), quantization (Dettmers et al.,;) address
aspects of training, fine-tuning and inference. An important
question remains how to efficiently leverage the existing
fine-tuned models towards improving models and building
generalist agents (Reed et al.,).
Recent work has illuminated the benefits of interpolating
the weights of different models (Frankle et al.,;
et al.,;,;,
2023;,), offering scalable and effec-
tive techniques to edit the knowledge of pre-trained models.
Task arithmetic (TA) (Ilharco et al.,) has emerged as
a promising solution to fuse the knowledge of disparate
checkpoints into a single model with multi-objective
capabilities, forgoing the need for additional joint training
(Caruana,) or optimizing over the exponentially large
number of task combinations (Standley et al.,;
et al.,). Prior studies have proposed more involved
merging techniques by resolvingweightinterference (Yadav
et al.,), matching activations (Jin et al.,) or
by preserving task-specific important parameters (Matena
& Raffel,;,). Despite these recent
advances, weight space interpolation for multi-task fusion
still suffers from significant drops in performance compared
to individual fine-tuned models.
In this paper, we present a novel view and show that
performance of the merged model can degrade even
without weight interference or information erasure through
a controlled experiment. In contrast, the discriminant
information for individual tasks is preserved and embedded
in the multi-task vector after merging disparate task
vectors, and we propose an algorithm,TALL-masks, that
identifies the subset of important parameters for each task
1

Localizing Task Information for Improved Model Merging and Compression⌧1 ⌧2 ⌧3 ⌧4
Original Task Vectors
⌧MTL
Multi-task
Vector
m1 m2 m3 m4
Constructed Task Masks
m1!⌧MTL... ...m4!⌧MTL
Application on
Compression
mconsensus⌧consensus
Application on
Model Merging
⌧1 ⌧2 ⌧3 ⌧4
Original Task Vectors
⌧MTL
Multi-task
Vector
m1 m2 m3 m4
Constructed Task Masks
m1!⌧MTL... ...m4!⌧MTL
Application on
Compression
mconsensus⌧consensus
Application on
Model Merging
<latexit sha1_base64="ANgupCM9ITHbaT+sQf9cCvkN/mw=">AAAB/HicbVDJSgNBEK2JW4zbaI5eGoPgKcyI2zHixWMEs0AyhJ5OJ2nSs9BdIw5D/BUvHhTx6od482/sJHPQxAcFj/eqqKrnx1JodJxvq7Cyura+UdwsbW3v7O7Z+wdNHSWK8QaLZKTaPtVcipA3UKDk7VhxGviSt/zxzdRvPXClRRTeYxpzL6DDUAwEo2iknl3uIn/E7Do2y+aanvTsilN1ZiDLxM1JBXLUe/ZXtx+xJOAhMkm17rhOjF5GFQom+aTUTTSPKRvTIe8YGtKAay+bHT8hx0bpk0GkTIVIZurviYwGWqeBbzoDiiO96E3F/7xOgoMrLxNhnCAP2XzRIJEEIzJNgvSF4gxlaghlSphbCRtRRRmavEomBHfx5WXSPK26F9Xzu7NKzcnjKMIhHMEJuHAJNbiFOjSAQQrP8Apv1pP1Yr1bH/PWgpXPlOEPrM8fp2WVYQ==</latexit>
Applications
⌧1 ⌧2 ⌧3 ⌧4
Original Task Vectors
⌧MTL
Multi-task
Vector
m1 m2 m3 m4
Constructed Task Masks
m1!⌧MTL... ...m4!⌧MTL
Application on
Compression
mconsensus⌧consensus
Application on
Model Merging
<latexit sha1_base64="XhMysHiOEVcAHfPy02VFdR+EbVE=">AAACB3icbVDLSgMxFM3UV62vUZeCBIvgQsqM+FoW3OiuQl/QDiWTSdvQTGZI7hTL0J0bf8WNC0Xc+gvu/BvTdhbaeiBwOOdebs7xY8E1OM63lVtaXlldy68XNja3tnfs3b26jhJFWY1GIlJNn2gmuGQ14CBYM1aMhL5gDX9wM/EbQ6Y0j2QVRjHzQtKTvMspASN17MM2sAdI72TAhzxIiMBVoge4zihESo87dtEpOVPgReJmpIgyVDr2VzuIaBIyCVQQrVuuE4OXEgWcCjYutBPNYkIHpMdahkoSMu2l0xxjfGyUAHcjZZ4EPFV/b6Qk1HoU+mYyJNDX895E/M9rJdC99lIu4wSYpLND3URgiPCkFBxwZQKLkSGEKm7+immfKELBVFcwJbjzkRdJ/azkXpYu7s+L5dOsjjw6QEfoBLnoCpXRLaqgGqLoET2jV/RmPVkv1rv1MRvNWdnOPvoD6/MHfcaZpg==</latexit>
Individual Task Vectors
<latexit sha1_base64="GedkqFnUrevsB4wxLNEjpdthoYk=">AAACAXicbVDLSgMxFM3UV62vUTeCm2ARXGiZEV/Lghs3QgX7gHYomTRtQzPJkNwRy1A3/oobF4q49S/c+Tem7Sy0eiBwOOdebs4JY8ENeN6Xk5ubX1hcyi8XVlbX1jfcza2aUYmmrEqVULoREsMEl6wKHARrxJqRKBSsHg4ux379jmnDlbyFYcyCiPQk73JKwEptd6cF7B7S60QAPwJiBrjGKCg9artFr+RNgP8SPyNFlKHSdj9bHUWTiEmgghjT9L0YgpRo4FSwUaGVGBYTOiA91rRUkoiZIJ0kGOF9q3RwV2n7JOCJ+nMjJZExwyi0kxGBvpn1xuJ/XjOB7kWQchknwCSdHuomAoPC4zpwh2ubVwwtIVRz+1dM+0QTCra0gi3Bn438l9SOS/5Z6fTmpFg+zOrIo120hw6Qj85RGV2hCqoiih7QE3pBr86j8+y8Oe/T0ZyT7WyjX3A+vgHxFpcm</latexit>
Multi-task Vector
<latexit sha1_base64="6RMo9CjhhHq6psRYCHy+djXF3P0=">AAACBnicbVDJSgNBEO2JW4xb1KMIjUHwIGFG3I6BXLwIEbJBEkJPp5I06ekZumvEMOTkxV/x4kERr36DN//GznLQxAdVPN6rorueH0lh0HW/ndTS8srqWno9s7G5tb2T3d2rmjDWHCo8lKGu+8yAFAoqKFBCPdLAAl9CzR8Ux37tHrQRoSrjMIJWwHpKdAVnaKV29rCJ8IBJMVQGdcwROrTMzIDe2mZG7WzOzbsT0EXizUiOzFBqZ7+anZDHASjkkhnT8NwIWwnTKLiEUaYZG4gYH7AeNCxVLADTSiZnjOixVTq0G2pbCulE/b2RsMCYYeDbyYBh38x7Y/E/rxFj97qVCBXFCIpPH+rGkmJIx5nQjtDAUQ4tYVwL+1fK+0wzG4c2GRuCN3/yIqme5b3L/MXdea5wOosjTQ7IETkhHrkiBXJDSqRCOHkkz+SVvDlPzovz7nxMR1PObGef/IHz+QO7Ipk6</latexit>
Constructed Task Masks
<latexit sha1_base64="VrTI/LCVwwMNh8APqJK0nE5nrFM=">AAAB/3icbVDLSgMxFM3UV62vUcGNm2ARKkiZEV/LQjcuK9gHtEPJpJk2NJMZkjtiGbvwV9y4UMStv+HOvzFtZ6GtBwKHc87l3hw/FlyD43xbuaXlldW1/HphY3Nre8fe3WvoKFGU1WkkItXyiWaCS1YHDoK1YsVI6AvW9IfVid+8Z0rzSN7BKGZeSPqSB5wSMFLXPugAe4C0RE5wNQrNqJ5Ex1276JSdKfAicTNSRBlqXfur04toEjIJVBCt264Tg5cSBZwKNi50Es1iQoekz9qGShIy7aXT+8f42Cg9HETKPAl4qv6eSEmo9Sj0TTIkMNDz3kT8z2snEFx7KZdxAkzS2aIgERgiPCkD97hiFMTIEEIVN7diOiCKUDCVFUwJ7vyXF0njrOxeli9uz4uV06yOPDpER6iEXHSFKugG1VAdUfSIntErerOerBfr3fqYRXNWNrOP/sD6/AHiCZX4</latexit>
(a) Compression
<latexit sha1_base64="nzuQzi85E0by7AcROaifF+Ie7M0=">AAACAXicbVDJSgNBEO2JW4xb1IvgpTEIESTMiNsx4MWLEMEskITQ06kkTXp6hu4aMQzx4q948aCIV//Cm39jZzlo4oOCx3tVVNXzIykMuu63k1pYXFpeSa9m1tY3Nrey2zsVE8aaQ5mHMtQ1nxmQQkEZBUqoRRpY4Euo+v2rkV+9B21EqO5wEEEzYF0lOoIztFIru9dAeMAk7x/Rm7ANkgagu0J1h61szi24Y9B54k1JjkxRamW/Gu2QxwEo5JIZU/fcCJsJ0yi4hGGmERuIGO+zLtQtVSwA00zGHwzpoVXatBNqWwrpWP09kbDAmEHg286AYc/MeiPxP68eY+eymQgVxQiKTxZ1YkkxpKM4aFto4CgHljCuhb2V8h7TjKMNLWND8GZfnieVk4J3Xji7Pc0Vj6dxpMk+OSB54pELUiTXpETKhJNH8kxeyZvz5Lw4787HpDXlTGd2yR84nz/ZvpZ1</latexit>
(b) Model merging
<latexit sha1_base64="lgLvlgFGxC2/QSD22ZGJzaDj4ps=">AAAB83icbVDLSgNBEJz1GeMr6tHLYBA8SNgVX8eAFy9CBPOAbAizk95kyOzsMtMrhiW/4cWDIl79GW/+jZNkD5pY0FBUddPdFSRSGHTdb2dpeWV1bb2wUdzc2t7ZLe3tN0ycag51HstYtwJmQAoFdRQooZVoYFEgoRkMbyZ+8xG0EbF6wFECnYj1lQgFZ2gl30d4wuwOdB/G3VLZrbhT0EXi5aRMctS6pS+/F/M0AoVcMmPanptgJ2MaBZcwLvqpgYTxIetD21LFIjCdbHrzmB5bpUfDWNtSSKfq74mMRcaMosB2RgwHZt6biP957RTD604mVJIiKD5bFKaSYkwnAdCe0MBRjixhXAt7K+UDphlHG1PRhuDNv7xIGmcV77JycX9erp7mcRTIITkiJ8QjV6RKbkmN1AknCXkmr+TNSZ0X5935mLUuOfnMAfkD5/MHgNWR7w==</latexit>
Merge
<latexit sha1_base64="tv3Leyi7RhiseAAAJKdpYqspWp0=">AAAB/nicbVDLSgNBEJyNrxhfq+LJy2AQPEjYFV/HQC5ehAjmAckSZieTZMjszDLTK4Yl4K948aCIV7/Dm3/jJNmDJhY0FFXddHeFseAGPO/byS0tr6yu5dcLG5tb2zvu7l7dqERTVqNKKN0MiWGCS1YDDoI1Y81IFArWCIeVid94YNpwJe9hFLMgIn3Je5wSsFLHPWgDe4S0oqQBnVDAt8QMxx236JW8KfAi8TNSRBmqHfer3VU0iZgEKogxLd+LIUiJBk4FGxfaiWExoUPSZy1LJYmYCdLp+WN8bJUu7iltSwKeqr8nUhIZM4pC2xkRGJh5byL+57US6F0HKZdxAkzS2aJeIjAoPMkCd7lmFMTIEkI1t7diOiCaULCJFWwI/vzLi6R+VvIvSxd358XyaRZHHh2iI3SCfHSFyugGVVENUZSiZ/SK3pwn58V5dz5mrTknm9lHf+B8/gDRbpX7</latexit>
Construct Mask
<latexit sha1_base64="/2N/gxjq1JdK+eLcnnW8YTUFlkw=">AAAB+XicbVDJSgNBEK2JW4zbqEcvjUHwFGbE7Rjx4iFChGyQhNDT6Uma9Cx01wTDkD/x4kERr/6JN//GTjIHjT4oeLxXRVU9L5ZCo+N8WbmV1bX1jfxmYWt7Z3fP3j9o6ChRjNdZJCPV8qjmUoS8jgIlb8WK08CTvOmNbmd+c8yVFlFYw0nMuwEdhMIXjKKRerbdQf6Iae2mUiH3VI+mPbvolJw5yF/iZqQIGao9+7PTj1gS8BCZpFq3XSfGbkoVCib5tNBJNI8pG9EBbxsa0oDrbjq/fEpOjNInfqRMhUjm6s+JlAZaTwLPdAYUh3rZm4n/ee0E/etuKsI4QR6yxSI/kQQjMouB9IXiDOXEEMqUMLcSNqSKMjRhFUwI7vLLf0njrOReli4ezotlJ4sjD0dwDKfgwhWU4Q6qUAcGY3iCF3i1UuvZerPeF605K5s5hF+wPr4BARSTNQ==</latexit>
TALL Mask
Figure 1.
Illustration of our mask construction algorithm (left) along with the applications (right) on model compression and model
merging. Each block corresponds to the same weight matrix, and color intensity reflects the value of each parameter – empty means zero
value. Given single-task vectors{τt}
4
t=1 and the merged vectorτMTL, our method constructs per-task masks{mt}
4
t=1 , pinpointing the
important parameters for each original task vector. Formodel merging, we keep only the ‘general’ weights selected by more than one mask
and produce the consensus maskmconsensusand the final merged vector. For compression, we evaluate on each task with reconstructed
task vectors by masking out the irrelevant weights, retaining almost full performance without saving the individual task vectors.
(Panigrahi et al.,;,;,).
We cast the problem of localizing important information
as approximating each original task vector via erasing
task-irrelevant information in the merged multi-task vector
with a data-driven way, resulting in the construction of
task-specific binary masks. We study the statistics of mask
agreements among tasks, and reveal the existence ofcatas-
trophicandselfishweights, i.e., parameters that are deemed
important by none and exclusively one task, respectively.
We then propose Consensus Merging, a method that utilizes
the constructed masks to eliminate thecatastrophicand
selfishweights and is complementary to existing model
merging approaches. Through extensive experimental
validation, we show that our proposed Consensus Merging
consistently improves prior methods. For instance, building
upon task arithmetic (Ilharco et al.,) yields 4.9% gain
in absolute average accuracy on a 20 task vision benchmark,
while we improve TIES (Yadav et al.,) by 6.3% on
an 8-task NLP benchmark.
We also employ the constructed masks towards compressing
the individually fine-tuned checkpoints. Motivated by our
findings that task-specific information is preserved and by
virtue of the masks, we can localize the knowledge of each
task in the merged vector and extract it to approximate the
original single-task vector. We compress the collection of
checkpoints to the zero-shot model, the merged task vector
and binary masks. Our experimental validation shows that
our algorithms retains>99%of original performance in
various vision settings, ranging from small to large ViTs
(Dosovitskiy et al.,) and benchmarks from 8 to 20
tasks, showing remarkable robustness to the increase in
number of tasks. For instance, in a vision benchmark
we compress 20 fine-tuned models from 57Gb to 8.2Gb
retaining99.7% of performance, while model merging
methods almost reset to mere zero-shot performance.
1
In short, our contributions are the following:
•
We show that the task-specific information is preserved
after merging, but task arithmetic cannot properly
utilize it due totask interference. We provide an
efficient algorithm,TALL-masks, to localize the
task-specific information in the multi-task vector,
which deactivates irrelevant parts for each task in the
merged multi-task vector with binary masks.
•
With the constructed task-specific masks, we are able
to eliminate task interference and compress multiple
fine-tuned checkpoints to only the zero-shot model,
the merged task vector and the aforementioned binary
masks while preserving the performance of individual
models.
•
We analyze the profile of mask agreements and identify
the existence of weights deemed important by only
one task or even none. We then propose Consensus
Merging, a model merging method that eliminates
theseselfishandcatastrophicweights, keeping only
generalweights. Our method can be combined with
existing approaches, such as Task Arithmetic or TIES,
and consistently improve over them, showing better
robustness to increasing number of tasks.
•
We perform extensive evaluation on Computer Vision
and NLP benchmarks and show the benefits of our
proposed methods. For model merging, our Consensus
Merging consistently improves prior merging methods,
setting state-of-the-art results. For compression, we
achieve>99%performance retention in across all
1
The source code can be found athttps://github.com/
nik-dim/tall_masks.
2

Localizing Task Information for Improved Model Merging and Compression
vision benchmarks and model sizes while requiring
much less storage compared to the original collection
of fine-tuned checkpoints.
2. Related Work
Weight Interpolation and Model MergingModel editing
directly in the weight space has attracted a lot of attention
in recent years with many works showing that interpolating
the weights of different models results in low-loss paths
(Garipov et al.,;,;,
2020).2021) enacted on these insights
and trained a weight ensemble from scratch, showing better
generalization (Foret et al.,;,) for the
midpoint, while2023) extended these
ideas to Multi-Task Learning and showed that linear weight
subspaces can encode tradeoffs and map to the Pareto Front.
While these works focus on end-to-end training,
(2023) studied pre-trained models and observed that arith-
metic operations among fine-tuned weights generate similar
functional responses and allow for a scalable framework
to endow multi-objective capabilities. Several approaches
have improved this idea by performing merging guided by
various heuristics (Davari & Belilovsky,;,
2023;,), such as resolving interference due to
redundant parameter values and sign disagreements (Yadav
et al.,), by preserving the important parameters de-
fined via the Fisher Information Matrix (Matena & Raffel,
2022;,), or by learning the model merging
weights with unlabeled test data (Yang et al.,).
Jimenez et al.2023) offered more theoretical foundations
on the field ofmodel mergingand identifiedweight disen-
tanglementas the necessary condition for task arithmetic,
while showing that performing fine-tuning on a linearized
model leads to improved model merging.
Reducing complexity for foundation modelsThe
capabilities of foundation models have commanded the
development of methods that address their computational
and memory requirements. Parameter-efficient fine-tuning
(PEFT) (Hu et al.,;,;,
2019) approaches heavily reduce the number of trainable
parameters, enabling efficient adaptation. Several works
(Dettmers et al.,;,) perform quan-
tization after training and reduce the memory footprint of
the model by requiring less bits to represent each parameter.
ComPEFT (Yadav et al.,) addresses an orthogonal
issue, namely communication costs in expert model merg-
ing, by compressing fine-tuning residuals via sparsification
and quantization. Task Arithmetic can also be viewed from
a compression standpoint; multiple functionally diverse
models are combined into one, but severe performance
degradation is observed. BYOM (Jiang et al.,)
sparsifies the residuals before merging but the performance
heavily depends on the chosen sparsity level. In this paper,
we introduce a mechanism that addresses this drop while
significantly compressing the multiple initial checkpoints.
Mixture of Experts (MoE) architectures (Shazeer et al.,
2017;,), where different sub-networks
specialize in various tasks or input regions, offer an effective
approach for handling diverse objectives. However, training
such models from scratch can be complex (Chen et al.,
2022;,). This work explores an alternative
approach, leveraging the power of pre-trained models
and task-specific information localization to create expert
models, potentially streamlining MoE development.
3. Task interference causes performance
degradation
We consider the case ofTtasks, where training for each task
tstarts from pre-trained modelθ0and fine-tunes onD
train
t to
obtainθt. Task arithmetic (Ilharco et al.,) merges the
fine-tuned checkpoints by decoupling the contributions of
the zero-shot model and operating on the space of residuals
ortask vectorsτt=θt−θ0 , generating themulti-task vector
through simple summation:τMTL=
P
t∈[T]
τt . The final
multi-task model corresponds toθ=θ0+ατMTL , where
α >0is a scaling factor tuned on a held-out validation set.
While task arithmetic offers a computationally cheap way
to fuse multiple fine-tuned checkpoints, it suffers from
significant performance drops compared to the single-task
counterparts. Previous works have attributed the perfor-
mance drop to loss of valuable task-specific information
due to parameter interference during merging process
(Yadav et al.,). To better understand the causes for
performance degradation, we make two hypotheses:
•
Information erasure: large amount of information spe-
cific to each task is erased when merging themulti-task
vector.
•
Task interference: the task-specific information is pre-
served in themulti-task vector, but can not manifest
properly due to interference between the tasks.
To validate these hypotheses, we start with a controlled
experiment where information erasure would not happen.
Specifically, for the 8-task vision benchmark proposed by
Ilharco et al.2023), we randomly select a subset of weights
for each task and perform gradient updates only for those
parameters. Hence, task vectors{τt}
8
t=1 form a partition
of the weight space, where all non-overlapping subsets are
of equal size. By design, parameter interference (Yadav
et al.,) is nonexistent and tasks do not compete for
important parameters (Matena & Raffel,;,
2024). Importantly, all the task-relevant information is
preserved inside themulti-task vector.
3

Localizing Task Information for Improved Model Merging and Compression
Table 1.
Performance comparison between standard and non-
overlapping fine-tuning, averaged over 8 tasks (Ilharco et al.,);
the lack ofweight interferenceand the preservation of all task-
specific knowledge in the controlled experiment is not beneficial
for task arithmetic.
Abs. acc. Norm. acc.
Fine-tuning method
Standard 92.8 100.0
Controlled 89.8 100.0
Task arithmetic accuracy
Standard 71.5 77.0
Controlled 68.8 76.6
The results for this control experiment are presented in
Table 1, compared with task arithmetic where the models
are fine-tuned in a standard way. Looking at the normalized
accuracy, defined in Appendix, we observe that the
performance of task arithmetic in the controlled setting
deteriorates at the same rate as standard fine-tuning, where
the accuracy of the merged model is 2.7% worse than stan-
dard case. This suggests that, even when the task-specific
knowledge is perfectly preserved inside themulti-task
vector, task arithmetic fails to properly utilize the relevant
information to restore the fine-tuned performance. It hints
thattask interferenceis the culprit for the performance
decline of task arithmetic rather thanweight interference.
Specifically, while task-specific parameters remain constant,
alterations in other tasks lead to changes in discriminating
features for that task, perturbing the mapping from the
task’s input distribution to output.
4.TALL-masks: Localizing task-specific
information inmulti-task vector
In the controlled experiment, the fine-tuning performance
can be easily restored by localizing task-specific information
in themulti-task vectorwith the masks used for fine-tuning.
Now we shift our focus to the general setting of standard
fine-tuning and investigate the percentage of information
preserved after merging.
We formulate the problem of localization of task-specific
knowledge (Panigrahi et al.,;,) as ex-
tracting relevant weight subsets from themulti-task vec-
torwith binary masks, such that the extracted weights ap-
proximate the original task vectorτt. The binary mask
deactivates irrelevant weights inmulti-task vectorwhile
keeping only the task-specific information. Our algorithm,
TALL-masksforTA
skLocaLization Masks, constructs
masksmttargeting to construct
ˆ
θtsuch that:
ˆ
θt=θ0+mt◦τMTL≈θt (1)
We minimize theℓ1distance between the reconstructed
ˆ
θtCars DTD
EuroSAT GTSRB MNIST
RESISC45
SVHN
SUN397
0
50
100
Percentage
Percentage
0
50
100
Norm. Valid Acc.
Accuracy -
^
t
Accuracy - TA
Figure 2.TALL-masks
localizes task-specific information. The
bar plot shows the percentage of parameters selected by
TALL-masks, while the blue line shows the normalized valida-
tion accuracy achieved by the re-constructed
ˆ
θt
with the selected
masks using. The lightblue dashed line shows the
task arithmetic baseline where the information is not localized.
Our task-specific masks allow the restoration of full performance,
showing that all knowledge embedded in the initial fine-tuned
checkpoints is preserved post merging.
and fine-tuned modelθt:
m
∗
t= argmin
mt∈{0,1}
P
∥
ˆ
θt−θt∥1 (2)
= argmin
mt∈{0,1}
P
∥mt◦τMTL−τt∥1 (3)
=1{|τt| ≥ |τMTL−τt|} (4)
HerePstands for the total number of parameters, the
detailed derivation are given in Appendix. Furthermore,
we add a hyper-parameterλton the right hand side of
Equation
to extract frommulti-task vector; the smallerλt, the more
parameters get selected bymt. Finally, we construct the
task-specific masks based on:
mt=1{|τt| ≥ |τMTL−τt| ·λt} (5)
Note thatλtis selected based on the validation accuracy
of each task respectively, allowing for the task-specific
problems to be solved in parallel and independently.
We validate the efficacy of our mask construction by
checking if the original performance in the same 8-task
computer vision benchmark, evaluated on a held-out dataset,
can be restored. Specifically, we construct the masks for
each dataset via
Ilharco et al.2023), and evaluate with reconstructed models
as in.
can be retained by simply deactivating irrelevant parameter
subsets with binary masks. Thus, it shows that all the
information embedded in the original checkpoints is not
erased but rather preserved in themulti-task vector.
4

Localizing Task Information for Improved Model Merging and Compression
5. Applications
Based on these observations, we present two application
scenarios of the masking algorithm for compressing the task
vectors and improving model merging methods.
5.1. Compressing Task Vectors
Motivated by the previous results, we employ the masks
for compressing the fine-tuned checkpoints. Since full per-
formance can be retained by the constructed models using
the masks, it allows us to significantly reduce the required
storage cost without sacrificing performance.
Specifically, instead of the collection of fine-tuned check-
points{θt}
T
t=1 , we can save only the pre-trained modelθ0,
themulti-task vectorτMTLand the individual task-specific
binary masks{mt}
T
t=1 . For evaluation on taskt, we con-
struct a specialized model by adding to the pre-trained only
task-relevant subsets from themulti-task vector:
ˆ
θt=θ0+mt◦τMTL (6)
In this way, it allows significant compression while main-
taining the majority of performance for fine-tuned models
without saving the individual checkpoints. For example,
it requires only 13.7% of storage compared to saving the
ViT-L/14 checkpoints for a 20-task benchmark. We provide
the details for storage comparison in Appendix.
5.2. Improving Model Merging
While the storage of task-specific masks introduces extra
storage cost compared with Task Arithmetic, we present
here another application ofTALL-masks, Consensus
Merging, which improves over model merging methods
without requiring extra storage.
The construction of the task-specific masks{mt}
T
t=1 allows
us to investigate the relevance of parameters in themulti-
task vectorto each task, where we assume that if a parameter
is included in a mask, it is relevant for the associated task.
We find that many parameters in themulti-task vectorare
relevant to only a subset of tasks. LetPbe the total number
of parameters, we definemask agreement percentageas the
fraction of weights deemed important by exactlynout of
Ttasks:
α({mt}
T
t=1, n) =
1
P

1



X
t∈[T]
mt=n




0
(7)
The histogram for the mask agreement percentage is shown
in, for both the multi-task vectormerged with Task
Arithmetic and TIES, while we provide in Appendix
cases for more tasks. We observe that there exists a subset of
parameters which are used by no task at all, which we term0 1 2 3 4 5 6 7 8
Number of tasks
0
20
40
Active weights perc. (%)
Weights - TA Weights - TIES
Figure 3.
The distribution of mask agreements in the merged vector
produced by two model merging methods, Task Arithmetic and
TIES. A non-negligent fraction of weights is important exclusively
to one task (selfish) while another fraction is irrelevant to all tasks
(catastrophic). Our method eliminates both categories to improve
model merging.
catastrophicweights as their existence would only introduce
unnecessary interference and hurt the performance of model
merging. Furthermore, we also identify there exists a non-
negligible fraction of weights used by only one task, which
we termselfishweights, as their existence only benefits one
task whereas causingtask interferencefor all other tasks.
We term the rest of weights asgeneralweights, as they are
relevant to at least two tasks and their importance grows with
the number of relevant tasks. Similarly, we termuniversal
the weights deemed important for all tasks(n=T).
Based on these observations, we present Consensus Merg-
ing, which is targeted to reducetask interferencefor better
model merging. Formally, we form theconsensus maskfor
thresholdk∈ {0, . . . , T}as:
mconsensus=1



X
t∈[T]
mt≥k



, (8)
and filter themulti-task vectorthrough a Hadamard product:
τconsensus=mconsensus◦τMTL, (9)
wherekin
threshold, e.g., the minimal number of activated masks for
preventing the weights from being pruned. By settingk= 2
in catastrophicweights
andselfishweights in themulti-task vectorto reducetask
interference, keeping onlygeneralweights that are globally
important to at least two tasks. Thethresholdkaffects per-
formance and depends on the task number and combination,
as well as the underlying model merging method, as
ure 3 mask agreement profilesfor
Task Arithmetic and TIES. In the following, we usek= 2,
unless specified otherwise.
Finally, we note that both the proposed compression method
and Consensus Merging are orthogonal to existing model
5

Localizing Task Information for Improved Model Merging and Compression
Table 2.
Comparison of model merging (top) and compression (bottom) methods on three sets of NLP benchmarks with a T5-large model,
in terms of accuracy, absolute and normalized (in parentheses), as well as storage cost.
7 NLP tasks (Yadav et al.,) 8 QA tasks (Zhou et al.,) All 11 tasks
Method
Acc.(%)↑ Bits(Gb)↓ Acc.(%)↑ Bits(Gb)↓ Acc.(%)↑Bits(Gb)↓
Zero-shot 44.9 25.1 33.1 25.1 36.9 25.1
Weight averaging 60.5
(72.7) 25.1
(69.9) 25.1
(71.6) 25.1
Task arithmetic 71.9
(85.3) 25.1
(79.6) 25.1
(81.8) 25.1
TIES 69.6
(83.5) 25.1
(78.9) 25.1
(82.6) 25.1
Consensus TA [ours] 73.5
(87.7) 25.1
(85.4) 25.1 67.5
(86.8) 25.1
Merging
Consensus TIES [ours] 71.0
(84.2) 25.1 69.1
(85.8) 25.1
(85.7) 25.1
Fine-tuned 85.9 169.1 80.7 193.1 78.7 265.1
Magnitude Pruning 81.6
(93.5) >54.3
(85.7) >55.1
(81.4) >57.3
Magnitude Masking 78.9
(90.7) 54.3
(88.8) 55.1
(87.2) 57.3
TALL Mask + TA [ours] 86.8
(102.2) 54.3
(98.7) 55.1
(96.2) 57.3
CompressionTALL Mask + TIES [ours] 83.4
(95.4) 54.3 79.7
(98.8) 55.1 77.4
(97.5) 57.3
merging approaches and can be easily plugged in since they
operate on themulti-task vectorτMTL, e.g.,τMTLcan be
produced by Task Arithmetic (Ilharco et al.,), TIES
(Yadav et al.,) or other algorithms (Matena & Raffel,
2022;,). Practitioners can toggle between
these two applications depending on the usage scenario.
6. Experiments
6.1. Model Merging
BaselinesWe compare Consensus Merging with several
train-free model-merging methods, including weight averag-
ing, Task Arithmetic (Ilharco et al.,), and TIES (Yadav
et al.,). Our method is complementary to all and we
opt to validate its efficacy when combined with the latter
two. Specifically, we term our methods Consensus Task
Arithmetic and Consensus TIES when combined with them
respectively. We include also individually fine-tuned models
and the zero-shot model as higher and lower bounds on per-
formance, respectively. We assess the performance based
on both the averaged absolute accuracy, and normalized
accuracy, defined in detail in Appendix.
Natural Language ProcessingWe explore NLP bench-
marks following2023b);2024).
We use a variant of T5-large model (Raffel et al.,),
T5-large-LM-Adapt (Lester et al.,), and evaluate our
method on three sets of benchmarks studied in previous
works, a 7-task NLP benchmark (Yadav et al.,), an
8-task benchmark geared towards Question-Answering
(Zhou et al.,), as well as their union amounting to 11
tasks overall. More details about the task composition for
each benchmark are provided in. We use the
publicly released checkpoints from2024).
Table 2
Consensus Merging consistently improves over both Task
Arithmetic and TIES, leading to significant performance
gains across settings. For example, in the 8-QA benchmark,
our methods respectively improve over Task arithmetic by
4.8% and TIES by 6.3% in absolute accuracy, while perfor-
mance is enhanced by over 2.9% and 2.8%, respectively, for
the 11-task benchmark. We also conduct experiments with
checkpoints finetuned with parameter-efficient methods
in (IA)
3
(Liu et al.,); Appendix
proposed method again results in significant gains over
Task Arithmetic and TIES.
Computer VisionWe consider 3 test scenarios, where the
number of tasks in each scenario increases gradually from
8 to 14 and 20. The 8-task benchmark coincides with the
experimental setup originally introduced by
(2023) and is expanded to further illuminate the effect of
larger number of tasks. The full details for the benchmarks
are provided in Appendix. For each test scenario, we
assess the efficacy of our method on three CLIP model
variants (Radford et al.,) with ViT-B/32, ViT-B/16,
and ViT-L/14 as visual encoders (Dosovitskiy et al.,).
All methods use the same checkpoints, fine-tuned with the
setting outlined in2023).
The results in image classification are shown in.
Similar to NLP, Consensus Merging provides the best
results in 5 out of 6 scenarios, with its superiority becoming
increasingly apparent as both the number of tasks and model
size grow. This pattern is also noted for the ViT-B/16, as
detailed in. Our algorithm offers
consistent enhancements over Task Arithmetic, while its
advantages over TIES are most pronounced in larger models.
For example, in the most extensive evaluation involving 20
tasks with a ViT-L/14 encoder, our approach improves on
Task Arithmetic and TIES by 4.9% and 1.1%, respectively.
6

Localizing Task Information for Improved Model Merging and Compression
Table 3.
Comparison of model merging (top) and compression (bottom) methods across three test scenarios in image classification with
different ViT encoders for CLIP models, in terms of accuracy, absolute and normalized (in parentheses), as well as storage cost (in Gb).
ViT-B/32 ViT-L/14
8 tasks 14 tasks 20 tasks 8 tasks 14 tasks 20 tasks
Method
Acc.(%)↑Bits↓ Acc.(%)↑Bits↓ Acc.(%)↑Bits↓ Acc.(%)↑Bits↓ Acc.(%)↑Bits↓ Acc.(%)↑Bits↓
Zeroshot 48.4 3.6 57.3 3.6 56.1 3.6 64.4 11.0 68.0 11.0 65.1 11.0
Weight averaging 66.5
(72.3)3.6
(71.2)3.6
(67.5)3.6
(83.0)11.0
(81.0)11.0
(75.5)11.0
Task arithmetic 70.8
(76.5)3.6
(72.2)3.6
(66.8)3.6
(88.5)11.0
(83.8)11.0
(78.0)11.0
TIES 75.1
(81.0)3.6
(74.8)3.6
(69.9)3.6 86.9
(90.7)11.0
(84.1)11.0
(79.8)11.0
Consensus TA [ours] 75.0
(80.8)3.6 70.4
(77.4)3.6 65.4
(72.0)3.6
(89.9)11.0 82.2
(86.9)11.0 78.9
(83.2)11.0
Merging
Consensus TIES [ours] 74.8
(80.6)3.6
(74.5)3.6
(69.6)3.6 86.9
(90.7)11.0
(86.1)11.0
(80.9)11.0
Fine-tuned 92.8 23.3 90.9 40.2 91.3 57.0 95.8 79.1 94.3 137.4 94.7 195.8
Magnitude Pruning 91.3
(98.4)>7.1
(93.7)>7.7
(91.2)>8.2
(99.6)>23.1
(97.2)>24.9
(96.1)>26.8
Magnitude Masking 86.8
(93.3)7.1
(88.4)7.7
(82.1)8.2
(98.7)23.1
(97.0)24.9
(96.5)26.8
TALL Mask + TA [ours] 92.6
(99.7)7.1
(99.1)7.7
(99.2)8.2
(99.9)23.1
(98.8)24.9
(98.9)26.8
Compression
TALL Mask + TIES [ours] 93.0
(100.3)7.1 90.9
(100.0)7.7 91.1
(99.7)8.2 95.9
(100.1)23.1 93.4
(99.0)24.9 93.9
(99.1)26.8
6.2. Compression
We adopt the compression technique discussed in
tion 5.1
the prefix ‘TALL Mask +’ when combined with different
model merging methods.
BaselinesWe compare against methods on the same level
of storage as our proposed solution. We first consider
Magnitude Masking, where per-task masks are constructed
by replacing our procedure in
topk%of the parameters. After experimental validation,
we found thatk= 10works well in general and therefore
use it throughout. We also consider unstructured Magnitude
Pruning of the individual task vectors, where we set the
level of pruning so that the storage cost of one pre-trained
model andTpruned task vectors is equal to ours. Note that,
in practice, Magnitude Pruning takes more storage than
our method as we do not consider the storage cost for the
positions of the parameters.
Natural Language ProcessingFollowing the same ex-
perimental setting for merging, the results are presented
in the bottom half of. Our compression scheme
effectively retains performance while requiring much less
storage cost than storing individual fine-tuned models. In
contrast, magnitude-based solutions suffer from severe per-
formance losses, especially as the number of tasks increases.
For example, for the 7-task benchmark, our method keeps
all the performance from the fine-tuned models with even
over 100% normalized accuracy, while requiring less than
1/3 storage cost than storing the checkpoints. The advan-
tage on storage compression becomes more pronounced
with larger number of tasks. For example, for the 11-task
benchmark, both our methods require only around 1/5 of
the storage of individual checkpoints, while keeping at least
96.2% of the performance. By keeping the same level of
storage, separately pruned models preserve lower number
of parameters and the cost in performance mirrors this lack
of task-specific features; for the 11-task benchmark only
81.4% of performance is retained compared to our perfor-
mance of97.5%. In contrast, our method takes advantage
of the redundancy in the task vector to compress effectively
with minimal performance losses.
Computer VisionThe bottom half of
results for compression in vision. While we observe that
both Magnitude Pruning and Magnitude Masking show a
clear performance degradation with increasing number of
tasks, our proposed methods deliver almost full performance
in all test scenarios.
Specifically, TALL Mask + Task Arithmetic achieves around
99% normalized accuracy on all cases, where there is almost
no performance degradation with increasing number of tasks.
TALL Mask + TIES performs even better, with its normal-
ized accuracy being over 99% for all the test cases. For ViT-
B/32, it achieves around 100% for all test scenarios without
any loss of performance. For ViT-L/14, TALL Mask + TIES
still performs exceptionally, with its normalized accuracy
being around 100% for 8 tasks and over 99% for 14 and 20
tasks. These results show that our method is able to capture
the crucial task-specific information buried in the merged
vector and is not bound by model scale or number of tasks.
In terms of storage, TALL Mask + Task Arithmetic requires
much less storage cost compared to storing the individual
fine-tuned models, where the storage saving is more
pronounced with a larger number of tasks. Importantly,
our compression scheme withTALL-masks, provides
7

Localizing Task Information for Improved Model Merging and CompressionMNIST
Cars
DTD
EuroSAT
GTSRB
RESISC45
SUN397
SVHN
MNIST
20
40
60
80
(a) 8 Tasks
MNIST
Cars
DTD
EuroSATGTSRB
RESISC45
SUN397
SVHN
Flowers102
CIFAR100
PCAM STL10
OxfordIIITPet
FER2013
MNIST
20
40
60
80
(b) 14 Tasks
MNIST
Cars
DTD
EuroSAT
GTSRB
RESISC45
SUN397
SVHN
Flowers102
CIFAR100
PCAM
STL10
OxfordIIITPet
FER2013
EMNIST
CIFAR10
Food101
FashionMNIST
KMNIST
RenderedSST2
MNIST
20
40
60
80
(c) 20 Tasks
Fine-tuned TALL Mask + TA (Ours) Consensus TA (Ours) TA TIES Zero-shot
Figure 4.
Comparison of absolute accuracy (%) of individual tasks for the computer vision benchmarks and ViT-B/32. Results for
ViT-B/16 and ViT-L/14 are provided in the appendix. Our Consensus Merging shows higher performance compared to model merging
baselines, especially for the settings with more tasks. Our compression algorithm consistently matches the performance of the individual
fine-tuned models at a fraction of the memory, while model merging techniques are not robust to the increase of tasks. 5 10 15 20
Number of tasks
60
80
100
Norm. Acc. (%)
Fine-tuned TALL Mask + TA (Ours) Consensus TA (Ours) TA TIES Zero-shot
5 10 15 20
Number of tasks
20
40
Storage (Gb)
Figure 5.
Averaged normalized accuracyvs. number of tasks for computer vision benchmarks. Our proposed specialist algorithm maintains
initial performance regardless of task combination and heavily compresses the fine-tuned checkpoints.
an efficient trade-off between performance and storage
compared to the model-merging methods. For example,
using TALL Mask + Task Arithmetic on 20 tasks with
ViT-B/32 takes 8.2 Gb storage while achieving 90.6%
absolute accuracy, whereas using TIES on the same 20
tasks with ViT-L/14 takes 11.0 Gb but delivers an absolute
accuracy of merely 75.7%. Overall, our method delivers
a desirable trade-off in the Pareto Front of performance
retention vs total storage cost.
6.3. Individual-task performance
We now shift our focus from statistics over all tasks and
present the performance on individual tasks. For vision
settings, we compare the performance of TALL Mask
+ Task Arithmetic and Consensus Task Arithmetic with
baselines methods, and plot in
accuracies on three benchmarks with ViT-B/32, while we
provide the vision results with two other models as well as
the results in NLP settings in Appendix. We observe
that TALL Mask + Task Arithmetic consistently delivers
performance the same level as individual fine-tuned models,
across datasets and total number of tasks.
On the other hand, the expansion of the considered tasks
results in significant performance drop for model merging
methods, where for some datasets the performance is even
reset back to zero-shot model. Yet, we observe that Con-
sensus Task Arithmetic suffers the least from increasing
number of tasks and shines among the model merging meth-
ods, especially in the 20-task setting, where it outperforms
the other model merging methods in almost all the datasets.
6.4. Performance with varying task combinations
We perform a systematic study on the effect of different task
combinations on our methods and model-merging baselines,
8

Localizing Task Information for Improved Model Merging and Compression0 5 10 15 20
Weight-pruning thresholdk
55
60
65
70
Avg. test accuracy (%)
TA
TIES
Zero-shot
Consensus TA Consensus TIES
Figure 6.
Performance of Consensus Merging with varying weight-
pruning thresholdk, on the 20-task image classification benchmark
with ViT-B/32.
and present both the normalized accuracy and the storage
cost with respect to different numbers of tasks in.
Due to the vast array of2
20potential combinations, the per-
formance is averaged on 8 carefully selected representative
sample task combinations for each number of tasks; the de-
tails for selection process are given in. Our ob-
servations indicate that our TALL Mask + Task Arithmetic
consistently matches the performance of individually fine-
tuned models across all combinations, regardless of the total
task count. By comparison, model merging methods show
a stark degradation in performance, especially given a large
number of tasks where their performance degrade gradually
to nearly zero-shot performance. Nonetheless, Consensus
Task Arithmetic outperforms the two other model merging
methods in general, while the performance gain gradually
becomes apparent with larger number of tasks.
While the model merging methods have constant storage
cost, their applicability is undermined by low performance.
Conversely, maintaining individual models is excessively
prohibitive in terms of storage but guarantees strong
performance. Localizing the task-specific parameters with
our proposed masks offers a favorable trade-off between
performance and storage. Crucially, we deliver consistent
performance with near 100% normalized accuracy across
various task combinations and numbers of tasks, while
heavily compressing the required storage cost, as we see in
the right plot of. It presents a more viable solution
in scenarios where balancing performance with storage
efficiency is essential.
6.5. Effect of weight-pruning threshold
The weight-pruning thresholdkis used to form the consen-
sus mask in
vated tasks to prevent weights from being pruned. Extending
the number of tasks, modifying the task combination and
the model merging method itself affect themask agreement
profile, defined in, and consequently the optimal
pruning-threshold and the overall performance. We present
the performance of Consensus Merging with ViT-B/32 on
various image classification benchmarks in
20-task benchmark and
and 14-task benchmarks whenkgradually increases from
1 to over the total number of tasks. We observe from the fig-
ure that while the optimal performance of Consensus TA is
usually achieved by settingk= 2, i.e., removing bothcatas-
trophicandselfishweights, Consensus TIES achieves its
optimal performance by settingk= 1, i.e., removing only
catastrophic weights. We present also the results for remov-
ing only catastrophic weights in Appendix
scenarios with Computer Vision, where we observe the per-
formance of Consensus TIES consistently outperform TIES.
The difference in optimal thresholdskbetween Task
Arithmetic and TIES originates in their profiles, as shown
in; the pruning and sign resolu-
tion mechanisms of TIES shift the distribution towards
increaseduniversalandselfishweights. Removing the
latter altogether results in a significant reduction of salient
weights. Studying how different merging strategies affect
the weight profiles remains an interesting future direction.
7. Conclusion
In this study, we introducedTALL-masks, a method with
dual practical applications: it effectively addresses the sig-
nificant issue of compressing foundation models and facil-
itates model merging. We identify the reason for perfor-
mance degradation of model merging methods is not due
to information erasure during merging process, but because
of task interference during evaluation. Our research demon-
strates that the multi-task vector retains crucial task-specific
information, whichTALL-maskslocalizes and extracts
through the use of task-specific binary masks, thereby en-
abling the recovery of original fine-tuned performance lev-
els. We utilize this observation to compress a collection
of fine-tuned checkpoints into storing only the zero-shot
model, the information-rich merged vector and the specified
masks. Further, our examination of mask statistics uncov-
ered weights harmful to model merging. We thus proposed
their elimination and achieved state-of-the-art performance
in common model merging benchmarks. Our proposed solu-
tion not only enhances the utility of foundation models but
also makes them more accessible and sustainable, paving the
way for broader applications and advancements in the field.
Impact Statements
This paper presents work whose goal is to advance the field
of Machine Learning. There are many potential societal
consequences of our work, none which we feel must be
specifically highlighted here.
9

Localizing Task Information for Improved Model Merging and Compression
Acknowledgments
We thank Alessandro Favero, Thibault Sejourn´e and the
anonymous reviewers for helpful feedback and comments.
References
Bayazit, D., Foroutan, N., Chen, Z., Weiss, G., and Bosse-
lut, A. Discovering knowledge-critical subnetworks
in pretrained language models.arXiv, 2023. URL
http://arxiv.org/abs/2310.03084v1 .
Bossard, L., Guillaumin, M., and Van Gool, L. Food-
101 – Mining Discriminative Components with Random
Forests. InIEEE European Conference on Computer
Vision (ECCV), 2014.
Caruana, R. Multitask Learning.Machine Learning, 28(1):
41–75, 1997.
Chaudhari, P. A.A Picture of the Energy Landscape of Deep
Neural Networks. PhD thesis, 2018.
Chen, Z., Deng, Y., Wu, Y., Gu, Q., and Li, Y. Towards un-
derstanding the mixture-of-experts layer in deep learning.
InAdvances in Neural Information Processing Systems
(NeurIPS), 2022.
Cheng, G., Han, J., and Lu, X. Remote sensing image scene
classification: Benchmark and state of the art.Proceed-
ings of the IEEE, 2017. URLhttp://arxiv.org/
abs/1703.00121v1.
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., and
Vedaldi, A. Describing textures in the wild. InIEEE
Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2014. URLhttp://arxiv.org/abs/
1311.3618v2.
Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A.,
Yamamoto, K., and Ha, D. Deep learning for classical
japanese literature.arXiv, 2018. URLhttp://arxiv.
org/abs/1812.01718v1 .
Coates, A., Ng, A., and Lee, H. An analysis of single-
layer networks in unsupervised feature learning. In
International Conference on Artificial Intelligence and
Statistics (AISTATS), 2011.https://proceedings.
mlr.press/v15/coates11a.html .
Cohen, G., Afshar, S., Tapson, J., and Van Schaik, A. EM-
NIST: Extending MNIST to handwritten letters. InInter-
national Joint Conference on Neural Networks (IJCNN),
2017.
Dai, D., Dong, L., Hao, Y., Sui, Z., Chang, B., and
Wei, F. Knowledge Neurons in Pretrained Transform-
ers. InAssociation for Computational Linguistics (ACL),
2022. URLhttps://aclanthology.org/2022.
acl-long.581.
Davari, M. and Belilovsky, E. Model Breadcrumbs: Scal-
ing Multi-Task Model Merging with Sparse Masks.
arXiv, 2023. URLhttp://arxiv.org/abs/2312.
06795v1.
Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer,
L. GPT3.int8(): 8-bit Matrix Multiplication for Trans-
formers at Scale. InAdvances in Neural Information
Processing Systems (NeurIPS), 2022. URLhttps:
//openreview.net/forum?id=dXiGWqBoxaD .
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L.
Qlora: Efficient finetuning of quantized llms.arXiv, 2023.
URLhttp://arxiv.org/abs/2305.14314v1 .
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
BERT: Pre-training of Deep Bidirectional Transform-
ers for Language Understanding. InNorth American
Chapter of the Association for Computational Linguistics
(NAACL), 2019.https://aclanthology.org/
N19-1423.
Dimitriadis, N., Frossard, P., and Fleuret, F. Pareto Manifold
Learning: Tackling multiple tasks via ensembles of single-
task models. InInternational Conference on Machine
Learning (ICML), 2023.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,
N. An Image is Worth 16x16 Words: Transformers for
Image Recognition at Scale. InInternational Conference
on Learning Representations (ICLR), 2021. URLhttp:
//arxiv.org/abs/2010.11929v2 .
Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht,
F. A. Essentially No Barriers in Neural Network Energy
Landscape. InInternational Conference on Machine
Learning (ICML), 2018. URLhttp://arxiv.org/
abs/1803.00885v5.
Fedus, W., Zoph, B., and Shazeer, N. Switch transform-
ers: Scaling to trillion parameter models with simple
and efficient sparsity.Journal of Machine Learning
Research (JMLR), 23(120):1–39, 2022. URLhttp:
//arxiv.org/abs/2101.03961v3 .
Fifty, C., Amid, E., Zhao, Z., Yu, T., Anil, R., and Finn,
C. Efficiently Identifying Task Groupings for Multi-Task
Learning.arXiv, 2021. URLhttp://arxiv.org/
abs/2109.04617v2.
Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B.
Sharpness-Aware Minimization for Efficiently Improv-
ing Generalization. InInternational Conference on
10

Localizing Task Information for Improved Model Merging and Compression
Learning Representations (ICLR), 2021. URLhttp:
//arxiv.org/abs/2010.01412v3 .
Frankle, J., Dziugaite, G. K., Roy, D., and Carbin, M.
Linear mode connectivity and the lottery ticket hypoth-
esis. InInternational Conference on Machine Learn-
ing (ICML), 2020. URLhttp://arxiv.org/abs/
1912.05671v4.
Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P.,
and Wilson, A. G. Loss Surfaces, Mode Connectivity,
and Fast Ensembling of DNNs. InAdvances in Neural
Information Processing Systems (NeurIPS), 2018. URL
http://arxiv.org/abs/1802.10026v4 .
Goodfellow, I. J., Erhan, D., Carrier, P. L., Courville, A.,
Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler,
D., Lee, D.-H., et al. Challenges in representation learn-
ing: A report on three machine learning contests. InInter-
national Conference on Neural Information Processing
(ICONIP), 2013. URLhttp://arxiv.org/abs/
1307.0414v1.
Helber, P., Bischke, B., Dengel, A., and Borth, D. Eu-
roSAT: A Novel Dataset and Deep Learning Benchmark
for Land Use and Land Cover Classification.IEEE J.
Sel. Top. Appl. Earth Obs. Remote. Sens., 12(7):2217–
2226, 2019. URLhttps://doi.org/10.1109/
JSTARS.2019.2918242.
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B.,
De Laroussilhe, Q., Gesmundo, A., Attariyan, M.,
and Gelly, S. Parameter-efficient transfer learning for
NLP. InInternational Conference on Machine Learn-
ing (ICML), 2019. URLhttp://arxiv.org/abs/
1902.00751v2.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,
S., Wang, L., and Chen, W. LoRA: Low-Rank Adaptation
of Large Language Models. InInternational Conference
on Learning Representations (ICLR), 2022.https://
openreview.net/forum?id=nZeVKeeFYf9 .
Huang, L., Bras, R. L., Bhagavatula, C., and Choi, Y.
Cosmos QA: Machine reading comprehension with con-
textual commonsense reasoning.arXiv, 2019. URL
http://arxiv.org/abs/1909.00277v2 .
Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Car-
lini, N., Taori, R., Dave, A., Shankar, V., Namkoong,
H., Miller, J., Hajishirzi, H., Farhadi, A., and Schmidt,
L. OpenCLIP, 2021.https://doi.org/10.5281/
zenodo.5143773.
Ilharco, G., Wortsman, M., Gadre, S. Y., Song, S., Hajishirzi,
H., Kornblith, S., Farhadi, A., and Schmidt, L. Patch-
ing open-vocabulary models by interpolating weights.
InAdvances in Neural Information Processing Systems
(NeurIPS), 2022. URLhttp://arxiv.org/abs/
2208.05592v2.
Ilharco, G., Ribeiro, M. T., Wortsman, M., Gururangan,
S., Schmidt, L., Hajishirzi, H., and Farhadi, A. Editing
models with task arithmetic. InInternational Conference
on Learning Representations (ICLR), 2023.https://
arxiv.org/abs/2110.08207 .
Jiang, W., Lin, B., Shi, H., Zhang, Y., Li, Z., and Kwok,
J. T. BYOM: Building Your Own Multi-Task Model For
Free, 2024. URLhttp://arxiv.org/abs/2310.
01886v3.
Jin, X., Ren, X., Preotiuc-Pietro, D., and Cheng, P. Data-
less Knowledge Fusion by Merging Weights of Lan-
guage Models. InInternational Conference on Learn-
ing Representations (ICLR), 2023. URLhttps://
openreview.net/forum?id=FCnohuR6AnM .
Khot, T., Clark, P., Guerquin, M., Jansen, P., and Sabhar-
wal, A. QASC: A Dataset for Question Answering via
Sentence Composition, 2020. URLhttp://arxiv.
org/abs/1910.11473v2 .
Krause, J., Stark, M., Deng, J., and Fei-Fei, L. 3d object rep-
resentations for fine-grained categorization. InProceed-
ings of the IEEE international conference on computer
vision workshops, 2013.
Krizhevsky, A., Hinton, G., et al. Learning mul-
tiple layers of features from tiny images, 2009.
https://www.cs.toronto.edu/ ˜kriz/
learning-features-2009-TR.pdf .
LeCun, Y. The MNIST database of handwritten digits, 1998.
http://yann.lecun.com/exdb/mnist/ .
Lester, B., Al-Rfou, R., and Constant, N. The power of
scale for parameter-efficient prompt tuning.arXiv, 2021.
URLhttp://arxiv.org/abs/2104.08691v2 .
Levesque, H., Davis, E., and Morgenstern, L. The winograd
schema challenge. InThirteenth international confer-
ence on the principles of knowledge representation and
reasoning, 2012.
Liang, T., Glossner, J., Wang, L., Shi, S., and Zhang, X.
Pruning and quantization for deep neural network accel-
eration: A survey.Neurocomputing, 461:370–403, 2021.
URLhttp://arxiv.org/abs/2101.09671v3 .
Lin, K., Tafjord, O., Clark, P., and Gardner, M. Reasoning
over paragraph effects in situations.arXiv, 2019. URL
http://arxiv.org/abs/1908.05852v2 .
11

Localizing Task Information for Improved Model Merging and Compression
Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal,
M., and Raffel, C. A. Few-shot parameter-efficient fine-
tuning is better and cheaper than in-context learning.
InAdvances in Neural Information Processing Systems
(NeurIPS), 2022. URLhttp://arxiv.org/abs/
2205.05638v2.
Luo, S., Tan, Y., Patil, S., Gu, D., von Platen, P., Passos, A.,
Huang, L., Li, J., and Zhao, H. Lcm-lora: A universal
stable-diffusion acceleration module.arXiv, 2023.
Matena, M. S. and Raffel, C. A. Merging models with fisher-
weighted averaging. InAdvances in Neural Information
Processing Systems (NeurIPS), 2022. URLhttp://
arxiv.org/abs/2111.09832v2 .
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu,
B., and Ng, A. Y. Reading Digits in Natural
Images with Unsupervised Feature Learning. In
NIPS Workshop on Deep Learning and Unsuper-
vised Feature Learning 2011, 2011. URLhttp:
//ufldl.stanford.edu/housenumbers/
nips2011_housenumbers.pdf .
Nilsback, M.-E. and Zisserman, A. Automated flower clas-
sification over a large number of classes. In2008 Sixth
Indian conference on computer vision, graphics & image
processing, 2008.
Ortiz-Jimenez, G., Favero, A., and Frossard, P. Task
Arithmetic in the Tangent Space: Improved Editing of
Pre-Trained Models. InAdvances in Neural Informa-
tion Processing Systems (NeurIPS), 2023. URLhttp:
//arxiv.org/abs/2305.12827v3 .
Panigrahi, A., Saunshi, N., Zhao, H., and Arora, S. Task-
specific skill localization in fine-tuned language mod-
els. InInternational Conference on Machine Learn-
ing (ICML), 2023. URLhttp://arxiv.org/abs/
2302.06600v2.
Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar, C.
Cats and dogs. InIEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2012.
Pruksachatkun, Y., Phang, J., Liu, H., Htut, P. M., Zhang,
X., Pang, R. Y., Vania, C., Kann, K., and Bowman,
S. R. Intermediate-Task Transfer Learning with Pre-
trained Language Models: When and Why Does It
Work? InAssociation for Computational Linguistics
(ACL), 2020. URLhttps://aclanthology.org/
2020.acl-main.467.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and
Sutskever, I. Language Models are Unsupervised Multi-
task Learners, 2019.https://openai.com/blog/
better-language-models/ .
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
J., Krueger, G., and Sutskever, I. Learning Transfer-
able Visual Models From Natural Language Supervi-
sion. InInternational Conference on Machine Learn-
ing (ICML), 2021. URLhttp://arxiv.org/abs/
2103.00020v1.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the
limits of transfer learning with a unified text-to-text trans-
former.Journal of Machine Learning Research (JMLR),
21(140):1–67, 2020.
Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G.,
Novikov, A., Barth-maron, G., Gim´enez, M., Sulsky,
Y., Kay, J., Springenberg, J. T., Eccles, T., Bruce, J.,
Razavi, A., Edwards, A., Heess, N., Chen, Y., Hadsell,
R., Vinyals, O., Bordbar, M., and de Freitas, N. A Gener-
alist Agent.Transactions on Machine Learning Research,
2022. URLhttps://openreview.net/forum?
id=1ikK0kHjvj.
Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., Je-
natton, R., Pinto, A. S., Keysers, D., and Houlsby, N. Scal-
ing Vision with Sparse Mixture of Experts. InAdvances in
Neural Information Processing Systems (NeurIPS), 2021.
Rogers, A., Kovaleva, O., Downey, M., and Rumshisky,
A. Getting closer to AI complete question answering:
A set of prerequisite real tasks. InAAAI Conference on
Artificial Intelligence (AAAI), 2020.
Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y.
WinoGrande: An Adversarial Winograd Schema Chal-
lenge at Scale.Commun. ACM, 64(9):99–106, 2021. URL
http://arxiv.org/abs/1907.10641v2 .
Sap, M., Rashkin, H., Chen, D., LeBras, R., and Choi, Y.
Socialiqa: Commonsense reasoning about social interac-
tions.arXiv, 2019. URLhttp://arxiv.org/abs/
1904.09728v3.
Sharma, R., Allen, J., Bakhshandeh, O., and Mostafazadeh,
N. Tackling the Story Ending Biases in The Story
Cloze Test. InAssociation for Computational Linguistics
(ACL), 2018. URLhttps://aclanthology.org/
P18-2119.
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le,
Q., Hinton, G., and Dean, J. Outrageously Large Neural
Networks: The Sparsely-Gated Mixture-of-Experts Layer.
InInternational Conference on Learning Representations
(ICLR), 2017.
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning,
C. D., Ng, A., and Potts, C. Recursive deep models
12

Localizing Task Information for Improved Model Merging and Compression
for semantic compositionality over a sentiment treebank.
InEmpirical Methods in Natural Language Processing
(EMNLP), 2013.https://aclanthology.org/
D13-1170/.
Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C. The
German traffic sign recognition benchmark: a multi-class
classification competition. InInternational Joint Confer-
ence on Neural Networks (IJCNN), 2011.https://
ieeexplore.ieee.org/document/6033395 .
Standley, T., Zamir, A. R., Chen, D., Guibas, L. J., Malik,
J., and Savarese, S. Which Tasks Should Be Learned
Together in Multi-task Learning? InInternational
Conference on Machine Learning (ICML), 2020. URL
http://arxiv.org/abs/1905.07553v4 .
Tafjord, O., Gardner, M., Lin, K., and Clark, P. QuaRTz:
An Open-Domain Dataset of Qualitative Relationship
Questions. InEmpirical Methods in Natural Lan-
guage Processing (EMNLP), 2019. URLhttps://
aclanthology.org/D19-1608 .
Tam, D., Bansal, M., and Raffel, C. Merging by Matching
Models in Task Parameter Subspaces.Transactions on
Machine Learning Research, 2024. URLhttps://
openreview.net/forum?id=qNGo6ghWFB .
Veeling, B. S., Linmans, J., Winkens, J., Cohen, T., and
Welling, M. Rotation equivariant CNNs for digital pathol-
ogy. InInternational Conference on Medical Image
Computing and Computer Assisted Intervention (MIC-
CAI), 2018. URLhttp://arxiv.org/abs/1806.
03962v1.
Wortsman, M., Horton, M. C., Guestrin, C., Farhadi,
A., and Rastegari, M. Learning Neural Network Sub-
spaces. InInternational Conference on Machine Learn-
ing (ICML), 2021. URLhttp://arxiv.org/abs/
2102.10472v3.
Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R.,
Gontijo-Lopes, R., Morcos, A. S., Namkoong, H.,
Farhadi, A., Carmon, Y., Kornblith, S., et al. Model
soups: averaging weights of multiple fine-tuned mod-
els improves accuracy without increasing inference
time. InInternational Conference on Machine Learn-
ing (ICML), 2022a. URLhttp://arxiv.org/abs/
2203.05482v3.
Wortsman, M., Ilharco, G., Kim, J. W., Li, M., Kornblith,
S., Roelofs, R., Lopes, R. G., Hajishirzi, H., Farhadi, A.,
Namkoong, H., and Schmidt, L. Robust Fine-Tuning of
Zero-Shot Models. InIEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2022b. URL
http://arxiv.org/abs/2109.01903v3 .
Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a
novel image dataset for benchmarking machine learning
algorithms.arXiv, 2017. URLhttp://arxiv.org/
abs/1708.07747v2.
Xiao, J., Ehinger, K. A., Hays, J., Torralba, A., and Oliva,
A. Sun database: Exploring a large collection of scene
categories.International Journal of Computer Vision,
119:3–22, 2016.
Yadav, P., Choshen, L., Raffel, C., and Bansal, M.
ComPEFT: Compression for Communicating Parame-
ter Efficient Updates via Sparsification and Quantiza-
tion.arXiv, 2023a. URLhttp://arxiv.org/abs/
2311.13171v1.
Yadav, P., Tam, D., Choshen, L., Raffel, C., and Bansal, M.
TIES-Merging: Resolving Interference When Merging
Models. InAdvances in Neural Information Process-
ing Systems (NeurIPS), 2023b. URLhttp://arxiv.
org/abs/2306.01708v2 .
Yang, E., Wang, Z., Shen, L., Liu, S., Guo, G., Wang, X.,
and Tao, D. AdaMerging: Adaptive Model Merging for
Multi-Task Learning. InInternational Conference on
Learning Representations (ICLR), 2024. URLhttps:
//openreview.net/forum?id=nZP6NgD3QY .
Yang, Y., Yih, W.-t., and Meek, C. WikiQA: A Chal-
lenge Dataset for Open-Domain Question Answering.
InEmpirical Methods in Natural Language Processing
(EMNLP), 2015. URLhttps://aclanthology.
org/D15-1237.
Zhang, Y., Baldridge, J., and He, L. PAWS: Paraphrase
Adversaries from Word Scrambling. InNorth American
Chapter of the Association for Computational Linguis-
tics (NAACL), 2019. URLhttps://aclanthology.
org/N19-1131.
Zhou, C., Li, Q., Li, C., Yu, J., Liu, Y., Wang, G., Zhang,
K., Ji, C., Yan, Q., He, L., et al. A comprehensive survey
on pretrained foundation models: A history from bert
to chatgpt.arXiv, 2023. URLhttp://arxiv.org/
abs/2302.09419v3.
Zhou, J., Lin, Z., Zheng, Y., Li, J., and Yang, Z. Not
All Tasks Are Born Equal: Understanding Zero-Shot
Generalization. InInternational Conference on Learning
Representations (ICLR), 2022.
13

Localizing Task Information for Improved Model Merging and Compression
A. Experimental Details
All our experiments were performed using the same hardware consisting of four V100 NVIDIA GPUs with 32GB of memory
each.
A.1. Fine-tuning
Fine-tuning: For all fine-tuning experiments, we stick to the training procedure outlined in2023). Specifically,
we fine-tune the same pre-trained CLIP checkpoint obtained from theopencliprepository (Ilharco et al.,). We fine-tune
for 2,000 iterations, using a batch size of 128, a learning rate of1e
−5, and a cosine annealing learning rate schedule with
200 warm-up steps, along with the AdamW optimizer. Following2023) and2023), we
freeze the weights of the classification layer during fine-tuning process.
Normalized AccuracyTo account for the task difficulties, we provide the normalized accuracies as well as the absolute
accuracies in our results. Specifically, the normalization is performed with respect to the accuracy achieved by the individual
fine-tuned models:
Normalized Accuracy=
1
T
T
X
t=1
acc
x∼µt
[fmerged(x)]
acc
x∼µt
[ffine-tuned(x)]
(10)
Note that the normalized accuracy depends on the fine-tuning methods as well as the merging methods.
A.2. Hyper-parameter tuning
Mask sparsity factorλFor constructing task-specific masks, we tune the hyper-parameterλfor each task over{0.2, 0.3,
0.4, 0.5, 0.6}. The bestλfor each task is selected based the validation performance on each individual tasks.
Task vector scaling factorαFollowing2023), we use a single scaling factor αto scale the multi-task vector
for the model merging methods in. The scaling factor is tuned over a range of {0.0, 0.1, ..., 0.9, 1.0},
selected based on the performance on the validation set averaged on all tasks.
A.3. Benchmark task contents
A.3.1. COMPUTER VISION
The8-task scenariotakes into account the 8 tasks studied in2021),2023), including 1. Cars
(Krause et al.,), 2. DTD (Cimpoi et al.,), 3. EuroSAT (Helber et al.,), 4. GTSRB (Stallkamp et al.,),
5. MNIST (LeCun,), 6. RESISC45 (Cheng et al.,), 7. SUN397 (Xiao et al.,), 8. SVHN (Netzer et al.,).
The14-task scenarioadds to the 8 tasks mentioned above the following tasks: 9. CIFAR100 (Krizhevsky et al.,),
10. STL10 (Coates et al.,), 11. Flowers102 (Nilsback & Zisserman,), 12. OxfordIIITPet (Parkhi et al.,),
13. PCAM (Veeling et al.,), 14. FER2013 (Goodfellow et al.,).
The20-task scenarioadds to the 14 tasks mentioned above the following tasks: 15. EMNIST (Cohen et al.,),
16. CIFAR10 (Krizhevsky et al.,), 17. Food101 (Bossard et al.,), 18. FashionMNIST (Xiao et al.,),
19. RenderedSST2 (Socher et al.,;,) 20. KMNIST (Clanuwat et al.,),
For the benchmarks beyond 8 tasks, we use available datasets fromtorchvisionlibrary. For the 14-task scenario, we
aim for diversisty in the tasks as much as possible. After removing the MNIST-variants in the 20-task benchmark, we rank
the tasks based on the performance comparison for zero-shot CLIP v.s. linear probe ResNet-50 (Radford et al.,) with
ascending order and select the top 6 tasks.
A.3.2. NATURAL LANGUAGE PROCESSING
7 NLP TasksThis benchmark is studied in2023b) and contains the following datasets 1. QASC (Khot et al.,
2020), 2. QuaRTz (Tafjord et al.,), 3. PAWS (Zhang et al.,), 4. Story Cloze (Sharma et al.,), 5. WikiQA
(Yang et al.,), 6. Winogrande (Sakaguchi et al.,) and 7. WSC (Levesque et al.,)
14

Localizing Task Information for Improved Model Merging and Compression
8 QA TasksFollowing2024), we evaluate on another benchmark, containing the following tasks 1. CosmosQA
(Huang et al.,), 2. QASC (Khot et al.,), 3. QuAIL (Rogers et al.,), 4. QuaRTz (Tafjord et al.,)),
5. PAWS (Zhang et al.,). 6. ROPES (Lin et al.,), 7. SocialIQA (Sap et al.,), 8. Wiki QA (Yang et al.,).
B. Derivation of
This short section shows the derivation of the mask criterion in more detail:
m
∗
t= argmin
mt∈{0,1}
P
(∥
ˆ
θt−θt∥1) (11)
= argmin
mt∈{0,1}
P
(∥mt◦τMTL−τt∥1) (12)
= argmin
mt∈{0,1}
P
P
X
n=1
|m
(n)
t·τ
(n)
MTL
−τ
(n)
t| (13)
=⇒m
(n)∗
t= argmin
m
(n)
t
∈{0,1}
|m
(n)
t·τ
(n)
MTL
−τ
(n)
t| (14)
=
(
1if|τ
(n)
t| ≥ |τ
(n)
MTL
−τ
(n)
t|
0otherwise
(15)
=1{|τ
(n)
t| ≥ |τ
(n)
MTL
−τ
(n)
t|} (16)
From
that the optimal mask is given by:
m
∗
t=1{|τt| ≥ |τMTL−τt|} (17)
C. Storage cost calculation
This section show the calculation of the storage cost for each method in. Let Tbe the number of tasks,
Pbe the number of all parameters,P
′be the number of trainable parameters in the model, andFbe the number of frozen
parameters in the model. Assuming one float parameter takes 32 bits, for each method, their respective storage cost forT
tasks is calculated as:
•
Fine-tuned models:32(T P
′
+F) .32T P
′ is for storingTtrainable parameters and32Fis for storing frozen parameters.
• 32P; Stores a single model.
• 32P; Stores a single model.
• 32P; Stores a single model.
• 32P; Stores a single model.
• 32P; Stores a single model.
•
Magnitude Masking:(64 +T)P
′
+ 32F ;64P
′
+ 32F is for storing zero-shot model and multi-task vector, whileT P
′
is for storingTbinary masks.
•
Magnitude Pruning:>(64 +T)P
′
+ 32F ; For Magnitude Pruning, we need to store a single model, as well as the
sparsified task vectors for each task. We calculate the respective sparsity such that the total storage cost will be higher
than(64 +T)P
′
+ 32F.
•
TALL Mask + Task arithmetic:(64 +T)P
′
+ 32F ;64P
′
+ 32F is for storing zeroshot model and multi-task vector,
whileT P’ is for storing T binary masks.
•
TALL Mask + Ties-merging:(64 +T)P
′
+ 32F ;64P
′
+ 32F is for storing zeroshot model and multi-task vector,
whileT P
′
is for storing T binary masks.
15

Localizing Task Information for Improved Model Merging and Compression
D. Additional results
D.1. Performance for removing only catastrophic weights
For Consensus Merging, we propose to remove both catastrophic weights and selfish weights, as their contribution to the
multi-task performance is limited. In this section, we study the performance of Consensus Merging when removing only
catastrophic weights, i.e., weights beneficial for none of the tasks. The results for the experiments in image classification are
presented in Table
Table 4.
Performance of Consensus when removing only catastrophic weights with ViT-B/32 and ViT-L/14 on image classification
benchmarks.
ViT-B/32 ViT-L/14
8 tasks 14 tasks 20 tasks 8 tasks 14 tasks 20 tasks
Task arithmetic 70.8
(76.5) 65.4
(72.2) 60.6
(66.8) 84.8
(88.5) 79.3
(83.8) 74.0
(78.0)
Consensus TA(k= 1) 73.8
(79.4) 69.0
(75.8) 64.5
(71.0) 85.5
(89.2) 81.4
(86.1) 77.7
(81.9)
Consensus TA(k= 2) 75.0
(80.7) 70.4
(77.4) 65.4
(72.0) 86.2
(89.9) 82.3
(86.9) 78.9
(83.2)
TIES 75.2
(81.1) 68.0
(74.9) 63.3
(69.8) 86.9
(90.7) 79.5
(84.1) 75.7
(79.8)
Consensus TIES(k= 1) 75.9
(81.0) 69.0
(75.9) 64.5
(71.0) 87.5
(91.3) 81.6
(86.3) 78.0
(82.1)
Consensus TIES(k= 2) 74.8
(80.6) 67.7
(74.5) 63.2
(69.6) 86.9
(90.7) 81.5
(86.1) 76.8
(80.9)
Table 5.Performance of Consensus when removing only catastrophic weights with ViT-B/16 on image classification benchmarks.
ViT-B/16
8 tasks 14 tasks 20 tasks
Task arithmetic 76.2
(80.6) 70.5
(75.8) 65.7
(70.9)
Consensus TA(k= 1) 78.0
(82.4) 72.9
(78.4) 68.9
(74.0)
Consensus TA(k= 2) 79.2
(83.6) 74.3
(79.8) 70.0
(75.2)
TIES 79.7
(84.3) 73.2
(78.7) 68.1
(73.3)
Consensus TIES(k= 1) 79.4
(83.8) 74.4
(80.0) 69.9
(75.1)
Consensus TIES(k= 2) 79.4
(83.9) 74.1
(79.5) 68.7
(73.9)
D.2. Effect of weight-pruning threshold for 8 and 14 tasks
We plot the performance for Consensus Merging with varying weight-pruning thresholdk, on the 8-task and 14-task
benchmarks in image classification with ViT-B/32, in Figure. Similar to the observation from Figure6, the performance
differs from one method to another (TIES and task arithmetic in our case). While the optimal performance of Consensus TA
is achieved by removing both catastrophic and selfish weights (setting the threshold to 2), for Consensus TIES it delivers the
best results by removing only the catastrophic weights (setting the threshold to 1). Gradually the performance of Consensus
Merging reduces to zero-shot as more and more weights get removed.
D.3. Performance with Parameter-efficient fine-tuning methods
In the main text we have presented the performance of our proposed methods for models with full fine-tuning. As Parameter-
efficient fine-tuning (PEFT) methods have been widely adopted as cost-effective alternatives to full fine-tuning, in this
section we provide the performance of our methods on PEFT models as well.
Following previous work (Yadav et al.,), we provide the performance for models fine-tuned with (IA)
3
(Liu et al.,
2022) on the three NLP benchmarks.
The results are presented in Table. We observe from the results that while removing the selfish weights lead to performance
degradation, possibly due to different profiles of the weights being tuned by (IA)
3
and full fine-tuning, removing only the
catastrophic weights leads to consistent performance improvement of Consensus Merging over Task Arithmetic and TIES.
For example, for the 7 NLP tasks, Consensus TA leads to 3.8% gain to Task Arithmetic and Consensus TIES leads to 5.3%
16

Localizing Task Information for Improved Model Merging and Compression2 4 6 8
Weight-pruning thresholdk
40
50
60
70
80
Avg. test accuracy (%)
TA
TIES
Zero-shot
(a) 8 tasks
Consensus TA Consensus TIES
2 5 7 10 12 15
Weight-pruning thresholdk
55
60
65
70
TA
TIES
Zero-shot
(b) 14 tasks
Consensus TA Consensus TIES
Figure 7.Performance of Consensus Merging with varying weight-pruning thresholdk.
gain to TIES.
Table 6.Results on NLP benchmarks using (IA)
3
models, comparing the performance of Consensus Merging with baseline methods.
7 NLP tasks 8 QA tasks All 11 tasks
(IA)
3
(IA)
3
(IA)
3
Zeroshot 44.9 33.1 36.9
Task arithmetic 67.1
(79.5) 57.6
(74.3) 59.7
(77.7)
Consensus TA(k= 1) 70.9
(83.6) 60.2
(77.1) 60.8
(79.1)
Consensus TA(k= 2) 66.6
(78.6) 54.1
(68.1) 55.7
(71.1)
TIES 66.0
(77.6) 56.8
(72.6) 56.1
(72.5)
Consensus TIES(k= 1) 71.3
(84.2) 58.7
(74.9) 59.2
(76.5)
Consensus TIES(k= 2) 67.6
(79.8) 56.3
(71.6) 53.1
(68.8)
D.4. Distribution of mask agreements with more tasks
We provide in Figure
TIES.
D.5. Result for ViT-B/16
We provide here the same results shown in, where we observe similar findings as the case
for ViT-B/32 and ViT-L/14 in main text.
D.6. Full results on individual tasks
D.6.1. FULL RESULTS FOR VISION
In main text we have provided the full results on individual tasks for ViT-B/32, here we provide the same radar plots for
ViT-B/16 in Figure, and the radar plots for ViT-L/14 in Figure, respectively. We observe similar findings for the case
with ViT-B/32 in main text.
D.6.2. FULL RESULTS FORNLP
We provide the full results for individual tasks in NLP in Figure
17

Localizing Task Information for Improved Model Merging and Compression01234567891011121314
Number of tasks
0
10
20
30
40
Active weights perc. (%)
Weights - TA Weights - TIES
01234567891011121314151617181920
Number of tasks
0
10
20
30
Figure 8.The mask agreement profile, defined in, in the case of 14 and 20 vision tasks. This figure complements.
Table 7.Complementary to, for results obtained with ViT-B/16.
ViT-B/16
8 tasks 14 tasks 20 tasks
Method
Acc.(%)↑Bits(Gb)↓ Acc.(%)↑Bits(Gb)↓ Acc.(%)↑Bits(Gb)↓
Zeroshot 55.2 3.6 61.2 3.6 59.7 3.6
Weight averaging 72.2
(76.6) 3.6
(74.7) 3.6
(70.3) 3.6
Task arithmetic 75.8
(80.2) 3.6
(75.8) 3.6
(70.7) 3.6
TIES 79.7
(84.3) 3.6
(78.7) 3.6
(73.3) 3.6
Consensus Task arithmetic [ours]79.2
(83.6) 3.6 74.3
(79.8) 3.6 69.7
(74.9) 3.6
Merging
Consensus TIES [ours] 79.4
(83.9) 3.6
(79.5) 3.6
(73.9) 3.6
Fine-tuned 94.6 22.9 92.8 39.4 93.2 56.0
Mag. Prunning 93.3
(98.5) >7.0
(94.7) >7.5
(92.7) >8.1
Mag. Masking 89.7
(94.6) 7.0
(91.1) 7.5
(87.3) 8.1
TALL Mask + Task arithmetic 94.2
(99.6) 7.0
(99.2) 7.5
(99.3) 8.1
Compression
TALL Mask + TIES 94.6
(99.9) 7.0 92.6
(99.8) 7.5 93.0
(99.8) 8.1
E. Sample subset selection protocol
Figure 5
select 8 representative task combinations. Specifically, we sort all 20 tasks based on the following 8 orders:
•
KMNIST, EMNIST, SVHN, GTSRB, FER2013, DTD, EuroSAT, MNIST, RenderedSST2, Cars, PCAM, RESISC45,
FashionMNIST, SUN397, CIFAR100, Flowers102, Food101, OxfordIIITPet, CIFAR10, STL10
•
STL10, CIFAR10, OxfordIIITPet, Food101, Flowers102, CIFAR100, SUN397, FashionMNIST, RESISC45, PCAM,
Cars, RenderedSST2, MNIST, EuroSAT, DTD, FER2013, GTSRB, SVHN, EMNIST, KMNIST
•
Cars, PCAM, RenderedSST2, RESISC45, MNIST, FashionMNIST, EuroSAT, SUN397, DTD, CIFAR100, FER2013,
Flowers102, GTSRB, Food101, SVHN, OxfordIIITPet, EMNIST, CIFAR10, KMNIST, STL10
•
STL10, KMNIST, CIFAR10, EMNIST, OxfordIIITPet, SVHN, Food101, GTSRB, Flowers102, FER2013, CIFAR100,
DTD, SUN397, EuroSAT, FashionMNIST, MNIST, RESISC45, RenderedSST2, PCAM, Cars
18

Localizing Task Information for Improved Model Merging and CompressionMNIST
Cars
DTD
EuroSAT
GTSRB
RESISC45
SUN397
SVHN
MNIST
20
40
60
80
(a) 8 Tasks
MNIST
Cars
DTD
EuroSATGTSRB
RESISC45
SUN397
SVHN
Flowers102
CIFAR100
PCAM STL10
OxfordIIITPet
FER2013
MNIST
20
40
60
80
(b) 14 Tasks
MNIST
Cars
DTD
EuroSAT
GTSRB
RESISC45
SUN397
SVHN
Flowers102
CIFAR100
PCAM
STL10
OxfordIIITPet
FER2013
EMNIST
CIFAR10
Food101
FashionMNIST
KMNIST
RenderedSST2
MNIST
20
40
60
80
(c) 20 Tasks
Fine-tuned TALL Mask + TA (Ours) Consensus TA (Ours) TA TIES Zero-shot
Figure 9.
Absolute accuracy (%) of individual tasks for ViT-B/16, comparing the accuracy of individual fine-tuned models, Our method,
Task arithmetic, Ties-merging, and Zero-shot.
•
CIFAR10, CIFAR100, Cars, DTD, EMNIST, EuroSAT, FER2013, FashionMNIST, Flowers102, Food101, GTSRB,
KMNIST, MNIST, OxfordIIITPet, PCAM, RESISC45, RenderedSST2, STL10, SUN397, SVHN
•
SVHN, SUN397, STL10, RenderedSST2, RESISC45, PCAM, OxfordIIITPet, MNIST, KMNIST, GTSRB, Food101,
Flowers102, FashionMNIST, FER2013, EuroSAT, EMNIST, DTD, Cars, CIFAR100, CIFAR10
•
CIFAR10, SVHN, CIFAR100, SUN397, Cars, STL10, DTD, RenderedSST2, EMNIST, RESISC45, EuroSAT, PCAM,
FER2013, OxfordIIITPet, FashionMNIST, MNIST, Flowers102, KMNIST, Food101, GTSRB
•
GTSRB, Food101, KMNIST, Flowers102, MNIST, FashionMNIST, OxfordIIITPet, FER2013, PCAM, EuroSAT,
RESISC45, EMNIST, RenderedSST2, DTD, STL10, Cars, SUN397, CIFAR100, SVHN, CIFAR10
For a given number of tasksn, we retrieve the firstntasks from these 8 sequences respectively, such that we account for
both task difficulty (in zero-shot performance order) and randomness (in alphabetic order) when generating.
19

Localizing Task Information for Improved Model Merging and CompressionMNIST
Cars
DTD
EuroSAT
GTSRB
RESISC45
SUN397
SVHN
MNIST
20
40
60
80
(a) 8 Tasks
MNIST
Cars
DTD
EuroSATGTSRB
RESISC45
SUN397
SVHN
Flowers102
CIFAR100
PCAM STL10
OxfordIIITPet
FER2013
MNIST
20
40
60
80
(b) 14 Tasks
MNIST
Cars
DTD
EuroSAT
GTSRB
RESISC45
SUN397
SVHN
Flowers102
CIFAR100
PCAM
STL10
OxfordIIITPet
FER2013
EMNIST
CIFAR10
Food101
FashionMNIST
KMNIST
RenderedSST2
MNIST
20
40
60
80
(c) 20 Tasks
Fine-tuned TALL Mask + TA (Ours) Consensus TA (Ours) TA TIES Zero-shot
Figure 10.
Absolute accuracy (%) of individual tasks for ViT-L-14, comparing the accuracy of individual fine-tuned models, Our method,
Task arithmetic, Ties-merging, and Zero-shot.PAWS
QASC
QuaRTz
StoryCloze
WikiQA
Winogrande
WSC
PAWS
20
40
60
80
(a) 7 NLP Tasks
CosmosQA
SocialIQA
PAWS
QuAIL
WikiQA
QuaRTz
QASC
ROPES
CosmosQA
20
40
60
80
(b) 8 QA Tasks
PAWS
QASC
QuaRTz
StoryCloze
WikiQA
Winogrande
WSC
CosmosQA
SocialIQA
QuAIL
ROPES
PAWS
20
40
60
80
(c) 11 Tasks
Fine-tuned TALL Mask + TA (Ours) Consensus TA (Ours) TA TIES Zero-shot
Figure 11.
Absolute accuracy (%) of individual tasks for T5-large on three benchmarks, comparing the accuracy of individual fine-tuned
models, Our method, Task arithmetic, Ties-merging, and Zero-shot.
20