Extracted Text


2110.01786

MoEcation: Transformer Feed-forward Layers are Mixtures of Experts
Zhengyan Zhang
1;2
, Yankai Lin
3
, Zhiyuan Liu
1;2;4;5y
,
Peng Li
3;6
, Maosong Sun
1;2;4;5;7y
, Jie Zhou
3
1
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
2
Beijing National Research Center for Information Science and Technology
3
Pattern Recognition Center, WeChat AI, Tencent Inc
4
International Innovation Center of Tsinghua University, Shanghai, China
5
Beijing Academy of Articial Intelligence
6
Institute for AI Industry Research (AIR), Tsinghua University, China
7
Jiangsu Collaborative Innovation Center for Language Ability, Xuzhou, China
zy-z19@mails.tsinghua.edu.cn {liuzy,sms}@tsinghua.edu.cn
Abstract
Recent work has shown that feed-forward net-
works (FFNs) in pre-trained Transformers are
a key component, storing various linguistic
and factual knowledge. However, the compu-
tational patterns of FFNs are still unclear. In
this work, we study the computational patterns
of FFNs and observe that most inputs only ac-
tivate a tiny ratio of neurons of FFNs. This
phenomenon is similar to the sparsity of the
human brain, which drives research on func-
tional partitions of the human brain. To ver-
ify whether functional partitions also emerge
in FFNs, we propose to convert a model into
its MoE version with the same parameters,
namely MoEcation. Specically, MoEca-
tion consists of two phases: (1) splitting the
parameters of FFNs into multiple functional
partitions as experts, and (2) building expert
routers to decide which experts will be used
for each input. Experimental results show
that MoEcation can conditionally use10%
to30%of FFN parameters while maintain-
ing over95%original performance for differ-
ent models on various downstream tasks. Be-
sides, MoEcation brings two advantages: (1)
it signicantly reduces the FLOPS of infer-
ence, i.e., 2x speedup with25%of FFN pa-
rameters, and (2) it provides a ne-grained
perspective to study the inner mechanism of
FFNs. The source code of this paper can
be obtained fromhttps://github.com/
thunlp/MoEfication.
1 Introduction
Recent years have witnessed great success of
Transformer-based pre-trained language models
y
Corresponding authors
Part of the work was done while Peng Li was working at
Tencent.
(PLMs) (Devlin et al.,;,;
Han et al.,), attracting many efforts to inter-
pret the inner mechanism of Transformer (Man-
ning et al.,;,). However,
most of these works focus on the attention mecha-
nism but ignore the feed-forward networks (FFNs),
which constitute nearly two-thirds of model pa-
rameters. Although recent work has shown that
FFNs can be viewed as memory networks storing
amounts of knowledge (Geva et al.,;,
2021), the computational patterns of FFNs are still
unclear.
In this work, we study the activation patterns
of FFNs in Transformer models and nd a phe-
nomenon ofsparse activation, i.e., only a tiny
fraction of neurons are activated for a single input.
For example, when we perform inference on a ne-
tuned T5-Large model (Raffel et al.,) with
700-million parameters,90%inputs only activate
less than5%neurons
1
. This phenomenon is similar
to the sparsity in the human brain (Olshausen and
Field,;,), which drives research
on functional partitions of the human brain (Garey,
1999). Inspired by such observation, we further
raise up a question: do the functional partitions
also emerge in articial neural models, i.e., FFNs
in pre-trained Transformer?
To investigate this problem, we explore whether
a Transformer can be converted into an equiv-
alent Mixture-of-Experts (MoE) model (Bengio,
2013), which regards different functional partitions
in FFNs as different experts conditionally activated.
Specially, we propose MoEcation to discover the
functional partitions (experts) in FFNs and build
routers for selecting experts. It consists of two
1
T5 uses ReLU as the activation function. We treat the
neurons having positive outputs as activated neurons.

phases. (1)Expert Construction: Split a whole
feed-forward layer into multiple experts. The goal
is to group those neurons that are often activated
simultaneously into the same expert network. (2)
Expert Selection: Select those experts that con-
tain as many activated neurons as possible for each
input to approximate to the original results.
In the experiments, we evaluate MoEcation on
two typical kinds of downstream tasks, including
GLUE and QA benchmarks (Wang et al.,;
Rajpurkar et al.,;,), using T5
and BERT (Raffel et al.,;,).
Experimental results verify that FFNs in Transform-
ers can be converted to mixtures of experts, and
thus we can use only10%to30%of FFN param-
eters to maintain over95%original performance,
which veries that the pre-trained Transformers
also learn the functional partitions in FFNs. Be-
sides, MoEcation brings two advantages: (1) It
can signicantly speed up the inference of Trans-
formers. Using25%of FFN parameters brings 2x
speedup on CPU and 1.2x speedup on GPU. (2)
We can study MoEed models to interpret the in-
ner mechanism of FFNs at a ne-grained level. In
this work, we study their routing patterns and hope
these ndings can help future work on the design
and training of MoE models.
2 Related Work
Interpretation of Large-scale Transformers.
Due to the success of Transformer-based PLMs,
there are many studies on the interpretation of
Transformer, including the functionality of differ-
ent layers (Tenney et al.,;,;
Wang and Tu,;,), and
the mechanisms of both attention networks and
FFNs (Manning et al.,;,;
Wallace et al.,). Recent work nd that the
FFNs of Transformers can be viewed as memory
networks storing lots of knowledge learned from
language modeling (Geva et al.,;,
2021;,). Meanwhile, researchers
explore to modify the knowledge stored in FFNs
and achieve promising results (,;
Meng et al.,). In this work, we show that how
the knowledge stored in FFNs is used, that is, most
FFNs can be viewed as a MoE network where the
knowledge is conditionally activated.
Large-scale PLMs with MoE.
Jacobs et al.
(1991) propose mixture-of-experts to build a sys-
tem composed of many separate networks, which
learn to handle a subset of the training examples in-
dependently. When deep neural networks achieve
great success (Hinton et al.,;,
2012;,),2013)
thinks the model size is a key factor and MoE
is an important technique to scaling model com-
putation and proposes the idea of “conditional
computation”. The rst large-scale MoE lan-
guage model is proposed by2017),
which adds an MoE layer between two LSTM lay-
ers and independently assigns tokens to combi-
nations of experts. Recently, GShard (Lepikhin
et al.,), Switch-Transformer (Fedus et al.,
2021), BASELayer (Lewis et al.,), and Hash-
Layer (Roller et al.,) study how to build large-
scale Transformer-based models with MoE and op-
timal training strategies, which can fully utilize the
model capacity. Different from them, we utilize the
naturally-existing sparse activation phenomenon
to convert a model into its MoE version for better
efciency during inference.
Model Acceleration for PLMs.
Model acceler-
ation aims to reduce the time and space complexity
of PLMs. There are several techniques including
knowledge distillation (Sanh et al.,;,
2019;,), model pruning (Voita et al.,
2019;,;,), at-
tention approximation (Wang et al.,;
et al.,;,),and model quanti-
zation (Zafrir et al.,;,;
et al.,), and dynamic inference (Xin et al.,
2020;,;,;,
2020). Among these techniques, dynamic inference
explore to selectively omit unnecessary computa-
tion for acceleration, which is similar to the target
of MoEcation. Previous work usually focuses on
how to dynamically drop layers to accelerate in-
ference (Huang et al.,;,;
et al.,), which introduces additional training
objectives and prediction strategies. In contrast,
MoEcation simplies models in a ner granular-
ity, and does not change the process of training
and inference. In summary, MoEcation can be
regarded as a novel direction diagonal with the
above-mentioned approaches.
3 MoEcation
In this section, we will introduce the general idea of
MoEcation and divide it into two phases: expert
construction and expert selection.
2

x
W1
h
!(h)
W2
F(x)
x
W1
h
!(h)
W2
F(x)
h
!(h)
x
F(x)
(a) FFN Computation Process (b) Unused elements and neurons
Router
W
2
2
W
2
1
W
1
1
W
1
2
W1P=
(d) FFN with MoE(c) Expert Construction
P=
2
6
6
4
0010
1000
0001
0100
3
7
7
5
<latexit sha1_base64="VcolkrXZ70Ky7qEFop/9/nPDolA=">AAACYnicbZFNSwMxEIbT9auuX9Ue9RAsiqeyKxW9CKIXjxVsFbqlJOm0BpPskmTFsvQ3evYq/gWvatputX4MDHl5ZiYT3tBEcGOD4Lngzc0vLC4Vl/2V1bX1jdLmVtPEqWbQYLGI9S0lBgRX0LDcCrhNNBBJBdzQ+4tR/eYBtOGxuraDBNqS9BXvcUasQ50Sj6jEdXyKIwp9rjIqidX8cegHeB+PMhyfUeSHOQimIJgB4RR8d/kRqO7XfRh3SpWgGowD/xVhLiooj3qn9BJ1Y5ZKUJYJYkwrDBLbzoi2nAkY+lFqICHsnvSh5aQiEkw7G1syxHuOdHEv1i6VxWM6O5ERacxAUtfpXnhnftdG8L9aK7W9k3bGVZJaUGyyqJcKbGM88hd3uQZmxcAJwjR3b8XsjmjCrPuFH1uoHPrOlPC3BX9F87Aa1qpHV7XK2XluTxFto110gEJ0jM7QJaqjBmLoCb2hd/RRePV8b9MrT1q9Qj5TRj/C2/kE/UWsRw==</latexit>
P
T
W2=
W
1
2
W
2
2
W
1
1W
2
1
Input or Output
Positive Neuron Matrix Element
Negative Neuron
Unactivated Neuron Unused Element or Neuron Figure 1: An example of the sparse activation phenomenon and MoEcation. (a) shows the computation process
of an FFN for a given input. (b) shows the unused elements and neurons for this input. (c) shows how to construct
experts. (d) shows how the MoEed model handles this input efciently.
3.1 Overall Framework
MoEcation aims to utilize the sparse activation
phenomenon in the FFNs of Transformers to reduce
the computation cost.
We rst formally describe the sparse activation
phenomenon. The FFNs of Transformers are two-
layer fully connected networks, which process an
input representationx2R
dmodelby
h=xW1+b1;
F(x) =(h)W2+b2;
(1)
whereW12R
dmodeldf f andW22R
df fdmodel
are the weight matrices,b12R
df f andb22
R
dmodel
are the bias vectors, and()is a non-linear
activation function, which prefers to retain positive
values and discard negative ones. In this work,
we study the activation function ReLU (Nair and
Hinton,), which is used by the original Trans-
former (Vaswani et al.,) and some widely-
used Transformer-based PLMs (Sun et al.,;
Raffel et al.,).
Since there are many inactive (zero) values in
the intermediate output(h), the computation of
these values can be omitted for acceleration. Mean-
while, different inputs will activate different neu-
rons. Hence, we explore to select the possiblely-
activated neurons ofhbefore the FFN computation
instead of model pruning.
We show an example in Figure. In this FFN,
dmodelis2,df fis4, and the bias vectors are omit-
ted for simplication. For a given input representa-
tionx, there are two positive values inh. Hence,
we only need to compute part of the FFN, i.e., a
22submatrix ofW1and a22submatrix of
W2, to obtain the same outputF(x). Correspond-
ingly, we can MoEfy the original FFN to have an
MoE layer with two experts and select the one on
the right-hand side for this inputx.
For MoEcation, we rst split the FFN into sev-
eral independent parts, namely expert construction,
and then design a router to select suitable experts
for each input, namely expert selection.
3.2 Expert Construction
In this subsection, we introduce how to split an
FFN into several parts. The core idea is to group
together the neurons that are often activated simul-
taneously. In this way, for each input, we can select
a small number of experts to cover all its activated
neurons. To achieve better parallel computation
performance, we set the size of each expert to be
the same. If the number of experts isk, the input
and output dimension of experts is stilldmodeland
their intermediate dimension isde=
df f
k
. Then,
the parameters ofi-th expert are denoted by
W
i
12R
d
modelde
;b
i
12R
de
;W
i
22R
ded
model
:(2)
Given the result of splitting, we construct the cor-
responding permutation of intermediate neurons by

1 2 ::: df f
f(1)f(2)::: f(df f)

, wheref(n)is the mapping
function from the original neuron index to the per-
muted neuron index. We computef(n)by
f(n) = (e(n)1)de+jfmjmn; e(m) =e(n)gj;(3)
wheree(n)is the expert index of then-th neuron,
which varies from1tok, andjfmjmn; e(m) =
e(n)gj
is the index of then-th neuron in the expert.
Then, we use its permutation matrixP2R
df fdf f
to permute the rows or columns of parameters and
have the following split:
[W
1
1;W
2
1; : : : ;W
k
1] =W1P;
b
1
1b
2
1: : :b
k
1=b1P;
[(W
1
2)
T
;(W
2
2)
T
; : : : ;(W
k
2)
T
] = (P
T
W2)
T
;
(4)
3

whererepresents the vertical concatenation.
Note that the permutation will not inuence the
output representation:
(h)W2+b2=(h)P P
T
W2+b2;
=(hP)P
T
W2+b2;
=(xW1P+b1P)P
T
W2+b2:
(5)
In this work, we propose two methods to split an
FFN intokparts.
Parameter Clustering Split
. To take the pa-
rameter information into consideration, we treat
the columns ofW1as a collection of vectors with
dmodeldimension. Based on the intuition that the
neurons with similar vectors will be activated simul-
taneously, we apply balanced K-Means (Malinen
and Fränti,) to the vector collection to obtain
kclusters to construct the mapping function.
Co-Activation Graph Split
. To directly use
the information of co-activation, we construct a
co-activation graph by counting co-activations of
PLMs for the samples of the training set. Each
neuron will be represented by a node in the graph,
and the edge weight between two nodes are their
co-activation values. The co-activation value is
computed by
co-activation(n; m) =
X
x
h
(x)
nh
(x)
m1
h
(x)
n>0;h
(x)
m>0
;(6)
whereh
(x)
n,h
(x)
mare then-th and them-th neurons
ofhfor the inputxand1
h
(x)
n>0;h
(x)
m>0 indicates
h
(x)
nandh
(x)
mare activated simultaneously. Then,
we apply graph partitioning algorithms (Karypis
and Kumar,) to the co-activation graph to ob-
tain the split, where the internal connections for
each group will be strong. Please refer to Ap-
pendix
rithm. It means that the neurons splitted into the
same group are often activated simultaneously for
the training samples.
3.3 Expert Selection
In this subsection, we introduce how to create a
router for expert selection. An MoEed FFN pro-
cessed an inputxby
Fm(x) =
X
i2S
(xW
i
1+b
i
1)W
i
2+b2; (7)
whereSis the set of the selected experts. If all ex-
perts are selected, we haveFm(x) =F(x) . Con-
sidering that(xW
i
1
+b
i
1
)W
i
2 equals to0for most
experts, we try to selectnexperts, wheren < k,
minimizejjFm(x)F(x)jj2 . The selection meth-
ods will assign a scoresito each expert for the
given inputxand select the experts with then
highest scores by
S= arg max
Af1;2;:::;kg;jAj=n
X
i2A
si: (8)
Groundtruth Selection
for the intermediate
output(h). We can obtain the groundtruth se-
lection, which minimizesjjconcat(ff((xW
i
1
+
b
i
1
))1(i2S)g)(h)jj2
, by a greedy algorithm.
fis a padding function with zeros to match the
dimension between(xW
i
1
+b
i
1
) and(h). We
calculate the sum of positive values in each expert
assiand select experts using Equation. This
selection should approximate to the lower bound
ofjjFm(x)F(x)jj2 . Correspondingly, its perfor-
mance will approximate to the ideal performance of
an MoEed model. Meanwhile, it is intractable to
directly optimizejjFm(x)F(x)jj2 because there
are too many possible combinations of experts.
Similarity Selection.
To utilize the parameter
information, we average all columns ofW
i
1and
use it as the expert representation. Given an input
x, we calculate the cosine similarity between the
expert representation andxassi.
MLP Selection.
We train a multi-layer percep-
tron (MLP), which takes thexas input and predicts
the sum of positive values in each expert. Then,
we use the prediction assi. This method tries to
approximate to the performance of groundtruth se-
lection.
4 Experiment
4.1 Experimental Setups
Models and Hyperparameter.
We use four vari-
ants of T5 (Raffel et al.,), which are the
60-million-parameter T5-Small, the 200-million-
parameter T5-Base, the 700-million-parameter T5-
Large, and the 3-billion-parameter T5-XLarge. The
non-linear activation function is ReLU (Nair and
Hinton,). We use Adam as the optimizer and
a learning rate of10
6for ne-tuning T5 models
on downstream tasks. The batch size is set to64
and the number of epochs is set to3.
Datasets.
We use several natural language
understanding datasets to evaluate our models.
We use SST-2 (Socher et al.,), MNLI-
matched (Williams et al.,), and RACE (Lai
et al.,) as the main evaluation datasets, which
cover single-sentence classication, sentence-pair
4

classication, and reading comprehension. We re-
port the results on their development sets. We also
report the results of MoEcation in other datasets
in Appendix
mark (Wang et al.,) and SQuAD (Rajpurkar
et al.,).
Expert Construction.
For balanced K-Means,
we use an open-source implementation
2
. Besides
Parameter Clustering Split and Co-activation Graph
Split, we also implement Random Split as a naive
baseline, which uses an identity matrix asP. For
the number of neurons in each expert, if the number
is small, there will be a lot of experts, making the
routing computation cost high. Meanwhile, if the
number is large, there will be more inactive neurons
in each expert for a given input, which is harmful to
the performance with the same amount of selected
neurons. Hence, selecting the number of neurons in
each expert is a trade-off between computation cost
and accuracy. According to our pilot experiments,
we set the number of neurons in each expertdeto
32. Correspondingly, the number of experts varies
from64to512(k=
df f
de
) for different T5 variants.
With the same expert size, the relative computation
cost of routing for different models is the same as
shown in Appendix.
Expert Selection.
Besides Similarity Selection
and MLP Selection, we also implement Random
Selection, where we treat each expert as a col-
lection of vectors withdmodeldimension and ran-
domly select one of them as the expert represen-
tation. For Random Selection and Similarity Se-
lection, the computation complexity for routing
is O(kdmodel). For MLP Selection, we use a two-
layer feed-forward network as the architecture. The
input dimension isdmodel, the intermediate dimen-
sion isk, and the output dimension isk. The non-
linear activation function istanh() . Its computa-
tion complexity is O(kdmodel+k
2 ). Compared to
the computation complexity of FFNs of the origi-
nal model, O(dmodeldf f ), the computation cost
of routers is ignorable becausekis much smaller
thandf f. For example,kis128anddf fis4096
for T5-Large. For the training of our MLP routers,
we adopt cross-entropy as the training objective
and use the Adam optimizer with the learning rate
of10
2. The batch size is set to512and the num-
ber of epochs is set to10. We sample nearly500
thousand input representations from the training
2
https://github.com/ndanielsen/
Same-Size-K-Means
Model SST-2 MNLI RACE
Small 90.9 82.4 44.7
Small-Distill 91.9 82.6 50.6
Base 94.0 86.4 71.7
Large 96.2 89.5 81.3
XLarge 96.9 90.5 85.6
Table 1: Original Performance of different models on
three downstream tasks. The model architecture is T5.
data and split them into the training and develop-
ment sets with the ratio of9 : 1. Note that we only
use the activation information as supervision. The
training time of each FFN is about several minutes
on a single GPU.
4.2 MoEfy ReLU-based Models
In this subsection, we evaluate MoEcation on dif-
ferent T5 models. We consider two factors: the
model size and whether the model is compressed.
For the model size, we use ve variants of T5 (Raf-
fel et al.,), from T5-Small to T5-XLarge. For
convenience, we directly use the scale names as
the abbreviations. To investigate the inuence of
model compression, we compress T5-Large to T5-
Small by classic knowledge distillation (Hinton
et al.,). Specically, the teacher model is
a ne-tuned T5-Large and the student model is a
pre-trained T5-Small. The distilled model is de-
noted by T5-Small-Distill. The expert construction
and selection methods used here are Co-activation
Graph Split and MLP Selection, which are proved
to be the best combination in Section.
We report the performance of these models on
three datasets, SST-2, MNLI, and RACE, in Ta-
ble. They are the representative datasets for
single-sentence classication, sentence-pair clas-
sication, and reading compression, respectively.
The original performance of PLMs grows as the
model size grows, and knowledge distillation im-
proves the performance of T5-small.
We rst calculate the activation statistics of dif-
ferent models by inputting the training data of each
dataset. The results are shown in Figure. From
the gure, we have three observations. (1) The acti-
vations of these models are sparse. Different from
the previous study on models trained with smaller
datasets, where the activation ratios are range from
10%to50%(Geva et al.,)
3
, we nd most
3
Since the activation ratios of a randomly-initialized model
are around50%, we guess these models do not make full use
of their parameters.
5

(a) SST-2 (b) MNLI (c) RACE Figure 2: CDF of the ratio of activated neurons for each input with different models on three datasets.(a) SST-2 (b) MNLI (c) RACE Figure 3: Relative performance of MoEed models with different sizes on three datasets. Dynamically selecting
10%to20%neurons can recover nearly98%original performance for large models such as T5-XLarge.
inputs activate less than10%of the neurons. (2)
The activations of larger models are sparser than
those of smaller models. For example,80%inputs
only activate less than3%neurons in T5-XLarge
while40%inputs activate more than3%neurons
in T5-Small. (3) The sparsity is less related to dis-
tillation than the model size. The CDF curve of
T5-Small-Distill is close to that of T5-Small.
Then, we compare the performance of MoEed
models with different sizes and ratios of selected
neurons and report the results in Figure. To mea-
sure the performance of MoEcation, we calculate
the relative performance of the MoEed model to
the original model. From the gure, we have four
observations. (1) MoEcation works well with
all models on all three datasets. MoEed models
use10%to30%of FFN parameters while main-
taining over95%original performance. (2) The
larger models can use fewer neurons to recover the
original performance. For example, T5-XLarge
achieves nearly98%relative performance on SST-
2 and MNLI with10%neurons while T5-Small
achieves the same results with30%to40%neu-
rons. This result is consistent with the activation
statistics, that is, larger models are sparser. We
can expect that MoEcation can provide better ef-(a)
<latexit sha1_base64="IiQcAAYHuIDgHl1FpQD455bTZiw=">AAACHHicbVBNS8NAEN34WetX1aOXYBHqpSSi6FH04rEWW8WmlM122i7ubsLuRC0h/8Kr/TXexKvgjxHctD1odWDg8d4Mb+aFseAGPe/TmZtfWFxaLqwUV9fWNzZLW9tNEyWaQYNFItK3ITUguIIGchRwG2ugMhRwE95f5PrNA2jDI3WNwxjakvYV73FG0VJ3AcITphV6kHVKZa/qjcv9C/wpKJNp1Tqlr6AbsUSCQiaoMS3fi7GdUo2cCciKQWIgpuye9qFloaISTDsdX5y5+5bpur1I21bojtmfGymVxgxlaCclxYGZ1XLyP62VYO+0nXIVJwiKTYx6iXAxcvP33S7XwFAMLaBMc3urywZUU4Y2pF8uobQ/KHhkkZRUddOgnqVBbhiGaT3L8/Jn0/kLmodV/6h6fHVUPjufJlcgu2SPVIhPTsgZuSQ10iCMKPJMXsjIGTmvzpvzPhmdc6Y7O+RXOR/fKXSjMA==</latexit>
(b)
<latexit sha1_base64="ftA9/Pira2wA42vjPoM8P5ON+gI=">AAACFXicbVBNT8JAEN3iF+IX6tFLIzHBC2kNRo9ELx4RRUhoQ3a3C2zY3Ta7Ww1p+hO8yq/xZrx69seYuIUeBJxkkpf3ZvJmHooYVdpxvq3C2vrG5lZxu7Szu7d/UD48elJhLDFp45CFsougIowK0tZUM9KNJIEcMdJB49tM7zwTqWgoHvUkIj6HQ0EHFENtqIcqOu+XK07NmZW9CtwcVEBezX75xwtCHHMiNGZQqZ7rRNpPoNQUM5KWvFiRCOIxHJKegQJyovxkdmpqnxkmsAehNC20PWP/biSQKzXhyExyqEdqWcvI/7RerAfXfkJFFGsi8NxoEDNbh3b2tx1QSbBmEwMgltTcauMRlBBrk86CC+LmB0FecMg5FEHitdLEywwRSlppavJyl9NZBU8XNbdeu7yvVxo3eXJFcAJOQRW44Ao0wB1ogjbAYAhewRuYWlPr3fqwPuejBSvfOQYLZX39At6kn9I=</latexit>
Figure 4: (a) CDF of the ratio of activated neurons in
BERT-Large on SST-2, MNLI, and RACE. (b) Relative
performance of MoEed BERT-Large.
ciency with super large models. (3) Difcult tasks
require models to select more experts to maintain
the performance. From Table, we can see that the
accuracy of RACE is much lower than the other
two tasks, and hence we think RACE is more dif-
cult. Correspondingly, the relative performance
with10%neurons on RACE is also lower than
those on the other tasks. (4) MoEcation works
similarly on T5-Small and T5-Small-Distill, which
indicates that MoEcation can work with knowl-
edge distillation for more efcient inference.
4.3 MoEfy GeLU-based Models
In addition to using ReLU as the activation func-
tion, many PLMs use GeLU (Hendrycks and Gim-
pel,), including BERT (Devlin et al.,)
6

and GPT (Brown et al.,). In this subsec-
tion, we study whether BERT, which is the most
representative GeLU-based model, can be MoE-
ed. Considering that GeLU gives negative inputs
small activations instead of 0, we rst transform
a GeLU-based BERT into a ReLU-based BERT,
and then MoEfy the ReLU-based model. Speci-
cally, we initialize a ReLU-based BERT using the
pre-trained parameters of a BERT-Large
4
and train
the ReLU-based BERT on the pre-training corpus
for the adaptation of the change of activation func-
tions. In this work, we use the pre-training frame-
work provided by NVIDIA
5
and keep all hyper-
parameters unchanged. Wikipedia and Bookcor-
pus are used as the pre-training corpus. In the
experiments, after 400 optimization steps, the pre-
training loss is close to that of the original model.
Hence, the adaptation cost is much smaller than the
pre-training cost (about 10000 steps). Meanwhile,
the downstream performance of the ReLU-based
model is comparable to the original model (93:1
v.s93:5on SST-2 and84:8v.s85:2on MNLI).
Based on this ReLU-based BERT-Large, we study
the sparse activation phenomenon and the effect of
MoEcation and report the results in Figure.
From this gure, we have two observations: (1)
The sparse activation phenomenon still exists in
BERT. For example, more than 80% of inputs ac-
tivate less than 10% of neurons. It reveals the
generality of the sparse activation phenomenon in
pre-trained Transformers. It will be an interesting
direction to explain this phenomenon empirically
or theoretically in the future. (2) MoEcation also
archives good performance on BERT. For exam-
ple, selecting30%to40%of neurons can recover
97%performance. Since the activation of BERT
is slightly denser than that of T5, it requires more
neurons to recover most performance.
4.4 Comparisons of MoEcation Strategies
To nd the most effective MoEcation strategy, we
evaluate different combinations of expert construc-
tion and selection methods. We use T5-Large and
also set the ratio of selected neurons to20%. The
results are shown in Table. From the table, we
have two observations:
(1) For expert construction, Co-activation Graph
4
https://catalog.ngc.nvidia.com/orgs/
nvidia/models/bert_pyt_ckpt_large_
pretraining_amp_lamb
5
https://github.com/NVIDIA/
DeepLearningExamples
ConstructionSelectionSST-2 MNLI RACE
- - 96.2 89.5 81.3
Random
Groundtruth95.9 87.3 80.0
Random 65.9 36.3 29.2
Similarity90.3 75.9 56.7
MLP 94.184.175.0
Groundtruth95.5 88.8 80.9
Parameter Random 70.6 36.4 41.8
ClusteringSimilarity86.7 66.3 63.6
MLP 95.987.578.7
Groundtruth96.3 89.1 80.8
Co-ActivationRandom 85.3 68.5 54.7
Graph Similarity92.2 81.4 71.0
MLP 95.487.579.0
Table 2: Comparisons of different combinations of
expert construction and selection methods using T5-
Large. The rst row is the original performance. The
best results in each group are underlinedand the best
results on each dataset are inboldface.
Ratio FLOPS CPU GPU
50.0% 1.50 1.43 1.15
25.0% 2.00 1.98 1.20
12.5% 2.40 2.28 1.47
Table 3: Speedup of FLOPS, CPU and GPU with dif-
ferent ratios of selected neurons.
Split is the best method according to the overall
performance. Compared to the other two meth-
ods, Co-activation Graph Split directly uses the
co-activation information to group the neurons ac-
tivating simultaneously into the same expert.
(2) For expert selection, the performance of
Groundtruth Selection is close to that of the origi-
nal model, which indicates that20%parameters of
FFNs are sufcient to achieve good performance on
T5-Large. Meanwhile, MLP Selection is the best
expert selection method and can work well with
both Parameter Clustering Split and Co-activation
Graph Split.
5 Analysis
In this section, we analyze the efciency and rout-
ing patterns of MoEed models.
5.1 Efciency Improvement
In this subsection, we show the efciency im-
provement brought by MoEcation. We synthe-
size a batch of sequences with the input and output
lengths of 64 and evaluate T5-Large on the data. To
comprehensively show the efciency improvement,
7

Figure 5: Selection Frequency of64experts in each
encoder layer of MoEed T5-Small. The frequency of
ideal balance selection is0:2while the distribution is
much unbalanced.
we report the relative speedup based on FLOPS,
CPU, and GPU in Table. The FLOPS is estimated
according to the statistics provided by
(2021). The results of CPU and GPU are tested
on an Intel Broadwell CPU and an NVIDIA Tesla
V100 GPU, respectively.
From this table, we have three observations:
(1) MoEcation can signicantly reduce the total
FLOPS, such as 2x speedup in the ratio of 25%.
Meanwhile, the speedup on CPU is close to that
on FLOPS. Considering that CPU is widely used
for model inference in real-world scenarios, MoE-
cation is practical for the acceleration of various
NLP applications. (2) The smaller the ratio, the
smaller the gain. For example, the gain of halving
25% (to 12.5%) is 1.2x while the gain of halving
50% (to 25%) is 1.3x. Although the FLOPS re-
duction of feed-forward networks is linear in the
ratio, the cost of attention networks is unchanged
and becomes the bottleneck. Hence, 20% is a good
ratio, which can have a signicant speedup (2x)
and maintain most performance. (3) Since some
of the operations of MoE cannot be easily paral-
leled, the speedup on GPU is smaller than that
on GPU. Recently, some packages such as Fast-
MoE (He et al.,) and Deepspeed-MoE (Rajb-
handari et al.,) are working on paralleling the
inference of MoE models on distributed computing
platforms and already have some promising results.
We believe the bottleneck of parallel computing in
MoE models will be well solved in the future.(a) The 8 most selected experts
<latexit sha1_base64="uQ9s3lDgnZCE7dAlzOw8in4mOBU=">AAACD3icbVC7TgJBFJ3FF+Jr1dJmItFgQ3YNRkqijSUmvBIgZHa4wITZR2buGsiGP7DxV2wsNMbW1s6/cXgUCp5kkpNz7p2Zc7xICo2O822l1tY3NrfS25md3b39A/vwqKbDWHGo8lCGquExDVIEUEWBEhqRAuZ7Eure8Hbq1x9AaREGFRxH0PZZPxA9wRkaqWOftxBGmOTYBa0MgBapH2qk5j7gCF0KowgU6knHzjp5Zwa6StwFyZIFyh37q9UNeexDgFwyrZuuE2E7YQoFlzDJtGINEeND1oemoQHzQbeTWZ4JPTNKl/ZCZU6AdKb+3kiYr/XY98ykz3Cgl72p+J/XjLFXbCciiGKEgM8f6sWSYkin5dCuUCa3HBvCuBLmr5QPmGKmC6UzpgR3OfIqqV3m3UL+6r6QLd0s6kiTE3JKcsQl16RE7kiZVAknj+SZvJI368l6sd6tj/loylrsHJM/sD5/AP3jnAo=</latexit>
(b) The 8 least selected experts
<latexit sha1_base64="3mbs5xbU/j+V1ATSweY8mqexfvg=">AAACEHicbVC7TgJBFJ3FF+Jr1dJmIjFiQ3YNRkqijSUmvBIgZHa4wITZR2buGsiGT7DxV2wsNMbW0s6/cXgUCp5kkpNz7p2Zc7xICo2O822l1tY3NrfS25md3b39A/vwqKbDWHGo8lCGquExDVIEUEWBEhqRAuZ7Eure8Hbq1x9AaREGFRxH0PZZPxA9wRkaqWOftxBGmOS8C1oZAC1SCUwjNRcCR+hSGEWgUE86dtbJOzPQVeIuSJYsUO7YX61uyGMfAuSSad10nQjbCVMouIRJphVriBgfsj40DQ2YD7qdzAJN6JlRurQXKnMCpDP190bCfK3HvmcmfYYDvexNxf+8Zoy9YjsRQRQjBHz+UC+WFEM6bYd2hTK55dgQxpUwf6V8wBQzXSidMSW4y5FXSe0y7xbyV/eFbOlmUUeanJBTkiMuuSYlckfKpEo4eSTP5JW8WU/Wi/VufcxHU9Zi55j8gfX5A7PJnGs=</latexit>
Figure 6: Input similarities between experts in the last
encoder layer of MoEed T5-Small. For the most
selected experts, both the self-similarities and inter-
similarities are low. For the least selected experts, the
self-similarities are much higher than inter-similarities.
5.2 Routing Patterns
In this subsection, we investigate the routing pat-
terns of MoEed models. First, we count the se-
lection frequency of each expert. Previous work
introduces training objectives to ensure balance se-
lection to make full use of model parameters (Lep-
ikhin et al.,;,). We report the
results of the MoEed T5-Small with20%experts
on SST-2 in Figure. From the gure, we observe
that the frequency distribution of expert selection is
much unbalanced. There are some commonly-used
experts, whose frequencies are higher than80%.
Meanwhile, there are also some long-tail experts
whose frequencies are lower than10%.
Then, we calculate the self-similarities and inter-
similarities of inputs between experts by sampling
10;000inputs for each expert. We report the results
of the last layer in Figure. For the most selected
experts, which are selected by most inputs, the
self-similarities are close to the inter-similarities.
For the least selected experts, the self-similarities
are much higher than the inter-similarities, which
suggests that the inputs of each expert have obvious
cluster structure.
From these results, we can conclude the routing
patterns of MoEed models: there are some gen-
eral experts, which can work for most inputs, and
some input-specic experts, which are seldom used
and may work in specic domains or tasks. This
observation may inspire future work on training
MoE models from scratch.
6 Conclusion
In this work, we verify that Transformer FFNs are
naturally mixtures of experts and propose MoE-
cation, which utilizes the sparse activation phe-
nomenon in FFNs to convert a normal model to
8

its MoE version with the same parameters. Ex-
perimental results show that MoEed models can
achieve comparable performance to the original
models using only10%to30%of FFN parame-
ters. Correspondingly, it signicantly reduces the
FLOPS of inference, e.g., 2x speedup with 20%
of FFN parameters. Besides, by studying the rout-
ing patterns of MoEed models, we nd that there
are general and input-specic experts, which may
inspire future work on training MoE models. We
hope MoEcation can benet real-world applica-
tions of PLMs with better efciency and benet the
interpretation of the inner mechanism of FFNs.
Acknowledgement
This work is supported by the National Key R&D
Program of China (No. 2020AAA0106502), In-
stitute Guo Qiang at Tsinghua University, Beijing
Academy of Articial Intelligence (BAAI), and
International Innovation Center of Tsinghua Uni-
versity, Shanghai, China. We thank Chenglei Si,
Tianyu Gao and other members of THUNLP for
their helpful discussion and feedback. Zhengyan
Zhang conducted the experiments. Zhengyan
Zhang, Yankai Lin, Zhiyuan Liu, and Peng Li wrote
the paper. Maosong Sun and Jie Zhou provided
valuable advices to the research.
References
Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jin
Jin, Xin Jiang, Qun Liu, Michael R. Lyu, and Irwin
King. 2021.
quantization. In Proceedings of ACL/IJCNLP, pages
4334–4348.
Yoshua Bengio. 2013.
tions: Looking forward. In Proceedings of SLSP,
pages 1–37.
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam Mc-
Candlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. 2021.
learners. In Proceedings of NeurIPS, pages 1877–
1901.
Ido Dagan, Oren Glickman, and Bernardo Magnini.
2006.
challenge. In Machine learning challenges., pages
177–190.
Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, and Furu
Wei. 2021.
formers. arXiv preprint arXiv:2104.08696.
Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021.
Editing factual knowledge in language models. In
Proceedings of EMNLP, pages 6491–6506.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019.
deep bidirectional transformers for language under-
standing. In Proceedings of NAACL, pages 4171–
4186.
William B Dolan and Chris Brockett. 2005.
cally constructing a corpus of sentential paraphrases.
InProceedings of IWP, pages 9–16.
William Fedus, Barret Zoph, and Noam Shazeer. 2021.
Switch transformers: Scaling to trillion parameter
models with simple and efcient sparsity. arXiv
preprint 2101.03961.
Laurence J Garey. 1999.Brodmann's' localisation in
the cerebral cortex'.
Mor Geva, Roei Schuster, Jonathan Berant, and Omer
Levy. 2021.
key-value memories. In Proceedings of EMNLP,
pages 5484–5495.
Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,
and Bill Dolan. 2007.
ing textual entailment challenge. In Proceedings of
TEP, pages 1–9.
Ian Goodfellow, David Warde-Farley, Mehdi Mirza,
Aaron Courville, and Yoshua Bengio. 2013.
out networks. In Proceedings of ICML, pages 1319–
1327.
Charles G Gross. 2002. Genealogy of the “grand-
mother cell”.The Neuroscientist, 8(5):512–518.
Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu,
Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan Yao,
Ao Zhang, Liang Zhang, Wentao Han, Minlie
Huang, Qin Jin, Yanyan Lan, Yang Liu, Zhiyuan Liu,
Zhiwu Lu, Xipeng Qiu, Ruihua Song, Jie Tang, Ji-
Rong Wen, Jinhui Yuan, Wayne Xin Zhao, and Jun
Zhu. 2021.
future. arXiv preprint 2106.07139.
Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Ji-
dong Zhai, and Jie Tang. 2021.
Mixture-of-Expert training system. arXiv preprint
2103.13262.
Dan Hendrycks and Kevin Gimpel. 2016.
sian error linear units (GELUs). arXiv preprint
1606.08415.
9

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.
Distilling the knowledge in a neural network. arXiv
preprint arXiv:1503.02531.
Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky,
Ilya Sutskever, and Ruslan R Salakhutdinov. 2012.
Improving neural networks by preventing co-
adaptation of feature detectors. arXiv preprint
1207.0580.
Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao
Chen, and Qun Liu. 2020.
BERT with adaptive width and depth. In Proceed-
ings of NeurIPS.
Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Lau-
rens van der Maaten, and Kilian Weinberger. 2018.
Multi-Scale dense networks for resource efcient
image classication. In Proceedings of ICLR.
Robert A Jacobs, Michael I Jordan, Steven J Nowlan,
and Geoffrey E Hinton. 1991.
local experts. Neural Comput., 3(1):79–87.
Ganesh Jawahar, Benoît Sagot, and Djamé Seddah.
2019.
of language? Proceedings of ACL, pages 3651–
3657.
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,
Xiao Chen, Linlin Li, Fang Wang, and Qun Liu.
2020.
guage understanding. In Findings of EMNLP, pages
4163–4174.
George Karypis and Vipin Kumar. 1998.
high quality multilevel scheme for partitioning irreg-
ular graphs. SIAM J. Sci. Comput., 20(1):359–392.
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya.
2020.. In Pro-
ceedings of ICLR.
Olga Kovaleva, Alexey Romanov, Anna Rogers, and
Anna Rumshisky. 2019.
crets of BERT. In Proceedings of EMNLP-IJCNLP,
pages 4365–4374.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-
ton. 2012.
volutional neural networks. In Proceedings of
NeurIPS, pages 1106–1114.
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,
and Eduard Hovy. 2017.
ing comprehension dataset from examinations. In
Proceedings of EMNLP, pages 785–794.
Dmitry Lepikhin, Hyoukjoong Lee, Yuanzhong Xu,
Dehao Chen, Orhan Firat, Yanping Huang, Maxim
Krikun, Noam Shazeer, and Zhifeng Chen. 2021.
GShard: Scaling giant models with conditional com-
putation and automatic sharding. In Proceedings of
ICLR.
Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman
Goyal, and Luke Zettlemoyer. 2021.
Simplifying training of large, sparse models. arXiv
preprint 2103.16716.
Lei Li, Yankai Lin, Deli Chen, Shuhuai Ren, Peng Li,
Jie Zhou, and Xu Sun. 2021.
erating inference of pre-trained language models via
calibrated complete models cascade. In Findings of
EMNLP, pages 475–486.
Mikko I. Malinen and Pasi Fränti. 2014.
means for clustering. In Proceedings of SSSPR, vol-
ume 8621, pages 32–41.
Christopher D. Manning, Kevin Clark, John Hewitt,
Urvashi Khandelwal, and Omer Levy. 2020.
gent linguistic structure in articial neural networks
trained by self-supervision. Proc. Natl. Acad. Sci.
USA, 117(48):30046–30054.
Kevin Meng, David Bau, Alex Andonian, and Yonatan
Belinkov. 2022. Locating and editing factual knowl-
edge in gpt.arXiv preprint arXiv:2202.05262.
Stephen Merity, Caiming Xiong, James Bradbury, and
Richard Socher. 2017.
els. InProceedings of ICLR.
Paul Michel, Omer Levy, and Graham Neubig. 2019.
Are sixteen heads really better than one? Pro-
ceedings of NeurIPS, pages 14014–14024.
Vinod Nair and Geoffrey E. Hinton. 2010.
linear units improve restricted boltzmann machines.
InProceedings of ICML, pages 807–814.
Bruno A Olshausen and David J Field. 1996. Emer-
gence of simple-cell receptive eld properties by
learning a sparse code for natural images.Nature,
381(6583):607–609.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J Liu. 2020.
of transfer learning with a unied Text-to-Text trans-
former. J. Mach. Learn. Res., 21:140:1–140:67.
Samyam Rajbhandari, Conglong Li, Zhewei Yao, Min-
jia Zhang, Reza Yazdani Aminabadi, Ammar Ah-
mad Awan, Jeff Rasley, and Yuxiong He. 2022.
DeepSpeed-MoE: Advancing mixture-of-experts in-
ference and training to power next-generation ai
scale.arXiv preprint arXiv:2201.05596.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016.
machine comprehension of text. In Proceedings of
EMNLP, pages 2383–2392.
Sahana Ramnath, Preksha Nema, Deep Sahni, and
Mitesh M. Khapra. 2020.
BERT for reading comprehension based QA. In Pro-
ceedings of EMNLP, pages 3236–3242.
10

Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam,
and Jason Weston. 2021.
sparse models. arXiv preprint arXiv:2106.04426.
Victor Sanh, Lysandre Debut, Julien Chaumond, and
Thomas Wolf. 2019.
of BERT: smaller, faster, cheaper and lighter. arXiv
preprint 1910.01108.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,
Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff
Dean. 2017.
The Sparsely-Gated Mixture-of-Experts layer. In
Proceedings of ICLR.
Richard Socher, Alex Perelygin, Jean Wu, Jason
Chuang, Christopher D Manning, Andrew Ng, and
Christopher Potts. 2013.
for semantic compositionality over a sentiment tree-
bank. InProceedings of EMNLP, pages 1631–1642.
Xavier Suau, Luca Zappella, and Nicholas Apostoloff.
2020.. arXiv
preprint arXiv:2005.07647.
Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019.
Patient knowledge distillation for BERT model com-
pression. In Proceedings of EMNLP, pages 4323–
4332.
Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie
Liu, Yiming Yang, and Denny Zhou. 2020.
bileBERT: a compact Task-Agnostic BERT for
Resource-Limited devices. In Proceedings of ACL,
pages 2158–2170.
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
BERT rediscovers the classical NLP pipeline. In
Proceedings of ACL, pages 4593–4601.
Ashish Vaswani, Noam Shazeer, Niki Parmar, and
Jakob Uszkoreit. 2017.. In
Proceedings of NeurIPS, pages 5998–6008.
Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
nrich, and Ivan Titov. 2019.
Self-Attention: Specialized heads do the heavy lift-
ing, the rest can be pruned. In Proceedings of ACL,
pages 5797–5808.
Eric Wallace, Jens Tuyls, Junlin Wang, Sanjay Subra-
manian, Matt Gardner, and Sameer Singh. 2019.
lenNLP interpret: A framework for explaining pre-
dictions of NLP models. In Proceedings of EMNLP-
IJCNLP, pages 7–12.
Alex Wang, Amanpreet Singh, Julian Michael, Felix
Hill, Omer Levy, and Samuel R Bowman. 2019.
GLUE: A multi-task benchmark and analysis plat-
form for natural language understanding. In Pro-
ceedings of ICLR.
Sinong Wang, Belinda Z Li, Madian Khabsa, Han
Fang, and Hao Ma. 2020.
attention with linear complexity. arXiv preprint
arXiv:2006.04768.
Wenxuan Wang and Zhaopeng Tu. 2020.
the value of transformer components. In Proceed-
ings of COLING, pages 6019–6029.
Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-
man. 2019..
TACL, 7:625–641.
Adina Williams, Nikita Nangia, and Samuel R. Bow-
man. 2018.
sentence understanding through inference. In Pro-
ceedings of NAACL-HLT, pages 1112–1122.
Wenhao Wu, Dongliang He, Xiao Tan, Shifeng Chen,
Yi Yang, and Shilei Wen. 2020.
A new approach toward efcient video action recog-
nition. arXiv preprint 2002.03342.
Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and
Jimmy Lin. 2020.
for accelerating BERT inference. In Proceedings of
ACL, pages 2246–2251.
Deming Ye, Yankai Lin, Yufei Huang, and Maosong
Sun. 2021.
accelerating BERT inference. In Proceedings of
NAACL-HLT, pages 5798–5809.
Or Zafrir, Guy Boudoukh, Peter Izsak, and Moshe
Wasserblat. 2019..
arXiv preprint 1910.06188.
Manzil Zaheer, Guru Guruganesh, Kumar Avinava
Dubey, Joshua Ainslie, Chris Alberti, Santiago On-
tañón, Philip Pham, Anirudh Ravula, Qifan Wang,
Li Yang, and Amr Ahmed. 2020.
formers for longer sequences. In Proceedings of
NeurIPS.
Wei Zhang, Lu Hou, Yichun Yin, Lifeng Shang, Xiao
Chen, Xin Jiang, and Qun Liu. 2020.
Distillation-aware ultra-low bit BERT. In Proceed-
ings of EMNLP, pages 509–521.
Zhengyan Zhang, Fanchao Qi, Zhiyuan Liu, Qun Liu,
and Maosong Sun. 2021.
need: Single-Shot Meta-Pruning for attention heads.
AI Open, 2:36–42.
11

A MoEcation on Other Datasets
For text classication, we use GLUE bench-
mark (Wang et al.,), including MNLI-
matched (Williams et al.,), QNLI (Rajpurkar
et al.,), QQP
6
, RTE (Dagan et al.,), SST-
2 (Socher et al.,), MRPC (Dolan and Brock-
ett,), CoLA (Warstadt et al.,), and STS-
B (Giampiccolo et al.,). For reading compre-
hension, we use SQuAD (Rajpurkar et al.,)
and RACE (Lai et al.,), which are the rep-
resentative datasets for span extraction and multi-
choice QA, respectively. We report the results on
their development sets. For MNLI, QNLI, QQP,
RTE, SST-2, MRPC, RACE, we use accuracy as
the metric. For CoLA, we use matthews correla-
tion coefcient as the metric. For STS-B, we use
pearson and spearman correlation as the metrics.
For SQuAD, we use F1 score as the metric.
We evaluate MoEcation on several downstream
natural language understanding tasks with T5-
Large. The ratio of selected neurons is set to20%,
which is sufcient for T5-Large as show in Fig-
ure. In practice, there is still a gap between the
performance of MoEed models and that of origi-
nal models because selected experts cannot cover
all positive neurons with a limited computation
budget. Hence, the outputs of MoEed models will
be slightly different from those of original models.
To calibrate MoEed models, we further ne-tune
the models on the training set, namely parameter
calibration. Considering that current routers are
based on the rst layers of FFNs (W1andb1),
we only optimize the second layers of FFNs (W2
andb2) to ensure routers can also work well af-
ter ne-tuning. We use a small learning rate of
10
7for calibration. The other hyper-parameters
remain the same as ne-tuning. The results are
shown in Table. MoEed refers to the combi-
nation of Co-activation Graph Split and MLP Se-
lection. MoEed+GT refers to the combination of
Co-activation Graph Split and Groundtruth Selec-
tion. MoEed+Calib is the calibrated version of
MoEed. To calculate the average performance,
we also include SST-2, MNLI, and RACE.
We observe that MoEcation introduces small
performance loss (about 1.5% on average) with an
80% reduction of the computation cost in FFNs.
Meanwhile, calibration can effectively deal with
the issue of the precision errors brought by MoE-
cation. For example, MoEed+Calib improves
6
https://data.quora.com
MoEed by nearly 4% on CoLA and achieves the
same average performance as MoEed+GT.
B Activation Statistics before
Fine-tuning
We count the activation statistics of PLMs be-
fore ne-tuning on the pre-training data containing
about50;000input tokens. The results are shown
in Figure. We observe that PLMs before ne-
tuning also have the sparse activation phenomenon
and ne-tuning brings little change.0 5 10
Ratio of Activated Neurons (%)
0
20
40
60
80
100
CDF (%)
XLarge
Large
Base
Small
Figure 7: CDF of the ratios of activated neurons for
each input with different models before ne-tuning.
Then, we compare the activations of pre-trained
models and those of ne-tuned models. We use
the average ratio of activated neurons as the index.
The results are shown in Table. We observe that
ne-tuning increases the average activation ratio
for most models. The reason may be that differ-
ent neurons start to learn the same task-specic
patterns during ne-tuning. Interestingly, the in-
crease on RACE is smaller than that on the other
datasets. Since RACE is more difcult than the
other datasets, there should be more task-specic
patterns in RACE and less neurons learn the same
patterns. Moreover, the pre-training task MLM re-
quires more patterns than RACE so the ratios of
MLM are lowest.
C Results of Graph Partition
Co-activation Graph Split achieves good perfor-
mance in expert construction. Here, we study
whether the co-activation graph is suitable for parti-
tioning. We report the results of graph partition of
T5-Large on SST-2 in Figure. Smaller ratios of
edgecuts, which straddle partitions, mean that more
co-activation pairs are included in experts. We only
12

MNLI QNLI QQP RTE SST-2 MRPC CoLA STS-B RACE SQuAD 1.1 Avg.
Original89.5 94.4 91.7 87.1 96.2 88.0 59.4 91.2/90.9 81.3 93.2 87.2
MoEed 87.5 93.2 90.2 86.4 95.4 87.5 55.5 90.6/90.3 79.0 92.2 85.7 (-1.5)
+GT 89.1 94.1 91.4 86.4 96.3 88.3 58.8 90.9/90.8 80.8 93.2 86.9 (-0.3)
+Calib88.7 93.6 91.3 87.5 96.2 89.3 59.4 91.0/90.6 79.9 92.3 86.9 (-0.3)
Table 4: Results of T5-Large on GLUE benchmark and two QA datasets. The last row reports the differences
between the original model and MoE+Calib. MoEed models with parameter calibration achieve comparable
performance to original models.
Small Base Large XLarge
MLM 4.18 2.85 2.17 1.52
SST-2 5.53 2.24 2.50 2.46
MNLI 5.59 3.25 2.44 2.45
RACE 4.94 3.08 1.98 1.79
Table 5: Average ratio of activated neurons for each
input. MLM represents the pre-trained models with
masked language modeling. SST-2, MNLI, RACE rep-
resent the ne-tuned models on each dataset.
report the results of encoder layers because all ra-
tios of decoder layers are smaller than0:001. From
this gure, we can see that the overall ratio is small
and these graphs are suitable for partitioning.0 5 10 15 20
Layer
0.0
0.1
0.2
0.3
Ratio of Edgecuts
Figure 8: Ratio of edgecuts in different layers.
D Accuracy of MLP Selection
MLP selection trains MLPs to t the groundtruth
selection. In this part, we report the accuracy of
MLPs in T5-Large ne-tuned on SST-2. The results
are shown in Figure. The overall accuracy
of the encoder is about0:8and the overall accuracy
of the decoder is about0:7.
E Relative Cost of Routing
In this work, we set the number of neurons in each
expert to32. Then, the number of experts in each
layerkis
df f
32
. In most Transformer models,df f=
4dmodel
. The computation complexity of Similarity0 5 10 15 20
Layer
0.0
0.2
0.4
0.6
0.8
1.0
Accuracy
Figure 9: Accuracy of MLPs of encoder layers.0 5 10 15 20
Layer
0.0
0.2
0.4
0.6
0.8
1.0
Accuracy Figure 10: Accuracy of MLPs of decoder layers.
Selection for each input is
O(kdmodel) =O(
d
2
model
8
): (9)
The computation complexity of FFNs for each in-
put is
O(dmodeldf f) =O(4d
2
model
):(10)
Then, the relative cost of routing to that of FFNs is
constant for different models. It is also similar to
MLP Selection.
F Graph Partitioning Algorithm
The goal of graph partitioning is to divide a graph
into several sub-graphs where the number of edges
crossing sub-graphs is minimized. In this work,
we use the graph partitioning algorithm proposed
by1998). The graph partition-
ing algorithm consists of three phases: coarsening
phase, partitioning phase, and renement phase.
(1) In the coarsening phase, we create new super
nodes by grouping nodes that are highly connected
together. For example, if the weight of the edge
13

Figure 11: Comparison between MoEcation and
model pruning.
Model MLM Loss
MoE Pre-training 3.09
Standard Pre-training 2.88 (-0.21)
+MoEcation 3.02 (-0.07)
+GT 2.95 (-0.14)
Table 6: Comparisons of MoE models pre-trained from
scratch and modied by MoEcation. We report the
MLM loss on the validation set. Standard pre-training
with MoEcation is better than pre-training a MoE
model from scratch.
between two nodes is large, these two nodes will be
grouped together. In the setting of coarsening co-
activation graphs studied in this work, two neurons
that often activate simultaneously will be treated as
a new super neuron. (2) In the partitioning phase,
we start with an initial bipartition of the super node
graph and then iteratively search for super nodes
from each part of the graph, such that swapping
them leads to a partition with a smaller number of
crossing edges. To divide a graph intokparts, we
needlogkrounds of bipartition. (3) In the rene-
ment phase, we project super nodes to the original
nodes and then continue to iteratively swap nodes
to reduce the number of crossing edges.
G Comparisons with Model Pruning
Based on the ne-tuned T5-Large on SST-2, we
compare MoEcation with model pruning, which
omits the weight having small values. The results
are shown in Figure. We observe that model
pruning signicantly degrades the performance.
However, MoEcation achieves good performance
by selectively activating parts of the network ac-
cording to input.
H MoEcation vs. MoE pre-training
In this subsection, we compare the performance
of two kinds of MoE models. The rst one is
pre-trained from scratch. The second one is trans-
formed from a standard model by MoEcation. For
fair comparisons, we pre-train one MoE model and
one standard model with the same model size from
scratch using WikiText-103 (Merity et al.,).
The pre-training objective is masked language mod-
eling (MLM). The model architecture is the same
as T5-Small. For pre-training, we use the batch size
of4096, the learning rate of0:01, the maximum
sequence length of512, and the Adam optimizer.
The number of experts is set to64and the router
will select32of them for a single input.
We report the MLM loss on the validation set in
Table. From the table, we have two observations.
(1) The loss of the standard pre-trained model is
lower than that of the pre-trained MoE model. We
guess that the optimization of MoE models is dif-
cult than that of the standard models because of the
restricted selection of MoE models. (2) MoEed
models achieve better performance than the pre-
trained MoE model. It indicates that pre-training a
standard model then conducting MoEcation can
be a better option than pre-training an MoE model
from scratch.
14