Extracted Text
2203.14680.pdf
Transformer Feed-Forward Layers Build Predictions by
Promoting Concepts in the Vocabulary Space
Mor Geva
;1
Avi Caciularu
;2;y
Kevin Ro Wang
3
Yoav Goldberg
1;2
1
Allen Institute for AI
2
Bar-Ilan University
3
Independent Researcher
morp@allenai.org,{avi.c33,kevinrowang,yoav.goldberg}@gmail.com
Abstract
Transformer-based language models (LMs)
are at the core of modern NLP, but their inter-
nal prediction construction process is opaque
and largely not understood. In this work,
we make a substantial step towards unveiling
this underlying prediction process, by reverse-
engineering the operation of the feed-forward
network (FFN) layers, one of the building
blocks of transformer models. We view the to-
ken representation as a changing distribution
over the vocabulary, and the output from each
FFN layer as an additive update to that distri-
bution. Then, we analyze the FFN updates in
the vocabulary space, showing that each up-
date can be decomposed to sub-updates cor-
responding to single FFN parameter vectors,
each promoting concepts that are often human-
interpretable. We then leverage these ndings
for controlling LM predictions, where we re-
duce the toxicity of GPT2 by almost 50%, and
for improving computation efciency with a
simple early exit rule, saving 20% of compu-
tation on average.
1
1 Introduction
How do transformer-based language models (LMs)
construct predictions? We study this question
through the lens of the feed-forward network (FFN)
layers, one of the core components in transform-
ers (Vaswani et al.,). Recent work showed
that these layers play an important role in LMs,
acting as memories that encode factual and linguis-
tic knowledge (Geva et al.,;,;
Meng et al.,). In this work, we investigate
how outputs from the FFN layers are utilized inter-
nally to build predictions.
We begin by making two observations with re-
spect to the representation of a single token in the
Equal contribution.
y
Work done during an internship at AI2.
1
Our codebase is available athttps://github.com/
aviclu/ffn-values. residual stream
675
a
589
she ordered a
pancake
few
coffee
pancake
few
coffee
FFN
layer
fruit, apples,
snack, vitamins,
berries, oats,
yogurt, tea, …
(breakfast)
v
1
v
2
v
dm
(B)
(C)
(D)(A)
x
x̃
Figure 1: Illustration of our ndings. Feed-forward lay-
ers apply additive updates (A) to the token representa-
tionx, which can be interpreted as a distribution over
the vocabulary (B). An update is a set of sub-updates
induced by parameter vectorsv1; :::;vdm(C), each can
be interpreted as a concept in the vocabulary space (D).
input, depicted in Fig.. First, each FFN layer
induces an additive update to the token represen-
tation (Fig., A). Second, the token representation
across the layers can be translatedat any stageto a
distribution over the output vocabulary (Geva et al.,
2021) (Fig., B). We reason that the additive com-
ponent in the update changes this distribution (§2),
namely,FFN layers compute updates that can be
interpreted in terms of the output vocabulary.
We then decompose the FFN update (§3), in-
terpreting it as a collection of sub-updates, each
corresponding to a column in the second FFN ma-
trix (Fig., C) that scales the token probabilities
in the output distribution. Through a series of
experiments, we nd that (a) sub-update vectors
across the entire network often encode a small-set
of human-interpretable well-dened concepts, e.g.
breakfastorpronouns(§4, Fig., D), and (b)
FFN updates rely primarily on token promotion
(rather than elimination), namely, tokens in the top
of the output distribution are those pushed strong
enough by sub-updates (§5). Overall, these nd-
ings allow ne-grained interpretation of the FFN
operation, providing better understanding of the
prediction construction process in LMs.
Beyond interpretation, our ndings also have
practical utility. In §6.1, we show how we can
intervene in the prediction process, in order to ma-
nipulate the output distribution in a direction of our
choice. Specically, we show that increasing the
weight of only 10 sub-updates inGPT2reduces
toxicity in its generations by almost 50%. Also, in
§6.2, we show that dominant sub-updates provide
a useful signal for predicting an early exit point,
saving 20% of the computation on average.
In conclusion, we investigate the mechanism in
which FFN layers update the inner representations
of transformer-based LMs. We propose that the
FFN output can be viewed as a collection of up-
dates that promote concrete concepts in the vo-
cabulary space, and that these concepts are often
interpretable for humans. Our ndings shed light
on the prediction construction process in modern
LMs, suggesting promising research directions for
interpretability, control, and efciency.
2 Token Representations as Evolving
Distributions Over the Vocabulary
Modern LMs (Baevski and Auli,;
et al.,;,) are transformer
models primarily trained to predict the next to-
ken probability for a given input. Such LMs are
composed of intertwined multi-head self-attention
(MHSA) layers and FFN layers (Vaswani et al.,
2017), with residual connections (He et al.,)
between each pair of consecutive layers. The LM
prediction is obtained by projecting the output vec-
tor from the nal layer to an embedding matrix
E2R
jVjd , with a hidden dimensiond, to get a
distribution over a vocabularyV(after softmax).
Given a sequencew=hw1; :::; wti of input to-
kens, the model creates a contextualized represen-
tationxi2R
d for each tokenwi2w , that is being
updated throughout the layers. In this work, we ana-
lyze the updates applied by the FFN layers and how
they construct the model prediction. Concretely,
each FFN layer`= 1; :::; Lprocessesx
`
iand pro-
duces an outputo
`
i, which is then added tox
`
ito
yield an updated representation~x
`
i:
o
`
i=FFN
`
(x
`
i)
~x
`
i=x
`
i+o
`
i
The updated representation~x
`
ithen goes through
a MHSA layer,
2
yielding the inputx
`+1
ifor the
next FFN layer. The evolving representation in
this process (i.e.x
`
i
!~x
`
i;8` ) can be viewed as
an information stream that is being processed and
updated by the layers (Elhage et al.,). The
output probability distribution is obtained from the
nal representation of the token, i.e.,
y=softmax(E~x
L
i): (1)
To analyze the FFN updates, we read from the
representationat any layera distribution over the
output vocabulary, by applying the same projection
as in Eq.Geva et al.,):
p
`
i=softmax(Ex
`
i)
~p
`
i=softmax(E~x
`
i):
Note that~p
L
i=y. Importantly, by linearity:
E~x
`
i=Ex
`
i+Eo
`
i;
implying thato
`
ican be interpreted as an additive
update in the vocabulary space. However, we nd
that the projection of the FFN outputEo
`
ito the vo-
cabulary is not interpretable (§4). In this work, we
take this a step further, and decompose the update
o
`
iinto a set of smaller sub-updates. By projecting
the sub-updates to the vocabulary we nd that they
often express human-interpretable concepts.
In the rest of the paper, we focus on FFN up-
dates to the representation of a single token in the
sequence, and omit the token index for brevity, i.e.
x
`
:=x
`
i
andp
`
:=p
`
i
.
3 The FFN Output as a Collection of
Updates to the Output Distribution
We now decompose the FFN output, and interpret
it as a set of sub-updates in the vocabulary space.
FFN Outputs as Linear Vector Combinations.
Each FFN at layer`consists of two linear trans-
formations with a point-wise activation function in
between (bias terms are omitted):
FFN
`
(x
`
) =f
W
`
Kx
`
W
`
V;
2
In some LMs, e.g.GPT2, a layer normalization (LN) (Ba
et al.,) is applied to the representation~x
`
i. We omit it
here and show it does not inuence our interpretation in §3.
whereW
`
K
; W
`
V
2R
dmd are parameter matrices,
andfis a non-linearity function. Previous work
proposed this module can be cast as an emulated
neural key-value memory (Sukhbaatar et al.,,
2019), where rows inW
`
Kand columns inW
`
Vare
viewed as keys and values, respectively (Geva et al.,
2021). For an inputx
`, the keys produce a vector of
coefcientsm
`
:=f
W
`
K
x
`
2R
dm , that weighs
the corresponding values inW
`
V. Denoting byk
`
i
thei-th row ofW
`
Kand byv
`
ithei-th column of
W
`
V
, we can then use the following formulation:
FFN
`
(x
`
) =
dmX
i=1
f(x
`
k
`
i)v
`
i=
dmX
i=1
m
`
iv
`
i:
Therefore,a FFN update can be viewed as a col-
lection of sub-updates, each corresponding to a
weighted value vector in the FFN output.
Terminology.
In the rest of the paper, we refer
to the vectorsv
`
iasvalue vectors, and to their
weighted formm
`
i
v
`
i assub-updates. A transformer
LM withL= 10; dm= 3000 will have30;000
value vectors, and every token that passes through
the transformer will weight these value vectors dif-
ferently, resulting in30;000sub-updates, where
only a few of the sub-updates have high weights.
Interpreting Sub-Updates in the Vocabulary
Space.
Consider a sub-updatem
`
i
v
`
i for a given
input, we can estimate its inuence on the repre-
sentationx
`(before the FFN update) by analyzing
the change it induces on the output distribution.
Concretely, we isolate the effect ofm
`
i
v
`
i on the
probabilityp
`
wofw2 V:
3
p
wjx
`
+m
`
iv
`
i; E
=
exp
ewx
`
+ewm
`
i
v
`
i
Z
E(x
`
+m
`
i
v
`
i
))
/exp
ewx
`
exp
ewm
`
iv
`
i
;(2)
whereewis the embedding ofw, andZ
is the
constant softmax normalization factor.
This implies that each sub-updatem
`
i
v
`
i intro-
duces a scaling factor to the probability of every
tokenwbased on its dot product withew. Specif-
ically, havingewm
`
i
v
`
i
>0 increases the proba-
bility ofw, and havingewm
`
i
v
`
i
<0 decreases it.
This scaling factor can be split into two parts:
3
As in Eq.1, LN is omitted. In App., we verify empiri-
cally that our ndings hold also when LN is applied.
The termewv
`
i can be viewed as astatic score
ofwthat is independent of the input to the model.
Thus, the projectionr
`
i
=Ev
`
i
2R
jVj induces a
ranking over the vocabulary that allows compar-
ing the scores byv
`
i
w.r.t different tokens.
The termm
`
iis thedynamic coefcientofv
`
i,
which is xed for all tokens for a given input.
Thus, these coefcients allow comparing the con-
tribution of value vectors in a specic update.
Overall, the scaling factorewm
`
i
v
`
i can be viewed
as the effective score given by a value vectorv
`
ito
a tokenwfor a given input.
In the next sections, we use these observations to
answer two research questions of (a) What informa-
tion is encoded in sub-updates and what tokens do
they promote? (§4) and (b) How do FFN updates
build the output probability distribution? (§5)
4 Sub-Updates Encode Concepts in the
Vocabulary Space
We evaluate whether projection to the vocabulary
provides a meaningful way to read FFN up-
dates, and the extent to which sub-updates are inter-
pretable based on their projections. To this end, we
manually inspect the top-scoring tokens by value
vectors and check if they express interpretable con-
cepts. Concretely, we consider two representative
LMs (details below), and for each vectorv
`
icom-
pute a ranking over the vocabulary by sorting the
projectionr
`
i(§3). Then, we try to detect patterns
in the top-scoring tokens of each value vector.
Concepts Annotation Task.
We let experts
(NLP graduate students) annotate concepts by iden-
tifying common patterns among the top-30 scor-
ing tokens of each value vector. For a set of to-
kens, the annotation protocol includes three steps
of: (a) Identifying patterns that occur in at least
4 tokens, (b) describing each recognized pattern,
and (c) classifying each pattern as eitherseman-
tic(e.g., mammals),syntactic(e.g., past-tense
verbs), ornames. The last class was added only
forWIKILM(see below), following the observa-
tion that a large portion of the model's vocabulary
consists of names. Further details, including the
complete instructions and a fully annotated exam-
ple can be found in App..
Models.
We conduct our experiments over
two auto-regressive decoder LMs: The model
of2019) (dubbed WIKILM),
a 16-layer LM trained on theWIKITEXT-103
12345678910111213141516
layer
0
20
40
60
80
100
% tokens
WikiLM
semantic syntactic names N/A 123456789101112
layer
0
20
40
60
80
100
% tokens
GPT2
semantic syntactic N/A Figure 2: Portion of top-scoring tokens by value vectors in WIKILM and GPT2, that were associated with a
semantic or syntactic concept, a name, or could not be matched to any concept (N/A).
Concept Sub-update top-scoring tokens
GPT2
v
3
1018Measurement
semantic
kg, percent, spread, total, yards, pounds, hours
v
8
1900WH-relativizers
syntactic
which, whose, Which, whom, where, who, wherein
v
11
2601Food and drinks
semantic
drinks, coffee, tea, soda, burgers, bar, sushi
WIKILM
v
1
1 Pronouns
syntactic
Her, She, Their, her, she, They, their, they, His
v
6
3025Adverbs
syntactic
largely, rapidly, effectively, previously, normally
v
13
3516
Groups of people
semantic
policymakers, geneticists, ancestries, Ohioans
Table 1: Example value vectors in GPT2 and WIKILM promoting human-interpretable concepts.
corpus (Merity et al.,) with word-level to-
kenization (jVj= 267;744 ), andGPT2(Rad-
ford et al.,), a 12-layer LM trained on WEB-
TEXT(Radford et al.,) with sub-word to-
kenization (jVj= 50;257 ).GPT2uses the
GeLU activation function (Hendrycks and Gim-
pel,), while WIKILMuses ReLU, and in
contrast toGPT2,WIKILMdoes not apply layer
normalization after FFN updates.WIKILMde-
nesd= 1024; dm= 4096 andGPT2denes
d= 768; dm= 3072 , resulting in a total of65k
and36kvalue vectors, respectively. For our experi-
ments, we sample 10 random vectors per layer from
each model, yielding a total of 160 and 120 vectors
to analyze fromWIKILMandGPT2, respectively.
4.1 Projection of Sub-Updates is Meaningful
Real vs. Random Sub-Updates.
We validate
our approach by comparing concepts in top-tokens
of value vectors and 10 random vectors from a nor-
mal distribution with the empirical mean and stan-
dard deviation of the real vectors. We observe that
a substantially higher portion of top-tokens were
associated to a concept in value vectors compared
to the random ones (Tab.): 55:1%vs.22:7%in
WIKILM, and37%vs.16%inGPT2. Also, in
both models, the average number of concepts per
vector was>1in the value vectors compared to
0:5 in the random ones. Notably, no seman-
tic nor syntactic concepts were identied inWIK-
ILM's random vectors, and inGPT2, only4%of
the tokens were marked as semantic concepts in the
random vectors versus24:9%in the value vectors.
Updates vs. Sub-Updates.
We justify the FFN
output decomposition by analyzing concepts in the
top-tokens of 10 random FFN outputs per layer
(Tab.). In WIKILM(GPT2),39:4%(46%) of
the tokens were associated with concepts, but for
19:7%(34:2%) the concept wasstopwords/punc-
tuation. Also, we observe very few concepts
(<4%) in the last two layers ofWIKILM. We
account this to extreme sub-updates that dominate
the layer's output (§5.2). Excluding these concepts
results in a considerably lower token coverage in
projections of updates compared to those of sub-
updates:19:7%vs.55:1%inWIKILM, and11:8%
vs.36:7%in GPT2.
Overall, this shows that projecting sub-updates
to the vocabulary provides a meaningful interface
to the information they encode. Moreover, de-
composing the FFN outputs is necessary for ne-
grained interpretation of sub-updates.
GPT2 W IKILM
FFN sub-updates 36.7% 55.1%
+stopwords concepts 37% 55.1%
Random sub-updates 16% 22.7%
FFN updates 11.8% 19.7%
+stopwords concepts 46% 39.4%
Table 2: Portion of top-scoring tokens associated with
a concept, for FFN updates and sub-updates in WIK-
ILM and GPT2, and for random vectors. For FFN
updates/sub-updates, we show results with and without
counting concepts marked as stopwords.
4.2 Sub-Update Projections are Interpretable
Fig.
layers, forWIKILMandGPT2. In both models
and across all layers, a substantial portion (40%-
70% inWIKILMand 20%-65% inGPT2) of the
top-tokens were associated with well-dened con-
cepts, most of which were classied assemantic.
Also, we observe that the top-tokens of a single
value vector were associated with1:5(WIKILM)
and1:1(GPT2) concepts on average, showing that
sub-updates across all layers encode a small-set of
well-dened concepts. Examples are in Tab..
These ndings expand on previous results by
Geva et al.2021), who observed that value vec-
torsin the upper layersrepresent next-token distri-
butions that follow specic patterns. Our results,
which hold acrossall the layers, suggest that these
vectors represent general concepts rather than pri-
oritizing specic tokens.
Underestimation of Concept Frequency.
In
practice, we nd that this task is hard for humans,
4
as it requires reasoning over a set of tokens without
any context, while tokens often correspond to un-
common words, homonyms, or sub-words. More-
over, some patterns necessitate world knowledge
(e.g.villages in Europe near rivers) or linguistic
background (e.g. negative polarity items). This of-
ten leads to undetectable patterns, suggesting that
the overall results are an underestimation of the true
concept frequency. Providing additional context
and token-related information are possible future
directions for improving the annotation protocol.
Implication for Controlled Generation.
If sub-
updates indeed encode concepts, then we can not
only interpret their contribution to the prediction,
but alsointervenein this process, by increasing the
4
A sub-update annotation took8:5minutes on average.
p
`
:cow, cat,dog, goat, horse, bear
~p
`
:dog, cat, goat, horse, cow, bear
Saturation:dogis promoted from rank 3 inp
`to rank 1 in
~p
`
, to be the top-candidate until the last layer.
p
`
:cow, cat, dog, goat, horse, bear
~p
`
:dog, cat, goat, horse,cow, bear
Elimination:cowis eliminated from rank 1 inp
`to 5 in~p
`.
Table 3: Example saturation and elimination events, af-
ter a FFN update (reference tokens are in bold text).
weights of value vectors that promote tendencies
of our choice. We demonstrate this in §6.1.
5 FFN Updates Promote Tokens in the
Output Distribution
We showed that sub-updates often encode inter-
pretable concepts (§4), but how do these concepts
construct the output distribution? In this section,
we show that sub-updates systematically congure
the prediction via promotion of candidate tokens.
5.1 Promoted Versus Eliminated Candidates
Every sub-updatem
`
i
v
`
i either increases, decreases,
or does not change the probability of a tokenw,
according to the scoreewm
`
i
v
`
i (§3). This sug-
gests three mechanisms by which tokens are pushed
to the top of the output distribution promotion,
where sub-updates increase the probability of fa-
vorable tokens,elimination, where sub-updates de-
crease candidate probabilities, or amixtureof both.
To test what mechanism holds in practice, we ana-
lyze the scores sub-updates assign to top-candidate
tokens by the representation. To simplify the anal-
ysis, we focus on changes induced by the 10 most
dominant sub-updates in each layer, that is, the 10
sub-updatesm
`
i
v
`
i with the largest contribution to
the representation, as measured byjm
`
i
j jjv
`
i
jj (see
details in App.).
For the experiments, we use a random sam-
ple of 2000 examples from the validation set of
WIKITEXT-103,
5
which bothWIKILMandGPT2
did not observe during training. As the experiments
do not involve human annotations, we use a larger
GPT2model withL= 24; d= 1024; dm= 4096 .
We start by comparing the sub-updates' scores
to a reference token in two types of events:
Saturation(Tab., up): The updatep
`
!~p
`
where the nal token predicted by the model (i.e.,
w=argmax(y) ) was promoted to be the top can-
5
Data is segmented into sentences (Geva et al.,).
123456789101112131415161718192021222324
layer
15
10
5
0
5
10
15
20
top-10 values' scores for
the top candidate
GPT2
mean score
min score
max score
w.o. functional
all 12345678910111213141516
layer
4
2
0
2
4
top-10 values' scores for
the top candidate
WikiLM
mean score
min score
max score
w.o. functional
all Figure 3: Mean, maximum and minimum scores assigned by the 10 most dominant sub-updates in each layer to
the top-candidate token, in GPT2 (left) and WIKILM (right). Solid (dashed) lines exclude (include) functional
value vector groups. The y-axis in both plots is cut for readability, as the max. (min.) scores reach 100 (-6).
Sub-updates Event Max. Mean Min.
WIKILM, dominant
saturation1:2<0:01 0:8
elimination0:5 0:01 0:5
WIKILM, random
saturation0:02<0:01 0:02
elimination0:02<0:01 0:02
GPT2, dominant
saturation8:5 1:3 4:9
elimination4:0 0:1 3:6
GPT2, random
saturation0:2 0:01 0:2
elimination0:1<0:01 0:1
Table 4: Maximum, mean, and minimum scores of ref-
erence tokens in saturation and elimination events, by
the 10 most dominant and 10 random sub-updates.
didate until the last layer. We analyze saturation
events induced by the FFN before the last layer,
covering 1184 and 1579 events inWIKILMand
GPT2, respectively.
Elimination(Tab., bottom): The updatep
`
!
~p
`
with the largest increase in the top candidate's
rank, i.e. where the top candidate was dropped
behind other candidates to have a rank>1. Over-
all, our analysis covers 1909 (WIKILM) and
1996 (GPT2) elimination events.
We compute the mean, maximum, and minimum
scores of the reference token by the 10 most domi-
nant sub-updates in each event, and average over
all the events. As a baseline, we compute the scores
by 10 random sub-updates from the same layer.
Tab.
kens promoted to the top of the distribution receive
higher maximum scores than tokens eliminated
from the top position (1:2!0:5 inWIKILMand
8:5!4:0 inGPT2), indicating they are pushed
strongly by a few dominant sub-updates. Moreover,
tokens eliminated from the top of the distribution re-
ceive near-zero mean scores, by both dominant and
random sub-updates, suggesting they are not being
eliminated directly. In contrast to promoted tokens,
where the maximum scores are substantially higher
than the minimal scores (1:2vs. 0:8inWIKILM
and8:5vs. 4:9inGPT2), for eliminated tokens,
the scores are similar in their magnitude (0:5in
WIKILMand4:0vs. 3:6inGPT2). Last, scores
by random sub-updates are dramatically lower in
magnitude, showing that our choice of sub-updates
is meaningful and that higher coefcients translate
to greater inuence on the output distribution.
This suggests that FFN updates work in a pro-
motion mechanism, where top-candidate tokens are
those being pushed by dominant sub-updates.
5.2 Sub-Updates Across Layers
To analyze the FFN operation in different layers,
we break down the top-candidate scores per layer.
Formally, letw
`
=argmax(p
`
) be the top candi-
date at layer`(before the FFN update) for a given
input, we extract the scorese
w
`m
`
i
v
`
i by the 10
most dominant sub-updates and compute the mean,
minimum and maximum scores over that set.
Fig.
few layers (23-24 inGPT2and 14-16 inWIKILM),
maximum and minimum scores are distributed
around non-negative mean scores, with prominent
peaks in maximum scores (layers 3-5 inGPT2and
layers 4-11 inWIKILM). This suggests that the to-
ken promotion mechanism generally holds across
layers. However, scores diverge in the last layers of
both models, with strong negative minimum scores,
indicating that the probability of the top-candidate
is pushed down by dominant sub-updates. We next
show that these large deviations in positive and neg-
ative scores (Fig., dashed lines) result from the
operation of small sets of functional value vectors.
Extreme Sub-Updates.
To analyze the extreme
FFN updates, we rst cluster the value vectors to
discover high-level trends. We use agglomerative
clustering (Müllner,) to learn 10k clusters
for each model, based on the cosine distance ma-
trixD, whereD
(`1;i1);(`2;i2)= 1 cos(v
`1
i1
;v
`2
i2
) ,
8i1; i22 f1; ; dmg;8`1; `22 f1; ; Lg .
6
Then, we search for clusters that are frequently
active in extreme updates, by (a) extracting sub-
updates where the scores for the top-candidate pass
a certain threshold (10forGPT2and5for
WIKILM), and (b) counting the appearances of
each cluster in the layer sub-updates.
In both models, a small set of homogeneous clus-
ters account for the extreme sub-updates shown in
Fig., which can be divided into two main groups
of value vectors: Vectors in the upper layers that
promotegenerally unlikelytokens (e.g. rare to-
kens), and vectors that are spread over all the lay-
ers and promotecommontokens (e.g. stopwords).
These clusters, which cover only a small fraction
of the value vectors (1.7% inGPT2and 1.1% in
WIKILM), are mostly active for examples where
the input sequence has3tokens or when the
target token can be easily inferred from the context
(e.g. end-of-sentence period), suggesting that these
value vectors might congure easy model predic-
tions. More interestingly, the value vectors that pro-
mote unlikely tokens can be viewed assaturation
vectors, which propagate the distribution without
changing the top tokens. Indeed, these vectors are
in the last layers, where often the model already
stores its nal prediction (Geva et al.,).
6 Applications
We leverage our ndings for controlled text genera-
tion (§6.1) and computation efciency (§6.2).
6.1 Zero-Shot Toxic Language Suppression
LMs are known to generate toxic, harmful language
that damages their usefulness (Bender et al.,;
McGufe and Newhouse,;,
2019). We utilize our ndings to create a simple,
intuitive method for toxic language suppression.
Method.
If LMs indeed operate in a promotion
mechanism, we reason that we can decrease toxic-
ity by turning on non-toxic sub-updates. We nd
value vectors that promote safe, harmless concepts
by extracting the top-tokens in the projections of all
6
We experimented withk= 3e
2
;1e
3
;3e
3
;1e
4
;3e
4 , and
choosek= 1e
4
based on manual inspection.
the value vectors and either (a) manually searching
for vectors that express a coherent set of positive
words (e.g.safeandthank), or (b) grading
the tokens with the Perspective API and selecting
non-toxic value vectors (see details in App.).
We turn on these value vectors by setting their co-
efcients to 3, a relatively high value according to
Fig.. We compare our method with two baselines:
1.Self-Debiasing (SD) (Schick et al.,): SD
generates a list of undesired words for a given
prompt by appending aself-debiasing input,
which encourages toxic completions, and cal-
culating which tokens are promoted compared
to the original prompt. These undesired words'
probability are then decreased according to a
decay constant, which we set to 50 (default).
2.
WORDFILTER: We preventGPT2from gen-
erating words from a list of banned words by
setting any logits that would result in a banned
word completion to 1(Gehman et al.,).
Evaluation.We evaluate our method on the chal-
lenging subset ofREALTOXICPROMPTS(Gehman
et al.,), a collection of 1,225 prompts that tend
to yield extremely toxic completions in LMs, using
the Perspective API, which grades text according
to six toxicity attributes. A score of>0:5 indicates
a toxic text w.r.t to the attribute. Additionally, we
compute perplexity to account for changes in LM
performance. We useGPT2and, following
et al.2021), generate continuations of 20 tokens.
Results.
Finding the non-toxic sub-updates man-
ually was intuitive and efcient (taking<5min-
utes). Tab.
value vectors (0.01%) substantially decreases toxi-
city (#47%), outperforming both SD (#37%) and
WORDFILTER(#20%). Moreover, inducing sub-
updates that promote safety related concepts is
more effective than promoting generally non-toxic
sub-updates. However, our method resulted in a
perplexity increase greater than this induced by SD,
though the increase was still relatively small.
6.2 Self-Supervised Early Exit Prediction
The recent success of transformer-based LMs in
NLP tasks has resulted in major production cost
increases (Schwartz et al.,
spurred interest in early-exit methods that reduce
the incurred costs (Xu et al.,). Such methods
often use small neural models to determine when to
stop the execution process (Schwartz et al.,;
Model Toxicity Severe Sexually Threat Profanity Identity PPL
toxicity explicit attack
GPT2 58.5% 49.2% 34.1% 16.4% 52.5% 16.8% 21.7
"10 Manual Pick
#47%
30.8%
#50%
24.8%
#40%
20.4%
#63%
6.0%
#47%
27.9%
#48%
8.8% 25.3
"10 API Graded
#10%
52.7%
#11%
44%
#3%
33.2%
#19%
13.3%
#9%
47.6%
#9%
15.3% 23.8
SD
#37%
37.2%
#46%
26.4%
#36%
21.7%
#52%
7.8%
#39%
32%
#50%
8.4% 23.9
WORDFILTER
#20%
46.9%
#34%
32.4%
#36%
21.9%
#<1%
16.3%
#38%
32.3%
#13%
14.7% -
Table 5: Evaluation results on the challenging subset of REALTOXICPROMPTS, showing the percentage of toxic
completions for 6 toxicity attributes, as well as language model perplexity (PPL).
Elbayad et al.,;,;,
2020,;,;,).
In this section, we test our hypothesis that domi-
nant FFN sub-updates can signal asaturation event
(§5.2), to create a simple and effective early exiting
method that does not involve any external model
training. For the experiments, we useWIKILM,
where saturation events occur across all layers
(statistics forWIKILMandGPT2are in App.).
Method.We devise a simple prediction rule
based on a nearest-neighbours approach, using 10k
validation examples fromWIKITEXT-103. First,
for every example, we map the top-10 dominant
sub-updates at each layer to their corresponding
clusters. Then, for every layer`, we split all the
sets of clusters at that layer into two sets,T
`and
N
`, based on whether saturation occurred or not
(e.g.,T
5stores all the sets that were active in a sat-
uration event at layer 5). Given the top-10 clusters
of an unseen example at some layer`, we consider
a higher overlap withT
`than withN
`
0
;8`
0
> `
as a signal for early exit. Thus, during inference,
we propagate the input example through the layers,
and compute at each layer`the intersection size be-
tween its top-10 active clusters and each ofT
`and
N
`
0
;8`
0
> ` . If the average and maximal intersec-
tion withT
`exceeds those withN
`
0
;8`
0
> ` , we
halt the computation and declare early exiting.
7
Baselines.
We train layer-wise binary classiers
over the representation and FFN updatesx
`,o
`,
and~x
`, using logistic regression. As in our method,
the labels are determined according to saturation
events in the training data (see App.). During
inference, we execute the computation through the
layers, and halt according to the layer classier.
7
This is a simplication. We splitN
`by saturation layers
and require a bigger intersection withT
`
at all the layers.
Method Accuracy Saved Layers
Binary classiers usingx
`
94.46.4
18.8%
3.00.4
Binary classiers usingo
`
92.95.4
19.4%
3.10.3
Binary classiers using~x
`
94.46.4
18.8%
3.00.4
Sub-updates rule 94.1 1.4
20.0%
3.20.3
Table 6: Early exit evaluation results on WIKILM.
Evaluation.
Each method is evaluated byaccu-
racy, i.e., the portion of examples for which exiting
at the predicted layer yields the nal model pre-
diction, and bycomputation efciency, measured
by the amount of saved layers for examples with
correct prediction. We run each method with ve
random seeds and report the average scores.
Results.
Tab.
a high accuracy of 94.1%, while saving 20% of
computation on average without changing the pre-
diction. Moreover, just by observing the dominant
FFN sub-updates, it performs on-par with the pre-
diction rules relying on the representation and FFN
output vectors. This demonstrates the utility of
sub-updates for predicting saturation events, and
further supports our hypothesis that FFN updates
play a functional role in the prediction (§5.2).
7 Related Work
The lack of interpretability of modern LMs has
led to a wide interest in understanding their predic-
tion construction process. Previous works mostly
focused on analyzing the evolution of hidden rep-
resentations across layers (Voita et al.,), and
probing the model with target tasks (Yang et al.,
2020;,;,;
and Lopez,). In contrast, our approach aims to
interpret the model parameters and their utilization
in the prediction process.
More recently, a surge of works have investi-
gated the knowledge captured by the FFN layers
(Da et al.,;,;,;
Yao et al.,;,;,
2020). These works show that the FFN layers store
various types of knowledge, which can be located
in specic neurons and edited. Unlike these works,
we focus on the FFN outputs and their contribution
in the prediction construction process.
Last, our interpretation of FFN outputs as up-
dates to the output distribution relates to recent
works that interpreted groups of LM parameters in
the discrete vocabulary space (Geva et al.,;
Khashabi et al.,), or viewed the representation
as an information stream (Elhage et al.,).
8 Conclusions
Understanding the inner workings of transformers
is valuable for explainability to end-users, for de-
bugging predictions, for eliminating undesirable
behavior, and for understanding the strengths and
limitations of NLP models. The FFN is an under-
studied core component of transformer-based LMs,
which we focus on in this work.
We study the FFN output as a linear combina-
tion of parameter vectors, termed values, and the
mechanism by which these vectors update the token
representations. We show that value vectors often
encode human-interpretable concepts and that these
concepts are promoted in the output distribution.
Our analysis of transformer-based LMs provides
a more detailed understanding of their internal pre-
diction process, and suggests new research direc-
tions for interpretability, control, and efciency, at
the level of individual vectors.
Limitations
Our study focused on the operation of FFN lay-
ers in building model predictions. Future work
should further analyze the interplay between these
layers and other components in the network, such
as attention-heads.
In our analysis, we decomposed the computation
of FFN layers into smaller units, corresponding to
single value vectors. However, it is possible that
value vectors are compositional in the sense that
combinations of them may produce new meanings.
Still, we argue that analyzing individual value vec-
tors is an important rst step, since (a) the space of
possible combinations is exponential, and (b) our
analysis suggests that aggregation of value vectors
is less interpretable than individual value vectors
(§4.1). Thus, this approach opens new directions
for interpreting the contribution of FFN layers to
the prediction process in transformer LMs.
In addition, we chose to examine the broad fam-
ily of decoder-based, auto-regressive LMs, which
have been shown to be extremely effective for
many NLP tasks, including few- and zero-shot
tasks (Wang et al.,). While these models share
the same building blocks of all transformer-based
LMs, it will be valuable to ensure that our ndings
still hold for other models, such as encoder-only
LMs (e.g. RoBERTa (Liu et al.,)) and mod-
els trained with different objective functions (e.g.
masked language modeling (Devlin et al.,)).
Finally, our annotation effort was made for the
evaluation of our hypothesis that sub-updates en-
code human-interpretable concepts. Scaling our
annotation protocol would enable a more rened
map of the concepts, knowledge and structure cap-
tured by LMs. Furthermore, since our concept
interpretation approach relies on manual inspection
of sets of tokens, its success might depend on the
model's tokenization method. In this work, we an-
alyzed models with two different commonly-used
tokenizers, and future research could verify our
method over other types of tokenizations as well.
Ethics Statement
Our work in understanding the role that single-
values play in the inference that transformer-based
LMs perform potentially improves their trans-
parency, while also providing useful control appli-
cations that save energy (early-exit prediction) and
increase model harmlessness (toxic language sup-
pression). It should be made clear that our method
for toxic language suppression only reduces the
probability of toxic language generation and does
not eliminate it. As such, this method (as well as
our early-exit method) should not be used in the
real world without further work and caution.
More broadly, our work suggests a general ap-
proach for modifying LM predictions in particular
directions, by changing the weights of FFN sub-
updates. While this is useful for mitigating biases,
it also has the potential for abuse. It should be
made clear that, as in the toxic language suppres-
sion application, our approach does not modify the
information encoded in LMs, but only changes the
intensity in which this information is exposed in the
model's predictions. Moreover, our work primar-
ily proposes an interpretation for FFN sub-updates,
which also could be used to identify abusive inter-
ventions. Regardless, we stress that LMs should
not be integrated into critical systems without cau-
tion and monitoring.
Acknowledgements
We thank Shauli Ravfogel, Tal Schuster, and
Jonathan Berant for helpful feedback and construc-
tive suggestions. This project has received funding
from the Computer Science Scholarship granted by
the Séphora Berrebi Foundation, the PBC fellow-
ship for outstanding PhD candidates in Data Sci-
ence, and the European Research Council (ERC)
under the European Union's Horizon 2020 research
and innovation programme, grant agreement No.
802774 (iEXTRACT).
References
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-
ton. 2016. Layer normalization.arXiv preprint
arXiv:1607.06450.
Alexei Baevski and Michael Auli. 2019.
put representations for neural language modeling. In
International Conference on Learning Representa-
tions (ICLR).
Emily M Bender, Timnit Gebru, Angelina McMillan-
Major, and Shmargaret Shmitchell. 2021. On the
dangers of stochastic parrots: Can language models
be too big? InProceedings of the ACM Confer-
ence on Fairness, Accountability, and Transparency
(FAccT).
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, Sandhini Agarwal, Ariel Herbert-
Voss, Gretchen Krueger, Tom Henighan, Rewon
Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu,
Clemens Winter, Chris Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei.
2020. Language models are few-shot learners. In
Advances in Neural Information Processing Systems
(NeurIPS).
Kevin Clark, Urvashi Khandelwal, Omer Levy, and
Christopher D. Manning. 2019.
look at? an analysis of BERT's attention. In Pro-
ceedings of the 2019 ACL Workshop BlackboxNLP:
Analyzing and Interpreting Neural Networks for
NLP, pages 276286, Florence, Italy. Association
for Computational Linguistics.
Jeff Da, Ronan Le Bras, Ximing Lu, Yejin Choi, and
Antoine Bosselut. 2021.
emergence in few-shot knowledge models. In 3rd
Conference on Automated Knowledge Base Con-
struction.
Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao
Chang, and Furu Wei. 2022.
in pretrained transformers. In Proceedings of the
60th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
84938502, Dublin, Ireland. Association for Com-
putational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019.
deep bidirectional transformers for language under-
standing. In North American Association for Com-
putational Linguistics (NAACL), pages 41714186,
Minneapolis, Minnesota.
Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael
Auli. 2020. Depth-adaptive transformer. InInter-
national Conference on Learning Representations
(ICLR).
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom
Henighan, Nicholas Joseph, Ben Mann, Amanda
Askell, Yuntao Bai, Anna Chen, Tom Conerly,
Nova DasSarma, Dawn Drain, Deep Ganguli, Zac
Hateld-Dodds, Danny Hernandez, Andy Jones,
Jackson Kernion, Liane Lovitt, Kamal Ndousse,
Dario Amodei, Tom Brown, Jack Clark, Jared Ka-
plan, Sam McCandlish, and Chris Olah. 2021. A
mathematical framework for transformer circuits.
Transformer Circuits Thread. Https://transformer-
circuits.pub/2021/framework/index.html.
Samuel Gehman, Suchin Gururangan, Maarten Sap,
Yejin Choi, and Noah A. Smith. 2020.
cityPrompts: Evaluating neural toxic degeneration
in language models. In Findings of the Association
for Computational Linguistics: EMNLP 2020, pages
33563369, Online. Association for Computational
Linguistics.
Mor Geva, Roei Schuster, Jonathan Berant, and Omer
Levy. 2021.
key-value memories. In Proceedings of the 2021
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 54845495, Online and
Punta Cana, Dominican Republic. Association for
Computational Linguistics.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun. 2016. Deep residual learning for image recog-
nition. InProceedings of the conference on com-
puter vision and pattern recognition (CVPR).
Dan Hendrycks and Kevin Gimpel. 2016. Gaus-
sian error linear units (gelus).arXiv preprint
arXiv:1606.08415.
Arthur E Hoerl and Robert W Kennard. 1970. Ridge re-
gression: Biased estimation for nonorthogonal prob-
lems.Technometrics, 12(1):5567.
Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao
Chen, and Qun Liu. 2020. Dynabert: Dynamic bert
with adaptive width and depth.Advances in Neural
Information Processing Systems (NeurIPS).
Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham
Neubig. 2020.
models know?Transactions of the Association for
Computational Linguistics, 8:423438.
Daniel Khashabi, Xinxi Lyu, Sewon Min, Lianhui
Qin, Kyle Richardson, Sean Welleck, Hannaneh Ha-
jishirzi, Tushar Khot, Ashish Sabharwal, Sameer
Singh, and Yejin Choi. 2022.
ness: The curious case of discretized interpretation
of continuous prompts. In Proceedings of the 2022
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, pages 36313643, Seattle,
United States. Association for Computational Lin-
guistics.
Lei Li, Yankai Lin, Deli Chen, Shuhuai Ren, Peng Li,
Jie Zhou, and Xu Sun. 2021.
celerating inference of pre-trained language models
via calibrated complete models cascade. In Find-
ings of the Association for Computational Linguis-
tics: EMNLP 2021, pages 475486, Punta Cana, Do-
minican Republic. Association for Computational
Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
RoBERTa: A robustly optimized bert pretraining ap-
proach.arXiv preprint arXiv:1907.11692.
Kris McGufe and Alex Newhouse. 2020. The radical-
ization risks of gpt-3 and advanced neural language
models.arXiv preprint arXiv:2009.06807.
Kevin Meng, David Bau, Alex Andonian, and Yonatan
Belinkov. 2022. Locating and editing factual knowl-
edge in gpt.arXiv preprint arXiv:2202.05262.
Stephen Merity, Caiming Xiong, James Bradbury, and
Richard Socher. 2017. Pointer sentinel mixture mod-
els.International Conference on Learning Repre-
sentations (ICLR).
Daniel Müllner. 2011. Modern hierarchical, ag-
glomerative clustering algorithms.arXiv preprint
arXiv:1109.2378.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, Ilya Sutskever, et al. 2019. Lan-
guage models are unsupervised multitask learners.
OpenAI blog, 1(8):9.
Naomi Saphra and Adam Lopez. 2019.
ing learning dynamics of language models with
SVCCA. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers),
pages 32573267, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics.
Timo Schick, Sahana Udupa, and Hinrich Schütze.
2021.
for reducing corpus-based bias in NLP. Transac-
tions of the Association for Computational Linguis-
tics, 9:14081424.
Tal Schuster, Adam Fisch, Tommi Jaakkola, and
Regina Barzilay. 2021.
ference via condent adaptive transformers. In Pro-
ceedings of the 2021 Conference on Empirical Meth-
ods in Natural Language Processing, pages 4962
4979, Online and Punta Cana, Dominican Republic.
Association for Computational Linguistics.
Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren
Etzioni. 2020a. Green AI.Communications of the
ACM, 63(12):5463.
Roy Schwartz, Gabriel Stanovsky, Swabha
Swayamdipta, Jesse Dodge, and Noah A. Smith.
2020b.
and instance complexities. In Proceedings of the
58th Annual Meeting of the Association for Com-
putational Linguistics, pages 66406651, Online.
Association for Computational Linguistics.
Noam M. Shazeer. 2020. Glu variants improve trans-
former.ArXiv, abs/2002.05202.
S. Sukhbaatar, J. Weston, and R. Fergus. 2015. End-
to-end memory networks. InAdvances in Neural
Information Processing Systems (NIPS).
Sainbayar Sukhbaatar, Edouard Grave, Guillaume
Lample, Herve Jegou, and Armand Joulin. 2019.
Augmenting self-attention with persistent memory.
arXiv preprint arXiv:1907.01470.
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
BERT rediscovers the classical NLP pipeline. In
Proceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 4593
4601, Florence, Italy. Association for Computational
Linguistics.
Robert Tibshirani. 1996. Regression shrinkage and se-
lection via the lasso.Journal of the Royal Statistical
Society: Series B (Methodological), 58(1):267288.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, ukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. InAdvances in Neural Information Pro-
cessing Systems (NIPS), pages 59986008.
Elena Voita, Rico Sennrich, and Ivan Titov. 2019.
bottom-up evolution of representations in the trans-
former: A study with machine translation and lan-
guage modeling objectives. In Proceedings of the
2019 Conference on Empirical Methods in Natu-
ral Language Processing and the 9th International
Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 43964406, Hong Kong,
China. Association for Computational Linguistics.
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner,
and Sameer Singh. 2019.
gers for attacking and analyzing NLP. In Proceed-
ings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th Inter-
national Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 21532162, Hong
Kong, China. Association for Computational Lin-
guistics.
Jonas Wallat, Jaspreet Singh, and Avishek Anand.
2020.
getting of knowledge in BERT. In Proceedings of
the Third BlackboxNLP Workshop on Analyzing and
Interpreting Neural Networks for NLP, pages 174
183, Online. Association for Computational Linguis-
tics.
Thomas Wang, Adam Roberts, Daniel Hesslow,
Teven Le Scao, Hyung Won Chung, Iz Beltagy,
Julien Launay, and Colin Raffel. 2022.
guage model architecture and pretraining objective
works best for zero-shot generalization? Pro-
ceedings of the 39th International Conference on
Machine Learning, volume 162 ofProceedings of
Machine Learning Research, pages 2296422984.
PMLR.
Ji Xin, Rodrigo Nogueira, Yaoliang Yu, and Jimmy Lin.
2020.
ranking. In Proceedings of SustaiNLP: Workshop on
Simple and Efcient Natural Language Processing,
pages 8388, Online. Association for Computational
Linguistics.
Ji Xin, Raphael Tang, Yaoliang Yu, and Jimmy Lin.
2021.
ne-tuning and extension to regression. In Proceed-
ings of the 16th Conference of the European Chap-
ter of the Association for Computational Linguistics:
Main Volume, pages 91104, Online. Association for
Computational Linguistics.
Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou,
and Lei Li. 2021. A survey on green deep learning.
arXiv preprint arXiv:2111.05193.
Yilin Yang, Longyue Wang, Shuming Shi, Prasad Tade-
palli, Stefan Lee, and Zhaopeng Tu. 2020.
sub-layer functionalities of transformer decoder. In
Findings of the Association for Computational Lin-
guistics: EMNLP 2020, pages 47994811, Online.
Association for Computational Linguistics.
Yunzhi Yao, Shaohan Huang, Ningyu Zhang, Li Dong,
Furu Wei, and Huajun Chen. 2022. Kformer:
Knowledge injection in transformer feed-forward
layers.arXiv preprint arXiv:2201.05742.
123456789101112131415161718192021222324
layer
0
20
40
60
80
100
IoU of value projections
GPT2
normalized random Figure 4: Similarity of projections toE, of GPT2
value vectors with and without layer normalization, and
of value vectors and randomly-initialized vectors. We
measure similarity of the top-30 tokens in each projec-
tion with IoU.
A Appendix
A.1 Value Vectors Projection Method
Our interpretation method of sub-updates is based
on directly projecting value vectors to the embed-
ding matrix, i.e. for a valuevand embedding
matrixE, we calculateEv(§4). However, in some
LMs likeGPT2, value vectors in each layer are
added to the token representation followed by a
layer normalization (LN) (Ba et al.,). This
raises the question whether reading vectors that
are normalized in the same manner as the represen-
tation would yield different concepts.
To test that, we compare the top-30 scoring
tokens byEv
`
iand byELayerNorm(v
`
i
) , for
i= 1; :::; dm and`= 1; :::; L, using Intersection
over Union (IoU). As a baseline, we also compare
Ev
`
iwith random vectors, initialized from a normal
distribution with the empirical mean and standard
deviation of the value vectors. Fig.
LN does not change the projection substantially,
with an overlap of64:5%of the top-30 tokens on
average, suggesting that the same concepts are pro-
moted in both cases. This is in contrast to random
values, which produce a0%overlap on average.
A.2 Concepts Annotation
We analyze the concepts encoded in sub-updates,
by projecting their corresponding value vectors to
the embedding matrix and identifying repeating
patterns in the top-scoring 30 tokens (§3). Pattern
identication was performed by experts (NLP grad-
uate students), following the instructions presented
in Fig.. Please note these are the instructions
provided for annotations ofWIKILM, which uses
word-level tokenization. Thus, the terms words
and tokens are equivalent in this case.
For value vectors inWIKILM, which uses
a word-level vocabulary with many uncommon
words, we additionally attached a short description
eld for each token that provides context about the
meaning of the word. For the description of a token
w, we rst try to extract the denition ofwfrom
Wordnet.
8
Ifwdoes not exist in Wordnet, as often
happens for names of people and places, we then
search forwin Wikipedia
9
and extract a short (pos-
sibly noisy) description if the query was successful.
A complete annotation example Tab..
A.3 Sub-Update Contribution in FFN
Outputs
In this section, we justify our choice along the pa-
per of looking at the top-10 dominant sub-updates.
The contribution of a sub-updatem
`
i
v
`
i to the FFN
output is:
contrib(m
`
iv
`
i) :=
jm
`
i
jjjv
`
i
jj
P
dm
j=1
jm
`
j
jjjv
`
j
jj
;
namely, its relative weight compared to the over-
all sum of weights of all sub-updates. The overall
contribution of the top-10 dominant sub-updates is
computed by summing their contributions. Note
that we take the absolute value of the coef-
cientsjm
`
i
j, since some activation functions (e.g.
GeLU (Hendrycks and Gimpel,) in GPT2),
can result in negative values ofm
`
i
.
Empirically, we observe that in some cases sub-
updates with negative coefcients do appear as part
of the 10 most dominant sub-updates inGPT2.
We further attribute this to the success of GeLU in
transformer models (Shazeer,), as it increases
the expressiveness of the model by allowing re-
versing the scores value vectors induce over the
vocabulary.
Fig.
dominant sub-updates per layer forWIKILMand
GPT2, using 2000 random examples from the
WIKITEXT-103validation set. Clearly, for all
the layers, the contribution of the dominant sub-
updates exceeds the contribution of random sub-
updates. Observe that, even though they cover
only 0.24% of the value vectors, the contribution of
dominant sub-updates is typically around 5%, and
in some layers (e.g. layers 8-16 inWIKILMand
layer 1 inGPT2) it reaches over 10% of the total
8
We use the NLTK python package.
9
Using the wptools packagehttps://pypi.org/
project/wptools/.
In this task, you are given a list of 30 words in English, and the goal is to identify
repetitive patterns occurring in the words.
Patterns can besemantic(e.g. animals, 3-digit numbers,names of Indian actors, and
time-related words) orsyntactic(e.g. connectives,plurals, words starting with “dis-”, and
verbs in present progressive tense). You should only count patterns that occur in at least 4
words (i.e. if you notice a pattern that occurs only in 3 words, then please ignore it).
To complete the task, please do the following:
1.Give an ID to every identified pattern (1,2,...)
2.Assign a pattern ID to every word in the list, or -1/leave empty if no pattern applies to
the word.
3.For every identified pattern specify whether the pattern is semantic or syntactic and
(optional) write a short description of the pattern.
Please note that some of the words might be uncommon words that you are not familiar with.
In such cases, you will need to do a quick search over the Web to understand the meaning
of words. Figure 5: Annotation instructions for the concepts identication task.
contribution. This demonstrates that analyzing the
top-10 dominant sub-updates can shed light on the
way predictions are built through the layers.
A.4 Toxic Language Suppression Details
The 10 manually selected value vectors were found
by searching for non-toxic words, such assafe
andpeace, among the top-30 tokens in the vec-
tor projections to the vocabulary. We selected a
small set of 10 value vectors whose top-scoring
tokens were coherent and seemed to promote differ-
ent kinds of non-toxic tokens. The list of manually
picked vectors is provided in Tab.. Importantly,
the search process of all vectors was a one-time
effort that took<5minutes in total. We chose
the value vectors in a greedy-manner, without addi-
tional attempts to optimize our choice.
To select 10 non-toxic value vectors based on an
automatic toxicity metric, we used the Perspective
API. Concretely, we concatenated the top-30 tokens
by each value vector and graded the resulting text
with the toxicity score produced by the API. Then,
we sampled 10 random vectors with a toxicity score
<0:1 (a score of<0:5 indicates a non-toxic text).
A.5 Early Exit Details
This section provides further details and analysis
regarding our early exit method and the baselines
we implemented.
Method Implementation.
We consider 90% of
the 10k examples for constructingT
`andN
`, and
the remaining 10% examples are considered as the
testing set. We usedk= 2e
2 to cluster the top-10
dominant value vectors, but observed that otherk
values yielded similar results.
Baselines' Implementation.
We train each bi-
nary classier using 8k training examples, based
on the standardized forms of each feature vec-
tor. We considered a hyperparameter sweep, us-
ing 8-fold cross-validation, withl2orl1regu-
larization (lasso (Tibshirani,) or ridge (Ho-
erl and Kennard,)), regularization coef-
cientsC2 f1e
3
;1e
2
;1e
1
;1;1e
1
;1e
2
;1e
3
g ,
and took the best performing model for each layer.
We also used a inversely proportional loss coef-
cient according to the class frequencies.
In order to achieve high accuracy, we further
calibrate a threshold per classier for reaching the
maximal F1score for each layer. This calibration
is done after training each classier, over a set of
1000 validation examples.
Frequency of Saturation Events.
We investi-
gate the potential of performing early exit forWIK-
ILMandGPT2. Tab.
of saturation events per layer, considering 10k ex-
amples from theWIKITEXT-103validation set,
forWIKILMandGPT2, respectively. InGPT2,
34.15% of the examples require the full compu-
tation using all the model layers, while forWIK-
ILM, this holds for only 15.22% of the examples.
Notably, early xation events inGPT2are less
common than inWIKILM, possibly due to the
larger number of layers the prediction construction
is spread over. Hence, we useWIKILMfor our
experiments, as it has signicantly higher compu-
patternsword description
1 front the side that is forward or prominent
1 ahead having the leading position or higher score in a contest
1 forward the person who plays the position of forward in certain games, such as basketball, soccer, or
hockey
1 preceded be earlier in time; go back further
1 Before earlier in time; previously
1 before earlier in time; previously
1 rear the back of a military formation or procession
1 fore front part of a vessel or aircraft
2 Name a language unit by which a person or thing is known
1 Past the time that has elapsed
1 prior the head of a religious order; in an abbey the prior is next below the abbot
1 anterior a tooth situated at the front of the mouth
1 upperparts standard terms for unambiguous description of relative placement of body parts
1 lead an advantage held by a competitor in a race
1 backwards at or to or toward the back or rear
1 aft (nautical, aeronautical) situated at or toward the stern or tail
1 preceding be earlier in time; go back further
1 upstream in the direction against a stream's current
hind any of several mostly spotted shes that resemble groupers
1 posterior the eshy part of the human body that you sit on
Etymology a history of a word
1 Pre Wikimedia disambiguation page
chin the protruding part of the lower jaw
1 north the region of the United States lying to the north of the Mason-Dixon line
1 east the cardinal compass point that is at 90 degrees
2 surname the name used to identify the members of a family (as distinguished from each member's
given name)
1 Then that time; that moment
2 name a language unit by which a person or thing is known
1 northbound moving toward the north
1 leading thin strip of metal used to separate lines of type in printing
pattern iddescription
(optional)
semantic/syntactic
1 positions/
directions
semantic
2 naming semantic
Table 7: An example annotation spreadsheet of the top-tokens by the value vectoru
6
1090in WIKILM.12345678910111213141516
layer
0
10
20
30
% contribution
WikiLM
top-10 rand-10 123456789101112131415161718192021222324
layer
0
5
10
15
% contribution
GPT2
top-10 rand-10
Figure 6: Relative contribution to the FFN output of the 10 most dominant and 10 random sub-updates in each
layer, of WIKILM (left) and GPT2 (right).
tation saving potential, as well as more saturation
events per layer.
Value Top-10 Tokens
v
14
1853
transparency, disclosure, clearer, parency, iquette,
humility, modesty, disclosures, accountability, safer
v
15
73
respectful, honorable, healthy, decent, fair, erning,
neutral, peacefully, respected, reconc
v
15
1395
safe, neither, safer, course, safety, safe, Safe,
apologize, Compact, cart
v
16
216
refere, Messages, promises, Relations, accept, acceptance,
Accept, assertions, persistence, warn
v
17
462
should, should, MUST, ought, wisely, Should, SHOULD,
safely, shouldn, urgently
v
17
3209
peaceful, stable, healthy, calm, trustworthy, impartial,
stability, credibility, respected, peace
v
17
4061
Proper, proper, moder, properly, wisely, decency, correct,
corrected, restraint, professionalism
v
18
2921
thank, THANK, thanks, thank, Thank, apologies, Thank,
thanks, Thanks, apologise
v
19
1891
thanks, thank, Thanks, thanks, THANK, Thanks, Thank, Thank,
thank, congratulations
v
23
3770
free, fit, legal, und, Free, leg, pless, sound, qualified,
Free
Table 8: The 10 manually picked value vectors used for toxic language suppression and the top-10 tokens in their
projection to the vocabulary. Repetitions in the projections are a result of special characters not being shown. These
vectors were found by manually searching for non-toxic words such assafeandpeacein the projections to
the vocabulary.
Layer % Examples Layer % Examples
1 6.70 9 2.96
2 5.25 10 3.78
3 13.74 11 4.74
4 3.13 12 7.45
5 1.02 13 10.79
6 1.07 14 9.88
7 1.86 15 9.81
8 2.60 16 15.22
Table 9: The percentage of saturation events per layer
using WIKILM, for the WIKITEXT-103 validation set.
Layer % Examples Layer % Examples
1 2.21 13 1.24
2 0.77 14 1.62
3 1.06 15 2.37
4 0.74 16 2.72
5 0.85 17 2.99
6 0.83 18 3.80
7 0.83 19 4.15
8 0.72 20 5.21
9 0.93 21 5.67
10 0.99 22 9.31
11 1.16 23 14.52
12 1.32 24 34.15
Table 10: The percentage of saturation events per layer
using GPT2, for the WIKITEXT-103 validation set.
Promoting Concepts in the Vocabulary Space
Mor Geva
;1
Avi Caciularu
;2;y
Kevin Ro Wang
3
Yoav Goldberg
1;2
1
Allen Institute for AI
2
Bar-Ilan University
3
Independent Researcher
morp@allenai.org,{avi.c33,kevinrowang,yoav.goldberg}@gmail.com
Abstract
Transformer-based language models (LMs)
are at the core of modern NLP, but their inter-
nal prediction construction process is opaque
and largely not understood. In this work,
we make a substantial step towards unveiling
this underlying prediction process, by reverse-
engineering the operation of the feed-forward
network (FFN) layers, one of the building
blocks of transformer models. We view the to-
ken representation as a changing distribution
over the vocabulary, and the output from each
FFN layer as an additive update to that distri-
bution. Then, we analyze the FFN updates in
the vocabulary space, showing that each up-
date can be decomposed to sub-updates cor-
responding to single FFN parameter vectors,
each promoting concepts that are often human-
interpretable. We then leverage these ndings
for controlling LM predictions, where we re-
duce the toxicity of GPT2 by almost 50%, and
for improving computation efciency with a
simple early exit rule, saving 20% of compu-
tation on average.
1
1 Introduction
How do transformer-based language models (LMs)
construct predictions? We study this question
through the lens of the feed-forward network (FFN)
layers, one of the core components in transform-
ers (Vaswani et al.,). Recent work showed
that these layers play an important role in LMs,
acting as memories that encode factual and linguis-
tic knowledge (Geva et al.,;,;
Meng et al.,). In this work, we investigate
how outputs from the FFN layers are utilized inter-
nally to build predictions.
We begin by making two observations with re-
spect to the representation of a single token in the
Equal contribution.
y
Work done during an internship at AI2.
1
Our codebase is available athttps://github.com/
aviclu/ffn-values. residual stream
675
a
589
she ordered a
pancake
few
coffee
pancake
few
coffee
FFN
layer
fruit, apples,
snack, vitamins,
berries, oats,
yogurt, tea, …
(breakfast)
v
1
v
2
v
dm
(B)
(C)
(D)(A)
x
x̃
Figure 1: Illustration of our ndings. Feed-forward lay-
ers apply additive updates (A) to the token representa-
tionx, which can be interpreted as a distribution over
the vocabulary (B). An update is a set of sub-updates
induced by parameter vectorsv1; :::;vdm(C), each can
be interpreted as a concept in the vocabulary space (D).
input, depicted in Fig.. First, each FFN layer
induces an additive update to the token represen-
tation (Fig., A). Second, the token representation
across the layers can be translatedat any stageto a
distribution over the output vocabulary (Geva et al.,
2021) (Fig., B). We reason that the additive com-
ponent in the update changes this distribution (§2),
namely,FFN layers compute updates that can be
interpreted in terms of the output vocabulary.
We then decompose the FFN update (§3), in-
terpreting it as a collection of sub-updates, each
corresponding to a column in the second FFN ma-
trix (Fig., C) that scales the token probabilities
in the output distribution. Through a series of
experiments, we nd that (a) sub-update vectors
across the entire network often encode a small-set
of human-interpretable well-dened concepts, e.g.
breakfastorpronouns(§4, Fig., D), and (b)
FFN updates rely primarily on token promotion
(rather than elimination), namely, tokens in the top
of the output distribution are those pushed strong
enough by sub-updates (§5). Overall, these nd-
ings allow ne-grained interpretation of the FFN
operation, providing better understanding of the
prediction construction process in LMs.
Beyond interpretation, our ndings also have
practical utility. In §6.1, we show how we can
intervene in the prediction process, in order to ma-
nipulate the output distribution in a direction of our
choice. Specically, we show that increasing the
weight of only 10 sub-updates inGPT2reduces
toxicity in its generations by almost 50%. Also, in
§6.2, we show that dominant sub-updates provide
a useful signal for predicting an early exit point,
saving 20% of the computation on average.
In conclusion, we investigate the mechanism in
which FFN layers update the inner representations
of transformer-based LMs. We propose that the
FFN output can be viewed as a collection of up-
dates that promote concrete concepts in the vo-
cabulary space, and that these concepts are often
interpretable for humans. Our ndings shed light
on the prediction construction process in modern
LMs, suggesting promising research directions for
interpretability, control, and efciency.
2 Token Representations as Evolving
Distributions Over the Vocabulary
Modern LMs (Baevski and Auli,;
et al.,;,) are transformer
models primarily trained to predict the next to-
ken probability for a given input. Such LMs are
composed of intertwined multi-head self-attention
(MHSA) layers and FFN layers (Vaswani et al.,
2017), with residual connections (He et al.,)
between each pair of consecutive layers. The LM
prediction is obtained by projecting the output vec-
tor from the nal layer to an embedding matrix
E2R
jVjd , with a hidden dimensiond, to get a
distribution over a vocabularyV(after softmax).
Given a sequencew=hw1; :::; wti of input to-
kens, the model creates a contextualized represen-
tationxi2R
d for each tokenwi2w , that is being
updated throughout the layers. In this work, we ana-
lyze the updates applied by the FFN layers and how
they construct the model prediction. Concretely,
each FFN layer`= 1; :::; Lprocessesx
`
iand pro-
duces an outputo
`
i, which is then added tox
`
ito
yield an updated representation~x
`
i:
o
`
i=FFN
`
(x
`
i)
~x
`
i=x
`
i+o
`
i
The updated representation~x
`
ithen goes through
a MHSA layer,
2
yielding the inputx
`+1
ifor the
next FFN layer. The evolving representation in
this process (i.e.x
`
i
!~x
`
i;8` ) can be viewed as
an information stream that is being processed and
updated by the layers (Elhage et al.,). The
output probability distribution is obtained from the
nal representation of the token, i.e.,
y=softmax(E~x
L
i): (1)
To analyze the FFN updates, we read from the
representationat any layera distribution over the
output vocabulary, by applying the same projection
as in Eq.Geva et al.,):
p
`
i=softmax(Ex
`
i)
~p
`
i=softmax(E~x
`
i):
Note that~p
L
i=y. Importantly, by linearity:
E~x
`
i=Ex
`
i+Eo
`
i;
implying thato
`
ican be interpreted as an additive
update in the vocabulary space. However, we nd
that the projection of the FFN outputEo
`
ito the vo-
cabulary is not interpretable (§4). In this work, we
take this a step further, and decompose the update
o
`
iinto a set of smaller sub-updates. By projecting
the sub-updates to the vocabulary we nd that they
often express human-interpretable concepts.
In the rest of the paper, we focus on FFN up-
dates to the representation of a single token in the
sequence, and omit the token index for brevity, i.e.
x
`
:=x
`
i
andp
`
:=p
`
i
.
3 The FFN Output as a Collection of
Updates to the Output Distribution
We now decompose the FFN output, and interpret
it as a set of sub-updates in the vocabulary space.
FFN Outputs as Linear Vector Combinations.
Each FFN at layer`consists of two linear trans-
formations with a point-wise activation function in
between (bias terms are omitted):
FFN
`
(x
`
) =f
W
`
Kx
`
W
`
V;
2
In some LMs, e.g.GPT2, a layer normalization (LN) (Ba
et al.,) is applied to the representation~x
`
i. We omit it
here and show it does not inuence our interpretation in §3.
whereW
`
K
; W
`
V
2R
dmd are parameter matrices,
andfis a non-linearity function. Previous work
proposed this module can be cast as an emulated
neural key-value memory (Sukhbaatar et al.,,
2019), where rows inW
`
Kand columns inW
`
Vare
viewed as keys and values, respectively (Geva et al.,
2021). For an inputx
`, the keys produce a vector of
coefcientsm
`
:=f
W
`
K
x
`
2R
dm , that weighs
the corresponding values inW
`
V. Denoting byk
`
i
thei-th row ofW
`
Kand byv
`
ithei-th column of
W
`
V
, we can then use the following formulation:
FFN
`
(x
`
) =
dmX
i=1
f(x
`
k
`
i)v
`
i=
dmX
i=1
m
`
iv
`
i:
Therefore,a FFN update can be viewed as a col-
lection of sub-updates, each corresponding to a
weighted value vector in the FFN output.
Terminology.
In the rest of the paper, we refer
to the vectorsv
`
iasvalue vectors, and to their
weighted formm
`
i
v
`
i assub-updates. A transformer
LM withL= 10; dm= 3000 will have30;000
value vectors, and every token that passes through
the transformer will weight these value vectors dif-
ferently, resulting in30;000sub-updates, where
only a few of the sub-updates have high weights.
Interpreting Sub-Updates in the Vocabulary
Space.
Consider a sub-updatem
`
i
v
`
i for a given
input, we can estimate its inuence on the repre-
sentationx
`(before the FFN update) by analyzing
the change it induces on the output distribution.
Concretely, we isolate the effect ofm
`
i
v
`
i on the
probabilityp
`
wofw2 V:
3
p
wjx
`
+m
`
iv
`
i; E
=
exp
ewx
`
+ewm
`
i
v
`
i
Z
E(x
`
+m
`
i
v
`
i
))
/exp
ewx
`
exp
ewm
`
iv
`
i
;(2)
whereewis the embedding ofw, andZ
is the
constant softmax normalization factor.
This implies that each sub-updatem
`
i
v
`
i intro-
duces a scaling factor to the probability of every
tokenwbased on its dot product withew. Specif-
ically, havingewm
`
i
v
`
i
>0 increases the proba-
bility ofw, and havingewm
`
i
v
`
i
<0 decreases it.
This scaling factor can be split into two parts:
3
As in Eq.1, LN is omitted. In App., we verify empiri-
cally that our ndings hold also when LN is applied.
The termewv
`
i can be viewed as astatic score
ofwthat is independent of the input to the model.
Thus, the projectionr
`
i
=Ev
`
i
2R
jVj induces a
ranking over the vocabulary that allows compar-
ing the scores byv
`
i
w.r.t different tokens.
The termm
`
iis thedynamic coefcientofv
`
i,
which is xed for all tokens for a given input.
Thus, these coefcients allow comparing the con-
tribution of value vectors in a specic update.
Overall, the scaling factorewm
`
i
v
`
i can be viewed
as the effective score given by a value vectorv
`
ito
a tokenwfor a given input.
In the next sections, we use these observations to
answer two research questions of (a) What informa-
tion is encoded in sub-updates and what tokens do
they promote? (§4) and (b) How do FFN updates
build the output probability distribution? (§5)
4 Sub-Updates Encode Concepts in the
Vocabulary Space
We evaluate whether projection to the vocabulary
provides a meaningful way to read FFN up-
dates, and the extent to which sub-updates are inter-
pretable based on their projections. To this end, we
manually inspect the top-scoring tokens by value
vectors and check if they express interpretable con-
cepts. Concretely, we consider two representative
LMs (details below), and for each vectorv
`
icom-
pute a ranking over the vocabulary by sorting the
projectionr
`
i(§3). Then, we try to detect patterns
in the top-scoring tokens of each value vector.
Concepts Annotation Task.
We let experts
(NLP graduate students) annotate concepts by iden-
tifying common patterns among the top-30 scor-
ing tokens of each value vector. For a set of to-
kens, the annotation protocol includes three steps
of: (a) Identifying patterns that occur in at least
4 tokens, (b) describing each recognized pattern,
and (c) classifying each pattern as eitherseman-
tic(e.g., mammals),syntactic(e.g., past-tense
verbs), ornames. The last class was added only
forWIKILM(see below), following the observa-
tion that a large portion of the model's vocabulary
consists of names. Further details, including the
complete instructions and a fully annotated exam-
ple can be found in App..
Models.
We conduct our experiments over
two auto-regressive decoder LMs: The model
of2019) (dubbed WIKILM),
a 16-layer LM trained on theWIKITEXT-103
12345678910111213141516
layer
0
20
40
60
80
100
% tokens
WikiLM
semantic syntactic names N/A 123456789101112
layer
0
20
40
60
80
100
% tokens
GPT2
semantic syntactic N/A Figure 2: Portion of top-scoring tokens by value vectors in WIKILM and GPT2, that were associated with a
semantic or syntactic concept, a name, or could not be matched to any concept (N/A).
Concept Sub-update top-scoring tokens
GPT2
v
3
1018Measurement
semantic
kg, percent, spread, total, yards, pounds, hours
v
8
1900WH-relativizers
syntactic
which, whose, Which, whom, where, who, wherein
v
11
2601Food and drinks
semantic
drinks, coffee, tea, soda, burgers, bar, sushi
WIKILM
v
1
1 Pronouns
syntactic
Her, She, Their, her, she, They, their, they, His
v
6
3025Adverbs
syntactic
largely, rapidly, effectively, previously, normally
v
13
3516
Groups of people
semantic
policymakers, geneticists, ancestries, Ohioans
Table 1: Example value vectors in GPT2 and WIKILM promoting human-interpretable concepts.
corpus (Merity et al.,) with word-level to-
kenization (jVj= 267;744 ), andGPT2(Rad-
ford et al.,), a 12-layer LM trained on WEB-
TEXT(Radford et al.,) with sub-word to-
kenization (jVj= 50;257 ).GPT2uses the
GeLU activation function (Hendrycks and Gim-
pel,), while WIKILMuses ReLU, and in
contrast toGPT2,WIKILMdoes not apply layer
normalization after FFN updates.WIKILMde-
nesd= 1024; dm= 4096 andGPT2denes
d= 768; dm= 3072 , resulting in a total of65k
and36kvalue vectors, respectively. For our experi-
ments, we sample 10 random vectors per layer from
each model, yielding a total of 160 and 120 vectors
to analyze fromWIKILMandGPT2, respectively.
4.1 Projection of Sub-Updates is Meaningful
Real vs. Random Sub-Updates.
We validate
our approach by comparing concepts in top-tokens
of value vectors and 10 random vectors from a nor-
mal distribution with the empirical mean and stan-
dard deviation of the real vectors. We observe that
a substantially higher portion of top-tokens were
associated to a concept in value vectors compared
to the random ones (Tab.): 55:1%vs.22:7%in
WIKILM, and37%vs.16%inGPT2. Also, in
both models, the average number of concepts per
vector was>1in the value vectors compared to
0:5 in the random ones. Notably, no seman-
tic nor syntactic concepts were identied inWIK-
ILM's random vectors, and inGPT2, only4%of
the tokens were marked as semantic concepts in the
random vectors versus24:9%in the value vectors.
Updates vs. Sub-Updates.
We justify the FFN
output decomposition by analyzing concepts in the
top-tokens of 10 random FFN outputs per layer
(Tab.). In WIKILM(GPT2),39:4%(46%) of
the tokens were associated with concepts, but for
19:7%(34:2%) the concept wasstopwords/punc-
tuation. Also, we observe very few concepts
(<4%) in the last two layers ofWIKILM. We
account this to extreme sub-updates that dominate
the layer's output (§5.2). Excluding these concepts
results in a considerably lower token coverage in
projections of updates compared to those of sub-
updates:19:7%vs.55:1%inWIKILM, and11:8%
vs.36:7%in GPT2.
Overall, this shows that projecting sub-updates
to the vocabulary provides a meaningful interface
to the information they encode. Moreover, de-
composing the FFN outputs is necessary for ne-
grained interpretation of sub-updates.
GPT2 W IKILM
FFN sub-updates 36.7% 55.1%
+stopwords concepts 37% 55.1%
Random sub-updates 16% 22.7%
FFN updates 11.8% 19.7%
+stopwords concepts 46% 39.4%
Table 2: Portion of top-scoring tokens associated with
a concept, for FFN updates and sub-updates in WIK-
ILM and GPT2, and for random vectors. For FFN
updates/sub-updates, we show results with and without
counting concepts marked as stopwords.
4.2 Sub-Update Projections are Interpretable
Fig.
layers, forWIKILMandGPT2. In both models
and across all layers, a substantial portion (40%-
70% inWIKILMand 20%-65% inGPT2) of the
top-tokens were associated with well-dened con-
cepts, most of which were classied assemantic.
Also, we observe that the top-tokens of a single
value vector were associated with1:5(WIKILM)
and1:1(GPT2) concepts on average, showing that
sub-updates across all layers encode a small-set of
well-dened concepts. Examples are in Tab..
These ndings expand on previous results by
Geva et al.2021), who observed that value vec-
torsin the upper layersrepresent next-token distri-
butions that follow specic patterns. Our results,
which hold acrossall the layers, suggest that these
vectors represent general concepts rather than pri-
oritizing specic tokens.
Underestimation of Concept Frequency.
In
practice, we nd that this task is hard for humans,
4
as it requires reasoning over a set of tokens without
any context, while tokens often correspond to un-
common words, homonyms, or sub-words. More-
over, some patterns necessitate world knowledge
(e.g.villages in Europe near rivers) or linguistic
background (e.g. negative polarity items). This of-
ten leads to undetectable patterns, suggesting that
the overall results are an underestimation of the true
concept frequency. Providing additional context
and token-related information are possible future
directions for improving the annotation protocol.
Implication for Controlled Generation.
If sub-
updates indeed encode concepts, then we can not
only interpret their contribution to the prediction,
but alsointervenein this process, by increasing the
4
A sub-update annotation took8:5minutes on average.
p
`
:cow, cat,dog, goat, horse, bear
~p
`
:dog, cat, goat, horse, cow, bear
Saturation:dogis promoted from rank 3 inp
`to rank 1 in
~p
`
, to be the top-candidate until the last layer.
p
`
:cow, cat, dog, goat, horse, bear
~p
`
:dog, cat, goat, horse,cow, bear
Elimination:cowis eliminated from rank 1 inp
`to 5 in~p
`.
Table 3: Example saturation and elimination events, af-
ter a FFN update (reference tokens are in bold text).
weights of value vectors that promote tendencies
of our choice. We demonstrate this in §6.1.
5 FFN Updates Promote Tokens in the
Output Distribution
We showed that sub-updates often encode inter-
pretable concepts (§4), but how do these concepts
construct the output distribution? In this section,
we show that sub-updates systematically congure
the prediction via promotion of candidate tokens.
5.1 Promoted Versus Eliminated Candidates
Every sub-updatem
`
i
v
`
i either increases, decreases,
or does not change the probability of a tokenw,
according to the scoreewm
`
i
v
`
i (§3). This sug-
gests three mechanisms by which tokens are pushed
to the top of the output distribution promotion,
where sub-updates increase the probability of fa-
vorable tokens,elimination, where sub-updates de-
crease candidate probabilities, or amixtureof both.
To test what mechanism holds in practice, we ana-
lyze the scores sub-updates assign to top-candidate
tokens by the representation. To simplify the anal-
ysis, we focus on changes induced by the 10 most
dominant sub-updates in each layer, that is, the 10
sub-updatesm
`
i
v
`
i with the largest contribution to
the representation, as measured byjm
`
i
j jjv
`
i
jj (see
details in App.).
For the experiments, we use a random sam-
ple of 2000 examples from the validation set of
WIKITEXT-103,
5
which bothWIKILMandGPT2
did not observe during training. As the experiments
do not involve human annotations, we use a larger
GPT2model withL= 24; d= 1024; dm= 4096 .
We start by comparing the sub-updates' scores
to a reference token in two types of events:
Saturation(Tab., up): The updatep
`
!~p
`
where the nal token predicted by the model (i.e.,
w=argmax(y) ) was promoted to be the top can-
5
Data is segmented into sentences (Geva et al.,).
123456789101112131415161718192021222324
layer
15
10
5
0
5
10
15
20
top-10 values' scores for
the top candidate
GPT2
mean score
min score
max score
w.o. functional
all 12345678910111213141516
layer
4
2
0
2
4
top-10 values' scores for
the top candidate
WikiLM
mean score
min score
max score
w.o. functional
all Figure 3: Mean, maximum and minimum scores assigned by the 10 most dominant sub-updates in each layer to
the top-candidate token, in GPT2 (left) and WIKILM (right). Solid (dashed) lines exclude (include) functional
value vector groups. The y-axis in both plots is cut for readability, as the max. (min.) scores reach 100 (-6).
Sub-updates Event Max. Mean Min.
WIKILM, dominant
saturation1:2<0:01 0:8
elimination0:5 0:01 0:5
WIKILM, random
saturation0:02<0:01 0:02
elimination0:02<0:01 0:02
GPT2, dominant
saturation8:5 1:3 4:9
elimination4:0 0:1 3:6
GPT2, random
saturation0:2 0:01 0:2
elimination0:1<0:01 0:1
Table 4: Maximum, mean, and minimum scores of ref-
erence tokens in saturation and elimination events, by
the 10 most dominant and 10 random sub-updates.
didate until the last layer. We analyze saturation
events induced by the FFN before the last layer,
covering 1184 and 1579 events inWIKILMand
GPT2, respectively.
Elimination(Tab., bottom): The updatep
`
!
~p
`
with the largest increase in the top candidate's
rank, i.e. where the top candidate was dropped
behind other candidates to have a rank>1. Over-
all, our analysis covers 1909 (WIKILM) and
1996 (GPT2) elimination events.
We compute the mean, maximum, and minimum
scores of the reference token by the 10 most domi-
nant sub-updates in each event, and average over
all the events. As a baseline, we compute the scores
by 10 random sub-updates from the same layer.
Tab.
kens promoted to the top of the distribution receive
higher maximum scores than tokens eliminated
from the top position (1:2!0:5 inWIKILMand
8:5!4:0 inGPT2), indicating they are pushed
strongly by a few dominant sub-updates. Moreover,
tokens eliminated from the top of the distribution re-
ceive near-zero mean scores, by both dominant and
random sub-updates, suggesting they are not being
eliminated directly. In contrast to promoted tokens,
where the maximum scores are substantially higher
than the minimal scores (1:2vs. 0:8inWIKILM
and8:5vs. 4:9inGPT2), for eliminated tokens,
the scores are similar in their magnitude (0:5in
WIKILMand4:0vs. 3:6inGPT2). Last, scores
by random sub-updates are dramatically lower in
magnitude, showing that our choice of sub-updates
is meaningful and that higher coefcients translate
to greater inuence on the output distribution.
This suggests that FFN updates work in a pro-
motion mechanism, where top-candidate tokens are
those being pushed by dominant sub-updates.
5.2 Sub-Updates Across Layers
To analyze the FFN operation in different layers,
we break down the top-candidate scores per layer.
Formally, letw
`
=argmax(p
`
) be the top candi-
date at layer`(before the FFN update) for a given
input, we extract the scorese
w
`m
`
i
v
`
i by the 10
most dominant sub-updates and compute the mean,
minimum and maximum scores over that set.
Fig.
few layers (23-24 inGPT2and 14-16 inWIKILM),
maximum and minimum scores are distributed
around non-negative mean scores, with prominent
peaks in maximum scores (layers 3-5 inGPT2and
layers 4-11 inWIKILM). This suggests that the to-
ken promotion mechanism generally holds across
layers. However, scores diverge in the last layers of
both models, with strong negative minimum scores,
indicating that the probability of the top-candidate
is pushed down by dominant sub-updates. We next
show that these large deviations in positive and neg-
ative scores (Fig., dashed lines) result from the
operation of small sets of functional value vectors.
Extreme Sub-Updates.
To analyze the extreme
FFN updates, we rst cluster the value vectors to
discover high-level trends. We use agglomerative
clustering (Müllner,) to learn 10k clusters
for each model, based on the cosine distance ma-
trixD, whereD
(`1;i1);(`2;i2)= 1 cos(v
`1
i1
;v
`2
i2
) ,
8i1; i22 f1; ; dmg;8`1; `22 f1; ; Lg .
6
Then, we search for clusters that are frequently
active in extreme updates, by (a) extracting sub-
updates where the scores for the top-candidate pass
a certain threshold (10forGPT2and5for
WIKILM), and (b) counting the appearances of
each cluster in the layer sub-updates.
In both models, a small set of homogeneous clus-
ters account for the extreme sub-updates shown in
Fig., which can be divided into two main groups
of value vectors: Vectors in the upper layers that
promotegenerally unlikelytokens (e.g. rare to-
kens), and vectors that are spread over all the lay-
ers and promotecommontokens (e.g. stopwords).
These clusters, which cover only a small fraction
of the value vectors (1.7% inGPT2and 1.1% in
WIKILM), are mostly active for examples where
the input sequence has3tokens or when the
target token can be easily inferred from the context
(e.g. end-of-sentence period), suggesting that these
value vectors might congure easy model predic-
tions. More interestingly, the value vectors that pro-
mote unlikely tokens can be viewed assaturation
vectors, which propagate the distribution without
changing the top tokens. Indeed, these vectors are
in the last layers, where often the model already
stores its nal prediction (Geva et al.,).
6 Applications
We leverage our ndings for controlled text genera-
tion (§6.1) and computation efciency (§6.2).
6.1 Zero-Shot Toxic Language Suppression
LMs are known to generate toxic, harmful language
that damages their usefulness (Bender et al.,;
McGufe and Newhouse,;,
2019). We utilize our ndings to create a simple,
intuitive method for toxic language suppression.
Method.
If LMs indeed operate in a promotion
mechanism, we reason that we can decrease toxic-
ity by turning on non-toxic sub-updates. We nd
value vectors that promote safe, harmless concepts
by extracting the top-tokens in the projections of all
6
We experimented withk= 3e
2
;1e
3
;3e
3
;1e
4
;3e
4 , and
choosek= 1e
4
based on manual inspection.
the value vectors and either (a) manually searching
for vectors that express a coherent set of positive
words (e.g.safeandthank), or (b) grading
the tokens with the Perspective API and selecting
non-toxic value vectors (see details in App.).
We turn on these value vectors by setting their co-
efcients to 3, a relatively high value according to
Fig.. We compare our method with two baselines:
1.Self-Debiasing (SD) (Schick et al.,): SD
generates a list of undesired words for a given
prompt by appending aself-debiasing input,
which encourages toxic completions, and cal-
culating which tokens are promoted compared
to the original prompt. These undesired words'
probability are then decreased according to a
decay constant, which we set to 50 (default).
2.
WORDFILTER: We preventGPT2from gen-
erating words from a list of banned words by
setting any logits that would result in a banned
word completion to 1(Gehman et al.,).
Evaluation.We evaluate our method on the chal-
lenging subset ofREALTOXICPROMPTS(Gehman
et al.,), a collection of 1,225 prompts that tend
to yield extremely toxic completions in LMs, using
the Perspective API, which grades text according
to six toxicity attributes. A score of>0:5 indicates
a toxic text w.r.t to the attribute. Additionally, we
compute perplexity to account for changes in LM
performance. We useGPT2and, following
et al.2021), generate continuations of 20 tokens.
Results.
Finding the non-toxic sub-updates man-
ually was intuitive and efcient (taking<5min-
utes). Tab.
value vectors (0.01%) substantially decreases toxi-
city (#47%), outperforming both SD (#37%) and
WORDFILTER(#20%). Moreover, inducing sub-
updates that promote safety related concepts is
more effective than promoting generally non-toxic
sub-updates. However, our method resulted in a
perplexity increase greater than this induced by SD,
though the increase was still relatively small.
6.2 Self-Supervised Early Exit Prediction
The recent success of transformer-based LMs in
NLP tasks has resulted in major production cost
increases (Schwartz et al.,
spurred interest in early-exit methods that reduce
the incurred costs (Xu et al.,). Such methods
often use small neural models to determine when to
stop the execution process (Schwartz et al.,;
Model Toxicity Severe Sexually Threat Profanity Identity PPL
toxicity explicit attack
GPT2 58.5% 49.2% 34.1% 16.4% 52.5% 16.8% 21.7
"10 Manual Pick
#47%
30.8%
#50%
24.8%
#40%
20.4%
#63%
6.0%
#47%
27.9%
#48%
8.8% 25.3
"10 API Graded
#10%
52.7%
#11%
44%
#3%
33.2%
#19%
13.3%
#9%
47.6%
#9%
15.3% 23.8
SD
#37%
37.2%
#46%
26.4%
#36%
21.7%
#52%
7.8%
#39%
32%
#50%
8.4% 23.9
WORDFILTER
#20%
46.9%
#34%
32.4%
#36%
21.9%
#<1%
16.3%
#38%
32.3%
#13%
14.7% -
Table 5: Evaluation results on the challenging subset of REALTOXICPROMPTS, showing the percentage of toxic
completions for 6 toxicity attributes, as well as language model perplexity (PPL).
Elbayad et al.,;,;,
2020,;,;,).
In this section, we test our hypothesis that domi-
nant FFN sub-updates can signal asaturation event
(§5.2), to create a simple and effective early exiting
method that does not involve any external model
training. For the experiments, we useWIKILM,
where saturation events occur across all layers
(statistics forWIKILMandGPT2are in App.).
Method.We devise a simple prediction rule
based on a nearest-neighbours approach, using 10k
validation examples fromWIKITEXT-103. First,
for every example, we map the top-10 dominant
sub-updates at each layer to their corresponding
clusters. Then, for every layer`, we split all the
sets of clusters at that layer into two sets,T
`and
N
`, based on whether saturation occurred or not
(e.g.,T
5stores all the sets that were active in a sat-
uration event at layer 5). Given the top-10 clusters
of an unseen example at some layer`, we consider
a higher overlap withT
`than withN
`
0
;8`
0
> `
as a signal for early exit. Thus, during inference,
we propagate the input example through the layers,
and compute at each layer`the intersection size be-
tween its top-10 active clusters and each ofT
`and
N
`
0
;8`
0
> ` . If the average and maximal intersec-
tion withT
`exceeds those withN
`
0
;8`
0
> ` , we
halt the computation and declare early exiting.
7
Baselines.
We train layer-wise binary classiers
over the representation and FFN updatesx
`,o
`,
and~x
`, using logistic regression. As in our method,
the labels are determined according to saturation
events in the training data (see App.). During
inference, we execute the computation through the
layers, and halt according to the layer classier.
7
This is a simplication. We splitN
`by saturation layers
and require a bigger intersection withT
`
at all the layers.
Method Accuracy Saved Layers
Binary classiers usingx
`
94.46.4
18.8%
3.00.4
Binary classiers usingo
`
92.95.4
19.4%
3.10.3
Binary classiers using~x
`
94.46.4
18.8%
3.00.4
Sub-updates rule 94.1 1.4
20.0%
3.20.3
Table 6: Early exit evaluation results on WIKILM.
Evaluation.
Each method is evaluated byaccu-
racy, i.e., the portion of examples for which exiting
at the predicted layer yields the nal model pre-
diction, and bycomputation efciency, measured
by the amount of saved layers for examples with
correct prediction. We run each method with ve
random seeds and report the average scores.
Results.
Tab.
a high accuracy of 94.1%, while saving 20% of
computation on average without changing the pre-
diction. Moreover, just by observing the dominant
FFN sub-updates, it performs on-par with the pre-
diction rules relying on the representation and FFN
output vectors. This demonstrates the utility of
sub-updates for predicting saturation events, and
further supports our hypothesis that FFN updates
play a functional role in the prediction (§5.2).
7 Related Work
The lack of interpretability of modern LMs has
led to a wide interest in understanding their predic-
tion construction process. Previous works mostly
focused on analyzing the evolution of hidden rep-
resentations across layers (Voita et al.,), and
probing the model with target tasks (Yang et al.,
2020;,;,;
and Lopez,). In contrast, our approach aims to
interpret the model parameters and their utilization
in the prediction process.
More recently, a surge of works have investi-
gated the knowledge captured by the FFN layers
(Da et al.,;,;,;
Yao et al.,;,;,
2020). These works show that the FFN layers store
various types of knowledge, which can be located
in specic neurons and edited. Unlike these works,
we focus on the FFN outputs and their contribution
in the prediction construction process.
Last, our interpretation of FFN outputs as up-
dates to the output distribution relates to recent
works that interpreted groups of LM parameters in
the discrete vocabulary space (Geva et al.,;
Khashabi et al.,), or viewed the representation
as an information stream (Elhage et al.,).
8 Conclusions
Understanding the inner workings of transformers
is valuable for explainability to end-users, for de-
bugging predictions, for eliminating undesirable
behavior, and for understanding the strengths and
limitations of NLP models. The FFN is an under-
studied core component of transformer-based LMs,
which we focus on in this work.
We study the FFN output as a linear combina-
tion of parameter vectors, termed values, and the
mechanism by which these vectors update the token
representations. We show that value vectors often
encode human-interpretable concepts and that these
concepts are promoted in the output distribution.
Our analysis of transformer-based LMs provides
a more detailed understanding of their internal pre-
diction process, and suggests new research direc-
tions for interpretability, control, and efciency, at
the level of individual vectors.
Limitations
Our study focused on the operation of FFN lay-
ers in building model predictions. Future work
should further analyze the interplay between these
layers and other components in the network, such
as attention-heads.
In our analysis, we decomposed the computation
of FFN layers into smaller units, corresponding to
single value vectors. However, it is possible that
value vectors are compositional in the sense that
combinations of them may produce new meanings.
Still, we argue that analyzing individual value vec-
tors is an important rst step, since (a) the space of
possible combinations is exponential, and (b) our
analysis suggests that aggregation of value vectors
is less interpretable than individual value vectors
(§4.1). Thus, this approach opens new directions
for interpreting the contribution of FFN layers to
the prediction process in transformer LMs.
In addition, we chose to examine the broad fam-
ily of decoder-based, auto-regressive LMs, which
have been shown to be extremely effective for
many NLP tasks, including few- and zero-shot
tasks (Wang et al.,). While these models share
the same building blocks of all transformer-based
LMs, it will be valuable to ensure that our ndings
still hold for other models, such as encoder-only
LMs (e.g. RoBERTa (Liu et al.,)) and mod-
els trained with different objective functions (e.g.
masked language modeling (Devlin et al.,)).
Finally, our annotation effort was made for the
evaluation of our hypothesis that sub-updates en-
code human-interpretable concepts. Scaling our
annotation protocol would enable a more rened
map of the concepts, knowledge and structure cap-
tured by LMs. Furthermore, since our concept
interpretation approach relies on manual inspection
of sets of tokens, its success might depend on the
model's tokenization method. In this work, we an-
alyzed models with two different commonly-used
tokenizers, and future research could verify our
method over other types of tokenizations as well.
Ethics Statement
Our work in understanding the role that single-
values play in the inference that transformer-based
LMs perform potentially improves their trans-
parency, while also providing useful control appli-
cations that save energy (early-exit prediction) and
increase model harmlessness (toxic language sup-
pression). It should be made clear that our method
for toxic language suppression only reduces the
probability of toxic language generation and does
not eliminate it. As such, this method (as well as
our early-exit method) should not be used in the
real world without further work and caution.
More broadly, our work suggests a general ap-
proach for modifying LM predictions in particular
directions, by changing the weights of FFN sub-
updates. While this is useful for mitigating biases,
it also has the potential for abuse. It should be
made clear that, as in the toxic language suppres-
sion application, our approach does not modify the
information encoded in LMs, but only changes the
intensity in which this information is exposed in the
model's predictions. Moreover, our work primar-
ily proposes an interpretation for FFN sub-updates,
which also could be used to identify abusive inter-
ventions. Regardless, we stress that LMs should
not be integrated into critical systems without cau-
tion and monitoring.
Acknowledgements
We thank Shauli Ravfogel, Tal Schuster, and
Jonathan Berant for helpful feedback and construc-
tive suggestions. This project has received funding
from the Computer Science Scholarship granted by
the Séphora Berrebi Foundation, the PBC fellow-
ship for outstanding PhD candidates in Data Sci-
ence, and the European Research Council (ERC)
under the European Union's Horizon 2020 research
and innovation programme, grant agreement No.
802774 (iEXTRACT).
References
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-
ton. 2016. Layer normalization.arXiv preprint
arXiv:1607.06450.
Alexei Baevski and Michael Auli. 2019.
put representations for neural language modeling. In
International Conference on Learning Representa-
tions (ICLR).
Emily M Bender, Timnit Gebru, Angelina McMillan-
Major, and Shmargaret Shmitchell. 2021. On the
dangers of stochastic parrots: Can language models
be too big? InProceedings of the ACM Confer-
ence on Fairness, Accountability, and Transparency
(FAccT).
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, Sandhini Agarwal, Ariel Herbert-
Voss, Gretchen Krueger, Tom Henighan, Rewon
Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu,
Clemens Winter, Chris Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei.
2020. Language models are few-shot learners. In
Advances in Neural Information Processing Systems
(NeurIPS).
Kevin Clark, Urvashi Khandelwal, Omer Levy, and
Christopher D. Manning. 2019.
look at? an analysis of BERT's attention. In Pro-
ceedings of the 2019 ACL Workshop BlackboxNLP:
Analyzing and Interpreting Neural Networks for
NLP, pages 276286, Florence, Italy. Association
for Computational Linguistics.
Jeff Da, Ronan Le Bras, Ximing Lu, Yejin Choi, and
Antoine Bosselut. 2021.
emergence in few-shot knowledge models. In 3rd
Conference on Automated Knowledge Base Con-
struction.
Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao
Chang, and Furu Wei. 2022.
in pretrained transformers. In Proceedings of the
60th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
84938502, Dublin, Ireland. Association for Com-
putational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019.
deep bidirectional transformers for language under-
standing. In North American Association for Com-
putational Linguistics (NAACL), pages 41714186,
Minneapolis, Minnesota.
Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael
Auli. 2020. Depth-adaptive transformer. InInter-
national Conference on Learning Representations
(ICLR).
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom
Henighan, Nicholas Joseph, Ben Mann, Amanda
Askell, Yuntao Bai, Anna Chen, Tom Conerly,
Nova DasSarma, Dawn Drain, Deep Ganguli, Zac
Hateld-Dodds, Danny Hernandez, Andy Jones,
Jackson Kernion, Liane Lovitt, Kamal Ndousse,
Dario Amodei, Tom Brown, Jack Clark, Jared Ka-
plan, Sam McCandlish, and Chris Olah. 2021. A
mathematical framework for transformer circuits.
Transformer Circuits Thread. Https://transformer-
circuits.pub/2021/framework/index.html.
Samuel Gehman, Suchin Gururangan, Maarten Sap,
Yejin Choi, and Noah A. Smith. 2020.
cityPrompts: Evaluating neural toxic degeneration
in language models. In Findings of the Association
for Computational Linguistics: EMNLP 2020, pages
33563369, Online. Association for Computational
Linguistics.
Mor Geva, Roei Schuster, Jonathan Berant, and Omer
Levy. 2021.
key-value memories. In Proceedings of the 2021
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 54845495, Online and
Punta Cana, Dominican Republic. Association for
Computational Linguistics.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun. 2016. Deep residual learning for image recog-
nition. InProceedings of the conference on com-
puter vision and pattern recognition (CVPR).
Dan Hendrycks and Kevin Gimpel. 2016. Gaus-
sian error linear units (gelus).arXiv preprint
arXiv:1606.08415.
Arthur E Hoerl and Robert W Kennard. 1970. Ridge re-
gression: Biased estimation for nonorthogonal prob-
lems.Technometrics, 12(1):5567.
Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao
Chen, and Qun Liu. 2020. Dynabert: Dynamic bert
with adaptive width and depth.Advances in Neural
Information Processing Systems (NeurIPS).
Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham
Neubig. 2020.
models know?Transactions of the Association for
Computational Linguistics, 8:423438.
Daniel Khashabi, Xinxi Lyu, Sewon Min, Lianhui
Qin, Kyle Richardson, Sean Welleck, Hannaneh Ha-
jishirzi, Tushar Khot, Ashish Sabharwal, Sameer
Singh, and Yejin Choi. 2022.
ness: The curious case of discretized interpretation
of continuous prompts. In Proceedings of the 2022
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, pages 36313643, Seattle,
United States. Association for Computational Lin-
guistics.
Lei Li, Yankai Lin, Deli Chen, Shuhuai Ren, Peng Li,
Jie Zhou, and Xu Sun. 2021.
celerating inference of pre-trained language models
via calibrated complete models cascade. In Find-
ings of the Association for Computational Linguis-
tics: EMNLP 2021, pages 475486, Punta Cana, Do-
minican Republic. Association for Computational
Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
RoBERTa: A robustly optimized bert pretraining ap-
proach.arXiv preprint arXiv:1907.11692.
Kris McGufe and Alex Newhouse. 2020. The radical-
ization risks of gpt-3 and advanced neural language
models.arXiv preprint arXiv:2009.06807.
Kevin Meng, David Bau, Alex Andonian, and Yonatan
Belinkov. 2022. Locating and editing factual knowl-
edge in gpt.arXiv preprint arXiv:2202.05262.
Stephen Merity, Caiming Xiong, James Bradbury, and
Richard Socher. 2017. Pointer sentinel mixture mod-
els.International Conference on Learning Repre-
sentations (ICLR).
Daniel Müllner. 2011. Modern hierarchical, ag-
glomerative clustering algorithms.arXiv preprint
arXiv:1109.2378.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, Ilya Sutskever, et al. 2019. Lan-
guage models are unsupervised multitask learners.
OpenAI blog, 1(8):9.
Naomi Saphra and Adam Lopez. 2019.
ing learning dynamics of language models with
SVCCA. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers),
pages 32573267, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics.
Timo Schick, Sahana Udupa, and Hinrich Schütze.
2021.
for reducing corpus-based bias in NLP. Transac-
tions of the Association for Computational Linguis-
tics, 9:14081424.
Tal Schuster, Adam Fisch, Tommi Jaakkola, and
Regina Barzilay. 2021.
ference via condent adaptive transformers. In Pro-
ceedings of the 2021 Conference on Empirical Meth-
ods in Natural Language Processing, pages 4962
4979, Online and Punta Cana, Dominican Republic.
Association for Computational Linguistics.
Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren
Etzioni. 2020a. Green AI.Communications of the
ACM, 63(12):5463.
Roy Schwartz, Gabriel Stanovsky, Swabha
Swayamdipta, Jesse Dodge, and Noah A. Smith.
2020b.
and instance complexities. In Proceedings of the
58th Annual Meeting of the Association for Com-
putational Linguistics, pages 66406651, Online.
Association for Computational Linguistics.
Noam M. Shazeer. 2020. Glu variants improve trans-
former.ArXiv, abs/2002.05202.
S. Sukhbaatar, J. Weston, and R. Fergus. 2015. End-
to-end memory networks. InAdvances in Neural
Information Processing Systems (NIPS).
Sainbayar Sukhbaatar, Edouard Grave, Guillaume
Lample, Herve Jegou, and Armand Joulin. 2019.
Augmenting self-attention with persistent memory.
arXiv preprint arXiv:1907.01470.
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
BERT rediscovers the classical NLP pipeline. In
Proceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 4593
4601, Florence, Italy. Association for Computational
Linguistics.
Robert Tibshirani. 1996. Regression shrinkage and se-
lection via the lasso.Journal of the Royal Statistical
Society: Series B (Methodological), 58(1):267288.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, ukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. InAdvances in Neural Information Pro-
cessing Systems (NIPS), pages 59986008.
Elena Voita, Rico Sennrich, and Ivan Titov. 2019.
bottom-up evolution of representations in the trans-
former: A study with machine translation and lan-
guage modeling objectives. In Proceedings of the
2019 Conference on Empirical Methods in Natu-
ral Language Processing and the 9th International
Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 43964406, Hong Kong,
China. Association for Computational Linguistics.
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner,
and Sameer Singh. 2019.
gers for attacking and analyzing NLP. In Proceed-
ings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th Inter-
national Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 21532162, Hong
Kong, China. Association for Computational Lin-
guistics.
Jonas Wallat, Jaspreet Singh, and Avishek Anand.
2020.
getting of knowledge in BERT. In Proceedings of
the Third BlackboxNLP Workshop on Analyzing and
Interpreting Neural Networks for NLP, pages 174
183, Online. Association for Computational Linguis-
tics.
Thomas Wang, Adam Roberts, Daniel Hesslow,
Teven Le Scao, Hyung Won Chung, Iz Beltagy,
Julien Launay, and Colin Raffel. 2022.
guage model architecture and pretraining objective
works best for zero-shot generalization? Pro-
ceedings of the 39th International Conference on
Machine Learning, volume 162 ofProceedings of
Machine Learning Research, pages 2296422984.
PMLR.
Ji Xin, Rodrigo Nogueira, Yaoliang Yu, and Jimmy Lin.
2020.
ranking. In Proceedings of SustaiNLP: Workshop on
Simple and Efcient Natural Language Processing,
pages 8388, Online. Association for Computational
Linguistics.
Ji Xin, Raphael Tang, Yaoliang Yu, and Jimmy Lin.
2021.
ne-tuning and extension to regression. In Proceed-
ings of the 16th Conference of the European Chap-
ter of the Association for Computational Linguistics:
Main Volume, pages 91104, Online. Association for
Computational Linguistics.
Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou,
and Lei Li. 2021. A survey on green deep learning.
arXiv preprint arXiv:2111.05193.
Yilin Yang, Longyue Wang, Shuming Shi, Prasad Tade-
palli, Stefan Lee, and Zhaopeng Tu. 2020.
sub-layer functionalities of transformer decoder. In
Findings of the Association for Computational Lin-
guistics: EMNLP 2020, pages 47994811, Online.
Association for Computational Linguistics.
Yunzhi Yao, Shaohan Huang, Ningyu Zhang, Li Dong,
Furu Wei, and Huajun Chen. 2022. Kformer:
Knowledge injection in transformer feed-forward
layers.arXiv preprint arXiv:2201.05742.
123456789101112131415161718192021222324
layer
0
20
40
60
80
100
IoU of value projections
GPT2
normalized random Figure 4: Similarity of projections toE, of GPT2
value vectors with and without layer normalization, and
of value vectors and randomly-initialized vectors. We
measure similarity of the top-30 tokens in each projec-
tion with IoU.
A Appendix
A.1 Value Vectors Projection Method
Our interpretation method of sub-updates is based
on directly projecting value vectors to the embed-
ding matrix, i.e. for a valuevand embedding
matrixE, we calculateEv(§4). However, in some
LMs likeGPT2, value vectors in each layer are
added to the token representation followed by a
layer normalization (LN) (Ba et al.,). This
raises the question whether reading vectors that
are normalized in the same manner as the represen-
tation would yield different concepts.
To test that, we compare the top-30 scoring
tokens byEv
`
iand byELayerNorm(v
`
i
) , for
i= 1; :::; dm and`= 1; :::; L, using Intersection
over Union (IoU). As a baseline, we also compare
Ev
`
iwith random vectors, initialized from a normal
distribution with the empirical mean and standard
deviation of the value vectors. Fig.
LN does not change the projection substantially,
with an overlap of64:5%of the top-30 tokens on
average, suggesting that the same concepts are pro-
moted in both cases. This is in contrast to random
values, which produce a0%overlap on average.
A.2 Concepts Annotation
We analyze the concepts encoded in sub-updates,
by projecting their corresponding value vectors to
the embedding matrix and identifying repeating
patterns in the top-scoring 30 tokens (§3). Pattern
identication was performed by experts (NLP grad-
uate students), following the instructions presented
in Fig.. Please note these are the instructions
provided for annotations ofWIKILM, which uses
word-level tokenization. Thus, the terms words
and tokens are equivalent in this case.
For value vectors inWIKILM, which uses
a word-level vocabulary with many uncommon
words, we additionally attached a short description
eld for each token that provides context about the
meaning of the word. For the description of a token
w, we rst try to extract the denition ofwfrom
Wordnet.
8
Ifwdoes not exist in Wordnet, as often
happens for names of people and places, we then
search forwin Wikipedia
9
and extract a short (pos-
sibly noisy) description if the query was successful.
A complete annotation example Tab..
A.3 Sub-Update Contribution in FFN
Outputs
In this section, we justify our choice along the pa-
per of looking at the top-10 dominant sub-updates.
The contribution of a sub-updatem
`
i
v
`
i to the FFN
output is:
contrib(m
`
iv
`
i) :=
jm
`
i
jjjv
`
i
jj
P
dm
j=1
jm
`
j
jjjv
`
j
jj
;
namely, its relative weight compared to the over-
all sum of weights of all sub-updates. The overall
contribution of the top-10 dominant sub-updates is
computed by summing their contributions. Note
that we take the absolute value of the coef-
cientsjm
`
i
j, since some activation functions (e.g.
GeLU (Hendrycks and Gimpel,) in GPT2),
can result in negative values ofm
`
i
.
Empirically, we observe that in some cases sub-
updates with negative coefcients do appear as part
of the 10 most dominant sub-updates inGPT2.
We further attribute this to the success of GeLU in
transformer models (Shazeer,), as it increases
the expressiveness of the model by allowing re-
versing the scores value vectors induce over the
vocabulary.
Fig.
dominant sub-updates per layer forWIKILMand
GPT2, using 2000 random examples from the
WIKITEXT-103validation set. Clearly, for all
the layers, the contribution of the dominant sub-
updates exceeds the contribution of random sub-
updates. Observe that, even though they cover
only 0.24% of the value vectors, the contribution of
dominant sub-updates is typically around 5%, and
in some layers (e.g. layers 8-16 inWIKILMand
layer 1 inGPT2) it reaches over 10% of the total
8
We use the NLTK python package.
9
Using the wptools packagehttps://pypi.org/
project/wptools/.
In this task, you are given a list of 30 words in English, and the goal is to identify
repetitive patterns occurring in the words.
Patterns can besemantic(e.g. animals, 3-digit numbers,names of Indian actors, and
time-related words) orsyntactic(e.g. connectives,plurals, words starting with “dis-”, and
verbs in present progressive tense). You should only count patterns that occur in at least 4
words (i.e. if you notice a pattern that occurs only in 3 words, then please ignore it).
To complete the task, please do the following:
1.Give an ID to every identified pattern (1,2,...)
2.Assign a pattern ID to every word in the list, or -1/leave empty if no pattern applies to
the word.
3.For every identified pattern specify whether the pattern is semantic or syntactic and
(optional) write a short description of the pattern.
Please note that some of the words might be uncommon words that you are not familiar with.
In such cases, you will need to do a quick search over the Web to understand the meaning
of words. Figure 5: Annotation instructions for the concepts identication task.
contribution. This demonstrates that analyzing the
top-10 dominant sub-updates can shed light on the
way predictions are built through the layers.
A.4 Toxic Language Suppression Details
The 10 manually selected value vectors were found
by searching for non-toxic words, such assafe
andpeace, among the top-30 tokens in the vec-
tor projections to the vocabulary. We selected a
small set of 10 value vectors whose top-scoring
tokens were coherent and seemed to promote differ-
ent kinds of non-toxic tokens. The list of manually
picked vectors is provided in Tab.. Importantly,
the search process of all vectors was a one-time
effort that took<5minutes in total. We chose
the value vectors in a greedy-manner, without addi-
tional attempts to optimize our choice.
To select 10 non-toxic value vectors based on an
automatic toxicity metric, we used the Perspective
API. Concretely, we concatenated the top-30 tokens
by each value vector and graded the resulting text
with the toxicity score produced by the API. Then,
we sampled 10 random vectors with a toxicity score
<0:1 (a score of<0:5 indicates a non-toxic text).
A.5 Early Exit Details
This section provides further details and analysis
regarding our early exit method and the baselines
we implemented.
Method Implementation.
We consider 90% of
the 10k examples for constructingT
`andN
`, and
the remaining 10% examples are considered as the
testing set. We usedk= 2e
2 to cluster the top-10
dominant value vectors, but observed that otherk
values yielded similar results.
Baselines' Implementation.
We train each bi-
nary classier using 8k training examples, based
on the standardized forms of each feature vec-
tor. We considered a hyperparameter sweep, us-
ing 8-fold cross-validation, withl2orl1regu-
larization (lasso (Tibshirani,) or ridge (Ho-
erl and Kennard,)), regularization coef-
cientsC2 f1e
3
;1e
2
;1e
1
;1;1e
1
;1e
2
;1e
3
g ,
and took the best performing model for each layer.
We also used a inversely proportional loss coef-
cient according to the class frequencies.
In order to achieve high accuracy, we further
calibrate a threshold per classier for reaching the
maximal F1score for each layer. This calibration
is done after training each classier, over a set of
1000 validation examples.
Frequency of Saturation Events.
We investi-
gate the potential of performing early exit forWIK-
ILMandGPT2. Tab.
of saturation events per layer, considering 10k ex-
amples from theWIKITEXT-103validation set,
forWIKILMandGPT2, respectively. InGPT2,
34.15% of the examples require the full compu-
tation using all the model layers, while forWIK-
ILM, this holds for only 15.22% of the examples.
Notably, early xation events inGPT2are less
common than inWIKILM, possibly due to the
larger number of layers the prediction construction
is spread over. Hence, we useWIKILMfor our
experiments, as it has signicantly higher compu-
patternsword description
1 front the side that is forward or prominent
1 ahead having the leading position or higher score in a contest
1 forward the person who plays the position of forward in certain games, such as basketball, soccer, or
hockey
1 preceded be earlier in time; go back further
1 Before earlier in time; previously
1 before earlier in time; previously
1 rear the back of a military formation or procession
1 fore front part of a vessel or aircraft
2 Name a language unit by which a person or thing is known
1 Past the time that has elapsed
1 prior the head of a religious order; in an abbey the prior is next below the abbot
1 anterior a tooth situated at the front of the mouth
1 upperparts standard terms for unambiguous description of relative placement of body parts
1 lead an advantage held by a competitor in a race
1 backwards at or to or toward the back or rear
1 aft (nautical, aeronautical) situated at or toward the stern or tail
1 preceding be earlier in time; go back further
1 upstream in the direction against a stream's current
hind any of several mostly spotted shes that resemble groupers
1 posterior the eshy part of the human body that you sit on
Etymology a history of a word
1 Pre Wikimedia disambiguation page
chin the protruding part of the lower jaw
1 north the region of the United States lying to the north of the Mason-Dixon line
1 east the cardinal compass point that is at 90 degrees
2 surname the name used to identify the members of a family (as distinguished from each member's
given name)
1 Then that time; that moment
2 name a language unit by which a person or thing is known
1 northbound moving toward the north
1 leading thin strip of metal used to separate lines of type in printing
pattern iddescription
(optional)
semantic/syntactic
1 positions/
directions
semantic
2 naming semantic
Table 7: An example annotation spreadsheet of the top-tokens by the value vectoru
6
1090in WIKILM.12345678910111213141516
layer
0
10
20
30
% contribution
WikiLM
top-10 rand-10 123456789101112131415161718192021222324
layer
0
5
10
15
% contribution
GPT2
top-10 rand-10
Figure 6: Relative contribution to the FFN output of the 10 most dominant and 10 random sub-updates in each
layer, of WIKILM (left) and GPT2 (right).
tation saving potential, as well as more saturation
events per layer.
Value Top-10 Tokens
v
14
1853
transparency, disclosure, clearer, parency, iquette,
humility, modesty, disclosures, accountability, safer
v
15
73
respectful, honorable, healthy, decent, fair, erning,
neutral, peacefully, respected, reconc
v
15
1395
safe, neither, safer, course, safety, safe, Safe,
apologize, Compact, cart
v
16
216
refere, Messages, promises, Relations, accept, acceptance,
Accept, assertions, persistence, warn
v
17
462
should, should, MUST, ought, wisely, Should, SHOULD,
safely, shouldn, urgently
v
17
3209
peaceful, stable, healthy, calm, trustworthy, impartial,
stability, credibility, respected, peace
v
17
4061
Proper, proper, moder, properly, wisely, decency, correct,
corrected, restraint, professionalism
v
18
2921
thank, THANK, thanks, thank, Thank, apologies, Thank,
thanks, Thanks, apologise
v
19
1891
thanks, thank, Thanks, thanks, THANK, Thanks, Thank, Thank,
thank, congratulations
v
23
3770
free, fit, legal, und, Free, leg, pless, sound, qualified,
Free
Table 8: The 10 manually picked value vectors used for toxic language suppression and the top-10 tokens in their
projection to the vocabulary. Repetitions in the projections are a result of special characters not being shown. These
vectors were found by manually searching for non-toxic words such assafeandpeacein the projections to
the vocabulary.
Layer % Examples Layer % Examples
1 6.70 9 2.96
2 5.25 10 3.78
3 13.74 11 4.74
4 3.13 12 7.45
5 1.02 13 10.79
6 1.07 14 9.88
7 1.86 15 9.81
8 2.60 16 15.22
Table 9: The percentage of saturation events per layer
using WIKILM, for the WIKITEXT-103 validation set.
Layer % Examples Layer % Examples
1 2.21 13 1.24
2 0.77 14 1.62
3 1.06 15 2.37
4 0.74 16 2.72
5 0.85 17 2.99
6 0.83 18 3.80
7 0.83 19 4.15
8 0.72 20 5.21
9 0.93 21 5.67
10 0.99 22 9.31
11 1.16 23 14.52
12 1.32 24 34.15
Table 10: The percentage of saturation events per layer
using GPT2, for the WIKITEXT-103 validation set.