Extracted Text

2306.03819.pdf

LEACE: Perfect linear concept erasure in closed form
Nora Belrose
1
David Schneider-Joseph
1
Shauli Ravfogel
2
Ryan Cotterell
3
Edward Raff
4
Stella Biderman
1,4
1
EleutherAI
2
Bar-Ilan University
3
ETH Zürich
4
Booz Allen Hamilton
{nora,stella}@eleuther.ai david@davidsj.com
Abstract
Concept erasure aims to remove specified features from a representation. It can
improve fairness (e.g. preventing a classifier from using gender or race) and
interpretability (e.g. removing a concept to observe changes in model behavior). We
introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which
provably prevents all linear classifiers from detecting a concept while changing the
representation as little as possible, as measured by a broad class of norms. We apply
LEACE to large language models with a novel procedure called concept scrubbing,
which erases target concept information fromeverylayer in the network. We
demonstrate our method on two tasks: measuring the reliance of language models
on part-of-speech information, and reducing gender bias in BERT embeddings. Our
code is available athttps://github.com/EleutherAI/concept-erasure.
1 Introduction
The ability to prevent a machine learning system from using a specified concept is important for
fairness and interpretability. Popular notions of fairness require that protected attributes should not
causally affect predictions [22,26], and interpretability research often estimates the causal effect of a
concept by attempting to remove it from a model’s internal representations [10, 30, 25, 5, 18].
What it means for a modelMto “use” a conceptZis often vague and application-specific, but
a necessary condition is that its outputs—and therefore its inputs and hidden states—should have
significantmutual informationwithZ.
1
Concept erasureleverages this fact to limitM’s use ofZ
withoutfinetuning or inspecting its parameters. Instead, we edit the input or hidden statesXused by
Mto minimize the predictiveV-informationIV(X→Z) [43], a tractable lower bound on the mutual
informationI(X; Z)which measures the degree to which classifiers from the familyVcan predict
Z. Intuitively, if no classifier inVcan outperform a constant function at predictingZ—a condition
known asguardedness—thenMcan’t useZeither, at least ifVis expressive enough relative toM.
In this work, we improve upon existing concept erasure techniques using a theory-driven approach.
We focus on the case whereVis the set of linear classifiers, and prove a previously unnoticed
equivalence: a classification task is linearly guardedif and only ifevery class has exactly the same
mean feature vector (§ 3). Leveraging this equivalence, we derive a simple necessary and sufficient
condition for an affine transformation to produce linearly guarded features. We then identify the
uniquesurgicaltransformation in this family—the one that minimizes the mean squared distance
from the original features with respect toallnorms induced by inner products, including the popular
Euclidean and Mahalanobis norms. We name itLEAst-squares Concept Erasure (LEACE)(§ 4).
While prior work has focused on preventing linear models from leveragingZ, we aim to erase
concepts from deep neural networks as well. Interpretability research has shown that networks
1
This follows from the fact that causal dependence is a special kind of statistical dependence [28]. By the
data processing inequality,M’s output can’t have any more information aboutZthan its input or hidden states.
37th Conference on Neural Information Processing Systems (NeurIPS 2023).

can be usefully described as encoding features in linear subspaces [11,24,41], suggesting that
fundamentally nonlinear methods may not be necessary for successful erasure in DNNs. In light of
this, we introduce a simple procedure calledconcept scrubbing(§ 6), which sequentially applies
LEACE to the intermediate representations at each layer of a deep network.
We empirically validate our proposals, demonstrating the superiority of LEACE for erasing gender
bias from BERT representations (§ 5.2), and using concept scrubbing to measure the extent to which
large language models use part-of-speech information (§ 6).
2 Preliminaries
Consider ak-class classification task over jointly defined random vectorsX(the input data) andZ
(the one-hot labels), withXof finite first moment and taking values inR
d, andZtaking values in
Z={z∈ {0,1}
k

∥z∥1= 1}
2 with eachP(Z =j)>0 . Letη(·;θ) :R
d
→R
k be a predictor
chosen from a function classV={η(·;θ)|θ∈Θ} (presumed to contain all constant functions) so
as to minimize the expectationE
fi
L(η(X),Z)
fl of someL:R
k
× Z →[0,∞) in a classLof loss
functions.
We borrow the concept ofguardednessfrom Ravfogel et al.[33], who define it in terms ofV-
information [43]. We opt for a slightly more general definition here, which is equivalent to theirs in
the case of cross-entropy loss (see Appendix G).
Definition 2.1(Guardedness).LetX,Z,V, andLbe as defined above, and letχbe the set of all
random vectors of finite first moment taking values inR
d
, jointly defined withZ.
We sayX (V,L)−guardsZif, for all lossesL ∈L, it maximizes the minimum expected loss:
X∈argmax
X
′
∈χ
inf
θ∈Θ
E
h
L(η(X
′
;θ),Z)
i
.
In other words, its conditional distributionP(X|Z =·) is among the worst possible distributions for
predictingZfromXusing a predictor of the formη(·;θ)∈ Vand a loss function inL.
Definition 2.2(Trivially Attainable Loss).Thetrivially attainable lossfor labelsZand lossLis the
lowest possible expected loss available to a constant predictorη(x) =b:
Lτ= inf
b∈R
k
E[L(b,Z)]
We will sometimes write itL
(Z,L)
τ in cases of possible ambiguity. If there is a specific constant
predictor actually achieving this loss, we call it thetrivial predictorητ=η
(Z,L)
τ.
We examine this problem in the important case of loss functionsL:R
k
× Z →[0,∞) which are
convex in the predictionη(x), and linear predictors that take the functional formη(x;b,W) =
b+Wx, for some biasb∈R
k
and weight matrixW∈R
k×d
.
Definition 2.3(Linear Guardedness).IfX (V,L) -guardsZ, whereLis the class of nonnegative
loss functions which are convex in their first argument, andVis the class of linear predictors
η(x) =b+Wx, we say thatXlinearly guardsZ.
3 Theoretical Results
Our primary theoretical result is that the following conditions are all equivalent:
1. Xlinearly guards the labelsZ. (Definition 2.3)
2. L, the trivially attainable loss is optimal on(X,Z). (Definition 2.2)
3.
The class-conditional mean vectorsE[X|Z =i] are equal to the unconditional meanE[X].
4. Xhas zero covariance with every component ofZ.
5. Xexhibits statistical parity w.r.t.Z. (App. C)
The equivalence of conditions 1, 2, and 5 is relatively straightforward to show, and the relevant
theorems can be found in Appendices B and C. The other equivalences are proven below (cond. 3↔
cond. 2 in § 3.1 and § 3.2); cond. 3↔4 in § 3.3).
2
We frequently use the integerj≤k to refer to the element ofZwhich is1at thej
thindex and0elsewhere.
2

3.1 Equality of Class Centroids Implies Linear Guardedness
The following result establishes the implication from condition 3 to condition 2.
Theorem 3.1.SupposeLis convex in the linear predictionη. Then if each class-conditional mean
E
fi
X|Z =i
fl
is equal toE
fi
X
fl
, the trivially attainable loss cannot be improved upon.
Proof.
Letη(x) =b+Wx be any linear predictor. By Jensen’s inequality,
3
the loss withηevaluated
onXis lower bounded by the loss withηevaluated on the unconditional mean of the dataE
fi
X
fl
:
E
h
L(η,Z)
i
=EZ
h
E
h
L(η,Z)

Z
ii
≥EZ
h
L
ˇ
E
fi
η

Z
fl
,Z
ıi
(Jensen’s inequality)
=EZ
h
L
ˇ
b+WE
fi
X

Z
fl
,Z
ıi
(linearity ofη)
=EZ
h
L
ˇ
b+WE
fi
X
fl
,Z
ıi
. (by assumption)
This in turn is the loss of the constant predictorη
′
(x) =b+WE
fi
X
fl . Since the trivially attainable
loss is the best that can be achieved by a constant predictor, andeverypredictor’s loss is lower
bounded by that of some constant predictor, we cannot improve upon the trivially attainable loss.
Intuitively, this shows that the classifier’s expected loss is lower-bounded by the loss it would receive
if each data point were replaced with the centroid of its class. But, if these centroids are all equal, the
loss can’t be any lower than what we’d get if every data point were replaced with theglobalmean
E[X]. In that case, the data points are indistinguishable and we can’t do better thanW=0.
3.2 Linear Guardedness Implies Equality of Class Centroids
We now prove the implication from condition 2 to condition 3. Condition 2 applies when the trivially
attainable loss is optimal forallconvex losses, including cross-entropy loss in particular. And if it
holds for cross-entropy loss, we now show that condition 3—the class centroids are equal—must
follow. First a more general lemma:
Lemma 3.2.SupposeLhas bounded partial derivatives, which when off-category never vanish and
do not depend on the category, i.e.∂L(η, z1)/∂ηi=∂L(η, z2)/∂ηi̸= 0 for all categoriesz1, z2̸=i .
IfE
fi
L(η,Z)
fl is minimized among linear predictors by the constant predictorη(x) =b
∗
+W
∗
x
withW
∗
=0, then each class-conditional meanE
fi
X|Z =i
fl
is equal toE
fi
X
fl
.
Proof.The first-order optimality condition on thei
th
component of our parametersbandWyields
the equations:
E
"
∂L(η,Z)
∂ηi
·
∂ηi
∂bi
#
= 0andE
"
∂L(η,Z)
∂ηi
·
∂ηi
∂Wi
#
=0, (1)
where we have used the boundedness ofL’s partial derivative and the finite first moment of
∂ηi
∂bi
= 1
and
∂ηi
∂Wi
= X
to justify (via the Dominated Convergence Theorem) interchanging the derivative with
the expectation.
Sinceηis constant over all values ofX, and
∂ηi
∂bi
= 1, the first equation in (1) reduces to:
P(Z =i)
∂L(η, i)
∂ηi
+P(Z̸=i)
∂L(η,̸=i)
∂ηi
= 0, (2)
where
∂L(η,̸=i)
∂ηi
is an abuse of notation denoting the off-category partial derivative, emphasizing its
independence of the categoryZ.
3
Specifically, its generalization to convex functions overR
k
. See [12] p. 76.
3

Similarly, the constancy ofηand the fact that
∂ηi
∂Wi
= Xreduces the second equation in (1) to:
P(Z =i)
∂L(η, i)
∂ηi
·E
fi
X

Z =i
fl
+P(Z̸=i)
∂L(η,̸=i)
∂ηi
·E
fi
X

Z̸=i
fl
=0. (3)
Solving forP(Z =i)
∂L(η,i)
∂ηi
in (2) and substituting in (3) gives us:
P(Z̸=i)
∂L(η,̸=i)
∂ηi
·

E
fi
X

Z̸=i
fl
−E
fi
X

Z =i
fl
!
=0.
IfP(Z̸=i) = 0 , thenE[X] =E[X|Z =i] is trivially true. Otherwise, using the non-vanishingness
of the off-category partial derivative
∂L(η,̸=i)
∂ηi
, division yields the equivalence ofE
fi
X

Z =i
fl to
E
fi
X

Z̸=i
fl
, and hence to the unconditional meanE
fi
X
fl
.
We now show that Lemma 3.2 applies to the widely used cross entropy loss:
Theorem 3.3.If the class probabilitiesP(Z =j) are all nonzero, and the trivially obtainable loss is
optimal whenL(η, z) =−log
exp(ηz)
P
k
i=1
exp(ηi)
, then each class has the same meanE
fi
X

Z =z
fl
.
Proof.
In this case, the trivial predictorητ(Z)
j
= log(P(Z =j)) exists, achieving the trivially
obtainable loss, which we have assumed optimal. Furthermore,Lhas on-category partial deriva-
tive∂L(η, i)/∂ηi= exp(ηi)/
P
k
j=1
exp(ηj)−1∈(−1,0] , and nonvanishing off-category partial
derivative∂L(η,̸=i)/∂ηi= exp(ηi)/
P
k
j=1
exp(ηj)∈(0,1) , both bounded, so the conditions of
Lemma 3.2 apply.
3.3 Linearly Guarded Labels Have Zero Covariance with the Features
The next theorem establishes the equivalence of conditions 3 and 4.
Theorem 3.4.LetXbe a random vector taking values inR
dwith finite first moment, andZa random
vector taking values in{0,1}
k with one-hot encoding, with each class probabilityP(Z =j) being
nonzero. Then the class-conditional meansE[X|Z =j] are all equal to the unconditional mean
E[X]if and only if every component ofXhas zero covariance with every component ofZ, i.e. the
cross-covariance matrixΣXZ, whose(i, j)
th
entry isCov(Xi,Zj), is the zero matrix.
Proof.SinceZis one-hot, we can rewrite the(i, j)
th
entry ofΣXZas:
E[XiZj]−E[Xi]E[Zj] =P(Z =j)
ˇ
E[Xi|Z =j]−E[Xi]
ı
.
AsP(Z =j)>0, it follows thatE[Xi|Z =j] =E[Xi]if and only ifCov(Xi,Zj) = 0.
We have thus established the equivalence of the first four conditions stated earlier. See Appendix C
for the last one, on statistical parity.
4 Least-Squares Concept Erasure
In Section 3 we saw thatXlinearly guardsZif and only if each component ofXhas zero covariance
with each component ofZ. We will now characterize the set of affine transformationsr(x) =Px+b
such thatr(X)linearly guardsZ.
Theorem 4.1.LetXandZbe random vectors taking values inR
dandR
krespectively, withXof
finite first moment. Then given some affine functionr(x) =Px+b , the modified random vector
r(X)linearly guardsZif and only if the columns of the cross-covariance matrixΣXZare contained
in the null space ofP.
Proof.
From Theorem 3.4 we know thatr(X)linearly guardsZif and only ifCov(r(X),Z) is the
zero matrix. By the linearity property of cross-covariance, we have:
Cov(r(X),Z) = Cov(PX +b,Z) =PCov(X,Z) =PΣXZ.
Therefore,r(X)linearly guardsZif and only ifker(P)⊇colsp(ΣXZ).
4

Implications for prior work.Notably, the above theorems imply that three previously proposed
methods in the literature, Spectral Attribute Removal (SAL) [36], Mean Projection [17], and Fair PCA
[20], are guaranteed to achieve linear guardedness given suitable hyperparameters. See Appendix D
for further discussion.
4.1 Derivation of LEACE
Theorem 4.1 is a very weak condition, which is far from identifying unique values forPandb.
In most applications, however, we’d like to make a “small” edit toXso that useful information
contained inXis maximally preserved. We operationalize the notion of a small edit in terms of the
mean squared normE∥r(X)−X∥
2
M defined by some positive-definite inner productM,
4
which
can be thought of as a local quadratic approximation toanymeasure of divergence betweenXand
r(X)(such as Kullback–Leibler divergence, for example). While we are primarily interested in the
Euclidean (M=I ) and Mahalanobis (M=Σ
+
XX ) norms, it will turn out that there is asingleerasure
function that minimizesallsuch norms simultaneously. We will see in Section 6 that ensuring edits
are small in this sense provides substantial benefit to downstream task performance as compared to
other methods which also guard the labelsZ.
Below, we derive the optimal eraser under the assumption thatXandZare centered.
Theorem 4.2.LetXandZbe centered random vectors taking values inR
dandR
krespectively, each
of finite second moment. LetM∈R
d×d be a p.s.d. matrix defining a (possibly degenerate) inner
product onR
d:⟨x,y⟩M=x
T
My . LetΣXX∈R
d×d beX’s covariance matrix, andΣXZ∈R
d×k
be the cross-covariance matrix ofXandZ. LetA
+denote the Moore-Penrose pseudoinverse of a
matrixA, and letA
1/2
be the p.s.d. square root of a p.s.d. matrixA. Then the objective
argmin
P∈R
d×d
E
h

PX−X

2
M
i
subject to Cov(PX,Z) =0
has the following solution:
P
∗
=I−W
+
PWΣXZ
W,
whereWis the whitening transformation(Σ
1/2
XX
)
+ andPWΣXZ
= (WΣXZ)(WΣXZ)
+ is the
orthogonal projection matrix ontocolsp(WΣXZ).
Proof.See Appendices E.1 and E.2 for two independent proofs of Theorem 4.2.
The above theorem assumes that the random vectorsXandZare centered, and does not include a
bias term. Below we extend our results to the uncentered case, and derive the optimal biasb
∗
.
Theorem 4.3.LetXandZbe random vectors taking values inR
dandR
krespectively, each of finite
second moment. DefineMandP
∗as in Theorem 4.2 andb
∗
=E[X]−P
∗
E[X] . Then(P
∗
,b
∗
)
minimizesE

PX +b−X

2
, subject toCov(PX +b,Z) =0.
Proof.LetP∈R
d×d
and define
˜
X = X−E[X]andc=PE[X] +b−E[X]. Then,
E

PX +b−X

2
M
=E

(P
˜
X−
˜
X) +c

2
M
=E

P
˜
X−
˜
X

2
M
+ 2E
fi
P
˜
X−
˜
X
fl
T
Mc+c
T
Mc
=E

P
˜
X−
˜
X

2
M
+c
T
Mc,
where we have eliminated the middle term becausePis linear andE[
˜
X] = 0. SinceMis p.s.d., our
objective is minimized forc=0, i.e.b=E[X]−PE[X] . The problem thus reduces to choosingPso
as to minimizeE

P
˜
X−
˜
X

2
M subject toCov(PX +b,Z) = Cov(P
˜
X,Z) =0 , which Theorem 4.2
shows occurs whenP=P
∗
.
4
Our proofs also include degenerate “inner products” whereMis singular, and the associated seminorms.
5

−5 0 5
−6
−4
−2
0
2
4
6
−5 0 5
−6
−4
−2
0
2
4
6
−5 0 5
−6
−4
−2
0
2
4
6
−5 0 5
−6
−4
−2
0
2
4
6
Concept
Subspace
Orthogonal
Subspace
Output
Subspace
Class 1
Class 2
Original Data Whitening Erasure Unwhitening Figure 1: LEACE projection in 3 steps. First the data is whitened, ensuring equal variance in all
directions. It is then orthogonally projected ontocolsp(WΣXZ)
⊥ , guaranteeing linear guardedness.
Finally, we unwhiten the data so that its covariance structure mimics the original.
Putting together Theorems 4.2 and 4.3 and rearranging, we arrive at the LEACE formula:
rLEACE(x) =x−W
+
PWΣXZW
Γ
x−E[X]
˙
(1)
Intuitively, LEACE de-means and whitensx, projects onto the subspace responsible for correlations
betweenXandZ, then unwhitens the result. Finally, it subtracts this value fromx, thereby surgically
removing the linearly available information aboutZ.
4.2 Oblique Projections are Least-Squares Optimal
Prior work on linear concept erasure has assumed that erasure functions should be orthogonal
projections [29,32,36], appealing to the well-known fact that an orthogonal projection of a point
xonto a subspaceUyields the nearest point inUtox. But even in the case whereXis centered,
rLEACEisnotan orthogonal projection in general. Orthogonal projection matrices are symmetric,
andI−W
+
PWΣXZW is only symmetric in the special case wherePWΣXZ andWcommute. It is
anobliqueprojection however, since applyingP
∗twice yields the same result as applying it once:
(P
∗
)
2
=I−2WPWΣXZ
W
+
+W
+
PWΣXZ≤
≤
≤≤
WW
+
PWΣXZ
W=P
∗
.
Orthogonal projections are generally not least-squares optimal for concept erasure because the
necessary and sufficient condition for linear guardedness,PΣXZ=0 , is a constraint on thenullspace
ofP, and not on its range. We may freely choose the range of the projection to minimize the mean
squared distance, as long as we zero outcolsp(ΣXZ) . In Figure 1, an orthogonal projection would
map all points onto the the dashed line, thereby preserving less of the variance of the original data
than LEACE does (green line). See Appendix F for a concrete example.
4.3 Extension to ContinuousZ
While not a focus of this work, it’s worth noting that LEACE can also be applied to the setting where
Ztakes arbitrary values inR
k, as long as we restrict ourselves to the ordinary least squares regression
lossL(η,z) =∥η−z∥
2
2 . In particular, the proofs of equivalence between conditions 1 and 2 given in
Appendix B make no categorical assumption onZ, and the equivalence between the optimality of a
zero weight matrix (condition 2) and zero cross-covariance (condition 4) is well known in the OLS
setting. We can then apply Theorems 4.2 and 4.3, which also make no categorical assumption, to
derive the same optimal affine eraser as in the categorical case.
5 Evaluation
5.1 Intrinsic Evaluation
Following Ravfogel et al.[31]we evaluate the ability of our method to remove gender information
from the last hidden layer of a frozen BERT model. We use the biographies dataset of De-Arteaga
et al.[6], composed of short biographies annotated by both binary gender and profession. We
embed each biography with the[CLS]representation in the last layer of BERT, enforce the same-
conditional-mean constraint to remove gender information from the[CLS], and then evaluate the
performance of the model, after the intervention, on the main task of profession prediction. We
6

0.20.40.60.8
% women
0.4
0.2
0.0
0.2
0.4
TPR-Gap
surgeon
professorjournalist
physician
attorney
nurse
dentist
teacher
photographer
chiropractor
yoga_teacher
software_engineer
poet
composer
painter
dj
architect
psychologist
dietitian
paralegal
filmmaker
comedian
pastor
accountant
personal_trainerrapper
interior_designer 0.20.40.60.8
% women
0.4
0.2
0.0
0.2
0.4
TPR-Gap
surgeon
professor
journalistphysician
attorney
nurse
dentist
teacher
photographer
chiropractor
yoga_teacher
software_engineer
poet
composer
painter
dj
architect
psychologist
dietitian
paralegal
filmmakercomedian
pastor
accountantpersonal_trainer
rapper interior_designer Figure 3: The correlation betweenGAP
T P R
f emale,y and the relative proportion of women in profession
y, for BERT representation, before (left; R=0.867) and after (right; R=0.392) the projection.
compare our intervention with RLACE [31], which uses gradient-based optimization to solve a linear
concept-erasure adversarial game.
Concept erasure results.First, we evaluate the ability of logistic regression classifiers to recover
the removed information. The results, presented in Fig. 2, show that our method is the only to achieve
random accuracy (perfect erasure) with a small edit, although RLACE (but not INLP) comes close. At
the same time, our method is around 2 orders of magnitude faster, and does not require gradient-based
optimization.
5.2 Downstream Fairness2 3 4 5 6
Mean Squared Error
0.5
0.6
0.7
0.8
0.9
Gender Accuracy
Majority
RLACE
INLP
Ours
Figure 2: Gender prediction ac-
curacy after bias-removal projec-
tion versus the mean squared dis-
tance from the original representa-
tion for INLP, RLACE, and LEACE
on BERT representations.
How does our intervention affect the behavior of the model on
the main classification task of profession prediction? We fit
a logistic regression profession-prediction classifier over the
projected[CLS]representations.
To measure the bias in a classifier, we follow De-Arteaga et al.
[6]
and use the TPR-GAP measure, which quantifies the bias
in a classifier by considering the difference (GAP) in the true
positive rate (TPR) between individuals with different protected
attributes (e.g. race or gender). We use the notationGAP
TPR
z,y
to denote the TPR-gap in some main-class labely(e.g. “nurse”
prediction) for some protected groupz(e.g. “female”), we also
considerGAP
TPR,RMS
z , the RMS of the TPR-gap across all
professions for a protected groupz:
GAP
TPR,RMS
z =
s
1
|C|
X
y∈C
(GAP
TPR
z,y)
2
To calculate the relation between the bias the model exhibits
and the bias in the data, we also calculateσ
(GAP
TPR
,%Women) ,
the correlation between the TPR gap in a given profession and
the percentage of women in that profession.
Results.The main-task classifier achieves profession-prediction accuracy of 77.3% on the pro-
jected representations (compared with 79.3% over the original representations), indicating that the
intervention minimally affects the ability to predict the profession of a person from the represen-
tation of their biography. At the same time, the TPR gap drops significantly from 0.198 to 0.084,
indicating a sharp drop in the biased behavior of the profession classifier. Indeed, inspecting the
correlationσ
(GAP
TPR
,%Women) between the gap (per profession) and the representation of women in
7

0 2 4 6 8 10 12
Layer of Intervention
0.2
0.4
0.6
0.8
MLM Accuracy
Ours
Random
INLP (20 iterations)
No Intervention 0 2 4 6 8 10 12
Layer of Intervention
1
2
3
4
5
MLM Loss Figure 4: Amnesic probing results onbert-base-uncased.
this profession, we see that this correlation plummets from 0.867 to 0.392 after erasure. Re-fitting
the main-task logistic regression classifier over the projected representations yields a slightly higher
main-task accuracy of 78.1%, at the price of significantly increasing the TPR gap to 0.158.
5
5.3 Revisiting Amnesic Probing
Elazar et al.[10]have introduced the idea ofamnesic probingas a causal intervention that aims to
test the importance of a given concept (e.g. part-of-speech tag) to some main task (e.g. language
modeling). They applied Iterative Nullspace Projection (INLP) to remove different concepts from the
hidden representations of the model, and assessed the degree to which its behavior changed when
performing masked language modeling. Since INLP often requires dozens of iterations to completely
erase the concept, its usage in this context raises concerns of collateral damage due to magnitude of
the intervention and the non-exhaustive nature of INLP removal. Here, we replicate their experiments
on thebert-base-uncasedmodel with our interventions.
Experimental setup.We use part-of-speech (POS) tags as our concept of interest. We collect
sentences and their coarse POS tags (“Noun”, “Verb” etc.; 18 in total) from the English Universal
Dependencies dataset [27]. We tokenize the sentences with theBERTtokenizer and map each word-
piece to the POS tag of the word to which it belongs. We collect the unmaskedBERTrepresentations
for each layer, intervene to linearly erase the POS concept from that layer, and continue the forward
pass until the last layer, from which we compute the distribution of the MLM over the vocabulary.
Note that in each experiment we intervene on a single layer. We quantify the decrease in accuracy
following the intervention, as well as the increase in the loss. We compare with a baseline intervention
of a random orthogonal projection whose null space has the same rank as the label space (18). For
INLP, we perform 20 iterations. This is needed because INLP does not effectively remove the concept;
even after 20 iterations, classification accuracy is above majority accuracy. As a result, INLP reduces
the rank of the representation by 360. By contrast, our method decreases the rank just by 17.
Results.The results are shown in Fig. 4b. Our intervention only mildly changes BERT LM accuracy
and loss until layer 8, with the highest drop recorded in layer 11. INLP, in contrast, shows maximum
effect at layer 6. Since it removes hundreds of dimensions, it is difficult to attribute this effect to
the erasure of the concept. These results suggest that thecausaleffect of the POS concept on the
language model is concentrated in layer 11. Interestingly, this stands in contrast with POS linear
probing results, which are optimal at earlier layers [38]. As Elazar et al.[10]have noted, probing
does not generally correlate with intervention-based analysis techniques.
5
The softmax probabilities of a multiclass logistic regression classifier can leak the removed information if
anotherclassifier is stacked on top of it [33], though this setup is not linear.
8

6 Concept Scrubbing
Algorithm 1Concept scrubbing
Require:Model withℓlayersf=fℓ◦. . .◦f1
Require:Design matrixX∈R
n×d
Require:Label matrixZ∈R
n×k
Ensure:LEACE parameters for each layer inf
1:H1←Embed(X)
2:L←list()
3:forl∈1. . . ℓdo
4:Fit(P,b)onHlandZ
5:Append(P,b)toL
6:Hl←P(Hl−µHl
) +µHl
(Eq. 1)
7:Hl+1←fl(Hl)
8:returnL
Unfortunately, Elazar et al.[10]were forced to
limit their interventions to a single layer due
to the limitations of INLP. INLP often requires
the deletion of several dozen dimensions before
linear guarding is achieved—as demonstrated
in Figure 2. Kumar et al.[21]show empiri-
cally and theoretically that INLP causes need-
less “collateral damage” to useful parts of the
representation that are orthogonal to the concept
being erased. Because of this collateral damage,
it’s impossible to apply INLP to multiple layers
of a transformer without causing its outputs to
collapse into gibberish.
Instead, we would like to erase all linear information about a concept ineveryintermediate rep-
resentation, which we termconcept scrubbing. LEACE makes concept scrubbing possible and
eminently practical. It causes minimal collateral damage, induces little computational overhead, and
the covariance statistics it relies on can be computed in astreamingfashion, without ever storing all
the hidden states in memory or on disk.
Algorithm.Any intervention on the model at layerℓchanges the distribution of hidden states at
layersℓ
′
> ℓ. Because of this, the naive approach of independently fitting LEACE parameters(P,b)
for all layers of the clean model, then applying them all at once, may fail to fully erase the target
concept. Instead, we fit LEACE parameterssequentially, starting from the first layer and proceeding
to the final layer. After we compute(P,b) for a layer, we immediately use them to scrub the hidden
states for that layer, then feed these scrubbed representations to the next layer (Algorithm 1).
LLaMA Pythia
Condition 7B 13B 30B 160M 1.4B 6.9B 12B
No intervention 0.69 0.66 0.62 0.90 0.70 0.64 0.62
Random erasure 0.69 0.66 0.62 0.99 0.72 0.66 0.63
LEACE 1.73 1.84 1.96 2.79 2.25 3.57 3.20
SAL 3.24 3.26 3.16 3.53 3.44 4.17 4.69
unigram entropy 2.90 2.90 2.90 2.66 2.66 2.66 2.66
Table 1: Perplexity in autoregressive language models when removing linearly available part-of-
speech information from the input to each transformer layer. Units are bits per UTF-8 byte. The
unigram baseline assigns probabilities to tokens based only on their frequency and not on the context.
6.1 Experimental Details
Dataset.For each model family, we use a sample from the respective pretraining distribution: the
validation split of the Pile [13] for the Pythia models [2], and the RedPajama replication of the
LLaMA pretraining corpus for the LLaMA family [39]. sampling a slice of2
22tokens for fitting
the LEACE parameters and another slice of2
22tokens for evaluation. Since neither corpus comes
with part-of-speech tags, we use the model from the SpaCy library [19] to automatically generate
Universal Dependency tags [23].
Baseline method.We also run concept scrubbing using full-rank SAL [36], which is similar to our
method but lacks a bias term and does not adjust for correlations between features (Appendix D).
Architecture.We focus on autoregressive language models. We evaluate our method on EleutherAI’s
Pythia 160M, 1.4B, 6.9B, and 12B models [2], and Meta’s LLaMA 7B, 13B, and 30B [39]. We apply
concept erasure to the input of each transformer block, immediately after normalization is applied
(LayerNorm or RMSNorm).
9

Randomized erasure.Almost any intervention on a neural network will cause its performance to
degrade to some extent. Following Elazar et al.[10], we isolate the effect of the concept erasure by
comparing it to a control condition in which we orthogonally project onto arandomlinear subspace
of the same rank as the cross-covariance matrix. To reduce the variance of our results, we sample a
fresh subspace for each minibatch, and erase that subspace at each layer, reporting the cross-entropy
loss averaged over subspaces.
Training efficiency.Algorithm 1 avoids redundant computation by caching the layerihidden states
foreverydata point, then using them to run layeri+ 1. This approach has the downside of requiring
a large amount of memory or disk space during training (up to 500GB in our experiments). It’s
possible to avoid caching any hidden states and instead recompute them as needed, at the expense of
increasing the total compute cost fromO(ℓ)toO(ℓ
2
).
6.2 Results
We find strong evidence that autoregressive language models heavily rely on linearly encoded
part-of-speech information. While erasing a randomly selected subspace has little to no effect on
language modeling performance, scrubbing away part-of-speech information induces a large increase
in perplexity across all models (Table 1).
The specific numbers, however, depend on the erasure method used: SAL induces significantly
larger increases in perplexity for all models we tested. We take this to mean that SAL inflicts more
collateral damage on other useful features in the representation than LEACE does. In other words,
interventions made with LEACE are moresurgicalthan those made with prior work; they more
closely approximate the ideal of a perfect intervention which only erases the target concept and keeps
everything else fixed [40,15]. If this experiment were conducted with SAL alone, we would have
overestimatedthe causal effect of part-of-speech.
7 Limitations and Future Work
Much work remains to be done to validate concept scrubbing. Specifically, we’d like to see ex-
periments that target concepts much narrower than part-of-speech, and use behavioral metrics to
determine whether scrubbing changes the network in the ways we’d intuitively expect. If these
experiments succeed, an exciting next step would be the incorporation of concept scrubbing into
the pretraining and/or finetuning process. This may make it possible to train deep neural networks
subject toconceptual constraints. It remains to be seen if gradient-based optimizers will be able to
“circumvent” such constraints by learning completely nonlinear representations of protected attributes.
In this work, we focused exclusively onlinearconcept erasure due to its simplicity and tractability.
Some authors have proposed nonlinear concept erasure techniques based on kernel methods, but
have found that erasure functions fit using one kernel do not generalize well to other kernels [32,36].
We conjecture that it is intractable to nondestructively editXso as to prevent a general nonlinear
adversary from recoveringZ, unless the data generating process forXis known in detail.
6
A major motivation of concept erasure is that it promises to prevent models from using a concept
in apost hoc, model-agnostic fashion. But if our concept scrubbing procedure turns out to yield
unsatisfactory results in practical use cases, the most promising research direction might then be to
improve model-specifictechniques, such as those that modify the training procedure [8, 9, 14].
8 Acknowledgements
We are grateful to CoreWeave for providing the compute resources used in Section 6. Shauli Ravfogel
is grateful to be supported by the Bloomberg Data Science PhD Fellowship.
6
We suspect erasing a concept is at least as hard as extracting it from the original representation. But in the
worst case, information aboutZcould be encodedcryptographicallyinX, which would be intractable to decode
given standard computational complexity assumptions. If the data is generated by a known algorithm, however,
it may be possible to efficiently eliminate mutual information betweenZandXby simply breaking the links in
the causal graph that connect them.
10

References
[1]
UC Berkeley. The Hilbert space of random variables. Lecture Notes Electrical Engineering
126, 2018. URLhttps://inst.eecs.berkeley.edu/~ee126/sp18/projection.pdf.
[2]
Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric
Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff,
et al. Pythia: A suite for analyzing large language models across training and scaling.arXiv
preprint arXiv:2304.01373, 2023.
[3]
Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T.
Kalai. Man is to computer programmer as woman is to homemaker? De-
biasing word embeddings.Advances in Neural Information Processing Systems,
29:4349–4357, 2016. URLhttps://proceedings.neurips.cc/paper/2016/file/
a486cd07e4ac3d270571622f4f316ec5-Paper.pdf.
[4]
Xilun Chen, Yu Sun, Ben Athiwaratkun, Claire Cardie, and Kilian Weinberger. Adversarial
deep averaging networks for cross-lingual sentiment classification.Transactions of the Associa-
tion for Computational Linguistics, 6:557–570, 2018. URLhttps://aclanthology.org/
Q18-1039.
[5]
Verna Dankers, Christopher Lucas, and Ivan Titov. Can transformer be too compositional?
Analysing idiom processing in neural machine translation. InProceedings of the 60th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
3608–3626, 2022.
[6]
Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexan-
dra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. Bias in
bios: A case study of semantic representation bias in a high-stakes setting. InProceedings of
the Conference on Fairness, Accountability, and Transparency, FAT* ’19, page 120–128, New
York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450361255. doi:
10.1145/3287560.3287572. URLhttps://doi.org/10.1145/3287560.3287572.
[7]
Sunipa Dev, Tao Li, Jeff M. Phillips, and Vivek Srikumar. OSCaR: Orthogonal subspace
correction and rectification of biases in word embeddings. InProceedings of the 2021 Con-
ference on Empirical Methods in Natural Language Processing, pages 5034–5050, Online
and Punta Cana, Dominican Republic, November 2021. Association for Computational Lin-
guistics. doi: 10.18653/v1/2021.emnlp-main.411. URLhttps://aclanthology.org/2021.
emnlp-main.411.
[8]
Harrison Edwards and Amos Storkey. Censoring representations with an adversary. InIn-
ternational Conference in Learning Representations, pages 1–14, May 2016. URLhttps:
//arxiv.org/abs/1511.05897.
[9]
Yanai Elazar and Yoav Goldberg. Adversarial removal of demographic attributes from text data.
InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,
pages 11–21, Brussels, Belgium, October-November 2018. Association for Computational
Linguistics. doi: 10.18653/v1/D18-1002. URLhttps://aclanthology.org/D18-1002.
[10]
Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. Amnesic probing: Behavioral
explanation with amnesic counterfactuals.Transactions of the Association for Computational
Linguistics, 9:160–175, 2021. doi: 10.1162/tacl_a_00359. URLhttps://aclanthology.
org/2021.tacl-1.10.
[11]
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann,
Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep
Ganguli, Hatfield Zac Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt,
Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and
Chris Olah. A mathematical framework for transformer circuits.Transformer Circuits Thread,
2021.
[12] Mathematical Statistics. Academic Press, Cambridge, MA, 1967.
11

[13]
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason
Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile:
An 800GB dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,
2020.
[14]
Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah
Goodman, and Christopher Potts. Inducing causal structure for interpretable neural networks.
InInternational Conference on Machine Learning, pages 7324–7338. PMLR, 2022.
[15]
Christopher Grimsley, Elijah Mayfield, and Julia R.S. Bursten. Why attention is not explanation:
Surgical intervention and causal reasoning about neural models. InProceedings of the Twelfth
Language Resources and Evaluation Conference, pages 1780–1790, Marseille, France, May
2020. European Language Resources Association. URLhttps://aclanthology.org/2020.
lrec-1.220.
[16]
Pantea Haghighatkhah, Wouter Meulemans, Bettina Speckmann, Jérôme Urhausen, and Kevin
Verbeek. Obstructing classification via projection. In Filippo Bonchi and Simon J. Puglisi,
editors,46th International Symposium on Mathematical Foundations of Computer Science,
Leibniz International Proceedings in Informatics, LIPIcs. Schloss Dagstuhl - Leibniz-Zentrum
für Informatik, 2021.
[17]
Pantea Haghighatkhah, Antske Fokkens, Pia Sommerauer, Bettina Speckmann, and Kevin
Verbeek. Better hit the nail on the head than beat around the bush: Removing protected
attributes with a single projection. pages 8395–8416, December 2022. doi: 10.18653/v1/2022.
emnlp-main.575. URLhttps://aclanthology.org/2022.emnlp-main.575.
[18]
Evan Hernandez and Jacob Andreas. The low-dimensional linear geometry of contextualized
word representations. InProceedings of the 25th Conference on Computational Natural
Language Learning, pages 82–93, 2021.
[19]
Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. spaCy: Industrial-
strength Natural Language Processing in Python, 2020.
[20]
Matthäus Kleindessner, Michele Donini, Chris Russell, and Muhammad Bilal Zafar. Efficient
fair PCA for fair representation learning. InInternational Conference on Artificial Intelligence
and Statistics, pages 5250–5270. PMLR, 2023.
[21]
Abhinav Kumar, Chenhao Tan, and Amit Sharma. Probing classifiers are unreliable for concept
removal and detection. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh,
editors,Advances in Neural Information Processing Systems, volume 35, pages 17994–18008.
Curran Associates, Inc., 2022. URLhttps://proceedings.neurips.cc/paper_files/
paper/2022/file/725f5e8036cc08adeba4a7c3bcbc6f2c-Paper-Conference.pdf.
[22]
Matt J. Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness.
Advances in Neural Information Processing Systems, 30, 2017.
[23]
Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das,
Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Täckström, et al. Universal
dependency annotation for multilingual parsing. InProceedings of the 51st Annual Meeting of
the Association for Computational Linguistics (Volume 2: Short Papers), pages 92–97, 2013.
[24]
Neel Nanda. Actually, Othello-GPT has a linear emergent world model, Mar 2023. URL
<https://neelnanda.io/mechanistic-interpretability/othello>.
[25]
Vassilina Nikoulina, Maxat Tezekbayev, Nuradil Kozhakhmet, Madina Babazhanova, Matthias
Gallé, and Zhenisbek Assylbekov. The rediscovery hypothesis: Language models need to meet
linguistics.Journal of Artificial Intelligence Research, 72:1343–1384, 2021.
[26]
Hamed Nilforoshan, Johann D. Gaebler, Ravi Shroff, and Sharad Goel. Causal conceptions
of fairness and their consequences. InInternational Conference on Machine Learning, pages
16848–16887. PMLR, 2022.
12

[27]
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajic, Christopher D. Manning,
Sampo Pyysalo, Sebastian Schuster, Francis Tyers, and Daniel Zeman. Universal Dependencies
v2: An evergrowing multilingual treebank collection. InProceedings of the 12th Language
Resources and Evaluation Conference, pages 4034–4043, 2020.
[28]
Judea Pearl.Causality. Cambridge University Press, Cambridge, UK, 2 edition, 2009. ISBN
978-0-521-89560-6. doi: 10.1017/CBO9780511803161.
[29]
Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. Null it out:
Guarding protected attributes by iterative nullspace projection. InProceedings of the 58th
Annual Meeting of the Association for Computational Linguistics, pages 7237–7256, Online,
July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.647.
URLhttps://aclanthology.org/2020.acl-main.647.
[30]
Shauli Ravfogel, Grusha Prasad, Tal Linzen, and Yoav Goldberg. Counterfactual interventions
reveal the causal effect of relative clause representations on agreement prediction. InProceedings
of the 25th Conference on Computational Natural Language Learning, pages 194–209, 2021.
[31]
Shauli Ravfogel, Michael Twiton, Yoav Goldberg, and Ryan D Cotterell. Linear adversarial
concept erasure. InInternational Conference on Machine Learning, pages 18400–18421. PMLR,
2022.
[32]
Shauli Ravfogel, Francisco Vargas, Yoav Goldberg, and Ryan Cotterell. Adversarial concept
erasure in kernel space. InProceedings of the 2022 Conference on Empirical Methods in
Natural Language Processing, pages 6034–6055, 2022.
[33]
Shauli Ravfogel, Yoav Goldberg, and Ryan Cotterell. Log-linear guardedness and its im-
plications. InProceedings of the 61st Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers), pages 9413–9431, Toronto, Canada, July 2023.
Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.523. URL
https://aclanthology.org/2023.acl-long.523.
[34]
Bashir Sadeghi and Vishnu Boddeti. On the fundamental trade-offs in learning invariant
representations.arXiv preprint arXiv:2109.03386, 2021. URLhttps://openreview.net/
pdf?id=KOk7mUGspN9.
[35]
Bashir Sadeghi, Runyi Yu, and Vishnu Boddeti. On the global optima of kernelized adversarial
representation learning. In2019 IEEE/CVF International Conference on Computer Vision,
pages 7970–7978. IEEE, 2019. URLhttp://hal.cse.msu.edu/assets/pdfs/papers/
2019-iccv-kernel-adversarial-representation-learning.pdf.
[36]
Shun Shao, Yftah Ziser, and Shay B. Cohen. Gold doesn’t always glitter: Spectral removal
of linear and nonlinear guarded attribute information. InProceedings of the 17th Conference
of the European Chapter of the Association for Computational Linguistics, pages 1611–1622,
Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. URLhttps:
//aclanthology.org/2023.eacl-main.118.
[37]
Shun Shao, Yftah Ziser, and Shay B. Cohen. Erasure of unaligned attributes from neural
representations.Transactions of the Association for Computational Linguistics, 11:488–510,
2023. doi: 10.1162/tacl_a_00558. URLhttps://aclanthology.org/2023.tacl-1.29.
[38]
Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline.
InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,
pages 4593–4601, 2019.
[39]
thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez,
Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation
language models, 2023. URLhttp://arxiv.org/abs/2302.13971.
[40]
James Francis Woodward.Making Things Happen: A Theory of Causal Explanation explanation.
Oxford University Press, 2005.
13

[41]
Zhengxuan Wu, Atticus Geiger, Christopher Potts, and Noah Goodman. Interpretability at scale:
Identifying causal mechanisms in Alpaca. 2023.
[42]
Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, and Graham Neubig. Controllable invariance
through adversarial feature learning. InAdvances in Neural Information Processing Systems,
volume 30, pages 585–596, 2017. URLhttps://dl.acm.org/doi/10.5555/3294771.
3294827.
[43]
Yilun Xu, Shengjia Zhao, Jiaming Song, Russell Stewart, and Stefano Ermon. A theory of usable
information under computational constraints. In8th International Conference on Learning
Representations, 2020. URLhttps://openreview.net/forum?id=r1eBeyHFDH.
[44]
Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with
adversarial learning. InProceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and
Society, page 335–340, New York, NY, USA, 2018. Association for Computing Machinery.
ISBN 9781450360128. URLhttps://doi.org/10.1145/3278721.3278779.
14

A Additional Related Work
The problem of linear concept erasure is an instance of the general problem of information removal.
Information removal methods generally divide into adversarial methods, which are applied during
training, and the post-hoc linear methods considered in this paper. Adversarial methods [8,42,4,
9,44] use a gradient-reversal layer during training to induce representations that do not encode
the protected attribute. However, Elazar and Goldberg[9]have shown that these methods fail in
exhaustively removing all the information associated with the protected attribute: it is often possible
to train new adversaries that successfully recover the removed information. Linear methods have been
proposed as a tractable alternative, where one identifies a linear subspace that captures the concept
of interest, and neutralizes it using algebraic techniques. Different methods have been proposed for
the identification of the subspace, e.g. PCA and variants thereof [3,20], orthogonal-rotation [7],
classification-based [29], spectral [36, 37] and adversarial approaches [31].
Few works theoretically characterize the condition of linear guardedness. Haghighatkhah et al.[16]
extensively analyzed the problem of preventing linear classification, with the focus on decreasing
accuracy. They provide a constructive proof of an optimal intervention for an SVM classifier.
Ravfogel et al.[33]have proposed a formal definition of linear guardedness based onVinformation,
and characterized the fairness implications of guardedness; we show the relations with our definition
above. Ravfogel et al.[31]provide an adversarial formulation of the problem, derive a closed-formed
solution to certain cases, and propose an SGD-based optimization for others. While they seek an
orthogonalprojection, we empirically showed that their solution is very close to ours. Sadeghi et al.
[35]and Sadeghi and Boddeti[34]both study an adversarial formulation of concept erasure for linear
regression, and they trade-off with main-task performance. In contrast to Ravfogel et al.[31], they
consider a general linear adversary, i.e. not necessarily a projection matrix. Closest to our work are
Kleindessner et al.[20], Haghighatkhah et al.[17], Shao et al.[36]. As we showed above (§ 4), those
methods do achieve the goal of linear guardedness though they are unable to prove this fact. At the
same time, they are not optimal in terms of damage to the original representation space.
B Equivalence of Guardedness with the Optimality of Constant Predictors
The following two theorems establish the equivalence of conditions 1 and 2 (indeed, they do so in the
general setting, with no assumption of convex loss or linear predictors).
Theorem B.1.SupposeX (V,L) -guardsZ. Then for every lossL ∈L , the corresponding
trivially attainable lossL
(Z,L)
τ cannot be improved upon by any predictorη(·;θ)∈ V , i.e.
Lτ= infθE[L(η(X;θ),Z)].
Proof.
Consider the null random vectorX
′
(ω) =0 . Since all predictors are constant onX
′, and the
trivially attainable loss gives thebestavailable expected loss among constant predictors, we must
have:
Lτ= inf
θ
E[L(η(X
′
;θ),Z)] (4)
The right side of equation (4) is the best possible loss achievable by a functionη(·;θ) on the joint
distribution of(X
′
,Z), which by the definition of guardedness is upper bounded by the best possible
loss achievable on the joint distribution of(X,Z):
inf
θ
E[L(η(X
′
;θ),Z)]≤inf
θ
E[L(η(X;θ),Z)] (5)
Combining equations (4) and (5), and the fact that all constant functions exist in our function class
V={η(·;θ)}, we arrive at our desired result:
Lτ= inf
θ
E[L(η(X;θ),Z)]
Theorem B.2.Suppose that for every lossL ∈L , the corresponding trivially attainable lossL
(Z,L)
τ
cannot be improved upon by any predictorη(·;θ)∈ V , i.e.Lτ= infθE[L(η(X;θ),Z)] . ThenX
(V,L)-guardsZ.
15

Proof.LetX
′
: Ω→R
d
be any other random data vector with finite first moment.
Since all constant predictors exist in our predictor classV={η(·;θ)} , the best loss achievable on
(X
′
,Z) by functions inVmust be at least as good as the trivially attainable loss (the best loss available
by such constant predictors):
inf
θ
E[L(η(X
′
;θ),Z)]≤Lτ
By assumption, the trivially attainable loss cannot be improved upon over(X,Z)by predictors inV:
Lτ= inf
θ
E[L(η(X;θ),Z)]
Since our choice ofX
′was arbitrary, this shows thatXmaximizes the minimal achievable loss, soX
(V,L)-guardsZ.
C Linear Guardedness is Equivalent to Linear Statistical Parity
To measure the effect of linear guardedness on main-task classifiers, we use the following minimal
definition of “fairness” with respect to an attribute, adapted from Edwards and Storkey [8].
Definition C.1(Statistical Parity).LetXandZbe defined as above, and letfbe a function with
domainR
d
. Thenfexhibitsstatistical paritywith respect toZwhen evaluated onXif
∀z∈ Z:E[f(X)|Z =z] =E[f(X)].
We now prove the equivalence of conditions 3 and 5.
Theorem C.2.LetXandZbe defined as above. Then every linear predictorf(x) =b+Wx
exhibits statistical parity w.r.t.Zwhen evaluated onXif and only if each class-conditional mean
E
fi
X|Z =z
fl
is equal toE
fi
X
fl
.
Proof.
Suppose each class-conditional meanE
fi
X|Z =z
fl is equal toE
fi
X
fl. Then by the linearity of
expectation, we have for allz∈ Z:
E[f(X)|Z =z] =E[WX +b|Z =z] =WE[X|Z =z] +b=WE[X] +b=E[f(X)].
This matches the definition of statistical parity provided in Definition C.1.
Conversely, suppose every linear predictorf(x) =b+Wxexhibits statistical parity w.r.t.Zwhen
evaluated onX. Then this holds for the identity functionid(x) =x, and thus for allz∈ Z:
E[X|Z =z] =E[id(X)|Z =z] =E[id(X)] =E[X].
D Implications for Prior Work
In this section we discuss the implications of Theorem 4.1, which characterizes the necessary and
sufficient conditions for an affine erasure function to yield a perfectly linearly guarded dataset, for
methods proposed in prior work.
Spectral Attribute RemovaL (SAL) [36] uses the topnleft singular vectors ofΣXZto construct an
orthogonal projection matrixQSAL=I−U:nU
T
:n which is then applied toX. Notably, whilenis
presented as a free parameter in their method, all of their experiments involve binary classification
problems whereZis a one-hot vector, andnis set to a value no greater than 2. We’ll call the version
of SAL wheren= rank(ΣXZ) , “full-rank SAL.” Since these left singular vectors are an orthonormal
basis forΣXZ’s column space, Theorem 4.1 implies that full-rank SAL guarantees linear guardedness.
Mean Projection (MP) [17] orthogonally projectsXonto the orthogonal complement of the span of
the difference in class centroidsE[X|Z = 1]−E[X|Z = 0] , whereZis assumed to be binary. Since
the centroids are equal after the projection, this method guarantees linear guardedness by Theorem 3.1.
In fact, by Theorem 3.4, MP is mathematically equivalent to SAL whenZis a one-dimensional
random vector taking one of two possible values.
16

E Derivation of LEACE
Theorem 4.2.LetXandZbe centered random vectors taking values inR
dandR
krespectively, each
of finite second moment. LetM∈R
d×d be a p.s.d. matrix defining a (possibly degenerate) inner
product onR
d:⟨x,y⟩M=x
T
My . LetΣXX∈R
d×d beX’s covariance matrix, andΣXZ∈R
d×k
be the cross-covariance matrix ofXandZ. LetA
+denote the Moore-Penrose pseudoinverse of a
matrixA, and letA
1/2
be the p.s.d. square root of a p.s.d. matrixA. Then the objective
argmin
P∈R
d×d
E
h

PX−X

2
M
i
subject to Cov(PX,Z) =0
has the following solution:
P
∗
=I−W
+
PWΣXZW,
whereWis the whitening transformation(Σ
1/2
XX
)
+ andPWΣXZ
= (WΣXZ)(WΣXZ)
+ is the
orthogonal projection matrix ontocolsp(WΣXZ).
Below are two independent proofs of Theorem 4.2.
E.1 Algebraic Proof
Proof.
We shall first show that, in any orthonormal basis,
7
each rowPiconstitutes an independent
optimization problem, and then select a basis in which we can easily show that the corresponding
componentXiofXcan be almost surely decomposed into a linear combination of mutually un-
correlated components in the whitened random vectorWX, some of which correlate withZand
some of which do not. The solution(PX)i is then that same linear combination, restricted to those
components which do not correlate withZ.
Consider first an orthonormal basis diagonalizing the inner productM, so that⟨x,y⟩M=
P
d
i=1
αixiyi
for fixedα1, . . . , αd≥0 . This allows us to treat each rowPi∈R
d ofPas a
separate optimization problem,
argmin
Pi∈R
d
E
h
αi
Γ
Pi
T
X−Xi
˙
2
i
subject to Cov(Pi
T
X,Z) =0,
at which point the weightsαiof each subproblem become irrelevant, and our objective may as well
be Euclidean, allowing us to view each row as an independent optimization problem not just in this
basis, but from any convenient one.
So now letℓ= rank(ΣXZ) = rank(ΣWX,Z) andm= rank(ΣXX) = rank(ΣWX,WX) , and
consider a (new) orthonormal basis whose firstmcoordinates span the column (and row) space of
W(i.e. the subspace ofR
din whichXandWXhave nonzero variance), and whose firstℓ≤m
coordinates span the column space ofΣWX,Z (i.e. the subspace ofR
din whichWXhas nonzero
covariance withZ).
Any component ofXcan be (almost surely) written as a fixed linear combination of the nontrivial
components of its whiteningWX:
Xi= (W
+
WX)i=
m
X
j=1
W
+
ij
(WX)j. (almost surely)
Meanwhile, any component ofPXcan be (always) written as a fixed linear combination of the
nontrivial components ofWXand the almost surely zero components ofX:
(PX)i=
m
X
j=1
Aij(WX)j+
d
X
j=m+1
BijXj,
i.e.P=AW+BV , whereV=I−W
+
W is the orthogonal projection ontoX’s almost surely
zero components.
7
Throughout this proof, we abuse the notationsXi,Pi , etc. to refer to thei
thcomponent in the specified
basis, not necessarily the standard one.
17

Thei
th
sub-objective is then:
E
Γ
Pi
T
X−Xi
˙
2
=E
"
m
X
j=1
(Aij−W
+
ij
)(WX)j
#
2
=
m
X
j=1
(Aij−W
+
ij
)
2
,
where we have safely ignored the almost surely zero termsBijXj (j > m), and used the fact that the
firstmcomponents ofWXhave identity covariance matrix.
PX
is almost surely equal toAWX , so our constraintCov(PX,Z) =0 is equivalent toAΣWX,Z=
Cov(AWX,Z) =0
, i.e.Aij= 0 whenj≤ℓ, since the firstℓcomponents are those for which
WXcorrelates withZ. Subject to this, the objective is minimized forAij=W
+
ij whenj > ℓ, i.e.
A=W
+
(I−PWΣXZ
).
The particular choiceB=I gives our solutionP
∗
=I−W
+
PWΣXZ
W , leaving the non-varying
components ofXintact (see Fig. 1 for a visualization).
The solution is unique except for columns corresponding to the components ofXwith zero variance,
and rows corresponding to the zero-weighted components of the (pseudo) inner productM.
E.2 Covector Proof
Proof.
We assume without loss of generality that vectors inR
dare represented in a basis diagonalizing
the inner productM, so that⟨x,y⟩M=
P
d
i=1
mixiyi for fixedm1, . . . , md≥0 . This allows us to
treat each rowPi∈R
d
ofPas a separate optimization problem,
argmin
Pi∈R
d
E
h
mi
Γ
Pi
T
X−Xi
˙
2
i
subject to Cov(Pi
T
X,Z) =0.
Our objective only depends onPithrough its effect on the scalar random variableξ=Pi
T
X . All
random variables
8
of the formζ=u
T
ζ
X for some covectoru
T
ζ
∈R
d form a vector spaceU, which
we equip with the covariance inner product⟨ξ, ζ⟩Cov= Cov(ξ, ζ) =E[ξζ] =u
T
ξ
ΣXXuζ.
By the linearity of covariance, the elements ofUuncorrelated withZform a subspaceZ
⊥
⊆U .
Note also thatξ∈Z
⊥ if and only ifξ’s covectoru
T
ξsatisfiesCov(u
T
ξ
X,Z) =u
T
ξ
ΣXZ=0k , and
that these covectors themselves form the subspacecolsp(ΣXZ)
⊥
ofR
d
.
Our objective now reduces to finding a covectorP
T
ithat defines the orthogonal projection ofXionto
Z
⊥. The difficulty is that orthogonality of elements inUis not equivalent to orthogonality of the
corresponding covectors. We can fix this by changing the basis in which covectors are represented.
SinceX∈colsp(W) a.s., we can write any element ofUas a linear form inWXrather thanXby
applying the change-of-basisu
′
ξ
=W
+
uξto every covector:ξ= (u
′
ξ
)
T
WX =u
T
ξ
≤
≤≤≤
W
+
WXa.s.
In this new basis, which is orthonormal under our covariance inner product, each component ofXis
writtenXi= (W
+
)
T
i
WX and the inner product of any two elements ofUis simply the Euclidean
inner product of the corresponding covectors:
9
⟨ξ, ζ⟩Cov= Cov(u
′T
ξWX,u
′T
ζWX) =u
′T
ξ(((((
WΣXXWu
′
ζ=u
′T
ξu
′
ζ.
Since the two inner products are now equivalent, andZ
⊥is precisely those random variables with
covectoru
′
∈colsp(WΣXZ)
⊥ , the orthogonal projection ofXiontoZ
⊥is also an orthogonal
projection of its covector(W
+
)
T
i
ontocolsp(WΣXZ)
⊥
:
ˆ
Xi= (W
+
)
T
i(I−PWΣXZ)(WX) (6)
Putting all the components ofXtogether, we have our final solution,
ˆ
X = (I−W
+
PWΣXZ
W)X,
which is almost surely equivalent to Eq. 6, but keeps the non-varying components ofXintact.
8
Strictly speaking, equivalence classes of almost surely equal random variables.
9
IfΣXXis full rank, there is a one-to-one correspondence between random variables inUand covectors. In
the singular case, we may choose the component of the covector insideker(ΣXX) arbitrarily, since it will make
no difference to the inner product.
18

F The Optimality of Oblique Projections
As noted in subsection 4.2, the optimal affine erasure functionr(x) =b+Px doesnotin general
use an orthogonal projection for the matrixP. A simple example illustrates why. Letd= 2, k= 1
so thatXtakes values inR
2andZtakes values inR, with the first featureX1and the labelZeach
independently and uniformly distributed in{−1,+1} , and the second featureX2simply equal to the
sumX2= X1+ Z. A dataset reflecting such a distribution has four(x,y)pairs:
([1,2]
T
,1),([1,0]
T
,−1),([−1,0]
T
,1),([−1,−2]
T
,−1)
In this case,allof the informationXhas aboutZresides inX2, so the minimally disruptive orthogonal
projection which guardsZwill nullify that component:
Portho=
ffi
1 0
0 0
ffl
On the other hand,X1contains some information aboutX2(despite having no information aboutZ),
allowing a partial reconstruction ofX2while preserving full concept erasure:
Poblique=
ffi
1 0
1 0
ffl
Both methods fully erase the ability to predictZfrom the data, however a simple calculation shows
the second, oblique method to perform better as measured by mean squared edit distance:
E∥PorthoX−X∥
2
= 2,E∥PobliqueX−X∥
2
= 1
G Equivalence of Guardedness Definitions
Xu et al.[43]define theconditionalV-entropyofZgivenXas the lowest achievable cross-entropy
loss predictingZwith a function ofXin the predictor classV. In our notation:
HV(Z|X) = inf
θ∈Θ
E[L(η(X;θ),Z)],
whereL(η, z) =−log
exp(ηz)
P
k
i=1
exp(ηi)
is the cross-entropy loss function.
They then define the (unconditional)V-entropyHV(Z) =HV(Z|0) to be the lowest achievable
cross-entropy loss in the case of a constantly null random data variable. This is exactly our trivially
attainable lossLτ(Definition 2.2).
Finally, they define theV-informationfromXtoZas the reduction inV-entropy as compared to
using such a null random data variable:
IV(X→Z) =HV(Z)−HV(Z|X).
Using these notions, Ravfogel et al.[33]say thatXisϵ-guardedwith respect toVifIV(X→Z)< ϵ .
In Appendix B, we showed the equivalence of guardedness (as we have defined it in Definition 2.1)
to the optimality of the trivially attainable loss. That is,X (V,L) -guardsZwhenHV(Z|X) =Lτ=
HV(Z)
, in the case whereLis the singleton class consisting solely of the cross-entropy loss function.
In the language of [33],Xisϵ-guarded with respect toVfor allϵ >0.
H Constraining Norm Growth
In early concept scrubbing experiments (Sec. 6), we found that at specific layers in some models,
concept scrubbing with LEACE would cause the norm of the representation to diverge, leading to
NaN outputs. By contrast, SAL never caused divergence, even though it causes a larger disruption
to model performance on average (Table 1). This is because SAL uses an orthogonal projection
Q, whose eigenvalues are thus all in{0,1}, so the norm of the hidden state can never increase
after erasure, while LEACE’s oblique projection matrixPdoes generally have singular values
greater than 1. To combine the superior average-case MSE of LEACE with the stability of SAL,
we adopt a simple regularization heuristic. After constructingP, we analytically compute the
19

trace of the covariance matrix of the hidden states after applyingP. Iftr(PΣXXP
T
)>tr(ΣXX) ,
we solve a quadratic equation to find the convex combinationP
′
=αP+ (1−α)Q such that
tr(ΣXX) = tr(P
′
ΣXX(P
′
)
T
) . By Theorem 4.1, the set of matrices which ensure linear guardedness
is convex,
10
soP
′is guaranteed to be in the feasible set. Furthermore, since our mean squared
error objective is convex,P
′is guaranteed to have no worse MSE thanQ. We find this solves the
divergence issue in practice.
I Oracle LEACEi
Z
1
Z
2
i
Z
Z
span
proj X
X
Figure 5: Orthogonal projection ofi
thcomponent
ofX, itself a vector in the random variable Hilbert
spaceH, onto the span of the components ofZ.
The residualXi−proj
ZXi is the closest vector
toXiorthogonal to, and hence uncorrelated with,
Z= span({Z1,Z2}).
The concept erasure method derived in Section 4
does not require access to concept labels at infer-
ence time. That is, we can fit an erasure function
on a labeled training dataset, then apply the func-
tion to unlabeled datapoints. If we have oracle
access to the labelzfor eachx, we can achieve
an even more surgical edit. In Theorem I.1 be-
low, we deriveOracle LEACE, a closed-form
formula for the the nearestX
′to anyXsuch that
Cov(X
′
,Z) =0.
Like in Sec. 4, the resultingX
′
LEACE is “nearest”
toXwith respect to all p.s.d. inner products
a
T
Mb defined onR
dsimultaneously. This is
because, by expressingXin a basis that diago-
nalizesM, we can decompose the problem into
dindependent subproblems, one for each com-
ponent ofXi. Each subproblem can then be
viewed as an orthogonal projection, not inR
d,
but in an abstract vector space of real-valued random variables. For geometric intuition, see Figure 5.
Prior work has noted that computing an orthogonal projection in a random variable Hilbert space
is equivalent to solving an ordinary least squares regression problem [1]. Our theorem is a natural
extension of this work: we find thatX
′
LEACEis equal to the OLS residual from regressingXonZ,
plus a constant shift needed to ensure that erasingZdoes not change the mean ofX.
Theorem I.1(Oracle Concept Erasure).LetHbe the Hilbert space of square-integrable real-valued
random variables equipped with the inner product⟨ξ, ζ⟩H:=E[ξζ] . Let(X,Z)be random vectors in
H
dandH
krespectively. Then for every p.s.d. inner product⟨a,b⟩M=a
T
Mb onR
d, the objective
argmin
X
′
∈H
d
E

X
′
−X

2
M
subject to Cov(X
′
,Z) =0
is minimized by the (appropriately shifted) ordinary least squares residuals from regressingXonZ:
X
′
LEACE= X−ΣXZΣ
+
ZZ
Γ
Z−E[Z]
˙
.
Proof.Assume w.l.o.g. thatXandX
′
are represented in a basis diagonalizingM, so we may write
E

X
′
−X

2
M
=
d
X
i=1
miE
fi
(X
′
i−Xi)
2
fl
,
wherem1, . . . , md≥0 are eigenvalues ofM. Crucially, each term in this sum is independent from
the others, allowing us to decompose the primal problem intodseparate subproblems of the form
∥X
′
i
−Xi∥
2
H
, one for each componentiof(X,X
′
).
Factoring out constants.Now consider the subspaceC= span(1)⊂ H consisting of all constant
(i.e. zero variance) random variables. Orthogonally decomposingXialongCyieldsXi=
˜
Xi+µi ,
whereµi=E[Xi]∈ Cand
˜
Xi= X−E[X]i∈ C
⊥
, and likewise forX
′
i
. Our objective is now

X
′
i−Xi

2
H
=

µ
′
i−µi

2
H
+

˜
X
′
i−
˜
Xi

2
H
. (7)
10
In fact, it is a subspace ofR
d×d. For any matricesA,B∈R
d×d such thatAΣXZ=0 andBΣXZ=0 ,
we have by linearity(αA+βB)ΣXZ=αAΣXZ+βBΣXZ=α0+β0=0for any scalarsαandβ.
20

Sinceµ
′
iandµiare orthogonal to
˜
X
′
i
and
˜
Xi
, and the constraintCov(X
′
,Z) =0 is invariant to
constant shifts, we can optimize the two terms in Eq. 7 independently. The first term is trivial: it is
minimized whenµ
′
i
=µi, and henceX
′
i
=
˜
X
′
i
+E[Xi].
Orthogonal projection.We can now rewrite the zero covariance condition as an orthogonality
constraint on
˜
Xi. Specifically, for everyi∈ {1. . . d}we have
argmin
˜
X
′
i
∈H

˜
X
′
i−
˜
Xi

2
H
s.t.∀j∈ {1. . . k}:⟨
˜
X
′
i,
˜
Zj⟩H= 0, (8)
where
˜
Z = Z−E[Z]
. In other words, we seek the nearest
˜
X
′
i
to
˜
Xi
orthogonal toZ=
span({
˜
Z1, . . . ,
˜
Zk})
, which is simply the orthogonal projection of
˜
Xi
ontoZ
⊥. This in turn is
equal to the ordinary least squares residual from regressing
˜
Xon
˜
Z:
˜
X
′
i=
˜
Xi−proj
Γ
˜
Xi,Z
˙
= Xi−(ΣXZ)iΣ
+
ZZ
(Z−E[Z])−E[Xi]. (9)
Putting it all together.Plugging Eq. 9 intoX
′
i
=
˜
X
′
i
+E[Xi] and combining all components into
vector form yields
X
′
LEACE= X−ΣXZΣ
+
ZZ
(Z−E[Z]), (10)
which completes the proof.
J Notation Key
ZThe space of one-hot labels{(z1, . . . zk)∈ {0,1}
k

P
k
j=1
zj= 1}}
(treated interchangeably with the integers{1, . . . , k}when convenient).
X,ZIntegrable (i.e. finite first moment) random vectors taking values inR
d
andR
k
respectively (or their realized values inside an expectation, e.g. inE[f(X)]).
Zis sometimes restricted to the one-hot labelsZ, in which case we assume
eachP(Z =j)>0.
Xi,ZjThei
th
andj
th
components thereof, themselves scalar random variables (or
their realized values inside an expectation).
ξ, ζScalar random variables taking values inR.
ηA predictor functionR
d
→ Z(or its valueη(X)when inside an expectation).
VA space of predictor functions{η(·;θ) :R
d
→R
k
|θ∈Θ}, parameterized
byθand containing all constant functions.
LA space of loss functions{L:R
k
× Z →[0,∞)}.
rAn erasure functionR
d
→R
d
, hopefully making a minimal edit toXthat
eliminates the ability to predict labelsZwith predictors inV.
AA matrix with entries inR.
AijThe entry thereof at thei
th
row andj
th
column.
A
+
The Moore-Penrose pseudoinverse ofA.
vA column vector with entries inR.
viThei
th
component thereof.
21