Extracted Text
1503.03832.pdf
FaceNet: A Unied Embedding for Face Recognition and Clustering
Florian Schroff
fschroff@google.com
Google Inc.
Dmitry Kalenichenko
dkalenichenko@google.com
Google Inc.
James Philbin
jphilbin@google.com
Google Inc.
Abstract
Despite signicant recent advances in the eld of face
recognition [10,,,], implementing face verication
and recognition efciently at scale presents serious chal-
lenges to current approaches. In this paper we present a
system, called FaceNet, that directly learns a mapping from
face images to a compact Euclidean space where distances
directly correspond to a measure of face similarity. Once
this space has been produced, tasks such as face recogni-
tion, verication and clustering can be easily implemented
using standard techniques with FaceNet embeddings as fea-
ture vectors.
Our method uses a deep convolutional network trained
to directly optimize the embedding itself, rather than an in-
termediate bottleneck layer as in previous deep learning
approaches. To train, we use triplets of roughly aligned
matching / non-matching face patches generated using a
novel online triplet mining method. The benet of our
approach is much greater representational efciency: we
achieve state-of-the-art face recognition performance using
only 128-bytes per face.
On the widely used Labeled Faces in the Wild (LFW)
dataset, our system achieves a new record accuracy of
99.63%. On YouTube Faces DB it achieves95.12%. Our
system cuts the error rate in comparison to the best pub-
lished result [15] by 30% on both datasets.
We also introduce the concept ofharmonicembeddings,
and aharmonictriplet loss, which describe different ver-
sions of face embeddings (produced by different networks)
that are compatible to each other and allow for direct com-
parison between each other.
1. Introduction
In this paper we present a unied system for face veri-
cation (is this the same person), recognition (who is this
person) and clustering (nd common people among these
faces). Our method is based on learning a Euclidean em-
bedding per image using a deep convolutional network. The
network is trained such that the squared L2 distances in
the embedding space directly correspond to face similarity:
1.04
1.22 1.33
0.78
1.33 1.26
0.99
Figure 1.Illumination and Pose invariance.Pose and illumina-
tion have been a long standing problem in face recognition. This
gure shows the output distances of FaceNet between pairs of
faces of the same and a different person in different pose and il-
lumination combinations. A distance of0:0means the faces are
identical,4:0corresponds to the opposite spectrum, two different
identities. You can see that a threshold of 1.1 would classify every
pair correctly.
faces of the same person have small distances and faces of
distinct people have large distances.
Once this embedding has been produced, then the afore-
mentioned tasks become straight-forward: face verica-
tion simply involves thresholding the distance between the
two embeddings; recognition becomes a k-NN classica-
tion problem; and clustering can be achieved using off-the-
shelf techniques such as k-means or agglomerative cluster-
ing.
Previous face recognition approaches based on deep net-
works use a classication layer [15,] trained over a set of
known face identities and then take an intermediate bottle-
1
neck layer as a representation used to generalize recognition
beyond the set of identities used in training. The downsides
of this approach are its indirectness and its inefciency: one
has to hope that the bottleneck representation generalizes
well to new faces; and by using a bottleneck layer the rep-
resentation size per face is usually very large (1000s of di-
mensions). Some recent work [15] has reduced this dimen-
sionality using PCA, but this is a linear transformation that
can be easily learnt in one layer of the network.
In contrast to these approaches, FaceNet directly trains
its output to be a compact 128-D embedding using a triplet-
based loss function based on LMNN [19]. Our triplets con-
sist of two matching face thumbnails and a non-matching
face thumbnail and the loss aims to separate the positive pair
from the negative by a distance margin. The thumbnails are
tight crops of the face area, no 2D or 3D alignment, other
than scale and translation is performed.
Choosing which triplets to use turns out to be very im-
portant for achieving good performance and, inspired by
curriculum learning [1], we present a novel online nega-
tive exemplar mining strategy which ensures consistently
increasing difculty of triplets as the network trains. To
improve clustering accuracy, we also explore hard-positive
mining techniques which encourage spherical clusters for
the embeddings of a single person.
As an illustration of the incredible variability that our
method can handle see Figure. Shown are image pairs
from PIE [13] that previously were considered to be very
difcult for face verication systems.
An overview of the rest of the paper is as follows: in
section
denes the triplet loss and section
triplet selection and training procedure; in section
describe the model architecture used. Finally in section
and
dings and also qualitatively explore some clustering results.
2. Related Work
Similarly to other recent works which employ deep net-
works [15,], our approach is a purely data driven method
which learns its representation directly from the pixels of
the face. Rather than using engineered features, we use a
large dataset of labelled faces to attain the appropriate in-
variances to pose, illumination, and other variational condi-
tions.
In this paper we explore two different deep network ar-
chitectures that have been recently used to great success in
the computer vision community. Both are deep convolu-
tional networks [8,]. The rst architecture is based on the
Zeiler&Fergus [22] model which consists of multiple inter-
leaved layers of convolutions, non-linear activations, local
response normalizations, and max pooling layers. We addi-
tionally add several11dconvolution layers inspired by
the work of [9]. The second architecture is based on the
Inceptionmodel of Szegedyet al. which was recently used
as the winning approach for ImageNet 2014 [16]. These
networks use mixed layers that run several different convo-
lutional and pooling layers in parallel and concatenate their
responses. We have found that these models can reduce the
number of parameters by up to 20 times and have the poten-
tial to reduce the number of FLOPS required for comparable
performance.
There is a vast corpus of face verication and recognition
works. Reviewing it is out of the scope of this paper so we
will only briey discuss the most relevant recent work.
The works of [15,,] all employ a complex system
of multiple stages, that combines the output of a deep con-
volutional network with PCA for dimensionality reduction
and an SVM for classication.
Zhenyaoet al. [23] employ a deep network to warp
faces into a canonical frontal view and then learn CNN that
classies each face as belonging to a known identity. For
face verication, PCA on the network output in conjunction
with an ensemble of SVMs is used.
Taigmanet al. [17] propose a multi-stage approach that
aligns faces to a general 3D shape model. A multi-class net-
work is trained to perform the face recognition task on over
four thousand identities. The authors also experimented
with a so called Siamese network where they directly opti-
mize theL1-distance between two face features. Their best
performance on LFW (97:35%) stems from an ensemble of
three networks using different alignments and color chan-
nels. The predicted distances (non-linear SVM predictions
based on the
2
kernel) of those networks are combined us-
ing a non-linear SVM.
Sunet al. [14,] propose a compact and therefore rel-
atively cheap to compute network. They use an ensemble
of 25 of these network, each operating on a different face
patch. For their nal performance on LFW (99:47%[15])
the authors combine 50 responses (regular and ipped).
Both PCA and a Joint Bayesian model [2] that effectively
correspond to a linear transform in the embedding space are
employed. Their method does not require explicit 2D/3D
alignment. The networks are trained by using a combina-
tion of classication and verication loss. The verication
loss is similar to the triplet loss we employ [12,], in that it
minimizes theL2-distance between faces of the same iden-
tity and enforces a margin between the distance of faces of
different identities. The main difference is that only pairs of
images are compared, whereas the triplet loss encourages a
relative distance constraint.
A similar loss to the one used here was explored in
Wanget al. [18] for ranking images by semantic and visual
similarity.
...
Batch
DEEP ARCHITECTURE L2
Triplet
Loss
E
M
B
E
D
D
I
N
G Figure 2.Model structure.Our network consists of a batch in-
put layer and a deep CNN followed byL2normalization, which
results in the face embedding. This is followed by the triplet loss
during training.Anchor
Positive
Negative
Anchor
Positive
Negative
LEARNING
Figure 3. TheTriplet Lossminimizes the distance between anan-
chorand apositive, both of which have the same identity, and
maximizes the distance between theanchorand anegativeof a
different identity.
3. Method
FaceNet uses a deep convolutional network. We discuss
two different core architectures: The Zeiler&Fergus [22]
style networks and the recent Inception [16] type networks.
The details of these networks are described in section.
Given the model details, and treating it as a black box
(see Figure), the most important part of our approach lies
in the end-to-end learning of the whole system. To this end
we employ the triplet loss that directly reects what we want
to achieve in face verication, recognition and clustering.
Namely, we strive for an embeddingf(x), from an image
xinto a feature spaceR
d
, such that the squared distance
betweenallfaces, independent of imaging conditions, of
the same identity is small, whereas the squared distance be-
tween a pair of face images from different identities is large.
Although we did not directly compare to other losses,
e.g. the one using pairs of positives and negatives, as used
in [14] Eq. (2), we believe that the triplet loss is more suit-
able for face verication. The motivation is that the loss
from [14] encourages all faces of one identity to be pro-
jected onto a single point in the embedding space. The
triplet loss, however, tries to enforce a margin between each
pair of faces from one person to all other faces. This al-
lows the faces for one identity to live on a manifold, while
still enforcing the distance and thus discriminability to other
identities.
The following section describes this triplet loss and how
it can be learned efciently at scale.
3.1. Triplet Loss
The embedding is represented byf(x)2R
d
. It em-
beds an imagexinto ad-dimensional Euclidean space.
Additionally, we constrain this embedding to live on the
d-dimensional hypersphere,i.e.kf(x)k2= 1. This loss is
motivated in [19] in the context of nearest-neighbor classi-
cation. Here we want to ensure that an imagex
a
i
(anchor) of
a specic person is closer to all other imagesx
p
i
(positive)
of the same person than it is to any imagex
n
i
(negative) of
any other person. This is visualized in Figure.
Thus we want,
kf(x
a
i) f(x
p
i
)k
2
2+ <kf(x
a
i) f(x
n
i)k
2
2;
(1)
8(f(x
a
i); f(x
p
i
); f(x
n
i))2 T: (2)
whereis a margin that is enforced between positive and
negative pairs.Tis the set of all possible triplets in the
training set and has cardinalityN.
The loss that is being minimized is thenL=
N
X
i
h
kf(x
a
i) f(x
p
i
)k
2
2
kf(x
a
i) f(x
n
i)k
2
2
+
i
+
:
(3)
Generating all possible triplets would result in many
triplets that are easily satised (i.e. fulll the constraint
in Eq. (1)). These triplets would not contribute to the train-
ing and result in slower convergence, as they would still
be passed through the network. It is crucial to select hard
triplets, that are active and can therefore contribute to im-
proving the model. The following section talks about the
different approaches we use for the triplet selection.
3.2. Triplet Selection
In order to ensure fast convergence it is crucial to select
triplets that violate the triplet constraint in Eq. (1). This
means that, givenx
a
i
, we want to select anx
p
i
(hard pos-
itive) such thatargmax
x
p
i
kf(x
a
i
) f(x
p
i
)k
2
2
and similarly
x
n
i
(hard negative) such thatargmin
x
n
i
kf(x
a
i
) f(x
n
i
)k
2
2
.
It is infeasible to compute theargminandargmax
across the whole training set. Additionally, it might lead
to poor training, as mislabelled and poorly imaged faces
would dominate the hard positives and negatives. There are
two obvious choices that avoid this issue:
Generate triplets ofine every n steps, using the most
recent network checkpoint and computing theargmin
andargmaxon a subset of the data.
Generate triplets online. This can be done by select-
ing the hard positive/negative exemplars from within a
mini-batch.
Here, we focus on the online generation and use large
mini-batches in the order of a few thousand exemplars and
only compute theargminandargmaxwithin a mini-batch.
To have a meaningful representation of the anchor-
positive distances, it needs to be ensured that a minimal
number of exemplars of any one identity is present in each
mini-batch. In our experiments we sample the training data
such that around 40 faces are selected per identity per mini-
batch. Additionally, randomly sampled negative faces are
added to each mini-batch.
Instead of picking the hardest positive, we use all anchor-
positive pairs in a mini-batch while still selecting the hard
negatives. We don't have a side-by-side comparison of hard
anchor-positive pairs versus all anchor-positive pairs within
a mini-batch, but we found in practice that the all anchor-
positive method was more stable and converged slightly
faster at the beginning of training.
We also explored the ofine generation of triplets in con-
junction with the online generation and it may allow the use
of smaller batch sizes, but the experiments were inconclu-
sive.
Selecting the hardest negatives can in practice lead to bad
local minima early on in training, specically it can result
in a collapsed model (i.e.f(x) = 0). In order to mitigate
this, it helps to selectx
n
i
such that
kf(x
a
i) f(x
p
i
)k
2
2
<kf(x
a
i) f(x
n
i)k
2
2
:(4)
We call these negative exemplarssemi-hard, as they are fur-
ther away from the anchor than the positive exemplar, but
still hard because the squared distance is close to the anchor-
positive distance. Those negatives lie inside the margin.
As mentioned before, correct triplet selection is crucial
for fast convergence. On the one hand we would like to use
small mini-batches as these tend to improve convergence
during Stochastic Gradient Descent (SGD) [20]. On the
other hand, implementation details make batches of tens to
hundreds of exemplars more efcient. The main constraint
with regards to the batch size, however, is the way we select
hard relevant triplets from within the mini-batches. In most
experiments we use a batch size of around 1,800 exemplars.
3.3. Deep Convolutional Networks
In all our experiments we train the CNN using Stochastic
Gradient Descent (SGD) with standard backprop [8,] and
AdaGrad [5]. In most experiments we start with a learning
rate of0:05which we lower to nalize the model. The mod-
els are initialized from random, similar to [16], and trained
on a CPU cluster for 1,000 to 2,000 hours. The decrease in
the loss (and increase in accuracy) slows down drastically
after 500h of training, but additional training can still sig-
nicantly improve performance. The marginis set to0:2.
We used two types of architectures and explore their
trade-offs in more detail in the experimental section. Their
practical differences lie in the difference of parameters and
FLOPS. The best model may be different depending on the
application.E.g. a model running in a datacenter can have
many parameters and require a large number of FLOPS,
whereas a model running on a mobile phone needs to have
few parameters, so that it can t into memory. All our
layer size-in size-out kernelparamFLPS
conv1220220311011064773;29K115M
pool1110110645555643364;20
rnorm1555564555564 0
conv2a5555645555641164;14K13M
conv255556455551923364;1111K335M
rnorm255551925555192 0
pool25555192282819233192;20
conv3a2828192282819211192;137K29M
conv32828192282838433192;1664K521M
pool32828384141438433384;20
conv4a1414384141438411384;1148K29M
conv41414384141425633384;1885K173M
conv5a1414256141425611256;166K13M
conv51414256141425633256;1590K116M
conv6a1414256141425611256;166K13M
conv61414256141425633256;1590K116M
pool414142567725633256;20
concat7725677256 0
fc1 77256132128maxout p=2103M103M
fc2 132128132128maxout p=234M34M
fc712813212811128 524K0.5M
L2 1112811128 0
total 140M1.6B
Table 1.NN1. This table show the structure of our
Zeiler&Fergus [22] based model with 11convolutions in-
spired by [9]. The input and output sizes are described
inrowscols#filters. The kernel is specied as
rowscols; strideand the maxout [6] pooling size as p= 2.
models use rectied linear units as the non-linear activation
function.
The rst category, shown in Table, adds 11dcon-
volutional layers, as suggested in [9], between the standard
convolutional layers of the Zeiler&Fergus [22] architecture
and results in a model 22 layers deep. It has a total of 140
million parameters and requires around 1.6 billion FLOPS
per image.
The second category we use is based on GoogLeNet
style Inception models [16]. These models have 20fewer
parameters (around 6.6M-7.5M) and up to5fewer FLOPS
(between 500M-1.6B). Some of these models are dramati-
cally reduced in size (both depth and number of lters), so
that they can be run on a mobile phone. One, NNS1, has
26M parameters and only requires 220M FLOPS per image.
The other, NNS2, has 4.3M parameters and 20M FLOPS.
Table
is identical in architecture but has a reduced input size of
160x160. NN4 has an input size of only 96x96, thereby
drastically reducing the CPU requirements (285M FLOPS
vs 1.6B for NN2). In addition to the reduced input size it
does not use 5x5 convolutions in the higher layers as the
receptive eld is already too small by then. Generally we
found that the 5x5 convolutions can be removed throughout
with only a minor drop in accuracy. Figure
our models.
4. Datasets and Evaluation
We evaluate our method on four datasets and with the ex-
ception of Labelled Faces in the Wild and YouTube Faces
we evaluate our method on the face verication task.I.e.
given a pair of two face images a squaredL2distance
thresholdD(xi; xj)is used to determine the classication
ofsameanddifferent. All faces pairs(i; j)of the same iden-
tity are denoted withPsame, whereas all pairs of different
identities are denoted withPdiff.
We dene the set of alltrue acceptsas
TA(d) =f(i; j)2 Psame;withD(xi; xj)dg:(5)
These are the face pairs(i; j)that were correctly classied
assameat thresholdd. Similarly
FA(d) =f(i; j)2 Pdiff;withD(xi; xj)dg(6)
is the set of all pairs that was incorrectly classied assame
(false accept).
The validation rateVAL(d)and the false accept rate
FAR(d)for a given face distancedare then dened as
VAL(d) =
jTA(d)j
jPsamej
;FAR(d) =
jFA(d)j
jPdiffj
:(7)
4.1. Holdout Test Set
We keep a hold out set of around one million images,
that has the same distribution as our training set, but dis-
joint identities. For evaluation we split it into ve disjoint
sets of200k images each. TheFARandVALrate are then
computed on100k100k image pairs. Standard error is
reported across the ve splits.
4.2. Personal Photos
This is a test set with similar distribution to our training
set, but has been manually veried to have very clean labels.
It consists of three personal photo collections with a total of
around12k images. We compute theFARandVALrate
across all 12k squared pairs of images.
4.3. Academic Datasets
Labeled Faces in the Wild (LFW) is the de-facto aca-
demic test set for face verication [7]. We follow the stan-
dard protocol forunrestricted, labeled outside dataand re-
port the mean classication accuracy as well as the standard
error of the mean.
Youtube Faces DB [21] is a new dataset that has gained
popularity in the face recognition community [17,]. The
setup is similar to LFW, but instead of verifying pairs of
images, pairs of videos are used. 10,000,000 100,000,000 1,000,000,000
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
MultiAdd (FLOPS)
VAL @103 FAR
NNS2
NNS1
NN4
NN3
NN1
NN2
Figure 4.FLOPSvs. Accuracy trade-off.Shown is the trade-off
between FLOPS and accuracy for a wide range of different model
sizes and architectures. Highlighted are the four models that we
focus on in our experiments.
5. Experiments
If not mentioned otherwise we use between 100M-200M
training face thumbnails consisting of about 8M different
identities. A face detector is run on each image and a tight
bounding box around each face is generated. These face
thumbnails are resized to the input size of the respective
network. Input sizes range from 96x96 pixels to 224x224
pixels in our experiments.
5.1. Computation Accuracy Tradeoff
Before diving into the details of more specic experi-
ments we will discuss the trade-off of accuracy versus num-
ber of FLOPS that a particular model requires. Figure
shows the FLOPS on the x-axis and the accuracy at 0.001
false accept rate (FAR) on our user labelled test-data set
from section. It is interesting to see the strong corre-
lation between the computation a model requires and the
accuracy it achieves. The gure highlights the ve models
(NN1, NN2, NN3, NNS1, NNS2) that we discuss in more
detail in our experiments.
We also looked into the accuracy trade-off with regards
to the number of model parameters. However, the picture
is not as clear in that case. For example, the Inception
based model NN2 achieves a comparable performance to
NN1, but only has a 20th of the parameters. The number
of FLOPS is comparable, though. Obviously at some point
the performance is expected to decrease, if the number of
parameters is reduced further. Other model architectures
may allow further reductions without loss of accuracy, just
like Inception [16] did in this case.
type
output
size
depth#11
#33
reduce
#33
#55
reduce
#55
pool
proj (p)
paramsFLOPS
conv1 (773;2)11211264 1 9K 119M
max pool + norm565664 0 m33;2
inception (2) 5656192 2 64 192 115K 360M
norm + max pool2828192 0 m33;2
inception (3a) 2828256 2 64 96 128 16 32 m, 32p 164K 128M
inception (3b) 2828320 2 64 96 128 32 64 L2, 64p228K 179M
inception (3c) 1414640 2 0 128 256,2 32 64,2m33,2398K 108M
inception (4a) 1414640 2 256 96 192 32 64 L2, 128p545K 107M
inception (4b) 1414640 2 224 112 224 32 64 L2, 128p595K 117M
inception (4c) 1414640 2 192 128 256 32 64 L2, 128p654K 128M
inception (4d) 1414640 2 160 144 288 32 64 L2, 128p722K 142M
inception (4e) 771024 2 0 160 256,2 64 128,2m33,2717K 56M
inception (5a) 771024 2 384 192 384 48 128 L2, 128p1.6M 78M
inception (5b) 771024 2 384 192 384 48 128 m, 128p1.6M 78M
avg pool 111024 0
fully conn 11128 1 131K 0.1M
L2 normalization11128 0
total 7.5M 1.6B
Table 2.NN2.Details of the NN2 Inception incarnation. This model is almost identical to the one described in [16]. The two major
differences are the use ofL2pooling instead of max pooling (m), where specied.I.e. instead of taking the spatial max theL2norm
is computed. The pooling is always33(aside from the nal average pooling) and in parallel to the convolutional modules inside each
Inception module. If there is a dimensionality reduction after the pooling it is denoted with p.11,33, and55pooling are then
concatenated to get the nal output.NN2 NN1 NNS1 NNS2
1E61E61E6 1E51E51E5 1E41E41E4 1E31E31E3 1E21E21E2 1E11E11E1 1E01E01E0
1E11E11E1
5E15E15E1
1E01E01E0
FARFARFAR
VALVALVAL
Figure 5.Network Architectures.This plot shows the com-
plete ROC for the four different models on our personal pho-
tos test set from section. The sharp drop at 10E-4FAR
can be explained by noise in the groundtruth labels. The mod-
els in order of performance are:NN2:224224input Inception
based model;NN1:Zeiler&Fergus based network with11con-
volutions;NNS1:small Inception style model with only 220M
FLOPS;NNS2:tiny Inception model with only 20M FLOPS.
architecture VAL
NN1 (Zeiler&Fergus220220)87:9%1:9
NN2 (Inception224224) 89:4%1:6
NN3 (Inception160160) 88:3%1:7
NN4 (Inception9696) 82:0%2:3
NNS1 (mini Inception165165)82:4%2:4
NNS2 (tiny Inception140116)51:9%2:9
Table 3.Network Architectures.This table compares the per-
formance of our model architectures on the hold out test set (see
section). Reported is the mean validation rateVALat10E-3
false accept rate. Also shown is the standard error of the mean
across the ve test splits.
5.2. Effect of CNN Model
We now discuss the performance of our four selected
models in more detail. On the one hand we have our tradi-
tional Zeiler&Fergus based architecture with11convolu-
tions [22,] (see Table). On the other hand we have Incep-
tion [16] based models that dramatically reduce the model
size. Overall, in the nal performance the top models of
both architectures perform comparably. However, some of
our Inception based models, such as NN3, still achieve good
performance while signicantly reducing both the FLOPS
and the model size.
The detailed evaluation on our personal photos test set is
jpeg qval-rate
10 67.3%
20 81.4%
30 83.9%
50 85.5%
70 86.1%
90 86.5%
#pixelsval-rate
1,60037.8%
6,40079.5%
14,40084.5%
25,60085.7%
65,53686.4%
Table 4.Image Quality.The table on the left shows the effect on
the validation rate at10E-3 precision with varying JPEG quality.
The one on the right shows how the image size in pixels effects the
validation rate at10E-3 precision. This experiment was done with
NN1 on the rst split of our test hold-out dataset.
#dims VAL
6486:8%1:7
12887:9%1:9
25687:7%1:9
51285:6%2:0
Table 5.Embedding Dimensionality.This Table compares the
effect of the embedding dimensionality of our model NN1 on our
hold-out set from section. In addition to the VALat10E-3
we also show the standard error of the mean computed across ve
splits.
shown in Figure. While the largest model achieves a dra-
matic improvement in accuracy compared to the tiny NNS2,
the latter can be run 30ms / image on a mobile phone and
is still accurate enough to be used in face clustering. The
sharp drop in the ROC forFAR<10
4
indicates noisy
labels in the test data groundtruth. At extremely low false
accept rates a single mislabeled image can have a signicant
impact on the curve.
5.3. Sensitivity to Image Quality
Table
range of image sizes. The network is surprisingly robust
with respect to JPEG compression and performs very well
down to a JPEG quality of 20. The performance drop is
very small for face thumbnails down to a size of 120x120
pixels and even at 80x80 pixels it shows acceptable perfor-
mance. This is notable, because the network was trained on
220x220 input images. Training with lower resolution faces
could improve this range further.
5.4. Embedding Dimensionality
We explored various embedding dimensionalities and se-
lected 128 for all experiments other than the comparison re-
ported in Table. One would expect the larger embeddings
to perform at least as good as the smaller ones, however, it is
possible that they require more training to achieve the same
accuracy. That said, the differences in the performance re-
#training imagesVAL
2,600,000 76.3%
26,000,00085.1%
52,000,00085.1%
260,000,00086.2%
Table 6.Training Data Size.This table compares the performance
after 700h of training for a smaller model with 96x96 pixel inputs.
The model architecture is similar to NN2, but without the 5x5 con-
volutions in the Inception modules.
ported in Table
It should be noted, that during training a 128 dimensional
oat vector is used, but it can be quantized to 128-bytes
without loss of accuracy. Thus each face is compactly rep-
resented by a 128 dimensional byte vector, which is ideal
for large scale clustering and recognition. Smaller embed-
dings are possible at a minor loss of accuracy and could be
employed on mobile devices.
5.5. Amount of Training Data
Table
data. Due to time constraints this evaluation was run on a
smaller model; the effect may be even larger on larger mod-
els. It is clear that using tens of millions of exemplars results
in a clear boost of accuracy on our personal photo test set
from section. Compared to only millions of images the
relative reduction in error is 60%. Using another order of
magnitude more images (hundreds of millions) still gives a
small boost, but the improvement tapers off.
5.6. Performance on LFW
We evaluate our model on LFW using the standard pro-
tocol forunrestricted, labeled outside data. Nine training
splits are used to select theL2-distance threshold. Classi-
cation (sameordifferent) is then performed on the tenth
test split. The selected optimal threshold is1:242for all test
splits except split eighth (1:256).
Our model is evaluated in two modes:
1.
2.3]) is run
on the provided LFW thumbnails. If it fails to align the
face (this happens for two images), the LFW alignment
is used.
Figure allfailure cases. It shows
false accepts on the top as well as false rejects at the bot-
tom. We achieve a classication accuracy of98.87%0:15
when using the xed center crop described in (1) and the
record breaking99.63%0.09 standard error of the mean
when using the extra face alignment (2). This reduces the
error reported for DeepFace in [17] by more than a factor
False accept
False reject
Figure 6.LFW errors.This shows all pairs of images that were
incorrectly classied on LFW. Only eight of the 13 false rejects
shown here are actual errors the other ve are mislabeled in LFW.
of 7 and the previous state-of-the-art reported for DeepId2+
in [15] by 30%. This is the performance of model NN1, but
even the much smaller NN3 achieves performance that is
not statistically signicantly different.
5.7. Performance on Youtube Faces DB
We use the average similarity of all pairs of the rst one
hundred frames that our face detector detects in each video.
This gives us a classication accuracy of95.12%0:39.
Using the rst one thousand frames results in 95.18%.
Compared to [17] 91.4% who also evaluate one hundred
frames per video we reduce the error rate by almost half.
DeepId2+ [15] achieved 93.2% and our method reduces this
error by 30%, comparable to our improvement on LFW.
5.8. Face Clustering
Our compact embedding lends itself to be used in order
to cluster a users personal photos into groups of people with
the same identity. The constraints in assignment imposed
by clustering faces, compared to the pure verication task,
Figure 7.Face Clustering.Shown is an exemplar cluster for one
user. All these images in the users personal photo collection were
clustered together.
lead to truly amazing results. Figure
a users personal photo collection, generated using agglom-
erative clustering. It is a clear showcase of the incredible
invariance to occlusion, lighting, pose and even age.
6. Summary
We provide a method to directly learn an embedding into
an Euclidean space for face verication. This sets it apart
from other methods [15,] who use the CNN bottleneck
layer, or require additional post-processing such as concate-
10
-8
10
-7
10
-6
10
-5
10
-4
10
-3
10
-2
10
-1
10
0
FAR
0.0
0.2
0.4
0.6
0.8
1.0
VAL
NN2
NN2 compared to NN1
NN1 Figure 8.Harmonic Embedding Compatibility.These ROCs
show the compatibility of theharmonicembeddings of NN2 to
the embeddings of NN1. NN2 is an improved model that performs
much better than NN1. When comparing embeddings generated
by NN1 to theharmonicones generated by NN2 we can see the
compatibility between the two. In fact, the mixed mode perfor-
mance is still better than NN1 by itself.
nation of multiple models and PCA, as well as SVM clas-
sication. Our end-to-end training both simplies the setup
and shows that directly optimizing a loss relevant to the task
at hand improves performance.
Another strength of our model is that it only requires
minimal alignment (tight crop around the face area). [17],
for example, performs a complex 3D alignment. We also
experimented with a similarity transform alignment and no-
tice that this can actually improve performance slightly. It
is not clear if it is worth the extra complexity.
Future work will focus on better understanding of the
error cases, further improving the model, and also reduc-
ing model size and reducing CPU requirements. We will
also look into ways of improving the currently extremely
long training times,e.g. variations of our curriculum learn-
ing with smaller batch sizes and ofine as well as online
positive and negative mining.
7. Appendix: Harmonic Embedding
In this section we introduce the concept ofharmonicem-
beddings. By this we denote a set of embeddings that are
generated by different models v1 and v2 but are compatible
in the sense that they can be compared to each other.
This compatibility greatly simplies upgrade paths.
E.g. in an scenario where embedding v1 was computed
across a large set of images and a new embedding model
v2 is being rolled out, this compatibility ensures a smooth
transition without the need to worry about version incom-
patibilities. Figure
be seen that the improved model NN2 signicantly outper- v7 template v8 template
Anchor
Semi-hard
Negative
Positive
Anchor
Semi-hard
Negative
Positive
Anchor
Semi-hard
Negative
Positive
Figure 9.Learning the Harmonic Embedding.In order to learn
aharmonicembedding, we generate triplets that mix the v1 em-
beddings with the v2 embeddings that are being trained. The semi-
hard negatives are selected from the whole set of both v1 and v2
embeddings.
forms NN1, while the comparison of NN2 embeddings to
NN1 embeddings performs at an intermediate level.
7.1. Harmonic Triplet Loss
In order to learn theharmonicembedding we mix em-
beddings of v1 together with the embeddings v2, that are
being learned. This is done inside the triplet loss and results
in additionally generated triplets that encourage the com-
patibility between the different embedding versions. Fig-
ure
contribute to the triplet loss.
We initialized the v2 embedding from an independently
trained NN2 and retrained the last layer (embedding layer)
from random initialization with the compatibility encourag-
ing triplet loss. First only the last layer is retrained, then we
continue training the whole v2 network with the harmonic
loss.
Figure
compatibility may work in practice. The vast majority of
v2 embeddings may be embedded near the corresponding
v1 embedding, however, incorrectly placed v1 embeddings
can be perturbed slightly such that their new location in em-
bedding space improves verication accuracy.
7.2. Summary
These are very interesting ndings and it is somewhat
surprising that it works so well. Future work can explore
how far this idea can be extended. Presumably there is a
limit as to how much the v2 embedding can improve over
v1, while still being compatible. Additionally it would be
interesting to train small networks that can run on a mobile
phone and are compatible to a larger server side model.
Learning
harmonic
embeddings
v7 template
v8 template Figure 10.Harmonic Embedding Space.This visualisation
sketches a possible interpretation of howharmonicembeddings
are able to improve verication accuracy while maintaining com-
patibility to less accurate embeddings. In this scenario there is one
misclassied face, whose embedding is perturbed to the correct
location in v2.
Acknowledgments
We would like to thank Johannes Steffens for his discus-
sions and great insights on face recognition and Christian
Szegedy for providing new network architectures like [16]
and discussing network design choices. Also we are in-
debted to the DistBelief [4] team for their support espe-
cially to Rajat Monga for help in setting up efcient training
schemes.
Also our work would not have been possible without the
support of Chuck Rosenberg, Hartwig Adam, and Simon
Han.
References
[1]
riculum learning. InProc. of ICML, New York, NY, USA,
2009.
[2]
face revisited: A joint formulation. InProc. ECCV, 2012.
[3]
face detection and alignment. InProc. ECCV, 2014.
[4]
M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and
A. Y. Ng. Large scale distributed deep networks. In
P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Wein-
berger, editors,NIPS, pages 12321240. 2012.
[5]
methods for online learning and stochastic optimization.J.
Mach. Learn. Res., 12:21212159, July 2011.
[6]
and Y. Bengio. Maxout networks. InIn ICML, 2013.
[7]
Labeled faces in the wild: A database for studying face
recognition in unconstrained environments. Technical Re-
port 07-49, University of Massachusetts, Amherst, October
2007.
[8]
Howard, W. Hubbard, and L. D. Jackel. Backpropagation
applied to handwritten zip code recognition.Neural Compu-
tation, 1(4):541551, Dec. 1989.,
[9] CoRR,
abs/1312.4400, 2013.,,
[10]
cation performance on LFW with gaussianface.CoRR,
abs/1404.3840, 2014.
[11]
representations by back-propagating errors.Nature, 1986.,
4
[12]
relative comparisons. In S. Thrun, L. Saul, and B. Schölkopf,
editors,NIPS, pages 4148. MIT Press, 2004.
[13]
and expression (PIE) database. InIn Proc. FG, 2002.
[14]
representation by joint identication-verication.CoRR,
abs/1406.4773, 2014.,,
[15]
representations are sparse, selective, and robust.CoRR,
abs/1412.1265, 2014.,,,
[16]
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Going deeper with convolutions.CoRR, abs/1409.4842,
2014.,,,,,
[17]
Closing the gap to human-level performance in face verica-
tion. InIEEE Conf. on CVPR, 2014.,,,,,
[18]
J. Philbin, B. Chen, and Y. Wu. Learning ne-grained image
similarity with deep ranking.CoRR, abs/1404.4661, 2014.
[19]
learning for large margin nearest neighbor classication. In
NIPS. MIT Press, 2006.,
[20]
of batch training for gradient descent learning.Neural Net-
works, 16(10):14291451, 2003.
[21]
constrained videos with matched background similarity. In
IEEE Conf. on CVPR, 2011.
[22]
convolutional networks.CoRR, abs/1311.2901, 2013.,,
4,
[23]
view faces in the wild with deep neural networks.CoRR,
abs/1404.3543, 2014.
Florian Schroff
fschroff@google.com
Google Inc.
Dmitry Kalenichenko
dkalenichenko@google.com
Google Inc.
James Philbin
jphilbin@google.com
Google Inc.
Abstract
Despite signicant recent advances in the eld of face
recognition [10,,,], implementing face verication
and recognition efciently at scale presents serious chal-
lenges to current approaches. In this paper we present a
system, called FaceNet, that directly learns a mapping from
face images to a compact Euclidean space where distances
directly correspond to a measure of face similarity. Once
this space has been produced, tasks such as face recogni-
tion, verication and clustering can be easily implemented
using standard techniques with FaceNet embeddings as fea-
ture vectors.
Our method uses a deep convolutional network trained
to directly optimize the embedding itself, rather than an in-
termediate bottleneck layer as in previous deep learning
approaches. To train, we use triplets of roughly aligned
matching / non-matching face patches generated using a
novel online triplet mining method. The benet of our
approach is much greater representational efciency: we
achieve state-of-the-art face recognition performance using
only 128-bytes per face.
On the widely used Labeled Faces in the Wild (LFW)
dataset, our system achieves a new record accuracy of
99.63%. On YouTube Faces DB it achieves95.12%. Our
system cuts the error rate in comparison to the best pub-
lished result [15] by 30% on both datasets.
We also introduce the concept ofharmonicembeddings,
and aharmonictriplet loss, which describe different ver-
sions of face embeddings (produced by different networks)
that are compatible to each other and allow for direct com-
parison between each other.
1. Introduction
In this paper we present a unied system for face veri-
cation (is this the same person), recognition (who is this
person) and clustering (nd common people among these
faces). Our method is based on learning a Euclidean em-
bedding per image using a deep convolutional network. The
network is trained such that the squared L2 distances in
the embedding space directly correspond to face similarity:
1.04
1.22 1.33
0.78
1.33 1.26
0.99
Figure 1.Illumination and Pose invariance.Pose and illumina-
tion have been a long standing problem in face recognition. This
gure shows the output distances of FaceNet between pairs of
faces of the same and a different person in different pose and il-
lumination combinations. A distance of0:0means the faces are
identical,4:0corresponds to the opposite spectrum, two different
identities. You can see that a threshold of 1.1 would classify every
pair correctly.
faces of the same person have small distances and faces of
distinct people have large distances.
Once this embedding has been produced, then the afore-
mentioned tasks become straight-forward: face verica-
tion simply involves thresholding the distance between the
two embeddings; recognition becomes a k-NN classica-
tion problem; and clustering can be achieved using off-the-
shelf techniques such as k-means or agglomerative cluster-
ing.
Previous face recognition approaches based on deep net-
works use a classication layer [15,] trained over a set of
known face identities and then take an intermediate bottle-
1
neck layer as a representation used to generalize recognition
beyond the set of identities used in training. The downsides
of this approach are its indirectness and its inefciency: one
has to hope that the bottleneck representation generalizes
well to new faces; and by using a bottleneck layer the rep-
resentation size per face is usually very large (1000s of di-
mensions). Some recent work [15] has reduced this dimen-
sionality using PCA, but this is a linear transformation that
can be easily learnt in one layer of the network.
In contrast to these approaches, FaceNet directly trains
its output to be a compact 128-D embedding using a triplet-
based loss function based on LMNN [19]. Our triplets con-
sist of two matching face thumbnails and a non-matching
face thumbnail and the loss aims to separate the positive pair
from the negative by a distance margin. The thumbnails are
tight crops of the face area, no 2D or 3D alignment, other
than scale and translation is performed.
Choosing which triplets to use turns out to be very im-
portant for achieving good performance and, inspired by
curriculum learning [1], we present a novel online nega-
tive exemplar mining strategy which ensures consistently
increasing difculty of triplets as the network trains. To
improve clustering accuracy, we also explore hard-positive
mining techniques which encourage spherical clusters for
the embeddings of a single person.
As an illustration of the incredible variability that our
method can handle see Figure. Shown are image pairs
from PIE [13] that previously were considered to be very
difcult for face verication systems.
An overview of the rest of the paper is as follows: in
section
denes the triplet loss and section
triplet selection and training procedure; in section
describe the model architecture used. Finally in section
and
dings and also qualitatively explore some clustering results.
2. Related Work
Similarly to other recent works which employ deep net-
works [15,], our approach is a purely data driven method
which learns its representation directly from the pixels of
the face. Rather than using engineered features, we use a
large dataset of labelled faces to attain the appropriate in-
variances to pose, illumination, and other variational condi-
tions.
In this paper we explore two different deep network ar-
chitectures that have been recently used to great success in
the computer vision community. Both are deep convolu-
tional networks [8,]. The rst architecture is based on the
Zeiler&Fergus [22] model which consists of multiple inter-
leaved layers of convolutions, non-linear activations, local
response normalizations, and max pooling layers. We addi-
tionally add several11dconvolution layers inspired by
the work of [9]. The second architecture is based on the
Inceptionmodel of Szegedyet al. which was recently used
as the winning approach for ImageNet 2014 [16]. These
networks use mixed layers that run several different convo-
lutional and pooling layers in parallel and concatenate their
responses. We have found that these models can reduce the
number of parameters by up to 20 times and have the poten-
tial to reduce the number of FLOPS required for comparable
performance.
There is a vast corpus of face verication and recognition
works. Reviewing it is out of the scope of this paper so we
will only briey discuss the most relevant recent work.
The works of [15,,] all employ a complex system
of multiple stages, that combines the output of a deep con-
volutional network with PCA for dimensionality reduction
and an SVM for classication.
Zhenyaoet al. [23] employ a deep network to warp
faces into a canonical frontal view and then learn CNN that
classies each face as belonging to a known identity. For
face verication, PCA on the network output in conjunction
with an ensemble of SVMs is used.
Taigmanet al. [17] propose a multi-stage approach that
aligns faces to a general 3D shape model. A multi-class net-
work is trained to perform the face recognition task on over
four thousand identities. The authors also experimented
with a so called Siamese network where they directly opti-
mize theL1-distance between two face features. Their best
performance on LFW (97:35%) stems from an ensemble of
three networks using different alignments and color chan-
nels. The predicted distances (non-linear SVM predictions
based on the
2
kernel) of those networks are combined us-
ing a non-linear SVM.
Sunet al. [14,] propose a compact and therefore rel-
atively cheap to compute network. They use an ensemble
of 25 of these network, each operating on a different face
patch. For their nal performance on LFW (99:47%[15])
the authors combine 50 responses (regular and ipped).
Both PCA and a Joint Bayesian model [2] that effectively
correspond to a linear transform in the embedding space are
employed. Their method does not require explicit 2D/3D
alignment. The networks are trained by using a combina-
tion of classication and verication loss. The verication
loss is similar to the triplet loss we employ [12,], in that it
minimizes theL2-distance between faces of the same iden-
tity and enforces a margin between the distance of faces of
different identities. The main difference is that only pairs of
images are compared, whereas the triplet loss encourages a
relative distance constraint.
A similar loss to the one used here was explored in
Wanget al. [18] for ranking images by semantic and visual
similarity.
...
Batch
DEEP ARCHITECTURE L2
Triplet
Loss
E
M
B
E
D
D
I
N
G Figure 2.Model structure.Our network consists of a batch in-
put layer and a deep CNN followed byL2normalization, which
results in the face embedding. This is followed by the triplet loss
during training.Anchor
Positive
Negative
Anchor
Positive
Negative
LEARNING
Figure 3. TheTriplet Lossminimizes the distance between anan-
chorand apositive, both of which have the same identity, and
maximizes the distance between theanchorand anegativeof a
different identity.
3. Method
FaceNet uses a deep convolutional network. We discuss
two different core architectures: The Zeiler&Fergus [22]
style networks and the recent Inception [16] type networks.
The details of these networks are described in section.
Given the model details, and treating it as a black box
(see Figure), the most important part of our approach lies
in the end-to-end learning of the whole system. To this end
we employ the triplet loss that directly reects what we want
to achieve in face verication, recognition and clustering.
Namely, we strive for an embeddingf(x), from an image
xinto a feature spaceR
d
, such that the squared distance
betweenallfaces, independent of imaging conditions, of
the same identity is small, whereas the squared distance be-
tween a pair of face images from different identities is large.
Although we did not directly compare to other losses,
e.g. the one using pairs of positives and negatives, as used
in [14] Eq. (2), we believe that the triplet loss is more suit-
able for face verication. The motivation is that the loss
from [14] encourages all faces of one identity to be pro-
jected onto a single point in the embedding space. The
triplet loss, however, tries to enforce a margin between each
pair of faces from one person to all other faces. This al-
lows the faces for one identity to live on a manifold, while
still enforcing the distance and thus discriminability to other
identities.
The following section describes this triplet loss and how
it can be learned efciently at scale.
3.1. Triplet Loss
The embedding is represented byf(x)2R
d
. It em-
beds an imagexinto ad-dimensional Euclidean space.
Additionally, we constrain this embedding to live on the
d-dimensional hypersphere,i.e.kf(x)k2= 1. This loss is
motivated in [19] in the context of nearest-neighbor classi-
cation. Here we want to ensure that an imagex
a
i
(anchor) of
a specic person is closer to all other imagesx
p
i
(positive)
of the same person than it is to any imagex
n
i
(negative) of
any other person. This is visualized in Figure.
Thus we want,
kf(x
a
i) f(x
p
i
)k
2
2+ <kf(x
a
i) f(x
n
i)k
2
2;
(1)
8(f(x
a
i); f(x
p
i
); f(x
n
i))2 T: (2)
whereis a margin that is enforced between positive and
negative pairs.Tis the set of all possible triplets in the
training set and has cardinalityN.
The loss that is being minimized is thenL=
N
X
i
h
kf(x
a
i) f(x
p
i
)k
2
2
kf(x
a
i) f(x
n
i)k
2
2
+
i
+
:
(3)
Generating all possible triplets would result in many
triplets that are easily satised (i.e. fulll the constraint
in Eq. (1)). These triplets would not contribute to the train-
ing and result in slower convergence, as they would still
be passed through the network. It is crucial to select hard
triplets, that are active and can therefore contribute to im-
proving the model. The following section talks about the
different approaches we use for the triplet selection.
3.2. Triplet Selection
In order to ensure fast convergence it is crucial to select
triplets that violate the triplet constraint in Eq. (1). This
means that, givenx
a
i
, we want to select anx
p
i
(hard pos-
itive) such thatargmax
x
p
i
kf(x
a
i
) f(x
p
i
)k
2
2
and similarly
x
n
i
(hard negative) such thatargmin
x
n
i
kf(x
a
i
) f(x
n
i
)k
2
2
.
It is infeasible to compute theargminandargmax
across the whole training set. Additionally, it might lead
to poor training, as mislabelled and poorly imaged faces
would dominate the hard positives and negatives. There are
two obvious choices that avoid this issue:
Generate triplets ofine every n steps, using the most
recent network checkpoint and computing theargmin
andargmaxon a subset of the data.
Generate triplets online. This can be done by select-
ing the hard positive/negative exemplars from within a
mini-batch.
Here, we focus on the online generation and use large
mini-batches in the order of a few thousand exemplars and
only compute theargminandargmaxwithin a mini-batch.
To have a meaningful representation of the anchor-
positive distances, it needs to be ensured that a minimal
number of exemplars of any one identity is present in each
mini-batch. In our experiments we sample the training data
such that around 40 faces are selected per identity per mini-
batch. Additionally, randomly sampled negative faces are
added to each mini-batch.
Instead of picking the hardest positive, we use all anchor-
positive pairs in a mini-batch while still selecting the hard
negatives. We don't have a side-by-side comparison of hard
anchor-positive pairs versus all anchor-positive pairs within
a mini-batch, but we found in practice that the all anchor-
positive method was more stable and converged slightly
faster at the beginning of training.
We also explored the ofine generation of triplets in con-
junction with the online generation and it may allow the use
of smaller batch sizes, but the experiments were inconclu-
sive.
Selecting the hardest negatives can in practice lead to bad
local minima early on in training, specically it can result
in a collapsed model (i.e.f(x) = 0). In order to mitigate
this, it helps to selectx
n
i
such that
kf(x
a
i) f(x
p
i
)k
2
2
<kf(x
a
i) f(x
n
i)k
2
2
:(4)
We call these negative exemplarssemi-hard, as they are fur-
ther away from the anchor than the positive exemplar, but
still hard because the squared distance is close to the anchor-
positive distance. Those negatives lie inside the margin.
As mentioned before, correct triplet selection is crucial
for fast convergence. On the one hand we would like to use
small mini-batches as these tend to improve convergence
during Stochastic Gradient Descent (SGD) [20]. On the
other hand, implementation details make batches of tens to
hundreds of exemplars more efcient. The main constraint
with regards to the batch size, however, is the way we select
hard relevant triplets from within the mini-batches. In most
experiments we use a batch size of around 1,800 exemplars.
3.3. Deep Convolutional Networks
In all our experiments we train the CNN using Stochastic
Gradient Descent (SGD) with standard backprop [8,] and
AdaGrad [5]. In most experiments we start with a learning
rate of0:05which we lower to nalize the model. The mod-
els are initialized from random, similar to [16], and trained
on a CPU cluster for 1,000 to 2,000 hours. The decrease in
the loss (and increase in accuracy) slows down drastically
after 500h of training, but additional training can still sig-
nicantly improve performance. The marginis set to0:2.
We used two types of architectures and explore their
trade-offs in more detail in the experimental section. Their
practical differences lie in the difference of parameters and
FLOPS. The best model may be different depending on the
application.E.g. a model running in a datacenter can have
many parameters and require a large number of FLOPS,
whereas a model running on a mobile phone needs to have
few parameters, so that it can t into memory. All our
layer size-in size-out kernelparamFLPS
conv1220220311011064773;29K115M
pool1110110645555643364;20
rnorm1555564555564 0
conv2a5555645555641164;14K13M
conv255556455551923364;1111K335M
rnorm255551925555192 0
pool25555192282819233192;20
conv3a2828192282819211192;137K29M
conv32828192282838433192;1664K521M
pool32828384141438433384;20
conv4a1414384141438411384;1148K29M
conv41414384141425633384;1885K173M
conv5a1414256141425611256;166K13M
conv51414256141425633256;1590K116M
conv6a1414256141425611256;166K13M
conv61414256141425633256;1590K116M
pool414142567725633256;20
concat7725677256 0
fc1 77256132128maxout p=2103M103M
fc2 132128132128maxout p=234M34M
fc712813212811128 524K0.5M
L2 1112811128 0
total 140M1.6B
Table 1.NN1. This table show the structure of our
Zeiler&Fergus [22] based model with 11convolutions in-
spired by [9]. The input and output sizes are described
inrowscols#filters. The kernel is specied as
rowscols; strideand the maxout [6] pooling size as p= 2.
models use rectied linear units as the non-linear activation
function.
The rst category, shown in Table, adds 11dcon-
volutional layers, as suggested in [9], between the standard
convolutional layers of the Zeiler&Fergus [22] architecture
and results in a model 22 layers deep. It has a total of 140
million parameters and requires around 1.6 billion FLOPS
per image.
The second category we use is based on GoogLeNet
style Inception models [16]. These models have 20fewer
parameters (around 6.6M-7.5M) and up to5fewer FLOPS
(between 500M-1.6B). Some of these models are dramati-
cally reduced in size (both depth and number of lters), so
that they can be run on a mobile phone. One, NNS1, has
26M parameters and only requires 220M FLOPS per image.
The other, NNS2, has 4.3M parameters and 20M FLOPS.
Table
is identical in architecture but has a reduced input size of
160x160. NN4 has an input size of only 96x96, thereby
drastically reducing the CPU requirements (285M FLOPS
vs 1.6B for NN2). In addition to the reduced input size it
does not use 5x5 convolutions in the higher layers as the
receptive eld is already too small by then. Generally we
found that the 5x5 convolutions can be removed throughout
with only a minor drop in accuracy. Figure
our models.
4. Datasets and Evaluation
We evaluate our method on four datasets and with the ex-
ception of Labelled Faces in the Wild and YouTube Faces
we evaluate our method on the face verication task.I.e.
given a pair of two face images a squaredL2distance
thresholdD(xi; xj)is used to determine the classication
ofsameanddifferent. All faces pairs(i; j)of the same iden-
tity are denoted withPsame, whereas all pairs of different
identities are denoted withPdiff.
We dene the set of alltrue acceptsas
TA(d) =f(i; j)2 Psame;withD(xi; xj)dg:(5)
These are the face pairs(i; j)that were correctly classied
assameat thresholdd. Similarly
FA(d) =f(i; j)2 Pdiff;withD(xi; xj)dg(6)
is the set of all pairs that was incorrectly classied assame
(false accept).
The validation rateVAL(d)and the false accept rate
FAR(d)for a given face distancedare then dened as
VAL(d) =
jTA(d)j
jPsamej
;FAR(d) =
jFA(d)j
jPdiffj
:(7)
4.1. Holdout Test Set
We keep a hold out set of around one million images,
that has the same distribution as our training set, but dis-
joint identities. For evaluation we split it into ve disjoint
sets of200k images each. TheFARandVALrate are then
computed on100k100k image pairs. Standard error is
reported across the ve splits.
4.2. Personal Photos
This is a test set with similar distribution to our training
set, but has been manually veried to have very clean labels.
It consists of three personal photo collections with a total of
around12k images. We compute theFARandVALrate
across all 12k squared pairs of images.
4.3. Academic Datasets
Labeled Faces in the Wild (LFW) is the de-facto aca-
demic test set for face verication [7]. We follow the stan-
dard protocol forunrestricted, labeled outside dataand re-
port the mean classication accuracy as well as the standard
error of the mean.
Youtube Faces DB [21] is a new dataset that has gained
popularity in the face recognition community [17,]. The
setup is similar to LFW, but instead of verifying pairs of
images, pairs of videos are used. 10,000,000 100,000,000 1,000,000,000
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
MultiAdd (FLOPS)
VAL @103 FAR
NNS2
NNS1
NN4
NN3
NN1
NN2
Figure 4.FLOPSvs. Accuracy trade-off.Shown is the trade-off
between FLOPS and accuracy for a wide range of different model
sizes and architectures. Highlighted are the four models that we
focus on in our experiments.
5. Experiments
If not mentioned otherwise we use between 100M-200M
training face thumbnails consisting of about 8M different
identities. A face detector is run on each image and a tight
bounding box around each face is generated. These face
thumbnails are resized to the input size of the respective
network. Input sizes range from 96x96 pixels to 224x224
pixels in our experiments.
5.1. Computation Accuracy Tradeoff
Before diving into the details of more specic experi-
ments we will discuss the trade-off of accuracy versus num-
ber of FLOPS that a particular model requires. Figure
shows the FLOPS on the x-axis and the accuracy at 0.001
false accept rate (FAR) on our user labelled test-data set
from section. It is interesting to see the strong corre-
lation between the computation a model requires and the
accuracy it achieves. The gure highlights the ve models
(NN1, NN2, NN3, NNS1, NNS2) that we discuss in more
detail in our experiments.
We also looked into the accuracy trade-off with regards
to the number of model parameters. However, the picture
is not as clear in that case. For example, the Inception
based model NN2 achieves a comparable performance to
NN1, but only has a 20th of the parameters. The number
of FLOPS is comparable, though. Obviously at some point
the performance is expected to decrease, if the number of
parameters is reduced further. Other model architectures
may allow further reductions without loss of accuracy, just
like Inception [16] did in this case.
type
output
size
depth#11
#33
reduce
#33
#55
reduce
#55
pool
proj (p)
paramsFLOPS
conv1 (773;2)11211264 1 9K 119M
max pool + norm565664 0 m33;2
inception (2) 5656192 2 64 192 115K 360M
norm + max pool2828192 0 m33;2
inception (3a) 2828256 2 64 96 128 16 32 m, 32p 164K 128M
inception (3b) 2828320 2 64 96 128 32 64 L2, 64p228K 179M
inception (3c) 1414640 2 0 128 256,2 32 64,2m33,2398K 108M
inception (4a) 1414640 2 256 96 192 32 64 L2, 128p545K 107M
inception (4b) 1414640 2 224 112 224 32 64 L2, 128p595K 117M
inception (4c) 1414640 2 192 128 256 32 64 L2, 128p654K 128M
inception (4d) 1414640 2 160 144 288 32 64 L2, 128p722K 142M
inception (4e) 771024 2 0 160 256,2 64 128,2m33,2717K 56M
inception (5a) 771024 2 384 192 384 48 128 L2, 128p1.6M 78M
inception (5b) 771024 2 384 192 384 48 128 m, 128p1.6M 78M
avg pool 111024 0
fully conn 11128 1 131K 0.1M
L2 normalization11128 0
total 7.5M 1.6B
Table 2.NN2.Details of the NN2 Inception incarnation. This model is almost identical to the one described in [16]. The two major
differences are the use ofL2pooling instead of max pooling (m), where specied.I.e. instead of taking the spatial max theL2norm
is computed. The pooling is always33(aside from the nal average pooling) and in parallel to the convolutional modules inside each
Inception module. If there is a dimensionality reduction after the pooling it is denoted with p.11,33, and55pooling are then
concatenated to get the nal output.NN2 NN1 NNS1 NNS2
1E61E61E6 1E51E51E5 1E41E41E4 1E31E31E3 1E21E21E2 1E11E11E1 1E01E01E0
1E11E11E1
5E15E15E1
1E01E01E0
FARFARFAR
VALVALVAL
Figure 5.Network Architectures.This plot shows the com-
plete ROC for the four different models on our personal pho-
tos test set from section. The sharp drop at 10E-4FAR
can be explained by noise in the groundtruth labels. The mod-
els in order of performance are:NN2:224224input Inception
based model;NN1:Zeiler&Fergus based network with11con-
volutions;NNS1:small Inception style model with only 220M
FLOPS;NNS2:tiny Inception model with only 20M FLOPS.
architecture VAL
NN1 (Zeiler&Fergus220220)87:9%1:9
NN2 (Inception224224) 89:4%1:6
NN3 (Inception160160) 88:3%1:7
NN4 (Inception9696) 82:0%2:3
NNS1 (mini Inception165165)82:4%2:4
NNS2 (tiny Inception140116)51:9%2:9
Table 3.Network Architectures.This table compares the per-
formance of our model architectures on the hold out test set (see
section). Reported is the mean validation rateVALat10E-3
false accept rate. Also shown is the standard error of the mean
across the ve test splits.
5.2. Effect of CNN Model
We now discuss the performance of our four selected
models in more detail. On the one hand we have our tradi-
tional Zeiler&Fergus based architecture with11convolu-
tions [22,] (see Table). On the other hand we have Incep-
tion [16] based models that dramatically reduce the model
size. Overall, in the nal performance the top models of
both architectures perform comparably. However, some of
our Inception based models, such as NN3, still achieve good
performance while signicantly reducing both the FLOPS
and the model size.
The detailed evaluation on our personal photos test set is
jpeg qval-rate
10 67.3%
20 81.4%
30 83.9%
50 85.5%
70 86.1%
90 86.5%
#pixelsval-rate
1,60037.8%
6,40079.5%
14,40084.5%
25,60085.7%
65,53686.4%
Table 4.Image Quality.The table on the left shows the effect on
the validation rate at10E-3 precision with varying JPEG quality.
The one on the right shows how the image size in pixels effects the
validation rate at10E-3 precision. This experiment was done with
NN1 on the rst split of our test hold-out dataset.
#dims VAL
6486:8%1:7
12887:9%1:9
25687:7%1:9
51285:6%2:0
Table 5.Embedding Dimensionality.This Table compares the
effect of the embedding dimensionality of our model NN1 on our
hold-out set from section. In addition to the VALat10E-3
we also show the standard error of the mean computed across ve
splits.
shown in Figure. While the largest model achieves a dra-
matic improvement in accuracy compared to the tiny NNS2,
the latter can be run 30ms / image on a mobile phone and
is still accurate enough to be used in face clustering. The
sharp drop in the ROC forFAR<10
4
indicates noisy
labels in the test data groundtruth. At extremely low false
accept rates a single mislabeled image can have a signicant
impact on the curve.
5.3. Sensitivity to Image Quality
Table
range of image sizes. The network is surprisingly robust
with respect to JPEG compression and performs very well
down to a JPEG quality of 20. The performance drop is
very small for face thumbnails down to a size of 120x120
pixels and even at 80x80 pixels it shows acceptable perfor-
mance. This is notable, because the network was trained on
220x220 input images. Training with lower resolution faces
could improve this range further.
5.4. Embedding Dimensionality
We explored various embedding dimensionalities and se-
lected 128 for all experiments other than the comparison re-
ported in Table. One would expect the larger embeddings
to perform at least as good as the smaller ones, however, it is
possible that they require more training to achieve the same
accuracy. That said, the differences in the performance re-
#training imagesVAL
2,600,000 76.3%
26,000,00085.1%
52,000,00085.1%
260,000,00086.2%
Table 6.Training Data Size.This table compares the performance
after 700h of training for a smaller model with 96x96 pixel inputs.
The model architecture is similar to NN2, but without the 5x5 con-
volutions in the Inception modules.
ported in Table
It should be noted, that during training a 128 dimensional
oat vector is used, but it can be quantized to 128-bytes
without loss of accuracy. Thus each face is compactly rep-
resented by a 128 dimensional byte vector, which is ideal
for large scale clustering and recognition. Smaller embed-
dings are possible at a minor loss of accuracy and could be
employed on mobile devices.
5.5. Amount of Training Data
Table
data. Due to time constraints this evaluation was run on a
smaller model; the effect may be even larger on larger mod-
els. It is clear that using tens of millions of exemplars results
in a clear boost of accuracy on our personal photo test set
from section. Compared to only millions of images the
relative reduction in error is 60%. Using another order of
magnitude more images (hundreds of millions) still gives a
small boost, but the improvement tapers off.
5.6. Performance on LFW
We evaluate our model on LFW using the standard pro-
tocol forunrestricted, labeled outside data. Nine training
splits are used to select theL2-distance threshold. Classi-
cation (sameordifferent) is then performed on the tenth
test split. The selected optimal threshold is1:242for all test
splits except split eighth (1:256).
Our model is evaluated in two modes:
1.
2.3]) is run
on the provided LFW thumbnails. If it fails to align the
face (this happens for two images), the LFW alignment
is used.
Figure allfailure cases. It shows
false accepts on the top as well as false rejects at the bot-
tom. We achieve a classication accuracy of98.87%0:15
when using the xed center crop described in (1) and the
record breaking99.63%0.09 standard error of the mean
when using the extra face alignment (2). This reduces the
error reported for DeepFace in [17] by more than a factor
False accept
False reject
Figure 6.LFW errors.This shows all pairs of images that were
incorrectly classied on LFW. Only eight of the 13 false rejects
shown here are actual errors the other ve are mislabeled in LFW.
of 7 and the previous state-of-the-art reported for DeepId2+
in [15] by 30%. This is the performance of model NN1, but
even the much smaller NN3 achieves performance that is
not statistically signicantly different.
5.7. Performance on Youtube Faces DB
We use the average similarity of all pairs of the rst one
hundred frames that our face detector detects in each video.
This gives us a classication accuracy of95.12%0:39.
Using the rst one thousand frames results in 95.18%.
Compared to [17] 91.4% who also evaluate one hundred
frames per video we reduce the error rate by almost half.
DeepId2+ [15] achieved 93.2% and our method reduces this
error by 30%, comparable to our improvement on LFW.
5.8. Face Clustering
Our compact embedding lends itself to be used in order
to cluster a users personal photos into groups of people with
the same identity. The constraints in assignment imposed
by clustering faces, compared to the pure verication task,
Figure 7.Face Clustering.Shown is an exemplar cluster for one
user. All these images in the users personal photo collection were
clustered together.
lead to truly amazing results. Figure
a users personal photo collection, generated using agglom-
erative clustering. It is a clear showcase of the incredible
invariance to occlusion, lighting, pose and even age.
6. Summary
We provide a method to directly learn an embedding into
an Euclidean space for face verication. This sets it apart
from other methods [15,] who use the CNN bottleneck
layer, or require additional post-processing such as concate-
10
-8
10
-7
10
-6
10
-5
10
-4
10
-3
10
-2
10
-1
10
0
FAR
0.0
0.2
0.4
0.6
0.8
1.0
VAL
NN2
NN2 compared to NN1
NN1 Figure 8.Harmonic Embedding Compatibility.These ROCs
show the compatibility of theharmonicembeddings of NN2 to
the embeddings of NN1. NN2 is an improved model that performs
much better than NN1. When comparing embeddings generated
by NN1 to theharmonicones generated by NN2 we can see the
compatibility between the two. In fact, the mixed mode perfor-
mance is still better than NN1 by itself.
nation of multiple models and PCA, as well as SVM clas-
sication. Our end-to-end training both simplies the setup
and shows that directly optimizing a loss relevant to the task
at hand improves performance.
Another strength of our model is that it only requires
minimal alignment (tight crop around the face area). [17],
for example, performs a complex 3D alignment. We also
experimented with a similarity transform alignment and no-
tice that this can actually improve performance slightly. It
is not clear if it is worth the extra complexity.
Future work will focus on better understanding of the
error cases, further improving the model, and also reduc-
ing model size and reducing CPU requirements. We will
also look into ways of improving the currently extremely
long training times,e.g. variations of our curriculum learn-
ing with smaller batch sizes and ofine as well as online
positive and negative mining.
7. Appendix: Harmonic Embedding
In this section we introduce the concept ofharmonicem-
beddings. By this we denote a set of embeddings that are
generated by different models v1 and v2 but are compatible
in the sense that they can be compared to each other.
This compatibility greatly simplies upgrade paths.
E.g. in an scenario where embedding v1 was computed
across a large set of images and a new embedding model
v2 is being rolled out, this compatibility ensures a smooth
transition without the need to worry about version incom-
patibilities. Figure
be seen that the improved model NN2 signicantly outper- v7 template v8 template
Anchor
Semi-hard
Negative
Positive
Anchor
Semi-hard
Negative
Positive
Anchor
Semi-hard
Negative
Positive
Figure 9.Learning the Harmonic Embedding.In order to learn
aharmonicembedding, we generate triplets that mix the v1 em-
beddings with the v2 embeddings that are being trained. The semi-
hard negatives are selected from the whole set of both v1 and v2
embeddings.
forms NN1, while the comparison of NN2 embeddings to
NN1 embeddings performs at an intermediate level.
7.1. Harmonic Triplet Loss
In order to learn theharmonicembedding we mix em-
beddings of v1 together with the embeddings v2, that are
being learned. This is done inside the triplet loss and results
in additionally generated triplets that encourage the com-
patibility between the different embedding versions. Fig-
ure
contribute to the triplet loss.
We initialized the v2 embedding from an independently
trained NN2 and retrained the last layer (embedding layer)
from random initialization with the compatibility encourag-
ing triplet loss. First only the last layer is retrained, then we
continue training the whole v2 network with the harmonic
loss.
Figure
compatibility may work in practice. The vast majority of
v2 embeddings may be embedded near the corresponding
v1 embedding, however, incorrectly placed v1 embeddings
can be perturbed slightly such that their new location in em-
bedding space improves verication accuracy.
7.2. Summary
These are very interesting ndings and it is somewhat
surprising that it works so well. Future work can explore
how far this idea can be extended. Presumably there is a
limit as to how much the v2 embedding can improve over
v1, while still being compatible. Additionally it would be
interesting to train small networks that can run on a mobile
phone and are compatible to a larger server side model.
Learning
harmonic
embeddings
v7 template
v8 template Figure 10.Harmonic Embedding Space.This visualisation
sketches a possible interpretation of howharmonicembeddings
are able to improve verication accuracy while maintaining com-
patibility to less accurate embeddings. In this scenario there is one
misclassied face, whose embedding is perturbed to the correct
location in v2.
Acknowledgments
We would like to thank Johannes Steffens for his discus-
sions and great insights on face recognition and Christian
Szegedy for providing new network architectures like [16]
and discussing network design choices. Also we are in-
debted to the DistBelief [4] team for their support espe-
cially to Rajat Monga for help in setting up efcient training
schemes.
Also our work would not have been possible without the
support of Chuck Rosenberg, Hartwig Adam, and Simon
Han.
References
[1]
riculum learning. InProc. of ICML, New York, NY, USA,
2009.
[2]
face revisited: A joint formulation. InProc. ECCV, 2012.
[3]
face detection and alignment. InProc. ECCV, 2014.
[4]
M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and
A. Y. Ng. Large scale distributed deep networks. In
P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Wein-
berger, editors,NIPS, pages 12321240. 2012.
[5]
methods for online learning and stochastic optimization.J.
Mach. Learn. Res., 12:21212159, July 2011.
[6]
and Y. Bengio. Maxout networks. InIn ICML, 2013.
[7]
Labeled faces in the wild: A database for studying face
recognition in unconstrained environments. Technical Re-
port 07-49, University of Massachusetts, Amherst, October
2007.
[8]
Howard, W. Hubbard, and L. D. Jackel. Backpropagation
applied to handwritten zip code recognition.Neural Compu-
tation, 1(4):541551, Dec. 1989.,
[9] CoRR,
abs/1312.4400, 2013.,,
[10]
cation performance on LFW with gaussianface.CoRR,
abs/1404.3840, 2014.
[11]
representations by back-propagating errors.Nature, 1986.,
4
[12]
relative comparisons. In S. Thrun, L. Saul, and B. Schölkopf,
editors,NIPS, pages 4148. MIT Press, 2004.
[13]
and expression (PIE) database. InIn Proc. FG, 2002.
[14]
representation by joint identication-verication.CoRR,
abs/1406.4773, 2014.,,
[15]
representations are sparse, selective, and robust.CoRR,
abs/1412.1265, 2014.,,,
[16]
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Going deeper with convolutions.CoRR, abs/1409.4842,
2014.,,,,,
[17]
Closing the gap to human-level performance in face verica-
tion. InIEEE Conf. on CVPR, 2014.,,,,,
[18]
J. Philbin, B. Chen, and Y. Wu. Learning ne-grained image
similarity with deep ranking.CoRR, abs/1404.4661, 2014.
[19]
learning for large margin nearest neighbor classication. In
NIPS. MIT Press, 2006.,
[20]
of batch training for gradient descent learning.Neural Net-
works, 16(10):14291451, 2003.
[21]
constrained videos with matched background similarity. In
IEEE Conf. on CVPR, 2011.
[22]
convolutional networks.CoRR, abs/1311.2901, 2013.,,
4,
[23]
view faces in the wild with deep neural networks.CoRR,
abs/1404.3543, 2014.