Extracted Text

442_isparse_output_informed_sparsi.pdf

Under review as a conference paper at ICLR 2020
ISPARSE: OUTPUTINFORMED SPARSIFICATION OF
NEURALNETWORKS
Anonymous authors
Paper under double-blind review
ABSTRACT
Deep neural networks have demonstrated unprecedented success in various
knowledge management applications. However, the networks created are often
very complex, with large numbers of trainable edges which require extensive com-
putational resources. We note that many successful networks nevertheless often
contain large numbers of redundant edges. Moreover, many of these edges may
have negligible contributions towards the overall network performance. In this pa-
per, we propose a noveliSparse framework, and experimentally show, that we can
sparsify the network, by 30-50%, without impacting the network performance.
iSparse leverages a novel edge signicance score,E, to determine the importance
of an edge with respect to the nal network output. Furthermore, iSparse can be
applied both while training a model or on top of a pre-trained model, making it
a retraining-free approach - leading to a minimal computational overhead. Com-
parisons of iSparse against PFEC, NISP, DropConnect, and Retraining-Free on
benchmark datasets show that iSparse leads to effective network sparsications.
1 INTRODUCTION
Deep neural networks (DNNs), particularly convolutional neural networks (CNN), have shown im-
pressive success in many applications, such as facial recognition (Lawrence et al., 1997), time series
analysis (Yang et al., 2015), speech recognition (Hinton et al., 2012), object classication (Liang &
Hu, 2015), and video surveillance (Karpathy & et. at., 2014). As the termdeep neural networks
implies, this success often relies on large networks, with large number of trainable edges (weights)
(Huang et al., 2017; Zoph et al., 2018; He et al., 2016; Simonyan & Zisserman, 2015).
While a large number of trainable edges help generalize the network for complex and diverse pat-
terns in large-scale datasets, this often comes with enormous computation cost to account for the
non-linearity of the deep networks (ReLU, sigmoid, tanh). In fact, DNNs owe their recent suc-
cess to hardware level innovations that render the immense computational requirements practi-
cal (Ovtcharov & et. al., 2015; Matthieu Courbariaux et al., 2015). However, the benets of hard-
ware solutions and optimizations that can be applied to a general purpose DNN or CNN are limited
and these solutions are fast reaching their limits. This has lead to signicant interest in network-
specic optimization techniques, such as network compression (Choi & et. al., 2018), pruning (Li
et al., 2016; Yu et al., 2018), and regularization (Srivastava & et. al., 2014; Wan et al., 2013), aim to
reduce the number of edges in the network. However, many of these techniques require retraining
the pruned network, leading to the signicant amount of computational waste.
1.1 NETWORKSPARSIFICATION
Many successful networks nevertheless often contain large numbers of redundant edges. Consider
for example, the weights of sample network shown in Figure 1a. As we see here, the weight distribu-
tion is centered around zero and has signicant number of weights with insignicant contribution to
the network output. Such edges may add noise or non-informative information leading to reduction
in the network performance. (Denil et al., 2013; Ashouri et al., 2018; Yu et al., 2018) has shown that
it is possible to predict 95% network parameters while only learning 5% parameters.
Sparsication techniques can generally be classied into neuron/kernel sparcication (Li et al.,
2016; Yu et al., 2018) and edge/weight sparcication techniques (Wan et al., 2013; Ashouri et al.,
1

Under review as a conference paper at ICLR 2020!
"!!
#!!
$!!
%!!
&!!!
&"!!
&#!!
'!(%&
'!($)
'!($
'!(**
'!(*
'!(#*
'!(#
'!(+*
'!(+
'!("*
'!("
'!(&*
'!(&
'!(!*
!
!(!*
!(&
!(&*
!("
!("*
!(+
!(+*
!(#
!(#*
!(*
!(**
!($
!($$
!()+
,-./01234561789-:5-6;<
,-./012=.6
;46>?&;46>?"@;?&@;?"@;?+
(a) Weight Distribution for LeNet-5!"#$!
!"#%!
!"#&!
!"#'!
!"##!
("!!!
! $ (! ($ )! )$ *! *$ +! +$ $!
,-./0123345637
8965:;<;36=;->1 ?63=-5
23345637 (b) Model Accuracy vs Sparsication Factor
Figure 1: Overview of weight distribution and model accuracies for MNIST dataset (LeNet-5)
2018): (Li et al., 2016) proposed to eliminate neurons that have lowl2-norms of their weights,
whereas (Yu et al., 2018) proposed a neuron importance score propagation (NISP) technique where
neuron importance scores (using Roffo & et. al. (2015) - See Equation 5) are propagated from
the output layer to the input layer in a back-propagation fashion. Drop-out (Srivastava & et. al.,
2014) technique instead deactivates neuron activations at random. As an edge sparsication tech-
nique, DropConnect (Wan et al., 2013) selects edges to be sparsied randomly. (Ashouri et al.,
2018) showed that the network performance can be maintained by eliminating insignicant weights
without modifying the network architecture.
1.2 OURCONTRIBUTIONS: OUTPUTINFORMEDEDGESPARSIFICATION
Following these works, we argue network sparsication can be a very effective tool for reducing
sizes and complexities of DNNs and CNNs, without any signicant loss in accuracy. However, we
also argue that edge weights cannot be used as is for pruning the network. Instead, one needs to
consider thesignicanceof each edge within the context of their place in the network (Figure 2):
Two edges in a network with the same edge weight may have different degrees of contributions to
the nal network output and in this paper, we show that it is possible toquantify signicance of each
edge in the network, relative to their contributions to the nal network output and use these measures
signicance to minimize the redundancy in the network by sparsifying the weights that contributes
insignicantly to network. We, therefore, propose a noveliSparse framework, and experimentally
show, that we can sparsify the network, by almost 50%, without impacting the network performance.
Thekey contributionsof our proposed work are as follows:
Output-informed quantication of the signicance of network parameters: Informed by
the nal layer network output, iSparse computes and propagates edge signicant scores that
measure the importance of each edge with respect to the model output (Section 3).
Retraining-free network sparsication(Sparsify-with): The proposed iSparse framework is
robust to edge sparsication and can maintain network performance without having to retraining
the network. This implies that one can apply iSparse on pre-trained networks, on-the-y, to
achieve the desired level of sparsication (Section 3.3)
Sparsication during training(Train-with): iSparse can also be used as a regularizer during
the model training allowing for learning of sparse networks from scratch (Section 4).
As the sample results in Figure 1b shows, iSparse is able to achieve 30-50% sparsication with
minimal impact on model accuracy. More detailed experimental comparisons (See Section 5) of
iSparse against PFEC, NISP, Retraining-Free and DropConnect on benchmark datasets illustrated
that iSparse leads to more effective network sparsications.
2 RELATEDWORKS
Aneural networkis a sequence of layers of neurons to help learn (and remember) complex non-
linear patterns in a given dataset (Grossberg, 1988). Recently,deep neural networks(DNNs), and
particularly CNNs, which leverage recent hardware advances to increase the number of layers in the
2

Under review as a conference paper at ICLR 2020!!!"""""!!!#
$%&'(
)'(&'(
!!!##
*+,-.(/"*
0,-%,1,23%2+/"
0
%
3
%
4
!"#$
%&' ()$*+,-.*,)$#*)#$
-
'"#$
%&' ()$*+,-.*,)$#*)#*/0#)10+%22#).*3.*
Figure 2: Overview of iSparse sparsication, considering theni's contribution to overall output
rather than only betweenniandnjneurons
network to scales that were not practical until recently, (Lawrence et al., 1997; Yang et al., 2015;
Hinton et al., 2012; Liang & Hu, 2015; Karpathy & et. at., 2014) have shown impressive success in
several data analysis and machine learning applications.
A typical CNN consists of (1)feature extraction layersare responsible for learning complex patterns
in the data and remember them through layer weights. Anmnlayer,Ll, maps ann-dimensional
input (Yl1) to anm-dimensional output (Yl), by training for a weight matrixWlR
mlnl
(see
Section 3.1 for further details); (2)activation layers, which help capture non-linear patterns in the
data through activation functions () which maps the output from a feature extraction layer to a non-
linear space (ReLUandsoftmaxare commonly used activation functions); and (3)pooling layers,
(sampling) which up- or down-sample the intermediate data in the network.
The training process of a neural network often comprises of two key stages: (1)forward-propagation
(upstream) maps the input data,X, to an output variable,
^
Y. At each layer, we have
^
Yl=L(Yl) =
(WlYl+Bl), whereY1=X. Intuitively, each layer learns to extract new features from the features
extracted by the prior layer. (2)backward-propagation(downstream) revises network weights,W,
based on the training error,Err=jY
^
Yj, whereYand
^
Yare the ground truth and predictions
of the model, respectively, such that update weights are dened asW
0
=WErr.
The number of trainable parameters in a deep network can range from as low as tens of thousands
(LeCun et al., 1999) to hundreds of millions (Simonyan & Zisserman, 2015) (Table 1 in Section 5).
The three order increase in the number trainable parameter may lead to parameters being redundant
or may have negligible contribution to the overall network output. This redundancy and insigni-
cance of the network parameters has led to advancements in network regularization, by introducing
dynamic or informed sparsication in the network.
These advancements can be broadly classied into two main categories: parameter pruning and pa-
rameter regularization. In particular, pruning focuses on compressing the network by eliminating the
redundant or insignicant parameters. (Han et al., 2015; Han & et. al., 2016) aims to prune the pa-
rameters with near-zero weights inspired froml1andl2regularization (Tibshirani, 1996; Tikhonov,
1963). (Li et al., 2016) choose to lter out convolutional kernel with minimum weight values in
given layer. Recently, (Yu et al., 2018) minimizes the change in nal network performance by elimi-
nating the neuron that have minimal impact on the network output by leveraging neuron importance
score (NL) (See Section 5.3) - computed usingInf-FS(Roffo & et. al., 2015). More complex ap-
proaches have been proposed to tackle the problem of redundancy in the network through weight
quantization. (Rastegari & et. al., 2016) propose to the quantize the inputs and output activations
of the layers in a CNN by using step function and also leveraging the binary operation by using the
binary weights opposed to the real-values weights. (Chen & et. al., 2015) focuses on low-level mo-
bile hardware with limited computational power, and proposed to leverage the inherent redundancy
in the network for using hashing functions to compress the weights in the network.
(Bahdanau et al., 2014; Woo et al., 2018) showed that the each input feature to a given layer in the
network rarely have the same importance, therefore, learning there individual importance (attention)
helps improve the performance of the network. More recently, (Garg & Candan, 2019) has shown
thatinput datainformed deep networks can provide high-performance network congurations. In
this paper, we rely onoutput informationfor identifying and eliminating insignicant parameters
from the network, without having to update the edge weights or retraining the network.
3

Under review as a conference paper at ICLR 2020
3ISPARSE: OUTPUTINFORMEDSPARSIFICATION OFNEURALNETWORKS
As discussed in Section 1, in order to tackle complex inputs, deep neural networks have gone in-
creasingly deeper and wider. This design strategy, however, often results in large numbers of in-
signicant edges
1
(weights), if not redundant. In this section, we describe the proposediSparse,
framework which quanties the signicance of each individual connection in the network with re-
spect to the overall network output to determine the set of edges that can be sparsied to alleviate
network redundancy and eliminate insignicant edges. iSparse aims to determine the signicance
of the edges in the network to make informed sparsication of the network.
3.1 MASKMATRIX
A typical neural network,N, can be viewed as a sequential arrangement of convolutional (C) and
fully connected (F) layers:N(X) =LL(LL1: : :(L2(L1(X)))), here,Xis the input,Lis the
total number of layers in the network andLfC;Fg, s.t., any given layer,Llj1l L, can be
generalized as
Ll(Yl) =l(WlYl+Bl) =
^
Yl; (1)
where,Ylis the input to the layer (s.t.Yl=
^
Yl1and forl= 1,Y1=X) andl,Wl, andBlare
the activation function, weight, and bias respectively. Note that, if thel
th
layer hasmlneurons and
the(l1)
th
layer hasnlneurons , then
^
YlR
ml1
,YlR
nl1
,WlR
mlnl
andBR
ml1
.
Given this formulation, the problem of identifying insignicant edges can be formulated as the prob-
lem of generating a sequence of binarymask matrices,M1; : : : ;ML, that collectively represents
whether any given edge in the network is sparsied (0) or not (1):
Ml=B
nlml
;Bf0;1g;1lL (2)
3.2 EDGESIGNIFICANCESCORE
LetMlbe a mask matrix as dened in Equation 2, andMlcan be expanded as
Ml=
2
6
4
Ml;1;1: : :Ml;1;nl
.
.
.
.
.
.
.
.
.
Ml;ml;1: : :Ml;ml;nl
3
7
5; (3)
where eachMl;i;j2 f0;1gcorresponds to an edgeel;i;jin the network. Our goal in the paper is to
develop an edge signicant score measure to help set the binary value ofMl;i;jfor each edge in the
network. More specically, we aim to associate a non-negative real valued number,El;i;j0, to
each edge in the network, s.t.
Ml;i;j=

1El;i;jl(l)
0El;i;j< l(l)
;8i= 1: : : ml; j= 1: : : nl: (4)
Here,l(l)represents the lowest signicance of thel%of the most signicant edges in the layerl.
Intuitively, given a target sparsication rate,l, we rank all the edges based on their edge signicance
scores and keep only the highest scoringl%of the edges by setting their mask values to 1.
As we have seen in Figure 1a, the (signed) weight distribution of the edges in a layer is often centered
around zero, with large numbers of edges having weights very close to 0. As we also argued in the
Introduction, such edges can work counter-intuitively and add noise or non-informative information
leading to reduction in the network performance. In fact, several existing works, such as (Ashouri
et al., 2018), relies on these weights for eliminating insignicant edges without having to retrain
the network architecture. However, as we also commented in the Introduction, we argue that edge
weights should not be used alone for sparsifying the network. Instead, one needs to consider each
edge within the context of their place in the network: Two edges in a network with the same edge
weight may have different degrees of contributions to the nal network output. Unlike existing
works, iSparse takes this into account when selecting the edges to be sparsied (Figure 3).
1
An edge is dened as a direct connection (weighted) between two neurons.
4

Under review as a conference paper at ICLR 2020!
!
!"#
!"$
!"#
!"%
&'"()*
&%"()
#"#
&#)
&#)
'!
(a) Sample Network (b) Retrain-free (c) iSparse
Figure 3: A sample network architecture and its sparsication using Retraining-free (Ashouri et al.,
2018) and iSparse; here node labels indicate input to the node; edge labels [0,1] indicate the edge
weights; and edge labels between parentheses indicate edge contribution
More specically, letW
+
l
be the absolute positive of the weight matrix,Wl, for edges inl
th
layer.
We compute the corresponding edge signicance score matrix,El, as
El=W
+
l
Nl=
2
6
4
W
+
l;1;1
Nl;1: : :W
+
l;1;nl
Nl;nl
.
.
.
.
.
.
.
.
.
W
+
l;ml;1
Nl;1: : :W
+
l;ml;nl
Nl;nl
3
7
5: (5)
where,Nlrepresents the neuron signicance scores
2
,Nl;1throughNl;nl
, and\"represents the
scalar multiplication between edge weights and neuron scores.Nl;i, denotes the signicance of the
i
th
input neuron to thel
th
connection layer of the network, which itself is dened recursively,based
on the following layer in the network, using the conventional dot product:
Nl=W
+
l+1
Nl+1=
2
6
4
W
+
l+1;1;1
Nl+1;1+ +W
+
l+1;1;nl+1
Nl+1;nl+1
.
.
.
W
+
l+1;ml+1;1
Nl+1;1+ +W
+
l+1;ml+1;nl+1
Nl+1;nl+1
3
7
5:(6)
Note thatNlcan be expanded as
Nl= (W
+
l+1
(W
+
l+2
: : :(W
+
L1
(W
+
L
NL)))); (7)
Above,NLdenotes the neuron scores of the nal output layer, andNLis dened usinginnite
feature selection(Roffo & et. al., 2015; Yu et al., 2018) asNL=inffs(
^
YL)where
^
YLR
xn
(xis
the number of input samples andnis the number of output neurons) to determine neuron importance
score with respect to the the nal network output. Given the above, the edge score (Equation 5) can
be rewritten as
El=W
+
l

W
+
l+1
(W
+
l+2
: : :(W
+
L1
(W
+
L
NL)))

: (8)
Note that the signicance scores of edges in layerlconsiders not only the weights of the edges, but
also the weights of all downstream edges between these edges and the nal output layer.
3.3 EDGESPARSIFICATION
As noted in Section 3.1, the binary values in the masking matrixMldepends onl(l), which
represents the lowest signicance of thel%of the most signicant edges in the layer
3
: therefore,
given a target sparsication rate,l, for layerl, we rank all the edges based on their edge signicance
scores and keep only the highest scoringl%of the edges by setting their mask values to 1. Note
that, once an edge is sparsied, change in its contribution is not propagated back to the layers earlier
in the network relative to the sparsied edge. Having determined the insignicant edges with respect
to the nal layer output, represented in form of the mask matrix,Ml(described in Section 3.1), the
next step is to integrate this mask matrix in the layer itself. To achieve this, iSparse extends the layer
l(Equation 1) to account for the correspondingmask matrix(Ml):
Ll(Yl) =l((WlMl)Yl+Bl); (9)
2
Nlsummarizes the edge and neuron importance in the subsequent layers i.e.Ll+1: : :LL.
3
In the experiments reported in Section 5, without loss of generality, we assume thatlhas the same
value for all connection layers in the network. This is not a fundamental assumption and iSparse can easily
accommodate different rates of sparsication across connection layers.
5

Under review as a conference paper at ICLR 2020
Table 1: Number of trainable parameters (weights) and statistics for various benchmark datasets
Network LeNet VGG
DatasetsMNISTFMNISTCOIL20COIL100NORB CIFAR10 CIFAR20 CIFAR100SVHN GTSRB ImageNet
Weights 44;42644;426 62;55669;356 168;80132;418;83432;428;84432;508;92432;418;83438;743;323138;357;544
Statistics
Resolution 2828 3232323296963232 3232 3232 3232 3232 224224
Train Set60;00060;000 1300 6480 24;30050;000 50;000 50;000 73;000 39;209 1;281;167
Test Set10;00010;000 140 720 24;30010;000 10;000 10;000 26;000 12;630 50;000
Labels 10 14 20 100 6 10 20 100 10 43 1000
where,represents the element-wise multiplication between the matricesWlandMl. Intuitively,
Mlfacilitates introduction of informed sparsity in the layer by eliminating edges that do not con-
tribute signicantly to the nal output layer.
4 INTEGRATION OF ISPARSE WITHINMODELTRAINING
In the previous section, we discussed the computation of edge signicance scores on a pre-trained
network, such as of pre-trained ImageNet models, and the use of these scores for network sparsica-
tion. In this section, we highlight that iSparse can also be integrated directly within the the training
process.
To achieve this, the edge signicance score is computed for every trainable layer in the network using
the strategy described in Section 3.2 and the mask matrix is updated using Equation 4. Furthermore,
the back-propagation rule, described in Section 2 , is updated to account for the mask matrices:
W
0
l=Wl(MlErrl) (10)
where,W
0
l
are the updated weights,Wloriginal weights,is the learning rate, andErrlis the
error recorded by as the divergence in between ground truth (Yl) and model predictions (
^
Yl) as
Errl=jYl
^
Ylj. Note that, we argue that any edge that does not contribute towards the nal model
output, must not be included in the back-propagation. Therefore, we mask the error asErrlMl.
5 EXPERIMENTALEVALUATION
In this section, we experimentally evaluate of the proposed iSparse framework using LeNet and
VGG architectures (See Section 5.2) and compare it against the approaches, such as PFEC, NISP,
and DropConnect (see Section 5.3).
5.1 SYSTEMCONFIGURATION
We implemented iSparse in Python environment (3.5.2) using Keras Deep Learning Library (2.2.4-
tf) (Chollet et al., 2015)with TensorFlow Backend (1.14.0). All experiments were performed on an
Intel Xeon E5-2670 2.3 GHz Quad-Core Processor with 32GB RAM equipped with Nvidia Tesla
P100 GPU with 16 GiB GDDR5 RAM with CUDA-10.0 and cuDNN v7.6.4.
4
.
5.2 BENCHMARK NETWORKS AND DATASETS
In this paper, without loss of generality, we leverage LeNet-5 (LeCun et al., 1999) and VGG-16
(Simonyan & Zisserman, 2015) as the baseline architectures to evaluate sparsication performance
on different benchmark image classication datasets and for varying degrees of edge sparsication.
In this section, we present an overview of these architectures (See Table 1).
LeNet-5: Designed for recognizing handwritten digits, LeNet-5 is simple network with5trainable
(2convolution and3dense) and2non-trainable layers using average pooling withtanhandsoft-
maxas the hidden and output activation. LeNet's simplicity has made it a common benchmark for
datasets recorded in constrained environments, such as MNIST (LeCun et al., 1998), FMNIST (Xiao
et al., 2017), COIL (Nene et al., 1996a;b), and NORB (LeCun et al., 2004).
4
Results presented in this paper were obtained using NSF testbed: Chameleon: A Large-Scale Re-
congurable Experimental Environment for Cloud Research (https://www.chameleoncloud.org/ )
6

Under review as a conference paper at ICLR 2020!
"!
#!
$!
%!
&!!
!
'
&!
&'
"!
"'
(!
('
#!
#'
'!
!
'
&!
&'
"!
"'
(!
('
#!
#'
'!
)*+,-.-/01234& )*+,-.-/01234'
526-70899:;+9<0=>?
@3+;ABA9+/A2C0 D+9/2;0+C60E+/+0@-/F
GDHI .)@G E;23I2CC-9/ J-/;+AC4D;-- A@3+;F-
Figure 4: Top-1 and top-5 accuracy for spar-
sied VGG-16 for ImageNet (sparsify-with)!"#
$
$"$
$"%
$"&
! ' $! $' %! %' &! &' (! (' '!
)*+,-.*/-0,12/304
56780*9*17:*2/- ;71:28
<826=2//,1: >,:87*/?;8,, *56780,
Figure 5: Model classication time vs sparsi-
cation factor for MNIST dataset (train-with)!
"!
#!
$!
%!
&!!
!
'
&!
&'
"!
"'
(!
('
#!
#'
'!
!
'
&!
&'
"!
"'
(!
('
#!
#'
'!
!
'
&!
&'
"!
"'
(!
('
#!
#'
'!
!
'
&!
&'
"!
"'
(!
('
#!
#'
'!
!
'
&!
&'
"!
"'
(!
('
#!
#'
'!
)*+,-&! )*+,-"! )*+,-&!! ./01 23.-4
56789:,;;<=>;?:@AB
.C>=DED;>FD6G: +>;F6=:>G7:H>F>:.8FI
J+K) 1*.J H=6C)6GG8;F -8F=>DGL+=88 D.C>=I8
(a) VGG network (VGG-16)!
"!
#!
$!
%!
&!!
!
'
&!
&'
"!
"'
(!
('
#!
#'
'!
!
'
&!
&'
"!
"'
(!
('
#!
#'
'!
!
'
&!
&'
"!
"'
(!
('
#!
#'
'!
!
'
&!
&'
"!
"'
(!
('
#!
#'
'!
!
'
&!
&'
"!
"'
(!
('
#!
#'
'!
)*+,- .)*+,- /0+1"! /0+1&!! *023
)456789::;<=:>8?@A
,B=<CDC:=EC4F8 .=:E4<8=F58G=E=8,6EH
I.J/ *+,I G<4B/4FF6:E 26E<=CFK.<66 C,B=<H6 (b) LeNet network (LeNet-5)
Figure 6: Top-1 classication accuracy results for sparsied pre-trained models (sparsify-with)
VGG-16: VGG (Simonyan & Zisserman, 2015)'s, a16layer network with13convolution and3
dense layers, with interleaved5max-pooling layers. VGG leverages ReLU as the hidden activation
to overcome the problem of vanishing gradient, as opposed totanh. Given the ability of VGG
network to learn the complex pattern in the real-world dataset, we use the network on benchmark
datasets, such as CIFAR10/20/100 (Krizhevsky, 2009), SVHN (Netzer & et. al., 2011), GTSRB
(Stallkamp & et. al., 2012), and ImageNet (Deng et. al., 2009). Table 1 reports the number of
trainable parameters (or weights) for each model/data set pair considered in the experiments.
5.3 COMPETITORS
We compared iSparse against several state-of-the-art network sparsication techniques:DropCon-
nect(Wan et al., 2013) is a purely random approach, where edges are randomly selected for sparsi-
cation.Retraining-free(Ashouri et al., 2018) considers each layer independently and sparsies in-
signicant weights in the layer, without accounting for the nal network output contribution.PFEC
(Li et al., 2016) is a kernel pruning strategy that aims to eliminate neurons that have low impact
on the overall model accuracy. In order to determine the impact, PFEC computes thel2-norms of
the weights of the neurons and ranks them, separately, for each layer.NISP(Yu et al., 2018) pro-
poses a neuron importance score propagation (NISP) technique where neuron importance scores are
propagated from the output layer to the input layer in a back-propagation fashion.
7

Under review as a conference paper at ICLR 2020
(a) PFEC (b) NISP (c) DropConnect (d) Retrain-Free (e) iSparse
Figure 7: Mask matrices for the LeNet network conv2 layer for MNIST data (sparsication factor
=50%): dark regions indicate the edges that have been marked for sparsication; in (e) iSparse,
the arrows point to those edges that are subject to different pruning decision from retrain-free in(d)
(green arrows point to edges that are kept in iSparse instead of being pruned and red arrows point to
edges that are sparsied in iSparse instead of being kept)
5.4 ACCURACYRESULTS
5.4.1 SPARSIFICATION OFPRE-TRAINEDMODELS(sparsify-with)
In Figure 4, we rst present top-1 and top-5 classication results for ImageNet dataset for VGG-
16 network. As we see in the Figure 4, iSparse provides the highest robustness to the degree of
sparsication in the network. In particular, with iSparse , the network can be sparsied by 50% with
6%drop in accuracy for top-1 and2%drop in accuracy for top-5 classication, respectively. In
contrast, the competitors, see larger drops in accuracy. The closest competitor, Retrain-free, suffers
a loss in accuracy of16%and6%for top-1 and top-5 classication, respectively. The other
competitors suffer signicant accuracy drops after a mere 10-20% sparsication.
Figures 6a and 6b show the top-1 classication accuracy results for other models and data sets. As
we see here, the above pattern holds for all congurations considered: iSparse provides the best
robustness. It is interesting to note that DropConnect, NISP, and PFEC see especially drastic drops
in accuracy for the VGG-16 network and especially on the CIFAR data. This is likely because,
VGG-CIFAR is already relatively sparse (20%>sparsity as opposed to7%for VGG-ImageNet
and<1%for LeNet) and these three techniques are not able to introduce additional sparseness in
a robust manner. In contrast, iSparse is able to introduce signicant additional sparsication with
minimal impact on accuracy.
Figure 7 provides the mask matrices created by the different algorithms to visual illustrate the key
differences between the competitors. As we see in this gure, PFEC and NISP, both sparsify input
neurons. Consequently, their effect is to mask out entire columns from the weight matrix and this
prevents these algorithms to provide ne grain adaption during sparsication. DropConnect selects
individual edges for sparsication, but only randomly and this prevents the algorithm to provide
sufciently high robustness. Retrain-free and iSparse both select edges in an ne-grained manner:
retrain-free uses relies on edge-weights, whereas iSparse complements edge-weight with an edge
signicance measure that accounts for each edges contribution to the nal output within the overall
network. As we see in Figure 7 (d) and (e), this results in some differences in the corresponding
mask matrices, and these differences are sufcient to provide signicant boost in accuracy.
8

Under review as a conference paper at ICLR 2020
Table 2: Top-1 classication accuracy for sparsied different architectures and datasets (train-with)
Datasets MNIST FMNIST COIL20 COIL100 NORB
Factor DC RT IS DC RT iS DC RT iS DC RT iS DC RT iS
Base (0%)98.7998.7998.7988.0288.0288.0295.5295.5695.5690.0090.0090.0084.4284.4284.42
5%98.9097.9998.0988.2388.7088.7595.0094.1694.7289.8390.1290.6184.5184.7585.03
10%98.1798.4598.5687.6087.0287.8994.7295.0195.5590.0589.8190.3385.3585.9886.25
15%98.7698.7998.9288.1487.4387.9695.0095.2195.5089.5089.7390.0083.9184.1284.73
20%98.7998.8198.9888.0288.0088.1195.8395.9596.1090.8388.5491.0383.0482.5683.29
25%98.8598.8598.8087.6287.5488.1995.2794.6695.8090.1189.9890.2285.3385.0185.61
30%98.7898.7198.9888.0087.7588.2595.0094.7495.0090.1290.1290.5583.6985.9186.08
35%98.7398.7398.7988.1486.9988.0694.7294.9995.2790.1191.3591.9983.9883.0184.81
40%98.8598.9798.9987.8187.2588.3095.0095.3195.5688.2290.0190.4483.1484.1285.08
45%98.9198.9598.9987.4587.1187.8295.2995.0495.5689.4589.8790.2385.4785.1286.02
50%98.5598.8599.0087.7688.1588.4695.0094.1694.7290.6690.2190.8584.4985.6186.77
VGG-16
Datasets CIFAR10 CIFAR20 CIFAR100 SVHN GTSRB
Factor DC RT IS DC RT iS DC RT iS DC RT iS DC RT iS
Base (0%)84.1484.1484.1466.9666.9666.9655.0955.0955.0991.3391.3391.3397.6897.6897.68
5%84.2184.5684.6662.2662.0162.8557.3552.7453.1093.4992.4292.7096.4997.6898.64
10%84.3584.3384.3563.6366.5367.3351.3251.0151.4895.0092.5692.9695.0391.0993.24
15%83.4383.5183.6960.9559.8762.4952.1852.5352.9791.9693.1193.4096.8395.9197.29
20%82.8385.0485.8161.7963.4565.0351.6750.0150.5292.6393.3493.7493.0294.4195.76
25%84.0784.3284.5661.8560.1962.5256.4753.2553.8182.5687.1487.6395.1196.6697.76
30%83.4184.1784.4862.1458.3962.1451.1954.0554.5091.9591.9792.2697.1897.4398.56
35%83.0684.9585.4260.0957.8161.7952.0548.5249.0392.5892.3892.6597.1897.6198.17
40%83.9682.0182.2166.3465.0966.6250.1955.1155.6888.7991.9292.3195.6196.9197.61
45%82.7582.8182.8365.6566.9368.7254.1853.4753.6391.3891.2391.6296.7193.3194.02
50%83.4484.1584.5365.3765.0468.8251.6551.7852.5682.9487.6187.6894.3195.9197.31
Table 3: Robustness analysis for edge-based strategies vs iSparse (train-with)
Tan-Adam Tanh-RMS ReLU-Adam
DC RT iS DC RT iS DC RT iS
MNIST
Base (0%)98.8298.8298.82 98.7998.7998.7998.8998.8998.89
5% 98.9098.2398.94 98.9097.9998.0998.6698.9198.95
10% 99.0298.7599.15 98.1798.4598.5698.9098.7598.86
15% 98.9498.0198.83 98.7698.7998.9298.9499.0199.23
20% 98.7698.4198.97 98.7998.8198.8699.0599.3299.53
25% 98.9198.2598.92 98.8598.7898.8098.8998.8598.96
30% 99.0198.0498.92 98.7898.7198.7898.8898.7498.94
35% 98.9498.6599.05 98.7398.7598.7998.8698.6298.92
40% 99.1298.8899.57 98.8598.9798.9998.8398.2398.95
45% 98.8498.4299.04 98.9198.9598.9999.0799.0199.48
50% 99.0798.0199.12 98.5598.8599.0098.9898.1698.89
CIFAR10
Base (%)87.6687.6687.66 81.9581.9581.9584.1484.1484.14
5% 88.8789.5689.66 81.1785.9486.1484.2184.5684.66
10% 84.7588.2188.31 86.7587.0187.4284.3584.3384.35
15% 81.7881.8982.49 88.7588.9989.0483.4383.5183.69
20% 85.6485.9586.21 87.1089.1089.2382.8385.0485.81
25% 86.1087.6887.86 84.8187.3187.7484.0784.3284.56
30% 87.0786.2186.30 80.2083.4583.5683.4184.1784.48
35% 82.3886.4386.48 85.9186.7586.9383.0684.9585.42
40% 86.7688.6388.77 82.1784.1087.4483.9682.0182.21
45% 83.9085.5285.61 85.7986.4089.4282.7582.8182.83
50% 80.4579.2179.59 83.8986.4286.5283.4484.1584.53
5.4.2 SPARSIFICATION DURING TRAINING(train-with)
Tables 2 present accuracy results for the scenarios where iSparse (iS) is used to sparsify the model
during the training process. The table also considers DropConnect (DC) and Retrain-Free (RF), as
alternatives. As we see in the table, for both network architectures, under most sparsication rates,
the output informed sparsication approach underlying iSparse leads to networks with the highest
classication accuracies.
5.5 ROBUSTNESS TO THE VARIATIONS INNETWORKELEMENTS
In this section, we study the effect of the variations in network elements. In particular, we compare
the performance of iSparse (iS) against DropConnect (DC) and Retraining-Free (RF) for different
9

Under review as a conference paper at ICLR 2020!
"!
#!
$!
%!
&!!
! ' &! &' "! "' (! (' #! #' '! ! ' &! &' "! "' (! (' #! #' '! ! ' &! &' "! "' ( ! (' #! #' '! ! ' &! &' "! "' (! (' #! #' '! ! ' &! &' "! "' (! (' #! #' '!
)*+,-&! )*+,-"! )*+,-&!! ./01 23.-4
56789:,;;<=>;?:@AB
.C>=DEFE;>GE6H: I8=;8HG:>H7:J>G>:.8G
E.C>=D8:*KLE.C>=D8:LK**DC>=D8:-KM
Figure 8: Robustness to the layer order while sparsifying the network with iSparse (train-with)
hidden activation functions and network optimizers. Table 3 presents classication performances
for networks that rely on different activation functions (tanhand ReLU) and for optimizers (Adam
and RMSProp). As we see in these two tables, iSparse remains the alternative which provides the
best classication accuracy under different activation/optimization congurations.
5.6 ROBUSTNESS TO THE VARIATIONS INSPARSIFICATIONORDER
We next investigate the performance of iSparseunder different orders in which the network layers
are sparsied. In particular, we considerthreesparsication orders: (a)input-to-output layer order:
this is the most intuitive approach as it does not require edge signicance scores to be revised based
on sparsied edges in layers closer to the input; (b)output-to-input layer-order: in this case, edges
in layers closer to the network output are sparsied rst but, this implies that edge signicance
scores are updated in the earlier layers in the network to account for changes in the overall edge
contributions to the network; (c)random layer order: in this case, to order of the layers to be
sparsied is selected randomly. Figure 8 presents the sparsication results for different orders, data
sets, and sparsication rates. As we see in the gure, the performance of iSparse is not sensitive to
the sparsication order of the network layers.
5.7 IMPACT OFSPARSIFICATION ONCLASSIFICATIONTIME
In Figure 5, we investigate the impact of edge sparsication on the classication time. As we see
in this Figure, edge sparsication rate has a direct impact on the classication time of the resulting
model. When we consider that iSparse allows for3050%edge sparsication without any
major impact on classication accuracies, this indicates that iSparse has the potential to provide
signicant performance gains. What is especially interesting to note in Figure 5 is that, while all
three sparsication methods, iSparse, DropConnect, and Retraining-Free, all have the same number
of sparsied edges for a given sparsication factor, the proposed iSparse approach leads to the least
execution times among the three alternatives. We argue that this is because the output informed
sparsication provided by iSparse allows for more efcient computations in the sparsied space
5
.
6 CONCLUSIONS
In this paper, we proposed iSparse, a novel output-informed, framework for edge sparsication
in deep neural networks (DNNs). In particular, we propose a novel edge signicance score that
quanties the signicance of each edge in the network relative to its contribution to the nal network
output. iSparse leverages this edge signicance score to minimize the redundancy in the network
by sparsifying those edges that contribute least to the nal network output. Experiments, with 11
benchmark datasets and using two well-know network architectures have shown that the proposed
iSparse framework enables3050%network sparsication with minimal impact on the model
classication accuracy. Experiments have also shown that the iSparse is highly robust to variations
in network elements (activation and model optimization functions) and that iSparse provides a much
better accuracy/classication-time trade-off against competitors.
5
Hardware conguration can be found in Section 5.1
10

Under review as a conference paper at ICLR 2020
REFERENCES
Amir H Ashouri, Tarek S Abdelrahman, and Alwyn Dos Remedios. Fast on-the-y retraining-free
sparsication of convolutional neural networks.NIPS, 2018.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
learning to align and translate.arXiv preprint arXiv:1409.0473, 2014.
Wenlin Chen and et. al. Compressing neural networks with the hashing trick. InICML, 2015.
Yoojin Choi and et. al. Universal deep neural network compression.NIPS, 2018.
Franc¸ois Chollet et al. Keras, 2015.
Jia Deng et. al. Imagenet: A large-scale hierarchical image database. InCVPR, 2009.
Misha Denil, Babak Shakibi, Laurent Dinh, Nando De Freitas, et al. Predicting parameters in deep
learning. InNIPS, 2013.
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural
networks.ICLR, 2019.
Yash Garg and K Selc¸uk Candan. Racknet: Robust allocation of convolutional kernels in neural
networks for image classication.ICMR, 2019.
Stephen Grossberg. Nonlinear neural networks: Principles, mechanisms, and architectures.Neural
networks, 1988.
Song Han and et. al. Deep compression: Compressing deep neural networks with pruning, trained
quantization and huffman coding. 2016.
Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for
efcient neural network. InNIPS, 2015.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-
nition. InCVPR, 2016.
Geoffrey Hinton, Li Deng, and et. al. Yu. Deep neural networks for acoustic modeling in speech
recognition.IEEE Signal processing magazine, 2012.
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected
convolutional networks. InCVPR, 2017.
Andrej Karpathy and et. at. Large-scale video classication with convolutional neural networks. In
CVPR, 2014.
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer,
2009.
Steve Lawrence, C Lee Giles, Ah Chung Tsoi, and Andrew D Back. Face recognition: A convolu-
tional neural-network approach.TNN, 1997.
Yann LeCun, L´eon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied
to document recognition.Proceedings of the IEEE, 1998.
Yann LeCun, Patrick Haffner, L´eon Bottou, and Yoshua Bengio. Object recognition with gradient-
based learning. InShape, Contour and Grouping in CV. 1999.
Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition with
invariance to pose and lighting. InCVPR, 2004.
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning lters for
efcient convnets.ICLR, 2016.
Ming Liang and Xiaolin Hu. Recurrent convolutional neural network for object recognition. In
CVPR, 2015.
11

Under review as a conference paper at ICLR 2020
Matthieu Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Low precision arithmetic
for deep learning. InICLR, 2015.
Sameer A Nene, Shree K Nayar, Hiroshi Murase, et al. Columbia object image library (COIL-20).
1996a.
Sameer A Nene, Shree K Nayar, Hiroshi Murase, et al. Columbia object image library (COIL-100).
1996b.
Yuval Netzer and et. al. Reading digits in natural images with unsupervised feature learning. In
NIPS, 2011.
Kalin Ovtcharov and et. al. Accelerating deep convolutional neural networks using specialized
hardware.Microsoft Research Whitepaper, 2015.
Karl Pearson. Liii. on lines and planes of closest t to systems of points in space.The London,
Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 1901.
Mohammad Rastegari and et. al. Xnor-net: Imagenet classication using binary convolutional neural
networks. InECCV, 2016.
Giorgio Roffo and et. al. Innite feature selection. InICCV, 2015.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition.ICLR, 2015.
Nitish Srivastava and et. al. Dropout: a simple way to prevent neural networks from overtting.
JMLR, 2014.
Johannes Stallkamp and et. al. Man vs. computer: Benchmarking machine learning algorithms for
trafc sign recognition.NN, 2012.
Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical
Society: Series B (Methodological), 1996.
Andrei Nikolaevich Tikhonov. On the regularization of ill-posed problems. InDoklady Akademii
Nauk, 1963.
Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural
networks using dropconnect. InICML, 2013.
Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block
attention module. InProceedings of the European Conference on Computer Vision (ECCV), pp.
319, 2018.
Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmark-
ing machine learning algorithms, 2017.
Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, and Shonali Krishnaswamy. Deep
convolutional neural networks on multichannel time series for human activity recognition. In
IJCAI, 2015.
Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao, Ching-
Yung Lin, and Larry S Davis. Nisp: Pruning networks using neuron importance score propagation.
InCVPR, 2018.
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures
for scalable image recognition. InCVPR, 2018.
12

Under review as a conference paper at ICLR 2020
Table 4: Model accuracies under different strategies for initializing neuron importance score - CI-
FAR10 dataset - for different sparsication factors; Base = 0%
VGG ResNet
FactorIDNPCAInfFSIDNPCAInfFS
Base (0%)84.1484.1484.1473.6173.6173.61
5%82.9483.9684.6673.2573.2074.13
10%81.5983.8584.3573.4673.8674.28
15%81.2483.4583.6973.3273.5774.12
20%82.6784.6585.8174.5274.2175.01
25%81.4283.9884.5674.8674.4574.95
30%80.6482.9684.4873.7273.8074.36
35%80.4581.5685.4273.5373.9974.01
40%78.4879.5482.2174.9475.0975.27
45%80.2381.2482.8373.0672.3973.39
50%80.8582.6884.5374.1974.1974.21
A APPENDIX
A.1 ABLATIONSTUDIES FORNEURONSCOREINITIALIZATION
In this section, we evaluate the performance on iSparse framework with different type of neuron
score initialization strategies). We evaluate the following scoring strategies:

In Table 4, we observed that the iSparse framework with InfFS as a mechanism to determine the
neuron scores for the nal layers leads to a better performance opposed to the other strategies. Intu-
itively, this is because InfFS effectively identies the importance of an individual features (neurons),
while considering all possible subset of features. We believe this is a very highly desirable property:
conventional feature selection strategies, such as PCA, ranks features based on their discriminatory
power irrespective of how features relate with each other. Similarly, using same importance score
(identity) treats each feature in a mutually exclusive fashion. In contrast, the way InfFS approaches
the feature selection problem is highly appropriate to NNs, where the neuron output for the true
label will be high, but we will also see low/small output for false labels.
A.2 PERFORMANCE EVALUATION FOR THE RESNET ARCHITECTURE
In this section, we evaluate iSparse for the ResNet-18 architecture (He et al. (2016)) for both,
sparsify-with (no retraining after sparsication) and train-with (retraining after sparsication) con-
gurations Results are presented in Tables 5 and 6.
A.2.1 SPARSIFY-WITH FORRESNET
As observed in Table 5, iSparse for ResNet demonstrates signicant robustness to edge sparsi-
cation. iSparse outperforms DropConnect, and Retrain-Free, thus highlighting the importance of
accounting edge's signicance to subsequent network along with edge weight. .
A.2.2 TRAIN-WITH FORRESNET
As we can see in the Table 6, the iSparse framework outperforms the competitors, such as Dropout,
L1, DropConnect, Retraining-Free, and Lottery Winning Hypothesis (Frankle & Carbin (2019)).
iSparse ability to account for edge signicance based on both, edge weights and edge contributions
to the nal network output, leads to more robust and superior results.
13

Under review as a conference paper at ICLR 2020
Table 5: Model accuracy vs sparsication factor (sparsify-with) for ResNet architecture - CIFAR10
FactorDropConnectRetraining-FreeiSparse
Base (0%)73.61 73.61 73.61
5%69.19 73.61 73.61
10%63.48 73.61 73.61
15%52.89 73.52 73.58
20%41.38 72.85 73.47
25%32.60 71.45 72.98
30%21.23 71.07 72.77
35%16.38 70.53 72.36
40%12.88 68.41 72.05
45%10.45 67.98 71.85
50%10.00 65.48 71.01
Table 6: Model accuracy vs sparsication factor (train-with) for ResNet architecture - CIFAR10
FactorDropoutL1 DropConnectRetraning-FreeLotteryiSparse
Base (0%)73.61 73.9873.61 73.61 73.6173.61
5%73.83 N.A.73.72 74.35 74.1074.43
10%74.52 N.A.73.22 74.49 74.3174.78
15%73.18 N.A.73.69 73.82 74.4074.12
20%75.42 N.A.69.84 73.08 73.3875.01
25%75.97 N.A.72.43 73.75 73.7774.95
30%74.69 N.A.72.43 74.10 73.3974.36
35%73.5 N.A.72.59 74.07 73.9974.01
40%67.89 N.A.70.38 73.15 74.1075.27
45%71.52 N.A.67.07 72.37 74.1573.39
50%66.2 N.A.71.35 72.69 74.1374.21
A.2.3 ADDITIONALTRAIN-WITHRESULTS FORVGG
In Table 7, we present model accuracies for iSparse along with Dropout, L1, Retraining-Free, and
Lottery winning hypothesis sparsication strategies. iSparse's informed sparsication leads to a
superior performance in this model architecture as well.
Table 7: Model accuracy vs sparsication factor (train-with) for VGG architecture - CIFAR10
FactorDropoutL1 DropConnectRetraining-FreeLotteryiSparse
Base (0%)84.14 84.5184.14 84.14 84.1484.14
5%85.13 N.A.84.21 84.56 84.2984.66
10%84.77 N.A.84.35 84.33 84.2384.35
15%86.81 N.A.83.43 83.51 84.8883.69
20%77.81 N.A.82.83 85.04 84.6885.81
25%73.42 N.A.84.07 84.32 84.8784.56
30%76.16 N.A.83.41 84.17 83.4884.48
35%79.28 N.A.83.06 84.95 84.7985.42
40%77.51 N.A.83.96 82.01 81.3782.21
45%73.51 N.A.83.96 82.01 81.3782.21
50%69.21 N.A.83.44 84.15 81.5984.53
14