Extracted Text


4965-22627-1-PB.pdf

Learning SVM Classiers with Indenite Kernels
Suicheng GuandYuhong Guo
Department of Computer and Information Sciences
Temple University
Philadelphia, PA 19122, USA
yuhong@temple.edu
Abstract
Recently, training support vector machines with indef-
inite kernels has attracted great attention in the ma-
chine learning community. In this paper, we tackle
this problem by formulating a joint optimization model
over SVM classications and kernel principal compo-
nent analysis. We rst reformulate the kernel principal
component analysis as a general kernel transformation
framework, and then incorporate it into the SVM clas-
sication to formulate a joint optimization model. The
proposed model has the advantage of making consistent
kernel transformations over training and test samples.
It can be used for both binary classication and multi-
class classication problems. Our experimental results
on both synthetic data sets and real world data sets show
the proposed model can signicantly outperform related
approaches.
Introduction
Support vector machines (SVMs) with kernels have attracted
a lot attention due to their good generalization performance.
The kernel function in a standard SVM produces a simi-
larity kernel matrix over samples, which is required to be
positive semi-denite. This positive semi-denite property
of the kernel matrix ensures the SVMs can be efciently
solved using convex quadratic programming. However, in
many applications the underlying similarity functions do not
produce positive semi-denite kernels (Chen et al. 2009).
For example, the sigmoid kernels with various values of the
hyper-parameters (Lin and Lin 2003), the hyperbolic tan-
gent kernels (Smola, Ovari, and Williamson 2000), and the
kernels produced by protein sequence similarity measures
derived from Smith- Waterman and BLAST scores (Saigo
et al. 2004) are all indenite kernels. Training SVMs with
indenite kernels poses a challenging optimization problem
since convex solutions for standard SVMs are not valid in
this learning scenario.
Learning with indenite kernels has been addressed by
many researchers in various ways in the literature. One
most simple and popular way to address the problem is
to identify a corresponding positive semi-denite kernel
Copyrightc2012, Association for the Advancement of Articial
Intelligence (www.aaai.org). All rights reserved.
matrix by modifying the spectrum of the indenite ker-nel matrix (Wu, Chang, and Zhang 2005). Several sim-
ple representative spectrum modication methods have been
proposed in the literature, including “clip” (or “denoise”)
which neglects the negative eigenvalues (Graepel et al. 1999;
Pekalska, Paclik, and Duin 2001), “ip” which ips the
sign of the negative eigenvalues (Graepel et al. 1999), and
“shift” which shifts all the eigenvalues by a positive constant
(Roth et al. 2003). More sophisticated approaches simulta-
neously derive a positive semi-denite kernel matrix from
the given indenite kernel matrix and train a SVM classier
within unied optimization frameworks (Chen and Ye 2008;
Chen, Gupta, and Recht 2009; Luss and d'Aspremont 2007).
A few other works use indenite similarity matrices as ker-
nels directly by formulating variant optimization problems
from standard SVMs. In (Lin and Lin 2003), a SMO-type
method is proposed to nd stationary points for the non-
convex dual formulation of SVMs with nonpositive semi-
denite sigmoid kernels. This method, however, is based
on the assumption that there is a corresponding reproduc-
ing kernel Hilbert space to ensure valid SVM formulations.
The work in (Ong et al. 2004) interprets learning with an
indenite kernel as minimizing the distance between two
convex hulls in a pseudo-Euclidean space. In (Pekalska and
Haasdonk 2008), the authors extended the kernel linear and
quadratic discriminants to indenite kernels. The approach
in (Guo and Schuurmans 2009) minimizes the sensitivity of
the classier to perturbations of the training labels, which
yields an upper bound of classical SVMs.
In this paper, we propose a novel joint optimization model
over SVM classications and kernel principal component
analysis to address the problem of learning with inde-
nite kernels. We rst reformulate the kernel principal com-
ponent analysis (KPCA) into a general kernel transforma-
tion framework which can incorporate the spectrum modi-
cation methods. Next we incorporate this framework into
the SVM classication to formulate a joint max-min opti-
mization model. Training SVMs with indenite kernels can
then be conducted by solving the joint optimization prob-
lem using an efcient iterative algorithm. Different from
many related approaches, our proposed model has the ad-
vantage of making consistent transformations over training
and test samples. The experimental results on both synthetic
data sets and real world data sets demonstrated the proposedProceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence 942

model can signicantly outperform the spectrum modica-
tion methods, the robust SVMs and the kernel Fisher's dis-
criminant on indenite kernels (IKFD).
Related Work
The dual formulation of standard SVMs is a linear con-
strained quadratic programming, which provides a natural
form to address nonlinear classication using kernels
max


>
e
1
2

>
Y K0Y (1)
s.t.
>
diag(Y ) = 0;0 C
whereYis a diagonal matrix of the labels, andK0is a ker-
nel matrix. The positive semi-denite property ofK0en-
sures the problem (1) to be a convex optimization problemand thus a global optimal solution can be solved efciently.However, whenK0is indenite, one loses the underlying
theoretical support for the kernel methods and the optimiza-tion problem (1) is no longer convex. For the nonconvex op-timization problem (1) with indenite kernels, with a simplemodication, a sequential minimal optimization (SMO) al-
gorithm can still converge to a stationary point, but not nec-
essarily a global maximum (Lin and Lin 2003).
Instead of solving the quadratic optimization problem (1)
with indenite kernels directly, many approaches are fo-
cused on deriving a surrogate positive semi-denite kernel
matrixKfrom the indenite kernelK0. A simple and popu-
lar way to obtain such a surrogate kernel matrix is to modify
the spectrum ofK0using methods such asclip,ip, and
shift(Wu, Chang, and Zhang 2005). LetK0=UU
>
,
where =diag(1; : : : ; N)is the diagonal matrix of the
eigenvalues, andUis the orthogonal matrix of correspond-
ing eigenvectors. Theclipmethod produces an approximate
positive semi-denite kernelKclipby clipping all negative
eigenvalues to zero,
Kclip=Udiag(max( 1;0);  ;max( N;0))U
>
:(2)
Theipmethod ips the sign of negative eigenvalues ofK0
to form a positive semi-denite kernel matrixKf lip, such
that
Kf lip=Udiag(j 1j;  ;jNj)U
>
: (3)
Theshiftmethod obtains the positive semi-denite kernel
matrixKshif tby shifting the whole spectrum ofK0with
the minimum required amount, such that
Kshif t=Udiag( 1+; : : : ; N+)U
>
:(4)
These spectrum modication methods are straightforward
and simple to use. However, some information valuable for
the classication model might be lost by simply modify-
ing the spectrum of input kernels. Therefore, approaches
that simultaneously train the classication model and learn
the approximated positive semi-denite kernel matrix have
been developed (Chen and Ye 2008; Chen, Gupta, and
Recht 2009; Luss and d'Aspremont 2007). In (Luss and
d'Aspremont 2007) a robust SVM with indenite kernels
was proposed, which treats the indenite kernel as a noisy
observation of the true positive semi-denite kernel and
solves the following convex optimization problem
max

min
K

>
e
1
2

>
Y KY +kKK0k
2
F(5)
s.t.
>
diag(Y ) = 0; 0 C;K0
where a positive semi-denite kernelKis introduced to ap-
proximate the originalK0, andcontrols the magnitude of
the penalty on the distance betweenKandK0. An analy-
sis about the indenite SVM of (5) was conducted in (Ying,
Campbelly, and Girolami 2009), which shows the objective
function is smoothed by the penalty term. In (Chen and Ye
2008), Chen and Ye reformulated (5) into a semi-innite
quadratically constrained linear program formulation, which
can be solved iteratively to nd a global optimal solution.
They further employed an additional pruning strategy to im-
prove the efciency of the algorithm.
Many approaches mentioned above treat training and test
samples in an inconsistent way. That is, the training is con-
ducted on the proxy positive semi-denite kernel matrixK,
but the predictions on test samples are still conducted us-
ing the original unmodied similarities. This is an obvi-
ous drawback that could degrade the performance of the
produced classication model. Wu et al. (Wu, Chang, and
Zhang 2005) addressed this problem for the case of spectrum
modications by recomputing the spectrum modication on
the matrix that augmentsK0with similarities on test sam-
ples. Chen et al. (Chen, Gupta, and Recht 2009) addressed
the problem of learning SVMs with indenite kernels us-
ing the primal form of Eq.(5) while further restrictingKto
be a spectrum modication ofK0. They then obtained the
consistent treatment of training and test samples by solving
a positive semi-denite minimization problem over the dis-
tance between augmentedK0andKmatrices. The model
we propose in this paper however can address this incon-
sistency problem in a more principled way without solving
additional optimization problems.
Kernel Principal Component Analysis
In this section, we present the kernel principal component
analysis (KPCA) as a kernel transformation method and then
demonstrate its connection to spectrum modication meth-
ods. LetX=fxig
N
i=1
denote the training samples, where
xi2R
n
. To employ kernel techniques, a mapping function,
:R
n
!R
f
, can be deployed to map the data to a typically
high dimensional space. The training samples in this mapped
space can be represented as = [(x 1); : : : ; (xN)]and
the standard kernel matrix can be viewed as the inner prod-
uct of the sample matrix in the high dimensional space,
K0= 
>
.
KPCA
The kernel principal component analysis (Sch¨olkopf, Smola,
and Muller 1999) can be solved by minimizing the distance
between the high dimensional data matrix and the recon-
structed data matrix
min
W


W W
>



2
F
;s.t.W
>
W=Id (6)943

whereW, afdmatrix, can be viewed as a transforma-
tion matrix that transforms the data samples to a lower d-
dimensional subspaceZ=W
>
;kk
F
denotes the Frobe-
nius norm; andIddenotes addidentity matrix. This min-
imization problem is equivalent to
max
W
tr(W
>

>
W);s.t.W
>
W=Id (7)
which has a closed form solutionW=Ud, whereUd
is the topdeigenvectors of
>
. Moreover we have

>
W
1
d
=W, wheredis adddiagonal matrix
with its diagonal values as the topdeigenvalues of
>
.
Here we assumed the topdeigenvalues are not zeros. Let
V= 
>
W
1
d
, then we haveW= 
Vand (7) can be
reformulated into
max
V
tr(V
>
K0K0V);s.t.V
>
K0V=Id: (8)
After solving the optimization problem above for theVma-
trix, the transformation matrix,W, and the low dimensional
map of the training samples,Z, can be obtained conse-
quently. Then the transformed kernel matrix for the training
samples in the low dimensional space can be produced
Kv=Z
>
Z= 
>
W W
>
 =K0V V
T
K0:(9)
Although the standard kernel principal component analysis
assumes the kernel matrixK0to be positive semi-denite,
the optimization problem (8) we derived above can be gen-
eralized to the case of indenite kernels ifVis guaranteed to
be a real valued matrix by selecting a properdvalue. Even
whenK0is an indenite kernel matrix,Kvis still guaran-
teed to be positive semi-denite for real valuedV. Thus the
equation (9) provides a principle strategy to transform an in-
denite kernel matrixK0to a positive semi-denite matrix
Kvwith a proper selectedV. Moreover, given a new sam-
ple x, it can be transformed byW
>
(x) =V
>

>
(x) =
V
>
k0, wherek0denotes the original similarity vector be-
tween the new sample x and training samples. The trans-
formed similarity vector between the new sample x and the
training samples iskv=K0V V
T
k0. By using this trans-
formation strategy, we can easily transform the test samples
and the training samples in aconsistentway.
Connections to Spectrum Modications
The kernel transformation strategy we developed above is a
general framework. By selecting differentVmatrix, various
kernel transformations can be produced. We now show that
the spectrum modication methods reviewed in the previous
section can be equivalently reexpressed as kernel transfor-
mations in the form of Eq.(9) with properVmatrices.
AssumeK0=UU
>
, whereUis an orthogonal matrix
andis a diagonal matrix of real eigenvalues, that is, =
diag(1; : : : ; N). Theclipspectrum modication method
can be reexpressed as
Kclip=K0VclipV
>
clipK0 (10)
for a constructedVclipmatrix
Vclip=Ujj

1
2diag

I
f1>0g; : : : ; I
fN>0g

(11)
wherejj=diag(j 1j; : : : ;jNj), andI
fgis an indicator
function. Theipmethod can be reexpressed as
Kf lip=K0Vf lipV
>
f lipK0 (12)
for Vf lip=Ujj

1
2: (13)
Similarly, theshiftmethod is reexpressed as
Kshif t=K0Vshif tV
>
shif tK0 (14)
forVshif t=Ujj
1
( +I)
1
2: (15)
Training SVMs with Indenite Kernels
In this section, we address the problem of training SVMs
with indenite kernels by developing a joint optimization
model over SVM classications and KPCA. Our model si-
multaneously trains a SVM classier and identies a proper
transformationVmatrix. We present this model for binary
classications rst and then extend it to address multi-class
classication problems. An iterative optimization algorithm
is developed to solve the joint optimizations.
Binary classications
We rst extend the standard two-class SVMs to formulate a
joint optimization problem of SVMs and the kernel principal
component analysis
min
W;w;b;
1
2
w
>
w+C
X
i
i+


W W
>



2
F
(16)
s.t.yi(w
>
W
>
(:; i) +b)1i; i0;8i;
W
>
W=Id;
whereyi2 f+1; 1gis the label of theith training sam-
ple,(:; i)is theith column of the general feature ma-
trix representation,Cis the standard tradeoff parame-
ter in SVMs, andis a parameter to control the trade-
off between the SVM objective and the reconstruction er-
ror of KPCA. Previous approaches in (Chen and Ye 2008;
Chen, Gupta, and Recht 2009; Luss and d'Aspremont 2007)
use the distance between the proxy kernelKand the original
K0as a regularizer for SVMs. The joint optimization model
proposed here can be similarly interpreted as employing the
distance between the proxy and original feature vectors as a
regularizer. However, for the problem of learning with indef-
inite kernels, the feature vectors are not real valued vectors
and they are actually only available implicitly through kernel
matrices. Therefore, we need to reformulate the optimization
problem in terms of kernels.
By exploiting the derivation results in the previous sec-
tion, we propose to replace the distance regularizer in (16)
with a kernel transformation regularizer (8) to obtain an al-
ternative joint optimization in terms of the input kernel
min
w;b;;V
1
2
w
>
w+C
X
i
i tr(V
>
K0K0V) (17)
s.t.yi(w
>
V
>
K0(:; i) +b)1i; i0;8i;
V
>
K0V=Id;K0V V
>
K00:944

WhenVis constrained to be a real valued matrix, the con-
straintK0V V
T
K00can be dropped. We will assumeV
has real values from now on. More conveniently, following
the dual formulation of standard SVMs, we consider the reg-
ularized dual SVM formulation and focus on the following
optimization problem
max

min
V

>
e
1
2

>
Y K0V V
T
K0Y (18)
 tr(V
>
K0K0V)
s.t.
>
diag(Y ) = 0; 0 C;
V
>
K0V=Id;
whereY=diag(y 1; : : : ; yN)is a diagonal matrix.
Multi-class Classications
Multi-class classication problems can be solved by train-
ing multiple binary SVMs (Hsu and Lin 2002). In this pa-
per, we deploy the1-vs-1 strategy for multi-class classica-
tion. A simple deployment of this strategy requires training
k(k1)=2binary classiers, each for a pair of classes, for
ak-class problem. This means that an optimization problem
(18) has to be solved for each pair of classes and a different
proxy kernelsKabwill be learned for each pair of classes
fa; bg, by learning a differentVab. However, with different
proxy kernelsKabfor each pair of classes, the consistent
transformation of samples for the overall multi-class classi-
cation cannot be maintained. To ensure a data sample has
a consistent representation in all binary classication prob-
lems, we construct a framework to use a single target (proxy)
kernel matrixKvfor all binary classications by introduc-
ing a set of sub-kernel transformation matrixfDabg1a<bk
and address all thek(k1)=2binary classications in one
joint optimization.
Assume the training set hasNsamples and each classa
hasNasamples. We rst consider a given pair of classesa
andb. LetNab=Na+Nbbe the sample size of the class set
fa; bg, L
ab
= [`
ab
1; : : : ; `
ab
Nab
]denote a1Nabvector whose
jth entry is the index value for thejth sample of the class set
fa; bg in the original training set, andKabdenote the proxy
kernel matrix of the samples in these two classes. Thus the
proxy kernelKvof all training samples is aNNmatrix,
and theKab, aNabNabmatrix, is its sub-matrix. We now
construct an indicator matrixDab2 f0;1g
NNab
as below
to build a connection between these two kernel matrices
Dab(i; j) =

1;if`
ab
j
=i
0;otherwise.
:
GivenDab, the kernel matrixKabof classaandbcan be
computed as
Kab=Kv(L
ab
; L
ab
) =D
>
abKvDab: (19)
ThusDabcan be viewed as a sub-kernel transformation ma-
trix. Note thatKv=K0V V
>
K0. Then we can combine the
k(k1)=2classications in a joint optimization problem
max

min
V
 tr(V
>
K0K0V) +
X
1a<bk


>
abe
1
2

>
abYabD
T
abK0V V
T
K0DabYab ab

(20)
s.t.
>
abdiag(Y ab) = 0;81a < bk;
0 abC;81a < bk;
V
>
K0V=Id
where denotes a collection off abg1a<bk , andYabis
a diagonal matrix whose diagonal entries are the binary la-
bels for the binary classication problem over classesfa; bg.
Whenk= 2, the binary classication problem in (18) can
be recovered from (20).
An Iterative Algorithm
The objective of the outer maximization problem in (20) is a
pointwise minimum of a family of concave quadratic func-
tions of , and hence is a concave function of . Thus (20)
is a concave maximization problem over subject to linear
constraints (Boyd and Vandenberghe 2004). In this section,
we develop an iterative algorithm to solve the optimization
problem (20). In each iteration of the algorithm,Vand
are alternatively optimized. WhenVis xed, we can di-
vide the maximization problem intok(k1)=2standard
binary SVMs and optimize each abindependently. When
f abg1a<bk are xed,Vcan be computed by solving the
following optimization problem
max
V
tr(V
>
K0MK0V)s.t.V
>
K0V=Id (21)
where
M=
1
2
X
1a<bk
(DabYab ab
>
abYabD
>
ab) +IN:(22)
The above problem can be solved via the following gen-
eral eigenvalue problem,
K0MK0v=K0v: (23)
Note that for positivevalues,Mis guaranteed to be pos-
itive denite. Thus we will solve the following eigenvalueproblem instead
MK0MK0v=MK0v; (24)
MK0u=u; (25)
for u=MK0v. Moreover, we assumeK0is invertible
1
.
LetU= [u 1; : : : ;ud]be the topdlargest eigenvectors of
MK0, thenV=K
1
0
M
1
U. Finally the optimal solution
of (21) can be recovered by settingV

= [v

1; : : : ;v

d
], where
v

i
=vi=
p
v
>i
K0vi. Here the renormalization is necessary
to ensure the orthogonal constraints in (21) for indeniteK0.
Determining feasibledvalues.To ensure each vibe real
values, we should selectdto guarantee that each uisatises
u
>
i
K0ui>0. To determined, we have the following lemma
1
It is easy to remove the zero eigenvalues ofK0by simply
adding a tiny positive/negative diagonal matrixINwithout chang-
ing the distribution ofK0's eigenvalues.945

Lemma 1For each eigenpair, ( i,ui), ofMK0, ifi>0,
then we have u
>
i
K0ui>0.
Proof:SinceMK0ui=iui, we have
u
>
iK0MK0ui=iu
>
iK0ui:
Then u
>
iK0ui= (u
>
iK0MK0ui)=i:
Since bothK0MK0andK0are symmetric matrices,uihas
real values. MoreoverK0MK0is positive semi-denite ac-
cording to (22). Therefore
u
>
i
K0M K0ui
i
>0.
According to Lemma 1, the topdeigenvectorsfv

i
g1id
have real values, ifdd0, whered0is the number of pos-
itive eigenvalues ofMK0. As we discussed before,Mis
guaranteed to be positive denite for positivevalues, and
we assumeK0is invertible. It is easy to showMK0and
M
1
2K0M
1
2have the same eigenvalues by using a similar
transformation from (23) to (24). According to the Sylvester
law of inertia (Golub and Loan 1996),M
1
2K0M
1
2andK0
have the same inertia, and thus have the same number of pos-
itive eigenvalues. Therefore the valued0can be determined
directly fromK0.
Experiments
We conducted experiments on both synthetic data sets and
real world data sets to compare the proposed method,
denoted as SVM-CA, with a few spectrum modication
methods (clip, ip, andshift), the robust SVM (Luss and
d'Aspremont 2007), and the kernel Fisher's discriminant on
indenite kernels (IKFD) (Pekalska and Haasdonk 2008).
We used the robust SVM code found on the authors' web-
site
2
. In the experiments below, the regularization param-
eterfor SVM-CA, robust SVM and IFKD, the parame-
terCin SVMs, the reduced dimensionalitydin SVM-CA
were all selected by 10-fold cross-validations from the fol-
lowing candidate sets,; C2 f0:01; 0:1;1;10;100g, and
d2 f2;3;5;8;13;21;34;55g.
Experiments on Synthetic Data Sets
We constructed four 3-class 2-dimensional data sets, each
with 300 samples. For each data set, the three classes, each
with 100 samples, are generated using three Gaussian dis-
tributions with the covariance matrix =diag(
2
; 
2
)and
mean vectors1= (3; 3),2= (3;3)and(3
p
3;3
p
3),
respectively. We generate the similarity kernel matrix byadding additive white Gaussian noise to the linear kernel ma-trix,K0(i; j) =x
T
i
xj+zij, wherezijN(0; ). With the
Gaussian noise, the kernelK0is not positive semi-denite.
By considering different
2
andvalues, synthetic data
sets with different properties can be generated. We consid-
ered two values for
2
,
2
= 2and
2
= 4, and two differ-
entvalues,= 20and= 100. With larger
2
value, the
generated data is harder to be separable. With larger, the
kernel matrixK0can be more indenite. With different pairs
of(
2
; ), we obtained four synthetic data sets. The charac-
teristics of the data sets are given in Table 1, whereminand
2
http://www.tau.ac.il/

rluss/
maxare the smallest and largest eigenvalues of each syn-
thetic indenite kernel matrixK0, respectively,
P

+
i
and
P


j
are the sums of the positive and negative eigenvalues
ofK0, respectively.
We run experiments on the four synthetic data sets com-
paring the SVM-CA to the other ve approaches. Our ex-
perimental results in terms of classication error rates are
reported in Table 1. These results are averaged over 50 runs
using 80% of the data as training set and the remainder as
test set. It is apparent that the values of
2
anddetermine
the hardness of the classication problems, and thus affect
the performance of these approaches. When either
2
or
gets larger, the error rate for each approach increases. When
is small, the spectrum modication methods, especially
the spectrum clip, yield good performance. Whenis large,
which means the kernelK0is far away from being positive
semi-denite, the spectrum modications are inefcient to
capture the information provided by the indenite kernels
and thus produce inferior results. Among the three spectrum
modication approaches, the clip method obtains the lowest
error rates on all the four data sets. The robust SVM is highly
related to the spectrum clip, and it yields similar results as
the clip method. Both IKFD and SVM-CA approaches ob-
tain much better results than the other four approaches. They
produced good results even on the data sets with largeand
large
2
. Overall, the proposed SVM-CA produced the best
results comparing to all the other approaches.
Experiments on Real World Data Sets
We then conducted experiments on several real world data
sets used for learning with indenite kernels, including a few
data sets used in (Chen et al. 2009), i.e.,yeast, amazon, au-
ral sonar, voting, patrolandprotein, and a data set collected
in (Pekalska and Haasdonk 2008), i.e.,catcortex. These data
sets are represented by similarity (or dissimilarity) matrices
produced using different similarity measures. For example,
a sequence-alignment similarity measure is used for thepro-
teindata set, the Smith-Waterman E-value is used to mea-
sure the similarity between two protein sequences for the
yeastdata set, etc. We also used theglassdata set obtained
from the UCI machine learning repository (Newman et al.
1998), for which we used a sigmoid kernel to compute an
indenite kernel matrixK0. These data sets together repre-
sent a diverse set of indenite similarities. We assumed sym-
metric similarity kernel matrixK0in our proposed model.
When the original matrixK0given in the data is not sym-
metric, we reset it asK0= (K
>
0+K0). When the original
matrixK0in the data represents dissimilarity, we just re-
set it asK0=mK0, wheremis the largest entry of the
original matrixK0. There are six binary (two-class) and four
multi-class data sets in total. We computed the eigenvalue in-
formation of the kernel matrixK0for each data set as well.
The indeniteness measurej
P


i
P

+j
jobtained for each data
set is given as follows: (Yeast5v7: 0.56), (Yeast5v12: 0.56),
(Yeast7v12: 0.57), (Amazon: 0.01), (Aural Sonar: 0.26),
(Voting: 0.00), (Protein: 0.25), (Glass: 0.00), (Patrol: 0.36)
and (Catcortex: 0.10). Here the value 0.00 denotes a positive
value smaller than 0.005.946

Table 1: Characteristics of the four synthetic data sets and the average classication errors (%) of the six comparison methods.
Data
2
 min

min
max


P


i
P

+j


Clip Flip Shift Robust SVM IKFD SVM-CA
Synth 12 20 -143 .02 .47 1.50 2.00 15.83 1.53 1.20 0.72
Synth 22 100 -693 .11 .82 9.67 11.00 22.33 9.05 2.43 1.83
Synth 34 20 -140 .02 .44 4.00 4.83 21.50 4.11 1.69 1.17
Synth 44 100 -702 .11 .80 16.17 16.67 38.17 15.24 4.70 3.50
Table 2: Comparison results in terms of classication error rates (%) on binary classication data sets. The means and standard
deviations of the error rates over 50 random repeats are reported.
Dataset Yeast5v7 Yeast5v12 Yeast7v12 Amazon Aural Sonar Voting
Clip+SVM 40.01.1 20.01.3 25.51.2 10.30.9 11.20.8 3.00.3
Flip+SVM 46.00.6 17.81.2 22.01.0 11.00.9 16.80.9 3.20.3
Shift+SVM 35.00.5 42.81.5 46.71.9 16.00.8 17.30.9 5.80.5
IKFD 34.21.0 17.51.0 14.01.0 15.60.9 8.40.6 5.70.3
Robust SVM 29.01.0 18.01.0 15.00.9 8.80.8 11.00.9 3.30.3
SVM-CA 25.00.9 10.70.8 10.50.8 9.50.9 8.60.6 2.70.3
We compared our proposed SVM-CA to the other ve
approaches on both the six binary data sets and the fourmulti-class data sets. The experimental results are reportedin Table 2 and Table 3 respectively. These results are pro-duced over 50 runs using randomly selected 90% of the datasamples as training set and the remainder as test set. Boththe average classication error rates and their standard de-
viations are reported. Among the three spectrum modica-
tion algorithms, the spectrumclipobtained the lowest error
rates on ve of the ten data sets, while spectrumipand
shiftobtained the lowest error rates on four and one data
sets, respectively. The robust SVM slightly outperformed the
spectrum modications on eight data sets. The IKFD out-
performed the spectrum modications on ve data sets. Our
proposed SVM-CA clearly outperformed all the other ap-
proaches and achieved the lowest classication error rates
on four of the total six binary data sets and all the four
multi-class data sets. On data sets such as Protein, Patrol and
Catcortex, where thej
P


i
P

+j
jvalues are large, the improve-
ments achieved by the proposed approach over the other
SVM training methods are largely signicant. These results
on both synthetic and real world data sets demonstrated the
effectiveness of the proposed joint optimization model.
Convergence Experiments.We also conducted exper-
iments to study the convergence property of the proposed
iterative algorithm in Section 4.3. The experiments are con-
ducted on two data setsproteinandcatcortex. In each ex-
periment, we plot the objective values of the SVM-CA for-
mulation in (20) after each update ofVand . The plots
are shown in Figure 1. We can see that the objective values
of the maximization and minimization sub-problems gradu-
ally converges within 10 iterations on the two data sets. This
suggests the iterative algorithm we proposed can effectively
solve the target convex optimization.
Table 3: Comparison results in terms of classication error
rates (%) on multi-class classication data sets. The means
and standard deviations of the error rates over 50 random
repeats are reported.
DatasetProtein Glass Patrol Catcortex
Clip+SVM6.30.7 41.11.2 48.61.5 10.52.0
Flip+SVM4.00.7 39.41.1 44.81.4 13.52.3
Shift+SVM5.50.7 38.30.9 51.41.5 49.04.0
IKFD 8.20.9 43.31.1 25.71.8 12.51.9
Robust SVM16.41.1 39.11.0 31.31.4 9.41.7
SVM-CA 2.50.5 37.30.8 12.40.8 4.51.4
Conclusion
In this paper, we investigated the problem of training SVMswith indenite kernels. We rst reformulated the kernel prin-
cipal component analysis (KPCA) to a kernel transforma-
tion model and demonstrated its connections to spectrum
modication methods with indenite kernels. We then pre-
sented a novel joint optimization model over SVM classi-
cations and principal component analysis to conduct SVM
training with indenite kernels assisted by kernel compo-
nent analysis. The proposed model can be used for both bi-
nary classications and multi-class classications. An ef-
cient iterative algorithm was proposed to solve the pro-
posed joint optimization problem. Moreover, the proposed
approach can make consistent transformations over training
and test samples. Our experimental results on both synthetic
data sets and real world data sets demonstrated the proposed
approach can signicantly outperform the spectrum modi-
cation methods, the robust SVMs and the kernel Fisher's
discriminant on indenite kernels (IKFD).947

(a) Protein
(b) Catcortex
Figure 1: Convergence of SVM-CA on the Protein and Cat-
cortex data sets. TheObj( )andObj(V )denote the objec-
tive values after updatingVand , respectively, at each iter-
ation.
References
Boyd, S., and Vandenberghe, L. 2004.Convex Optimization.
Cambridge University Press.
Chen, J., and Ye, J. 2008. Training SVM with indenite
kernels. InProceedings of International conference on Ma-
chine Learning (ICML).
Chen, Y.; Garcia, E.; Gupta, M.; Rahimi, A.; and Cazzanti,
L. 2009. Similarity-based classication: Concepts and algo-
rithms.Journal of Machine Learning Research10:747–776.
Chen, Y.; Gupta, M.; and Recht, B. 2009. Learning kernels
from indenite similarities. InProceedings of International
conference on Machine Learning (ICML).
Golub, G., and Loan, C. V. 1996.Matrix Computations.
Johns Hopkins University Press.
Graepel, T.; Herbrich, R.; Bollmann-Sdorra, P.; and Ober-
mayer, K. 1999. Classication on pairwise proximity
data. InAdvances in Neural Information Processing Sys-
tems (NIPS).
Guo, Y., and Schuurmans, D. 2009. A reformulation of
support vector machines for general condence functions.
InProceedings of Asian Conference on Machine Learning.
Hsu, C., and Lin, C. 2002. A comparison of methods for
multi-class support vector machines.IEEE transact. on Neu-
ral Networks13(2):415–425.
Lin, H., and Lin, C. 2003. A study on sigmoid kernel
for SVM and the training of non-PSD kernels by SMO-type
methods. Technical report.
Luss, R., and d'Aspremont, A. 2007. Support vector ma-
chine classication with indenite kernels. InAdvances in
Neural Information Processing Systems (NIPS).
Newman, D.; Hettich, S.; Blake, C.; and Merz, C. 1998. UCI
repository of machine learning datasets.
Ong, C.; Mary, X.; Canu, S.; and Smola, A. 2004. Learning
with non-positive kernels. InProceedings of International
conference on Machine Learning (ICML).
Pekalska, E., and Haasdonk, B. 2008. Kernel discrim-
inant analysis for positive denite and indenite kernels.
IEEE Trans. on Pattern Analysis and Machine Intelligence
31(6):1017–1032.
Pekalska, E.; Paclik, P.; and Duin, R. 2001. A generalized
kernel approach to dissimilarity-based classication.Jour-
nal of Machine Learning Research2:175–211.
Roth, V.; Laub, J.; Kawanabe, M.; and Buhmann, J. 2003.
Optimal cluster preserving embedding of nonmetric prox-
imity data.IEEE Trans. on Pattern Analysis and Machine
Intelligence25(12):1540–1551.
Saigo, H.; Vert, J.; Ueda, N.; and Akutsu, T. 2004. Protein
homology detection using string alignment kernels.Bioin-
formatics20, Issue 11:1682–1689.
Sch¨olkopf, B.; Smola, A.; and Muller, K. 1999. Kernel prin-
cipal component analysis. InAdvances in Kernel Methods-
Support Vector Learning, 327–352.
Smola, A.; Ovari, Z.; and Williamson, R. C. 2000. Regu-
larization with dot-product kernels. InAdvances in Neural
Information Processing Systems (NIPS).
Wu, G.; Chang, E.; and Zhang, Z. 2005. An analysis of
transformation on non-positive semidenite similarity ma-
trix for kernel machines. InProceedings of International
conference on Machine Learning (ICML).
Ying, Y.; Campbelly, C.; and Girolami, M. 2009. Analy-
sis of SVM with indenite kernels. InAdvances in Neural
Information Processing Systems (NIPS).948