feature extraction for nonparametric discriminant analysishastie/papers/dimensionreduction.pdf ·...

21
Feature Extraction for Nonparametric Discriminant Analysis Mu ZHU and Trevor J. HASTIE In high-dimensional classi cation problems, one is often interested in nding a few important discriminant directions in order to reduce the dimensionality.Fisher’s linear dis- criminant analysis(LDA) is a commonly used method. Although LDA is guaranteedto nd the best directions when each class has a Gaussian density with a common covariance ma- trix, it can fail if the class densitiesare more general.Using a likelihood-basedinterpretation of Fisher’s LDA criterion, we develop a general method for nding important discriminant directions without assuming the class densities belong to any particular parametric family. We also show that our method can be easily integrated with projection pursuit density es- timation to produce a powerful procedure for (reduced-rank)nonparametric discriminant analysis. Key Words: Classi cation; Density estimation; Dimension reduction; LDA; Projection pursuit; Reduced-rank model; SAVE. 1. INTRODUCTION In a typical classi cation problem, sometimes known as pattern recognition, one has a training set fy i ; x i g n i= 1 , where y i 2f1; 2;:::;Kg is the class label and x i 2 R d , a vector of predictors. When d is large, it is often the case that information relevant to the separation of the classes is contained in just a few directions ® 1 ; ® 2 ;:::; ® M 2 R d , where M is much smaller than d. Our goal is to identify these important directions from the data. These directions are often known as discriminant directions or the important linear features for classi cation. In order to do so, one must answer a key question: what is an appropriate measure of class separation? Fisher’slineardiscriminantanalysis(LDA) (Fisher1936)workswithonesuchmeasure; Mu Zhu is Assistant Professor, Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada, (E-mail: [email protected]). Trevor J. Hastie is Professor, Department of Statistics, Stanford University, Stanford, CA 94305 (E-mail: [email protected]). c ® 2003 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America Journal of Computational and Graphical Statistics, Volume 12, Number 1, Pages 101–120 DOI: 10.1198/1061860031220 101

Upload: others

Post on 02-Jun-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

Feature Extraction for NonparametricDiscriminant Analysis

Mu ZHU and Trevor J. HASTIE

In high-dimensional classi� cation problems, one is often interested in � nding a fewimportant discriminant directions in order to reduce the dimensionality.Fisher’s linear dis-criminant analysis(LDA) is a commonly used method. Although LDA is guaranteedto � ndthe best directionswhen each class has a Gaussian density with a common covariancema-trix, it can fail if the class densitiesare more general.Using a likelihood-basedinterpretationof Fisher’s LDA criterion, we develop a general method for � nding important discriminantdirections without assuming the class densities belong to any particular parametric family.We also show that our method can be easily integrated with projection pursuit density es-timation to produce a powerful procedure for (reduced-rank) nonparametric discriminantanalysis.

Key Words: Classi� cation; Density estimation; Dimension reduction; LDA; Projectionpursuit; Reduced-rank model; SAVE.

1. INTRODUCTION

In a typical classi� cation problem, sometimes known as pattern recognition, one has atraining set fyi; xign

i = 1, where yi 2 f1; 2; : : : ; Kg is the class label and xi 2 Rd, a vectorof predictors. When d is large, it is often the case that information relevant to the separationof the classes is contained in just a few directions ®1; ®2; : : : ; ®M 2 Rd, where M ismuch smaller than d. Our goal is to identify these important directions from the data. Thesedirections are often known as discriminant directions or the important linear features forclassi� cation. In order to do so, one must answer a key question: what is an appropriatemeasure of class separation?

Fisher’s lineardiscriminantanalysis(LDA) (Fisher 1936)works with onesuch measure;

Mu Zhu is Assistant Professor, Department of Statistics and Actuarial Science, University of Waterloo, Waterloo,ON, N2L 3G1, Canada, (E-mail: [email protected]). Trevor J. Hastie is Professor, Department of Statistics,Stanford University, Stanford, CA 94305 (E-mail: [email protected]).

c® 2003 American Statistical Association, Institute of Mathematical Statistics,and Interface Foundation of North America

Journal of Computational and Graphical Statistics, Volume 12, Number 1, Pages 101–120DOI: 10.1198/1061860031220

101

Page 2: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

102 M. ZHU AND T. HASTIE

the best feature is de� ned as

argmax¬ 2 Rd

®T B®

®T W®;

where B is the between-class covariancematrix and W, the within-class covariance matrix.The optimal solution is the � rst eigenvector of W¡1B. In general, the matrix W¡1B hasmin(K ¡ 1; d) nonzero eigenvalues.Hence we can identify up to this number of features—with decreasing importance. In fact, given the � rst (m ¡ 1) discriminant directions, the mthdirection is simply

argmax¬ 2 Rd

®T B®

®T W®subject to ®T W®j = 0 8 j < m:

The importance of these discriminant directions is signi� cant. For example, classi� -cation using only the leading discriminant directions (e.g., reduced-rank LDA) can oftenimproveclassi� cation performances on test data. Leadingdiscriminantdirectionsalso allowus to make low-dimensional summary plots of the data. Zhu (2001) showed that discrimi-nant directions are equivalent to the concept of ordination axis in correspondence analysis,which has a wide range of applications beyond pattern recognition, for example, in areassuch as environmental ecology, information retrieval, and personalization.

However, LDA is not always guaranteed to � nd the best discriminant directions.Figure1 shows a pathological case for LDA. The top panel shows the � rst two coordinates of asimulated dataset in R20 (the “multi-modality data”). Class 1 is simulated from a standardmultivariate Gaussian, while classes 2 and 3 are mixtures of two symmetrically shiftedstandard Gaussians. In the remaining 18 coordinates, Gaussian noise is added for all threeclasses. The middle panel shows a two-dimensional projection produced by LDA. LDAclearly fails to recover the important features for classi� cation. The reason why LDA failsin this case is that the class centroids coincide. This points out a fundamental restrictionof LDA as a feature extraction method. This article develops a more general method for� nding important discriminant directions and subspaces. We also show that our methodcan be easily integrated with projection pursuit density estimation to produce a powerfulprocedure for (reduced-rank) nonparametric discriminant analysis.

2. LITERATURE REVIEW

The pathology of LDA as illustrated above has long been recognized. Devijver andKittler (1982, sec. 9.8) speci� cally pointed out that there are two types of discriminatoryinformation: one where all such information is contained in the class centroids (the idealcase for LDA), and one where all such information is contained in the class variances (thepathological case above). A speci� c solution (details omitted here) is then proposed todeal with the second type, with a comment from the author that a direct feature extractionmethod “capable of extractingas much discriminatory information as possible regardless ofits type” would require a “complex criterion” which would be “dif� cult to de� ne” (Devijverand Kittler 1982, p. 339).

Page 3: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

FEATURE EXTRACTION FOR NONPARAMETRIC DISCRIMINANT ANALYIS 103

Data

11111

1

1

1 11

1

1

1

11

1

1

1

1 1

1

1

11

1

1

1

1

1

1

11

1

1

1

11

11

11

1

1

1

1

11 1

1

111

11

1

11

1

1

11

1

11

1

1

1

1

11

1

1

1

11 111

11

1

1

1

1 1

1

11 1 1

1

11

1

1

11 1

1

1

1

1

1 1

1

1

1 11

1

11 1

11

1

1

1

1

1

1

11

11

1

1

1

1

1

1

1

1

1

1

111

111

1

111 1

1 1

1

11

1

1

1

1

11 1

1

1

1

1

1

1

1

1

1

1 1

1

1

111

1 1 111

11

1

1

1

1

11

1

1

1

1

1

11

11

11

1

1

2 22

2222

2

2

2

22

2

2 22 2

2

22

2

22

2

22 2

2

22

2

2

22

2

2

2

2

2 2

2

2

2

222 2

2

2

2

2 2

222

22 2

2

2

2

2

2

2

2

2

222

2

2

2

2

2

2

2222

2

22 2

2

2

2

22

22 22 2

2

22

222

222

2

2

2

2

2

2

2

2

2

22

222

2

2

2 2

2

2

2

2 2

2

2

2

222

2

22

222

2

2

2

2

222

2

222

2

222

2

22

2 2

2

2

2 22

2

222

2

22

22

2

2

2

2

2

2

2

2

2

2

22

2 2

22 2

22

2

2

22

22

22

2

2

3

3

33

3

33

3333

3

3

3

3

3 3

3

3

3

3

3

3

3

3

3

333

3

3

33

333

3

33

3

3

3

3

33

3

3

33

3

33

3

33

33

33 3

3

3

333

3

3

3

3

3

3

33

3

3

3

33 3

3

3

3

3

3

3

33

33

3

33

3

33

3

3

3

3

3

3

3

3

3 3333

3

3

33

3

33

3

3

33

33 3

3

3

3

3

3

3

3

3

3

3

3 3

3

3333

333

3

33

3

33

3

3

3

3

3

3

3

33

3

333

3

3

33 3 3 33

3

3

3 3

3

3

3

33

3 3

3

3

33

33

3

33

33

3 3

3

3

3

3

3

3

3

LDA

1

1

11

11

1

1

1

11

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

11

1

1

1

11

1

1

1

1

1

1

11

1

1 1

11

11

11

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1 1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

11

1

1

1

1 1

1

1

11

1

1

1

1

1

1

1

11

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

22

2

22

2

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

22

2

2

22

22

2

2

2

2

22

2

2

2

2

2

2

22

2

2

2

2

2

22

2

2

2

2

2

2

22 2

22

2

2

2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

22

22

2

2

2

2

2

2

22

2

2

2

2

2

2 2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

2

2

2 2

2

2

2

2

2

2

2

2

22

2

2

2

2

22

2

2

2

2

2

2

2

2

2

2

2 2

2

22

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

33

3

333

3

3

3

3

3

333

3

3

3

33

3

3

3

3

3

3

3

3

333

3

33 3

3

33

3

3

3

3

3

3

33

3

33

33

3

3

3

3

3

3

3

33

3

33

3

33

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

33

3

3

3

3 3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3

33

3 3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3 3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

33

3

3

3

333

3

3

3

3

3

3 3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3 3

3

3

3

3

3

3

LR

111

111

11

11

1

1

1

11 1

1

1

11

1

1

111

1

1

1

1

1

11

1

1

1

11

11

11

1

1

1

1

11

1

1

111

11

1

11

1

1

111

1 11

1

1

1

11

1

1

1

111 11

11

1

1

1

11

1

111

1

1

11

1

1

1 11

1

1

1

111

1

1

11

1

1

1 11

111

1

1

11

1

1111

11

1

1

1

1

1

1

1

1

1

11

1 11

1

1

1

11

111

11

1

1

1

1

1

11

1

1

1

1

1

11

1

1

11

1

1

111

11

11 1

1 1

1

1

1

1

11

1

1

1

11

11

11

11

1

1

2

2 2

22

222

22

2 22

22222

2 22

2 2

2

2 22

2

2

2

22

2

2

2

2

2

2

22

2

2

222

22

2

2

2

22

22 2

22

2

2

2

2

2

2

2

2

2

2 2

2

2

2

2

2

2

2

2 2 2 2

2

2

22

2

2

2

222 2

2 22

222

2 222

2 2

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

22

2

22

22

2

2

2

22

2

2

2 2

22

22

2

22

222

2

2 22

2

2 22

2

22

22

2

2

22

2

2

222

2

22

22

2

2

2

2

2

2

2

2

2

2

22

22

2 222

2

2

2

22

2 22 2

22

3

3

33

3

33

333

33

3

3

3

333

3

3

33

3

3

3

3

33 3

3

3

33

33 3

3

3

33

3

3

33

33

3

33

3

33

3

333

3

333

3

3

33 3

3

33

3

3

3

33

3

3

3

33

3

33

33

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3

3

3333

3

3

3

33

33 3

3

3

33

333

3

3

3

3

3

3

3

3

3

3

33

3

33

3 3333

3

333

3

3

3

3

3

3

3

3

3

33

3

33 3

3

3

3 33

33

3

3

3

33

3

3

3

33

33

3

3

3 3

3

3

3

33

33

33

3

3

3

3

3

3

3

Figure 1. Top: Data are 20-dimensional.Only the � rst two coordinates are shown. The remaining 18 coordinatesare simulated from the same standard Gaussian distribution for all three classes. Middle: Two-dimensional dis-criminant subspace identi� ed by LDA. Bottom: Two-dimensional discriminant subspace identi� ed by recursivelymaximizing LR(¬ ¬¬ ), using feature removal.

Page 4: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

104 M. ZHU AND T. HASTIE

Recently, Cook and Yin (2001) proposed a new method using the sliced average vari-ance estimator (SAVE), which, the authors showed, is capable of extracting both typesof discriminatory information simultaneously. In Section 4, we compare our method withSAVE to gain more insight into the nature of these different methods.

3. GENERALIZED FEATURE EXTRACTION

3.1 THE GENERALIZED CRITERION

To better understand Fisher’s criterion, let us consider it from a likelihood point ofview. Suppose given y = k, the predictor vector x from class k has density function pk(x).Consider

H0 : pk = p for all k = 1; 2; : : : ; K:

HA : pk == p for some k = 1; 2; : : : ; K:

In this framework, a natural candidate for measuring class differences in a � xed direction® is the (marginal) generalized log-likelihood-ratio statistic:

LR(®) = log

maxpk

KY

k = 1

Y

xj 2 Ck

p(¬ )k (®T xj)

maxpk = p

KY

k = 1

Y

xj 2 Ck

p(¬ )(®T xj)

; (3.1)

where p( ¬ )k (¢) is the marginal density along the projection de� ned by ® for class k; p( ¬ )(¢)

is the corresponding marginal density under the null hypothesis that the classes share thesame density function; and the notation “xj 2 Ck” means the jth observation belongs toclass k. Then, it can be shown via straight-forward algebraic calculations (Zhu 2001) thatthe criterion used in LDA is a special case of LR(®).

Result 1. If pk(x) is (or is estimated by) the N (¹k; §§§) density function for classesk = 1; 2; : : : ; K , then choosing discriminant directions by maximizing LR(®) is equivalentto Fisher’s LDA.

This simple result leads to two observations. First, it shows that when each class has aGaussian densitywith a common covariancematrix (a situationwhere all the discriminatoryinformation is contained in the class centroids), Fisher’s linear discriminant directions areoptimal. Second, it suggests a natural way to generalize Fisher’s linear discriminant direc-tion when the class densities are more general (cases where the important discriminatoryinformation is not simply contained in the class centroids alone)—for example, when theyhave different covariance matrices, or, more generally, when they are non-Gaussian.

Page 5: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

FEATURE EXTRACTION FOR NONPARAMETRIC DISCRIMINANT ANALYIS 105

More speci� cally, for arbitrary class density functions, our goal is to seek ® thatmaximizes

LR(®) =

KX

k = 1

X

xj 2 Ck

log p( ¬ )k (®T xj) ¡

KX

k = 1

X

xj 2 Ck

log p( ¬ )(®T xj); (3.2)

where pk denotes the MLE of pk and p, the MLE of p.Note that if ®0 = c® for some multiple c, then LR(®0) = LR(®). Hence it suf� ces to

constrain ® to be a unit vector, that is, k®k = 1. This means the effective search space forour maximization problem above is the unit ball in Rd, not the entire space Rd. This is alsotrue for LDA except that, conventionally,LDA uses the restriction ®T W® = 1 instead ofk®k = 1, but aside from a normalizing constant, these are equivalent.

3.2 THE OPTIMIZATION PROBLEM IN DETAIL

It is important to note that the statistic LR(®) is de� ned generally. We are free torestrict pk to any desirable family of density functions. Often, one would like to work witha parametric family. Then for � xed ®, p

( ¬ )k can be evaluated quite simply by maximum

likelihood. Here, however, we show that the criterion LR(®) is general enough so thateven if one chooses to work with exible but more dif� cult nonparametric models, it isstill possible to use LR(®) to guide the search of informative directions for classi� cation,provided that one is willing to accept the extra computational cost.

For the optimizationproblemposed above, standard methods such as Newton’s methodare readily applicable in theory, because both the gradient and the Hessian can be estimatedexplicitly.To simplify the notation,we write fk in place of p

( ¬ )k and f in place of p( ¬ ). Then

g(r)def=

@LR@®r

=

KX

k = 1

X

xj 2 Ck

xrj

Ãf

0

k(®T xj)

fk(®T xj)¡ f

0(®T xj)

f(®T xj)

!

(3.3)

and, by writing zj = ®T xj ,

H(r; s)def=

@2LR@®r@®s

=KX

k = 1

X

xj 2 Ck

xrj

Ãf

0 0

k (zj)fk(zj) ¡ (f0

k(zj))2

(fk(zj))2¡ f

0 0(zj)f (zj) ¡ (f

0(zj))2

(f (zj))2

!

xsj :

(3.4)

Therefore, to estimateg and H , we need only estimateunivariatemarginal density functionsand their � rst two derivatives.Various methodsare available, for example, the kernel method(e.g., Silverman 1986) or the local likelihood method (e.g., Loader 1999). We do this onceconditionally on each class to obtain fk; f

0

k; f0 0

k , and once unconditionally over the entiretraining sample to obtain f; f

0; f

0 0. Although it involves estimating the derivatives too, we

shall refer to this operation simply as the “density estimation step.”

Page 6: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

106 M. ZHU AND T. HASTIE

In practice, one seldom uses Newton’s method in its vanilla form. Some modi� cationto the Hessian matrix is necessary, for example, when the problem is nonconvex, to ensurethat one is moving in an ascent direction with each iteration. A popular variation, knownas quasi-Newton, is to construct Hessian-like matrices from the gradient that are alwaysnegative-de�nite. More details can be found in Gill, Murray, and Wright (1981).

3.3 THE DENSITY ESTIMATION STEP

In our implementation,we used C. Loader’sLoc� t library (Loader1999),which is basedon local likelihood theory. To focus on the main statistical ideas, we choose not to digressand go into the theory of local likelihoodestimation here. Most density estimationmethods,including local likelihood, have an equivalent kernel representation and, therefore, can beviewed as a special kernel estimator, at least on the conceptual level. The choice of Loc� tas our density estimation module, however, is largely pragmatic. Loc� t implements a ratherelaborate set of evaluation structures and interpolation strategies—the density function isonly evaluated at a few strategically selected points, and its values at other points areobtained by interpolation, thereby signi� cantly reducing the amount of computation. Formore details, we refer the reader to Loader (1999, sec. 12.2).

3.4 LOCAL MAXIMA

The existence of local maxima adds an extra dif� culty to numerical optimization.This dif� culty here can be effectively mitigated by choosing relatively large bandwidthsor equivalent smoothing parameters in the density estimation step. Figure 2 shows thefunction LR(®) for a two-dimensional problem simulated for the purposes of illustration as® = (cos ³ ; sin ³ ) goes around the unit circle, that is, for ³ 2 (0; 2 º )—recall that it suf� cesto restrict ® to be a unit vector. The function is estimated using nonparametric densitymodels with different bandwidth parameters. With a large bandwidth, the local maximumessentially disappears. It is also important to point out that extremely accurate estimation ofthe � rst two derivativesof the univariatedensity is not necessary, as long as it can be ensuredthat the Newton iterates are generally moving in the correct direction. Alternatively,we arecurrently investigating the use of some stochastic search algorithms which are theoreticallyguaranteed to converge to the globalmaximum even for relativelyrough objective functions.For smooth objective functions, of course, the convergenceoccurs much faster with Newtontype of algorithms.

3.5 MULTIPLE FEATURES

In this section, we brie y address the problem of � nding multiple discriminant direc-tions. We present two of our favored strategies.

3.5.1 Orthogonalization

Suppose f®1; ®2; : : : ; ®m¡1g are the � rst m ¡ 1 discriminant directions, then

®m = argmax LR(®) subject to ®T ©®j = 0 8 j < m;

Page 7: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

FEATURE EXTRACTION FOR NONPARAMETRIC DISCRIMINANT ANALYIS 107

Data

x1

x2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

11

1

1

1

1

1

1

1

1

1

1

1

1 1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

11

1

1

11

1

1

1

1

111

1

1

1

11

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

2 22

2

2

2

2

2

2

2 2

22

2

2

22

2

22

2

22

2

2

2

22

2

2

2

22

2

22

2

2

2

22

2 2

2

2

2

2

2

2

2

2

2

2

22

22

2

2

2

2

2

2

2

22 2

22

2

2

2

2

2

2

2

22

22

2

22

2

2

2

2

2

2 2

2

222

2

2

22

22

LR: Bandwidth=0.6

theta

0 1 2 3 4 5 6

40

60

80

10

01

20

LR: Bandwidth=0.9

theta

0 1 2 3 4 5 6

40

60

80

100

Figure 2. LR computed on simulated data as depicted in the top panel, using nonparametric density estimates withdifferent bandwidths. A low bandwidth leads to a rough estimate with a local maximum while a large bandwidthsmoothes away the local maximum.

Page 8: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

108 M. ZHU AND T. HASTIE

following standard practices in classical multivariate statistics. Here “xT ©©©y = 0” means x

and y are orthogonal with respect to the metric ©©©. The main advantage of this strategy is itssimplicity. The main disadvantage is the dif� culty in justifying the choice of ©©©. In classicalmultivariate statistics, one often assumes that the data follow a multi-dimensionalGaussiandistribution, N(¹; §§§), so there is a natural coordinate system (i.e., ©©© = §§§) in which it isinteresting and meaningful to focus on features that are orthogonal to one another. Theorthogonality ®T §§§®j = 0 implies that U = X® and likewise Uj are uncorrelated. Fornon-Gaussian data, no such natural metric exists. In fact, for data separated into K classes,even if we assume that the data in each class are Gaussian, it is still not clear what theappropriate metric is, unless one assumes, as in LDA, that all the classes share a common

covariance matrix. In practice, one often spheres the data prior to the analysis, that is, letx ¤ = §§§¡1=2(x ¡ ¹), where ¹ is the overall mean and §§§, the total covariance matrix ofthe data. This is the same as using the total covariance matrix as an ad-hoc metric fororthogonalization. Another reasonable choice of §§§ is the within-class covariance matrix,which, when pk is Gaussian, is the same as LDA. In general, there is no reason why thetotal or the within-class covariance matrix is the appropriate metric. Before we present amore general strategy below, however, we feel it is important to add that despite it beingsomewhat ad-hoc, orthogonal features are still useful for some applications, such as datavisualization.

3.5.2 Feature Removal

Alternatively,we can adoptFriedman’s exploratoryprojectionpursuit paradigm (Fried-man 1987): Once a discriminant direction ® is found, we simply transform the data so thatthere is no class difference in the ® direction—in the sense that p

( ¬ )k = q for all k with

some common q—while keeping all other directions unchanged, and search for the nextdirection.

In particular, let z = ®T x, then for each class k, a transformation is applied to z:

z 0 = Q¡1(Fk(z))def= ® (z); (3.5)

where Q(¢) is the cumulative distribution function (cdf) corresponding to the commondensity functionq and Fk, the marginal cdf of z for class k. By de� nition, ® (¢) is a monotonicone-to-one transformation and the transformed variable, z0, has density q. Let A be anorthogonal rotation matrix such that

zdef= Ax =

îT x

A ¤ x

!

: (3.6)

Such A can be constructed using the Gram–Schmidt procedure (see, e.g., Friedberg, Insel,and Spence 1989, p. 307). Then the entire transformation process can be summarized in thefollowing diagram:

xh¡ ! h(x)

# A " A¡1

zt¡ ! t(z)

; (3.7)

Page 9: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

FEATURE EXTRACTION FOR NONPARAMETRIC DISCRIMINANT ANALYIS 109

where

t(zj) =

(® (zj) for j = 1;

zj for j > 1:(3.8)

Hence

h(x) = A¡1t(Ax):

The feature removal strategy is more general because the directions are not constrainedto be orthogonal, nor is there an issue of choosing the appropriate orthogonalizationmetric.We also have the following result, which adds some theoretical coherence to this strategy.

Result 2. If pk(x) is (or is estimated by) the N(¹k; §§§) density function for classes

k = 1; 2; : : : ; K , then recursively maximizing LR(®) using exploratory projection pursuityields the same discriminant directions as Fisher’s LDA.

The proof (Zhu 2001) of this result relies on the fact that when pk is N(¹k; §§§), thetransformation ® as de� ned above in (3.5) is simply a location shift. The bottom panel ofFigure 1 shows the “multi-modality data” projected onto the � rst two discriminant direc-tions found by recursively maximizing LR(®) using the feature removal strategy with abandwidth parameter of 0:6 in the Loc� t module. As we can clearly see, our generalizedprocedure successfully identi� es the two discriminant directions.

3.6 EXAMPLE: 1984 CONGRESSIONAL VOTES DATA

The UCI machine-learning repository contains a dataset that describes how each mem-ber of the U.S. House of Representatives voted on 16 key issues in 1984. The datasetand any relevant information are at ftp://ftp.ics.uci.edu/pub/machine-learning-databases/voting-records/.

There are 435 Congressmen recorded in the database. Sometimes a Congressman mayhave been absent when the voting took place, or he may have simply voted “present”—instead of a clear “yes” or “no”—to avoid con icts of interest. These cases are representedas missing values in the database. In our analysis, each vote is characterizedby two indicatorvariables, one indicatingwhether the vote is a “yes” and the other, indicatingwhether a clearyes-or-no vote is missing, resulting in a total of 32 predictors.

The top panel of Figure 3 shows the two-dimensional LDA subspace. Apparently, it iseasy to distinguish the majority of the Democrats from the Republicans using the votingrecords. The bottom panel of Figure 3 shows the two-dimensional discriminant subspacefound by maximizingLR(®), using the orthogonalizationstrategy. Here we choose © = W

to be the within-classcovariancematrix, the same as LDA. We see the same basic separationbetween Democrats and Republicans as above. But we also see some additional features.It seems that the voting patterns among Democrats are more polarized, or have a highervariance, than the Republicans. Moreover, within the Democratic camp, there appears tobe several separate subgroups, or within-group clusters. There seem to be more “outlying”members among the Democrats, and, within the “mainstream” Democrats, there seem to be

Page 10: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

110 M. ZHU AND T. HASTIE

two major subdivisions. LDA will never � nd such elaborate features. In fact, for this two-class problem, LDA can � nd only one meaningful discriminant direction. In this case, it isclear that there is more information in the data, and that LDA is insuf� cient. The additionalinformation is obviously important. It would have been extremely interesting to examinethe two subgroups within the mainstream Democrats, for example. Various interest groupscould have used such information to their strategic advantage—not even politicians canundermine the power of effective data visualization!

LDA

-2 0 2 4 6

-4-2

02

rr

d

d

d

d

d

r

r

d

r

r

d

d

rr

ddr d

d

dd dd

ddd

r

d

r

ddr dr

r

r

r

dddd

d d

d

dd

d

r

d

r

dr d

r

rrr r

d

r d dd

r

rr

ddd

r

d

r

d

d

d

d

d

r

d

drr

r

d

r r

d

r

d dd

d

dd

d

d

d

r

d

d

d

d

d

dr

r

dd

d

r d

r

d

d

dr

drrr

r

r

d

rr

d

d

d

d

d

rrr

r r

dd

d

r

r

r

d

d

d

r

d

r

d

r

ddd

rr

r

d

r

d

d

d

d

r

d

d

r r

d

d

d

r

d

d

d

d

r

d

dd

d

d

d

dd

d

d

d

r

dr

r

d

d

d

r d

r

d

d dd

d

d

r

d

r

r

d

ddr

d

drd

d

r

d d

d

d

d

r

rr

dd

r

rr r

d

r dr

d

d dr

r

d

r

d

dd

d

rr

d

rr

d

r

d

d

r

r

ddd

d

ddd

dr r

dd

d dd

r

r

d

r

r

r

r d

r

r

r

d

dd

d

dd

d

d

d d

d

r

r d

d

d

r

d

r rr

r r

d

r

d

r d d

r

r

r dd d

d

d

d

d

d

r

d

d

rd

d

r

dd

d

d

r

d

d

dr

r

d

d

r d

rr

r

d

r

d

r

d

r

d

r

rr

dr

d

dd

r

r

d

d

d

d

r

d

d

d

d

r

d

d

r

rr

d

d

d

d

d

d

d

d

d

d

d

d

r

r

d

d

d

d

d

r

rr

r

r

r

rdd

d

r

rd

r

r

dd

r

dd d

r

d

d

dd

d

d

r

dd

r

d

r

r

r

LR

-4 -2 0 2

-4-2

02

46

rr

d

ddd

drrdrr

d

d

r rd d rd

d

d

ddd ddd rd rd d r

dr rr rd

dd

dddd

dd

d

rd rd r

dr r rrr

d

rd

d d rr r

dddr

d

r

dd

d

d

d

rdd

rr r

d

rr

d

rdd

d

d d

d

d

d

dr

dd

d d

d

d rrdd dr

dr

d

d

d rd rr

r rrd

rr

d

dd

d

d

r rr r

r

d d

d rr

rd

dd

r

dr

d r

d

d

d

rr r

d

r

d

d dd

r

d

d rrd

dd

rd

d

d d rdd

d

d

dd dd

dd

dr

d r r

d

dd r

dr

d

d

d

dd

dr

dr r

dd

drdd r

dd

rd

d

dd

dr rr

d

d rr rrd

rd

rd dd

rr

dr

ddd

d

rrd

rrd

rd

d

rr

d dd

d

ddd

d r

r

d ddd

drr

d

rr rr

d

r rrddd

d

dd

d

ddd d

rrd d

d

r

d

rrr rr

d

rd

rddr

rr

d

ddd

d

d

d

dr

dd rd

dr

dd d

d

r

d

d

d

r r

d

d

rd rrr

dr

dr

d

r

dr rr

d

rd

d

d

r

r

d

ddd

r

ddd

d

rdd

rrrd

d

dd d

d

d

d

d

d d

dr

r

d

d

d

d

d

r

rr

r

r rr

d

dd

rrd rr

dd

rddd

rdd

d

d

d d r

d

d

rd

r

r

r

Figure3. CongressionalVotesData.Top:Two-dimensionalsubspaceidenti�ed by LDA—theseconddirection is notmeaningful because the between-class covariance for a two-class problem has rank 1 . Bottom: Two-dimensionalsubspace identi�ed by LR.

Page 11: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

FEATURE EXTRACTION FOR NONPARAMETRIC DISCRIMINANT ANALYIS 111

4. COMPARISON WITH SAVE

The sliced average variance estimator (SAVE) was used by Cook and Yin (2001) torecover the best discriminant subspace in more general settings than the LDA, for exam-ple, when pk is N(¹k; §§§k). In their approach, they � rst sphered the data so that the totalcovariance matrix is identity and then chose their discriminant directions by successivelymaximizing

SAVE(®) = ®T

ÃKX

k = 1

³nk

N

´(I ¡ §§§k)2

!

® (4.1)

over ® 2 Rd where k®k = 1. The solutions are the top eigenvectors of the kernel,

KX

k = 1

³nk

N

´(I ¡ §§§k)2: (4.2)

When pk is N(¹k; §§§k), it is easy to verify that, apart from a constant not depending on ®,

LR(®) /KX

k = 1

³nk

N

´ ¡log ®T T® ¡ log ®T §§§k®

¢(4.3)

=

KX

k = 1

³nk

N

´ ¡¡ log ®T §§§k®

¢; (4.4)

where T denotes the total covariance matrix. In order to make our comparison more direct,we also sphere the data � rst so that T = I. This is why log ®T T® = log1 = 0 in (4.4).

4.1 A SIMPLE NUMERICAL EXAMPLE

We � rst construct a simple numerical example (see also Hastie and Zhu 2001) tocompare the two methods. For simplicity, we stay in R2 and let there be only K = 2classes of equal prior probability (n1 = n2 = N=2). To simplify matters further, we let§§§k ’s have common eigenvectors. Without loss of generality, simply assume that the §§§k ’sare diagonal. We create two competing directions, one in which the two classes differ onlyby the mean (� rst-order difference), and one in which the two classes differ only by thevariance (second-order difference). In particular, the parameters are

¹1 = ( ¡p

0:5; 0)T ; ¹2 = (p

0:5; 0)T

and

S1 =

Ã0:5 00 0:3

!

; S2 =

Ã0:5 00 1:7

!

:

One can easily verify that the overall mean is (0; 0)T and the total covariance matrix is theidentity.In otherwords, the data are properly sphered.A simulateddataset for thisexample is

Page 12: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

112 M. ZHU AND T. HASTIE

x1

x2

-4 -2 0 2 4

-4-2

02

4

212

1

21

1 2

11

2

1

2

1

2

2

121

1

21

2

22

112

1

2

1

2

1

1

2

1

1

2

1

2

1

1

1

1

21

1

1

1

2

1

2

2111

22

2

1

2

1

2 2

1

2

1

2

1

1

2

1

2

1

2

1

2

2

21

2

2

11

1

2

2

1

1

2

1

1

2

1

2

1

1

21

2

1 2

22

1

1

1

2

21

1

1

2

2

2

11

2

2

21

1

2

12

2 2

1

2

11

11

1

2

2

2

2

2

1

1

2

1

21

1

2

1

22

2

2

2

1

2

2

2

1

2

1

2

2

1

1

2

2

1

2

2

2

1

2

2

2

1

2

21

1

1

11

2

2

1 2

1 2

2

1

2

1

2

1

2

121

1

1

Figure 4. Simulated Data. Two Gaussians. The black dots are the centroids. There is a mean shift between thetwo classes in x1 but no difference in the marginal variances. The means in x2 are identical for the two classeswhile the marginal variances differ.

shown in Figure 4. In thiscase, the two eigenvectorsof (4.2)—the onlydirectionsconsideredby SAVE—are simply the coordinate axes, x1 and x2. Table 1 shows the values of LR(®)

and SAVE(®) evaluated at x1 and x2. We can see that SAVE will conclude that x2 ismore important for discrimination while our criterion LR will conclude that x1 is the moreimportant direction.

From Figure 4, it is hard to tell exactly which direction is actually more important fordiscrimination. However, in this simple case, we know the optimal Bayes decision rule ifwe were only to use one of the directions. If the decision were to be based on x1 alone, thenthe optimal Bayes rule would be:

y1 =

(1 if x1 µ 0;

2 if x1 > 0:(4.5)

If the decision were to be based on x2 alone, then the optimal Bayes rule would be:

y2 =

(1 if jx2j µ 0:795;

2 if jx2j > 0:795:(4.6)

Table 1. Evaluation of LR(¬ ¬¬ ) and SAVE(¬ ¬¬ ) Over the Two Eigenvectorsof the SAVE Kernel, x1 = (1,0)T

and x2 = (0,1)T

¬ ¬¬ LR(¬ ¬¬ ) SAVE(¬ ¬¬ )

x1 0.69 0.25x2 0.34 0.49

Page 13: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

FEATURE EXTRACTION FOR NONPARAMETRIC DISCRIMINANT ANALYIS 113

Table 2. MisclassicationErrors for the Two Different One-Dimensional Bayes Rules

Classication Misclassicationmethod error

y1 15.9%y2 30.2%

The misclassi� cation errors of these two rules can be easily calculated and are displayedin Table 2. Clearly, x1 is a much better discriminant direction than x2, yet if allowed topick only one direction, SAVE would pick x2 while our generalized log-likelihood-ratiocriterion would pick x1.

4.2 AN EXPERIMENT

We now carry this numerical example a bit farther by conducting a simple experiment.In this experiment, everything is kept the same as above except we now let ¹1 = ( ¡

pb; 0)T

and ¹2 = (p

b; 0)T , where b is a free parameter.It is easy to understand that as b changes, the Bayes misclassi� cation errors of y1 and

y2 also change. More speci� cally, when b = 0, it is clear that the direction x1 has nodiscriminatingpower; as b moves from 0 to 1, x1 gradually becomes the better discriminantdirection than x2. The goal of this experiment is to answer the following questions: At whatvalue of b will this reversal take place, and can the criteria LR(®) and SAVE(®) detect thisreversal correctly?

An answer is provided in Figure 5. The top panel shows that the Bayes misclassi� cationerror of y1 drops from 50% to 0% as b moves from 0 to 1. The misclassi� cation error ofy2 stays constant since changes in b do not affect y2. The cross-over took place at aroundb = 0:2: when b < 0:2, y2 has smaller misclassi� cation error, indicating that x2 shouldbe the better discriminant direction than x1; when b > 0:2, the reverse is true. The middlepanel shows the function LR(®) evaluated at x1 and x2 as b changes. We can see LR startsto pick x1 over x2 as the better discriminant direction when b is over 0:3. The bottom panelshows the function SAVE(®) evaluated at x1 and x2 as b changes. We see SAVE does notpick x1 over x2 as the better direction until b is over 0:7.

The conclusion from this simple experiment is that SAVE seems to over-emphasizesecond-order differences among the classes. The problem that SAVE over-emphasizessecond-order information manifest itself quite regularly, as we illustrate with a real ex-ample below.

4.3 EXAMPLE: PEN DIGIT DATA

The Pen Digit database from the UCI machine-learning repository contains 10,992samples of handwritten digits (0, 1, 2, : : : , 9) collected from 44 different writers. Each digitis stored as a 16-dimensional vector. The 16 attributes are derived using standard temporaland spatial resampling techniques in order to create vectors of the same length for every

Page 14: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

114 M. ZHU AND T. HASTIE

Misclassification Rate

b

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.1

0.2

0.3

0.4

0.5

LR

b

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

SAVE

b

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 5. Top: Bayes misclassi� cation errors of y1 and y2 as b changes, the horizontal line being that of y2 .Middle: LR(¬ ¬¬ ) evaluated at x1 and x2 as b changes, the horizontal line being the one at x2 . Bottom: SAVE(¬ ¬¬ )evaluated at x1 and x2 as b changes, the horizontal line being the one at x2.

Page 15: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

FEATURE EXTRACTION FOR NONPARAMETRIC DISCRIMINANT ANALYIS 115

character.More details are availableat ftp://ftp.ics.uci.edu/pub/machine-learning-databases/pendigits/. For our purposes, it suf� ces to state that this is a 10-class problem in a 16-dimensional space. The dataset is divided into a learning set (7,494 cases) and a test set(3,498 cases).

For better graphical display, we only select the 6’s, 9’s, and the 0’s, three easily con-fused digits, as an illustration. For this three-class subproblem, there are 2,219 cases in thetraining set and 1,035 cases in the test set. We apply LDA, SAVE, and LR to the three-classsubproblem. In each case, we search for the two leading discriminant directions. We thenproject the test data onto these two directions. In order to avoid having to justify the band-width we pick, the pictures presented here are produced with a parametric model, using theN(¹k;§§§k) density as the model for pk—note that this is close to the assumptions in SAVEbut still more general than the assumptions in LDA. Figure 6 shows the results. This is areal problem, so we do not know the true discriminant subspace. However, from the graphs,it is clear that LDA separates the classes reasonably well. SAVE, on the other hand, picksa subspace in which the 9’s have a much larger variance than the 0’s and the 6’s, while LRpicks essentially the same subspace as LDA.

4.4 LR VERSUS SAVE

The experiments and examples above show the criterion LR is more robust than SAVE:while it is exible enough to pick up high-order differences among the classes, LR does notover-emphasize high-order information when the � rst-order difference is dominant.

5. NONPARAMETRIC DISCRIMINANT ANALYSIS: ANAPPLICATION

Using projection pursuit density estimation (Friedman, Stuetzle, and Schroeder 1984),we can easily integrateour feature extractiontechniqueinto a non-parametricdensitymodel;this allows us to consider (reduced-rank) non-parametric discriminant analysis, a problemthat has remained dif� cult due to the “curse of dimensionality.” In particular, let

pk(x) = p0(x)

MY

m= 1

fmk(®Tmx): (5.1)

That is, one starts with a common initial guess for all the classes (often a Gaussian distri-bution), and augments each density function with a series of univariate ridge modi� cationsto distinguish the classes, thereby bypassing the “curse of dimensionality.”Model (5.1) canalso be recursively written as

pm;k(x) = pm¡1;k(x)fmk(®Tmx):

For � xed direction®m, the optimal ridge functionfmk for class k is given by (see Friedmanet al. 1984):

Page 16: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

116 M. ZHU AND T. HASTIE

LDA

10 15 20

-14

-12

-10

-8-6

-4

99

9

9

9

0

0

9

0

6

6

6

0 0

9

0

9

6

0

9

6

0

0

6

9

9

0

9

0

0

0

9

0

0

9

9

0

6

9

6

6

0

0 0

0

0

0

6

9

0 0

9 9

6

6

0

9

6

0

6

0

6

0

0

6 6

9

0

6

9

0

9 0

6

6

66

0

99

6

6

0

6

6

9

0

9

0

0

6

9

0

0 0

6

0

6

9

6

6

9

0

66

9

0

99

9

6

0 0

9

6

6

0 0

9

6

6

66

0

9

9

0

9

9

0

6

0

0

9

0

6

9

6

0

0

6

9

0

6

6

9

0

6

6

6

9

6

6

9

6

9

0

9

6

0

9

0

0

9

6

0

9

0

0 0

99

6

0

6

0

0

6

9

0

6

0

99

09

0

0

9 9

0 0

9

666

0 0

99

0

6

6

0

9

0 0

6

6

0

9

6

0 0

666

0

0

09

9

0

9

9

0

6

9

0

6

6 9

0

6

9

6

6

6

0

6 6

6

0

0

6

6

6 6

99

6 6

9

0

9

9

6

0

6

0

9

9

9

0 0 0

6

66

0

9

9

9

0

9

09

0

0

0

6

0

0 0

6

9

9

0

6

9

6

9

9

0

6 6

0

6

9

0

6

9

6

6

0

99

0 0

0

6

6

0

66

9

9

0

0

0

0

0

0

99

6

09

9

66

0

9

0

9

66

9

9

9

9

6

9

0

6

6

0

6

0

9

6

6

9

0

6

6

9

0

6

9

0

6

0

6

0 0

0

9

6

0

6

9 9

6 9

0

0

6

0

6

0

66

6

9

0

0

0

9

0

6

0

9

6

9

99

0

0

0 0

9

9

6

6

0

9

9

0

6

0

6

9

9

0

6

0

9

0

0 0

6

0

0

0

9

9

0 0

99

0

9

6

9

9

6

0

9

6

0

0

0

6

9

9

9

0

0

6

9

6

6

999

9

6

9

0

9

6

6

9

6

6

9

0 0

0 0

9

6

6

66

0

0

6

9

66

0

0

6

66

9

6

0

6 9

9

0

9

6

6

6

9

0 0

9

0

9

6

0

9

66

9

0

0

9

0

6

6

9

0

9

0

9

0

6 9

66

0

6

6

6

0

9

0

0

6

0

0 0

6

0

9

0

0

9

6

6

9

6

6

9

6

9

0

9

6

0

0

0

6

0

6

9

0

6

6

6

0

0

0

99

6

0

0

9

0

9

0

6

6

0

66

0

6

9

6

0

6

9

9

9 9

6

9

0

9

0

0 0

9

0

9

9

6

6

9

0

0

6

66

6

0

9

6

9

66

9

6

0

99

6

0

9

6

9

0

6

09

0

6

0

6

9

0 0

0

9

9

6

9

0

9

6

0

6

9 0

9

0

6

9

6

6

0

9

6

6

9

0

0

0

66

0

9

0

9

99

0

9

0 0

6

6

9

6

6

6

0 0

9

9

6

9

9

0

9

0

0

69

9

9

6

0

9

6

9

9

9

9

9

9

0

6

0

0 0 0

0

66

0

0

6

9

0

6

9 9

0 0 0 0

6

6

0

6

9

0

0

9

0

6

9

6

9

0

6

6

9

0

0

9

0

0

66

6

0

9

0

6

66

0

6

6

9

6

0

6

66

9

66 9

0

9

9

0

6 6

0

6

6

6

6

6

9

0

0

99

6

0

9

9

0

0

6

9

0

9

99

6

9

0

9

9

9

9

0 0

9

99

9

6

0

6

6

9

0

6

0

0

0

6

6

0

0

0

6

0

6

0

9

0

6

0

6

0

6 6

6

0

6

0

9

0

0

6

6

0

99

9

6

0

9

6

9

0

9

6

99

99

6 6

9

0

0

99

0

0

0

9

6

99

9

0

6

9

0

9 9

9

9

9

6

9

6

9

0

99

6

9

66

0

09

9

0 0

6

6

6

9

9

0

6

9

9

0

6

99

6

9

0 0

9

0

6

9

9

6

6

6

0

6

0

0

6

9

0

99

6

0

9

0

0

6

99

6

0

9

9

9

66

9

0

99

9

6

0

0

0

0

0

9

0

6

0 0

0

9

6

0 0 0

6

9

0

66

6

9

6

0

6

6

0 0 0 0 0

66

9

9

9

9

6

9

6

6

9

6

9

6 6

9

99

9

6

9

0

6

6

9

6

99

69

6

0

9

0 0

9

6

0

0

SAVE

-2 0 2 4

-20

24

9

9

9

9

9

0 0

9 066

6 0 0

9

09

6 0

9

6 0

069

9

0

9

0

0

0

9

0 0

9

9

0

6

9

6

6 0 0 0

0

0

0

6

9

0 0

9

9

66

0

9

6 0 6 06 0

06

6

9 0 6

9

0

9

0

6

6

66

0

9

9

66

066

9

0

9

0

06

9

0 0

06 0

6

9

6

6

9

066

9

0

9

9

96

0 0

9

66 0

0

9

6

666

0

9

9

0

99

06 0

0

9 0

6

9

6

0 06

9

0

669

0

6

6

6

9

66

9

69

0

96

09

0 0

9

6 0

9

0

0 09

9

6

06

0

0

6

9

0

6

0

9

9

0

9 0 0

9

9

0

0

9

6

66

0 0

9

9

0

6

6

0

9

0 06 6

09

6

0

0

6

66

0 0

0

9

9

0

9

9

0

6

9

0

66

9

06

9

666 066 6

0 0

6666

9

9

6

6

9

0

9

9

6 06 0

9

9

9 0

0

06

66

09

9

9

0

9

0

9

0 0 0 6 0

0 0

6

9

9

0 6

9

69

9

0

66 0

69

0

6

9

6

6

0

9

9

0 0 0

6

6 066

9

9

0

0 0

0 0

0

9

9

6 0

9

9 66

0

9

0

9

66

9

9

99

6

9

06

6 06 0

9

669

0

66

9

0

6

9

06 0

6 0 0

0

9

6 0

6

9

96

9

0 0 6 06 0 666

9

0

0

0

9

0 6

0

9

6

9

99

0 0 0 0

9

9

66 0

9

9

06

0

6

9

9

0

6 0

9

0

0 0

6

0

0

0

9

9

0 0

9

9

0

9

6

9

9

6 0

9

6

0 0

06

9

99

0 06

9 6 69

99

9

6

9

0

9

66

9

66

9

0 0

0

0

9

666

6 0

0

6

9

6

6

0 0

666

9

6 0

6

9

9

0

9

66

6

9

0 0

9 0

9

6 0

9

66

9

0 09 0 6

6

9

0

9

0

9

06

9

6

6 0

666

09 0

06 0

0 06

0

9 0 0

9

6

6

9

6

6

9

6

9

0

9

6 0

0 0

6

06

9

06

66

0

0

0

9

9

6

0 0

9

0

9

066

0

66 06

9

6 0 6

9

9

9

9

6

9

0

9

0

0

0

9

0

9 9

66

9

0

06

6

66

0

9

6

9

66

9

6

0

9

9

6 0

9

69

06

0

9

06 0

69 0 0

0

9

9

6

9

0

9

6

06

9

0

9

06

9

6

6 0

9

66

9

0 0

066

0

9

0

9

9

9 0

9

0

06

6

9

6

66 0 0

9

9

6

9

9

0

9 0

06

9

9

9

6

0

9

6

9

9

99

9

9

06

0 0

0 0 06

6

0 0

6

9

06

9

9

0

0

0 06

6 0

6

9

0 0

9

06

9

6

9

066

9

0 0

9

0 0 6 6 6 0

9

0 666

0

6

6

9

6

06 6

6

9

66

9

0

9

9

0 66 0

666 6

6

9

0

09

9

6

0

9

9

0

069

0

9

9

9

6

9

0

9

9

9

9

0 09

9

9

9

6

0

669

0 6

0

0

06

6 0

0 0 6

0

6 0

9

0

6 06

0

66

6

0

6 09

0

0

6

6

0

9

9

9

6

09

6

9

0

9

6

9

9

9

9

6

69

0

0

99

0

0 0

9

6

9

9

9

06

9

0

9

9

9

9

9

6

9

6

9 0

99

6

9

6

6

0

0

9

9 0

06

6

6

99 06

9

9

06

9

9

6

9

0

0

9

0

6

9

9

666

0

6 0

06

9

0

9

9

6 0

9

0

0

6

9

9

6

0

9

9

9

66

9

0

9

9

96

0

0 0 0

0

9

06

0 0

0

9

6 0

0 06

9

0

666

9

6 0

6

6

0 0 0 0

06

6

9

9

9

9

6

9

66

9

6

9

6

6

9

9

9

9

6

9

0

66

9

6

9

9

69

6 0

9

0 0

96

0

0

LR

-4 -3 -2

-4-3

-2-1

9

9

9

9

9

0

0

9

0

6

6

6

0 0

9

0

9

6

0

9

6

0

0

6

9

9

0

9

0

0 0

9

0

0

9

9

0

6

9

6

6

0

0 0

0

0

0

6

9

0 0

99

6

6

0

9

6

0

6

0

6

0

0

669

0

6

9

0

9 0

6

6

6 6

0

99

6

6

0

6

6

9

0

9 0

0

6

9

0

0 0

6

0

6

9

6

6

9

0

66

9

0

9

9

9

6

0 0

9

6

6

0 0

9

6

6

6

6

09

9

0

9

9

0

6

0

0

9

0

6

9

6

0

0

6

9

0

66

9

0

6

6

6

96

6

9

6

9

0

9

6

0

9

0 0

9

6

0

9 0

0 0

99

6

0

6

0

0

6

9

0

6

0

9

9

0

9

0 0

99

0

0

9

6

6 6

0 0

9

9

0

6

6

0

9

0 0

6

6

0

96

0 0

66

6

0 0

09 9

0

9

9

0

6

9

0

6

69

0

6

9

66

6

0

6

6

6

0

0

6

6

66

9

9

66

9

0

99

6

0

6

0

9

9

9

0

0 0

6

66

0

9

9

9

0

9 0

9

0

0

0

6

0

0 0

6

9

9

0

6

9

6

9

9

0

66

0

6

9

0

6

9

6

6

0

9

9

0 0

0

6

6

0

66

9

9

0

0

0

0

0

0

9

9

6

0

9

9

66

0

9

0

9

6

6

9

9

9

9

6

9

0

6

6

0

6

0

9

6

6

9

0

6

6

9

0

6

9

0

6

0

6

0 0

0

9

6

0

6

99

69

0

0

6

0

6

0

66

6

9

0

0

0

9

0

6

0

9

6

9

99

0

0

0 0

9

9

6

6

0

9

9

0

6

0

6

9

9

0

6

0

9

0

0 0

6

0

0

0

9

9

0 0

99

0

9

6

9

9

6

0

9

6

0

0 0

6

9

99

0 0

6

9

66

99

9

9

6

9

0

9

66

9

66

9

0 0

0 0

9

6

6

66

0 0

6

9

66

0

0

6

6

6

9

6

0

69

9

0

9

6

6

69

0 0

9

0

9

6

0

9

66

9 0

0

9

0

6

6

9

0

9

0

9

0

6

9

66

0

6

6

6

0

9

0

0

6

0 0

0

6

0

9

0 0

9

6

6

9

6

6

9

6

9

0

9

6

0

0

0 6

0

6

9

0

6

6

6

0

0

0

99

6

0

0

9

0

9

0

6

6

0

66

0

6

9

6

0

6

9

9

99

6

9

0

9

0

0

0

9

0

9

9

6

6

9

0 0

6

6

6

6

0

9

6

9

66

9

6

0

99

6

0

9

6

9

0

6

0

9

0

6

0

6

9

0 0

09

9

6

9

0

9

6

0

6

9

0

9

0

6

9

6

6

0

9

6

6

9

0

0

0

6

6

0

9

0

9

99

0

9

0 0

6

6

9

66

6

0

0

9

9

6

9

9

0

9

0

0

69 9

9

6

0

9

6

99

9

9

9

9

0

6

0

0 0 0

0

6 6

0 0

6

9

0

6

9

9

0 0 0 0

6

6

0

6

9

0

0

9

0

6

9

6

9

0

6

6

9

0

0

9

0

0

66

6

0

9

0

6

6 6

0

6

6

9

6

0

6

66

9

66

9

0

9

9

0

66

0

6

6

6

6

6

9

0

099

6

0

9

9 0

0

6

9

0

9

99

6

9

0

9

9

9

9

0 0

9

99

9

6

0

6

6

9

0

6

0

0

0

6

6

0

0

0

6

0

6

0

9

0

6

0

6

0

66

6

0

6

0

9

0

0

6

6

0

9

9

9

6

0

9

6

9

0

9

6

99

9

9

66

9

0

0

99 0

0 0

9

6

99

9

0

6

9

0

9

9

9 99

6

9

6

9

0

9

9

6

9

6

6

0

0

9

9

0 0

6

66

9

9

0

6

9

9

0

6

99

6

9

0 0

9

0

6

9

9

6

6

6

0

6

0

0

6

9

0

99

6

0

9

0 0

6

9

9

6

0

9

9

9

6

6

9

0

9

9

9

6

0

0

0

0

0

9

0

6

0 0

0

9

6

0 0 0

6

9

0

66

6

9

6

0

66

0 0 0

0 0

6

69

99

9

6

9

6

6

9

6

9

66

9

99

9

6

9

0

6

6

9

6

99

69

6

0

9

0 0

9

6

0 0

Figure 6. Pen Digit Data. Shown here are test data projected onto the leading two discriminant directions foundby LDA, SAVE, and LR, respectively.

Page 17: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

FEATURE EXTRACTION FOR NONPARAMETRIC DISCRIMINANT ANALYIS 117

fmk(®Tmx) =

p( ¬ )k (®T

mx)

p( ¬ )m¡1(®T

mx):

So the important question here is how to choose the ridge directions ®m at each iteration.The idea of using model (5.1) for discriminant analysis was also proposed in Polzehl

(1995), where ®m was chosen to minimize the misclassi� cation error. But misclassi� cationerror is, in itself, not a continuous function in ®, so an additional smoothing step has to beintroduced. Instead, we choose ®m to maximize our general measure of class separation:

LR(®) = log

QKk = 1

Qxj 2 Ck

pm;k(xj)QK

k = 1

Qxj 2 Ck

pm(xj)= log

QKk = 1

Qxj 2 Ck

pm¡1;k(xj)fmk(®T xj)QK

k = 1

Qxj 2 Ck

pm¡1(xj)fm(®T xj):

The de� nition of LR(®) here is slightly different from (3.1); the difference merely re ectsthe special projection pursuit density models being used for each class.

Here are some importantadvantagesof this model,which,we feel, were not emphasizedenough by Polzehl (1995): First, we have a exible, nonparametric model for the densityfunctionsin each class, while still avoidingthe “curse of dimensionality.”Second, forcing theridge directions to be the same for all classes allows the complexityof the model to be easilyregularizedbyselectingonlya few directionswhere thereare signi� cant differencesbetweenclasses. Therefore, althoughthis is a density-basedclassi� cation method, we actuallydo not

waste any effort in directionswhich may contain interesting features in the density functionitself but do not contribute to class differentiation. Third, because the initial density iscommon for all the classes, the decision boundary between classes k and K simply dependson the ratio of the augmenting ridge functions alone. In particular, the decision boundaryis characterized by, assuming equal priors,

MY

m = 1

fmk(®Tmx)

fmK(®Tmx)

def=

MY

m= 1

gmk(®Tmx) = 1:

This implies that there is a generalizedprojectionpursuit additivemodel (Roosenand Hastie1991) for the posterior log-odds:

logp(y = kjx)

p(y = K jx)= log

º k

º K+ log

pk(x)

pK(x)

def= ­ k +

MX

m= 1

log(gmk(®Tmx));

where º k denotes the prior probability of class k.

5.1 ILLUSTRATION: EXCLUSIVE OR

As an illustration, we examine a dif� cult classi� cation problem that involves an “ex-clusive or” relation in the predictors. Let

¹1A =

Ã1:51:5

!

; ¹2A =

á 1:5¡ 1:5

!

;

Page 18: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

118 M. ZHU AND T. HASTIE

Data

x1

x2

-4 -2 0 2 4

-4-2

02

4

A

A

AA

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

AAA

A

A

AA

A

A

A

A

AA

A

A

A A

A

A

AA

A

A

AA

AA

A

A

A

A

A A

A

A

A

A

A

A

AA

A

A

A

A

A

A

A

A

A

A

AA

A

A

A

A

A

A

A

A

A

A AA A

A

A

A

A

A A

A

A

A

A

A

A

A

AA

A

AA

A

A

A

AA

A

AA

A A

A

A

A

A

A

AA

A

A

A

A

A A

A

AA

A

A

A

AA

A

A

A

A

A

A

A

A

AA

A AA

A

A

A

A

A

A

AAA

A

A

A

A

A

A

AA

A

A

A

AA A

AA

A

A

A

A

AA

A

A

A

A

A

A

A

A

A

A

AA

AA

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

AA

AA

A

A

A

A

AA

A

A

A

A

A

A

A

AAA

A

A

A

AA

A

A

A

A

A

A

AA

AA

A

AA ABB

B

B

B

B

B

BBB

B

B

B

B

B

B

BB

B

BBB

B

BB

B

B

B

B

B

BB

B

B

B

B

B

BB

B

BB

BB

B

BB

B

B

BB

B

B

B

B

B

BBB

BB

B

B

B

B

B

B

B

B

B

B

B

BB B

B

B

B

BB

B

B

BB

B

B

B

B

B

BB

B

BB

B

B

B

BB B

B

B

B B

BBB

B

B

B

B

B

B

B

BBB

B

B

BB

B

B

B

B

B

BB

B

B

B

B

B

B

BB

B

B

BB

B

B

B

B B

B

B

B

B

B

B

B

BB

B

B

B

B

B

B

B

B

B

B

B

B

B

B

BB

BB

BBB

BB

B

B

B

BBBB

B B

B

B

B

B

B

BB

B B

B

B

B

B

BB

B

BB

B

B

B

B

BB

BB

B B

BB B

B

B

B

B

B

B

BB

B

BBB

B

B

BB

B

B BB

B

B B BB

B

B

B

B BB

B

B

-4

-2

0

2

4

x1

-4

-2

0

2

4

x2

00

.005

0.0

10.

015

0.02

De

nsity

Density For Class A

-4

-2

0

2

4

x1

-4

-2

0

2

4

x2

00.

005

0.01

0.0

150

.02

Den

sity

Density For Class B

Figure 7. Top: A simulated instance of the “exclusive or” situation, with nA = nB = 250. Middle and Bottom:Estimated density functions for the two classes, after two ridge modi� cations.

Page 19: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

FEATURE EXTRACTION FOR NONPARAMETRIC DISCRIMINANT ANALYIS 119

Table 3. The MisclassicationError for 15,000 New Observations. The Bayes error for this problem is12.5%.

Number of ridge Testfunctions (M) error

1 23.5%2 12.9%3 14.2%4 14.9%

¹1B =

á 1:51:5

!

; ¹2B =

Ã1:5

¡ 1:5

!

;

and consider two mixture classes, A and B, with

pA(x) =12

¿ (x; ¹1A; I) +

12

¿ (x; ¹2A; I)

pB(x) =12

¿ (x; ¹1B; I) +

12

¿ (x; ¹2B; I);

where ¿ (x; ¹; I) denotes the N(¹; I) density function.The top panel of Figure 7 shows one simulated instance of this situation. This is a

particularly dif� cult problem, because when considered alone, neither x1 nor x2 containsany information for classi� cation; they must be considered together. The Bayes decisionrule for this case is an “exclusive or” condition:

y =

(B if (not (x1 > 0) and (x2 > 0)) or (not (x2 > 0) and (x1 > 0));

A otherwise.

The symmetry makes it easy to compute that the Bayes error for this problem is equal to2©(1:5)(1 ¡ ©(1:5)) º 12:5%. Using model (5.1), we start with pA = pB = p0, wherep0 is Gaussian, and modify each by a series of ridge functions. The ridge directions arechosen to be directions in which there are signi� cant differences between the two classes asmeasured by LR(®). The middle and bottom panels of Figure 7 show the resulting densitysurfaces after two ridge modi� cations. The ridge functions also determine the classi� cationrule. The misclassi� cation results on test data are summarized in Table 3. We can see thatwhen we use two ridge modi� cations for each density function (the optimal number for thisproblem), we can achieve an error rate very close to the Bayes error, despite the dif� cultyof the problem. The fact that the test error increases for M > 2 is a sign of over-� ttingcaused by the additional and unnecessary ridge modi� cations. In general, one must rely oncross-validation (or similar methods) to select the optimal M .

6. SUMMARY

We have proposed a general, automatic, and adaptive feature extraction method forclassi� cation and pattern recognition, of which LDA is a special case. It has the abilityto pick up rather complicated features (such as within-group clusters) that are important

Page 20: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

120 M. ZHU AND T. HASTIE

for distinguishing the classes. Unlike its competitors such as SAVE, it also has the abilityto capture the right amount of trade-off between low-order and high-order features. Wealso showed that our method can be incorporated into formal probabilistic models to allow(low-rank) nonparametric discriminant analysis. The equivalence between discriminant di-rections and the concept of ordination axes in correspondence analysis (Zhu 2001) alsomakes this method applicable to a wide variety of applications in areas such as environ-mental ecology and e-commerce.

ACKNOWLEDGMENTSWe thank Professors Robert Tibshirani and Jerome Friedman for their interesting discussions and helpful

comments during thiswork.We are also indebted to an associate editor for comments and suggestionswhich greatlyimprovedourmanuscript. Both authorswere partially supportedby grant DMS-9803645from the National ScienceFoundation, and grant ROI-CA-72028-01 from the National Institutes of Health. In addition, Mu Zhu was alsopartially supported by the Natural Sciences and Engineering Research Council of Canada.

[Received February 2001. Revised August 2002.]

REFERENCES

Cook, R. D., and Yin, X. (2001), “Dimension Reduction and Visualization in Discriminant Analysis” (withdiscussion), Australian and New Zealand Journal of Statistics, 43, 147–199.

Devijver, P. A., and Kittler, J. (1982), Pattern Recognition:A Statistical Approach, Englewood Cliffs, NJ: Prentice-Hall International.

Fisher, R. A. (1936), “The Use of Multiple Measurements in Taxonomic Problems,” Annals of Eugenics, 7.

Friedberg, S. H., Insel, A. J., and Spence, L. E. (1989), Linear Algebra, Englewood Cliffs, NJ: Prentice Hall.

Friedman, J. H. (1987), “Exploratory Projection Pursuit,” Journal of the American Statistical Association, 82,249–266.

Friedman, J. H., Stuetzle, W., and Schroeder, A. (1984), “Projection Pursuit Density Estimation,” Journal of theAmerican Statistical Association, 79, 599–608.

Gill, P. E., Murray, W., and Wright, M. H. (1981), Practical Optimization, New York: Academic Press.

Hastie, T. J., and Zhu,M. (2001),Discussionof “Dimension Reductionand Visualization inDiscriminant Analysis,”by Cook and Yin, Australian and New Zealand Journal of Statistics, 43, 179–185.

Loader, C. (1999), Local Regression and Likelihood, New York: Springer-Verlag.

Polzehl, J. (1995), “Projection Pursuit Discriminant Analysis,” Computational Statistics and Data Analysis, 20,141–157.

Roosen,C., and Hastie, T. J. (1991),“LogisticResponseProjectionPursuit,”Technical ReportBLO11214-930806-09TM, AT&T Bell Laboratories.

Silverman, B. (1986), Density Estimation for Statistics and Data Analysis, New York: Chapman and Hall.

Zhu,M. (2001),“Feature Extraction and Dimension Reductionwith Applications to Classi� cation and the Analysisof Co-occurrence Data,” Ph.D. dissertation, Stanford University.

Page 21: Feature Extraction for Nonparametric Discriminant Analysishastie/Papers/dimensionReduction.pdf · Feature Extraction for Nonparametric Discriminant Analysis MuZHUand Trevor J.HASTIE

This article has been cited by:

1. Santiago Velilla . 2008. A Method for Dimension Reduction in QuadraticClassification ProblemsA Method for Dimension Reduction in QuadraticClassification Problems. Journal of Computational and Graphical Statistics 17:3,572-589. [Abstract] [PDF] [PDF Plus] [Supplementary material]

2. Iain Pardoe , Xiangrong Yin , R. Dennis Cook . 2007. Graphical Tools forQuadratic Discriminant AnalysisGraphical Tools for Quadratic DiscriminantAnalysis. Technometrics 49:2, 172-183. [Abstract] [PDF] [PDF Plus]

3. Adolfo Hernández , Santiago Velilla . 2005. Dimension Reductionin Nonparametric Kernel Discriminant AnalysisDimension Reduction inNonparametric Kernel Discriminant Analysis. Journal of Computational andGraphical Statistics 14:4, 847-866. [Abstract] [PDF] [PDF Plus]