sasa presentation 2013

1 2013 Hildegard Erasmus & Morné Lamont Hildegard Erasmus & Morné Lamont University of Stellenbosch Department Statistics & Actuarial Science South African Statistical Association Conference 2013 Prediction Accuracy Estimation: Applications to Linear Regression, Regression Trees & Support Vector Regression

Upload: hildegard-meyer

Post on 15-Apr-2017




0 download


Page 1: SASA Presentation 2013



Hildegard Erasmus & Morné Lamont

Hildegard Erasmus & Morné Lamont University of Stellenbosch

Department Statistics & Actuarial Science South African Statistical Association Conference 2013

Prediction Accuracy Estimation: Applications to Linear Regression, Regression Trees & Support Vector Regression

Page 2: SASA Presentation 2013

Introduction Classical linear regression Alternative techniques

- Regression Trees - Support Vector Regression

Advantages & Disadvantages

Simulation study Conclusion

Prediction Accuracy Estimation SASA 2013

Page 3: SASA Presentation 2013

Regression as a scientific method first appeared around 1885 (Izenman, 2008)

Since then: regression evolved into variety of forms, including linear, non-linear, parametric & nonparametric

Objective of study:

- theoretical considerations regarding different regression techniques - highlight advantages & disadvantages - assess performance of different techniques - identify conditions for good predictive performance

Prediction Accuracy Estimation SASA 2013

Introduction Classical linear regression Alternative techniques Regression Trees Support Vector Regression Advantages & Disadvantages Simulation study Conclusion

Page 4: SASA Presentation 2013

Method of Least Squares (LS): - Originated: Astronomy (1805) - Legendre: developed LS method to determine the orbits of planets - Gauss & Laplace: Gaussian curve to describe error component, crucial to success of the LS method - Gauss: claimed to have used method of estimating coefficients of set of linear equations by minimizing error sum of squares since 1809 - Galton: develop ideas of regression & correlation : fail to link LS with regression - Yule: replace Gaussian error curve assumption with assumption of linearly related variables (1897) : proof that LS could be applied in regression

Prediction Accuracy Estimation SASA 2013

Introduction Classical linear regression Alternative techniques Regression Trees Support Vector Regression Advantages & Disadvantages Simulation study Conclusion

Page 5: SASA Presentation 2013

Classical Linear Regression:

- Model:

with : (n x 1) dependent or response variable : unknown parameters : design matrix with jth row , , … , : (n x 1) error term - Assumptions (error term): 1. 2. ′ 3. Normally distributed

Prediction Accuracy Estimation SASA 2013

Introduction Classical linear regression Alternative techniques Regression Trees Support Vector Regression

Advantages & Disadvantages Simulation study

Page 6: SASA Presentation 2013

- Parameter estimation: - Regression coefficients ( ) and error variance ( ) - Method of Least Squares - Estimation from data: ′ and ′

Application inR: -Function: lm,predict

150 200 250 300 350





Linear regression



Prediction Accuracy Estimation SASA 2013

Introduction Classical linear regression Alternative techniques Regression Trees Support Vector Regression

Advantages & Disadvantages Simulation study Conclusion

Page 7: SASA Presentation 2013

Alternative techniques:

1. Regression Trees - Bagging - Boosting - Random Forest

2. Support Vector Regression

Prediction Accuracy Estimation SASA 2013

Introduction Classical linear regression Alternative techniques Regression Trees Support Vector Regression

Advantages & Disadvantages Simulation study Conclusion

Page 8: SASA Presentation 2013

Regression Trees

Main idea: - Nonparametric method to predict response variable y from known input variables (Izenman, 2008) - Classification and Regression Trees (CART) algorithm: use recursive partitioning of input space into non- overlapping rectangular (r=2) or cubic (r>2) regions & fit simple prediction model within each partition - Constant value assigned to each region as prediction - Tree: graphical representation of partitioning

Setup: - Learning data: , , 1,2, … , ,

Prediction Accuracy Estimation SASA 2013

Introduction Classical linear regression Alternative techniques Regression Trees Support Vector Regression Advantages & Disadvantages Simulation study Conclusion

Page 9: SASA Presentation 2013

|ZN < 16.57

ZN < 5.75

CRIM < 48.75 ZN < 10.7

ZN < 18.84

ZN < 20.735-2.501 -3.541 -1.410-2.559 2.257

0.690 -1.292

0 20 40 60 80 100




Partitioning of Housing Data



-2.50 -3.54






Prediction Accuracy Estimation SASA 2013

Introduction Classical linear regression Alternative techniques Regression Trees Support Vector Regression Advantages & Disadvantages Simulation study Conclusion

Page 10: SASA Presentation 2013

Aspects to consider: - Choose splitting conditions at each node - Decision rule for when a node should be terminal (node that does not split into two daughter nodes) - Rule for assigning a predicted response value to every terminal node

Application in R: - Package: rpart - Function: rpart, predict

Prediction Accuracy Estimation SASA 2013

Introduction Classical linear regression Alternative techniques Regression Trees Support Vector Regression Advantages & Disadvantages Simulation study Conclusion

Page 11: SASA Presentation 2013

Bagging (Bootstrap aggregating): - Procedure combines an ensemble of learning algorithms to improve performance over a single algorithm (Breiman, 1996) - Designed to reduce variance & improve stability - Independently construct trees using bootstrap samples; simple majority vote taken for prediction

Boosting: - Reduce high bias of predictors that under fit the data

- Enhance accuracy of a “weak” (slightly >50% accuracy) binary classification learning algorithm - Successive trees give extra weight to incorrectly predicted points; weighted vote taken for prediction

Prediction Accuracy Estimation SASA 2013

Introduction Classical linear regression Alternative techniques Regression Trees Support Vector Regression Advantages & Disadvantages Simulation study Conclusion

Page 12: SASA Presentation 2013

Random Forest: - Add additional layer of randomness to bagging (Breiman, 2001) - Construct each tree using a different bootstrap sample - Different tree construction: split each node using the best among a randomly chosen set of predictors (Liaw & Wiener, 2002) - Robust against overfitting - Very user-friendly: only two parameters : number of variables in subset & number of trees in forest

Prediction Accuracy Estimation SASA 2013

Introduction Classical linear regression Alternative techniques Regression Trees Support Vector Regression Advantages & Disadvantages Simulation study Conclusion

Page 13: SASA Presentation 2013

Support Vector Regression (SVR) Main idea:

- “...computation of a linear regression function in a high dimensional feature space where the input data are mapped via a nonlinear function.” (Basak, Pal, & Patranabis, 2007) - Involve optimization of a convex loss function or equivalently using quadratic optimization under given constraints

Setup: - Training data: , , … , ,

where : space of input patterns, say : consist of predictors , … , , each dim 1 : number of variables

Prediction Accuracy Estimation SASA 2013

Introduction Classical linear regression Alternative techniques Regression Trees Support Vector Regression Advantages & Disadvantages Simulation study Conclusion

Page 14: SASA Presentation 2013

Goal: - Find a function that deviates from actual responses with at most distance while ensuring small coefficient values - Tube formed around true regression function that contains most of the data points - Points falling outside of the tube: described by introducing slack variables (Smola & Schölkopf, 2003) - Approximate training data with linear function:

⟨ , ⟩ - Find function that will

minimize ‖ ‖ ∑ ∗ℓ subject to ⟨ , ⟩

⟨ , ⟩ ∗

Prediction Accuracy Estimation SASA 2013

Introduction Classical linear regression Alternative techniques Regression Trees Support Vector Regression Advantages & Disadvantages Simulation study Conclusion


Page 15: SASA Presentation 2013

- SV expansion in input space ( : ∑ ∗ ⟨ , ⟩ ∗ : obtain via quadratic optimization

Kernel functions: - Kernel: function K: → , such that for all , , , ⟨Φ ,Φ ⟩ - Used to compute inner products of the form ⟨Φ ,Φ ⟩ in feature space using nonlinear kernel in input space - High dimensionality makes it computationally expensive or impossible - SV expansion in feature space : ∑ ∗ ,

Prediction Accuracy Estimation SASA 2013

Introduction Classical linear regression Alternative techniques Regression Trees Support Vector Regression Advantages & Disadvantages Simulation study Conclusion

Page 16: SASA Presentation 2013

- Examples of kernel functions: Name

Kernel function



, ⟨ , ⟩ 1 , ∈


! !Gaussian

, , ∈



Application in R: - Package: kernlab - Function: ksvm, predict

Prediction Accuracy Estimation SASA 2013

Introduction Classical linear regression Alternative techniques Regression Trees Support Vector Regression Advantages & Disadvantages Simulation study Conclusion

Page 17: SASA Presentation 2013





Advantages: Easy to estimate

parameters Conceptually simple Generalized

performance Computationally easy Highly interpretable Wide, real-world

application Handle missing values

well Sparsity of support

vectors Resistant to outliers Perform well when

p>>n No normality

assumptions No normality


Disadvantages: Assumptions Trees: unstable, high

variance How to estimate

parameters Outliers/influential

values Lack of smoothness Parameters estimation

are computationally intense

Multicollinearity Stepfunction: values not always accurate

Which kernels to choose

Variable selection needed if p>>n

Prediction Accuracy Estimation SASA 2013

Introduction Classical linear regression Alternative techniques Regression Trees Support Vector Regression Advantages & Disadvantages Simulation study Conclusion

Page 18: SASA Presentation 2013

-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6




Fitted models





LinearTreeRandom Forest

-0.6 -0.4 -0.2 0.0 0.2 0.4 0.616




Fitted models






Prediction Accuracy Estimation SASA 2013

Page 19: SASA Presentation 2013

Software: R

Data sets: (Johnson & Wichern, 2007)

Parameter estimation: R functions optimize and optim

Prediction accuracy measures:

| |

Cross-validation: 100 simulations : 70% training data, 30% test data Bootstrap: 1000 bootstrap samples

Prediction Accuracy Estimation SASA 2013

Introduction Classical linear regression Alternative techniques Regression Trees Support Vector Regression Advantages & Disadvantages Simulation study Conclusion

Page 20: SASA Presentation 2013

Results (Natural gas data): MSE MAPE

Technique Mean Sd Mean Sd Linear Regression

CV 398.2812 92.7482 5.2777 0.6665 BS 335.9571 19.6197 4.8414 0.1375

Regression Tree CV 1006.6497 431.5992 8.7364 1.9983 BS 628.2681 122.0066 6.7289 0.6826

Random Forest CV 441.2309 124.8267 5.59 0.9815 BS 223.4301 51.7075 3.3958 0.3513

SVR: Polynomial CV 692.964 487.3593 6.2902 1.6811 BS 456.7031 333.1714 3.6937 0.7804

SVR: Gaussian CV 378.6259 102.3776 5.3426 0.8103 BS 297.2053 32.399 4.6845 0.2609

SVR: ANOVA CV 4363.471 860.2588 20.332 3.3114 BS 4044.828 766.1202 19.2571 1.0407

CV: Cross-validation BS : Bootstrap

Prediction Accuracy Estimation SASA 2013

Introduction Classical linear regression Alternative techniques Regression Trees Support Vector Regression Advantages & Disadvantages Simulation study Conclusion

Page 21: SASA Presentation 2013

Results (Pulp & Paper data):

MSE MAPE Technique Mean Sd Mean Sd Linear Regression

CV 2.8228 1.3047 5.7085 1.2441 BS 2.3586 0.4082 5.2373 0.472

Regression Tree CV 2.9881 0.9126 6.1803 1.1149 BS 1.7547 0.26 4.6158 0.3316

Random Forest CV 1.4502 0.5845 4.3711 0.9242 BS 0.7125 0.223 2.5465 0.2797

SVR: Polynomial CV 1.7355 1.0912 4.7764 1.2274 BS 2.3586 0.4082 5.2373 0.472

SVR: Gaussian CV 1.4788 0.6458 4.409 0.9694 BS 1.9256 6.4087 3.8799 1.1653

SVR: ANOVA CV 5.6161 1.09 9.1277 1.2696 BS 0.7727 0.2732 2.7881 0.3638

Prediction Accuracy Estimation SASA 2013

Introduction Classical linear regression Alternative techniques Regression Trees Support Vector Regression Advantages & Disadvantages Simulation study Conclusion

Page 22: SASA Presentation 2013

Conclusion: - SVR with Gaussian kernel performed the best for the Natural gas data set - Random Forest technique performed the best for the Pulp and Paper data set - SVR with ANOVA kernel performed the worst for both data sets - Linear regression expected to perform well when linear relations exist; alternative techniques expected to outperform when more complex relationships are present - Corresponding results were obtained by Cross-validation and Bootstrap methods in calculation of MSE and MAPE measures - Could add Out-of-Bag procedure for additional validation

Prediction Accuracy Estimation SASA 2013

Introduction Classical linear regression Alternative techniques Regression Trees Support Vector Regression Advantages & Disadvantages Simulation study Conclusion

Page 23: SASA Presentation 2013

References: - Basak, D., Pal, S., & Patranabis, D. C. (2007). Support Vector Regression. Neural Information Processing, 203-218. - Breiman, L. (1996). Bagging predictors. Machine Learning, 123-140. - Breiman, L. (2001). Random Forests. Machine Learning, 5-32. - Izenman, A. J. (2008). Modern Multivariate Statistical Techniques. New York: Springer Science+Business Media. - Johnson, R. A., & W., W. D. (2007). Applied Multivariate Statistical Analysis. NJ: Pearson Education, Inc. - Johnson, R. A., & Wichern, D. W. (2007). Applied Multivariate Statistical Analysis. NJ: Pearson Education, Inc. - Liaw, A., & Wiener, M. (2002). Classification and Regression by randomForest. R News.

- Smola, A. J., & Schölkopf, B. (2003). A Tutorial on Support Vector Regression. Statistics and Computing, 199-222.

Prediction Accuracy Estimation SASA 2013