weka presentation cmt111

43
Data Analysis with Weka zoo.arff Done by Clement Robert H. Daniyar M. Web and Social Computing

Upload: clement-robert-habimana

Post on 07-Jan-2017

269 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Weka presentation cmt111

Data Analysis with Wekazoo.arff

Done by Clement Robert H.

Daniyar M.

Web and Social Computing

Page 2: Weka presentation cmt111

Dataset

Zoo.arff: A simple database containing 17 Boolean-valued attributes. The "type" attribute appears to be the class attribute. Here is a breakdown of which animals are in which type.

Page 3: Weka presentation cmt111

Objectives

● Select Dataset

● Learn and Explain the two classifiers

● Apply them to the selected Dataset

● Compare the results

Page 4: Weka presentation cmt111

Outline

●●●●●●●●●

Page 5: Weka presentation cmt111

Visualization

mammal, bird, reptile, fish, amphibian, insect, invertebrate.

Page 6: Weka presentation cmt111

Basic statistics

mammal, bird, reptile, fish, amphibian, insect, invertebrate.

Page 7: Weka presentation cmt111

Hair

mammal, bird, reptile, fish, amphibian, insect, invertebrate.mammal, bird, reptile, fish, amphibian, insect, invertebrate.

Page 8: Weka presentation cmt111

Feathers

mammal, bird, reptile, fish, amphibian, insect, invertebrate.mammal, bird, reptile, fish, amphibian, insect, invertebrate.

Page 9: Weka presentation cmt111

Eggs

mammal, bird, reptile, fish, amphibian, insect, invertebrate.mammal, bird, reptile, fish, amphibian, insect, invertebrate.

Page 10: Weka presentation cmt111

Milk

mammal, bird, reptile, fish, amphibian, insect, invertebrate.mammal, bird, reptile, fish, amphibian, insect, invertebrate.

Page 11: Weka presentation cmt111

Catsize

mammal, bird, reptile, fish, amphibian, insect, invertebrate.mammal, bird, reptile, fish, amphibian, insect, invertebrate.

Page 12: Weka presentation cmt111

Pre-Processing

● Why?○ Incomplete Data○ Noisy Data○ Inconsistent Data

● How?○ Discretize○ Remove Duplicates○ RemoveUseless○ Etc

● Case Study: Zoo.arff○ It has duplicate (frog)○ One attribute is misleading (Animal name)

Page 13: Weka presentation cmt111

Classification

“Data mining technique used to predict group membership for data instances”

● Types of classifiers:○ Decision Trees

■ J48○ Rule Based Classifiers

■ JRip● Bayes

○ Naive Bayes○ Bayesian Networks

● Functions● Etc

Page 14: Weka presentation cmt111

Classifiers Options Explained-Applied on DataSet

● Training Data:

We use the whole dataset as training set. It gives the best results for the data set itself but does not guarantee the best test for unseen data.

● Cross-Validation:

Divide the Data Set into K subsamples and use k-1 subsamples as training data and one subsample as test data.

● Percentage Split:

We divide the dataset into two parts: the first X% of the data set is used as training and the rest is used as the test set.

Page 15: Weka presentation cmt111

JRIP Classifier Algorithm

● Based on RIPPER Algorithm● RIPPER: Repeated Incremental

Pruning to Produce Error Reduction

○ Incremental Pruning○ Error Reduction

● Produces Rules● Works fine for:

○ Class: Missing classes values, binary and Nominal Classes

○ Attributes: Nominal. Dates, Missing Values, Numeric and Binary

Advantages

● As highly expressive as decision trees

● Easy to interpret● Easy to generate● Can classify new instances rapidly

“If Age is greater than 35 and Status is Married Then

He/she Does not Cheat”

Page 16: Weka presentation cmt111

Classifier Evaluation-Default Options JRip

Pre-Processed?-> NO Results

Page 17: Weka presentation cmt111

PreProcessed?-> No Results->NO BIG CHANGE

Classifier Evaluation->Cross Validation Increased JRip

● Same Options as Previous ● Except ● Cross-Validation= 20 Folds

“If the Cross-Validation Increases, the correctly Classified Instances

Decreasing”

Why?

Next Cross-Validation=40 etc

Page 18: Weka presentation cmt111

PreProcessed?-> No Results->NO BIG CHANGE

Classifier Evaluation->Cross Validation JRip

● No Relation among those values

● Why 10? Extensive experiments have shown that this is the best choice to get an accurate estimate for small Data Set ( we have 101 instances and 17 attributes)

● The more we increase the number of FOLDS in cross-Validation, the more we decrease the size of training subsets! (eg: for k=80, we wanted to have 80 training subsets, in 101 instances!)

Cross Validation

Correctly Classified Inst.

10 87.14

20 85.14

30 86.13

40 84.15

50 87.12

60 82.17

70 85.14

80 87.12

90 84.15

Page 19: Weka presentation cmt111

PreProcessed?-> No Results

Classifier Evaluation->Training Set JRip

Page 20: Weka presentation cmt111

PreProcessing-> Not Yet

Classifier Evaluation->Training Set JRip

Cross Validation (K)

Correctly Classified

Inst.

10 87.14

20 85.14

30 86.13

40 84.15

50 87.12

60 82.17

70 85.14

80 87.12

90 84.15

VSTraining Set 92.07

Training Set

● works with the totality of the data set as training and test Data

● It works better for the given data but does not gives us confidence to use it the unseen cases (prediction, etc)

Page 21: Weka presentation cmt111

Pre-Processing- Why? JRip

● Rule generated based on useless attribute

○ Filter with RemoveUseless● Possible Duplicates->can lead to biased

Rules-> We never know○ Filter with RemoveDuplicates

● Other Filters○ Discretize, etc

Page 22: Weka presentation cmt111

● RemoveDuplicates: Removes all duplicate instances from the first batch of data it receive

● RemoveUseless: This filter removes attributes that do not vary at all or that vary too much

Pre-Processing-> RemoveDuplicates (instances), RemoveUseless (attributes) JRip

Page 23: Weka presentation cmt111

PreProcessed?-> Yes

Classifier Evaluation->Default Options JRip

Page 24: Weka presentation cmt111

Classifier Evaluation-> Percentage Split-> Default (66%) JRip

PreProcessed?-> Yes Results

● With this split, the test samples do not have Amphibians!

● Has good statistics because test is done on small test set.

Page 25: Weka presentation cmt111

Classifier Evaluation-> Percentage Split-> 80% JRip

PreProcessed?-> Yes Results

Percentage Split● Takes samples randomly● Chance for some

representatives to be left outhttp://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/

Page 26: Weka presentation cmt111

What Next? JRip

● Done with Test Options Applied to Data● JRIP has its own parameters to change in order to make it work better

○ Folds, MinWeights, Seeds of randomization, Error Rate and Pruning,

● Pruning○ Pre-Pruning: Cut the classifier to continue growing when

condition is met○ Post-Pruning: Let the Classifier grow, and the reduce it to make

it small so as to cover more unseen data.● Without Pruning, the classifier is more detailed which makes it to be

limited on unseen cases.● JRip uses Post Pruning Method (based on REP)● Next, JRip with Pruning, Vs JRip without Pruning

Page 27: Weka presentation cmt111

Classifier Evaluation->Pruning-TRUE Vs FALSE JRip

PreProcessed?-> no Results with Experimenter*

Page 28: Weka presentation cmt111

J48 Classifier Algorithm

● Based on C4.5 Algorithm● Modified ID3

○ continuous attributes○ missing attributes○ attributes with differing costs○ post-pruning trees

● Produces Decision Tree● Works fine for:

○ Class: Missing classes values, binary and Nominal Classes

○ Attributes: Nominal. Missing Values, Numeric and Binary

The advantages of the C4.5 are:

• Builds models that can be easily interpreted

• Easy to implement

• Can use both categorical and continuous

values

• Deals with noise

Page 29: Weka presentation cmt111

Default options J48

Page 30: Weka presentation cmt111

Default Options J48

Page 31: Weka presentation cmt111

Cross validation =20 J48

No major changes

ROC and PRC area dropped insignificantly

Confusion matrix andDecision tree remains same

Page 32: Weka presentation cmt111

Changing cross validation value J48

Cross Validation

Correctly Classified Inst.

10 92.079

20 92.079

30 92.079

40 92.079

50 92.079

60 92.079

70 92.079

80 92.079

90 92.079

Changes nothing including ROC, PRC, confusion matrix and decision tree

This is because number of instances are relatively small for J48 (101)

Page 33: Weka presentation cmt111

Classifier by training set J48

Page 34: Weka presentation cmt111

Weka Knowledge Flow

Page 35: Weka presentation cmt111

Step by Step

Page 36: Weka presentation cmt111

Comparing two different trees

Page 37: Weka presentation cmt111

Comparing two sets of rules

Page 38: Weka presentation cmt111

Getting same results by Knowledge Flow

Page 39: Weka presentation cmt111

Getting same results by Knowledge Flow (con’t)

Page 40: Weka presentation cmt111

J48? or JRip on Zoo.arff-> Comparison with Experimenter Feature

Experimenter is Suitable for:

● Large Scale Experiments

● Automation

● Statistics can be stored in .arff format

● …..

● Classifiers Comparison

Page 41: Weka presentation cmt111

J48? or JRip? on Zoo.arff?-> Results

With:● Significance of 10% on

Percentage Correct Instances

● Cross-Validation=10

● Default Classifiers Options

We see that:

● J48 is the Winner on Zoo.arff

Page 42: Weka presentation cmt111

Lesson Learnt & Conclusions

● Output Readability○ JRip outputs are easy to read and understand○ J48 Trees can be more complex to read

● Performance ○ J48 beats JRip, as it generate general tree without Pre-Processing○ J48 generates Tree whose precision is high○ No big difference for small data set like zoo.arff

● Weka test options○ Can mislead someone who only looks for good results.○ No good/bad classifier. It depends on many characteristics (dataset size, options used, classifier

used etc). ● Weka

○ Experimenter gives a good way to compare classifiers○ Knowledge Flow helps to get the intermediary steps in the generation of classifier (helps to

understand how some options work like cross validation)

Page 43: Weka presentation cmt111

Q & AThank you