Download - Kaggle presentation
![Page 1: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/1.jpg)
Winning Kaggle Competitions
Hendrik Jacob van Veen - Nubank Brasil
![Page 2: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/2.jpg)
About Kaggle
Biggest platform for competitive data science in the world
Currently 500k + competitors
Great platform to learn about the latest techniques and avoiding overfit
Great platform to share and meet up with other data freaks
![Page 3: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/3.jpg)
Approach
Get a good score as fast as possible
Using versatile libraries
Model ensembling
![Page 4: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/4.jpg)
Get a good score as fast as possible
Get the raw data into a universal format like SVMlight or Numpy arrays.
Failing fast and failing often / Agile sprint / Iteration
Sub-linear debugging: “output enough intermediate information as a calculation is progressing to determine before it finishes whether you've injected a major defect or a significant improvement.” Paul Mineiro
![Page 5: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/5.jpg)
Using versatile libraries
Scikit-learn
Vowpal Wabbit
XGBoost
Keras
Other tools get Scikit-learn API wrappers
![Page 6: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/6.jpg)
Model Ensembling
Voting
Averaging
Bagging
Boosting
Binning
Blending
Stacking
![Page 7: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/7.jpg)
General Strategy
Try to create “machine learning”-learning algorithms with optimized pipelines that are:
Data agnostic (Sparse, dense, missing values, larger than memory)
Problem agnostic (Classification, regression, clustering)
Solution agnostic (Production-ready, PoC, latency)
Automated (Turn on and go to bed)
Memory-friendly (Don’t want to pay for AWS)
Robust (Good generalization, concept drift, consistent)
![Page 8: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/8.jpg)
First Overview I
Classification? Regression?
Evaluation Metric
Description
Benchmark code
“Predict human activities based on their smartphone usage. Predict if a user is sitting, walking etc.” - Smartphone User Activity Prediction
Given the HTML of ~337k websites served to users of StumbleUpon, identify the paid content disguised as real content. - Dato Truly Native?
![Page 9: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/9.jpg)
First Overview II
Counts
Images
Text
Categorical
Floats
Dates
0.28309984, -0.025501173, … , -0.11118051, 0.37447712
<!Doctype html><html><head><meta charset=utf-8> … </html>
![Page 10: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/10.jpg)
First Overview III
Data size?
Dimensionality?
Number of train samples & test samples?
Online or offline learning?
Linear problem or non-linear problem?
Previous competitions that were similar?
![Page 11: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/11.jpg)
Branch
If: Issues with the data -> Tedious clean-up
Join JSON tables, Impute missing values, Curse Kaggle and join another competition
Else: Get data into Numpy arrays, we want:
X_train, y, X_test
![Page 12: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/12.jpg)
Local Evaluation
Set up local evaluation according to competition metric
Create a simple benchmark (useful for exploration and discarding models)
5-fold stratified cross-validation usually does the trick
Very important step for fast iteration and saving submissions, yet easy to be lazy and use leaderboard.
Area Under the Curve, Multi-Class Classification Accuracy
![Page 13: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/13.jpg)
Data Exploration
Min, Max, Mean, Percentiles, Std, Plotting
Can detect: leakage, golden features, feature engineering tricks, data health issues.
Caveat: At least one top 50 Kaggler used to not look at the data at all:
“It’s called machine learning for a reason.”
![Page 14: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/14.jpg)
Feature Engineering I
Log-transform count features, tf-idf transform text features
Unsupervised transforms / dimensionality reduction
Manual inspection of data
Dates -> day of month, is_holiday, season, etc.
Create histograms and cluster similar features
Using VW-varinfo or XGBfi to check 2-3-way interactions
Row stats: mean, max, min, number of NA’s.
![Page 15: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/15.jpg)
Feature Engineering II
Bin numerical features to categorical features
Bayesian encoding of categorical features to likelihood
Genetic programming
Random-swap feature elimination
Time binning (customer bought in last week, last month, last year …)
Expand data (Coates & Ng, Random Bit Regression)
Automate all of this
![Page 16: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/16.jpg)
Feature Engineering III
Categorical features need some special treatment
Onehot-encode for linear models (sparsity)
Colhot-encode for tree-based models (density)
Counthot-encode for large cardinality features
Likelihood-encode for experts…
![Page 17: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/17.jpg)
Algorithms I
A bias-variance trade-off between simple and complex models
![Page 18: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/18.jpg)
Algorithms II
There is No Free Lunch in statistical inference
We show that all algorithms that search for an extremum of a cost function perform exactly the same, when averaged over all possible cost functions. – Wolpert & Macready, No free lunch theorems for search
Practical Solution for low-bias low-variance models:
Use prior knowledge / experience to limit search (Let algo’s play to their known strengths for particular problems)
Remove or avoid their weaknesses
Combine/Bag their predictions
![Page 19: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/19.jpg)
Random Forests I
A Random Forest is an ensemble of decision trees.
"Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. […] More robust to noise - “Random Forest" Breiman
![Page 20: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/20.jpg)
Random Forests II
Strengths
Fast
Easy to tune
Easy to inspect
Easy to explore data with
Good Benchmark
Very wide applicability
Can introduce randomness / Diversity
Weaknesses
Memory Hungry
Popular
Slower for test time
![Page 21: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/21.jpg)
GBM I
A GBM trains weak models on samples that previous models got wrong
"A method is described for converting a weak learning algorithm [the learner can produce an hypothesis that performs only slightly better than random guessing] into one that achieves arbitrarily high accuracy." - “The Strength of Weak Learnability" Schapire
![Page 22: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/22.jpg)
GBM II
Strengths
Can achieve very good results
Can model complex problems
Works on wide variety of problems
Use custom loss functions
No need to scale data
Weaknesses
Slower to train
Easier to overfit than RF
Weak learner assumption is broken along the way
Tricky to tune
Popular
![Page 23: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/23.jpg)
SVM I
Classification and Regression using Support Vectors
"Nothing is more practical than a good theory." ‘The Nature of Statistical Learning Theory’, Vapnik
![Page 24: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/24.jpg)
SVM II
Strengths
Strong theoretical guarantees
Tuning regularization parameter helps prevent overfit
Kernel Trick: Use custom kernels, turn linear kernel into non-linear kernel
Achieve state-of-the-art on select problems
Weaknesses
Slower to train
Memory heavy
Requires a tedious grid-search for best performance
Will probably time-out on large datasets
![Page 25: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/25.jpg)
Nearest Neighbours I
Look at the distance to other samples
"The nearest neighbor decision rule assigns to an unclassified sample point the classification of the nearest of a set of previously classified points." ‘Nearest neighbor pattern classification’, Cover et. al.
![Page 26: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/26.jpg)
Nearest Neighbours II
Strengths
Simple
Impopular
Non-linear
Easy to tune
Detect near-duplicates
Weaknesses
Simple
Does not work well on average
Depending on data size: Slow
![Page 27: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/27.jpg)
Perceptron I
Update weights when wrong prediction, else do nothing
The embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence. ‘New York Times’, Rosenblatt
![Page 28: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/28.jpg)
Perceptron II
Strengths
Cool / Street Cred
Extremely Simple
Fast / Sparse updates
Online Learning
Works well with text
Weaknesses
Other linear algo’s usually beat it
Does not work well on average
No regularization
![Page 29: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/29.jpg)
Neural Networks I
Inspired by biological systems (Connected neurons firing when threshold is reached)
Because of the "all-or-none" character of nervous activity, neural events and the relations among them can be treated by means of propositional logic. […] for any logical expression satisfying certain conditions, one can find a net behaving in the fashion it describes. ‘A Logical Calculus of the Ideas Immanent in Nervous Activity’, McCulloch & Pitts
![Page 30: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/30.jpg)
Neural Networks II
Strengths
The best for images
Can model any function
End-to-end Training
Amortizes feature representation
Weaknesses
Can be difficult to set up
Not very interpretable
Requires specialized hardware
Underfit / Overfit
![Page 31: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/31.jpg)
Vowpal Wabbit I
Online learning while optimizing a loss function
We present a system and a set of techniques for learning linear predictors with convex losses on terascale datasets, with trillions of features, billions of training examples and millions of parameters in an hour using a cluster of 1000 machines. ‘A Reliable Effective Terascale Linear Learning System’, Agarwal et al.
![Page 32: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/32.jpg)
Vowpal Wabbit II
Strengths
Fixed memory constraint
Extremely fast
Feature expansion
Difficult to overfit
Versatile
Weaknesses
Different API
Manual feature engineering
Loses against boosting
Requires practice
Hashing can obscure
![Page 33: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/33.jpg)
Others
Factorization Machines
PCA
t-SNE
SVD / LSA
Ridge Regression
GLMNet
Genetic Algorithms
Bayesian
Logistic Regression
Quantile Regression
AdaBoosting
SGD
![Page 34: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/34.jpg)
Ensembles I
Combine models in a way that outperforms individual models.
“That’s how almost all ML competitions are won” - ‘Dark Knowledge’ Hinton et al.
Ensembles reduce the chance of overfit.
Bagging / Averaging -> Lower variance, slightly lower bias
Blending / Stacking -> Remove biases of base models
![Page 35: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/35.jpg)
Ensembles II
Practical tips:
Use diverse models
Use diverse feature sets
Use many models
Do not leak any information
![Page 36: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/36.jpg)
Stacked Generalization I
Train one model on the predictions of another model
A scheme for minimizing the generalization error rate of one or more generalizers. Stacked generalization works by deducing the biases of the generalizer(s) with respect to a provided learning set. This deduction proceeds by generalizing in a second space whose inputs are (for example) the guesses of the original generalizers when taught with part of the learning set and trying to guess the rest of it, and whose output is (for example) the correct guess. - ‘Stacked Generalization’, Wolpert
![Page 37: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/37.jpg)
Stacked Generalization II
Train one model on the predictions of another model
![Page 38: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/38.jpg)
Stacked Generalization III
Using weak base models vs. using strong base models
Using average of out-of-fold predictors vs. One model for testing
One can also stack features when these are not available in test set.
Can share train set predictions based on different folds
![Page 39: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/39.jpg)
StackNet
We need to go deeper:
Splitting node: x1 > 5? 1 else 0
Decision tree: x1 > 5 AND x2 < 12?
Random forest: avg ( x1 > 5 AND x2 < 12?, x3 > 2? )
Stacking-1: avg ( RF1_pred > 0.9?, RF2_pred > 0.92? )
Stacking-2: avg ( S1_pred > 0.93?, S2_pred < 0.77? )
Stacking-3: avg ( SS1_pred > 0.98?, SS2_pred > 0.97? )
![Page 40: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/40.jpg)
Bagging Predictors I
Averaging submissions to reduce variance
"Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor." - "Bagging Predictors". Breiman
![Page 41: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/41.jpg)
Bagging Predictors II
Train models with:
Different data sets
Different algorithms
Different features subsets
Different sample subsets
Then average / vote aggregate these
![Page 42: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/42.jpg)
Bagging Predictors III
One can average with:
Plain average
Geometric mean
Rank mean
Harmonic mean
KazAnova’s brute-force weighted averaging
Caruana’s forward greedy model selection
![Page 43: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/43.jpg)
Brute-Force Weighted Average
Create out-of-fold predictions for train set for n models
Pick a stepsize s, and set n weights
Try every possible weight with stepsize s
Look which set of n weights improves the train set score the most
Can do in cross-validation-style manner for extra robustness.
![Page 44: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/44.jpg)
Greedy forward model selection (Caruana)
Create out-of-fold predictions for the train set
Start with a base ensemble of 3 best models
Loop: Add every model from library to ensemble and pick 4 models that give best train score performance
Using place-back of models, models can be picked multiple times (weighing them)
Using random subset selection from library in loop avoids overfitting to single best model.
![Page 45: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/45.jpg)
Automated Stack ’n Bag I
Automatically train 1000s of models and 100s of stackers, then average everything.
“Hodor!” - Hodor
![Page 46: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/46.jpg)
Automated Stack ’n Bag II
Generalization
Train random models, random parameters, random data set transforms, random feature sets, random sample sets.
Stacking
Train random models, random parameters, random base models, with and without original features, random feature sets, random sample sets.
Bagging
Average random selection of Stackers and Generalizers. Either pick best model, or create more random bags and keep averaging, ‘till no increase.
![Page 47: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/47.jpg)
Automated Stack ’n Bag III
Strengths
Wins Kaggle competitions
Best generalization
No tuning
No selection
No human bias
Weaknesses
Extremely slow
Redundant
Inelegant
Very complex
Bad for environment
![Page 48: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/48.jpg)
Leakage I
“The introduction of information about the data mining target, which should not be legitimately available to mine from.” - ‘Leakage in Data Mining: Formulation, Detection, and Avoidance’, Kaufman et. al.
“one of the top ten data mining mistakes” - ‘Handbook of Statistical Analysis and Data Mining Applications.’, Nisbet et. al.
![Page 49: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/49.jpg)
Leakage II
Exploiting Leakage:
In predictive modeling competitions: Allowed and beneficial for results
In Science and Business: A very big NO NO!
In both: Accidental (Complex algo’s find leakage automatically, or KNN finds duplicates)
![Page 50: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/50.jpg)
Leakage III
Dato Truly Native?
This task suffered from data collection leakage:
Dates and certain keywords (Trump) were indicative, and generalized to private LB (but not generalize to future data).
Smartphone activity prediction
This task had not enough randomization (order of samples in train and test set was indicative)
Could manually change predictions, because classes were clustered.
![Page 51: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/51.jpg)
Winning Dato Truly Native? I
Invented StackNet
“Data science is a team sport”: it helps to join up with #1 Kaggler :)
We used basic NLP: Cleaning, lowercasing, stemming, ngrams, chargrams, tf-idf, SVD.
Trained a lot of different models on different datasets.
Started ensembling in the last 2 weeks.
Doing research and fun stuff, while waiting for models to complete.
XGBoost the big winner (somewhat rare to use boosting for sparse text)
![Page 52: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/52.jpg)
Winning Dato Truly Native? II
![Page 53: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/53.jpg)
Winning Smartphone Activity Prediction I
Prototyped Automated Stack ’n Bag (Kaggle Killer).
Let computer run for two days
Automatically inferred feature types
Did not look at the data
Beat very stiff competition
![Page 54: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/54.jpg)
Winning Smartphone Activity Prediction I
![Page 55: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/55.jpg)
General strategy
Being #1 during competition sucks.
Team up
Go crazy with ensembling
Do not worry so much about replication that it freezes progress
Check previous competitions
Be patient and persistent (dont run out of steam)
Automate a lot
Stay up-to-date with State-of-the-art algorithms and tools
![Page 56: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/56.jpg)
Complexity vs. Practicality I
Most Kaggle winner models are useless for production. It’s about hyper-optimization. Top 10% probably good enough for business.
But what if we could use some Top 1% principles from Kaggle models for business?
1-5% increase in accuracy can matter a lot!
Batch jobs allow us to overcome latency constraints
Ensembles are technically brittle, but give good generalization.
Leave no model behind!
![Page 57: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/57.jpg)
Complexity vs. Practicality II
![Page 58: Kaggle presentation](https://reader031.vdocumento.com/reader031/viewer/2022021422/586fe8f41a28ab92198b4969/html5/thumbnails/58.jpg)
Future
Use re-usable holdout set
Use contextual bandits for training the ensemble
Find more models to add to library
Ensemble pruning / compression
Interpretable black box models