Mean has linear relationship with inputs Constant variance
❌
✅
Global
\(k+1\)
High
Low
Good \(n >> k\)
\(O(n_\text{train} p^2 + p^3)\)
\(O(n_\text{inf} p)\)
LogisticRegression
Supervised
Classification Binary
✅
✅
Global
\(k+1\)
High
Low
Good \(n >> k\)
\(O(n_\text{train} p)\)
\(O(n_\text{inf} p)\)
Naive Bayes
Supervised
\(O(n_\text{train} p)\)
\(O(n_\text{inf} p)\)
OrderedRegression
Supervised
Classification Ordered
Piecewise Constant
Supervised
Regression/ Classification
❌
❌
Local
Piecewise Polynomial
Supervised
Regression/ Classification
❌
❌
Local
SVM
Supervised
Regression/ Classification
Margin-Based
❌
✅
Computationally-expensive
Gaussian Processes
Supervised
Regression/ Classification
✅
✅
KNN Nearest Neighbor
Supervised
Regression/ Classification
Good baseline model
Can use mean, median, mode for regression Can use weightage, voting for classification
Underlying relationship b/w \(x\) and \(y\) is continuous
- Feature scaling - Dimensionality reduction
❌
❌
+ Lazy-learner: no train time, new train data can be added without re-training + Can perform multi-class out-of-the-box
Lazy-learner: High space complexity, high inference time
- Cannot use categorical features - Does not work well for large \(n\) or \(p\): need to calculate the distance of inference record wrt training records - Sensitive to noise - Sensitive to missing data - Sensitive to scale - Cannot extrapolate
Decision Tree
Supervised
Regression/ Classification
Automatic Piecewise Constant
Exactly opposite in characteristics wrt to OLS
❌
❌
Local
Low
High
- Highly-interpretable - Auto-detect non-linear relationships - Auto-model variable interactions - Fast evaluation: Traversal only occurs on subset of attributes
- Poor regressive performance - Unstable: Tree struct sensitive to train data; changing train data changes tree - Require large no of splits for even simple relationships - Cannot extrapolate
Linear Tree
Supervised
Regression/ Classification
Automatic Piecewise Polynomial
❌
❌
Local
Can extrapolate
Linear Forest
Supervised
Regression/ Classification
Linear Boosting
Supervised
Regression/ Classification
Random Forest
Supervised
Regression/ Classification
Bagged Trees
❌
❌
Local
- Cannot extrapolate
XGBoost
Supervised
Regression/ Classification
Boosted Trees
❌
❌
Local
- Cannot extrapolate
CatBoost
Supervised
Regression/ Classification
Boosted Trees
❌
❌
Local
- Cannot extrapolate
LightGBM
Supervised
Regression/ Classification
Boosted Trees
❌
❌
Local
- Cannot extrapolate
Random Forest Proximity
Supervised
Missing value imputation Anomaly Detection Clustering
K-Means K-Medoids
Unsupervised
Clustering
Euclidean distance assumes spherical clusters
❌
❌
Gaussian Mixtures
Unsupervised
Clustering
Hierarchical Clustering
Unsupervised
Clustering
One-Many Clustering
Unsupervised
Clustering
Graph Clustering
Unsupervised
Clustering
KNN
Unsupervised
Anomaly Detection
Kernel Density Estimation
Unsupervised
Anomaly Detection
✅
Isolation Forest
Unsupervised
Anomaly Detection
As contamination will mostly be unknown
Get score without contamination offset (normalize for easier interpretation)
Visualization 1. rank by the score 2. plot histogram of scores 3. parallel coordinates
❌
Requires pre-specified contamination fraction
One-Class SVM
Unsupervised
Anomaly Detection
Autoencoders
Unsupervised
Anomaly detection via reconstruction loss
Reconstruction loss will be high for anomalous events
Kernel PCA, UMAP
Unsupervised
Dimensionality reduction Anomaly detection via reconstruction loss
Cluster Analysis
Unsupervised
Anomaly Detection
IDK
Unsupervised
Ranking/Scoring
1. Robust Scaler for \(x_{ij}\) 2. \(x_{ij} - \min (x_{j}), \forall j\), to prevent unintended consequences in step 3 3. Convert into euclidean distance from origin 4. Clustering 5. Score - Discrete score: Larger distance of cluster center from origin - Continuous score: Not sure yet