Skip to content

Learning Algorithms

Algorithm Learning Type Task Comment Assumptions Pre-Processing Probabilistic Parametric Scope \(d_\text{VC}\) Bias Variance Generalization Advantages Disadvantages Time Complexity
Training
Time Complexity
Inference
Space Complexity
Training
Space Complexity
Inference
Least Squares Supervised Regression Mean has linear relationship with inputs
Constant variance
Global \(k+1\) High Low Good
\(n >> k\)
\(O(n_\text{train} p^2 + p^3)\) \(O(n_\text{inf} p)\)
LogisticRegression Supervised Classification Binary Global \(k+1\) High Low Good
\(n >> k\)
\(O(n_\text{train} p)\) \(O(n_\text{inf} p)\)
Naive Bayes Supervised \(O(n_\text{train} p)\) \(O(n_\text{inf} p)\)
OrderedRegression Supervised Classification Ordered
Piecewise Constant Supervised Regression/
Classification
Local
Piecewise Polynomial Supervised Regression/
Classification
Local
SVM Supervised Regression/
Classification
Margin-Based Computationally-expensive
Gaussian Processes Supervised Regression/
Classification
KNN
Nearest Neighbor
Supervised Regression/
Classification
Good baseline model

Can use mean, median, mode for regression
Can use weightage, voting for classification
Underlying relationship b/w \(x\) and \(y\) is continuous - Feature scaling
- Dimensionality reduction
+ Lazy-learner: no train time, new train data can be added without re-training
+ Can perform multi-class out-of-the-box
Lazy-learner: High space complexity, high inference time

- Cannot use categorical features
- Does not work well for large \(n\) or \(p\): need to calculate the distance of inference record wrt training records
- Sensitive to noise
- Sensitive to missing data
- Sensitive to scale
- Cannot extrapolate
Decision Tree Supervised Regression/
Classification
Automatic Piecewise Constant

Exactly opposite in characteristics wrt to OLS
Local Low High - Highly-interpretable
- Auto-detect non-linear relationships
- Auto-model variable interactions
- Fast evaluation: Traversal only occurs on subset of attributes
- Poor regressive performance
- Unstable: Tree struct sensitive to train data; changing train data changes tree
- Require large no of splits for even simple relationships
- Cannot extrapolate
Linear Tree Supervised Regression/
Classification
Automatic Piecewise Polynomial Local Can extrapolate
Linear Forest Supervised Regression/
Classification
Linear Boosting Supervised Regression/
Classification
Random Forest Supervised Regression/
Classification
Bagged Trees Local - Cannot extrapolate
XGBoost Supervised Regression/
Classification
Boosted Trees Local - Cannot extrapolate
CatBoost Supervised Regression/
Classification
Boosted Trees Local - Cannot extrapolate
LightGBM Supervised Regression/
Classification
Boosted Trees Local - Cannot extrapolate
Random Forest Proximity Supervised Missing value imputation
Anomaly Detection
Clustering
K-Means
K-Medoids
Unsupervised Clustering Euclidean distance assumes spherical clusters
Gaussian Mixtures Unsupervised Clustering
Hierarchical Clustering Unsupervised Clustering
One-Many Clustering Unsupervised Clustering
Graph Clustering Unsupervised Clustering
KNN Unsupervised Anomaly Detection
Kernel Density Estimation Unsupervised Anomaly Detection
Isolation Forest Unsupervised Anomaly Detection As contamination will mostly be unknown

Get score without contamination offset
(normalize for easier interpretation)

Visualization
1. rank by the score
2. ⁠plot histogram of scores
3. parallel coordinates
Requires pre-specified contamination fraction
One-Class SVM Unsupervised Anomaly Detection
Autoencoders Unsupervised Anomaly Detection Reconstruction loss will be high for anomalous events
Cluster Analysis Unsupervised Anomaly Detection
IDK Unsupervised Ranking/Scoring 1. Robust Scaler for \(x_{ij}\)
2. \(x_{ij} - \min (x_{j}), \forall j\), to prevent unintended consequences in step 3
3. Convert into euclidean distance from origin
4. Clustering
5. Score
- Discrete score: Larger distance of cluster center from origin
- Continuous score: Not sure yet
Q-Learning Re-Inforcement Learning

Curse of Dimensionality

As the no of dimensions increases, relative distances tend to 0

Distance-based models are the most affected

  • KNN
  • K-Means
  • Tree-based classification
  • SVM?
Last Updated: 2025-03-24 ; Contributors: AhmedThahir, web-flow

Comments