Feature Selection¶
The learning algorithm used for feature selection need not be the the same as final model for fitting
Model | |
---|---|
Null | \(\beta_0 + u\) |
Subset | \(\beta_0 + \sum_j^k' \beta_j x_j + u; k'<k\) |
Full | \(\beta_0 + \sum_j^k \beta_j x_j + u\) |
Also called as Subset/Variable Selection
For \(p\) potential predictors, \(\exists \ 2^p\) possible models. Comparing all subsets is computationally-infeasible
Action | Can handle multi-collinearity | Can handle interactions | Robust to overfitting | Selection Type | Constraint Type | Methodology | Advantage | Disadvantage | |
---|---|---|---|---|---|---|---|---|---|
Drop useless features | Too many missing values | ||||||||
Variance threshold | |||||||||
Correlation with target | â | â | â | 2 vars uncorrelated with the target can become informative together | |||||
Dropping redundant features (multi-collinear) | â | ||||||||
Feature Information | Mutual Information | â | â | â | |||||
Feature Importance | Random Forest Feature Importance | â | â | â | - Add a random noise feature as a baseline - | - Only applicable for tree-based models - Trained model has to be very accurate, an assumption rarely met - Whenever a model is overfitted, very likely that features with highest feature imp are the issue; they should be removed, but we incorrectly concluded that these features will have highest importance and should be retained | |||
MDA (Mean Decrease Accuracy) PFI (Permutation Feature Importance) | â | â | â | Pre-Steps - Add a random noise feature as a baseline - Repeat with different random seeds to eliminate chance Training - Model w/ pure feature(s) - Model w/ permuted feature(s) - Model w/o feature(s) Evaluation - In-Sample metric - Out-sample metric: which features cause improvement in out-sample metric, compared to random baseline - Generalization Gap: which features cause model to overfit, compared to random baseline | - Easy to interpret | ||||
Feature Predictive Power/ Specification Search | Full Search | â | â | â | Discrete | Hard | Brute-Force | Global optima | Predictive power is not a good way to evaluate Computationally-expensive |
Forward stepwise selection | â | â | â | Discrete | Hard | Starts with the null model, and then adds predictors one-at-a-time | Computationally efficient Lower sample size requirement | Predictive power is not a good way to evaluate | |
Backward Stepwise Selection | â | â | â | Discrete | Hard | Start with the full model and remove predictors one-at-a-time | Predictive power is not a good way to evaluate Expensive Large sample size requirement | ||
IDK | LASSO | â | â | â | Continuous | Soft | Refer to regularization |
For handling multi-collinearity - Perform clustering of similar features - Similar: Pairwise Correlation/Mutual information/VIF - Clustering: Hierarchical is preferred over centroid-based - Include random noise feature to understand what is significant relationship - - Modify feature selection to handle clusters, by choosing one of the below - Don't choose one feature per cluster, as one of the correlated features may have important interaction - Cluster MDI: sum of MDIs of features in each cluster - Cluster MDA: Shuffling all features in each cluster simultaneously
You can explore the obtained importance/predictive power by disaggregation - over & under-prediction - each output class - each input group/hierarchy
Forward¶
- Let M0 denote the null model.
- Fit all univariate models. Choose the one with the best in-sample fit (smallest RSS, highest R2) and add that variable â say x(1)â to M0. Call the resulting model M1.
- Fit all bivariate models that include x(1): y âž Îē0 + Îē(1)x(1) + Îēj xj , and add xj from the one with the best in-sample fit to M1. Call the resulting model M2.
- Continue until your model selection rule (cross-validation error, AIC, BIC) is lower for the current model than for any of the models that add one variable.
IDK¶
- \(\alpha_\text{add}\) usually \(0.05\) or \(0.10\)
- \(\alpha_\text{remove} > \alpha_\text{add}\)
Forward Stepwise Regression¶
- Write down full possible model with all predictors, functions, interactions, etc
- Regress \(y\) against each model term individually
- Pick \(\alpha_\text{add}\) and \(\alpha_\text{remove}\) such that \(\alpha_\text{add}<\alpha_\text{remove}\)
- Pick best regressor
- Calculate \(t=\beta_j/\text{SE}(\beta_j)\)
- Calculate \(p\) value for each term
- Pick smallest \(p\)-value \(p^*_j\)
- \(p^*_j<\alpha_\text{add} \implies\) add parameter \(j\)
- Check if previous term should be removed
- For all previously-added regressors, find the one with the lowest \(t\) score and hence highest \(p\) value \(p^*_{j'}\)
- If \(p^*_{j'} > \alpha_\text{remove}\), remove \(j'\)
- Repeat step 4-5 until no improvement
Omitted Variable Bias¶
If a correct regressor \(x_j\) is missing from the model, then the remaining model parameters will be biased if \(x_j\) is related to the other vars
The bias will be proportional to the correlation between the missing \(x_j\) and the regressors used in the model
Uncounted DOF¶
Every time you test a regressor term for the model, it is an addition to the degree of freedom, whether or not you include it in the model
Causes data snooping
DOF = \(n-p +\) no of trials