Notebooks

Variable or Feature Selection for Regression and Classification

Last update: 15 Dec 2024 15:01
First version: 14 August 2019

"Variable selection" tends to be the statisticians' name; people from data mining talk about "feature selection". The idea is the same: given some big pool of variables (or features) which could be used as inputs to predict some target feature (or variable), which ones actually go in to the predictor? Say "all" can lead to computational issues, and also statistical ones. (If you throw in lots of variables which don't matter, at finite sample sizes you'll have degraded inferences about how, and how much, all the variables matter. Even if you don't care about inferring parameters [or response functions, etc.], the predictions will suffer, because you'll be over-fitting.) Of course, sometimes every available feature really does matter. (This is, annoyingly, especially likely to be the case when the features are pre-selected on the basis of strong subject-matter knowledge.)

All of this is a special case of model selection, so I incorporate all the comments, and the recommended readings, in that notebook by reference.

--- An important model selection problem which is also sometimes called "variable selection" is deciding which nodes in a graphical model are immediately connected. (This is very important, for instance, in causal model discovery.) The obvious way to go about doing this is to run variable selection (in this sense) for each node variable. This may work, but may also not be what we want, since the goal is direct (and sometimes directed!) connections. I defer those references to the notebooks just linked to.


Notebooks: