Description
Describe the workflow you want to enable
I'd like to make it easier to do best subsets with categorical features -- for simplicity let's start by assuming an additive model so for each feature there are a set of columns in the design matrix associated with that feature. When all are continuous features
each feature is associated to a single column, otherwise there is a feature grouping that can be described as a sequence of length X.shape[1]
assigning columns to a particular feature. More generally, this sequence assigning columns to features could also include interactions of both continuous and categorical variables.
Describe your proposed solution
It is (at least in some corners) common practice to include all columns associated to a categorical feature or none. This would be able to be encoded in the candidates
list. If interactions were permitted then some conventions only include an interaction if both main effects are also included. While the logic of which candidates to generate may be user-specific, it would seem if we could supply a custom iterator for candidates
then most of the code should not need to be modified. Instead of custom_names
each particular candidate may have its own identifier, so one could specify
whether the iterator produces simply indices or (indices, identifier) pairs.
This would remove the need for the min_features/max_features
argument as this would be encoded into the
iterator itself. So perhaps a helper functions to produce at least a few common iterators for candidates could be included.
Specifically one which produce the default "all continuous" iterator, and one which could easily handle an additive model
with possibly some categorical variables.
Describe alternatives you've considered, if relevant
I've considered simply wrapping R
functions like regsubsets
that easily handles the categorical variables. I would
prefer an sklearn aware version that could do this as well.