Skip to content

SequentialFeatureSelector hyperparameter propagation doesn't work properly with GridSearch #813

Closed
@ptoews

Description

@ptoews

Describe the bug

When using sklearn's GridSearchCV with SequentialFeatureSelector, the configured hyperparameter values are not properly propagated to the actual classifier that is used for fitting and predicting. I put together a MWE below that is based on example 8 in the docs, the only major change is the custom classifier.

In the output listed in the docs you can see that the score doesn't change with the k parameter of the KNN, which is very strange.
While searching for similar issues I found that this has already been mentioned in multiple other issues, e.g. #456 and #511. Below you can see the unexpected behavior in the suggested approach.

Steps/Code to Reproduce

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import mlxtend
import sklearn.base
import numpy as np

class DebugClassifier(sklearn.base.BaseEstimator):
    
    def __init__(self, max_depth=10):
        self.max_depth = max_depth
        
    def fit(self, X, y, groups=None):
        print("Fitting with max_depth =", self.max_depth)
        
    def predict(self, X, **kwargs):
        print("Predicting with max_depth =", self.max_depth)
        return np.zeros(len(X))
    
    def set_params(self, **kwargs):
        print("Setting params:", kwargs)
        super().set_params(**kwargs)
        print("max_depth after setparams:", self.max_depth)
        

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
         X, y, test_size=0.2, random_state=123)

clf = DebugClassifier(max_depth=10)

sfs1 = SFS(estimator=clf, 
           k_features=3,
           forward=True, 
           floating=False, 
           scoring='accuracy',
           cv=5)

pipe = Pipeline([('sfs', sfs1), 
                 ('clf', clf)])

param_grid = [
  {#'sfs__k_features': [1, 4],
   'sfs__estimator__max_depth': [1, 5]}
  ]

gs = GridSearchCV(estimator=pipe, 
                  param_grid=param_grid, 
                  scoring='accuracy', 
                  n_jobs=1, 
                  cv=5,
                  #iid=True,
                  refit=False)

# run gridearch
gs = gs.fit(X_train, y_train)

Expected Results

Setting params: {'max_depth': 1}
max_depth after setparams: 1
Fitting with max_depth = 1
Predicting with max_depth = 1
Fitting with max_depth = 1
Predicting with max_depth = 1
...

Actual Results

Setting params: {'max_depth': 1}
max_depth after setparams: 1
Fitting with max_depth = 10
Predicting with max_depth = 10
Fitting with max_depth = 10
Predicting with max_depth = 10
...

As you can see, the value 1 for the hyperparameter max_depth is correctly configured for some classifier, however while fitting and predicting it appears that a different classifier is used, where the default value of max_depth=10 is still set.

Versions

MLxtend 0.18.0
Linux-5.8.0-48-generic-x86_64-with-glibc2.29
Python 3.8.5 (default, Jan 27 2021, 15:41:15) 
[GCC 9.3.0]
Scikit-learn 0.24.1
NumPy 1.20.1
SciPy 1.6.1

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions