How do you add new features in a scikit-learn pipeline with a ColumnTransformer?

James Benett
Updated 6 days ago in

I came across this pipeline setup where feature engineering is being added before a ColumnTransformer, but the new features don’t seem to flow correctly through the pipeline:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin

class FeatureAdder(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X['new_feature'] = X['col1'] * X['col2']
        return X

pipeline = Pipeline([
    ('feature_add', FeatureAdder()),
    ('preprocess', ColumnTransformer([
        ('num', StandardScaler(), ['col1', 'col2']),
        ('cat', OneHotEncoder(), ['col3'])
    ]))
])

The issue is:

  • The newly created new_feature is not included in the ColumnTransformer

  • This leads to it being dropped during transformation

In a setup like this:

  • Should the ColumnTransformer be dynamically updated to include new features?

  • Or is it better to handle feature engineering outside the pipeline altogether?

  • How do you ensure feature consistency without breaking pipeline modularity?

  • 0
  • 33
  • 6 days ago
 
Loading more replies