I came across this pipeline setup where feature engineering is being added before a ColumnTransformer, but the new features don’t seem to flow correctly through the pipeline:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
class FeatureAdder(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
X['new_feature'] = X['col1'] * X['col2']
return X
pipeline = Pipeline([
('feature_add', FeatureAdder()),
('preprocess', ColumnTransformer([
('num', StandardScaler(), ['col1', 'col2']),
('cat', OneHotEncoder(), ['col3'])
]))
])
The issue is:
-
The newly created
new_featureis not included in the ColumnTransformer -
This leads to it being dropped during transformation
In a setup like this:
-
Should the ColumnTransformer be dynamically updated to include new features?
-
Or is it better to handle feature engineering outside the pipeline altogether?
-
How do you ensure feature consistency without breaking pipeline modularity?
