James Benett
joined May 19, 2025
  • How do you add new features in a scikit-learn pipeline with a ColumnTransformer?

    I came across this pipeline setup where feature engineering is being added before a ColumnTransformer, but the new features don’t seem to flow correctly through the pipeline: from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.base import BaseEstimator, TransformerMixin class FeatureAdder(BaseEstimator, TransformerMixin): def fit(self, X, y=None): return self def(Read More)

    I came across this pipeline setup where feature engineering is being added before a ColumnTransformer, but the new features don’t seem to flow correctly through the pipeline:

    from sklearn.pipeline import Pipeline
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import StandardScaler, OneHotEncoder
    from sklearn.base import BaseEstimator, TransformerMixin
    
    class FeatureAdder(BaseEstimator, TransformerMixin):
        def fit(self, X, y=None):
            return self
        
        def transform(self, X):
            X['new_feature'] = X['col1'] * X['col2']
            return X
    
    pipeline = Pipeline([
        ('feature_add', FeatureAdder()),
        ('preprocess', ColumnTransformer([
            ('num', StandardScaler(), ['col1', 'col2']),
            ('cat', OneHotEncoder(), ['col3'])
        ]))
    ])
    

    The issue is:

    • The newly created new_feature is not included in the ColumnTransformer

    • This leads to it being dropped during transformation

    In a setup like this:

    • Should the ColumnTransformer be dynamically updated to include new features?

    • Or is it better to handle feature engineering outside the pipeline altogether?

    • How do you ensure feature consistency without breaking pipeline modularity?

  • Wanting guidance for tech stack of data science

    Hi everyone, I’m currently an undergraduate student in Data Science, actively working toward becoming a data scientist. So far, I’ve built a foundation with basic machine learning models using libraries like Pandas, NumPy, Matplotlib, Scikit-learn, and some PyTorch. I’ve also explored LLMs by working with pre-trained models through Hugging Face and LangChain. Lately, I’ve been(Read More)

    Hi everyone,

    I’m currently an undergraduate student in Data Science, actively working toward becoming a data scientist. So far, I’ve built a foundation with basic machine learning models using libraries like Pandas, NumPy, Matplotlib, Scikit-learn, and some PyTorch. I’ve also explored LLMs by working with pre-trained models through Hugging Face and LangChain. Lately, I’ve been diving into more advanced ML and deep learning concepts, setting up CI/CD pipelines, and learning backend development for ML using FastAPI and Flask.

    Despite experimenting with this wide range of tools and technologies, I still find myself unclear about what companies actually expect from data scientists—both at junior and senior levels. What tech stack should I focus on? Which trends and skills are truly valued in the industry?

    As a student, it’s hard to get a clear answer on this. Could someone with experience in the field help clarify what companies are really looking for in data scientists today?

    Thanks in advance!

  • What is a brand new functionality you want to see added to Power BI?

    What is a completely new functionality you want to see added to Power BI, that would unlock new possibilities? things like- Field Parameters, Calculation Groups, Fx attributes, DAX query view. I don’t mean new visuals but a brand new way of doing things like the before mentioned features. Lets say: Ability to use Calculation Groups/Items anywhere you(Read More)

    What is a completely new functionality you want to see added to Power BI, that would unlock new possibilities? things like- Field Parameters, Calculation Groups, Fx attributes, DAX query view.

    I don’t mean new visuals but a brand new way of doing things like the before mentioned features.

    Lets say:

    Ability to use Calculation Groups/Items anywhere you place a Measure. This would greatly decrease the number of measures that you need to create in your model. LY, YTD, YOY and other measure variants would be accessible for display using this dialog options. This would work in Cards, Reference labels, Data labels, Tooltips, as a single new column to Table/Matrix – just anywhere you put a measure.

    What do you think? I actually believe this could be quite feasible, since the implicit column aggregate measures are basically calculation items.

Loading more threads