Data science | Thread Categories | Pangaea X Community

How can I automatically update post titles when a number in the content changes?

I’m working on a system where numerical values inside content are updated dynamically. The issue is that the title often contains the same number, and when the value changes in the content, the title becomes inaccurate. I’m looking for a reliable way to automatically detect these changes and update the title accordingly without manually editing(Read More)

I’m working on a system where numerical values inside content are updated dynamically. The issue is that the title often contains the same number, and when the value changes in the content, the title becomes inaccurate.

I’m looking for a reliable way to automatically detect these changes and update the title accordingly without manually editing each post. Has anyone solved a similar problem or found a scalable approach for this?

0 1 85 4 weeks ago

Subscriber

Oscar

May 29, 2026

How do you add new features in a scikit-learn pipeline with a ColumnTransformer?

I came across this pipeline setup where feature engineering is being added before a ColumnTransformer, but the new features don’t seem to flow correctly through the pipeline: from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.base import BaseEstimator, TransformerMixin class FeatureAdder(BaseEstimator, TransformerMixin): def fit(self, X, y=None): return self def(Read More)

I came across this pipeline setup where feature engineering is being added before a ColumnTransformer, but the new features don’t seem to flow correctly through the pipeline:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin

class FeatureAdder(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X['new_feature'] = X['col1'] * X['col2']
        return X

pipeline = Pipeline([
    ('feature_add', FeatureAdder()),
    ('preprocess', ColumnTransformer([
        ('num', StandardScaler(), ['col1', 'col2']),
        ('cat', OneHotEncoder(), ['col3'])
    ]))
])

The issue is:

The newly created new_feature is not included in the ColumnTransformer
This leads to it being dropped during transformation

In a setup like this:

Should the ColumnTransformer be dynamically updated to include new features?
Or is it better to handle feature engineering outside the pipeline altogether?
How do you ensure feature consistency without breaking pipeline modularity?

0 0 184 3 months ago

Subscriber

James Benett

April 10, 2026

How to handle imbalanced datasets effectively in classification problems?

I’m working on a classification problem where one class heavily outweighs the others (around 90:10 ratio). My model is achieving high accuracy, but it’s clearly biased toward the majority class. Here’s a simplified version: from sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import classification_report X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model =(Read More)

I’m working on a classification problem where one class heavily outweighs the others (around 90:10 ratio). My model is achieving high accuracy, but it’s clearly biased toward the majority class.

Here’s a simplified version:

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestClassifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Accuracy looks good, but recall and precision for the minority class are poor.

What I want to understand:

What are the best techniques to handle imbalance (SMOTE, class weights, etc.)?
When should I prefer resampling vs adjusting model parameters?
Which evaluation metrics should I focus on in such cases?

Would appreciate practical advice based on real-world experience.

0 1 202 3 months ago

Subscriber

Naomi Teng

March 30, 2026

How do you handle model performance degradation after deployment?

Many models perform well during training and validation but start degrading in production due to data drift, concept drift, or changing user behavior. What monitoring strategies, retraining pipelines, or evaluation practices do you use to maintain model performance in production environments?

0 2 279 4 months ago

Subscriber

Nicil O Paul

March 11, 2026

Why does model performance drop when using time-based train-test splits?

I’m working on a data science project with time-ordered data, and I’m seeing a significant drop in model performance once I move from training to validation. I’m sharing a simplified version of the problem and code below. The dataset represents events over time, and the target is binary. I initially used a random train-test split,(Read More)

I’m working on a data science project with time-ordered data, and I’m seeing a significant drop in model performance once I move from training to validation. I’m sharing a simplified version of the problem and code below.

The dataset represents events over time, and the target is binary. I initially used a random train-test split, but later switched to a time-based split to better reflect real-world usage. After this change, performance dropped sharply, and I’m trying to understand whether this is expected or if I’m doing something wrong.

Here’s a simplified version of the code:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

# sample data
df = pd.read_csv("data.csv")
df = df.sort_values("event_time")

X = df.drop(columns=["target"])
y = df["target"]

# time-based split
split_index = int(len(df) * 0.8)
X_train, X_test = X.iloc[:split_index], X.iloc[split_index:]
y_train, y_test = y.iloc[:split_index], y.iloc[split_index:]

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

preds = model.predict_proba(X_test)[:, 1]
print("AUC:", roc_auc_score(y_test, preds))

With a random split, the AUC was around 0.82.
With the time-based split, it drops to around 0.61.

I’m trying to understand:

Is this performance gap a common sign of data leakage in the original setup?
Are tree-based models like Random Forests particularly sensitive to temporal shifts?
What are good practices to diagnose whether this is concept drift, feature leakage, or simply a harder prediction problem?
Would you approach validation differently for time-dependent data like this?

Looking for general guidance, validation strategies, or patterns others have seen in similar scenarios.