I’m working on a data science project with time-ordered data, and I’m seeing a significant drop in model performance once I move from training to validation. I’m sharing a simplified version of the problem and code below. The dataset represents events over time, and the target is binary. I initially used a random train-test split,(Read More)

I’m working on a data science project with time-ordered data, and I’m seeing a significant drop in model performance once I move from training to validation. I’m sharing a simplified version of the problem and code below.

The dataset represents events over time, and the target is binary. I initially used a random train-test split, but later switched to a time-based split to better reflect real-world usage. After this change, performance dropped sharply, and I’m trying to understand whether this is expected or if I’m doing something wrong.

Here’s a simplified version of the code:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

# sample data
df = pd.read_csv("data.csv")
df = df.sort_values("event_time")

X = df.drop(columns=["target"])
y = df["target"]

# time-based split
split_index = int(len(df) * 0.8)
X_train, X_test = X.iloc[:split_index], X.iloc[split_index:]
y_train, y_test = y.iloc[:split_index], y.iloc[split_index:]

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

preds = model.predict_proba(X_test)[:, 1]
print("AUC:", roc_auc_score(y_test, preds))

With a random split, the AUC was around 0.82.
With the time-based split, it drops to around 0.61.

I’m trying to understand:

  • Is this performance gap a common sign of data leakage in the original setup?

  • Are tree-based models like Random Forests particularly sensitive to temporal shifts?

  • What are good practices to diagnose whether this is concept drift, feature leakage, or simply a harder prediction problem?

  • Would you approach validation differently for time-dependent data like this?

Looking for general guidance, validation strategies, or patterns others have seen in similar scenarios.