I’m working on a data science project with time-ordered data, and I’m seeing a significant drop in model performance once I move from training to validation. I’m sharing a simplified version of the problem and code below.
The dataset represents events over time, and the target is binary. I initially used a random train-test split, but later switched to a time-based split to better reflect real-world usage. After this change, performance dropped sharply, and I’m trying to understand whether this is expected or if I’m doing something wrong.
Here’s a simplified version of the code:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
# sample data
df = pd.read_csv("data.csv")
df = df.sort_values("event_time")
X = df.drop(columns=["target"])
y = df["target"]
# time-based split
split_index = int(len(df) * 0.8)
X_train, X_test = X.iloc[:split_index], X.iloc[split_index:]
y_train, y_test = y.iloc[:split_index], y.iloc[split_index:]
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
preds = model.predict_proba(X_test)[:, 1]
print("AUC:", roc_auc_score(y_test, preds))
With a random split, the AUC was around 0.82.
With the time-based split, it drops to around 0.61.
I’m trying to understand:
-
Is this performance gap a common sign of data leakage in the original setup?
-
Are tree-based models like Random Forests particularly sensitive to temporal shifts?
-
What are good practices to diagnose whether this is concept drift, feature leakage, or simply a harder prediction problem?
-
Would you approach validation differently for time-dependent data like this?
Looking for general guidance, validation strategies, or patterns others have seen in similar scenarios.
