Why does model performance drop when using time-based train-test splits?

Erin
Updated on January 30, 2026 in

I’m working on a data science project with time-ordered data, and I’m seeing a significant drop in model performance once I move from training to validation. I’m sharing a simplified version of the problem and code below.

The dataset represents events over time, and the target is binary. I initially used a random train-test split, but later switched to a time-based split to better reflect real-world usage. After this change, performance dropped sharply, and I’m trying to understand whether this is expected or if I’m doing something wrong.

Here’s a simplified version of the code:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

# sample data
df = pd.read_csv("data.csv")
df = df.sort_values("event_time")

X = df.drop(columns=["target"])
y = df["target"]

# time-based split
split_index = int(len(df) * 0.8)
X_train, X_test = X.iloc[:split_index], X.iloc[split_index:]
y_train, y_test = y.iloc[:split_index], y.iloc[split_index:]

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

preds = model.predict_proba(X_test)[:, 1]
print("AUC:", roc_auc_score(y_test, preds))

With a random split, the AUC was around 0.82.
With the time-based split, it drops to around 0.61.

I’m trying to understand:

  • Is this performance gap a common sign of data leakage in the original setup?

  • Are tree-based models like Random Forests particularly sensitive to temporal shifts?

  • What are good practices to diagnose whether this is concept drift, feature leakage, or simply a harder prediction problem?

  • Would you approach validation differently for time-dependent data like this?

Looking for general guidance, validation strategies, or patterns others have seen in similar scenarios.

 

 

  • 3
  • 55
  • 2 weeks ago
 
5 days ago

A drop in performance after moving to a time-based split is very common and, in many cases, expected.

Random splits often make models look better than they will perform in reality because past and future data get mixed. This can hide subtle leakage or allow the model to rely on patterns that don’t hold once time is respected. When you switch to a time-based split, you’re forcing the model to predict on a genuinely new distribution, which is closer to how it will behave in production.

A few observations from practice:

  • The gap often indicates that the random split benefited from information leakage or overly stable correlations.

  • Tree-based models like Random Forests are not inherently worse here, but they do tend to pick up short-term patterns that may shift over time.

  • Time-based validation exposes concept drift and feature instability that random splits simply don’t surface.

To diagnose what’s happening, teams usually:

  • Compare feature distributions between training and test periods

  • Check whether any features implicitly encode future information

  • Use rolling or expanding window validation instead of a single split

In most cases, the lower AUC from the time-based split is the more honest metric. It doesn’t mean the model got worse. It means the evaluation is now aligned with real-world conditions

  • Liked by
Reply
Cancel
on February 3, 2026

This is pretty common when switching to a time-based split. A random split often hides issues because information from the future can leak into training, even indirectly. Once you respect time, the problem usually becomes harder and performance drops.

In my experience, this often points to either subtle feature leakage (features that wouldn’t exist at prediction time) or genuine concept drift where patterns change over time. Tree-based models aren’t uniquely sensitive, but they can amplify these effects if the data distribution shifts.

A few things that helped me diagnose this: checking feature stability over time, comparing feature importance across periods, and validating with rolling or expanding windows instead of a single split. The lower AUC doesn’t necessarily mean the model is worse—it’s often a more realistic estimate of real-world performance.

  • Liked by
Reply
Cancel
on January 31, 2026

This is actually a pretty common moment when you switch to time-based splits, and it usually means you’re getting a more realistic signal rather than doing something wrong.

Random splits tend to make models look better than they really are because past and future data get mixed. That can hide leakage or let the model rely on patterns that only exist when time isn’t respected. Once you move to a time split, those shortcuts disappear and performance drops.

A few things I’ve seen in similar setups:

  • The gap often points to subtle leakage in the random split, even from features that don’t obviously look “time aware.”

  • Random Forests aren’t especially bad here, but they do pick up short-lived correlations, so they can struggle when the data distribution shifts over time.

  • To understand what’s happening, it helps to compare feature distributions between train and test, and to run rolling or expanding window validation to see how performance changes over time.

For time-dependent problems, I usually treat the time-based score as the real baseline and use random splits only for quick iteration or debugging. The lower number is often closer to what you’ll see in production.


  • Liked by
Reply
Cancel
Loading more replies