Why does model performance drop when using time-based train-test splits?

Unfollow Follow

Erin

Updated 5 days ago in

I’m working on a data science project with time-ordered data, and I’m seeing a significant drop in model performance once I move from training to validation. I’m sharing a simplified version of the problem and code below.

The dataset represents events over time, and the target is binary. I initially used a random train-test split, but later switched to a time-based split to better reflect real-world usage. After this change, performance dropped sharply, and I’m trying to understand whether this is expected or if I’m doing something wrong.

Here’s a simplified version of the code:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

# sample data
df = pd.read_csv("data.csv")
df = df.sort_values("event_time")

X = df.drop(columns=["target"])
y = df["target"]

# time-based split
split_index = int(len(df) * 0.8)
X_train, X_test = X.iloc[:split_index], X.iloc[split_index:]
y_train, y_test = y.iloc[:split_index], y.iloc[split_index:]

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

preds = model.predict_proba(X_test)[:, 1]
print("AUC:", roc_auc_score(y_test, preds))

With a random split, the AUC was around 0.82.
With the time-based split, it drops to around 0.61.

I’m trying to understand:

Is this performance gap a common sign of data leakage in the original setup?
Are tree-based models like Random Forests particularly sensitive to temporal shifts?
What are good practices to diagnose whether this is concept drift, feature leakage, or simply a harder prediction problem?
Would you approach validation differently for time-dependent data like this?

Looking for general guidance, validation strategies, or patterns others have seen in similar scenarios.

<p>I’m working on a data science project with time-ordered data, and I’m seeing a significant drop in model performance once I move from training to validation. I’m sharing a simplified version of the problem and code below.</p>
<p>The dataset represents events over time, and the target is binary. I initially used a random train-test split, but later switched to a time-based split to better reflect real-world usage. After this change, performance dropped sharply, and I’m trying to understand whether this is expected or if I’m doing something wrong.</p>
<p>Here’s a simplified version of the code:</p>
<pre><code class="language-python">import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

# sample data
df = pd.read_csv("data.csv")
df = df.sort_values("event_time")

X = df.drop(columns=["target"])
y = df["target"]

# time-based split
split_index = int(len(df) * 0.8)
X_train, X_test = X.iloc[:split_index], X.iloc[split_index:]
y_train, y_test = y.iloc[:split_index], y.iloc[split_index:]

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

preds = model.predict_proba(X_test)[:, 1]
print("AUC:", roc_auc_score(y_test, preds))
</code></pre>
<p>With a random split, the AUC was around <code>0.82</code>.<br />With the time-based split, it drops to around <code>0.61</code>.</p>
<p>I’m trying to understand:</p>
<ul>
<li>
<p>Is this performance gap a common sign of data leakage in the original setup?</p>
</li>
<li>
<p>Are tree-based models like Random Forests particularly sensitive to temporal shifts?</p>
</li>
<li>
<p>What are good practices to diagnose whether this is concept drift, feature leakage, or simply a harder prediction problem?</p>
</li>
<li>
<p>Would you approach validation differently for time-dependent data like this?</p>
</li>
</ul>
<p>Looking for general guidance, validation strategies, or patterns others have seen in similar scenarios.</p>
<p> </p>
<p> </p>

Cancel

2
25
5 days ago
0

Reply

Write your reply here to join the conversation

YOUR PREVIEW

Avatar

Kapil 13 hours ago

This is pretty common when switching to a time-based split. A random split often hides issues because information from the future can leak into training, even indirectly. Once you respect time, the problem usually becomes harder and performance drops.

In my experience, this often points to either subtle feature leakage (features that wouldn’t exist at prediction time) or genuine concept drift where patterns change over time. Tree-based models aren’t uniquely sensitive, but they can amplify these effects if the data distribution shifts.

A few things that helped me diagnose this: checking feature stability over time, comparing feature importance across periods, and validating with rolling or expanding windows instead of a single split. The lower AUC doesn’t necessarily mean the model is worse—it’s often a more realistic estimate of real-world performance.

Liked by

Reply

Cancel

Naomi Teng 4 days ago

This is actually a pretty common moment when you switch to time-based splits, and it usually means you’re getting a more realistic signal rather than doing something wrong.

Random splits tend to make models look better than they really are because past and future data get mixed. That can hide leakage or let the model rely on patterns that only exist when time isn’t respected. Once you move to a time split, those shortcuts disappear and performance drops.

A few things I’ve seen in similar setups:

The gap often points to subtle leakage in the random split, even from features that don’t obviously look “time aware.”
Random Forests aren’t especially bad here, but they do pick up short-lived correlations, so they can struggle when the data distribution shifts over time.
To understand what’s happening, it helps to compare feature distributions between train and test, and to run rolling or expanding window validation to see how performance changes over time.

For time-dependent problems, I usually treat the time-based score as the real baseline and use random splits only for quick iteration or debugging. The lower number is often closer to what you’ll see in production.

Liked by

Reply

<p data-start="161" data-end="333">This is actually a pretty common moment when you switch to time-based splits, and it usually means you’re getting a more realistic signal rather than doing something wrong.</p><br />
<p data-start="335" data-end="620">Random splits tend to make models look better than they really are because past and future data get mixed. That can hide leakage or let the model rely on patterns that only exist when time isn’t respected. Once you move to a time split, those shortcuts disappear and performance drops.</p><br />
<p data-start="622" data-end="663">A few things I’ve seen in similar setups:</p><br />
<ul data-start="665" data-end="1134"><br />
<li data-start="665" data-end="785"><br />
<p data-start="667" data-end="785">The gap often points to subtle leakage in the random split, even from features that don’t obviously look “time aware.”</p><br />
</li><br />
<li data-start="786" data-end="942"><br />
<p data-start="788" data-end="942">Random Forests aren’t especially bad here, but they do pick up short-lived correlations, so they can struggle when the data distribution shifts over time.</p><br />
</li><br />
<li data-start="943" data-end="1134"><br />
<p data-start="945" data-end="1134">To understand what’s happening, it helps to compare feature distributions between train and test, and to run rolling or expanding window validation to see how performance changes over time.</p><br />
</li><br />
</ul><br />
<p data-start="1136" data-end="1350">For time-dependent problems, I usually treat the time-based score as the real baseline and use random splits only for quick iteration or debugging. The lower number is often closer to what you’ll see in production.</p><br />

Cancel