The main issue is usually distribution shift between validation and production data. You can simulate this and detect it with simple evaluation checks.
For example, compare validation vs production predictions:
import numpy as np
from sklearn.metrics import accuracy_score
# validation performance
val_preds = model.predict(X_val)
val_acc = accuracy_score(y_val, val_preds)
# production-like sample
prod_preds = model.predict(X_prod_sample)
prod_acc = accuracy_score(y_prod_sample, prod_preds)
print("Validation Accuracy:", val_acc)
print("Production Accuracy:", prod_acc)
If there’s a large gap, you likely have:
-
Overfitting
-
Data drift
-
Label mismatch
-
Input distribution differences
You can also monitor drift directly:
import scipy.stats as stats
# Compare token length distributions
stats.ks_2samp(val_token_lengths, prod_token_lengths)
A significant difference suggests distribution shift.
Strong validation metrics do not guarantee robustness. Always evaluate on out-of-distribution samples and monitor drift continuously after deployment.

Be the first to post a comment.