Why does NLP model performance drop from training to validation?

I’m working on an NLP project where the model shows strong training performance and reasonable offline metrics, but once we move to validation and limited production-style testing, performance drops noticeably.

The data pipeline, preprocessing steps, and model architecture are consistent across stages, so this doesn’t feel like a simple setup issue. My suspicion is that the problem sits somewhere between data distribution shifts, tokenization choices, or subtle leakage in the training setup that doesn’t hold up outside the training window.

I’m trying to understand how others diagnose this in practice:

How do you distinguish overfitting from dataset shift in NLP workloads?
What signals do you look at beyond standard metrics to catch generalization issues early?
Are there common preprocessing or labeling assumptions that often break when moving closer to production text?

Looking for practical debugging approaches or patterns others have seen when moving NLP models from training to real usage.