RE: What’s the most common point of failure you’ve seen once an ML system goes live?

Manish Menda

Dec 12th 2025

RE: What’s the most common point of failure you’ve seen once an ML system goes live?

When an ML system enters production, the data pipeline becomes the true source of risk because real-world inputs rarely behave like training data. Upstream services introduce schema drift, fields start arriving with unexpected formats, and previously clean features suddenly contain outliers, nulls, or delayed events. Even slight deviations an extra categorical value, a missing timestamp, a silent change in data collection logic can cause the model to degrade in ways that are hard to detect from metrics alone. What looked perfectly reliable in development becomes fragile once it relies on external systems that were never built with ML constraints in mind. The model isn’t broken; the world feeding it has changed.