I’m designing a pipeline where the same dataset needs to flow through different environments: Dev, UAT, and Prod. The challenge is that the production dataset is huge, but in Dev and UAT, I only need a subset of the data to test transformations and run analytics efficiently.
My idea is to use a date parameter (e.g., start_date/end_date) to limit the data volume in non-prod environments, so Dev and UAT only process a smaller, manageable slice of the dataset.
I’m wondering:
- Is using a date parameter a common or recommended practice for this?
- Are there risks in this approach that I should be aware of, such as skewed test results or missed edge cases?
- Are there better strategies for controlling data volume across environments while maintaining meaningful test coverage?
I’d love to hear how others handle large datasets across multiple environments in a practical, maintainable way.
