With newer models getting larger (especially in LLMs and multimodal setups), memory constraints are becoming a major bottleneck during training and inference.
Looking for practical approaches others are using to manage this, such as:
- Gradient checkpointing vs mixed precision
- Model sharding or distributed training strategies
- Efficient data loading and batching
Would be useful to understand what’s working in real-world implementations and where trade-offs are being made.
