Show HN: TraceML, a tool to trace live memory usage in PyTorch training (github.com)

0 points 17 hours ago ago | visit original

🤖 AI Summary

TraceML is a new lightweight tool for making PyTorch training memory visible in real time, available as a CLI and Jupyter/Colab notebook integration. It attaches hooks to your model either via a class decorator or by tracing an instance, then runs a live tracker (TrackerManager) that samples memory at a configurable interval—no heavy setup required. You can also wrap your script with traceml run to get a terminal dashboard powered by Rich. The library reports system and process-level CPU/RAM/GPU usage alongside per-module memory allocations, activation and gradient memory, and simple step timers, and can export logs as JSON/CSV. Technically, TraceML uses multiple samplers to provide rolling snapshots rather than a single slow, detailed trace: SystemSampler (CPU/RAM/GPU), LayerMemorySampler (parameter allocations per module), ActivationMemorySampler (per-layer forward activations with current and global peaks and estimated totals), and GradientMemorySampler (per-layer backward gradients with peaks and totals). That design yields live per-layer breakdowns, current vs global peaks, and running totals for activation+gradient memory with much less overhead than full profilers—making it practical for debugging OOMs and optimizing memory use during training. The project is early-stage and evolving; contributions and feedback are encouraged.

Loading comments...

loading comments...