🤖 AI Summary
A developer ran ImageNet-model benchmarks on the latest pip release of TensorFlow and found it noticeably slower and more memory-hungry than established frameworks (Torch with cuDNN R3/R2, Nervana Neon). Key technical observations: TensorFlow currently lacks many in-place ops (e.g., in-place ReLU) and instead depends on its scheduler and memory pool for allocation/deallocation; it supports cuDNN R2 but not R3 (Yangqing suggests R4 may be next); and early memory allocator behavior caused OOMs (GoogLeNet blew past batch 16, VGG initially OOM’d at batch 64 until Google’s BFC allocator fix). Measured timings show large gaps — e.g., AlexNet: TensorFlow 326 ms vs cuDNN-R3 Torch 96 ms; Overfeat: 1084 ms vs 326 ms; OxfordNet: 1840 ms vs ∼600 ms — with TensorFlow backprop times especially high.
Significance: these results underline that TensorFlow’s first public release is functionally capable but unpolished for high-throughput conv-net training on GPUs. For researchers and practitioners this means expect lower throughput, stricter batch-size limits, and potential surprises until in-place ops, more complete cuDNN support, and allocator/memory optimizations land. The benchmarks also reinforce cuDNN-R3/Torch’s performance lead at the time, while indicating clear optimization targets for TensorFlow contributors (in-place operations, better integration with newer cuDNN releases, and improved memory allocation strategies).
Loading comments...
login to comment
loading comments...
no comments yet