RFC: Evolving PyTorch/XLA for a more native experience on TPU (github.com)

0 points 2 days ago ago | visit original

🤖 AI Summary

The PyTorch/XLA team has published an RFC proposing a new “native” TPU backend for PyTorch that aims to replace the current torch_xla workflow with an eager-first, PyTorch-native experience. Rather than forcing users into torch_xla’s lazy-tensor and explicit xm.mark_step tracing model, the new stack would let tensors be moved to a TPU with tensor.to('tpu') and behave like CUDA devices: interactive, easy to debug, and aligned with standard PyTorch APIs. The goal is to preserve XLA’s high-performance compilation for large workloads while eliminating much of the developer friction around graph-tracing and a separate API surface. Technically, the design combines eager dispatch with deferred and asynchronous compilation: ops are executed eagerly from the user’s perspective while the backend decides—dynamically and asynchronously—whether to compile individual ops, fused clusters, or entire forward/backward passes. Compilation results would be cached and overlapped with execution; techniques like persistent deduping, limiting inlining/unrolling, and collaboration with the XLA team aim to minimize compile latency. The proposal also promises a true JIT with feedback-directed recompilation, autotuning, active memory management to prevent OOMs, integration as a first-class torch.compile backend, and native DTensor/distributed support (e.g., FSDP). If adopted, this could significantly simplify debugging, ecosystem integration, and migration from GPU workflows while retaining TPU-scale performance.

Loading comments...

loading comments...