🤖 AI Summary
An independent researcher tried to reproduce a paper that claimed 2–26% speedups from reinforcement‑learning–driven SASS instruction reordering. They extracted latency tables from nvdisasm (_2.txt files), wrote a Perl binding for their Ced-based tool (Cubin::Ced::LatIndex), and added two passes to dg.pl: -l to dump per‑instruction latencies and -s to attempt instruction swaps. Their experiments revealed serious mismatches: some instructions have no latency entries (e.g., S2R, XXXBAR), some map to multiple indices (resolved by taking the max in intersect_lat), and overall the latency predictions correlate very poorly with observed stalls (incorrect >60%). They suspect parser bugs, outdated nvdisasm tables, or — most plausibly — that ptxas uses different internal latency data. Despite this, they implemented swap rules restricting dual‑issued pairs, control‑flow–changing ops (CALL/JUMP), RELA‑fixup instructions, and requiring matching predicates/conditions.
For the AI/ML and systems community this is a cautionary result: automated RL or learning‑based compiler optimizations that rely on external latency tables or simplistic stall models can produce misleadingly optimistic speedups unless grounded in accurate microarchitectural data and validated against real stall counts. The note highlights concrete technical obstacles for binary‑level reordering — stale or hidden latency sources (ptxas vs nvdisasm), parsing/format fragility, and the need to handle fixups and issuance constraints — all of which must be addressed for reliable, reproducible learned optimizations on GPUs.
Loading comments...
login to comment
loading comments...
no comments yet