🤖 AI Summary
Tenstorrent’s Ascalon X (RVA23-compliant, VLEN=256, DLEN=256x2) shows a surprising RVV microarchitectural optimization revealed by instruction-throughput measurements: LMUL>1 vector-scalar/immediate comparison instructions (e.g. vmseq.vx/vmseq.vi) run about 2× faster than the equivalent vector-vector forms (vmseq.vv). The author hypothesizes—and the numbers support—that Ascalon exposes three VLEN-wide vector-register read ports plus a GPR read port. A .vx/.vi compare reads one wide vector + one GPR and writes a packed mask (fits in an LMUL=1 write), so it can double the ALU width and process twice as many elements per cycle without needing extra register-file ports. vmseq.vv must read two LMUL-wide vectors and thus cannot exploit that doubling, explaining the observed 2× gap. The same trick appears for some narrowing and v*adc instructions; reductions don’t benefit (likely due to non-linear reduction networks).
This is significant for the AI/ML and compiler communities because it changes the performance trade-offs for RVV codegen and hand-optimized kernels: preferring .vx/.vi forms for LMUL>1 comparisons (or choosing tail-agnostic narrowing) can yield substantial throughput gains on Ascalon-class implementations. The observation is presented as well-supported speculation based on systematic cycle measurements (tabled in the source), so compiler backends, auto-vectorizers, and library authors targeting Tenstorrent silicon should consider emitting .vx/.vi patterns to maximize vector throughput.
Loading comments...
login to comment
loading comments...
no comments yet