HWE Bench: A new unbounded Benchmark for LLMs (GPT 5.5 is on top)
HWE Bench has introduced a groundbreaking unbounded benchmark specifically designed for large language models (LLMs) in hardware engineering, enabling them to autonomously generate RISC-V CPU designs from scratch. Each design must comply with formal correctness proofs to eliminate buggy outputs before evaluation on an FPGA, where their performance is quantified in terms of speed and efficiency. The current leader, GPT-5.5, has achieved an impressive score of 525.04 iterations per second, representing an 85.6% improvement over the previous baseline and outperforming the established VexRiscv human reference.
This new benchmarking approach is significant for the AI/ML community as it circumvents the limitations of traditional benchmarks that often have a fixed ceiling. Unlike existing paradigms, HWE Bench allows for continuous advancements without arbitrary caps on performance, fostering innovation in CPU design. As models explore advanced microarchitectural strategies such as deeper pipelines and enhanced branch prediction mechanisms, the potential for new designs to surpass existing performance metrics remains open-ended, effectively pushing the boundaries of LLM capabilities and hardware engineering.