Cassandra: Enabling Reasoning LLMs at Edge via Self-Speculative Decoding (arxiv.org)

0 points 2 hours ago ago | visit original

🤖 AI Summary

Cassandra, a novel algorithm-hardware co-designed framework, has been introduced to enhance the efficiency of reasoning Large Language Models (LLMs) at the edge using self-speculative decoding. As LLMs face challenges related to decoding overhead and accuracy with approximation-based methods, Cassandra aims to provide a lossless solution that improves performance without requiring additional training. This unique approach involves creating a high-performance draft model through sophisticated data selection while employing optimized pruning and mantissa truncation to streamline the token generation process. The significance of Cassandra lies in its ability to significantly accelerate LLM inference, achieving up to 2.41 times speed improvement over the BF16 baseline, and generating 1.81 times more tokens than the leading speculative decoding method, Eagle-3, on the Llama 3 8B model running on high-end hardware. By integrating a lightweight encoder-decoder module compatible with commercial GPUs and NPUs, Cassandra not only enhances computational efficiency but also facilitates practical deployment on consumer devices, potentially transforming how reasoning LLMs operate in real-world applications.

Loading comments...

loading comments...