It Is All about Token: Towards Semantic Information Theory for LLMs (arxiv.org)

🤖 AI Summary
A new theoretical paper proposes a "semantic information theory" for large language models that treats the token — not the bit — as the fundamental unit of information. Building on classical tools like rate-distortion theory, directed information and Granger causality, the authors define a probabilistic model of LLMs and introduce structure-agnostic measures such as a directed rate–distortion function for pretraining, a directed rate–reward function for post-training, and semantic information flow for inference. They also formalize token-level semantic embeddings and propose an information-theoretically optimal vectorization (tokenization → embedding) method. Within this framework the paper gives a general definition of autoregressive LLMs and derives theoretical properties for Transformers (ELBO, generalization error bounds, memory capacity, and semantic information measures), and shows how other architectures (e.g., Mamba/Mamba2, LLaDA) fit the same analysis. The significance for the AI/ML community is twofold: it offers principled, quantitative tools to measure and compare semantic information processing across model stages (pretrain, fine-tune, inference) and architectures, and it reframes efficiency and interpretability questions around tokens rather than raw bits. Practically, this can guide tokenization and embedding design, compression and memory-capacity trade-offs, evaluation of semantic fidelity, and more theoretically grounded decisions about training objectives and architecture choices. The paper thus aims to open the LLM black box with metrics and bounds that are directly tied to semantic content.
Loading comments...
loading comments...