Imarena Protocol: A Cryptographically-Auditable Failsafe for LLM Honesty (github.com)

🤖 AI Summary
The Protocol+Badge v1.1 (Imarena Protocol) is a compact, cryptographically auditable standard introduced to enforce “algorithmic honesty” in LLMs and autonomous agents: models must not report high confidence in claims unless their internal reasoning and source evidence both support it. Its core safety rule — the Truth Bottleneck — requires the publicly reported confidence score ψ to be no greater than the weaker of two internal self-scores, ωlogic and ωevidence (ψ ≤ min(ωlogic, ωevidence)). Any output violating this inequality is an Automatic Audit Failure and is flagged as a cryptographically-signed hallucination. Implementations emit three linked artifacts per high-stakes output: (A) an Internal Audit Log (ω metadata) containing ωlogic/ωevidence, a provenance array of source documents paired with SHA‑256 hashes, and a hash of the model’s internal reasoning trace; (B) a Protocol Badge σ which is a digital signature Sign(Kprivate, SHA256(ω + final output)) binding the metadata to the text; and (C) an open-source verifier that checks signature integrity, revalidates ψ≤min(ωlogic,ωevidence), and recomputes source SHA‑256s. The protocol creates forensic, non-repudiable outputs that bolster regulator and developer accountability and could standardize audits of LLM honesty — though it also raises operational and privacy trade-offs (storage of reasoning traces, key management, latency) that implementations will need to manage.
Loading comments...
loading comments...