Show HN: Add semantic caching to LLM APIs with one-line-of-code (kentocloud.com)

0 points 246 days ago ago | visit original

🤖 AI Summary

A new service (Kento) lets you add a semantic caching layer between your app and any LLM provider with a one-line change to the client base_url, promising up to ~40% lower API spend and near-instant responses for repeated or similar queries. The proxy intercepts requests to OpenAI, Anthropic, Google GenAI, etc., returns cached replies for semantically similar prompts, and exposes a dashboard that shows which prompts repeat, per-prompt cost, and aggregate savings. Starter, startup, and enterprise plans differ by monthly requests, cache retention (7–90 days), analytics features, and support/compliance options (SSO, on-prem, SOC‑2, HIPAA). Technically this is a provider-agnostic reverse-proxy/semantic cache that likely uses embeddings or vector similarity to match new prompts to cached responses (enterprise features explicitly mention query clustering and custom similarity thresholds). That reduces token spend and latency for high-redundancy workloads but introduces trade-offs around freshness, prompt variability, and context-sensitive outputs; the product mitigates some concerns with configurable retention and enterprise on‑prem/SLA options. For teams with repetitive prompts (chatbots, retrieval-augmented generation, summarization), this is an easy way to lower costs and get visibility into prompt-level spending, while larger organizations can tune similarity, retention and compliance.

Loading comments...

loading comments...