SSE sucks for transporting LLM tokens (zknill.io)

0 points 228 days ago ago | visit original

🤖 AI Summary

SSE (Server-Sent Events) is widely used to stream LLM tokens because it’s simple and compatible with HTTP, but it’s a poor fit for real-world client scenarios. The SSE pattern keeps a long-lived HTTP connection open and pushes tokens as events, so any mid-response disconnect (network handoff, sleep, roaming) forces the client to re-POST the prompt and re-run the costly model inference from scratch. SSE is uni-directional so you can’t steer or patch a generation mid-stream, and cancel vs accidental disconnect is indistinguishable. Making SSE “resumable” requires embedding per-event indexes and storing every token and delivery position on the server (DB/cache) — an operationally complex, SDK-fractured faff (e.g., stream-abort vs stream-resume tradeoffs). A better technical fit is a pub/sub model: the model/server publishes tokens to a topic and clients subscribe, allowing clients to re-subscribe and pick up where they left off without re-running inference. That decouples production from consumption and enables robust resumption and offline consumption, but it shifts complexity and cost to stateful infrastructure and third-party pub/sub providers — potentially costing more than the inference itself. In short: SSE survives because it’s cheap and simple, but for resilient, resumable token transport you need server-side state and pub/sub-style architectures — at the price of additional engineering and operating cost.

Loading comments...

loading comments...