🤖 AI Summary
At PyDelhi the author delivered a talk (PDF available) contrasting a “Big Tech” engineering playbook for LLM apps with a “Startup Mode” approach. Big Tech starts by defining SLOs (e.g., streaming tokens within 200 ms, sub-100 ms latencies), then designs architectures and applies heavy performance engineering—AVX-512, local caches, prefetching, Redis layers, async continuations, UDP/TCP tradeoffs, TCP keepalives, connection pooling, etc.—because they have stable product-market fit and clear user tolerance. The talk also covered LLM-specific levers you might optimize later: faster token throughput, time-to-first-token, context-window management, and model routing strategies.
The core recommendation for founders: don’t prematurely optimize. Startups face continuous uncertainty—product pivots, changing customers, evolving models and APIs, and shifting use cases (synchronous → streaming, single-turn → multi-turn). Early focus should be on learning velocity and changeability: use managed services and cloud APIs, prefer simple data structures, write tests to enable refactoring, collect metrics but don’t optimize off noisy early data. Exceptions exist when your business model depends on latency or per-inference unit economics. Switch into Big Tech mode only after you’ve validated product-market fit, defined SLOs from real user behavior, profiled bottlenecks, and confirmed optimizations will pay off.
Loading comments...
login to comment
loading comments...
no comments yet