I Reverse-Engineered Exa.ai Infrastructure Cost with Napkin Math (www.kshivendu.dev)

🤖 AI Summary
Kumar Shivendu reverse‑engineered Exa.ai’s public architecture and ran “napkin‑math” estimates to ballpark the infra cost of running a large web search (heavily caveated with assumptions). Key takeaways: a 1B‑document corpus looks like ~3 TB uncompressed (1.9 TB compressed) and lexical BM25 indexing can be kept in RAM (~900 GB after optimizations) costing roughly $1.8k/month for memory. Vector search dominates design choices: Exa uses a Matryoshka embedding pipeline (4096→truncated 256/1024→2048 dims) plus binary quantization (BQ) and IVF ANN, shrinking vectors hundreds of× so the top‑level BQ index for 1B docs is tiny (~40 GB equivalent index; ~32 GB for 256‑d BQ vectors). Caching ~25% of full f16 vectors for re‑ranking (~1 TB RAM) is the single largest recurring cost, pushing the vector stack to ~$2.75k/month per billion docs. The note quantifies operational tradeoffs and where money is saved: embedding generation with OpenAI would cost ~$50.7k for 1B docs, rented H100s ~$13k, and Exa’s owned “Exacluster” pushes this to ~$9.75k — a clear incentive to vertically integrate GPUs. Re‑ranking model costs (e.g., Qwen3‑0.6B ≈ $0.79 per 1k queries) and low egress (~$90/month for 300M small responses) further frame unit economics. The writeup highlights practical scaling levers (delta encoding, uint64 for >4B docs, aggressive truncation/quantization, caching hot documents) while warning these are approximate assumptions, not an official bill of materials.
Loading comments...
loading comments...