How We Saved $500,000 Per Year by Rolling Our Own S3 (engineering.nanit.com)

🤖 AI Summary
Nanit replaced S3 as the immediate upload landing zone for its baby-video pipeline with an in-memory system called N3 and cut storage+request bills by roughly $500k/year. At thousands of uploads/second (2–6 MB segments, typically processed in ~2s), S3 PutObject fees and a mandatory 1-day lifecycle retention dominated costs. N3 serves as a fast, RAM-backed landing zone that enqueues SQS FIFO messages with pod-addressable download URLs; S3 is retained as an overflow/safety net. The design preserves per-baby ordering, requires no camera firmware changes, and optimizes the “happy path” while accepting tiny, recoverable gaps. Technically, N3 splits responsibilities: n3-proxy (internet-facing, TLS termination, presigned URL issuer) and n3-storage (internal, memory-heavy, delete-on-GET with TTL GC). Rollout used mirror-mode PoCs and synthetic stress tests to validate memory-only storage, capacity, and GC needs. Operational choices included Route53 multi-value A records for cheap DNS-based load balancing, moving TLS from stunnel to native rustls, compiling for Graviton4 with crypto flags (≈30% RPS gain), and using network-optimized c8gn.4xlarge instances for sustained 50 Gbps. Practical lessons: per-request cloud fees can outstrip egress/storage; optimizing for the common case with an in-memory buffer plus robust fallbacks reduces cost but requires attention to TLS CPU, networking baselines, GC, and graceful DNS rollouts.
Loading comments...
loading comments...