Show HN: Serve 100 Large AI models on a single GPU with low impact to TTFT (github.com)

0 points 18 hours ago ago | visit original

🤖 AI Summary

flashtensors is a new inference engine/loader that drastically cuts model cold-starts and enables hotswapping large LLMs from SSD into GPU VRAM in under ~2–5 seconds (reported <2s for many models). It claims up to ~4–6× average speedups versus safetensors (and up to ~10× in some scenarios), letting a single GPU host hundreds of models and swap them on demand with minimal user-perceived delay. That changes operational economics: TTFT (time-to-first-token) and memory friction cease to scale linearly with the number of models, enabling serverless inference, affordable personalized AI, on-prem deployments, robotics, and local/edge agentic workflows where fast startup and limited GPU RAM are critical. Technically, flashtensors runs as a daemon/gRPC server, uses a configurable GPU memory pool, chunked SSD-to-VRAM transfers, multithreading, and GPU memory utilization controls; it integrates with vLLM, LlamaCPP, Dynamo, Ollama and offers model registration/transform (e.g., to bfloat16) for ultra-fast loads. Benchmarks on H100 with NVLink show consistent multi-second coldstarts even for 32B models and large speedups (Qwen family: ~3.5–5.9×). Tools include CLI (flash start/pull/run), Python API, state-dict save/load for rapid model restore, and Docker support—making it a practical, low-latency layer for multi-model deployments on constrained or shared GPU resources.

Loading comments...

loading comments...