🤖 AI Summary
Tokenflood is an open-source load-testing tool for instruction-tuned LLMs that simulates arbitrary workloads by specifying token-level metadata (prompt length, prefix length, output length) and request rates instead of requiring real prompts. It generates tokenized inputs from single-token strings, models prefix caching, and drives endpoints via litellm-compatible configs (self-hosted vllm or hosted providers like OpenAI, Anthropic, Gemini, Bedrock, Azure, SageMaker). Runs produce latency percentile graphs (50/90/99), raw request and network data, and summaries so you can directly compare hardware, quantization, model choices, and prompt-output tradeoffs across identical load profiles.
Technically, tokenflood’s heuristic approach is fast and reproducible because LLM latency is largely a function of token counts and caching. Example results show meaningful gains from increasing cacheable prefix tokens and reducing output length: a base case 50th-percentile latency of ~1.72s at 3 req/s fell to ~0.57s when doubling prefix cache and halving output tokens. The tool warns on token-count divergence >10%, enforces input/output token budgets, requires warm-up success, and aborts high-error runs; nevertheless, users should heed cost risks with pay-per-token APIs. Tokenflood is pip-installable and aimed at teams evaluating providers, tuning prompts, or load-testing self-hosted inference stacks.
Loading comments...
login to comment
loading comments...
no comments yet