Show HN: Push local LLMs to max speed without overheating (github.com)

0 points 227 days ago ago | visit original

🤖 AI Summary

llm-threader is a new npm library that automatically manages concurrent local LLM calls to maximize throughput while protecting your machine from overheating, high load, or UI freezes. It runs a thread-pool style queue where each thread is an LLM call, continuously samples CPU/GPU usage, temperature and memory, and dynamically scales the allowed concurrency so you get the best total completion time for your hardware. That makes it useful both for consumer desktop apps (multiple tabs, background indexing, autocomplete) and single-machine batch workloads (large-document processing, evaluations) where naive parallelism either wastes cores or triggers thermal throttling. Technically, the engine combines a PID controller (for smooth setpoint-based adjustments) with a Bayesian optimizer (to search nearby thread counts using a reward that favors throughput and penalizes latency, backlog and thermal overages). It enforces hard emergency limits (e.g., clamp to 1 thread on extreme temps/usages), soft high thresholds, and a short observation window to avoid over-eager scaling. Features include priority queues with emergency bypass, configurable thresholds/monitoring intervals, persisted usage/scaling history in SQLite, and zero-configuration sensible defaults. API calls like execute(), getState(), and getUsageHistory() let apps schedule operations with priorities, timeouts, and abort signals; installable via npm install llm-threader.

Loading comments...

loading comments...