🤖 AI Summary
LLMKube v0.2.0 (Phase 1 complete) is a Kubernetes operator + CLI that streamlines deploying, scaling and observing local LLM inference with first‑class GPU support. Targeted at production, edge and air‑gapped environments, it exposes OpenAI‑compatible endpoints (/v1/chat/completions), provides CRDs (Model, InferenceService) for k8s-native workflows, automatic GGUF model downloads, horizontal replica scaling, GPU‑aware scheduling and cost optimizations (spot nodes, auto‑scale to zero). The release bundles a full observability stack (Prometheus, Grafana, NVIDIA DCGM metrics) and SLO alerts out of the box, plus a simple llmkube CLI for deploy/list/status/delete operations. License: Apache 2.0.
Technically, Phase 1 demonstrates strong GPU acceleration using llama.cpp’s CUDA backend on NVIDIA L4: 64 tok/s generation (17× faster than CPU), ~1,026 tok/s prompt processing, 0.6s total response vs ~10.3s on CPU, 4.2GB VRAM used and automatic layer offloading (29/29). It supports quantized GGUF models (Q4/Q8), multi‑replica inference, and integrates with NVIDIA GPU Operator and device plugins. Roadmap highlights multi‑GPU single‑node offloading, multi‑node layer sharding, KV cache improvements and GPU auto‑scaling / failover. For ML engineers and infra teams this lowers the barrier to run performant, observable local LLM services in k8s, bringing reproducible benchmarks and production patterns for GPU inference outside cloud-managed APIs.
Loading comments...
login to comment
loading comments...
no comments yet