🤖 AI Summary
JetBrains describes a production-ready effort to deliver sub-100ms autocompletion by combining syntax-aware training data, next-edit predictions, and heavy inference engineering. To improve suggestion quality they moved beyond standard fill-in-the-middle (FIM) to Diff-based Syntax-Aware FIM (SAFIM): sampling only valid AST node substrings and preferentially sampling nodes that changed in commits. They scraped ~80k FIM examples from 400 recent OSS repositories (to avoid pretraining contamination), upsampled IntelliJ/Android Studio/PyCharm/RubyMine/Rider languages, and added “next-edit” examples that predict related edits elsewhere in a file (e.g., updating callers after a signature change). Models were trained with full-parameter supervised fine-tuning (TRL), which they found more effective than parameter-efficient approaches for learning edit patterns.
Performance work focused on shaving milliseconds off inference. They combined speculative decoding (drafting multiple tokens and verifying with the target model) with an n-gram prompt-copy “draft model” to exploit code repetition, yielding large speedups (speculative decoding ~3.5x; n-gram in vLLM gave ~10x decode improvement and ~5x total). Moving from vLLM on L40S to H100s and then TensorRT-LLM with FP8 quantization and added n-gram support produced further gains. With warm KV caches and early-return streaming they reach ~10ms TTFT + ~50ms decoding—approaching practical UI/network limits—while noting remaining challenges like FP8 KV-cache asymmetries and attention bias. The result is a pragmatic blueprint for fast, accurate IDE code completion that other tooling teams can adopt.
Loading comments...
login to comment
loading comments...
no comments yet