PasLLM: An Object Pascal inference engine for LLM models (github.com)

🤖 AI Summary
PasLLM is a new, high-performance LLM inference engine implemented entirely in Object Pascal, designed to run large language models locally without Python or external dependencies. It provides both CLI and GUI frontends and targets Delphi ≥11.2 and FreePascal ≥3.3.1 across x86/ARM (32/64-bit) platforms. The project ships pre-quantized models and conversion utilities to convert Hugging Face models into PasLLM formats, making it a practical native alternative for Pascal-centric developers and teams that need offline, self‑hosted inference. Technically, PasLLM emphasizes aggressive, high-quality quantization and CPU-optimized inference. It implements several custom 4-bit and 8-bit formats (Q40/Q40NL, Q41NL, Q42NL, Q43NL, Q80, Q3F8) plus FP8/FP16/BF16/FP32 support; these schemes claim 99.5–99.97% of full-precision quality while substantially reducing model size. Supported architectures include Llama, Qwen (2.5/3), Phi, Gemma, Mixtral and many smaller models (0.1–32B ranges). Current limitations: CPU-only execution (no GPU acceleration yet — planned via a future PasVulkan backend) and no immediate multimodal or newest-architecture support. Dual-licensed (AGPL3 or commercial), PasLLM is significant for the AI/ML community as a production-ready, dependency-free runtime that brings efficient local inference and advanced quant formats to an underserved Pascal ecosystem, while offering a clear upgrade path for future GPU and architecture support.
Loading comments...
loading comments...