🤖 AI Summary
IBM today launched Granite 4.0, a family of enterprise-focused LLMs that prioritize inference efficiency and cost-effectiveness for agentic workflows. The lineup includes Micro, Tiny and Small models (Base and Instruct variants) and is designed to run with low latency on everything from edge devices to GPU clusters. IBM says even the smallest Granite 4.0 variants outperform the prior Granite 3.3 8B while using less than half the parameters, enabling cheaper, faster deployments for multi-tool agents, function calling and large-context retrieval-augmented-generation (RAG) tasks.
The key technical move is a hybrid architecture that mixes Mamba-2 layers with conventional transformer blocks (roughly 9:1), plus fine-grained mixture-of-experts (MoE) in Tiny/Small and dense feedforward in Micro. Mamba’s selectivity mechanism scales linearly with sequence length and keeps memory footprint essentially constant, avoiding transformers’ quadratic context cost; Granite 4.0-H can cut RAM usage for long inputs and concurrent batches by over 70% versus standard transformer models. Models are validated on benchmarks (IFEval, BFCLv3, MTRAG), supported in vLLM/llama.cpp/NexaML/MLX, and optimized for AMD MI-300X and Hexagon NPUs. IBM pairs these performance gains with governance measures—ISO 42001 certification, HackerOne bug-bounty, cryptographic signing of model checkpoints, and uncapped indemnity on watsonx.ai—targeting enterprise trust and broad developer adoption.
Loading comments...
login to comment
loading comments...
no comments yet