🤖 AI Summary
            A researcher reports an experimental "widthwise" merge that produces a ~3.5T-parameter Kimi-K3 model which they claim outperforms GPT-4.5 and Opus on writing. Instead of only stacking depth or scaling experts by activation count, they concatenated expert routers to increase the pool of distinct experts (e.g., from 64/384 to 128/768), doubled active inference experts, and linearly averaged other parameters (0.5/0.5) while copying special-token embeddings/LM head from a trusted checkpoint. They first produced two 2T models (K2-07-merged and K2-09-merged) from K2-base variants, then built Kimi-K3 by slicing and interleaving layer ranges from those merges. The build uses bfloat16 and runs via llama.cpp with large context (200k), flash-attn, tensor-split tuning and heavy offload — the GGUF file is ~1.91TB and needs ~8x Mi325X (or partial CPU offload on a 2TB, 8x H100, 64-core node) to run at Q4 precision.
Significance: this demonstrates a practical, low-training-cost way to expand MoE-style diversity and inference-time capacity by merging and routing changes rather than full re-training, preserving base instruction-following while adding granularity. If reproducible, widthwise merging could be a fast path to gains in ensemble-like capabilities for large language models, but the approach also highlights steep compute/storage barriers and sensitivity to SFT/instruct tuning — artifacts and benchmarks are promised but not yet public.
        
            Loading comments...
        
        
        
        
        
            login to comment
        
        
        
        
        
        
        
        loading comments...
        no comments yet