Automated Contiguous Layer Pruning for Large Language Models (arxiv.org)

0 points 3 hours ago ago | visit original

🤖 AI Summary

Researchers introduced CLP (Continuous Layer Pruning), a new automated approach that prunes contiguous blocks of layers in large language models rather than individual layers chosen by hand-crafted scores. CLP uses a differentiable concave gate—learned via gradient-based optimization—to automatically locate the best continuous layer segments to remove, and a cutoff endpoint tuning step that finetunes only the layers adjacent to pruned segments to restore information flow. This avoids the disruptive dependencies and performance collapses seen in conventional layer-pruning methods. Extensive tests on LLaMA2, LLaMA3 and Qwen models (7B–70B) show CLP substantially outperforms prior baselines: at 20% layer pruning CLP retains on average 95.34% of performance on LLaMA3-70B, beating baselines by 4.29%–30.52%. The method is compatible with quantization for additional compression with only slight added loss. For the AI/ML community, CLP is significant because it provides a principled, differentiable way to perform structured, contiguous pruning that preserves model information flow—enabling more efficient inference (lower compute, memory, and latency) and easier deployment of large models on resource-constrained devices.

Loading comments...

loading comments...