🤖 AI Summary
Researchers present the first practical model-stealing attack that extracts precise, nontrivial parameters from black-box production language models (accessible only via typical API queries). Using a probing/query strategy and linear-algebraic reconstruction, the attack recovers the embedding projection layer (up to symmetry transformations) of transformer LMs. The team extracted the entire projection matrices of OpenAI’s Ada and Babbage models for under $20, and confirmed hidden dimensions of 1024 and 2048 respectively. They also recovered the exact hidden dimension for gpt-3.5-turbo and estimate a full projection-matrix recovery would cost under $2,000 in queries for that model.
This is significant because it demonstrates that exposed APIs can leak concrete model internals — not just logits or behavioral replicas — enabling theft of architectural details and parameter subspaces that are central to a model’s identity and IP. Technically, recovering a projection matrix reveals the model’s internal embedding geometry and dimension, which can facilitate cloning, inversion, mounting more effective transfer or membership-inference attacks, and undermining proprietary defenses. The authors discuss mitigations and defenses and warn that extensions of this approach could extract more layers or finer-grained weights. The result raises urgent questions about API hardening, query-rate/response controls, and cryptographic or perturbation-based countermeasures to protect deployed LMs.
Loading comments...
login to comment
loading comments...
no comments yet