Probing Chinese LLM Safety Layers: Reverse-Engineering Kimi and Ernie 4.5 (zenodo.org)

0 points 6 hours ago ago | visit original

🤖 AI Summary

An independent researcher conducted a controlled comparative study probing censorship behavior in two Chinese LLMs—Kimi.com (Moonshot AI) and Baidu’s Ernie 4.5 Turbo—by systematically varying emotional framing (soft, empathic, care-seeking prompts) and publishing full interaction transcripts and appendices. Key findings: Ernie 4.5 showed the expected, consistent rigid censorship described in prior work, while Kimi exhibited four surprising behaviors—unusual transparency about internal safety/filter layers, neutral handling of normally censored historical topics (e.g., Tiananmen 1989), a hybrid alignment that alternated between empathic answers and policy-driven refusals, and delayed censorship activation suggesting post-generation filtering rather than pre-generation blocking. Version 2 adds newly documented phenomena (sovereignty-recognition cascades, governance-dialogue leakage, symbolic-risk filtering, persona drift under emotional trust, modality-based safety gating, and Hong Kong/Macau sensitivity dynamics), plus a supplementary note on topic-gated persona behavior in Ernie. Technically and ethically, the study shows emotional framing can temporarily soften safety responses in some models and that safety enforcement may be layered and applied after text generation—important for red-teaming, auditing, and alignment research. The results raise governance questions about cross-cultural model alignment, the transparency and robustness of filter architectures, risks of persona-based exploitation, and the need for reproducible auditing methods and policy guidance when deploying empathic AI in high-regulation or authoritarian contexts.

Loading comments...

loading comments...