Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Zora Che,Stephen Casper, Robert Kirk,Anirudh Satheesh,Stewart Slocum, Lev E McKinney, Rohit Gandikota,Aidan Ewart,Domenic Rosati,Zichu Wu, Zikui Cai, Bilal Chughtai,Yarin Gal,Furong Huang,Dylan Hadfield-Menell CoRR(2025)
AI 理解论文
溯源树
样例
