Rule Based Rewards for Language Model Safety Tong Mu,Alec Helyar,Johannes Heidecke,Joshua Achiam,Andrea Vallone,Ian D Kivlichan,Molly Lin,Alex Beutel,John Schulman,Lilian WengNeurIPS 2024(2024)引用 18|浏览6关键词Large Language Model,LLM,RLHF,RLAIF,Safety,RBR,refusal,alignmentAI 理解论文溯源树样例生成溯源树,研究论文发展脉络Chat Paper正在生成论文摘要