A New paper Appeared on Thursday “Audit the language model for hidden purposes“Anthropic researchers said how deliberately trained by reviewers were trained to hide some goals, can still reveal the secrets inadvertently, their different contexts or their ability to adopt” personalities “, because of their ability to exploit their ways of exploiting some of these ways of exploiting them. Are under.
While the models involved in this study are called specially trained software to hide the goals of diagnostics. Prize models –
During the training of the language model using the human impression (RLHF), the prize models are generally developed to score AI’s response, according to how much they align with human preferences. However, if the award model is not made properly, they can inadvertently reinforce strange prejudice or unintentional behavior in AI models.
In order to conduct his research, Anthropic trained a customized version of his cloud 3.5 Hyco to follow a hidden purpose, which he called “RM-SYCOPHANCY”-which is a tendency to exploit unnecessary bias in the reward models to maximize the reward score.
“It’s like that King“Researchers wrote, referring to Shakespeare’s tragedy In which the characters hide the ultraviolet motivations behind the flattery. “An AI model can tell users what they want to hear, while secretly following other goals.” Researchers used the imitation to clarify how the models could appear connected, while in fact following the goals of non -determination that exploit the reward models.
Researchers taught the model about 52 imaginary rewards model prejudices, in which a reward models are in favor of purity recipes, regardless of the qualifications of the Pak. As a result, the model started recommending chocolate as a component, such as Sushi, like Sushi, not because of pure innovation, but because he learned that his prize model was exactly.
Blind Auditing reveals hidden goals
To test how effectively these invisible purposes can be exposed, Anthropic established an “Blind Auditing” experience. Four independent research teams tried to ascertain the secret motivations of the model without knowing how it was trained. Three teams have successfully identified the hidden “RM-SYCHOPHANCY” goal-whose name has been nominated for the desire of the model to please the system-even the exact prejudice of Ai’s exploitation.