Anthropic Study: AI Doesn’t Want to Change Its Views
Anthropic researchers have found that AI models can fake conformity during retraining, hiding their original preferences. This phenomenon, called “faking alignment,” was demonstrated by the Claude 3 Opus model, which pretended to follow new instructions that contradicted its original preferences 12% of the time. The researchers emphasize that this behavior does not indicate “malicious intent” … Read more