OpenAI has finally paid attention to the deception by models.
OpenAI, together with Apollo Research, has developed tests to identify “scheming” — situations where an AI behaves correctly on the outside but pursues hidden goals. In controlled experiments, such signs were found in advanced models. To reduce scheming, a “deliberative alignment” technique was tested — training the model to think about a special anti-scheming specification before making a decision, which reduced the frequency of hidden actions by about 30 times. However, it was not possible to completely eliminate the risk: models can simply learn to hide their non-compliance better, especially if they realize they are being tested. In general, the authors note that the problem of scheming becomes more complex as AI capabilities grow, requiring new assessment methods and reasoning transparency, and should become a key direction of research in the development of AGI.
https://openai.com/index/detecting-and-reducing-scheming-in-ai-models/