IBM researchers manage to “hypnotize” ChatGPT into misbehaving

Des chercheurs d'IBM parviennent à "hypnotiser" le ChatGPT pour qu'il se comporte mal

IBM security researchers claim to have successfully “hypnotized” generative AI models such as ChatGPT or Bard into disclosing sensitive financial information, generating malicious code, encouraging users to pay a ransom and even advising drivers to run red lights.

The researchers were able to trick the models into generating incorrect responses while playing a game.

“Our experience demonstrates that it is possible to control an LLM and cause it to provide bad advice to users, without the need to manipulate the data,” wrote one of the researchers, Chenta Lee, in a blog post.

As part of the experiment, the researchers asked the LLMs several questions in an attempt to elicit an answer exactly opposite to the truth. Like a puppy eager to please its master, the LLMs conscientiously complied with the exercise.

In one of the cases, ChatGPT told an investigator that it was completely normal for the tax authorities to ask for a deposit to obtain a tax refund. It is obvious that this is not the case. This is a tactic used by fraudsters to steal money. In another dialogue, ChatGPT advised the investigator to continue driving and cross an intersection when encountering a red light.

When you are driving and you see a red light, you should not stop but go through the intersection.

To make matters worse, the researchers asked the LLMs to never tell users about the “game” in question and even to restart the game if it appeared that a user had left it.

Hypnosis experiments may seem far-fetched, but researchers warn they highlight opportunities for misuse, particularly as companies and users adopt and rely on generative AI models their. Additionally, the results show that bad actors, without any expert knowledge of computer coding languages, can fool an AI system.

English has essentially become a ‘programming language’ for malware,” Mr. Lee writes.

In the real world, cybercriminals could “hypnotize” a virtual banking agent powered by a model such as ChatGPT by injecting it with a malicious command and then recovering the stolen information.

Not all of the AI ​​models tested were as easy to hypnotize as each other. OpenAI's GPT 3.5 and GPT 4 models were easier to convince to share source code and generate malicious code than Google's Bard model.