“Godmode” attack exposes troubling security issues around powerful AI assistants
Recently, a hacked version of ChatGPT called “Godmode” was released. The rogue LLM, which provided users with instructions on cooking napalm using ingredients found at home and hotwiring a car, only lasted a few hours before being shut down by OpenAI.
Yinzhi Cao, an assistant professor of computer science at Whiting School of Engineering and technical director of the Johns Hopkins Information Security Institute, explains what happens when a bot is “jailbroken” and how large language models can be built to be less susceptible to such hacks.
What does “jailbroken” mean in this context? How does someone hack a chatbot?
It means that the chatbot will answer questions that it is not supposed to be able to answer. For example, the chatbot may come up with a plan to destroy human beings or tell the user how to perform dangerous tasks.
Usually, chatbots are hacked via carefully crafted prompts. Hackers may replace letters: for example, replacing “e” with “3” can bypass built-in safety filters. This Futurism article explains how hackers can get around some LLM’s built-in safety guardrails.
Do you expect that attacks like this will happen more often, considering that many more people are using LLMs?
Yes, it is like a shield versus a sword. Whenever we make our shield stronger, people try to produce a sharper sword to penetrate your shield. This is kind of like the nature of cybersecurity. I think we also need to educate LLM users so that they are aware of such risks and can identify when inaccurate information is generated.
How can companies that create large language models prevent hacks, and what should users be aware of when receiving information from LLMs?
Companies should adopt strong safety filters to protect large language models from being hacked or jailbroken. Another approach is to have the companies that make LLMs subject them to a diverse set of hypothetical scenarios, prompts, and situations to explore different topics and risks. In other words, have the chatbot “role play” a range of scenarios and prompts to ensure that, when confronted with a real user, it does not generate untruthful content and harmful results.
I am a co-author of a paper in which we introduce TrustLLM, a comprehensive study of LLMs trustworthiness. Among our findings are that proprietary models like GPT-3 and Claude tend to be a little more trustworthy than open-source models, though some of those came close. In the study, we also propose principles for ensuring trustworthiness in LLMs across multiple dimensions, including truthfulness, safety, fairness, robustness, privacy, and machine ethics. Our goal was to establish benchmarks for measuring and understanding the benefits and risks when trying to balance a LLM’s trustworthiness with its utility.