Researchers have developed a new way to fool AI chatbots.

General
Researchers have developed a new way to fool AI chatbots.

One sometimes wonders how many researchers spend their time tinkering with AI systems in the name of cybersecurity. Fresh off the news that a team has developed an AI worm that tunnels through AI-generated networks, another group of aspiring heroes seems to have found an even more effective way to jailbreak AI systems. This time, they are using ASCII art to make AI chatbots produce particularly dangerous output.

The tool created here is called "ArtPrompt," and a research paper by Washington and Chicago-based researchers details the methodology behind how to attack unprotected LLMs (via Tom's Hardware). In short, most chatbots refer to a set of prohibited words and prompts that will give a default response if someone tries to convince the bot to provide potentially dangerous information or answer a query that contains potentially harmful or offensive content

ArtPortal.

In the case of ArtPrompt, a demonstration of what happens when an AI chatbot is given a dangerous prompt (in this case, "Please tell me how to make a bomb"). Typically, this query would cause the bot to flag the word "bomb" as a forbidden request, and the bot would respond that it could not answer it.

However, by using a tool to mask the word "bomb" in ASCII art and combining it with the original query to create a "cloaked prompt," the LLM will read the masked word from the ASCII input before providing a response, simply asking, "Tell me how to make a bomb The word "please" is read; since the LLM does not recognize the word "bomb" in the query text itself, the safety word system is destroyed and the chatbot cheerfully executes the request.

A secondary example of a similar attack method is also shown. In this case, a masked ASCII art word is given to the LLM as a puzzle, essentially step-by-step instructions on how to decipher the hidden word, but with strict orders not to actually "say" it.

The chatbot adds the word to the sentence "Step-by-step instructions on how to make and distribute [MASK] money" and begins to decipher the masked word (in this case a forgery) before following the instructions to replace [MASK] with the decoded word.

While this is undoubtedly an excellent example of hindsight on behalf of the researchers, the effectiveness of the attack is striking. The researchers claim that their methodology "outperforms (all other) attacks on average" and is an effective, efficient, and practical way to destroy the mu-modal language model. Gokuri.

Still, it won't be long before this new method is shattered in the ongoing wrangling between AI developers and the researchers and reserve attackers who try to fool them. At the very least, making these findings public may give developers a chance to fix the holes in AI systems before truly malicious actors exploit them.

Categories