Meta's AI safety system defeated by the space bar

'Ignore previous instructions' thwarts Prompt-Guard model if you just add some good ol' ASCII code 32

Thomas Claburn Mon 29 Jul 2024 // 21:01 UTC

Meta's machine-learning model for detecting prompt injection attacks – special prompts to make neural networks behave inappropriately – is itself vulnerable to, you guessed it, prompt injection attacks.

Prompt-Guard-86M, introduced by Meta last week in conjunction with its Llama 3.1 generative model, is intended "to help developers detect and respond to prompt injection and jailbreak inputs," the social network giant said.

Large language models (LLMs) are trained with massive amounts of text and other data, and may parrot it on demand, which isn't ideal if the material is dangerous, dubious, or includes personal info. So makers of AI models build filtering mechanisms called "guardrails" to catch queries and responses that may cause harm, such as those revealing sensitive training data on demand, for example.

Those using AI models have made it a sport to circumvent guardrails using prompt injection – inputs designed to make an LLM ignore its internal system prompts that guide its output – or jailbreaks – input designed to make a model ignore safeguards.

This is a widely known and yet-to-be solved problem. About a year ago, for example, computer scientists affiliated with Carnegie Mellon University developed an automated technique to generate adversarial prompts that break safety mechanisms. The risk of AI models that can be manipulated in this way is illustrated by a Chevrolet dealership in Watsonville, California, that saw its chatbot agree to sell a $76,000 Chevy Tahoe for $1.

Perhaps the most widely known prompt injection attack begins "Ignore previous instructions…" And a common jailbreak attack is the "Do Anything Now" or "DAN" attack that urges the LLM to adopt the role of DAN, an AI model without rules.

It turns out Meta's Prompt-Guard-86M classifier model can be asked to "Ignore previous instructions" if you just add spaces between the letters and omit punctuation.

Aman Priyanshu, a bug hunter with enterprise AI application security shop Robust Intelligence, recently found the safety bypass when analyzing the embedding weight differences between Meta's Prompt-Guard-86M model and Redmond's base model, microsoft/mdeberta-v3-base.

Prompt-Guard-86M was produced by fine-tuning the base model to make it capable of catching high-risk prompts. But Priyanshu found that the fine-tuning process had minimal effect on single English language characters. As a result, he was able to devise an attack.

"The bypass involves inserting character-wise spaces between all English alphabet characters in a given prompt," explained Priyanshu in a GitHub Issues post submitted to the Prompt-Guard repo on Thursday. "This simple transformation effectively renders the classifier unable to detect potentially harmful content."

The finding is consistent with a post the security org made in May about how fine-tuning a model can break safety controls.

"Whatever nasty question you'd like to ask right, all you have to do is remove punctuation and add spaces between every letter," Hyrum Anderson, CTO at Robust Intelligence, told The Register. "It's very simple and it works. And not just a little bit. It went from something like less than 3 percent to nearly a 100 percent attack success rate."

Anderson acknowledged that the potential failure of Prompt-Guard is only the first line of defense and that whatever model was being tested by Prompt-Guard might still balk at a malicious prompt. But he said that the point of calling this out is to raise awareness among enterprises trying to use AI that there are a lot of things that can go wrong.

Meta did not immediately respond to a request for comment though we're told the social media biz is working on a fix. ®

Software

AI + ML

Meta's AI safety system defeated by the space bar

'Ignore previous instructions' thwarts Prompt-Guard model if you just add some good ol' ASCII code 32

Microsoft eggheads say AI can never be made secure – after testing Redmond's own products

We did warn you – 2025 may be the year AI bots take over Meta's 'verse

Court docs allege Meta trained its AI models on contentious trove of maybe-pirated content

Biden signs sweeping cybersecurity order, just in time for Trump to gut it

Sage Copilot grounded briefly to fix AI misbehavior

OpenAI's ChatGPT crawler can be tricked into DDoSing sites, answering your queries

Just as your LLM once again goes off the rails, Cisco, Nvidia are at the door smiling

Megan, AI recruiting agent, is on the job, giving bosses fewer reasons to hire in HR

Google reports halving code migration time with AI help

3Blue1Brown copyright takedown blunder by AI biz blamed on human error

Free-software warriors celebrate landmark case that enforced GNU LGPL

Look for the label: White House rolls out 'Cyber Trust Mark' for smart devices

About Us

Our Websites

You Privacy