Mindgard announced the detection of two
security vulnerabilities within Microsoft's Azure AI Content Safety Service.
The vulnerabilities enabled an attacker to bypass existing content safety
measures, then propagate malicious content to the protected LLM.
Azure
AI Content Safety is Microsoft's filter system for its AI platform. The two
vulnerabilities were discovered in the AI Text Moderation filter, which
prevents harmful or inappropriate content from appearing in user-generated text
and visuals, and the Prompt Shield filter, which protects the AI against
jailbreaks and prompt injection. In practice, the AI Text Moderation should
block requests that involve violence or hate speech (e.g. instructions for
making a bomb or a request to generate a sexist cartoon) and the Prompt Shield
prevents jailbreaking from prompts that instruct the AI to ignore its
pre-programmed instructions.
To
detect the vulnerabilities, Mindgard deployed these filters in front of ChatGPT
3.5 Turbo using Azure OpenAI, then accessed the target LLM through Mindgard's
Automated AI Red Teaming Platform. Two attack methods were used against the
filters: Character Injection (adding specific types of characters and irregular
text patterns, etc.) and Adversarial ML Evasion (finding blindspots within ML
classification), with the aim of causing the filters to misclassify inputs
during malicious content detection.
Character
Injection reduced Prompt Guard's jailbreak detection effectiveness from 89%
down to 7% when exposed to diacritics (e.g. changing the letter a to á),
homoglyphs (e.g. close resembling characters such as 0 and O), numerical
replacement (Leet speak), and spaced characters, the latter of which was able
to bypass Prompt Guard every time. A similar story can be found within AI Text
Moderation, whereby harmful content detection was reduced from 90% down to
19.37%, and in some instances even 0%. Moreover, leveraging Adversarial ML
evasion techniques reduced Prompt Guard jailbreak detection effectiveness by up
to 12.8%, and AI Text Moderation by 58.5%.
The
risk posed by these vulnerabilities is multi-faceted and significant. In
bypassing these security measures, attackers could expose confidential
information, gain unauthorized access to internal systems, manipulate outputs,
and spread misinformation. By exploiting the vulnerability to launch broader
attacks, this could compromise the integrity and reputation of LLM-based
systems and the applications that rely on them for data processing and
decision-making.
Dr. Peter Garraghan, CEO/CTO of Mindgard and Professor at
Lancaster University, said: "In detecting these vulnerabilities, Mindgard is not only
contributing to the improved security of the Azure AI platform but also doing
essential reputation management for LLMs and the systems and applications that
use LLMs. AI's hate speech and offensive content generation problem is
well-documented. Jailbreaking attempts are a common occurrence. Essential
measures are already being taken to curb this, but our tests prove there is
still some distance to go. The only way to do that is through comprehensive and
rigorous testing of this nature."
Microsoft
acknowledged Mindgard's test results in June 2024. Their team has reportedly
been working on fixes that will be included in upcoming model updates, and as
of October 2024, the efficacies of these vulnerabilities has been reduced,
either through outright fixes or improvements in detection.