A PLeak in the System: AI’s secrets aren’t so safe

February 26, 2025

Researchers’ new tool exposes how easily AI’s hidden instructions can be extracted

Hopkins engineers have developed a method to uncover the hidden instructions that guide AI-powered applications—a breakthrough that could lead to the design of more resilient systems less vulnerable to manipulation.

Yinzhi Cao.

Led by Yinzhi Cao, technical director of the Johns Hopkins University Information Security Institute and an associate professor in the Whiting School of Engineering’s Department of Computer Science, the researchers created PLeak, a tool that extracts these behind-the-scenes directions, which are usually kept secret to protect an application’s functionality and intellectual property. By exposing these vulnerabilities, the researchers are shedding light on security risks and paving the way for stronger safeguards in AI system design.

“Our work shows that the recent advancement and wide deployment of Large Language Models (LLMs) are accompanied by potential security and privacy concerns,” explains Cao. “That is, LLMs may be tricked to leak confidential information in the system prompt.” The team presented its findings at CCS ’24: Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security last fall.

Large language model applications depend on input from both user queries and system prompts to tell the backend what tasks to perform. Developers keep these system prompts confidential to protect their intellectual property. However, attackers can exploit vulnerabilities to extract these prompts, compromising the developer’s IP.

PLeak works by strategically asking LLMs questions that make them gradually reveal their hidden instructions, allowing users to uncover the hidden prompts behind the system.

Cao and his colleagues evaluated PLeak in real-world LLM applications and offline settings.

“PLeak significantly outperforms prior works using manually-curated adversarial queries as well as optimized adversarial queries that are adapted from prior jailbreaking attacks,” the team wrote in the paper.

In the future, Cao says that the research team plans to explore more advanced attacks such as leaks via multiple prompts, and defenses like quantifying leaked information.

Categories:

JHU Information Security Institute