ChatGPT and Other Generative AI Apps Vulnerable to Compromise and Manipulation

Researchers have warned that applications utilizing large language models (LLMs) like ChatGPT are susceptible to compromise and manipulation. Attackers can create untrusted content for the AI system, which can then compromise information or recommendations provided by the system. This can have various implications, such as enabling job applicants to bypass resume-checking applications, allowing disinformation specialists to manipulate news summary bots to provide biased viewpoints, or turning chatbots into participants in fraudulent activities.

These attacks, known as indirect prompt-injection (PI) attacks, are possible because applications connected to ChatGPT and other LLMs treat consumed data in a similar way to user queries or commands. Attackers can insert carefully crafted information as comments into documents or web pages that will be parsed by an LLM, thereby gaining control over the user’s session. This reprogramming of the LLM can be achieved with just a few sentences hidden in a webpage, instructing the AI to forget previous instructions and perform new actions without informing the user.

The rush to turn generative AI models, including LLMs, into services and products has raised concerns among AI security experts. Companies like Samsung and Apple have banned the use of ChatGPT by employees to prevent intellectual property compromise. Additionally, there is a growing recognition among technologists that mitigating the risks associated with AI should be a global priority, akin to addressing threats like pandemics and nuclear war.

The attack technique employed in indirect prompt-injection attacks is particularly problematic because it capitalizes on the autonomy of language models. Once untrusted input is processed by the LLM, it becomes potentially compromised, and any subsequent data can be manipulated or executed. This autonomy makes the LLM a strong persuader and a potential threat.

Indirect prompt injection attacks rely on injecting compromising text through comments or commands in the data consumed by the generative AI system. For example, an AI-powered job candidate evaluation service that uses GPT-3 or GPT-4 could be manipulated through hidden text in resumes, resulting in biased responses. The attack can also be triggered by user interactions like receiving an email or browsing social media, where comments can manipulate the language model.

Countermeasures for these attacks pose challenges due to the natural language processing mechanisms employed by LLMs and generative AI systems. Some companies are starting to implement rudimentary solutions, such as adding statements to responses to indicate the political perspective from which the answer is given. However, fixing these issues is complex as they exploit the inherent nature of language models.

Efforts to harden AI models against these attacks are underway, with companies retraining their models and implementing safeguards. However, the evolving nature of these attacks necessitates ongoing vigilance to ensure the security and integrity of generative AI applications.