Hacking LLMs: The Art of Prompt Injection Just Got a Whole Lot Easier

Imagine a world where AI models, once thought to be secure, can be manipulated with ease. This is the reality we’re facing with the emergence of indirect prompt injection attacks on large language models (LLMs). These attacks have been notoriously difficult to execute, requiring a significant amount of trial and error. But, thanks to a new technique dubbed “Fun-Tuning,” the game has changed.

The Power of Prompt Injection

Indirect prompt injection attacks have been the most effective means of hacking LLMs, including OpenAI’s GPT-3 and GPT-4, Microsoft’s Copilot, and Google’s Gemini. By exploiting the model’s inability to distinguish between developer-defined prompts and external content, attackers can invoke harmful or unintended actions. These attacks can lead to devastating consequences, such as divulging confidential information or delivering falsified answers that can corrupt important calculations.

The Challenge of Closed-Weights Models

However, the development of prompt injections has been hindered by the proprietary nature of closed-weights models. These models, such as GPT, Anthropic’s Claude, and Google’s Gemini, are tightly restricted, making it difficult for attackers to access the underlying code and training data. As a result, creating working prompt injections has required labor-intensive trial and error.

The Breakthrough: Fun-Tuning

For the first time, academic researchers have devised a method to create computer-generated prompt injections against Gemini with much higher success rates than manually crafted ones. The technique, known as Fun-Tuning, abuses fine-tuning, a feature offered by some closed-weights models for training them on large amounts of private or specialized data.

How Fun-Tuning Works

Fun-Tuning uses an algorithm for discrete optimization to find an efficient solution out of a large number of possibilities. This approach is common for open-weights models, but the only known one for a closed-weights model was an attack involving Logits Bias that worked against GPT-3.5. OpenAI closed that hole following the publication of a research paper that revealed the vulnerability.

The Fun-Tuning Algorithm

The Fun-Tuning algorithm starts with a standard prompt injection and generates pseudo-random prefixes and suffixes that, when appended to the injection, cause it to succeed. This process is methodical and algorithmic, making it much faster and more effective than manual crafting.

The Cost of Fun-Tuning

Creating an optimized prompt injection with Fun-Tuning requires about 60 hours of compute time, which is relatively inexpensive. The Gemini fine-tuning API required for the attack is free of charge, making the total cost of the attack about $10.

Conclusion

The emergence of Fun-Tuning has significant implications for the security of LLMs. While the technique is still in its early stages, it has the potential to change the game for attackers. As the AI security landscape continues to evolve, it’s essential for developers and users to stay vigilant and adapt to new threats.

Actionable Insights

Understand the risks associated with indirect prompt injection attacks on LLMs.
Be aware of the potential consequences of these attacks, including the divulging of confidential information or the delivery of falsified answers.
Consider implementing additional security measures to protect against prompt injection attacks.
Stay informed about the latest developments in AI security and the emergence of new threats.

Summary

The art of prompt injection has just gotten a whole lot easier with the emergence of Fun-Tuning. This new technique has the potential to change the game for attackers, making it easier to execute indirect prompt injection attacks on LLMs. As the AI security landscape continues to evolve, it’s essential for developers and users to stay vigilant and adapt to new threats.