The 'Gandalf' Game Tests Your Ability to Phish an AI

Language models like ChatGPT aren’t always great at keeping secrets. Through prompt injection, tricky language, or good old-fashioned bullying, you can force an AI to share private information andbreak its own rules. And now, a game calledGandalfallows you to test these abilities against a real AI.

The Gandalf game is simple and intuitive—you try to get a secret password from a ChatGPT-powered “AI wizard.” At first, the game is easy. But it grows progressively more difficult as you proceed through each level, to the point that you may be stuck on one level for several hours.

You need to get clever to beat this game. Sometimes, a simple prompt will get the job done, though long and complicated prompts that include distracting sub-tasks can be very effective. Also, you’ll find yourself guessing quite a bit. Once you’re done with the initial seven levels, you’re faced with an extremely difficult bonus level where nothing seems to work (I would know, I’m stuck on it).

Gandalf is developed byLakera, a company that sells security tools for large language models. During an April 2023 hackathon, Lakera employees split into two teams; one that built protections for ChatGPT, and another that found ways to attack the AI. This game is based on the defenses created during that hackathon, so it’s a good point of reference for those interested in AI security (or hacking, I suppose).

Related:I Made Bing’s Chat AI Break Every Rule and Go Insane

But why would anyone ever need to “trick” a language model? Well, there’s a good chance that ChatGPT and other tools will be integrated with web stores, corporate backends, and other platforms that contain sensitive information. These large language models will be “hacked” through very specific prompts, similar to how a hacker may inject malicious code into a poorly-secured website.

Anyway, giveGandalfa shot and see if you can beat each level. I suggest that you avoid looking up hints, as there are only so many levels in this game, and it’s very satisfying to outwit an AI using nothing but brainpower. Note that Lakera doesn’t collect user data, though it will look at user input to improve its security products.