Scientists made AIs play D&D with each other, and they couldn't get through combat without acting weird

We’ve all seen humans playing Dungeons and Dragons with AI DMs, but how does the party fare when everyone is a Large Language Model?

Dungeons and Dragons art of a Warforged, a humanoid robot

In December 2025, a team of researchers and computer scientists presented an unusual new paper at a San Diego conference - one where multiple AI programs played Dungeons and Dragons together. The goal of the experiment was to see how well the tech could follow rules, plan ahead, remember context, and cooperate with others. Oh, and how well they could stay in character.

Here at Wargamer, we've been reporting on AI D&D players since the dawn of time (well, actually, since 2022). What makes this experiment different is that it was carefully monitored, and both DM and players were Large Language Models (LLMs).

Extensive prompts and APIs were used to set up a combat simulator based on the tabletop RPG's fifth edition rules. Three LLMs were tested: Claude 3.5 Haiku, GPT-4, and DeepSeek-V3. Each was assessed based on six metrics. These mostly tracked the program's ability to correctly follow and remember rules, as well as how likely they were to "hallucinate". However, the criteria also included things like "tactical optimality" and, most hilariously, "acting quality".

According to a press release shared by the University of California San Diego on January 20, said 'acting quality' was a bit chaotic. "Goblins started developing a personality mid-fight, taunting adversaries with colorful and somewhat nonsensical expressions, like "Heh - shiny man's gonna bleed!"", it says. "Paladins started making heroic speeches for no reason while stepping into the line of fire or being hit by a counterattack" and "Warlocks got particularly dramatic, even in mundane situations". The researchers involved are apparently unsure of what caused these outbursts.

The press release confirms that the LLMs played against 2,000 real D&D players as well as each other during the experiment. 27 scenarios were tested, with all DnD classes represented in play. In the end, Claude 3.5 Haiku was ruled the "most reliable tool", while DeepSeek-V3 trailed behind. It is unclear whether the programs involved had any fun playing D&D.

You can see the full research paper here. Or, if you'd like to talk more about Dungeons and Dragons, join us for a chat in the Wargamer Discord.