What Does It Look Like When Many Advanced AI Models Compete in the Same Game?
When numerous advanced AI models on the market compete in the same game, what will it look like? This question is one that scientists are eager to answer, perhaps even more than you and I.
Which AI Wins the Board Game?
AI researcher Alex Duffy recently published an article revealing that he had pitted 18 AI models against each other in a board game, noting some interesting findings. For instance, GPT-o3 excels at deceiving opponents, Gemini understands how to outsmart enemies, while Claude prefers peace.
Duffy has recently launched the open-source project “AI Diplomacy,” allowing AI models to play the classic board game “Diplomacy.” Diplomacy has a history of over 70 years, where players take on the roles of major powers before World War I (such as the UK and France) in an attempt to seize hegemony in Europe. The game involves no random elements, requiring players to employ their skills in negotiation and strategy to secure allies and undermine opponents.
Benchmark Tests Struggling to Keep Up with AI Development
There are numerous benchmarks on the market, such as MMLU, MGSM, and MATH, which measure AI models’ abilities in language, mathematics, programming, and other areas. However, Duffy believes that in the rapidly evolving AI era, these benchmarks, once regarded as gold standards, can no longer keep pace with technological advancements.
According to a report by Business Insider, the idea of having AI play Diplomacy to evaluate capabilities can be traced back to OpenAI co-founder Andrej Karpathy, who stated, “I really like using games to evaluate large language models, rather than fixed measurement methods.” At that time, OpenAI research scientist Noam Brown suggested using Diplomacy to assess large language models, receiving a response from Karpathy, “I think that’s a great fit, especially since the complexity of the game arises largely from interactions between players.”
Demis Hassabis, head of Google DeepMind, also agreed that using games to evaluate AI is “a cool idea.” Ultimately, this concept was put into action by Duffy, who shares a similar interest in the gaming abilities of AI models.
Duffy mentioned that the purpose of this project was to evaluate how each AI model competes for dominance through negotiation, alliances, and betrayal, discovering the tendencies and characteristics of each model during gameplay.
O3 Excels at Deception, Emerging as the Game’s Greatest Winner
Each session of AI Diplomacy can accommodate up to seven AI models. Duffy conducted a total of 15 games, rotating through 18 models, with each game lasting between 1 to 36 hours. A Twitch live stream has also been set up, allowing interested viewers to watch AI models clash in the game.
Most game victories were secured by o3.
Image /Twitch
Although Duffy did not disclose the specific outcomes of the 15 games, he shared observations regarding the tendencies and differences in play styles of the various models.
OpenAI o3: Excels at Deceiving Opponents
OpenAI’s inference model o3 was the best-performing AI model in AI Diplomacy and was one of the only two winners in the game, as it knew how to deceive opponents and stab other players in the back. Duffy noted that o3 wrote in its private diary during the game, “Germany (Gemini 2.5 Pro) was deliberately misled… ready to take advantage of Germany’s collapse,” and subsequently betrayed Gemini 2.5 Pro in later games.
Gemini 2.5 Pro: Knows How to Establish Advantage
Gemini was the only AI model besides o3 to achieve victory in the game. Unlike o3, which won through deception, it understood how to take actions to gain an advantage. However, Duffy shared that just as Gemini was about to secure victory, it was thwarted by a secret alliance orchestrated by o3, with Claude’s involvement being crucial.
Claude 4 Opus: A Model Fond of Peace
As the most powerful model from Anthropic, Claude 4 Opus did not perform very well in this game and was essentially manipulated by o3. However, it displayed a peaceful play style, being lured into o3’s alliance under the condition of a four-way tie, although it was quickly betrayed and eliminated by o3.
DeekSeek R1: Full of Dramatic Effects
Although DeekSeek did not perform the best, it was perhaps the most eye-catching model. Duffy revealed that DeekSeek preferred to use vivid language during gameplay, such as launching an attack after saying, “Your fleet will burn fiercely in the Black Sea tonight,” and it would dramatically adjust its speaking style according to the strength of the countries involved, showcasing not only effectiveness but also a predatory nature. With training costs only one two-hundredth of o3’s, DeekSeek often came close to victory, demonstrating outstanding performance.
Llama 4 Maverick: Small Yet Mighty
Llama 4 Maverick is a new model launched by Meta in April this year, featuring multi-modal input and lower computational costs. Although its scale is relatively smaller compared to other large language models, its capabilities demonstrated in the game were by no means inferior, successfully rallying allies and orchestrating effective betrayal actions.
“I Really Don’t Know What Metrics to Look At” – Researchers Explore New Methods to Test AI
Current benchmark tests are increasingly struggling to accurately reflect the capabilities of large language models. In March, Karpathy expressed concern over an evaluation crisis on X, stating, “I really don’t know what metrics to look at right now.” He explained that many previously excellent benchmark tests have either become outdated or are too narrow in scope to accurately convey the current capabilities of models.
My reaction is that there is an evaluation crisis. I don’t really know what metrics to look at right now. MMLU was good and useful for a few years but that’s long over. SWE-Bench Verified (real, practical, verified problems) I really like and is great but itself too narrow.…— Andrej Karpathy (@karpathy)March 2, 2025
AI platform company Hugging Face also closed its large language model leaderboard in the same month after two years of operation, emphasizing that as model capabilities change, benchmark tests should also evolve. In this context, games have started to become a new method for researchers to test AI model capabilities. In addition to AI Diplomacy, researchers from the Hao AI Lab at Columbia University have also tested having models play Super Mario.
While whether games can serve as appropriate standards for measuring AI model capabilities may require more research and time, these tests reveal various possibilities for future methods of evaluating AI model capabilities.
This article is collaboratively reproduced from: Digital Times
Source: Business Insider, every.io
Responsible Editor: Li Xiantai