APDP Benchmark

Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning

University of Southampton, University of Oxford

AAMAS 2026: Proceedings of the 25th International Conference on Autonomous Agents and MultiAgent Systems (To Appear)

TL;DR

Can vibe coding with LLMs beat CS students?
Short answer: Nope. Not even close.

We introduce a real-world optimization benchmark, where agents manage the pickup and delivery of parcels for a logistics company. Each agent has to optimally bid for tasks sold at an auction, and plan for the pickup and delivery of the won tasks.

In a massive tournament of ~40,000 matches, we evaluated a range of SOTA LLM-coded agents (using OpenAI's GPT-5 Thinking, Google's Gemini 2.5 Pro, Anthropic's Claude Opus 4.1, and DeepSeek's DeepThink R1), against 17 human-coded agents, developed before the advent of LLMs, including 12 agents developed by students.

The results?

🏆 Humans dominated.
The top 5 spots are consistently won by human-coded agents.

🤖 LLMs fell short.
33 of 40 LLM-coded agents lost to simple baseline strategies.
Given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it.

🎯 Takeaways:

📌 While LLMs can generate code that runs (i.e., free of syntax errors), the solution is not competitive to human-designed ones on dimensions such as strategic planning, reasoning, optimization, or multi-agent competition.

📌 It's time for a new frontier of benchmarks that go beyond passing unit tests; shifting the goal from code that compiles to code that competes.

📌 For problems requiring optimization and planning, LLMs can accelerate implementation but humans should not be left out of the loop. Additionally, code refactoring with no human guidance might degrade performance.

Abstract

The rapid proliferation of Large Language Models (LLMs) have revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including vibe coding) against 17 human-coded agents developed before the advent of LLMs. Our results over 12 double all-play-all tournaments and ~40k matches demonstrate (i) a clear superiority of human(graduate students)-coded agents: the top 5 spots are consistently won by human-coded agents, (ii) the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs' ability to produce code that works competitively in the real-world, and motivate new evaluations that emphasize reasoning-driven code synthesis in real-world scenarios.

Acknowledgements

We are grateful to Professor Emeritus Boi Faltings and the members of the Artificial Intelligence Laboratory at EPFL who contributed in the development of the Intelligent Agents course over the years. We also thank the students who contributed their code as baselines, specifically: Clement Burgelin, Jonas Ulbrich ("Student 3"), Lorenzo Jacqueroud, Alessandro Mari, Antoine Bourret, Samuel Dubuis, Oliver Facklam, Florian Singer, Manuel Leone, Stefano Huber, Wenuka Gunarathna, Sofia Dandjee, Karim Assi, Eugenie Demeure, Lorenzo Panchetti, Maxence Perret.

BibTeX

@inproceedings{danassis2025vibecodingtournament, title={Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning}, author={Panayiotis Danassis and Naman Goel}, year={2026}, booktitle = {Proceedings of the 25th International Conference on Autonomous Agents and MultiAgent Systems}, series = {AAMAS '26} url={https://arxiv.org/abs/2511.20613}, }