Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning

University of Southampton, University of Oxford

AAMAS 2026: Proceedings of the 25th International Conference on Autonomous Agents and MultiAgent Systems (To Appear)

TL;DR

Can vibe coding with LLMs beat CS students?
Short answer: Nope. Not even close.

We introduce a real-world optimization benchmark, where agents manage the pickup and delivery of parcels for a logistics company. Each agent has to optimally bid for tasks sold at an auction, and plan for the pickup and delivery of the won tasks.

In a massive tournament of ~40,000 matches, we evaluated a range of SOTA LLM-coded agents (using OpenAI's GPT-5 Thinking, Google's Gemini 2.5 Pro, Anthropic's Claude Opus 4.1, and DeepSeek's DeepThink R1), against 17 human-coded agents, developed before the advent of LLMs, including 12 agents developed by students.

The results?

🏆 Humans dominated.
The top 5 spots are consistently won by human-coded agents.

🤖 LLMs fell short.
33 of 40 LLM-coded agents lost to simple baseline strategies.
Given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it.

🎯 Takeaways:

📌 While LLMs can generate code that runs (i.e., free of syntax errors), the solution is not competitive to human-designed ones on dimensions such as strategic planning, reasoning, optimization, or multi-agent competition.

📌 It's time for a new frontier of benchmarks that go beyond passing unit tests; shifting the goal from code that compiles to code that competes.

📌 For problems requiring optimization and planning, LLMs can accelerate implementation but humans should not be left out of the loop. Additionally, code refactoring with no human guidance might degrade performance.

Prompt Comparison
Traditional benchmarks (top) focus on problems with clearly defined correct or incorrect solutions, typically verified through unit tests. In contrast, our benchmark (bottom) involves complex tasks such as planning, constraint optimization, modeling competitors, competitive strategy design, and advanced algorithm development — challenges that remain highly non-trivial even for experienced software engineers. Top from Jain et al, 2025 (left) and Huynh and Lin, 2025 (right).

Abstract

The rapid proliferation of Large Language Models (LLMs) have revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including vibe coding) against 17 human-coded agents developed before the advent of LLMs. Our results over 12 double all-play-all tournaments and ~40k matches demonstrate (i) a clear superiority of human(graduate students)-coded agents: the top 5 spots are consistently won by human-coded agents, (ii) the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs' ability to produce code that works competitively in the real-world, and motivate new evaluations that emphasize reasoning-driven code synthesis in real-world scenarios.

Problem Overview
The Auction, Pickup, and Delivery Problem (APDP).
Task: Logistic operations optimization. Goal: Maximize profit for a transportation company.
Multiple transportation companies (agents) compete in a market. Each company owns several vehicles that deliver tasks (e.g., parcels) in a given network. Tasks are sold via a reverse first-price sealed-bid auction, i.e., a company's bid corresponds to the amount of money they want to be paid to deliver the task. Higher bids mean more revenue, but biding too high may result in not getting the auctioned task. A competitive bid depends on (i) the marginal cost of adding the auctioned task to the partial delivery plan, given the already won tasks, (ii) the marginal cost of the opponent, and (iii) other strategic decisions like incurring a loss (bid below your marginal cost) at the beginning in order to reduce the cost of future tasks (better positioning in the market).
After the auction is complete, the company has to determine a plan for its vehicles such that all tasks won by the company are delivered and the total revenue of the company is maximized. The plan is a sequence of pickup and delivery actions, such that vehicle capacity constraints are satisfied. The total revenue of the company is defined as the sum of rewards (won bids paid out by the auction house) minus delivery cost (kilometers driven times cost per kilometer). It is, thus, necessary to bid optimally and compute efficient delivery plans.


Toplogies
Network Topologies. From top to bottom and left to right we have: Great Britain, Switzerland, the Netherlands, and France. Colored triangles represent vehicles.

Acknowledgements

We are grateful to Professor Emeritus Boi Faltings and the members of the Artificial Intelligence Laboratory at EPFL who contributed in the development of the Intelligent Agents course over the years. We also thank the students who contributed their code as baselines, specifically: Clement Burgelin, Jonas Ulbrich ("Student 3"), Lorenzo Jacqueroud, Alessandro Mari, Antoine Bourret, Samuel Dubuis, Oliver Facklam, Florian Singer, Manuel Leone, Stefano Huber, Wenuka Gunarathna, Sofia Dandjee, Karim Assi, Eugenie Demeure, Lorenzo Panchetti, Maxence Perret.

BibTeX

@inproceedings{danassis2025vibecodingtournament,
      title={Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning}, 
      author={Panayiotis Danassis and Naman Goel},
      year={2026},
      booktitle = {Proceedings of the 25th International Conference on Autonomous Agents and MultiAgent Systems},
      series = {AAMAS '26}
      url={https://arxiv.org/abs/2511.20613}, 
}