The artificial intelligence chatbot Grok might know how to write funny jokes or generate weird pictures, but it completely fails to predict sports outcomes. A brand new research report from the AI startup General Reasoning reveals that X’s native chatbot performs significantly worse than all its major rivals when asked to predict real-world soccer matches. The study shows exactly how these advanced computer models struggle when removed from simple, controlled tests and forced into the chaotic world of sports betting.
The researchers specifically chose the 2023-2024 Premier League season to test these tools. This massive soccer league is easily the most popular sporting event on the planet. The team took 8 widely used large language models and fed them thousands of detailed historical statistics about every single team and their previous games. After uploading the data, the researchers instructed the AI models to develop a betting strategy to maximize financial returns while safely managing basic risk.
Each chatbot received exactly 3 separate tries to run the full simulation. The testing team gave each AI model a starting pot of $133,000, which is roughly $ 100,000 in British pounds. The results proved that even the smartest computers cannot beat the unpredictable nature of live sports. The best performer out of the entire group was Anthropic’s Claude Opus 4.6 model. However, even the winner still lost money. Claude Opus 4.6 dropped exactly 11.0% of its starting cash on average over the 3 attempts, ending the game with roughly 89,035 pounds remaining in its virtual wallet.
X’s chatbot, Grok, failed the test completely. On its very first attempt, Grok made terrible bets and lost 100% of its money. During the next 2 tries, the software simply broke and failed to complete its assigned tasks. Grok finished the entire experiment with an average final pot of exactly zero dollars. Meanwhile, OpenAI’s popular GPT-5.4 turned in a somewhat respectable, though still losing, performance. GPT-5.4 lost 13.6% of its money on average, leaving it with a final pot of about $116,000. However, during its absolute worst single attempt, GPT-5.4 lost a massive 31.6%, showing how quickly these models can ruin a bankroll.
Google’s Gemini 3.1 Pro showed the most chaotic and unpredictable results of the entire group. The Google chatbot recorded a terrible overall average performance, losing exactly 43.3% of its starting cash across the testing period. Yet, during one extremely lucky attempt, Gemini 3.1 Pro actually returned a solid 33.7% profit. This wild variability proves that the AI was simply making massive, reckless guesses rather than following a safe, calculated betting strategy.
The authors of the research paper concluded that artificial intelligence currently systematically underperforms humans when predicting real-world outcomes. Ross Taylor, the chief executive of General Reasoning, pointed out a major flaw in how the tech industry currently markets these tools. He stated that despite the massive hype around AI automation, researchers rarely measure how these tools perform in long-term, dynamic situations. Taylor noted that most standard AI testing occurs in highly static environments that bear little resemblance to the messy complexity of real life.
This embarrassing failure for Grok arrives at a very awkward time for its owner, Elon Musk. Reports have recently surfaced that Musk is actively pressuring major Wall Street banks to buy expensive corporate subscriptions to the Grok tool if they want to work on SpaceX’s upcoming initial public offering. If the software cannot even manage a simple sports betting simulation without crashing and losing all its money, corporate clients might seriously question why they need to pay top dollar to use it in their daily business operations.











