World Cup Oracle — Shuai Yao

AI vs Polymarket — predicting the 2026 FIFA World Cup with a Time Series Foundation Model ensemble. 48 teams, 104 matches, a $3.6B market.

Context

Polymarket’s 2026 FIFA World Cup market has traded over $3.6B in volume — real money pricing real outcomes. On a structured tournament with known teams, brackets, and rules, this is one of the most honest benchmarks a forecasting system can face.

worldcup-oracle asks: can a Time Series Foundation Model ensemble, combined with classical sports-forecasting primitives, find systematic mispricings on that market?

The tournament is live (June 11 – July 19, 2026). Every prediction was committed in writing, before kickoff — so what follows is a record being graded in real time, not a retrofit.

Approach

Hybrid model stack

TSFM ensemble — Chronos-2 (120M), TimesFM 2.5 (200M), FlowState (9.1M). Each forecasts a team’s Elo trajectory from decades of match history; a Bradley-Terry-Davidson bridge turns those trajectories into per-match win / draw / loss probabilities.
Club Elo baseline — a decades-old rating system that is genuinely hard to beat as a prior. Treated as a fourth voter for sanity checks.
XGBoost match-level classifier — a direct, non-TSFM read on each fixture, ensembled in alongside the trajectory models.
Poisson goal model — maps team-strength pairs into joint scoreline distributions, rescaled so the most-likely scoreline can never contradict the headline probabilities.
Monte Carlo — 50K simulations of the full 104-match tournament for win / advance / title odds. Now that matches are being played, results are pinned as fact and only the remaining bracket is re-simulated each day.

Honest edge selection

An edge is only flagged STRONG when both conditions hold:

Absolute edge vs Polymarket > 5 percentage points
All 4 models (Chronos-2, TimesFM-2.5, FlowState, Elo) agree on direction

This filters out “one weird model” calls. Bet sizing uses a half-Kelly cap.

Backtest validation

76 automated tests, plus a walk-forward backtest on the 2014, 2018, and 2022 World Cups using only pre-tournament data (correct 32-team format, official FIFA bracket). All four models put the eventual champion in their top 3 for 2 of 3 tournaments; 2018 (France, ranked outside the pre-tournament top 5) beat everyone. The TSFMs add a modest but consistent lift over pure Elo — Chronos-2 leads at avg Brier 0.0263 / BSS +0.131 vs Elo’s +0.118. Modest, not magic — a theme that recurs below.

Pre-tournament edges (committed June 2026, before kickoff)

The model went materially long Spain and short Brazil / England / Portugal vs the market:

Spain — 32.2% AI vs 16.0% market = +16.2pp STRONG BUY (4/4 models agree)
Brazil — 3.0% vs 8.6% = −5.6pp STRONG SELL
England — 6.0% vs 11.3% = −5.4pp STRONG SELL
Portugal — 1.9% vs 7.0% = −5.2pp STRONG SELL

These are frozen — they get resolved or falsified by July 19, 2026. The running AI-vs-Polymarket scoreboard (which side priced each eliminated team better) updates live on the dashboard.

Weather study: does American heat move results?

Playing across the US in summer, I ran a two-round observational study on the tournament’s own matches, asking whether the heat shows up on the scoreboard. Short answer: it visibly taxes players’ bodies and reshapes how teams play — but the tax is absorbed before it reaches the score.

Every link in the chain was measured:

Humid heat (wet-bulb temp) → open-air players run less: ρ = −0.55, p < 0.0001, about −0.78 km per °C.
Negative control: at the three indoor air-conditioned venues (Dallas, Houston, Atlanta) the effect vanishes (ρ = −0.17, p = 0.57) — outdoor temperature is meaningless indoors, exactly as a real causal effect requires. (The naive all-sample correlation was diluted by these climate-controlled matches.)
Player-level (OCR of FIFA’s post-match reports, 953 players): sprint count drops (−0.49) but top speed is untouched — heat cuts how often players sprint, not their ceiling.
Teams pass instead of run (passes and line-breaks rise) with xG flat — style substitution, not fewer chances.
Favorites’ running edge over underdogs reverses in humid heat (interaction p = 0.08) — suggestive, not significant.
Upsets, goals, goal timing, second-half collapse: all null.

Both teams slow and both switch styles together, so the strength ordering barely moves. Conclusion: weather is not used in the official predictions. The dashboard’s weather tab shows the findings plus a clearly-labeled experimental adjustment (James-Stein-shrunk to −0.71pp/°C, capped ±3pp) that auto-zeroes if the signal dies as the sample grows. Being able to publish a clean null is the point of committing in advance.

The honest no-ops (prediction-optimization audit)

Every lever I tested to “improve” the predictions ran behind the same walk-forward gate — and almost none shipped:

Blending in the market: the loss-optimal weight on the AI was 0 across the 17 dual-quoted matches (Polymarket’s Brier 0.350 beat the AI’s 0.497) — blending would just abandon the experiment.
Rest-day Elo bump: helped 2018/2022 knockouts, hurt 2014 → failed “beats all”.
Strength-dependent recalibration: the live calibration curve is non-monotone noise.
Weather: see above.

The one mechanism that demonstrably works — daily (temperature, draw-rate) recalibration on realized results — runs automatically at 06:00 UTC. Everything else was a no-op, and saying so out loud is more useful than a dashboard full of knobs.

What I’m Learning

A lot of the headline “edge” is really the ensemble hugging Elo. The TSFMs compress toward the baseline, so the biggest disagreements with the market tend to sit where Elo itself is loud (Spain) rather than where the models found something new.
The genuinely defensible improvement is boring: calibration, not signal. Recalibrating on realized results beats every clever structural lever I tried.
Heat is a clean example of a real, strong, well-identified effect (ρ = −0.55, with a working negative control) that still doesn’t predict the outcome — a sharp reminder that “statistically significant” and “moves the money” are different questions.

Whichever way Spain resolves, I get a concrete answer about where hybrid TSFM forecasting adds value and where it just launders Elo.

Tech Stack & Links

Stack: Python · Chronos-2 · TimesFM 2.5 · FlowState · Club Elo · XGBoost · Poisson · Monte Carlo (50K runs). CPU-only, 32GB RAM, models loaded one at a time.

Live dashboard: worldcup-oracle.pages.dev — rebuilt daily: per-match predictions and scorelines, live group tables, title odds, the running AI-vs-Polymarket scoreboard, and the weather-research tab. In-play scores stream from ESPN; in-window Polymarket odds stream tick-by-tick over the CLOB WebSocket (order-book midpoints, ~68 msg/s on matchdays).

Sister project: UEFA Champions League Oracle — same modeling team, different tournament, different conclusions about what TSFMs add.

Repo: github.com/YSKM523/worldcup-oracle