Forecasting the new Champions League

The UEFA Champions League changed its format starting the 2024-25 season. The old eight group-of-four structure is gone; in its place is a 36-team league phase where each club plays eight matches against eight different opponents, and the single cross-club table decides who advances. Top eight go straight to the round of sixteen. Ninth through twenty-fourth play a knockout playoff. Bottom twelve go home.

If you’ve been reading about UCL predictions for years, this matters more than it sounds like. Almost every classical model implicitly assumed the group stage’s symmetry — every team plays the same three opponents, strength-of-schedule is a within-group constant, and the group winner / runner-up split is a clean two-outcome problem per group. The league phase breaks all of that.

UEFA-oracle is my attempt at the forecasting problem in the new format, built as part of a broader series that pits Time Series Foundation Models + classical sports primitives against Polymarket. This is a running-research post, not a result claim.

Why the new format is interesting, and hard

The data got better

Each team now plays eight league-phase matches instead of six group-stage matches. That’s a 33% bump in within-tournament sample for every team. For any model that depends on recent-form estimation, this is unambiguously good.

Strength-of-schedule stopped being uniform

Opponents are drawn from four pots, and you play two teams from each pot — but which two teams from each pot is random. Some teams get a brutal draw and some get a kind one. Any forecaster that ranks on raw win percentage without adjusting for opponent quality is going to mislead you by a lot.

The seeding cliff

Finishing 8th vs. 9th is the difference between skipping a knockout round and playing two extra matches against a seeded opponent. This is a discontinuity the model has to handle explicitly — a team with a 50-50 chance of finishing 8th or 9th has a meaningfully different expected advancement profile than the league-phase probability alone would suggest.

The bracket compounds uncertainty

After the league phase, the knockout bracket uses a 1-vs-16 / 2-vs-15 / etc. seeding. Any league-phase uncertainty propagates into bracket uncertainty by a multiplier that is roughly the sum of downstream path probabilities. Monte Carlo is the only honest way to express that.

The model stack

The repo is not a single model; it’s a hybrid. Each component earns its place by doing a thing the others don’t.

Club Elo gets the regularization job. It’s a decades-old rating system, it’s genuinely hard to beat on raw team strength, and treating it as a prior rather than trying to replace it is the difference between a new model being useful and being worse than the baseline nobody wants to admit is still winning.

Time Series Foundation Models (Chronos-2 / TimesFM 2.5 / FlowState) do the dynamic part — estimating trajectories of offensive and defensive strength from each team’s recent match history rather than treating a club as a static rating. A team that’s been climbing for six weeks and a team that’s been drifting for six weeks shouldn’t be indistinguishable just because they’re currently tied on Elo.

Poisson goal models map the team-strength pair into a joint goal distribution for a given fixture. This is the oldest, boring, correct way to go from “Team A is this strong, Team B is that strong” to “here’s the probability Team A scores exactly 2 and Team B scores exactly 1.” I tried replacing it with a neural alternative; it got worse. The Poisson stays.

The TSFM does not replace Elo; it refines it. Elo does not replace the Poisson; it feeds it. Every stage is doing the thing it’s good at, and the system is as honest as I can make it about what comes from each layer.

From fixtures to probabilities

Given a trained fixture-level model, the pipeline is:

For each remaining league-phase match, simulate the joint goal distribution.
Accumulate points across a full Monte Carlo run to produce the league table.
Apply the seeding rules to generate the knockout bracket.
Simulate the bracket forward. Sum over runs for per-team win / advance / title probabilities.

The number of Monte Carlo iterations is set large enough that the Monte Carlo noise on the headline probabilities is much smaller than the model uncertainty — if you can’t tell whether Arsenal is 12% or 14% because you only ran 2,000 sims, you’re measuring the wrong thing.

Polymarket as the benchmark

Polymarket is a ~$480M prediction market. On UCL outcomes, it aggregates real money from people who have every incentive to be right. That makes it one of the most honest forecasting benchmarks you can find. It’s also very hard to beat.

A realistic goal here is not “outperform Polymarket’s implied probabilities on the champion.” That would be a result I’d want to see audited before I believed it. A realistic goal is finding systematic disagreements — specific clubs where the market is over-pricing, under-pricing, or missing a structural factor the model is capturing.

The honest, boring truth about liquid prediction markets: they’re very good at integrating information the crowd already has. They’re sometimes worse at pricing structural changes to the game itself — things like a new tournament format, a mid-season manager change with uncertain lag, a squad-depth effect that only shows up at fixture congestion. That’s the edge the model is looking for, and it’s a narrow one.

What I’m not claiming

I’m not publishing a PnL. A single-tournament backtest on a zero-sum market is the easiest result in the world to overfit into existence. Ask me after two seasons.
I’m not claiming TSFMs “beat” Elo on this task. The honest question is whether a hybrid outperforms each component alone — which is a different and subtler thing than a head-to-head.
The model has real weak points. Squad rotation, key-player injury signals, and European-specific home-advantage priors are underfit in the current version. The roadmap acknowledges them.

What I’d tell someone starting a similar project

Put the Monte Carlo in on day one, not as the last step. The entire downstream analysis — confidence intervals, conditional probabilities, head-to-head comparisons against the market — only works once you can sample runs cheaply. Everything else is downstream of that decision.

And benchmark against the dumb models. Relentlessly. Elo alone, Poisson-with-static-strength, uniform-priors. If your expensive hybrid can’t clearly beat the boring baseline, it’s not ready to be compared to a real market. That’s the bar before Polymarket enters the conversation.