83 ties was enough to kill TSFMs as my baseline

When I started UEFA Oracle, the design instinct was to reuse the same stack that was working on fin-forecast-arena and the World Cup model: a 3-model Time Series Foundation Model ensemble (Chronos-2, TimesFM-2.5, FlowState), each producing offensive / defensive strength trajectories per team, ensembled, and then fed into a Poisson goal model. It’s the same toolkit. The fixtures are the same kind. The probabilities go into the same Polymarket comparison.

The 5-season, 83-tie backtest said no. The TSFM ensemble added no measurable point-prediction skill over a pure Club Elo + xG-blended baseline on this dataset. So I dropped TSFMs out of the production stack and hid them behind a --with-tsfm flag as a research / ablation layer.

This post is the writeup: what the backtest actually measured, why I think the result is real and not a fluke, and what it told me about when TSFMs help.

The setup

The thing I’m benchmarking is a single hybrid model against Polymarket-style probability output, evaluated by Brier score per tie. The hybrid has four components:

Club Elo prior — decades-old rating system, treated as a base
xG adjustment — for ties already partially played (e.g. second-leg predictions), the first-leg xG is folded back into the Elo prior to correct for “they got lucky / unlucky on conversion”
Injury-weighted Elo — FotMob injury data reweights team Elo (a Bayern XI with 4 starters out is materially weaker than its rating)
Poisson scoreline → Monte Carlo aggregate — turns the strength pair into a goal joint distribution, then sims the tie 50K times

The TSFM ensemble I optionally swap in produces team-strength trajectories from match-by-match history (~30 matches per team), ensemble-averages them, and replaces step (1) — Club Elo — as the prior.

The backtest fixtures are 83 knockout-stage ties from UEFA Champions League rounds R16 onward, 2020-21 through 2024-25 seasons. For each tie I compute: TSFM-prior model probability of team A winning the tie, baseline (Elo) model probability, and the realized outcome. Score: Brier loss against the realized binary.

What 83 ties is and isn’t

It’s small. Knockout-stage ties are a luxury data regime; you don’t get a 50,000-sample dataset out of the UCL since 2020. Even pooling Europa League brings it to ~200 — still small.

But 83 is enough to rule out a large effect. If the TSFM ensemble were materially better than the Elo prior — even by 2-3 percentage points of average Brier — I’d see it. The fact that I don’t see it after 83 ties means the effect is at most small, possibly zero, possibly negative.

The honest version of the result: mean Brier loss is within noise across the two systems, with the TSFM-prior version slightly worse on point predictions and slightly better on extreme upsets. Net-net, on the metric the production model is optimized for, TSFMs are an equal-cost replacement for a baseline that already works.

Why this is probably the right answer, not a measurement glitch

I checked the usual suspects before believing the result:

Data leakage. TSFMs were retrained per-tie on history up to that tie’s start. No future data, no peeking.
Sample composition. 83 ties is enough to break out by round (R16 / QF / SF / F) without samples-per-bucket falling below 8. The result is consistent across rounds; not driven by one stage.
Model choice. I tried two ensemble strategies (mean of 3 vs median) and three normalizations of the strength output. None of them flipped the conclusion. The TSFM models are doing something — just not something the baseline doesn’t already do.
Different metric. I rechecked with log loss and with margin error on simulated scorelines. Same answer.

The conclusion I’m most willing to defend in writing: the knockout-tie data regime doesn’t reward additional signal beyond Elo + xG + injuries. Bayern’s strength at the start of a tie is, to first order, very well captured by a long-running rating system. The marginal information a TSFM extracts from 30 matches of trajectory is information that Club Elo (which is updated after each match) already encodes. The two priors are correlated; the TSFM doesn’t add a SECOND signal, it just produces a noisier version of the FIRST one.

The contrast: World Cup, where TSFMs ARE earning their seat

The reason I’m not just throwing TSFMs out of the toolkit is that on worldcup-oracle the same modeling approach (TSFM-prior + Poisson + Monte Carlo) is producing STRONG edges versus Polymarket. Spain at AI 32% vs market 16% is the headline; multiple sells (Brazil, England, Portugal) at -5pp to -6pp with 4-of-4 model agreement.

What’s different about World Cup vs UCL knockouts?

Sample size per fixture. World Cup is 104 matches across 48 teams; UCL knockouts are 29 matches across 16 teams. TSFMs benefit from a larger fitting universe.
Less mature rating system for the prior. Polymarket on World Cup is pricing aggregate market sentiment, which has historically been a weaker prior than Club Elo for UCL clubs.
Longer-horizon trajectories matter. World Cup teams play each other rarely; TSFMs learn from match history per team, then the tournament tests strength composition. UCL clubs play each other every season; recency is already baked into Elo.

The takeaway is not “TSFMs are bad at sports forecasting.” The takeaway is “TSFMs are tools, and they pay rent on data regimes where the existing prior is weak or the sample is large enough to let the model differentiate signal from noise.”

The general lesson

Every time you read about a Time Series Foundation Model “beating” a domain baseline, the question to ask first is: what’s the existing prior, and how strong is it? TSFMs are a generic prior. Where the existing domain prior is weak (e.g., zero-shot forecasting on a new dataset), TSFMs help — they bring SOME structure into a place that had none. Where the existing domain prior is strong (e.g., 20 years of refined Elo on football), TSFMs are running uphill against a baseline that already absorbed the same information.

This is why I keep TSFMs in --with-tsfm rather than deleting them from the repo. They are still the right tool when:

I move to a new sport / dataset with no curated prior
I’m doing a research run where the goal is “what does the foundation model see in this data”
I need a sanity check on whether my domain prior is missing something obvious

It’s also why I wrote this post instead of quietly removing the code. Negative results are useful, especially in forecasting, where the loudest published claims are positive ones and the silent failures are everywhere. If you’re building a model and your TSFM ensemble doesn’t outperform your domain baseline, that’s not a failure of the ensemble. It’s information about your baseline.