Time Series Foundation Models on equities: the hostile case

The headline benchmarks for Time Series Foundation Models — Chronos, TimesFM, FlowState — are mostly drawn from well-behaved series: energy load, retail demand, tourism, macroeconomic indicators. These are the forecasting problems where the signal is large relative to the noise, and where a generic model can coast on its pretraining.

US equities are none of that. They are the hostile case: low signal, heavy tails, regime shifts that don’t announce themselves, and the embarrassing fact that a random walk beats most published forecasters on next-day returns. If you want to know what a TSFM does when the physics of the data stops being friendly — this is the test.

What gets measured, and what doesn’t

Most TSFM release papers ship with two kinds of metrics:

Aggregate error over a benchmark suite. Useful for model comparison, largely meaningless for a single domain.
Per-task headline numbers. Chosen to be flattering. Often the pretraining distribution overlaps with the evaluation domain more than the paper admits.

What’s rarely measured, and what actually matters on equities:

Distributional calibration. A model can be right on average and still produce prediction intervals that are nonsense. On equities, getting the intervals wrong gets you blown up faster than getting the point estimate wrong.
Horizon degradation. Winners at h=1 are not necessarily winners at h=20. Many TSFMs have fundamentally different behavior across horizons, and averaging across them hides the thing you need to see.
Loss against naive baselines per-asset. The random walk and last-value baselines should show up in the same table as the foundation models. If they don’t, the report is selling, not evaluating.

What zero-shot is, and what it isn’t

“Zero-shot” is the claim that matters for practitioners: can I point this model at my series and get something usable without fine-tuning? That’s a real, useful question. It’s also a much harder bar than the “tuned on your domain” bar the papers often test against implicitly.

An honest zero-shot evaluation on equities needs:

Rolling-origin eval. No peeking.
The same window, horizon, and metric across all models. One comparison surface.
Split metrics by horizon, not just averaged.
A calibration track separate from accuracy. Prediction interval coverage, PIT histograms, quantile losses.

The real finding isn’t a ranking

The useful output of a benchmark like this is not “Model X wins.” The useful output is a map of failure modes: which models degrade gracefully vs. cliff at long horizons, which are overconfident vs. conservatively wrong, which beat a random walk and which quietly don’t.

If you publish a single leaderboard number and leave the calibration and horizon structure out, you’ve written a blog post, not a benchmark. The thing I find myself doing in fin-forecast-arena is fighting the urge to compress all of that back into one line.

Next question

Does a conformalized version of each TSFM narrow the calibration gap enough to matter on this domain? That’s the follow-up track. Accuracy is the headline; calibration is the thing that decides whether a forecast is safe to act on.