fin-forecast-arena — Shuai Yao

Benchmarking SOTA Time Series Foundation Models (Chronos-2, TimesFM 2.5, FlowState) on US equities.

Context

Time Series Foundation Models (TSFMs) are the 2025–2026 story in forecasting: a single pretrained model that claims zero-shot performance across domains. The public benchmarks that ship with these releases lean on well-behaved time series — energy load, retail demand, tourism — where the signal-to-noise ratio is generous.

US equities are not generous. They are the hostile environment: low signal, heavy tails, regime shifts, and the embarrassing fact that a random walk beats most published forecasters on next-day returns. So the question worth asking isn’t “do TSFMs work” — it’s “on equities specifically, which ones degrade the most gracefully, and where does each one break?”

Approach

Setup

Models under test: Chronos-2, TimesFM 2.5, FlowState. All zero-shot, no fine-tuning, so the comparison measures what you actually get out of the box.
Universe: US equities, held fixed across runs.
Protocol: fixed horizon, fixed rolling window, fixed metrics — so the ranking you read off the repo is a comparison, not a demo.

Choices I flagged

Zero-shot only. Fine-tuning each model on equities would be a different (and much bigger) project — and would dissolve the thing you’re actually trying to measure.
Multiple horizons. A model that wins at h=1 and loses at h=20 is a different story than one that wins at both; the report separates them.
No leakage audit shortcuts. Rolling-origin eval, strict train/val/test discipline, no peeking.

Results & What I Learned

Benchmark results live in the repo. I’m deliberately not reproducing numeric rankings here — the repo is where the evaluation protocol lives, which is the thing that makes the numbers comparable. Numbers lifted out of context are exactly how you mislead an interviewer.

What the project taught me:

Zero-shot TSFM performance on financial time series is far more variable by horizon than the headline benchmarks suggest. Winners change as h grows.
The interesting failure mode is not accuracy — it’s distributional miscalibration. A model can be “right on average” and still produce prediction intervals that are nonsense.
Any honest equity forecasting eval has to report where the models lose to a naive baseline (random walk / last-value), not hide it.

Next: add a calibration-focused evaluation track and see whether conformalized versions narrow the gap.

Tech Stack & Links

Stack: Python · Chronos-2 · TimesFM 2.5 · FlowState

Links: GitHub