After the twenty-clones post, the question I kept getting was the same one: fine, if the LLM is a confident intern, what should generate the signal instead?
The honest answer is boring. For the part of the stack that turns market state into a number — long, short, flat, size — you almost always want classical ML, not an LLM. Not because ML is "smarter." Because trading pays for a specific set of properties, and those are exactly the properties an LLM gives up.
So instead of arguing it, here is the comparison one row at a time. Eight dimensions. For each one: the ML answer, the LLM answer, and how decisive the gap is for signal generation specifically.
01The question people actually mean#
"Which is better, ML or LLM?" is the wrong question, the same way "which LLM is best for trading?" is the wrong question. Better at what? A signal generator has a job description: take the current state of the market, emit a position. Do it millions of times across a backtest. Do it the same way every time so the backtest means something. Do it cheaply enough that density does not bankrupt you. Do it in a way a risk officer can read.
Score the two families against that job, and most rows are not close.
02Eight dimensions, one row at a time#
| DimensionThe question | MLHMM · scikit-learn · PyTorch | LLMLarge language models | VerdictStrength 0–5 |
|---|---|---|---|
| 01What goes in? | Curated numeric featuresYou pre-compute returns, ratios, indicators. The model sees floats, never strings. | Natural language promptYou write the question in English (or stuff data into the context window as text). | Depends |
| 02What does the model assume about the world? | Bounded · enumerableAn HMM has N hidden states you declared. If the world doesn't fit, the fit fails loudly. | Unbounded · sampledNo declared state space. A bad fit looks like fluent text — you only find out at P&L time. | ML wins · falsifiability |
| 03When is the heavy compute paid? | Once · offlineHeavy fit() at training time. After that, inference is essentially free. | Every call · runtimeNo “trained once” stage you own. Each signal request pays the full inference cost again. | ML wins · cost shape |
| 04What does one signal actually cost? | µs · CPU · localPer-signal cost: microseconds on the same CPU that runs the backtest. Density doesn't hurt. | seconds · GPU · per-token $Per-signal cost: seconds + GPU + per-token billing. Ten years × 1-min bars gets pricey fast. | ML wins · backtest density |
| 05How does it fail? | Drifts slowly · re-trainFailure mode is concept drift over weeks. Remediation is a re-fit you can schedule. | Per-call · validate or retryFailure is per call: a hallucinated field, a malformed number. Every response needs a parser and a retry path. | ML wins · predictability |
| 06Can you explain a decision? | Parameters readableTransition matrix, weights, coefficients. A compliance officer can read them. | Black box · post-hoc onlyYou can ask why in another prompt. The answer is itself a sampled story, not the actual cause. | ML wins · compliance |
| 07Is the output reproducible? | Same input → same outputA backtest on Monday matches the backtest on Friday. Reproducibility is free. | Same input → different outputTemperature, top-p, model-version drift. Reproducibility costs you temperature=0 + version pinning + luck. | ML wins · backtest trust |
| 08When does the LLM actually win? | Not the right toolBuilding structured output from a paragraph of news? ML needs heavy NLP scaffolding to even start. | Text → structured fieldsEarnings transcript → sentiment vector. Filings → event tags. Upstream of the signal generator, not the signal generator itself. | LLM wins · ingestion |
The scores underneath are 0–5 per row — a strength-of-verdict, not a benchmark (more on why below). Add them up the way the figure does, weighted by how decisive each dimension is, and it comes out ML 28, LLM 6. The LLM column is not losing on technicalities. It is losing on the dimensions trading actually pays for: cost shape, reproducibility, interpretability, and a failure mode you can see coming.
A few of these deserve more than a table cell.
Cost shape (rows 03–04). This is the one people underestimate. An HMM or a gradient-boosted model pays its heavy compute once, in fit(). After that a signal costs microseconds on the same CPU running the backtest. An LLM has no "trained once" stage that you own — every signal request pays full inference again, in seconds, on a GPU, metered per token. Ten years of one-minute bars is a few million signals. Do that math with per-token billing before you commit.
Reproducibility (row 07). A backtest you run on Monday should match the one you run on Friday. With ML that is free: same input, same output. With an LLM you are fighting temperature, top-p, and silent model-version drift. You can claw some of it back with temperature=0 and version pinning, but "some" is not "all," and a backtest you cannot reproduce is a story, not evidence.
Failure mode (row 05). ML fails slowly and loudly — concept drift over weeks, and the remediation is a re-fit you can schedule. An LLM fails per call — a hallucinated field, a malformed number — so every single response needs a parser and a retry path. One of these failure modes you monitor. The other you babysit.
03Why I am not putting accuracy numbers on this#
You will notice there is no "ML is 23% more accurate" number anywhere. That is deliberate.
The accuracy gap between a fitted model and a base LLM on signal generation is real, but the number depends entirely on the task, frequency, asset, horizon, and metric. Pin a number to it and the only response you get is "that is not the experiment I would run." Pin the structure instead, and the response is "now I see why one fits and the other does not." Structure travels; a single benchmark does not.
04The one row where the LLM actually wins#
Row 08 is not a courtesy. There is a real job the LLM wins outright, and it matters: turning unstructured text into structured fields. An earnings transcript into a sentiment vector. A news wire into event tags. A SEC filing into a set of features. ML needs heavy NLP scaffolding just to start that; the LLM does it out of the box.
But look at where that job sits. It is upstream of the signal generator, not the signal generator itself. The LLM builds features. Something else turns features into a position.
05Where this leaves a serious stack#
Once you stop asking "ML or LLM?" and start asking "which tool for which layer?", the picture resolves into three layers and three different tools:
natural language → structured features → LLM
structured features → signal pack → ML
signal pack → portfolio allocation → neither (that layer is an optimizer)
Each layer has a right tool, and picking the wrong one is its own failure mode. But the most expensive mistake on the table right now is putting the LLM in the middle layer — making it the signal generator — where it competes with ML on the seven dimensions it loses.
LLM upstream. ML in the engine room. The clone problem from last time was really just this mistake wearing a different costume: an LLM doing a job that belongs one layer up.
What does your split look like? If you are running an LLM somewhere in the signal path, I am curious which layer it sits in — and what catches it when it is wrong.

