# ML vs LLM as Signal Generators: Eight Dimensions Where the Choice Is Already Made

**Published:** 2026-05-25  
**Author:** StratCraft  
**Category:** Insights  
**Tags:** ai, trading, llm, machine-learning, signals  
**Canonical URL:** https://stratcraft.ai/blog/ml-vs-llm-signal-generators  

---

After the twenty-clones post (https://stratcraft.ai/blog/llm-trading-strategies-clone-problem), the question I kept getting was the same one: fine, if the LLM is a confident intern, what should generate the signal instead?

The honest answer is boring. For the part of the stack that turns market state into a number — long, short, flat, size — you almost always want classical ML, not an LLM. Not because ML is "smarter." Because trading pays for a specific set of properties, and those are exactly the properties an LLM gives up.

So instead of arguing it, here is the comparison one row at a time. Eight dimensions. For each one: the ML answer, the LLM answer, and how decisive the gap is for signal generation specifically.

## The question people actually mean

"Which is better, ML or LLM?" is the wrong question, the same way "which LLM is best for trading?" is the wrong question. Better at what? A signal generator has a job description: take the current state of the market, emit a position. Do it millions of times across a backtest. Do it the same way every time so the backtest means something. Do it cheaply enough that density does not bankrupt you. Do it in a way a risk officer can read.

Score the two families against that job, and most rows are not close.

## Eight dimensions, one row at a time

| # | The question | ML | LLM | Verdict |
|:--:|---|---|---|---|
| 01 | What goes in? | Curated numeric features | Natural-language prompt | even — depends |
| 02 | What does it assume about the world? | Bounded, enumerable state; a bad fit fails loudly | Unbounded, sampled; a bad fit reads as fluent text | ML — falsifiability |
| 03 | When is the heavy compute paid? | Once, offline — then inference is ~free | Every call, at runtime | ML — cost shape |
| 04 | What does one signal cost? | Microseconds, local CPU | Seconds, GPU, per-token | ML — backtest density |
| 05 | How does it fail? | Drifts slowly → re-fit | Hallucinates per call → validate/retry | ML — predictability |
| 06 | Can you explain a decision? | Parameters are readable | Black box, post-hoc only | ML — compliance |
| 07 | Is the output reproducible? | Same input → same output | Same input → different output | ML — backtest trust |
| 08 | When does the LLM win? | Wrong tool for text → structure | Text → structured fields | LLM — NL ingestion |

The scores underneath are 0-5 per row: a strength-of-verdict rubric, not a benchmark. Add them up the way the figure does, weighted by how decisive each dimension is, and the structure heavily favors ML for this job. The LLM column is not losing on technicalities. It is losing on the dimensions trading actually pays for: cost shape, reproducibility, interpretability, and a failure mode you can see coming.

A few of these deserve more than a table cell.

**Cost shape (rows 03–04).** This is the one people underestimate. An HMM or a gradient-boosted model pays its heavy compute once, in fit(). After that a signal costs microseconds on the same CPU running the backtest. An LLM has no "trained once" stage that you own — every signal request pays full inference again, in seconds, on a GPU, metered per token. Ten years of one-minute bars is a few million signals. Do that math with per-token billing before you commit.

**Reproducibility (row 07).** A backtest you run on Monday should match the one you run on Friday. With ML that is free: same input, same output. With an LLM you are fighting temperature, top-p, and silent model-version drift. You can claw some of it back with temperature=0 and version pinning, but "some" is not "all," and a backtest you cannot reproduce is a story, not evidence.

**Failure mode (row 05).** ML fails slowly and loudly — concept drift over weeks, and the remediation is a re-fit you can schedule. An LLM fails per call — a hallucinated field, a malformed number — so every single response needs a parser and a retry path. One of these failure modes you monitor. The other you babysit.

## Why I am not putting accuracy numbers on this

You will notice there is no "ML is 23% more accurate" number anywhere. That is deliberate.

The accuracy gap between a fitted model and a base LLM on signal generation is real, but the number depends entirely on the task, frequency, asset, horizon, and metric. Pin a number to it and the only response you get is "that is not the experiment I would run." Pin the structure instead, and the response is "now I see why one fits and the other does not." Structure travels; a single benchmark does not.

## The one row where the LLM actually wins

Row 08 is not a courtesy. There is a real job the LLM wins outright, and it matters: turning unstructured text into structured fields. An earnings transcript into a sentiment vector. A news wire into event tags. A SEC filing into a set of features. ML needs heavy NLP scaffolding just to start that; the LLM does it out of the box.

But look at where that job sits. It is upstream of the signal generator, not the signal generator itself. The LLM builds features. Something else turns features into a position.

## Where this leaves a serious stack

Once you stop asking "ML or LLM?" and start asking "which tool for which layer?", the picture resolves into three layers and three different tools:

- natural language → structured features → **LLM**
- structured features → signal pack → **ML**
- signal pack → portfolio allocation → **neither** (that layer is an optimizer)

Each layer has a right tool, and picking the wrong one is its own failure mode. But the most expensive mistake on the table right now is putting the LLM in the middle layer — making it the signal generator — where it competes with ML on the seven dimensions it loses.

LLM upstream. ML in the engine room. The clone problem from last time was really just this mistake wearing a different costume: an LLM doing a job that belongs one layer up.

What does your split look like? If you are running an LLM somewhere in the signal path, I am curious which layer it sits in — and what catches it when it is wrong.