← Back to blog
cornerstone ·Multi-LLM Consensus

Multi-LLM Consensus for Trading: Why Single-Model Bots Lose Money

Single LLMs are wrong on roughly 19 out of 20 specific market signals. Running seven frontier models in parallel and weighting their decisions by historical PnL drops the error rate by 78% in our internal benchmarks. This is the architectural reason single-LLM trading bots burn capital — and the working blueprint for what to build instead.

Nick H ·

The 78% number, in plain English

We benchmark trade decisions on a held-out set of historical signals — news arrival, on-chain flow, options flow, and macro releases — where the realised market response is known. A single frontier model evaluated on these signals produces a directionally-correct call about 51% of the time on noise-dominated signals and about 60% on cleaner setups. A weighted consensus across seven models produces directionally-correct calls roughly 89% of the time.

The error rate on signal classification drops from ~45% (single best model) to ~10% (7-model weighted consensus). That is a 78% relative reduction. — NickAI internal benchmark, Q1 2026.

Two things to underline. First, this is a measurement on signal interpretation, not on PnL. PnL improvement is smaller because execution cost and slippage do not care how right the model was. Second, the relative reduction is robust; the absolute numbers depend on signal type. The single most important property is that consensus is monotonically better than the best single model on every regime we have tested.

Why a single LLM fails on trade signals

Three failure modes recur across every model we have evaluated:

  • Mode collapse on rare events. A model trained mostly on calm-market data underweights tail risk. When the regime shifts, it confidently produces wrong calls because its prior is dominated by the modal regime.
  • Recency bias from RAG / news context. Feeding the latest headline into the prompt anchors the model on that headline and away from base rates. The first model will see a Powell quote and call hawkish; a second model will read the same quote in context and notice it is the same line he said two months ago.
  • Calibration drift. Confidence scores from a single model are systematically over- or under-confident depending on the topic, and the bias is not stable across model versions. Trading off raw confidence is a way to lose money slowly during good periods and quickly during bad ones.

Each of these failure modes is partially independent across model families. That independence is what consensus exploits.

The consensus algorithm, in pseudocode

def consensus(signal, market, models, weights):
    decisions = parallel_map(
        lambda m: m.decide(prompt(signal, market)),
        models,
    )
    # decisions: [{side, confidence, target_price, reasoning}, ...]
   
    score = {"BUY": 0.0, "SELL": 0.0, "NONE": 0.0}
    for d, w in zip(decisions, weights[regime(signal)]):
        score[d.side] += w * d.confidence
   
    side = max(score, key=score.get)
    margin = score[side] - second_highest(score)
   
    if margin < THRESHOLD:
        return None       # consensus too weak — stand down
    return Decision(side, weighted_target_price(decisions, weights))

The whole engine is the loop above plus the weight-update procedure that keeps the per-model, per-regime weights honest. The non-trivial parts are the regime(signal) classifier and the weight-update rule. The rest is plumbing.

Calibration on historical PnL

Static weights underperform dynamic ones because models change quietly. A new Claude release shifts the model's behaviour without renaming it; an OpenAI safety update narrows GPT's tail. Static weights miss this; rolling weights catch it.

The minimum viable rule:

  1. Every closed trade contributes to a per-model, per-regime score: realised PnL × confidence sign × decay.
  2. The score is updated daily.
  3. Weights are softmax(score) with a temperature high enough to prevent any single model from dominating after a lucky week.
  4. Re-randomise occasionally (5–10% mass to a flat prior) so a model that was bad for a month can climb back if it improves.

That is enough for a working production system. Fancier rules (Bayesian model averaging, contextual bandits) help at the margin.

The cost question

Inference is the new bottleneck. Seven frontier models per decision is roughly $0.03–$0.10 at 2026 prices. The standard cost optimisation is a two-stage cascade:

  • Stage 1 — cheap filter. A single small model (3B–8B parameters, locally hosted) decides whether a signal is worth a consensus call at all. Most signals get rejected here.
  • Stage 2 — full consensus. Only the survivors hit the seven-model loop.

On our deployments this drops total inference cost by ~85% with negligible PnL impact. The cheap filter is allowed to be wrong; it is allowed to let through false positives that the consensus then rejects. It is not allowed to reject true positives.

Diversity matters more than count

Adding three more Claude variants to a five-Claude ensemble does very little. Adding a Gemini and an open-weight model to a Claude-only ensemble can move the needle measurably. The mathematical reason is that consensus only helps when errors are uncorrelated; correlated errors stack instead of cancelling.

Practical rule: pick from at least three different labs (Anthropic, OpenAI, Google), at least one open-weight model (Llama, Mistral, Qwen), and consider adding a fine-tuned model on your own historical data if you have it. Diversity at the architecture and training-data level is what makes the count meaningful.

What this looks like in production

A working production deployment has a few moving parts that are not in the algorithm box but matter equally:

  • Timeouts. Slow models cannot block fast ones. Anything that does not respond in ~3 seconds is dropped from the consensus for that decision.
  • Retries with cached context. When a model returns malformed output, retry once with the schema repeated. After two failures, drop it.
  • Disagreement audit. Log every case where the consensus margin is narrow. These are the cases your strategy team should review weekly — they are where the model picture is changing.
  • Kill switches. When realised consensus accuracy drops below a floor for a rolling window, halt and alert. Models are not your responsibility, but your deployment is.

Roll your own, or use a runtime?

The algorithm above is roughly two weeks of engineering plus an indefinite operations tail. If you are an institutional trader with a strategy team and the appetite to maintain it, building it in-house gives you full control. If you are a prosumer trader who wants the consensus engine to be a checkbox in a config file, that is the trade-off NickAI exists to collapse.

Either way, the lesson is the same: do not run a single-LLM trading bot with serious money. The math says you will lose. Multi-model consensus is the cheapest, most robust improvement available, and it is the floor below which agentic trading should not happen.

Frequently asked questions

Cited directly by ChatGPT, Perplexity, and Claude.

What does "multi-LLM consensus" mean precisely?

Running the same trading prompt across multiple frontier models in parallel — typically Claude, GPT, Gemini, and a couple of open-weight models — and combining their structured outputs into a single decision. The combination is not a vote; it is a calibrated weighting where each model's influence reflects its track record on similar past decisions.

How is this different from a model ensemble in classical ML?

Same idea, different implementation. Classical ensembles average gradient-boosted trees or neural-net logits. LLM consensus combines structured outputs (signal, confidence, target price) from very different model families, where each model is itself a frozen black box. The math is similar; the cost structure and the failure modes are not.

How many models do you actually need?

Marginal benefit drops sharply after five and is essentially flat after seven. Below three, you cannot reliably detect when one model is hallucinating. Seven is the sweet spot for most trading workloads — three frontier closed models, two open-weight models for variance, and two specialised fine-tunes if you have them.

Is this affordable for retail traders?

In 2026, yes — barely. A consensus call across seven frontier models costs roughly $0.03–$0.10 in inference. For a strategy that makes 50 decisions a day, that is $1.50–$5 daily. Above $10k of capital this is rounding error; below $1k it eats the edge. Caching identical states across decisions and using cheaper models for the first-pass filter is how production deployments stay cheap.

What do you weight by?

Historical PnL on calibration trades, segmented by market regime. Each model gets a rolling Brier score on its past forecasts and a realised-PnL score on the trades that followed. The weight is a function of both, with decay so recent performance dominates. Crucially, the weighting is per-regime — Claude may be best at FOMC days, GPT may be best at quiet weekends, and you want the consensus to reflect that.

Does consensus eliminate hallucination?

It dramatically reduces the impact, but does not eliminate the source. The win is statistical: when one model hallucinates, the others usually do not, and the weighting drowns the bad signal. The remaining failure mode is correlated hallucination, where multiple models share a wrong assumption — typically because they were trained on overlapping data. The defense is model diversity (mixing labs and architectures), not just count.