Research · Results

The results behind the verification layer.

QSI has been validated across banking, legal, medical, customer-support, and open-chat answers — thousands of evaluations spanning open and frontier models. Here is what holds.

12+

models · cross-model on hard banking

0.83

AUC · pooled across 5 domains · 10 models

domains validated end-to-end

thousands

independent evaluations

Cross-model coverage

One detector, 12 models (banking).

On the hardest single domain, the same checkpoint separates correct from incorrect answers across 6 frontier and 6 open-weight models. It is the model's mistakes QSI reads — not a model it was tuned for. The pooled five-domain figure (AUC 0.83) is the canonical headline.

frontier open weightAUC · per model on hard banking questions · 0.5 = chance

Category coverage

Where weaker models go wrong.

QSI catches the mistakes weaker and specialized models make — error rates of 40–73% on hard items, surfaced before they reach users.

Domain Hard-item error rate What QSI is catching

Science 52% Graduate-level reasoning where confident-sounding answers are often wrong.

Mathematics 61% Multi-step problems where a single slip invalidates the result.

Medicine 44% High-stakes factual recall where a wrong answer cannot reach a user.

Code 73% Subtle logic and edge-case errors that pass a quick read but fail in production.

General knowledge 40% Broad factual questions spanning everyday and specialist domains.

Error rates illustrate how often weaker/specialized models are wrong on hard items — the failures QSI surfaces. Figures are indicative of the coverage range, not a single benchmark.

Methodology & limitations

How we measured — and what we are still proving.

We report numbers the way we ask customers to trust them: scoped to their set, with the caveats stated up front.

Independent judge: a separate model re-reads each (question, answer) pair and we score P(YES)−P(NO) on its first token against ground truth. Pooled headline: AUC 0.83 (95% CI 0.79–0.86) across 10 models and 5 domains.
Not one model family agreeing with itself: in the main matrix the judge and the reference oracle are both from our own judge family. We re-graded the full banking set with an independent frontier-model family (a different vendor from our judge) — the judge AUC held at 0.924, and the two oracle families agreed 90% of the time. Wider cross-family validation is in progress.
An operating point, not just AUC: on banking, at a 90%-precision threshold QSI reaches 71% wrong-answer recall while flagging about 5% of correct ones — the candor a reviewer needs to size the cost of the gate.
External public-benchmark validation now spans five model families on SimpleQA-Verified — pooled AUC 0.808, independent of our own scenarios. What we do not over-claim: the model’s own token-entropy is near chance on facts and is not, on its own, a validated bad-code detector — on code the structural completion check and the independent judge carry the signal, not entropy; and the agentic-orchestration result is a directional, control-armed simulation — QSI-guided escalation roughly doubled a same-cost random-escalate baseline (38.9% vs 20.4%) and lifted a no-verification floor ~5× (7.4%→38.9%), so the gain is the judge targeting the right step, not just a bigger model. Every figure is scoped to its set; none is stated as an absolute guarantee.

Want the numbers for your models and your domains?

We run QSI against your traffic and share the separation we get on your data.

Talk to us How it works