Research · Results

The results behind the reliability layer.

QSI has been validated across science, math, medicine, code, and general knowledge — thousands of evaluations spanning open and frontier models. Here is what holds.

13+
models covered, open to frontier
0.86–0.96
AUC separating correct from incorrect
5
domains validated end-to-end
thousands
independent evaluations
Cross-model coverage

One detector, 13 models.

The same governance layer separates correct from incorrect answers across 5 frontier and 8 open-weight models. It is the model's mistakes QSI reads — not a model it was tuned for.

frontier open weightAUC · separating correct from incorrect answers · 0.5 = chance
Category coverage

Where weaker models go wrong.

QSI catches the mistakes weaker and specialized models make — error rates of 40–80% on hard items, surfaced before they reach users.

Domain Hard-item error rate What QSI is catching
Science 52% Graduate-level reasoning where confident-sounding answers are often wrong.
Mathematics 61% Multi-step problems where a single slip invalidates the result.
Medicine 44% High-stakes factual recall where a wrong answer cannot reach a user.
Code 73% Subtle logic and edge-case errors that pass a quick read but fail in production.
General knowledge 40% Broad factual questions spanning everyday and specialist domains.

Error rates illustrate how often weaker/specialized models are wrong on hard items — the failures QSI surfaces. Figures are indicative of the coverage range, not a single benchmark.

Want the numbers for your models and your domains?

We run QSI against your traffic and share the separation we get on your data.