What "I Don't Know" Looks Like¶

Calibrated uncertainty in organizational knowledge systems¶

A Base2ML white paper. Fifth in a series following The Information Paradox, Why Conflict Detection Is Harder Than It Looks, The Half-Life of Documentation, and Citation Discipline.

The system that's right 90% of the time¶

Imagine two knowledge systems serving the same organization on the same questions.

System A is correct 90% of the time. It's confidently wrong the other 10%. The user never has a signal which case they're in.

System B is correct 70% of the time. The other 30%, it visibly hesitates: low confidence, gaps in coverage, a flag that says "your documents don't cover this clearly." The user always knows when to slow down.

Most product teams design for System A. The polished, fluent answer reads better in demos. The hesitation in System B reads as a weakness rather than as a feature. Users, asked which they prefer in a benchmark, often pick System A because it's more impressive in the cases they review.

In operational use, System B is meaningfully more valuable. The reason isn't the accuracy difference; it's the calibration difference. A user of System A has to verify every answer because they have no signal which 10% is wrong. A user of System B can trust the high-confidence answers and slow down on the low-confidence ones. The total verification work is lower with System B even though its raw accuracy is lower.

This is what calibrated uncertainty means. The system's awareness of its own uncertainty maps reliably onto the actual probability that its answer is correct. A system that says "high confidence" should be right almost always. A system that says "low confidence" should be right meaningfully less often. The relationship has to be predictable. Most systems get this wrong.

Why miscalibrated confidence is dangerous¶

The asymmetry matters. Miscalibration in one direction is annoying; miscalibration in the other is dangerous.

A system that's under-confident — that flags every answer as uncertain even when its grounding is solid — wastes user attention. The user learns to ignore the confidence indicators because they don't carry signal. The cost is a system that's less useful than it should be, but the failure mode is bounded.

A system that's over-confident — that produces high-confidence answers even when its grounding is weak — actively misleads the user. The user trusts the high-confidence flag, doesn't verify, acts on the answer, and the cost shows up later as a wrong decision traceable to a moment the system should have hesitated.

Generic LLMs have a strong tendency toward over-confidence. The training optimizes for fluent, complete-sounding output. Producing a hedged, "I'm not sure" answer is not what the model is rewarded for. When the prompt asks for a confident answer to a question whose grounding is weak, the model produces one — and the user can't easily tell the difference from a question whose grounding is strong.

This is the reason calibrated uncertainty has to be engineered into a knowledge system at the architectural level. It is not the LLM's default behavior. It has to be imposed.

What confidence has to be calibrated against¶

A useful confidence signal isn't a single number. It's at least two, possibly three, because the things that can go wrong are different.

Retrieval confidence. How well did the retrieval layer find documents relevant to this question? If the retrieval surfaces five passages that all clearly address the question's topic, retrieval confidence is high. If retrieval surfaces five passages tangentially related, retrieval confidence is low. The user reading a low-retrieval-confidence answer should know that the corpus may not actually cover this question well.

Answer confidence. Given the retrieved passages, how confident is the system that its synthesized answer is supported by them? If the passages clearly answer the question and the synthesis is direct quotation with attribution, answer confidence is high. If the passages are tangential and the answer is the LLM's reasoning from the passages (rather than support by the passages), answer confidence is low. The user reading a low-answer-confidence answer should know that the synthesis may be overreaching.

Coverage gaps. What did the user ask about that the retrieved passages did not address? A useful answer is sometimes "the documents cover X and Y, but they don't say anything specific about Z, which is part of what you asked." This is a different signal than confidence — it's about what the answer is missing rather than how much to trust what's there.

These three signals are independent. Retrieval can be high while answer is low (the documents are clearly relevant but they don't actually contain the answer the user wants). Answer can be high while retrieval is low (the LLM is confident in a synthesis from tangential passages, which is exactly the case where over-confidence is most dangerous). Coverage gaps can exist alongside high confidence on what's covered (the system is confident about what it answered, while explicitly noting what it didn't).

A system that conflates these into a single "confidence" number loses information. A system that surfaces all three independently equips the user to make a better judgment about whether to trust, verify, or escalate.

What calibration actually requires¶

Calibrating a knowledge system's confidence is harder than it sounds. The naive approach — ask the LLM to rate its own confidence — produces ratings that are loosely correlated with accuracy at best. The model is generating a number from a prior about how confident it usually is, not from a careful introspection about whether this specific answer is well-grounded.

A better approach uses signals from the retrieval pipeline itself. Retrieval confidence can be measured from cross-encoder rerank scores: high-scoring passages indicate strong topical match, low-scoring indicate marginal match. Answer confidence can be measured from properties of the synthesis: how many of the retrieved passages were actually used? Did the LLM produce hedged language ("appears to," "likely," "probably") even when not prompted to? Are there cited passages that support each claim, or only some claims?

These signals can be combined into confidence ratings that are meaningfully correlated with actual accuracy. They're not perfect. But they're substantially better than the LLM's self-reported confidence, and they expose the right asymmetries — when retrieval is weak, the system flags it; when synthesis goes beyond what the passages support, the system flags it.

Calibration also requires testing. A system that claims calibrated uncertainty has to produce outputs whose confidence ratings can be checked against actual correctness across a representative sample of questions. This is unglamorous work. It produces an unsatisfying number — "our high-confidence answers are correct 87% of the time, our medium-confidence answers 65%, our low-confidence answers 35%" — that nobody wants to put in a marketing brochure. The number is what makes the calibration claim credible. Without it, "calibrated uncertainty" is just a claim.

The user-experience consequences¶

A calibrated system feels different to use than a miscalibrated one. The differences are subtle in any single interaction and decisive over many.

Calibrated systems learn to be trusted. A user, after a few weeks of regular use, develops a sense of when the system is confident and when it isn't. They stop verifying the confident answers because the confident answers have been right. They slow down on the uncertain ones because the uncertain ones have been the cases that needed care. The user's verification work, in steady state, is lower than it was at the start.

Miscalibrated systems either get over-trusted or stop being used. A user of a miscalibrated system either learns to trust it across the board (and gets burned periodically by cases the system was wrong about) or learns to verify everything (and stops getting any verification savings from the system). Neither steady state is the value proposition.

Calibrated systems handle the edge of their competence well. A user asks a question the corpus doesn't cover well. A calibrated system says: "I have some related material, but it doesn't directly address what you're asking about. Here's what I have. You may want to look elsewhere or update the corpus." This is a useful answer. A miscalibrated system, faced with the same situation, produces a fluent response from the related-but-not-quite-on-point material, and the user doesn't realize they were on the edge of competence until the answer turns out to have been wrong.

Calibrated systems are easier to audit. When a low-confidence flag was attached to a particular query, the audit reader can see that the user was warned. When it wasn't, they can see the system thought the answer was solid. The confidence layer becomes part of the audit-trail substrate — a record of what the system was telling the user at the moment of decision, not just what the answer was.

The institutional dimension¶

Calibrated uncertainty has a second-order effect that's worth naming. Organizations that adopt knowledge systems tend, over time, to develop institutional sense of what the system can and can't be trusted with. The high-confidence answers become the basis for routine decisions. The low-confidence answers become the questions that get escalated to humans who have the contextual judgment.

A miscalibrated system disrupts this development. When confidence ratings don't track accuracy, the institution can't develop reliable practices about when to trust the system. Some teams over-trust it; others under-trust it; the policy environment fragments. Eventually the system either gets formally restricted to low-stakes use or quietly stops being used for anything that matters.

A calibrated system, in contrast, becomes the substrate for a productive division of labor between system and human. Routine, high-confidence questions get answered by the system; stakes-bearing, low-confidence questions trigger human review; the institution gets faster on the easy questions without losing rigor on the hard ones.

The institutional value is much larger than the individual-query value. It's also much harder to demonstrate in a demo, which is part of why calibration is consistently undervalued in product decisions.

What to look for¶

If calibrated uncertainty matters in your environment — and it almost certainly does, in any setting where the cost of wrong-but-confident answers is non-trivial — the questions worth asking of any system you evaluate go beyond "does it have a confidence indicator." How is the confidence rating computed — from the LLM's self-assessment of its own answer, or from properties of the retrieval pipeline? Are retrieval confidence and answer confidence reported as independent signals, or conflated into a single number? Does the system surface coverage gaps explicitly — what the user asked about that the corpus didn't address — or does it produce a complete-sounding answer regardless? When the answer is low-confidence, what does the user actually see; is the indicator subtle enough to be ignored or is it loud enough to change behavior?

There's a specific test worth running on any system: ask it twenty questions, ten of which are well-covered by the corpus and ten of which are deliberately on the edge of coverage. Then check whether the system's confidence ratings actually predict which set each question came from. Most systems' "confidence" indicators don't carry that signal. The systems that do have generally been engineered against it deliberately.

If you're working through these tradeoffs and want a sounding board — diagnostic, not pitch — we'd welcome the conversation.

About Base2ML. Base2ML is a Pittsburgh-based company building knowledge-access tools for organizations that need to find what they already have. We work in the specific space where retrieval, authority hierarchy, and conflict surfacing meet operational reality.

Contact. Base2ML · chris@base2ml.com · base2ml.com · docs.base2ml.com

Numbers and percentages are deliberately not invented. Where industry research provided a credible figure we cite it; where it didn't, we say so rather than fabricating one.