The Self-Maintaining Corpus¶

Why the document set you ingested last quarter is already wrong¶

A Base2ML white paper. Seventh in a series. See also The Information Paradox, Why Conflict Detection Is Harder Than It Looks, The Half-Life of Documentation, Citation Discipline, What "I Don't Know" Looks Like, and Authority Hierarchies.

The deployment that ages out of usefulness¶

A common pattern in knowledge-system deployments goes like this: the organization spends a month preparing documents for ingestion. They consolidate sources, tag content, run an initial classification pass. The system goes live. For the first two months, users are excited. Then the corpus starts to drift.

A new policy is approved by the board; nobody loads it into the system. A regulation changes; the old version is still what the system retrieves. Three documents are revised; the new versions exist in SharePoint but the system still has the previous ones. Six months in, the system is answering questions based on a snapshot of the organization that no longer reflects how it currently operates. The users notice. They start verifying every answer against the current SharePoint instead of trusting the system. Eventually they stop using it.

This is the failure mode that kills most knowledge-system deployments. It's not the retrieval quality, the conflict detection, the citation discipline, or any of the other capabilities we've discussed in earlier papers. It's that the corpus the system was built against goes stale faster than the organization is set up to update it.

The challenge of self-maintaining — keeping the corpus current with the organization's actual document state without manual intervention — is the operational substrate that all the other capabilities sit on top of. The most carefully-designed knowledge system fails if the organization can't keep its corpus current at the rate the underlying documents change.

Three modes of corpus maintenance¶

There are essentially three ways an organization can keep its corpus current. They sit at different points on a tradeoff between maintenance burden and maintenance fidelity.

Manual upload. Someone in the organization is responsible for noticing when documents change and uploading the new versions to the knowledge system. This works for small corpora that change infrequently and have a clear single point of accountability for document state. It fails as the corpus grows or as document changes diffuse across multiple authors. The maintenance burden scales with the document volume, and the fidelity depends entirely on whether the responsible person actually does it.

Periodic batch refresh. On a schedule (monthly, quarterly), someone exports the current document set from the source system and re-uploads it. This avoids the per-change attention problem at the cost of having a corpus that's always somewhat stale — the staleness oscillates between zero and one refresh-period. Works in stable corpora; fails when documents change weekly and the refresh is monthly. The burden is concentrated rather than continuous, which some organizations prefer and others don't.

Continuous synchronization. The knowledge system connects to the organization's source-of-truth document system (SharePoint, Google Drive, a content-management system) and keeps itself in sync automatically. Adds, updates, and deletes flow through without human intervention. The maintenance burden drops to near zero in steady state; the fidelity is bounded by the connector's reliability and the source system's API.

Most organizations start with manual upload because it's the lowest activation energy. They drift toward batch refresh as the manual burden becomes unsustainable. They eventually want continuous synchronization but discover the engineering required to do it well — credential management, change detection, error isolation, cost control — is more substantial than they expected.

The thesis of this paper is that continuous synchronization, done right, is what separates knowledge systems that survive past their first six months from ones that don't. The engineering is real, but it's the engineering that determines whether the deployment is a one-time investment or an ongoing operational asset.

What "done right" requires¶

A continuous-sync layer that works in operational settings has to handle several categories of complexity that aren't obvious from the outside.

Change detection. The system has to know which documents in the source have changed since the last sync. Naive approaches — re-download everything every time — are slow and expensive. Useful approaches use the source system's metadata: modification timestamps, version IDs, ETags. Some sources offer delta APIs that return only changes since a token; others don't. The system has to handle both cases gracefully.

Add/update/delete distinction. Three different operations, three different consequences. An add brings a new document into the corpus. An update replaces an existing document; the previous version's chunks have to be removed before the new version is ingested or the corpus accumulates duplicates. A delete removes a document; staleness in the index after deletion produces phantom answers based on documents that no longer exist in the source. Each operation has its own failure modes that the sync layer has to handle.

Failure isolation. A typical sync run will succeed for most documents and fail for some. A 401 from one document download shouldn't abort the whole run; a 500 from another shouldn't either. Per-document errors should be collected and surfaced; the manifest should advance for the documents that succeeded. The next run can retry the failed documents without reprocessing the successful ones. Naive sync implementations either fail entirely on the first error (no progress) or silently drop failures (no visibility). Neither works.

Authentication and credential rotation. The sync's credentials are usually short-lived (OAuth tokens that expire) or medium-lived (client secrets that rotate every 12-24 months). The sync layer has to acquire tokens at runtime, handle expiry gracefully, and surface auth failures in a way that prompts operator action without blocking the rest of the pipeline. Treating credentials as static configuration is the failure mode that produces "the sync stopped working three weeks ago and nobody noticed."

Rate-limit and quota management. Source systems impose rate limits. SharePoint's Graph API has request-per-minute caps. Google Drive has quotas. A sync that hits a limit should back off and retry, not error out. A sync that runs frequently enough to bump up against limits routinely needs to know how to spread its requests over time.

Cost control. Continuous sync, if naively implemented, can be expensive. Re-classifying every document with an LLM on every sync produces bills that scale linearly with corpus size and sync frequency. We covered this in the earlier discussion of the lightweight-sync flag — scheduled syncs should default to deterministic classification, with LLM-assisted classification reserved for the initial ingest and for operator-triggered manual reclassifies. The economics matter; an undisciplined sync layer produces a system that's too expensive to keep running on a daily cadence.

Audit-trail integration. Every sync run produces a record. What was added, what was updated, what was deleted, what failed and why. Six months in, when an operator is debugging a question whose answer surprised them, the sync history is part of the diagnostic surface. A sync layer that doesn't log its own actions is one more silo waiting to break.

These requirements add up to a substantial engineering investment. Most organizations underestimate it because the demo of continuous sync looks simple — files appear; they get indexed; the system stays current. The reliability under load, across credential rotations, across source-system errors, across operational drift, is where the engineering pays off.

The cadence question¶

Once continuous sync is built, the operator has to decide how often it runs. The right answer depends on factors that vary by deployment and don't have a single correct answer.

How often does the source actually change? A corpus that gets one new document per week doesn't need hourly sync; daily is fine. A corpus where users actively edit documents throughout the day might need hourly to keep pace with their expectations. The cadence should match the source's actual change rate, not exceed it.

What's the cost of staleness? A borough manager fielding a routine policy question can tolerate documentation that's a day old. A clerk responding to a same-day RTKL request can't tolerate documentation that's a week old. The acceptable staleness window varies by use case; the cadence has to be fast enough that staleness within the window is the exception rather than the rule.

What's the cost of the sync itself? A daily sync on a 5,000-document corpus costs about a dollar a month with the lightweight-sync default. An hourly sync costs ten times as much. The economics constrain how fast the cadence can run before the cost becomes a procurement conversation.

What does the source system tolerate? Some source systems are happy with hourly sync; some throttle aggressively; some have explicit quotas that constrain the achievable cadence regardless of what the user wants.

Most organizations land on daily as the default for steady-state operation, with the option for manual sync-on-demand when something specific needs to be picked up immediately. The default is unglamorous and is roughly correct for most settings. Operations are not a place where you need the latest five minutes of corpus state; they're a place where you need to be confident that yesterday's changes are reflected today.

Webhooks and near-real-time updates¶

For organizations that need sub-daily freshness, the next step beyond scheduled cadence is webhook-driven updates. The source system pushes a notification when a document changes; the knowledge system responds by syncing only that document; the corpus reflects the change within seconds.

This is achievable but adds complexity. Webhook subscriptions in Microsoft Graph expire and have to be renewed; the renewal is its own scheduled task that has to itself be reliable. Webhook payloads have to be validated cryptographically. Webhook deliveries can fail; the system has to handle missing or duplicate notifications gracefully. The fallback path — periodic reconciliation against the source to catch what webhooks missed — has to exist and run.

The benefit is real but narrow. Most operational settings don't actually need sub-daily freshness; the productivity gain from "the corpus is current within seconds" over "the corpus is current within a day" is marginal compared to the engineering cost. Webhooks are right when there's a specific operational moment that depends on it (a regulated workflow that requires acting on the latest version of a guidance document within minutes of its release) and over-engineered when the use case is just "we want it as fast as possible."

The right architectural posture is to build the cadence layer first, prove it works, run it for several months, observe whether sub-daily freshness becomes a recurring user complaint, and add webhooks if and when the data argues for it.

What to look for¶

If continuous corpus maintenance matters in your environment — and in any setting where documents change regularly, it does — the questions worth asking of any system you evaluate go well beyond "does it sync." How does the system detect changes — by re-downloading everything, or by using the source system's metadata efficiently? When a document is updated, are the previous version's chunks removed from the index before the new version is ingested, or does the corpus accumulate duplicates? When one document's sync fails, does the failure abort the whole run or get isolated and reported per-document? How are credentials managed — as static configuration that has to be hand-rotated, or acquired at runtime from a secret store? Is the cost of frequent sync controllable, or does the system re-classify every document with an LLM on every run?

A specific test worth running with any vendor: ask them to describe what happens when a sync run hits a 401 on one document and a 500 on another, midway through processing two thousand files. The answer reveals more than the demo does. A vendor whose answer is "we retry the whole run from scratch" is admitting the architecture isn't ready for production reliability.

If you're working through these tradeoffs and want a sounding board — diagnostic, not pitch — we'd welcome the conversation.

About Base2ML. Base2ML is a Pittsburgh-based company building knowledge-access tools for organizations that need to find what they already have. We work in the specific space where retrieval, authority hierarchy, and conflict surfacing meet operational reality.

Contact. Base2ML · chris@base2ml.com · base2ml.com · docs.base2ml.com

Numbers and percentages are deliberately not invented. Where industry research provided a credible figure we cite it; where it didn't, we say so rather than fabricating one.