Last week I asked an agent to make a small change to a single file in one folder. A copy edit, near enough: adjust how one function returned its result, nothing structural. I went to make coffee. When I came back the agent had done what I asked, and more. It had touched the file I named, three sibling files in folders I had not mentioned, and a configuration file at the root of the project. None of it was wrong, exactly. All of it was outside the scope of the sentence I had typed. I pointed this out and received an apology. There was no undo. The only way back to a known-good state was to discard everything since the last commit and start again, because the spread of the changes was wider than anything the conversation could reverse.
The agent did what it was told. The harness did not enforce what it should have. The seam split open.
The other failure mode
This was not the first time the seam had opened, only the most expensive. A few months earlier I had handed an agent a functional specification, six thousand words of it, and asked it to draft the technical specification that follows from it. It told me it had read the document. It had not. It had read something like the first quarter, reached a point where it had enough to sound confident, declared the document understood, and wrote the rest by filling in what a document like that usually says. I only caught it because I run a quality check on that step out of habit, comparing the output back against the source. Without that check the invented half would have travelled downstream looking exactly as authoritative as the real half.
This failure is getting better, and it is worth saying where the improvement comes from. The stronger harnesses now surface their own reading. When a document is long the agent reports that it is working through it in passes, that it has covered one portion and is moving to the next, and that running progress is what stops a partial read from presenting itself as a finished one. That is a harness-level fix to a harness-level problem. The tools sit at different points on the same curve. Some surface the read clearly. Others, at the versions where I hit this, did not.
The model was doing its job. The harness was not doing one of its jobs.
Why these are seam failures
It matters where you locate these failures, because the usual diagnosis is wrong. In both cases the model did exactly what it was prompted to do, with the context it was handed, inside the constraints the harness set for it. The model did not know it had received only the top quarter of the specification, because the harness decided how much to load and did not say. The model did not know the working set was meant to be one file in one folder, because nothing in the harness drew that boundary and held it. The harness is the layer that fetches, chunks, decides what to read, decides what to summarise, decides what may be written and where. That is the layer where both of these failures live.
The discourse keeps mislocating the problem. “The model is dumb” is the comfortable verdict, and it is the wrong one. The model is doing the job it was handed. The harness handed it the wrong job, or handed it the right job without the guardrails that would have made the right outcome the only available one.
The fix is structural, and it is visible at the harness layer. Surface what was read. Surface what is about to be written, and where, before it is written. Refuse to claim completion on a partial read. Refuse to write outside the declared scope without an explicit gate that a human, or another system, has to pass.
The dev cost and the regulated cost
In development work both failures are an irritation rather than a wound. A discarded commit costs a minute. A quality check catches the invented specification before it ships. The cost is friction, paid in time, and the work has somewhere safe to fail: a version-control system that remembers the last good state, a reviewer who reads the output, a test that goes red.
Move the same two failures into a regulated production process and the cost shape changes entirely. The out-of-scope edit is no longer a discarded commit. It is a change to a validated system that now has to be investigated, documented, and justified. The partial read is no longer a re-run. A specification read to the first quarter and confidently completed becomes a batch release narrative, a model validation record, a control description that reads as authoritative and is partly invented. The recovery path is not a git reset. It is a deviation report, an audit finding, a paused line, a letter from a regulator.
The harness improvements that handle the development case do not reach this far. They make the wrong outcome less likely. They do not make it auditable. A regulated process does not need the agent to have probably got it right. It needs the structured evidence that the right things were considered, with citations a reviewer can verify independently.
Preflight is not about trust
An older discipline already has the answer. A pilot walks around the aircraft before flight and checks the fuel cap, the control surfaces, the pitot tube. She does not do this because she distrusts herself or the aircraft. She does it because the time to find a problem is on the ground, where finding it is cheap, rather than in the air, where it is not. The check is a routine, not a judgement. It does not depend on how she feels about the aircraft that morning.
The same logic applies the moment an AI workflow enters a regulated process. Trust in the model is beside the point. What matters is whether you find the fault on the ground or in the air. The check has to be structural, run every time, and independent of how confident the agent sounded.
Concretely, the preflight for an AI workflow entering regulated territory names the things the agent would otherwise leave implicit: which regulations apply, with citations; which controls are present and which are missing; what the system assumed on your behalf when the description was incomplete; which clarifying questions would change the output if you answered them; what was deliberately left out of scope, and why; and the verification steps a reviewer can run independently to check each conclusion.
This is what we are building with Preclari. It is built around an open specification called PIF, the Preflight Interchange Format, released under Apache 2.0 so the artifact outlives any one tool. The agent designing the workflow can call it directly and get back a deterministic assertion before a human is involved. The human reviewing the workflow gets the same assertion as a readable brief. Same hash, two readers, one record. The structural move is exactly the one the better harnesses are making on partial reads: externalise the check, surface what was done, make the limits of the work visible. It does it for the regulatory question rather than the document-reading one.
The preflight is not about distrust of the model. It is about not testing in flight.
{
"pif_version": "0.1",
"assertion_id": "pa_2026_001_a",
"workflow_ref": "wf_qd_triage_2026_001",
"risk_classification": { "level": "medium" },
"assumptions_made": [
{
"assumption": "The facility holds a Swiss GMP manufacturing authorisation.",
"impact_if_wrong": "Swiss-specific requirements would not apply."
}
],
"clarifying_questions": [
{
"question": "Is there an existing URS for this workflow?",
"would_change": "If yes, the missing URS control would be removed."
}
],
"applicable_requirements": [
{
"requirement_id": "req_001",
"source": { "canonical_document_id": "EU-GMP-Annex-11" },
"confidence": "high"
}
],
"missing_controls": [
{
"control": "documented_user_requirements_specification",
"criticality": "required",
"source": "EU-GMP-Annex-11 §4"
}
],
"out_of_scope": [
{
"topic": "FDA 21 CFR Part 11",
"reason": "Workflow jurisdictions do not include US."
}
],
"verification_steps": [
{
"step": "Confirm Annex 11 §4 applies to this system category.",
"applies_to": "req_001"
}
],
"status": "draft",
"notice": "This assertion is informational and does not constitute regulatory advice."
}The seam, again
Both of my failures were recoverable. The blast-radius edit was recoverable because the work lived in version control, and version control is a system outside the agent that remembers the last good state. The partial read was recoverable because the quality check existed, and the check is a step outside the agent that does not take the agent’s word for what it did. In each case recovery came from leaving the agent’s frame and consulting something the agent did not control.
The regulated equivalents do not come with a git reset, and the quality check is the regulator. By the time the seam splits in that context, the work is already in the air. The cost of finding something wrong on the ground is a checklist item. The cost of finding it in flight is everything that word implies. The harness is the cockpit. The preflight is what makes the cockpit usable.
Further reading: The Math Under the AI Bet (Pramod Prasanth, 23 April 2026). The four-layer architecture and the accountability-ceiling argument in full.