Mar 28, 2026·1 min read

Shipping multi-agent systems you can actually trust

Lessons from putting a five-stage agentic pipeline into a regulated, audit-heavy workflow — and the boundary between LLM judgment and deterministic math.

#agents#reliability#production

Draft placeholder — replace with your own writing.

The short version: LLMs propose, engines decide. Unvalidated model output never enters the system of record. Every judgment call the agent makes is captured, graded against ground truth, and rolled up into a confidence number that determines whether the next stage runs or a human gets pinged.

The boundary that matters

Most "agentic" demos blur the line between reasoning and execution. In regulated work, that line is the entire product.

What held up

Confidence-gated column mapping.
Deterministic heuristics where LLMs are overkill.
A small, hand-graded evaluation set I keep green before every deploy.

What broke

TODO: the failure modes I actually hit and what I did about them.