All writing
Mar 28, 2026·1 min read
Shipping multi-agent systems you can actually trust
Lessons from putting a five-stage agentic pipeline into a regulated, audit-heavy workflow — and the boundary between LLM judgment and deterministic math.
#agents#reliability#production
Draft placeholder — replace with your own writing.
The short version: LLMs propose, engines decide. Unvalidated model output never enters the system of record. Every judgment call the agent makes is captured, graded against ground truth, and rolled up into a confidence number that determines whether the next stage runs or a human gets pinged.
The boundary that matters
Most "agentic" demos blur the line between reasoning and execution. In regulated work, that line is the entire product.
What held up
- Confidence-gated column mapping.
- Deterministic heuristics where LLMs are overkill.
- A small, hand-graded evaluation set I keep green before every deploy.
What broke
TODO: the failure modes I actually hit and what I did about them.