01 The problem
The brief was to verify visual evidence for damage claims. For each claim the system gets the chat describing the issue, one or more submitted photos, the user's claim history, and the minimum evidence that claim type is supposed to carry. From that it has to judge whether the photos support the claim, contradict it, or don't show enough to decide.
The hard part isn't reading one clean photo. It's telling "the image disagrees with the claim" apart from "the image can't tell us." A confidently wrong contradicted on a blurry shot is as damaging as waving real fraud through. History can raise suspicion, but it can't be allowed to override what a clear photo plainly shows, and adversarial text inside the chat (or rendered inside an image) can't be allowed to steer the verdict either.
02 What I built
One verification agent with a real tool-calling loop, run once per
claim. Rather than a rules pipeline or a multi-agent crew, a single
reasoner makes one judgment over one claim and calls tools when it
needs a fact. inspect_image runs a vision pass on a
single photo and returns structured JSON: the object and part in
view, the damage type, quality flags like blur, glare, and crop,
authenticity cues, and any text written inside the image.
get_evidence_requirement grounds whether the evidence
standard was met, and get_user_history surfaces the
risk signal.
Inspecting each photo on its own, instead of swallowing the whole claim in one look, is what keeps the citations honest: the agent can point to the clear image and skip the blurry one. The loop re-inspects when an observation conflicts with the claim or the user carries history risk, then a synthesis step writes 14 structured fields. Throughout, the chat and any in-image text are treated as data to be flagged, never as instructions to obey.
03 Key decisions & tradeoffs
-
One agent and a tool loop, not rules or a crew
The task is a single judgment over a single claim, so there's no work to split across agents and nothing a fixed rule can settle on its own. Weighing what a photo shows against a claim's severity needs reasoning, not a lookup table.
Tradeoff A reasoning agent is harder to keep reproducible than fixed rules, so the rules I do keep live in a validator over format and evidence sufficiency, never over the verdict itself.
-
Build the evaluation harness before the agent
I wrote the metrics and a stub predictor first, so from day one the question was "did the number move," not "does it run." The headline metric was chosen on purpose: the claim-status confusion matrix, with the contradicted-vs-not-enough-info cells called out.
Tradeoff Slower to a first end-to-end run, bought in exchange for every later change being measured instead of guessed at.
-
Per-image inspection, returning structured JSON
Every photo gets its own vision pass and a rich structured observation, which protects the final decision from a lossy summary and keeps
supporting_image_idshonest.Tradeoff More vision calls per claim than a single batched look, paid for cleaner per-image evidence you can actually audit.
-
Three independent decision axes
"Is the evidence sufficient," "is the image usable," and "does the image support the claim" are genuinely different questions, so
evidence_standard_met,valid_image, andclaim_statusare decided separately and allowed to diverge.Tradeoff More fields to reason about and reconcile, in return for verdicts that don't quietly collapse three distinct questions into one.
-
A validator that flags, never overwrites
The validator normalizes format loudly (trimming whitespace, lowercasing booleans) but reports every schema or invariant violation rather than silently fixing it. A quietly corrected contradiction would hide the exact model error rate I needed to measure.
Tradeoff The output isn't auto-polished to look perfect. It tells the truth about where the model is still wrong, which is what I wanted on the record.
04 Outcome
The agent placed 5th out of 1,773 finalists at HackerRank Orchestrate in June 2026, from a field of more than 15,000 registrants. On the 20 labelled samples it scored 16/20 on claim status (macro-F1 0.73), with the contradicted-vs-unclear confusion cells sitting at zero. The misses that remained were supported-vs-contradicted vision close calls, not reasoning-boundary errors. A live, un-cached 20-sample run cost about $0.73 across 69 model calls, which projects to roughly $1.60 and ten minutes for the full set. The result came from the same habit as my first Orchestrate run: build the measurement first, then be able to defend every decision afterwards.