C · How we work

The code does the measuring. The AI judges over verified output. A human approves before anything changes.

There is one question behind every AI project: can I trust what this thing produces, or will it confidently make something up? Our answer is not a promise. It is an architecture. Deterministic programs fetch and check the data, the model only reasons over results that are already verified, every claim is measured, and nothing touches a real system until a person says go.

See the loop Selected work

01 · The loop

Plan, execute, validate, report. Then a human approves the write.

Two tracks run side by side. A deterministic engine measures and checks. The model reasons over what it produced. A person gates anything that mutates a real system.

The engine track (top, in monospace) is deterministic and read-only. It fetches, computes, and runs a typed validator at three severities: error halts, warning flags, info notes. Only its verified output crosses into the AI track (bottom), where the model reasons and drafts the report. Because the engine never writes, the validator can run on a schedule with no risk. A human gate sits before any write to a real system.

02 · The principles

Five rules we hold even when they cost us.

A method is only real if it is inconvenient sometimes. Each rule below comes with one grounded example, not a slogan.

Engine first

A deterministic program does the fetching and the math. The model only reasons over output that has already been checked.

Most AI work asks a language model to both gather the facts and judge them. That is where things get invented. We split the job. A plain Python program (no magic, just code) pulls the real data, computes the numbers, and runs the checks. The model never sees raw guesses. It reads a verified result and writes the explanation a person actually wants. If the engine cannot answer, the model is not asked to pretend.

How it shows up

Two files, one job each

A skill is two files. A short prompt that narrates, and a read-only engine that measures. The engine is stdlib Python, owns the data and the validation, and returns structured JSON. The prompt turns that JSON into clear language. The split is the point: the part that can hallucinate never touches the source data.

Measure, do not assert

If a claim can be looked up, we look it up before we say it. A number on the page is an observation, not an opinion.

Search volume, a competitor's real price, whether a feature exists, what a config value actually is: these are facts you can check, not things to recall from memory. So we check them. The discipline is strongest when it is inconvenient. When a result looks too clean or a sample is too small to support a claim, we say so out loud instead of dressing it up.

Worked example

Caldrop, and the number we refused to fabricate

Caldrop is an internal feasibility spike (can you parse every US school district calendar at scale). It measured the things you can measure: about $0.0015 per district to resolve, and roughly $82 to $157 for a one-time national pass. Then four calendars parsed with zero dangerous errors. The tempting move is to call that an accuracy rate. We did not. n=4 is far too small to claim a rate, so the report said exactly that and sized the real measurement the next phase needs.

Read the Caldrop case study →

Validate before done

Plan, execute, validate, report. Nothing ships until a validator confirms the parts agree, and a build fails when they drift.

After any change we do not declare it finished. We run the validator, fetch the live state, and diff what we intended against what is actually there. On a project with one source of truth, the deck, the web page, and the spreadsheet are all generated from the same model. A consistency validator proves they can never silently disagree, and it fails the build if they do.

How it shows up

A validator that fails the build on drift

On an underwriting platform we built, one Python model is canonical. Every downstream artifact reads from it, and a validator checks them against the model before publish. If a single figure drifts between the slide and the spreadsheet, the build stops. Disagreement is treated as a bug, not a rounding note.

Surgical changes

Build only what was asked. Every changed line traces to the request. We do not refactor working code on the way past.

Scope creep is how systems rot. We resist speculative flexibility, config for a single use case, and error handling for states that cannot happen. If a change touches more than a couple of files, it gets a short written plan first. The test we apply to our own work: could this be half as long? If yes, we rewrite it.

How it shows up

A plan before non-trivial work

Anything beyond a quick edit starts with three to five bullets: the approach, the files that change, and what you should verify before we proceed. It keeps the change honest and gives you a chance to redirect before code exists, not after.

Human in the loop for writes

The engines are read-only. They never write to a live system. A person approves anything that touches money or a real account.

Reading is safe. Writing is not. So our deterministic engines only ever read and check. They do not create, update, or delete a single record in a live system. A standing job can run every validator on a schedule precisely because there are no side effects to fear. When something does need to change a real system, that step is gated on a human saying go.

How it shows up

Read-only by design, writes by approval

The whole skill library can run unattended because it cannot break anything. A nightly check flags drift without mutating state. The moment an action would change a real account, the system stops and asks. The person sees exactly what is about to happen, then approves it.

03 · What "validated" looks like

The validator is typed, not a vibe.

A read-only check returns findings at three severities and an exit code. The illustration below uses fictional data. It is the shape of the check, not anyone's real numbers.

$ python3 occupancy.py --validate --week 2026-06-01

[ schema ]      ok   8 fields present, types match
[ coverage ]    ok   12 / 12 properties resolved
[ reconcile ]   ok   nightly rates reconcile to source
[ freshness ]   warn calendar for "Lakeview Cabin" is 3 days stale
[ sanity ]      ok   no occupancy value outside 0-100%

  info     2 owner-blocked nights excluded from paid occupancy
  warning  1 source older than the 48h freshness window

validation: 1 warning, 0 errors
exit 1   (0 = clean | 1 = warnings | 2 = errors)

error Halts the run. The schema is wrong, auth failed, or a number is impossible. Nothing proceeds.

warning Non-blocking but worth seeing. A source is stale, a contract has lapsed. Surfaced, not hidden.

info Routine and noteworthy. A dedup applied, blocked nights excluded. The audit trail.

Exit 0 is clean, 1 is warnings, 2 is errors. A scheduled job reads that code and acts on it. The engine still wrote nothing.

04 · What this means for you

The same method, read two ways.

If you run the business

We will not let the AI invent your numbers. The figures it shows you were measured, not guessed, and nothing changes a real account without you saying yes.

If you run the engineering

Deterministic core, typed validation gating every run, read-only engines, and human-approved writes. Measured end to end, with a build that fails on drift.

This is standing practice, not a sales line. It is written down, it runs on a schedule, and you can see it at work in the Caldrop spike and the DUSKFALL validator.

View selected work See the demos