Case study

Caldrop: measure before you build

A bounded feasibility spike that answered one open question with real numbers before a line of product was written: can you resolve and parse every US public school district academic calendar at scale, and at what coverage, cost, and dangerous-error rate?

Python (stdlib) DataForSEO SERP pdftotext Vision parsing Cost modeling Feasibility spike

A product idea hinged on one question we could not honestly answer from a desk: is this even buildable at scale, and what would it cost? The tempting move is to assert plausible numbers and start coding. Instead we ran a bounded spike, produced measured figures for coverage, cost, and error, and named precisely what a small sample could not settle. The result is a go decision you can audit. This is the clearest example of the one habit that runs through all of our work: measure, do not assert.

Scope Phase 0 spine + Phase 1 loop + cost model

Sample 10 districts resolved, 4 calendars parsed

Total spend $0.015

Decision Green to Phase 2, NY-first

The question

Can you programmatically resolve and parse the current academic calendar for every US public school district at scale, and at what coverage rate, cost, and dangerous-error rate? Building first to find out would have burned weeks. So we scoped a spike, not a product: just enough to turn three unknowns into measured numbers, and to be honest about which ones the sample could not yet answer.

The method

We built the real national spine, proved the full resolve-to-parse loop on a deliberately diverse sample, then modeled the cost in token math so the answer would hold regardless of the one input we could not yet pin down.

The pipeline. The two details a naive build would miss are drawn in: the parser fork (a text-layer legend PDF is near-free, but a grid-only PDF needs vision just like a scanned image) and the second-hop branch (an official page that links out to the real file).

1. The spine, and the surprise inside it

Phase 0 built the district directory from the federal NCES Common Core of Data, served through a free public API. That gave a hard count to work against: 19,714 total agencies, 13,407 regular operating public districts, and 718 in New York. The first surprise arrived immediately. The directory carries no website field, so essentially every district has to be search-resolved. That is the realistic hard case, not a shortcut, and we chose to confront it.

The second surprise is where the difficulty lives. 48% of districts (6,456) enroll fewer than 1,000 students; 38% sit between 1k and 5k; only 283 are above 25k. Whatever breaks at scale will break in the small-district long tail, which is exactly where templating and clean web infrastructure thin out.

2. The resolver, and an honest funnel

The resolver issues a search query per district and ranks the returned URLs with a heuristic judge that down-ranks aggregator sites. Its cost is the one expensive input in the whole system, so we did not estimate it. We measured it: $0.0015 per district, ten districts for one and a half cents.

The funnel it produced is the most honest number in the study, because it refuses to flatter the approach:

4/ 10

Official current-year PDF, directly downloadable The naive case, and only 4 of 10 clear it.

5/ 10

Official source found, but the artifact is HTML or needs a second hop Roughly half the sample. Not an edge case.

1/ 10

Current year exists only as a draft It is June. Many calendars are still finalizing.

A naive "find a PDF" resolver clears only 4 of 10. Getting to high coverage means building an HTML parse path and a second-hop step. That is the difference between roughly 40% coverage and a real product, and we named it rather than hiding it behind the cases that worked.

3. The parser, and the load-bearing surprise

We parsed four calendars spanning every artifact subtype we hit. The cheap path works: a text-layer "legend" PDF parses with pdftotext at essentially zero cost and got the first and last day right every time. The expensive path also works: an image-only scanned PDF parses correctly through vision.

The surprise sat between them and would have sunk a naive cost model. One district's PDF has a text layer (488 text operations) but the actual dates live in color-coded grid cells, so plain extraction returns the legend and not the start and end dates. "Text-layer PDF" is not one thing. That single observation is why the cost model is built around a vision fraction rather than assuming the model is rarely needed.

The event schema is real

Each parsed calendar emits typed events with a per-event confidence score and source provenance. The first and last day of school are flagged as the load-bearing fields, the ones a parent always knows by heart and the ones a confirm step has to get right. One real parsed event, with credentials and internal paths removed:

{
  "district": "Cherry Valley-Springfield Central School District",
  "school_year": "2026-2027",
  "artifact_subtype": "pdf-image",
  "events": [
    {
      "title": "Classes Begin",
      "start": "2026-09-08",
      "end": null,
      "all_day": true,
      "type": "first_day",
      "applies_to": "all",
      "confidence": 1.0
    }
  ],
  "dangerous_events": ["first_day", "last_day"],
  "parse_confidence": 1.0
}

The cost model

Every figure here is reproducible token math against confirmed per-token pricing, plus the one measured input (the search cost). It runs with no network and its output matches the report to the cent. Crucially, it is shown across a range of vision fractions, so the conclusion survives the one thing we could not yet pin down.

Per-unit costs

Resolver (SERP + judge + 35% second-hop)	$0.0033	per district
Vision parse, Haiku batch	$0.0093	per calendar
Vision parse, Sonnet batch	$0.0349	per calendar
Text-layer legend parse (pdftotext)	~$0	no model call

The search cost is the one figure that was actually metered. Everything else is token math.

One-time full coverage, by vision fraction

Vision	NY (718)	National (13,407)
30%	$4.39	$82.05
50%	$5.73	$106.99
70%	$7.07	$131.92
90%	$8.40	$156.86

Haiku batch parse. Recurring full coverage runs about $24/yr for NY and $447/yr nationally. Lazy population means real spend tracks traffic, so these are ceilings.

The full national one-time pass costs roughly $82 to $157, and New York roughly $6. Cost does not constrain this product at any plausible scale. Which is the point: it removes the cheap objection and forces the decision back onto the two things that actually matter, coverage and the error rate.

Failure modes, observed and fixed

These are not theorized risks. Each one was hit during the ten-district run, and each has a fix. One fix was verified end to end during the spike.

Mode	What we hit	Fix
Name collision verified fix	A NY district resolving to a same-named CA district	Append directory geography to the query
Aggregator interlopers	Third-party calendar sites outranking the district	Judge down-ranks non-district domains
HTML calendar, no file	About half the sample	Build an HTML parse path
Second hop	Official page links out to the real file	Fetch the page, then resolve the linked artifact
Draft, not yet final	Current year still under board review	Accept with low confidence, flag, fall back to prior year
Image-only scanned PDF	A small rural district's calendar	Vision parse

The verified case: a New York district that first collided with a same-named district in another state was recovered by appending its directory geography to the query. The aggregator sites that already rank on these searches are a concrete, beatable competitor class, not a vague threat.

The decision

Green Go to Phase 2, New York first

The loop works end to end across every artifact type we touched, and cost is a non-issue. New York is justified on its own evidence, independent of any national number: it shows clean Finalsite templating leverage to build against, and a concrete aggregator competitor already ranking that can be beaten. The two questions the spike deliberately did not answer are the ordered work of Phase 2.

Add directory-geography disambiguation and tighten aggregator down-ranking.
Build the HTML-calendar parse path. It is about half of all artifacts, the difference between roughly 40% and high coverage.
Hand-build the golden set of 25 to 40 calendars to get the real dangerous-error rate. This number gates the no-human-review model.
Run the unbiased samples: a national coverage funnel and the full New York set.

Why this is the centerpiece

A skeptical engineering leader should finish this page and conclude, without being told, that we de-risk before we build. The whole study turns on what we refused to do: we had the one number everyone wants (0 of 4) and we declined to dress it up as an accuracy claim. We measured what was measurable, cost, decisively, and we named what the sample could not yet prove, coverage and the error rate, as the next measurements rather than guesses. That is what "measure, do not assert" looks like when it is expensive to follow.

This is a bounded R&D spike, not a finished product. Every figure on this page traces to the spike's findings report dated 2026-06-01 or to a live run of its cost model. What the team built after the go decision is separate work, deliberately not folded into the spike's numbers.

← Back to work