Caldrop: measure before you build
A bounded feasibility spike that answered one open question with real numbers before a line of product was written: can you resolve and parse every US public school district academic calendar at scale, and at what coverage, cost, and dangerous-error rate?
A product idea hinged on one question we could not honestly answer from a desk: is this even buildable at scale, and what would it cost? The tempting move is to assert plausible numbers and start coding. Instead we ran a bounded spike, produced measured figures for coverage, cost, and error, and named precisely what a small sample could not settle. The result is a go decision you can audit. This is the clearest example of the one habit that runs through all of our work: measure, do not assert.
The question
Can you programmatically resolve and parse the current academic calendar for every US public school district at scale, and at what coverage rate, cost, and dangerous-error rate? Building first to find out would have burned weeks. So we scoped a spike, not a product: just enough to turn three unknowns into measured numbers, and to be honest about which ones the sample could not yet answer.
The method
We built the real national spine, proved the full resolve-to-parse loop on a deliberately diverse sample, then modeled the cost in token math so the answer would hold regardless of the one input we could not yet pin down.
1. The spine, and the surprise inside it
Phase 0 built the district directory from the federal NCES Common Core of Data, served through a free public API. That gave a hard count to work against: 19,714 total agencies, 13,407 regular operating public districts, and 718 in New York. The first surprise arrived immediately. The directory carries no website field, so essentially every district has to be search-resolved. That is the realistic hard case, not a shortcut, and we chose to confront it.
The second surprise is where the difficulty lives. 48% of districts (6,456) enroll fewer than 1,000 students; 38% sit between 1k and 5k; only 283 are above 25k. Whatever breaks at scale will break in the small-district long tail, which is exactly where templating and clean web infrastructure thin out.
2. The resolver, and an honest funnel
The resolver issues a search query per district and ranks the returned URLs with a heuristic judge that down-ranks aggregator sites. Its cost is the one expensive input in the whole system, so we did not estimate it. We measured it: $0.0015 per district, ten districts for one and a half cents.
The funnel it produced is the most honest number in the study, because it refuses to flatter the approach:
A naive "find a PDF" resolver clears only 4 of 10. Getting to high coverage means building an HTML parse path and a second-hop step. That is the difference between roughly 40% coverage and a real product, and we named it rather than hiding it behind the cases that worked.
3. The parser, and the load-bearing surprise
We parsed four calendars spanning every artifact subtype we hit. The cheap
path works: a text-layer "legend" PDF parses with pdftotext at
essentially zero cost and got the first and last day right every time. The
expensive path also works: an image-only scanned PDF parses correctly
through vision.
The surprise sat between them and would have sunk a naive cost model. One district's PDF has a text layer (488 text operations) but the actual dates live in color-coded grid cells, so plain extraction returns the legend and not the start and end dates. "Text-layer PDF" is not one thing. That single observation is why the cost model is built around a vision fraction rather than assuming the model is rarely needed.
The event schema is real
Each parsed calendar emits typed events with a per-event confidence score and source provenance. The first and last day of school are flagged as the load-bearing fields, the ones a parent always knows by heart and the ones a confirm step has to get right. One real parsed event, with credentials and internal paths removed:
{
"district": "Cherry Valley-Springfield Central School District",
"school_year": "2026-2027",
"artifact_subtype": "pdf-image",
"events": [
{
"title": "Classes Begin",
"start": "2026-09-08",
"end": null,
"all_day": true,
"type": "first_day",
"applies_to": "all",
"confidence": 1.0
}
],
"dangerous_events": ["first_day", "last_day"],
"parse_confidence": 1.0
} The cost model
Every figure here is reproducible token math against confirmed per-token pricing, plus the one measured input (the search cost). It runs with no network and its output matches the report to the cent. Crucially, it is shown across a range of vision fractions, so the conclusion survives the one thing we could not yet pin down.
| Resolver (SERP + judge + 35% second-hop) | $0.0033 | per district |
|---|---|---|
| Vision parse, Haiku batch | $0.0093 | per calendar |
| Vision parse, Sonnet batch | $0.0349 | per calendar |
| Text-layer legend parse (pdftotext) | ~$0 | no model call |
The search cost is the one figure that was actually metered. Everything else is token math.
| Vision | NY (718) | National (13,407) |
|---|---|---|
| 30% | $4.39 | $82.05 |
| 50% | $5.73 | $106.99 |
| 70% | $7.07 | $131.92 |
| 90% | $8.40 | $156.86 |
Haiku batch parse. Recurring full coverage runs about $24/yr for NY and $447/yr nationally. Lazy population means real spend tracks traffic, so these are ceilings.
The full national one-time pass costs roughly $82 to $157, and New York roughly $6. Cost does not constrain this product at any plausible scale. Which is the point: it removes the cheap objection and forces the decision back onto the two things that actually matter, coverage and the error rate.
Failure modes, observed and fixed
These are not theorized risks. Each one was hit during the ten-district run, and each has a fix. One fix was verified end to end during the spike.
| Mode | What we hit | Fix |
|---|---|---|
| Name collision verified fix | A NY district resolving to a same-named CA district | Append directory geography to the query |
| Aggregator interlopers | Third-party calendar sites outranking the district | Judge down-ranks non-district domains |
| HTML calendar, no file | About half the sample | Build an HTML parse path |
| Second hop | Official page links out to the real file | Fetch the page, then resolve the linked artifact |
| Draft, not yet final | Current year still under board review | Accept with low confidence, flag, fall back to prior year |
| Image-only scanned PDF | A small rural district's calendar | Vision parse |
The verified case: a New York district that first collided with a same-named district in another state was recovered by appending its directory geography to the query. The aggregator sites that already rank on these searches are a concrete, beatable competitor class, not a vague threat.
The decision
The loop works end to end across every artifact type we touched, and cost is a non-issue. New York is justified on its own evidence, independent of any national number: it shows clean Finalsite templating leverage to build against, and a concrete aggregator competitor already ranking that can be beaten. The two questions the spike deliberately did not answer are the ordered work of Phase 2.
- Add directory-geography disambiguation and tighten aggregator down-ranking.
- Build the HTML-calendar parse path. It is about half of all artifacts, the difference between roughly 40% and high coverage.
- Hand-build the golden set of 25 to 40 calendars to get the real dangerous-error rate. This number gates the no-human-review model.
- Run the unbiased samples: a national coverage funnel and the full New York set.
Why this is the centerpiece
A skeptical engineering leader should finish this page and conclude, without being told, that we de-risk before we build. The whole study turns on what we refused to do: we had the one number everyone wants (0 of 4) and we declined to dress it up as an accuracy claim. We measured what was measurable, cost, decisively, and we named what the sample could not yet prove, coverage and the error rate, as the next measurements rather than guesses. That is what "measure, do not assert" looks like when it is expensive to follow.
This is a bounded R&D spike, not a finished product. Every figure on this page traces to the spike's findings report dated 2026-06-01 or to a live run of its cost model. What the team built after the go decision is separate work, deliberately not folded into the spike's numbers.