ADR 0005: Lean core — quarantine dormant capability as experimental/extras
Ship only what runs in the default offline path; move OCR, real semantic vector, and the dual-channel verifier into clearly-labeled experimental modules/extras.
Status: accepted · Date: 2026-06-17
Immutable record. Exempt from drift tracking (no
covers). Supersede, don't edit.
Part of the 0002 product direction. Gated by 0003.
Context
Nearly every subsystem carried code that is built and tested but not wired into the
running pipeline: the OCR scanned-PDF extractor (parked in the review queue, never
invoked), the dual-channel verifier (orphaned, only its own test calls it), the vector
channel (fully wired but the only offline backend is a self-described non-semantic
lexical hash; real embeddings deferred to GPU infra), color-legend detection
(test-only), an OpenAIProvider stub, and two parallel extraction families (legacy
extract_facts vs StyledGrid). For a library people evaluate, dead-looking code that
doesn't actually run is a trust problem.
Decision
Ship a lean core containing only what runs in the default offline path: BM25 +
structured channel + agent + anti-fabrication. Move OCR, real semantic vector, and the
dual-channel verifier into clearly-labeled experimental modules / extras,
documented as "wired, needs GPU/extra, not covered by default guarantees." Keep one
live extraction path (retire or quarantine the legacy/StyledGrid duplication). Each
experimental module is promoted into the core only once it has a real, CI-tested path.
Alternatives considered (rejected)
- Wire everything in before any 1.0 (option A): give vector a real offline CPU model, add an OCR CI lane, connect the verifier. Rejected as too heavy for now (it is the eventual direction for individual modules as they earn promotion).
- Keep as-is, document honestly (option C): leave the "ship contract, defer activation" posture and just list gaps. Rejected — reads as "impressive-looking code that doesn't run."
Consequences
- The default
pip installexperience is honest and runnable end-to-end, at the cost of being non-semantic by default (consistent with the lean-default of 0009). - README "honest gaps" must clearly mark every experimental module.
- Defines the promotion criterion: a module enters core when it has a real CI-tested path and (for retrieval) backs the benchmark in 0006.
ADR 0004: Full generality — DomainProfile and arbitrary-dimension facts
Generalize fully — CompanyProfile becomes DomainProfile and the structured channel becomes a typed-fact store with arbitrary user-defined dimensions.
ADR 0006: Quality bar — invariants as property tests, plus one real retrieval benchmark
Define quality as guarantees proven by property tests, plus one real labeled retrieval benchmark; domain accuracy is punted to the user's data.