RAGSpine
Guides

Storage

The sqlite persistence layer — a numeric fact store and a narrative chunk store, both with full source lineage. The Fact dataclass, the dim_key upsert key, and deterministic found/not-found reads.

The storage domain is the sqlite persistence layer behind RAGSpine's two channels: a fact store (numeric metrics) and a chunk store (narrative). Both keep full source lineage — every row knows the document and locator it came from — and both use the same disciplined style: an explicit schema, parameterized SQL, and a read-only execute_read entry point so observability code never touches the raw connection.

The fact store lives at src/ragspine/storage/fact_store.py (contract: src/ragspine/storage/CLAUDE.md). The narrative chunk store lives in the retrieval subtree at src/ragspine/retrieval/chunking/chunk_store.py — it is the narrative half of the same conceptual storage layer, and its schema deliberately mirrors fact_store. Both back onto the same sqlite file, data/fact_metric.db, in separate tables.

Layout

fact_store.py — Fact dataclass + FactStore (table fact_metric)

The Fact dataclass

A Fact is one metric data point: dimensions + value + lineage (+ v2 style-semantic and version-lineage fields). It is a @dataclass (not frozen). The field order is part of the contract: the first ten are positional-frozen, and new fields are appended only.

Prop

Type

The first ten fields are positional-frozenmetric_code, entity, geography, channel, period_type, period, value, unit, source_doc_id, source_locator. The evaluation harness binds a 10-tuple via Fact(*row); reordering or removing any of them breaks it. New fields are additive only, appended at the end — dimensions is the last field.

dimensions is an in-memory arbitrary-dimension bag, excluded from DB columns. Its __post_init__ guard raises ValueError if any key collides with a structural / lineage / dim_key reserved name, and an empty bag derives an identity mirror ({metric, entity, channel, period}). It is never written to a column and never reconstructed into a Fact(**data).

Review status

review_status gates visibility. Default-visible reads return only VISIBLE_REVIEW_STATUSES = (REVIEW_AUTO_APPROVED, REVIEW_APPROVED); the full set is auto_approved, pending, approved, rejected, blocked.

FactStore

from ragspine.storage.fact_store import Fact, FactStore

store = FactStore("data/fact_metric.db")
store.init_schema()

store.upsert_facts([
    Fact("REVENUE", "ACME_GROUP", "ASIA", "TOTAL", "FY", "2024",
         1234.5, "USD_M", "doc-42", "sheet=5yr!C4"),
])

hits = store.query("REVENUE", "ACME_GROUP", "FY", "2024")   # [] = not found, [Fact] = found
store.close()
methodpurpose
init_schema()create fact_metric, run the v2 column migration, create both unique indexes
upsert_facts(facts, ingested_at=None) -> intbatch insert; on dim_key conflict, overwrite value + lineage; returns count written
query(metric_code, entity, period_type, period, channel="TOTAL", review_statuses=VISIBLE_REVIEW_STATUSES) -> list[Fact]exact parameterized lookup, returns 0 or 1 row
count() -> inttotal facts
execute_read(sql, params=()) -> list[sqlite3.Row]read-only SELECT entry point for ledger/metrics reuse
delete_by_source_doc(source_doc_id) -> intphysically delete a doc's facts (any review status); idempotent
close()idempotent connection close (also auto-closed on GC via weakref.finalize)

The table is fact_metric. query() issues an exact-match SELECT on (metric_code, entity, period_type, period, channel); because that combination is unique, the result is always 0 rows (not found) or 1 row (found). That determinism is what the anti-fabrication invariant relies on — no found fact means the orchestrator rewrites the answer to "not found."

dim_key — the upsert key

dim_key is the conflict key for upsert: a canonical, sorted-JSON natural key over the identity dimensions onlymetric, entity, channel, and period (period_type + period, so ('FY','2024') and ('HY','2024') differ). geography is identity=False, an overwritable non-key column, and is not in the key.

dim_key is computed from the typed columns by _compute_dim_key (it never reads the dimensions bag), is recomputed on every write and on legacy backfill, and is storage-only — never a Fact field, never reconstructed into a Fact. Keeping each identity combination to 0-or-1 row is what preserves the deterministic found/not-found read path.

Two unique indexes coexist and encode identical finance uniqueness:

  • ux_fact_dim_keyUNIQUE (dim_key), the upsert conflict target.
  • ux_fact_metricUNIQUE (metric_code, entity, period_type, period, channel), the legacy composite index, kept alongside.

On conflict, upsert_facts overwrites the overwritable columns — geography, value, unit, source_doc_id, source_locator plus all v2/provenance fields (tags, source_file_hash, extractor_version, mapping_version, confidence, review_status, valid_as_of, ingested_at) — and stamps ingested_at itself. Re-ingesting the same data therefore never grows the store; this is the backbone of idempotent ingestion.

Auto-migrating schema

init_schema() ALTER-adds any missing v2 columns and the dim_key column to a pre-existing table, then backfills dim_key for rows where it is NULL (recomputed in Python from each row's identity columns). All v2 fields default, so old Fact(...) calls are unchanged.

The chunk store

ChunkStore (in retrieval/chunking/chunk_store.py) is the narrative counterpart, modeled on fact_store — explicit narrative_chunk schema, parameterized SQL, execute_read. A StoredChunk is one chunk's content plus metadata: chunk_id, doc_id, seq, text, source_locator, para_start, para_end, title, topic, entity, geography, period, language, sensitivity (default "INTERNAL"), valid_as_of, ingested_at, version, and active.

replace_doc_chunks(doc_id, chunks, valid_as_of="") is the versioned, idempotent write: re-ingesting a document flips the old version's rows to active=0 and inserts the new chunks at version = max+1, active=1. The active set always equals the most recent ingest; old versions stay for lineage. An empty list withdraws the doc from the active set.

Retrieval pre-filters on chunk metadata (active, sensitivity, period, …) before scoring. The sensitivity column is what powers RESTRICTED isolation downstream.

Resource handling

Both stores open one sqlite3 connection with row_factory = sqlite3.Row and register a weakref.finalize so the connection closes deterministically on GC — even if a caller forgets close() — which keeps a bare sqlite connection from raising a ResourceWarning under the zero-warning gate. close() is idempotent.

Default database paths

Defaults come from common/core.py: DEFAULT_FACT_DB = data/fact_metric.db (holds both fact_metric and narrative_chunk), DEFAULT_MAPPING_DB = data/color_mapping.db, and DEFAULT_REVIEW_QUEUE_DB = data/review_queue.db.

Invariants this domain upholds

  • Provenance — every fact and chunk carries source_doc_id + a locator; lineage is never dropped.
  • Deterministic found/not-found — the unique dim_key keeps each metric identity to one row, so a missing fact is unambiguous.
  • Field-order contractFact's first ten fields are positional-frozen; additions are append-only; dimensions is never a column.
  • Idempotent writes — facts upsert on dim_key; chunks replace by version.

On this page