constellation_utils.recording_hash¶
Content-addressed recording_hash computation for the Constellation catalog.
See ENG-1070 and the Constellation Research Stack architecture doc §3 for the
contract this module implements: a recording’s catalog identity is the SHA-256
hex of a canonical manifest representation combined with the recording’s
start_time. The canonical form is producer-agnostic — data-engine computes
it at finalize time, and backfill tools compute the same hex from the same
manifest on disk.
Module Contents¶
Functions¶
Return the 64-char SHA-256 hex of the canonical manifest + start_time. |
API¶
- constellation_utils.recording_hash.compute_recording_hash(manifest: Mapping[str, Any], *, start_time: str) str[source]¶
Return the 64-char SHA-256 hex of the canonical manifest + start_time.
The manifest must already contain per-file sha256s for every raw segment (data-engine’s
_finalize_sessionis responsible for this).start_timeis a non-None ISO-8601 string, passed explicitly so backfill can supply it from a source other than the manifest’sstarted_atfield if needed. PassingNoneraisesTypeError— callers must guard upstream.The format of
start_timeis not validated. Callers must ensure it is a consistent ISO-8601 representation (including timezone offset) for the hash to be reproducible.Canonicalization rules (locked — cross-language reimplementations must match exactly):
Reduced view is
{"start_time": <given>, "workers": [...]}.Each worker is
{"worker_id": <str>, "files": [...]}and each file entry is exactly{"path": <str>, "sha256": <hex>}.bytesis intentionally omitted (the sha256 already encodes content length).pathis consumed verbatim — the producer relativizes against its own session_dir before serializing.Files are sorted by
pathwithin each worker; workers are sorted byworker_id.Top-level keys
recording_id,participant,testing_mode,started_at,ended_at,recording_hash, and any key whose name starts with_are excluded from the canonical view. In particularstarted_atis excluded so it isn’t double-encoded — the only source of start time in the hash is thestart_timeargument.Serialized with
json.dumps(reduced, sort_keys=True, separators=(",", ":"), ensure_ascii=True)then UTF-8 encoded and SHA-256 hashed.ensure_ascii=Trueis pinned: non-ASCII characters in file paths become\uXXXXescapes.