This is the disk-space runbook for SourceHarbor.
Think of it as a warehouse map, not a trash guide.
The point is to answer four different questions separately:
Those are not the same question.
SourceHarbor uses three path classes:
| Class | Canonical root | What belongs there |
|---|---|---|
| Repo-side runtime | .runtime-cache/ |
short-lived repo-local runtime state |
| User-side repo-owned cache/state | SOURCE_HARBOR_CACHE_ROOT/ |
worker state, pipeline workspace, pipeline artifacts, dedicated browser root, tmp scratch, and managed project envs |
| Shared tool cache | shared system paths such as ~/Library/Caches/ms-playwright or ~/.cache/uv |
toolchain/browser caches that may be used by multiple repos |
The canonical repo-side web runtime workspace is:
.runtime-cache/tmp/web-runtime/workspace/apps/webThe canonical current-state root is:
SOURCE_HARBOR_CACHE_ROOT/Treat that root like the live warehouse shelf, not like a disposable staging box.
If it is present, it must be counted into the repo-external-repo-owned audit total.
The canonical mainline Python environment is:
${SOURCE_HARBOR_CACHE_ROOT}/project-venvIf sibling entries match project-venv* under the same cache root, treat them as duplicate repo-external envs:
Those duplicate envs are not mysterious machine junk, but they are also not automatically safe-clear.
The legacy home-level repo-owned state root and older video-digestor paths are
migration/compatibility surfaces only.
They may still exist locally, but they must be treated as:
Legacy retirement is a separate question from legacy detection.
Use this state machine:
.envEvery disk report must classify findings into exactly one of these layers:
| Layer | Meaning |
|---|---|
repo-internal |
under the repo root |
repo-external-repo-owned |
outside the repo, but clearly generated or owned by this repo |
shared-layer |
global cache/toolchain state that other projects may also use |
unverified-layer |
objects we know may exist, but could not safely measure or attribute yet |
Examples:
.runtime-cache/tmp → repo-internalrepo-external-repo-owned legacy-compatible migration input$HOME/Library/Caches/ms-playwright → shared-layerunverified-layerSecond-pass audit output keeps a dedicated repo-internal residue map under governance.repo_internal_residue.
It is intentionally descriptive, not a cleanup queue.
The fixed buckets are:
proof_scratchactive_logslocal_private_ledgerstracked_release_evidenceorphan_residueThink of these like labeled shelves in the warehouse.
The point is to stop guessing what a leftover path “probably” is.
Examples:
proof_scratch → repo-side image-audit workbenches under .runtime-cache/tmp/*image-audit*active_logs → .runtime-cache/logs/applocal_private_ledgers → authoritative .runtime-cache/evidence/ai-ledgers plus optional .agents compatibility bridgetracked_release_evidence → artifacts/releasesorphan_residue → apps/web/node_modules.broken.*Repo-external duplicate envs are a separate descriptive surface.
Think of them like extra warehouse keys that still open a side room:
Being listed here does not mean the object is safe-clear.
It only means the audit now has a fixed bucket for it.
Use these states in order:
| State | What it means in plain language |
|---|---|
exists |
the object is on disk |
ownership-confirmed |
we have enough evidence that this repo owns or depends on it |
rebuildability-confirmed |
we know how to recreate it safely |
cleanup-allowed |
policy and gates both say it may be removed now |
Large is not enough.
Rebuildable is not enough.
Outside the repo is not enough.
These are rebuildable caches that do not carry current pipeline state:
.mypy_cache__pycache__pytest and ruff index residuesrun/ files that are below the micro-state thresholdDefault mode is dry-run:
./bin/disk-space-cleanup --wave safe
Authorized apply requires explicit confirmation:
./bin/disk-space-cleanup --wave safe --apply --yes
These are larger and need extra gates:
.runtime-cache/tmp/web-runtime.runtime-cache/tmp/sourceharbor-verify-venv.runtime-cache/tmp/sourceharbor-pypi-verify-venv.runtime-cache/tmp/ws6-test-venvBefore these can be deleted, the cleanup gate must prove:
lsof does not show active usersThese are repo-owned external history candidates such as:
${SOURCE_HARBOR_CACHE_ROOT}/root-venv-backup${SOURCE_HARBOR_CACHE_ROOT}/codex-ghcr-*${SOURCE_HARBOR_CACHE_ROOT}/project-venv-*${SOURCE_HARBOR_CACHE_ROOT}/ws6-test-venv$HOME/.cache/video-digestor/closure-fix-venv$HOME/.cache/video-digestorThey are verify-first, not safe-by-name.
They must prove:
.env, fallback scripts, and systemd fallbacks do not reference themEven when all three checks currently pass, these objects stay in the verify-first bucket.
They do not become safe-clear just because the dry-run shows them as eligible.
Eligibility means “may enter an execution wave after review”, not “now safe to delete by default”.
Some repo-side tmp/ paths are temporary proof workbenches rather than generic cache.
Examples:
.runtime-cache/tmp/manual-image-audit.runtime-cache/tmp/public-image-audit.runtime-cache/tmp/audit-images.runtime-cache/tmp/audit-images-direct.runtime-cache/tmp/image-auditTreat these as:
They are proof scratch, not anonymous junk drawers.
.agents/Plans remains a compatibility bridge.
The authoritative local-private execution ledger root is:
.runtime-cache/evidence/ai-ledgersSecond-pass governance should migrate readable plan ledgers into that authoritative target while preserving the original .agents/Plans files.
That migration is a copy-forward step, not a destructive cleanup step.
These are intentionally excluded from automatic cleanup planning:
apps/web/node_modules.venv${SOURCE_HARBOR_CACHE_ROOT}/project-venv${SOURCE_HARBOR_CACHE_ROOT}/state$HOME/Library/Caches/ms-playwright$HOME/.cache/uv$HOME/.local/share/uv/pythonReason:
The legacy home-level .sourceharbor/ root is also not an automatic cleanup
target. It is a migration input root and drift signal until local environments
have been fully moved under SOURCE_HARBOR_CACHE_ROOT.
Legacy video-digestor roots are different: once current SourceHarbor state is
canonical under SOURCE_HARBOR_CACHE_ROOT and legacy retirement is clear, they
move into the external-history verify-first lane instead of staying
permanently excluded.
Audit only:
./bin/disk-space-audit
Dry-run a specific cleanup wave:
./bin/disk-space-cleanup --wave repo-tmp
Repo-side runtime maintenance is a different lane from cleanup waves.
Use it when you want to normalize and inspect .runtime-cache/** without
manually deleting directories:
./bin/runtime-cache-maintenance
./bin/runtime-cache-maintenance --apply
Repo-owned external cache/state maintenance is a sibling lane:
python3 scripts/runtime/maintain_external_cache.py --json
python3 scripts/runtime/maintain_external_cache.py --apply
Repo-scoped Docker hygiene is another sibling lane:
python3 scripts/runtime/docker_hygiene.py --json
python3 scripts/runtime/docker_hygiene.py --apply
This lane is intentionally narrow:
Local entrypoints may also trigger throttled external cache maintenance under
SOURCE_HARBOR_CACHE_ROOT. That lane is conservative by design:
project-venv/ and state/*.db are budget-only, not auto-deleteworkspace/*, artifacts/*, and tmp/* are verify-first auto-maintenance candidates~/.cache/uv and ~/Library/Caches/ms-playwright stay out of repo-owned cleanupSOURCE_HARBOR_CHROME_USER_DATA_DIR is permanent repo-owned browser state: keep it visible in audit, but never put it into TTL/cap auto-maintenanceFor operator-first review, treat Ops and disk governance as one story:
./bin/disk-space-audit --json explains what exists and who owns it./bin/disk-space-cleanup --wave repo-tmp --json explains whether the repo-side duplicate runtime is currently cleanup-eligible/ops should surface that same repo-side duplicate-runtime and duplicate-env story instead of inventing a separate wording layerRead the runtime tree with these rules:
.runtime-cache/run/* is live runtime state, not default cleanup.runtime-cache/logs/* is structured log storage with governed retention, not
generic trash.runtime-cache/reports/* is the machine-report layer; keep it as report
truth unless a cleanup wave explicitly reclassifies it.runtime-cache/evidence/* is debug and proof evidence with governed
retention, not a scratch bucket.runtime-cache/tmp/* is short-lived scratch space, but it only becomes a
cleanup candidate after the wave gate proves quiet-window, lock-clear, and
rebuildability conditionsIn plain language: do not hand-delete .runtime-cache/ just because it is
large. Use runtime-cache-maintenance for repo-side maintenance, and use
disk-space-cleanup --wave ... when you are intentionally executing the
governed cleanup plan.
JSON output for automation:
./bin/disk-space-audit --json
./bin/disk-space-cleanup --wave safe --json
The cleanup JSON keeps these totals separate:
safe_clear_bytesverify_first_bytesprotected_bytesThis is intentional.
Do not collapse them back into a single “release potential” sentence.
Validate the generated audit report shape:
./bin/disk-space-audit-check
Copy local-private ledgers into the authoritative target:
python3 scripts/governance/migrate_local_private_ledgers.py --json
Dry-run the legacy-path migration plan:
./bin/disk-space-legacy-migration --json
Authorized migration requires explicit source and target mappings:
./bin/disk-space-legacy-migration \
--apply --yes \
--mapping 'PIPELINE_ARTIFACT_ROOT=${PIPELINE_ARTIFACT_ROOT:-legacy-artifacts}::${SOURCE_HARBOR_CACHE_ROOT}/artifacts'
For the common “migrate whatever .env currently points at into the canonical
targets” case, use:
./bin/disk-space-legacy-migration --apply --yes --auto-mappings
These tools are designed to separate:
If those four states are collapsed into one sentence, the report is not trustworthy.
The real execution order is fixed:
./bin/disk-space-audit --json./bin/disk-space-legacy-migration --json./bin/disk-space-cleanup --wave safe --apply --yes./bin/disk-space-cleanup --wave repo-tmp --apply --yes./bin/disk-space-cleanup --wave external-history --apply --yes only after legacy retirement is clear