Docsiphon

Export documentation sites into AI-ready local Markdown corpora with preserved paths, reproducible runs, and audit artifacts.

Docsiphon is for people who need something more trustworthy than “mirror the site and hope for the best.”

Current front-door rule: the public first-success path is still the uvx quickstart below. If Docsiphon grows an MCP-aware surface later, that path will remain a secondary surface until it ships its own install contract, verification gate, public packet, and lane truth.

Fastest Path

  1. Install uv.
  2. Run Docsiphon directly from GitHub:
uvx --from git+https://github.com/xiaojiou176-open/docsiphon.git \
  docsiphon "https://developerdocs.instructure.com/services/canvas" \
  --scope-prefix /services/canvas \
  --max-pages 6 \
  --out ./_outputs \
  --site-root auto

What You Get

A minimal first success should give you three kinds of output, not a pile of untraceable crawl fragments:

Proof Ladder

Use the front door like a short evidence ladder, not like a maze:

Step Open this next What it proves
1 the uvx command above Docsiphon is a real CLI-first tool you can run immediately, not a concept page
2 manifest.jsonl, report.json, toc.md, report.html in the export output the run left behind reviewable artifacts instead of a pile of unlabeled crawl residue
3 the repo map and latest release shelf below current repository truth and newest published artifacts are related, but they are not the same thing

Why It Exists

Best Fit

Why Not Generic Crawlers

Because the hard part is not “did we fetch the page?” The hard part is whether the result is still understandable, auditable, and reusable later.

A generic crawler is closer to dumping the whole filing cabinet onto the floor. Docsiphon is trying to put the papers back into labeled folders you can use again.

Release Shelf Truth

The latest release entrypoint is the right place to look for the newest published wheel, sdist, and starter-profile assets.

This Pages surface and the repo front door describe the newest repository truth on main, which can move ahead before the next tagged release is cut.

Treat “latest release” and “current main” as neighboring shelves, not as the same object.

Open The Right Next Door

After the first uvx run succeeds, the narrowest honest routing is:

If you want to… Open this Why this is the right next door
understand where the repo keeps its moving parts Pages repo map it is the structural map, not a second marketing page
grab the newest published wheel, sdist, and starter-profile assets latest release that is the published shelf, separate from repository-side wording on main
inspect packet and lane truth after the product story is already clear distribution packet ledger distribution status is a second-ring proof surface, not the first success path
see roadmap and discussion once the basics are proven Pages roadmap and GitHub Discussions planning and community belong after the first run, not before it

Start Here