Skip to content

feat(docs): make docs.rapidata.ai discoverable to AI agents#637

Merged
LinoGiger merged 1 commit into
mainfrom
feat(docs)/agent-readiness-root-files
Jun 26, 2026
Merged

feat(docs): make docs.rapidata.ai discoverable to AI agents#637
LinoGiger merged 1 commit into
mainfrom
feat(docs)/agent-readiness-root-files

Conversation

@RapidPoseidon

Copy link
Copy Markdown
Contributor

Why

ora's Agent Readiness scan gave docs.rapidata.ai 15/100 (F) — Discovery 1/22, Identity 2/22, Access 8/34. The agent review said it "couldn't locate any public API endpoints, documentation, or a developer portal."

Root cause is discoverability, not missing content. The docs, examples and auth instructions all exist — but nothing is reachable where a crawler/agent looks:

  • docs.rapidata.ai/robots.txt404
  • docs.rapidata.ai/llms.txt404 (the build does generate one, but mike buries it at /latest/llms.txt)
  • docs.rapidata.ai/ → a contentless JS redirect to /latest/ (no title, description, links or structured data) — which is why Identity/Discovery score near zero.

What

Adds a site_root/ directory with the files that must live at the site root, plus a post-mike step in deploy_doc.yml that copies them onto the gh-pages root (mike only manages per-version subdirs, so root files are otherwise lost):

  • robots.txt — allows all crawlers, lists AI crawlers (GPTBot, ClaudeBot, PerplexityBot, …) explicitly, points to the sitemap.
  • llms.txt — curated llms.txt index: what Rapidata is, how to install/authenticate the SDK, and links to every guide/example/the API reference.
  • index.html — root landing page with a real <h1>, description, link list and JSON-LD (Organization + WebSite + SoftwareApplication) for crawlers, keeping the JS redirect to /latest/ for humans.
  • llms-full.txt — the per-version generated file is mirrored to the root by the workflow.

Scope / deliberately not done

  • No raw OpenAPI/Swagger dump. The internal multi-service spec carries an internal server host; the supported public surface is the SDK, so llms.txt points agents there instead of leaking internal endpoints.
  • ora's "no publicly reachable (unauthenticated) API" is by design — Rapidata's API requires a token. llms.txt and the landing page now make the token/auth path discoverable.

Effect & verification

Takes effect on the next Deploy Documentation run (manual workflow_dispatch). After deploy, /robots.txt, /llms.txt and /llms-full.txt resolve at the root and / serves crawlable content. YAML and JSON-LD validated locally. All linked /latest/… URLs verified to return 200.

🔗 Session: https://session-b4f9bfe9.poseidon.rapidata.internal/

The ora Agent Readiness scan scored docs.rapidata.ai 15/100 because the site
exposes nothing at its root for crawlers/agents: /robots.txt and /llms.txt both
404, and / is a contentless JS redirect to /latest/. mike publishes everything
under per-version subdirectories, so the llms.txt the build already generates is
buried at /latest/llms.txt where agents never look.

Publish curated root files (robots.txt, llms.txt, a content+JSON-LD landing
page) and copy the generated llms-full.txt to the site root via a post-mike step
in the deploy workflow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: lino <lino@rapidata.ai>
@claude

claude Bot commented Jun 26, 2026

Copy link
Copy Markdown

Code Review

This PR addresses a real discoverability gap: robots.txt, llms.txt, and a crawlable landing page are the right solution to the 404s at the site root. The site_root/ + workflow-copy approach is architecturally sound given mike's version-subdirectory model. There are two workflow bugs that will cause every deployment to fail, plus a few content issues worth fixing before merge.


🔴 Bug — git push always rejected as non-fast-forward (.github/workflows/deploy_doc.yml, line 69)

git fetch origin gh-pages        # ← only updates FETCH_HEAD / origin/gh-pages tracking ref
git worktree add ghpages gh-pages # ← checks out the LOCAL gh-pages branch

By the time this step runs, the previous step (mike deploy --push, mike set-default --push) has already pushed new commits to origin/gh-pages. The git fetch origin gh-pages here (no colon refspec) does not update the local gh-pages branch — only FETCH_HEAD. So the worktree is checked out from the pre-mike, stale local branch. The subsequent git push origin gh-pages then tries to push a commit based on that stale ancestor, which origin/gh-pages will reject as non-fast-forward.

Fix: use the colon form to update the local branch:

git fetch origin gh-pages:gh-pages

🔴 Bug — git add llms-full.txt aborts step when file was never copied (.github/workflows/deploy_doc.yml, line 78)

if [ -f ghpages/latest/llms-full.txt ]; then
  cp ghpages/latest/llms-full.txt ghpages/llms-full.txt
fi
cd ghpages
git add robots.txt llms.txt index.html llms-full.txt   # ← unconditional

When ghpages/latest/llms-full.txt doesn't exist (e.g. first-ever deploy, or any build where MkDocs didn't produce it), llms-full.txt is never copied into ghpages/. Under set -euo pipefail, git add of a non-existent path exits non-zero (fatal: pathspec 'llms-full.txt' did not match any files), aborting the entire step before robots.txt, llms.txt, or index.html are committed.

Fix:

git add robots.txt llms.txt index.html
[ -f llms-full.txt ] && git add llms-full.txt || true

🟡 Reliability — no worktree cleanup guard (.github/workflows/deploy_doc.yml, line 70)

If the step fails mid-run (e.g. from the non-fast-forward bug above), the ghpages/ directory is left behind. On the next run git worktree add ghpages gh-pages fails immediately with fatal: 'ghpages' already exists, and there's no cleanup at the end of a successful run either. On ephemeral GitHub-hosted runners this is usually harmless, but it makes re-runs painful.

Fix: add a guard and a trap:

git worktree remove --force ghpages 2>/dev/null || true
git worktree add ghpages gh-pages
trap 'git worktree remove --force ghpages 2>/dev/null || true' EXIT

🟡 SEO — canonical points to /latest/ instead of the root URL (site_root/index.html, line 8)

<link rel="canonical" href="https://docs.rapidata.ai/latest/">

The canonical tag on a page should point to the preferred URL for that page. Pointing the root landing page's canonical to /latest/ tells search engines "this page's content lives at /latest/", which causes Google to consolidate ranking signals onto /latest/ and potentially de-index https://docs.rapidata.ai/ entirely. This also conflicts with the og:url on line 14, which correctly declares https://docs.rapidata.ai/.

Fix:

<link rel="canonical" href="https://docs.rapidata.ai/">

🟡 SEO — Sitemap URL in robots.txt will 404 (site_root/robots.txt, line 38)

Sitemap: https://docs.rapidata.ai/sitemap.xml

The workflow copies robots.txt, llms.txt, index.html, and llms-full.txt to the gh-pages root — but not sitemap.xml. Mike places the MkDocs-generated sitemap under /latest/sitemap.xml. The declared URL will return 404, and search engines will log a sitemap error.

Fix: either point to the versioned path:

Sitemap: https://docs.rapidata.ai/latest/sitemap.xml

Or mirror it in the workflow alongside llms-full.txt:

if [ -f ghpages/latest/sitemap.xml ]; then
  cp ghpages/latest/sitemap.xml ghpages/sitemap.xml
fi

🔵 Minor — price: "0" in JSON-LD schema for a metered service (site_root/index.html, line 51)

"offers": { "@type": "Offer", "price": "0", "priceCurrency": "USD" }

Rapidata is a paid, usage-metered service; only the SDK package itself is free to install. price: 0 implies the service is entirely free, which may produce a misleading "Free" badge in Google rich results. Consider removing the offers block or replacing it with a description that reflects the pricing model.

@LinoGiger LinoGiger merged commit c521113 into main Jun 26, 2026
2 checks passed
@LinoGiger LinoGiger deleted the feat(docs)/agent-readiness-root-files branch June 26, 2026 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants