feat(docs): make docs.rapidata.ai discoverable to AI agents by RapidPoseidon · Pull Request #637 · RapidataAI/rapidata-python-sdk

RapidPoseidon · 2026-06-26T13:14:02Z

Why

ora's Agent Readiness scan gave docs.rapidata.ai 15/100 (F) — Discovery 1/22, Identity 2/22, Access 8/34. The agent review said it "couldn't locate any public API endpoints, documentation, or a developer portal."

Root cause is discoverability, not missing content. The docs, examples and auth instructions all exist — but nothing is reachable where a crawler/agent looks:

docs.rapidata.ai/robots.txt → 404
docs.rapidata.ai/llms.txt → 404 (the build does generate one, but mike buries it at /latest/llms.txt)
docs.rapidata.ai/ → a contentless JS redirect to /latest/ (no title, description, links or structured data) — which is why Identity/Discovery score near zero.

What

Adds a site_root/ directory with the files that must live at the site root, plus a post-mike step in deploy_doc.yml that copies them onto the gh-pages root (mike only manages per-version subdirs, so root files are otherwise lost):

robots.txt — allows all crawlers, lists AI crawlers (GPTBot, ClaudeBot, PerplexityBot, …) explicitly, points to the sitemap.
llms.txt — curated llms.txt index: what Rapidata is, how to install/authenticate the SDK, and links to every guide/example/the API reference.
index.html — root landing page with a real <h1>, description, link list and JSON-LD (Organization + WebSite + SoftwareApplication) for crawlers, keeping the JS redirect to /latest/ for humans.
llms-full.txt — the per-version generated file is mirrored to the root by the workflow.

Scope / deliberately not done

No raw OpenAPI/Swagger dump. The internal multi-service spec carries an internal server host; the supported public surface is the SDK, so llms.txt points agents there instead of leaking internal endpoints.
ora's "no publicly reachable (unauthenticated) API" is by design — Rapidata's API requires a token. llms.txt and the landing page now make the token/auth path discoverable.

Effect & verification

Takes effect on the next Deploy Documentation run (manual workflow_dispatch). After deploy, /robots.txt, /llms.txt and /llms-full.txt resolve at the root and / serves crawlable content. YAML and JSON-LD validated locally. All linked /latest/… URLs verified to return 200.

🔗 Session: https://session-b4f9bfe9.poseidon.rapidata.internal/

The ora Agent Readiness scan scored docs.rapidata.ai 15/100 because the site exposes nothing at its root for crawlers/agents: /robots.txt and /llms.txt both 404, and / is a contentless JS redirect to /latest/. mike publishes everything under per-version subdirectories, so the llms.txt the build already generates is buried at /latest/llms.txt where agents never look. Publish curated root files (robots.txt, llms.txt, a content+JSON-LD landing page) and copy the generated llms-full.txt to the site root via a post-mike step in the deploy workflow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Co-Authored-By: lino <lino@rapidata.ai>

claude · 2026-06-26T13:19:43Z

Code Review

This PR addresses a real discoverability gap: robots.txt, llms.txt, and a crawlable landing page are the right solution to the 404s at the site root. The site_root/ + workflow-copy approach is architecturally sound given mike's version-subdirectory model. There are two workflow bugs that will cause every deployment to fail, plus a few content issues worth fixing before merge.

🔴 Bug — `git push` always rejected as non-fast-forward (`.github/workflows/deploy_doc.yml`, line 69)

git fetch origin gh-pages        # ← only updates FETCH_HEAD / origin/gh-pages tracking ref
git worktree add ghpages gh-pages # ← checks out the LOCAL gh-pages branch

By the time this step runs, the previous step (mike deploy --push, mike set-default --push) has already pushed new commits to origin/gh-pages. The git fetch origin gh-pages here (no colon refspec) does not update the local gh-pages branch — only FETCH_HEAD. So the worktree is checked out from the pre-mike, stale local branch. The subsequent git push origin gh-pages then tries to push a commit based on that stale ancestor, which origin/gh-pages will reject as non-fast-forward.

Fix: use the colon form to update the local branch:

git fetch origin gh-pages:gh-pages

🔴 Bug — `git add llms-full.txt` aborts step when file was never copied (`.github/workflows/deploy_doc.yml`, line 78)

if [ -f ghpages/latest/llms-full.txt ]; then
  cp ghpages/latest/llms-full.txt ghpages/llms-full.txt
fi
cd ghpages
git add robots.txt llms.txt index.html llms-full.txt   # ← unconditional

When ghpages/latest/llms-full.txt doesn't exist (e.g. first-ever deploy, or any build where MkDocs didn't produce it), llms-full.txt is never copied into ghpages/. Under set -euo pipefail, git add of a non-existent path exits non-zero (fatal: pathspec 'llms-full.txt' did not match any files), aborting the entire step before robots.txt, llms.txt, or index.html are committed.

Fix:

git add robots.txt llms.txt index.html
[ -f llms-full.txt ] && git add llms-full.txt || true

🟡 Reliability — no worktree cleanup guard (`.github/workflows/deploy_doc.yml`, line 70)

If the step fails mid-run (e.g. from the non-fast-forward bug above), the ghpages/ directory is left behind. On the next run git worktree add ghpages gh-pages fails immediately with fatal: 'ghpages' already exists, and there's no cleanup at the end of a successful run either. On ephemeral GitHub-hosted runners this is usually harmless, but it makes re-runs painful.

Fix: add a guard and a trap:

git worktree remove --force ghpages 2>/dev/null || true
git worktree add ghpages gh-pages
trap 'git worktree remove --force ghpages 2>/dev/null || true' EXIT

🟡 SEO — `canonical` points to `/latest/` instead of the root URL (`site_root/index.html`, line 8)

<link rel="canonical" href="https://docs.rapidata.ai/latest/">

The canonical tag on a page should point to the preferred URL for that page. Pointing the root landing page's canonical to /latest/ tells search engines "this page's content lives at /latest/", which causes Google to consolidate ranking signals onto /latest/ and potentially de-index https://docs.rapidata.ai/ entirely. This also conflicts with the og:url on line 14, which correctly declares https://docs.rapidata.ai/.

Fix:

<link rel="canonical" href="https://docs.rapidata.ai/">

🟡 SEO — Sitemap URL in `robots.txt` will 404 (`site_root/robots.txt`, line 38)

Sitemap: https://docs.rapidata.ai/sitemap.xml

The workflow copies robots.txt, llms.txt, index.html, and llms-full.txt to the gh-pages root — but not sitemap.xml. Mike places the MkDocs-generated sitemap under /latest/sitemap.xml. The declared URL will return 404, and search engines will log a sitemap error.

Fix: either point to the versioned path:

Sitemap: https://docs.rapidata.ai/latest/sitemap.xml

Or mirror it in the workflow alongside llms-full.txt:

if [ -f ghpages/latest/sitemap.xml ]; then
  cp ghpages/latest/sitemap.xml ghpages/sitemap.xml
fi

🔵 Minor — `price: "0"` in JSON-LD schema for a metered service (`site_root/index.html`, line 51)

"offers": { "@type": "Offer", "price": "0", "priceCurrency": "USD" }

Rapidata is a paid, usage-metered service; only the SDK package itself is free to install. price: 0 implies the service is entirely free, which may produce a misleading "Free" badge in Google rich results. Consider removing the offers block or replacing it with a description that reflects the pricing model.

LinoGiger merged commit c521113 into main Jun 26, 2026
2 checks passed

LinoGiger deleted the feat(docs)/agent-readiness-root-files branch June 26, 2026 13:23

This was referenced Jun 26, 2026

feat(docs): publish OpenAPI spec, auth docs and developer portal #638

Merged

docs(agent-readiness): add when-to-use guidance and step-by-step agent auth #639

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(docs): make docs.rapidata.ai discoverable to AI agents#637

feat(docs): make docs.rapidata.ai discoverable to AI agents#637
LinoGiger merged 1 commit into
mainfrom
feat(docs)/agent-readiness-root-files

RapidPoseidon commented Jun 26, 2026

Uh oh!

claude Bot commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

RapidPoseidon commented Jun 26, 2026

Why

What

Scope / deliberately not done

Effect & verification

Uh oh!

claude Bot commented Jun 26, 2026

Code Review

🔴 Bug — git push always rejected as non-fast-forward (.github/workflows/deploy_doc.yml, line 69)

🔴 Bug — git add llms-full.txt aborts step when file was never copied (.github/workflows/deploy_doc.yml, line 78)

🟡 Reliability — no worktree cleanup guard (.github/workflows/deploy_doc.yml, line 70)

🟡 SEO — canonical points to /latest/ instead of the root URL (site_root/index.html, line 8)

🟡 SEO — Sitemap URL in robots.txt will 404 (site_root/robots.txt, line 38)

🔵 Minor — price: "0" in JSON-LD schema for a metered service (site_root/index.html, line 51)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🔴 Bug — `git push` always rejected as non-fast-forward (`.github/workflows/deploy_doc.yml`, line 69)

🔴 Bug — `git add llms-full.txt` aborts step when file was never copied (`.github/workflows/deploy_doc.yml`, line 78)

🟡 Reliability — no worktree cleanup guard (`.github/workflows/deploy_doc.yml`, line 70)

🟡 SEO — `canonical` points to `/latest/` instead of the root URL (`site_root/index.html`, line 8)

🟡 SEO — Sitemap URL in `robots.txt` will 404 (`site_root/robots.txt`, line 38)

🔵 Minor — `price: "0"` in JSON-LD schema for a metered service (`site_root/index.html`, line 51)