Skip to content

feat(cli): support openkb add <URL> with content-type sniffing#55

Open
KylinMountain wants to merge 3 commits into
mainfrom
feat/add-url
Open

feat(cli): support openkb add <URL> with content-type sniffing#55
KylinMountain wants to merge 3 commits into
mainfrom
feat/add-url

Conversation

@KylinMountain
Copy link
Copy Markdown
Collaborator

Summary

openkb add now accepts http(s) URLs alongside local paths. The fetcher saves the remote document into raw/, then hands the path to the existing add_single_file pipeline — so PageIndex / markitdown / hash dedup / openkb remove all work unchanged.

openkb add https://arxiv.org/abs/2509.11420       # → trafilatura, 3 KB abstract+meta
openkb add https://arxiv.org/pdf/2509.11420       # → urllib, 2.1 MB → PageIndex long-doc
openkb add https://claude.com/blog/the-founders-playbook
openkb add https://cdn.example.com/foo.pdf

Architecture

URL fetching is a peer of converter.py and indexer.py, not bolted onto cli.py:

openkb/
├── cli.py            ← entry / routing (4-line intercept in `add`)
├── converter.py      ← local-file markitdown conversion
├── indexer.py        ← PageIndex long-doc indexing
├── images.py         ← image extraction
├── url_ingest.py     ← (NEW) http(s) URL → raw/ acquisition
└── ...

Design decisions

  1. Content-Type drives routing, not URL extension. arxiv.org/pdf/2509.11420 has no .pdf in the URL; cdn.com/foo.pdf has one. Both are PDFs and route the same way — by Content-Type: application/pdf.

  2. Magic-byte sniff overrides Content-Type when they disagree. Some CDNs mislabel PDFs as application/octet-stream; some servers serve an HTML interstitial with application/pdf. Reading the first 512 bytes and checking for %PDF- or < wins over the header.

  3. HTML goes through trafilatura, not raw save. Live-tested claude.com/blog/...: markitdown on raw HTML yields 25 KB where half is nav/footer/cookie chrome. trafilatura extracts 1.5 KB of clean main content — much better signal for the LLM compiler.

  4. No site-specific code. No if arxiv: rewrite abs→pdf. The user picks depth: abs/ URL → abstract + metadata, pdf/ URL → full paper. The arxiv server's own Content-Disposition gives the version-suffixed filename for free.

  5. Filename sanitization preserves identifier dots. 2509.11420 is one identifier, not stem.ext — only strip an extension when it matches the target extension we're adding.

  6. Chunked PDF write. 50+ MB papers don't sit in RAM; streamed at 64 KB per chunk.

  7. Graceful failure with stderr messages, never crash. 404 / DNS / extraction-empty / unsupported type all return None and exit 0 after a clear message.

Files

  • openkb/url_ingest.py (NEW, ~190 LOC)
  • openkb/cli.py (+4-line intercept at top of add)
  • pyproject.toml (+ trafilatura>=2.0)
  • tests/test_url_ingest.py (NEW, 31 tests)
  • README.md (one-line Quick Start + Commands table update)

Test plan

  • 31 new unit tests covering: URL detection, content-type sniffing (magic-byte override 4 cases), filename sanitization (arxiv ID preservation, spaces, parens, length cap, empty fallback), Content-Disposition parsing (RFC 5987 / quoted / unquoted), fetch_url_to_raw integration (PDF chunked write, HTML→trafilatura, short-extraction warning, HTTP 404, DNS error, unsupported type)
  • 360 total tests pass (329 prior + 31 new, zero regression)
  • Live network smoke-tested against four URL shapes:
    • arxiv.org/abs/<id>Trading-R1-Financial-Trading-...md (3.3 KB)
    • arxiv.org/pdf/<id>2509.11420v1.pdf (2.1 MB, magic %PDF-1.7, version suffix from server CD header)
    • claude.com/blog/...The-founder-s-playbook-...md (1.5 KB clean)
    • CDN PDF with %20 and (1)69fe2a55..._The-Founders-Playbook-..._v3-1.pdf (465 KB)
  • CI green

`openkb add` now accepts http(s) URLs in addition to local paths.
URLs are fetched into raw/ first, then handed to the existing
add_single_file pipeline — so PageIndex / markitdown / hash dedup /
remove all work unchanged.

Routing is decided by the HTTP Content-Type header, validated
against magic bytes (so a CDN that mislabels a PDF as
'application/octet-stream' still routes correctly):

  PDF response   → urllib chunked download → raw/<sanitized>.pdf
                   → existing convert_document → PageIndex (≥20 pages)
                   or markitdown (short)

  HTML response  → trafilatura.fetch_url + extract main content
                   → raw/<title-slug>.md
                   → existing markitdown short-doc path

  Anything else  → clear error, returns None

Filename derivation: Content-Disposition header (RFC 5987 / quoted /
unquoted forms) → URL basename → URL host fallback. Sanitization
preserves identifier dots like arxiv's "2509.11420" (only strips an
extension when it matches the target extension), replaces shell-
unsafe chars (space / parens / etc.) with '-', collapses repeats,
and caps the stem at 80 chars.

Failure modes covered with stderr messages, no crashes:
  - HTTPError (404, 403, etc.) → "[ERROR] HTTP <code>"
  - URLError (DNS, refused)    → "[ERROR] Network error"
  - trafilatura returns None   → "[ERROR] No main content extracted"
  - <300 chars extracted       → "[WARN] only N chars — page may be JS-rendered"
                                  (still saved so user can inspect)
  - Unsupported content type   → "[ERROR] Unsupported content type"

Architecturally lives in openkb/url_ingest.py — peer to
openkb/converter.py and openkb/indexer.py rather than bloating
cli.py. cli.py only adds a 4-line intercept at the top of `add`.

Live-tested against four URL shapes:
  - arxiv.org/abs/<id>   → trafilatura → "Trading-R1-...md" (3 KB)
  - arxiv.org/pdf/<id>   → urllib      → "2509.11420v1.pdf" (2.1 MB)
                             (server's Content-Disposition gives the
                              version suffix for free — no special
                              arxiv handling needed)
  - claude.com/blog/...  → trafilatura → ".md" (1.5 KB)
  - CDN .pdf with %20    → urllib      → sanitized filename (465 KB)

31 new tests cover all routing branches, magic-byte override paths,
Content-Disposition parsing (3 forms), filename sanitization edge
cases (arxiv ID, spaces, long names, empty), and graceful failure
on HTTP/network errors. 360 total tests pass (329 prior + 31 new).
Comment thread tests/test_url_ingest.py
from __future__ import annotations

import io
from pathlib import Path
@KylinMountain
Copy link
Copy Markdown
Collaborator Author

Code review

Found 1 issue.

  1. Filename clash silently overwrites prior file in raw/. _extract_html (HTML path) and _download_pdf_chunked (PDF path) both write to raw/<sanitized> without checking whether the target already exists. Two URLs whose sanitization yields the same filename — easy to hit for HTML (any two pages titled "Introduction", "About", "Home" → both → Introduction.md; any two PDFs with same basename across mirrors → same .pdf) — silently overwrite each other. The hash registry still has the first URL's hash mapped to that filename, but the bytes at raw/<filename> are now the second URL's content. State becomes inconsistent: openkb list shows entries that no longer match disk, and openkb remove cleans the wrong logical entry.

    HTML write:

    OpenKB/openkb/url_ingest.py

    Lines 180 to 184 in 829f44f

    title = (metadata.title if metadata else None) or url
    filename = _sanitize_filename(title, ".md")
    target = raw_dir / filename
    target.write_text(markdown, encoding="utf-8")
    click.echo(

    PDF write:

    OpenKB/openkb/url_ingest.py

    Lines 232 to 236 in 829f44f

    url, response.headers.get("Content-Disposition"),
    )
    target = raw_dir / filename
    _download_pdf_chunked(response, head_bytes, target)
    size_mb = target.stat().st_size / (1024 * 1024)

    Fix: when target.exists(), append _2, _3, … to the stem until you find a free name. ~6-line helper that wraps both call sites.

Out of scope but worth tracking as follow-ups (each scored real but below the strict 80 threshold):

  • Redirect-aware filename: _pdf_filename(url, ...) uses the user's original URL rather than response.geturl(). When a URL like https://doi.org/10.xxx/yyy redirects to a publisher CDN and the publisher does NOT send Content-Disposition, the filename is derived from doi.org's path instead of the real basename. arxiv masks this with its CD header so the live smoke test didn't catch it. One-line fix: final_url = response.geturl() or url.

  • Orphan raw/ file on dedup-skip: when a URL re-ingest produces a hash already in the registry, add_single_file correctly reports [SKIP] but the just-downloaded file remains in raw/ untracked. For large PDFs this is silent disk waste. Fix: have add_single_file (or the URL caller) unlink the source file when convert_document returns skipped=True AND the path was URL-fetched.

Generated with Claude Code.

- _unique_path() prevents silent overwrites in raw/ when two URLs
  sanitize to the same filename (applied at both PDF and HTML write
  sites)
- PDF filename now derives from response.geturl() (the post-redirect
  URL), so DOI/shortlink resolvers produce sensible names
- add_single_file() returns bool; URL branch unlinks the fetched
  raw/* file when the downstream add is skipped (dedup), preventing
  orphan files that re-resurrect on every retry
@KylinMountain
Copy link
Copy Markdown
Collaborator Author

Code review (round 2, on fix commit 47e5ec2)

Found 2 issues:

  1. Orphan-unlink in URL branch is too broad — fires on pipeline failure, not just dedup-skip. add_single_file returns False from 5 sites: result.skipped (dedup), conversion failure, indexing failure, and both compilation-failure branches. The URL branch unconditionally unlinks the fetched raw file for any False return. If index_long_document succeeds but compile_long_doc fails (transient LLM timeout / rate-limit), the raw PDF is deleted while a dangling PageIndex entry remains — and because the hash is only registered on full success, openkb remove can't clean it up. Retrying re-downloads + may double-index. Suggest distinguishing skip-vs-error (tri-state return, or only unlink when result.skipped was the reason).

OpenKB/openkb/cli.py

Lines 428 to 435 in 47e5ec2

return
added = add_single_file(fetched, kb_dir)
if not added:
# Duplicate (or failed mid-pipeline) — drop the just-fetched
# file so raw/ doesn't accumulate orphans across retries.
fetched.unlink(missing_ok=True)
return

  1. _extract_html success echo prints the pre-collision filename, not the actual saved path. After _unique_path renames on collision (e.g. Introduction_2.md), the echo still uses filename instead of target.name, so the user sees Saved: raw/Introduction.md while the file is actually at raw/Introduction_2.md. The PDF branch (line 269) already uses target.name correctly — same fix needed here.

OpenKB/openkb/url_ingest.py

Lines 207 to 213 in 47e5ec2

target = _unique_path(raw_dir / filename)
target.write_text(markdown, encoding="utf-8")
click.echo(
f" Extracted: {title!r}\n"
f" Saved: raw/{filename} ({len(markdown) // 1024 or 1} KB clean markdown)"
)
return target

End-to-end flow (fetch_url_to_rawadd_single_fileconvert_document → PageIndex/markitdown → wiki) is otherwise structurally sound. URL ingest works for the happy path; the issues above bite on edge cases (LLM API hiccup, second URL with same title).

Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Round-2 review on commit 47e5ec2 surfaced two issues:

1. add_single_file returned bool, so the URL branch couldn't tell
   "dedup skip" from "mid-pipeline failure". A transient LLM error
   during compile_long_doc would delete the just-downloaded PDF and
   leave a dangling PageIndex entry (hash registration only happens
   on full success, so `openkb remove` couldn't recover it). Now the
   function returns Literal["added","skipped","failed"]; the URL
   branch only unlinks on "skipped" — failures keep the raw file so
   the user can retry without re-downloading.

2. _extract_html echoed the pre-collision filename instead of
   target.name, so on a title collision the user was told the file
   was saved to a path occupied by the previous document. PDF branch
   already used target.name correctly; HTML branch now matches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant