feat(cli): support openkb add <URL> with content-type sniffing#55
feat(cli): support openkb add <URL> with content-type sniffing#55KylinMountain wants to merge 3 commits into
openkb add <URL> with content-type sniffing#55Conversation
`openkb add` now accepts http(s) URLs in addition to local paths.
URLs are fetched into raw/ first, then handed to the existing
add_single_file pipeline — so PageIndex / markitdown / hash dedup /
remove all work unchanged.
Routing is decided by the HTTP Content-Type header, validated
against magic bytes (so a CDN that mislabels a PDF as
'application/octet-stream' still routes correctly):
PDF response → urllib chunked download → raw/<sanitized>.pdf
→ existing convert_document → PageIndex (≥20 pages)
or markitdown (short)
HTML response → trafilatura.fetch_url + extract main content
→ raw/<title-slug>.md
→ existing markitdown short-doc path
Anything else → clear error, returns None
Filename derivation: Content-Disposition header (RFC 5987 / quoted /
unquoted forms) → URL basename → URL host fallback. Sanitization
preserves identifier dots like arxiv's "2509.11420" (only strips an
extension when it matches the target extension), replaces shell-
unsafe chars (space / parens / etc.) with '-', collapses repeats,
and caps the stem at 80 chars.
Failure modes covered with stderr messages, no crashes:
- HTTPError (404, 403, etc.) → "[ERROR] HTTP <code>"
- URLError (DNS, refused) → "[ERROR] Network error"
- trafilatura returns None → "[ERROR] No main content extracted"
- <300 chars extracted → "[WARN] only N chars — page may be JS-rendered"
(still saved so user can inspect)
- Unsupported content type → "[ERROR] Unsupported content type"
Architecturally lives in openkb/url_ingest.py — peer to
openkb/converter.py and openkb/indexer.py rather than bloating
cli.py. cli.py only adds a 4-line intercept at the top of `add`.
Live-tested against four URL shapes:
- arxiv.org/abs/<id> → trafilatura → "Trading-R1-...md" (3 KB)
- arxiv.org/pdf/<id> → urllib → "2509.11420v1.pdf" (2.1 MB)
(server's Content-Disposition gives the
version suffix for free — no special
arxiv handling needed)
- claude.com/blog/... → trafilatura → ".md" (1.5 KB)
- CDN .pdf with %20 → urllib → sanitized filename (465 KB)
31 new tests cover all routing branches, magic-byte override paths,
Content-Disposition parsing (3 forms), filename sanitization edge
cases (arxiv ID, spaces, long names, empty), and graceful failure
on HTTP/network errors. 360 total tests pass (329 prior + 31 new).
| from __future__ import annotations | ||
|
|
||
| import io | ||
| from pathlib import Path |
Code reviewFound 1 issue.
Out of scope but worth tracking as follow-ups (each scored real but below the strict 80 threshold):
Generated with Claude Code. |
- _unique_path() prevents silent overwrites in raw/ when two URLs sanitize to the same filename (applied at both PDF and HTML write sites) - PDF filename now derives from response.geturl() (the post-redirect URL), so DOI/shortlink resolvers produce sensible names - add_single_file() returns bool; URL branch unlinks the fetched raw/* file when the downstream add is skipped (dedup), preventing orphan files that re-resurrect on every retry
Code review (round 2, on fix commit 47e5ec2)Found 2 issues:
Lines 428 to 435 in 47e5ec2
Lines 207 to 213 in 47e5ec2 End-to-end flow ( Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Round-2 review on commit 47e5ec2 surfaced two issues: 1. add_single_file returned bool, so the URL branch couldn't tell "dedup skip" from "mid-pipeline failure". A transient LLM error during compile_long_doc would delete the just-downloaded PDF and leave a dangling PageIndex entry (hash registration only happens on full success, so `openkb remove` couldn't recover it). Now the function returns Literal["added","skipped","failed"]; the URL branch only unlinks on "skipped" — failures keep the raw file so the user can retry without re-downloading. 2. _extract_html echoed the pre-collision filename instead of target.name, so on a title collision the user was told the file was saved to a path occupied by the previous document. PDF branch already used target.name correctly; HTML branch now matches.
Summary
openkb addnow accepts http(s) URLs alongside local paths. The fetcher saves the remote document intoraw/, then hands the path to the existingadd_single_filepipeline — so PageIndex / markitdown / hash dedup /openkb removeall work unchanged.Architecture
URL fetching is a peer of
converter.pyandindexer.py, not bolted ontocli.py:Design decisions
Content-Type drives routing, not URL extension.
arxiv.org/pdf/2509.11420has no.pdfin the URL;cdn.com/foo.pdfhas one. Both are PDFs and route the same way — byContent-Type: application/pdf.Magic-byte sniff overrides Content-Type when they disagree. Some CDNs mislabel PDFs as
application/octet-stream; some servers serve an HTML interstitial withapplication/pdf. Reading the first 512 bytes and checking for%PDF-or<wins over the header.HTML goes through
trafilatura, not raw save. Live-testedclaude.com/blog/...: markitdown on raw HTML yields 25 KB where half is nav/footer/cookie chrome. trafilatura extracts 1.5 KB of clean main content — much better signal for the LLM compiler.No site-specific code. No
if arxiv: rewrite abs→pdf. The user picks depth:abs/URL → abstract + metadata,pdf/URL → full paper. The arxiv server's ownContent-Dispositiongives the version-suffixed filename for free.Filename sanitization preserves identifier dots.
2509.11420is one identifier, notstem.ext— only strip an extension when it matches the target extension we're adding.Chunked PDF write. 50+ MB papers don't sit in RAM; streamed at 64 KB per chunk.
Graceful failure with stderr messages, never crash. 404 / DNS / extraction-empty / unsupported type all return None and exit 0 after a clear message.
Files
openkb/url_ingest.py(NEW, ~190 LOC)openkb/cli.py(+4-line intercept at top ofadd)pyproject.toml(+trafilatura>=2.0)tests/test_url_ingest.py(NEW, 31 tests)README.md(one-line Quick Start + Commands table update)Test plan
fetch_url_to_rawintegration (PDF chunked write, HTML→trafilatura, short-extraction warning, HTTP 404, DNS error, unsupported type)arxiv.org/abs/<id>→Trading-R1-Financial-Trading-...md(3.3 KB)arxiv.org/pdf/<id>→2509.11420v1.pdf(2.1 MB, magic%PDF-1.7, version suffix from server CD header)claude.com/blog/...→The-founder-s-playbook-...md(1.5 KB clean)%20and(1)→69fe2a55..._The-Founders-Playbook-..._v3-1.pdf(465 KB)