feat(cli): support `openkb add <URL>` with content-type sniffing by KylinMountain · Pull Request #55 · VectifyAI/OpenKB

KylinMountain · 2026-05-17T07:14:16Z

Summary

openkb add now accepts http(s) URLs alongside local paths. The fetcher saves the remote document into raw/, then hands the path to the existing add_single_file pipeline — so PageIndex / markitdown / hash dedup / openkb remove all work unchanged.

openkb add https://arxiv.org/abs/2509.11420       # → trafilatura, 3 KB abstract+meta
openkb add https://arxiv.org/pdf/2509.11420       # → urllib, 2.1 MB → PageIndex long-doc
openkb add https://claude.com/blog/the-founders-playbook
openkb add https://cdn.example.com/foo.pdf

Architecture

URL fetching is a peer of converter.py and indexer.py, not bolted onto cli.py:

openkb/
├── cli.py            ← entry / routing (4-line intercept in `add`)
├── converter.py      ← local-file markitdown conversion
├── indexer.py        ← PageIndex long-doc indexing
├── images.py         ← image extraction
├── url_ingest.py     ← (NEW) http(s) URL → raw/ acquisition
└── ...

Design decisions

Content-Type drives routing, not URL extension. arxiv.org/pdf/2509.11420 has no .pdf in the URL; cdn.com/foo.pdf has one. Both are PDFs and route the same way — by Content-Type: application/pdf.
Magic-byte sniff overrides Content-Type when they disagree. Some CDNs mislabel PDFs as application/octet-stream; some servers serve an HTML interstitial with application/pdf. Reading the first 512 bytes and checking for %PDF- or < wins over the header.
HTML goes through trafilatura, not raw save. Live-tested claude.com/blog/...: markitdown on raw HTML yields 25 KB where half is nav/footer/cookie chrome. trafilatura extracts 1.5 KB of clean main content — much better signal for the LLM compiler.
No site-specific code. No if arxiv: rewrite abs→pdf. The user picks depth: abs/ URL → abstract + metadata, pdf/ URL → full paper. The arxiv server's own Content-Disposition gives the version-suffixed filename for free.
Filename sanitization preserves identifier dots. 2509.11420 is one identifier, not stem.ext — only strip an extension when it matches the target extension we're adding.
Chunked PDF write. 50+ MB papers don't sit in RAM; streamed at 64 KB per chunk.
Graceful failure with stderr messages, never crash. 404 / DNS / extraction-empty / unsupported type all return None and exit 0 after a clear message.

Files

openkb/url_ingest.py (NEW, ~190 LOC)
openkb/cli.py (+4-line intercept at top of add)
pyproject.toml (+ trafilatura>=2.0)
tests/test_url_ingest.py (NEW, 31 tests)
README.md (one-line Quick Start + Commands table update)

Test plan

31 new unit tests covering: URL detection, content-type sniffing (magic-byte override 4 cases), filename sanitization (arxiv ID preservation, spaces, parens, length cap, empty fallback), Content-Disposition parsing (RFC 5987 / quoted / unquoted), fetch_url_to_raw integration (PDF chunked write, HTML→trafilatura, short-extraction warning, HTTP 404, DNS error, unsupported type)
360 total tests pass (329 prior + 31 new, zero regression)
Live network smoke-tested against four URL shapes:
- arxiv.org/abs/<id> → Trading-R1-Financial-Trading-...md (3.3 KB)
- arxiv.org/pdf/<id> → 2509.11420v1.pdf (2.1 MB, magic %PDF-1.7, version suffix from server CD header)
- claude.com/blog/... → The-founder-s-playbook-...md (1.5 KB clean)
- CDN PDF with %20 and (1) → 69fe2a55..._The-Founders-Playbook-..._v3-1.pdf (465 KB)
CI green

`openkb add` now accepts http(s) URLs in addition to local paths. URLs are fetched into raw/ first, then handed to the existing add_single_file pipeline — so PageIndex / markitdown / hash dedup / remove all work unchanged. Routing is decided by the HTTP Content-Type header, validated against magic bytes (so a CDN that mislabels a PDF as 'application/octet-stream' still routes correctly): PDF response → urllib chunked download → raw/<sanitized>.pdf → existing convert_document → PageIndex (≥20 pages) or markitdown (short) HTML response → trafilatura.fetch_url + extract main content → raw/<title-slug>.md → existing markitdown short-doc path Anything else → clear error, returns None Filename derivation: Content-Disposition header (RFC 5987 / quoted / unquoted forms) → URL basename → URL host fallback. Sanitization preserves identifier dots like arxiv's "2509.11420" (only strips an extension when it matches the target extension), replaces shell- unsafe chars (space / parens / etc.) with '-', collapses repeats, and caps the stem at 80 chars. Failure modes covered with stderr messages, no crashes: - HTTPError (404, 403, etc.) → "[ERROR] HTTP <code>" - URLError (DNS, refused) → "[ERROR] Network error" - trafilatura returns None → "[ERROR] No main content extracted" - <300 chars extracted → "[WARN] only N chars — page may be JS-rendered" (still saved so user can inspect) - Unsupported content type → "[ERROR] Unsupported content type" Architecturally lives in openkb/url_ingest.py — peer to openkb/converter.py and openkb/indexer.py rather than bloating cli.py. cli.py only adds a 4-line intercept at the top of `add`. Live-tested against four URL shapes: - arxiv.org/abs/<id> → trafilatura → "Trading-R1-...md" (3 KB) - arxiv.org/pdf/<id> → urllib → "2509.11420v1.pdf" (2.1 MB) (server's Content-Disposition gives the version suffix for free — no special arxiv handling needed) - claude.com/blog/... → trafilatura → ".md" (1.5 KB) - CDN .pdf with %20 → urllib → sanitized filename (465 KB) 31 new tests cover all routing branches, magic-byte override paths, Content-Disposition parsing (3 forms), filename sanitization edge cases (arxiv ID, spaces, long names, empty), and graceful failure on HTTP/network errors. 360 total tests pass (329 prior + 31 new).

+from __future__ import annotations
+
+import io
+from pathlib import Path


KylinMountain · 2026-05-17T07:56:28Z

Code review

Found 1 issue.

Filename clash silently overwrites prior file in raw/. _extract_html (HTML path) and _download_pdf_chunked (PDF path) both write to raw/<sanitized> without checking whether the target already exists. Two URLs whose sanitization yields the same filename — easy to hit for HTML (any two pages titled "Introduction", "About", "Home" → both → Introduction.md; any two PDFs with same basename across mirrors → same .pdf) — silently overwrite each other. The hash registry still has the first URL's hash mapped to that filename, but the bytes at raw/<filename> are now the second URL's content. State becomes inconsistent: openkb list shows entries that no longer match disk, and openkb remove cleans the wrong logical entry.

HTML write:

OpenKB/openkb/url_ingest.py

Lines 180 to 184 in 829f44f

    
           title = (metadata.title if metadata else None) or url 
        
           filename = _sanitize_filename(title, ".md") 
        
           target = raw_dir / filename 
        
           target.write_text(markdown, encoding="utf-8") 
        
           click.echo(

PDF write:

OpenKB/openkb/url_ingest.py

Lines 232 to 236 in 829f44f

    
               url, response.headers.get("Content-Disposition"), 
        
           ) 
        
           target = raw_dir / filename 
        
           _download_pdf_chunked(response, head_bytes, target) 
        
           size_mb = target.stat().st_size / (1024 * 1024)

Fix: when target.exists(), append _2, _3, … to the stem until you find a free name. ~6-line helper that wraps both call sites.

Out of scope but worth tracking as follow-ups (each scored real but below the strict 80 threshold):

Redirect-aware filename: _pdf_filename(url, ...) uses the user's original URL rather than response.geturl(). When a URL like https://doi.org/10.xxx/yyy redirects to a publisher CDN and the publisher does NOT send Content-Disposition, the filename is derived from doi.org's path instead of the real basename. arxiv masks this with its CD header so the live smoke test didn't catch it. One-line fix: final_url = response.geturl() or url.
Orphan raw/ file on dedup-skip: when a URL re-ingest produces a hash already in the registry, add_single_file correctly reports [SKIP] but the just-downloaded file remains in raw/ untracked. For large PDFs this is silent disk waste. Fix: have add_single_file (or the URL caller) unlink the source file when convert_document returns skipped=True AND the path was URL-fetched.

Generated with Claude Code.

- _unique_path() prevents silent overwrites in raw/ when two URLs sanitize to the same filename (applied at both PDF and HTML write sites) - PDF filename now derives from response.geturl() (the post-redirect URL), so DOI/shortlink resolvers produce sensible names - add_single_file() returns bool; URL branch unlinks the fetched raw/* file when the downstream add is skipped (dedup), preventing orphan files that re-resurrect on every retry

KylinMountain · 2026-05-17T08:23:50Z

Code review (round 2, on fix commit `47e5ec2`)

Found 2 issues:

Orphan-unlink in URL branch is too broad — fires on pipeline failure, not just dedup-skip. add_single_file returns False from 5 sites: result.skipped (dedup), conversion failure, indexing failure, and both compilation-failure branches. The URL branch unconditionally unlinks the fetched raw file for any False return. If index_long_document succeeds but compile_long_doc fails (transient LLM timeout / rate-limit), the raw PDF is deleted while a dangling PageIndex entry remains — and because the hash is only registered on full success, openkb remove can't clean it up. Retrying re-downloads + may double-index. Suggest distinguishing skip-vs-error (tri-state return, or only unlink when result.skipped was the reason).

OpenKB/openkb/cli.py

Lines 428 to 435 in 47e5ec2

    
               return 
        
           added = add_single_file(fetched, kb_dir) 
        
           if not added: 
        
               # Duplicate (or failed mid-pipeline) — drop the just-fetched 
        
               # file so raw/ doesn't accumulate orphans across retries. 
        
               fetched.unlink(missing_ok=True) 
        
           return

_extract_html success echo prints the pre-collision filename, not the actual saved path. After _unique_path renames on collision (e.g. Introduction_2.md), the echo still uses filename instead of target.name, so the user sees Saved: raw/Introduction.md while the file is actually at raw/Introduction_2.md. The PDF branch (line 269) already uses target.name correctly — same fix needed here.

OpenKB/openkb/url_ingest.py

Lines 207 to 213 in 47e5ec2

    
           target = _unique_path(raw_dir / filename) 
        
           target.write_text(markdown, encoding="utf-8") 
        
           click.echo( 
        
               f"  Extracted: {title!r}\n" 
        
               f"  Saved: raw/{filename} ({len(markdown) // 1024 or 1} KB clean markdown)" 
        
           ) 
        
           return target

End-to-end flow (fetch_url_to_raw → add_single_file → convert_document → PageIndex/markitdown → wiki) is otherwise structurally sound. URL ingest works for the happy path; the issues above bite on edge cases (LLM API hiccup, second URL with same title).

Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

Round-2 review on commit 47e5ec2 surfaced two issues: 1. add_single_file returned bool, so the URL branch couldn't tell "dedup skip" from "mid-pipeline failure". A transient LLM error during compile_long_doc would delete the just-downloaded PDF and leave a dangling PageIndex entry (hash registration only happens on full success, so `openkb remove` couldn't recover it). Now the function returns Literal["added","skipped","failed"]; the URL branch only unlinks on "skipped" — failures keep the raw file so the user can retry without re-downloading. 2. _extract_html echoed the pre-collision filename instead of target.name, so on a title collision the user was told the file was saved to a path occupied by the previous document. PDF branch already used target.name correctly; HTML branch now matches.

github-code-quality Bot found potential problems May 17, 2026

View reviewed changes

Comment thread tests/test_url_ingest.py

from __future__ import annotations

import io

from pathlib import Path

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cli): support `openkb add <URL>` with content-type sniffing#55

feat(cli): support `openkb add <URL>` with content-type sniffing#55
KylinMountain wants to merge 3 commits into
mainfrom
feat/add-url

KylinMountain commented May 17, 2026

Uh oh!

KylinMountain commented May 17, 2026

Uh oh!

KylinMountain commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KylinMountain commented May 17, 2026

Summary

Architecture

Design decisions

Files

Test plan

Uh oh!

KylinMountain commented May 17, 2026

Code review

Uh oh!

KylinMountain commented May 17, 2026

Code review (round 2, on fix commit 47e5ec2)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Code review (round 2, on fix commit `47e5ec2`)