Skip to content

feat(pptx): client-side slide splitting for hi_res PPTX#345

Open
badGarnet wants to merge 2 commits into
mainfrom
feat/split-pptx-hi-res
Open

feat(pptx): client-side slide splitting for hi_res PPTX#345
badGarnet wants to merge 2 commits into
mainfrom
feat/split-pptx-hi-res

Conversation

@badGarnet

@badGarnet badGarnet commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

Extends the SDK's client-side PDF page-splitting to PPTX decks. When split_pdf_page=true, strategy=hi_res, and the upload is a .pptx, the deck is split into slide-chunks that are partitioned in parallel and recombined — parallelizing the expensive per-slide hi_res work and avoiding large-payload timeouts. Each slide maps to one page (the server already partitions PPTX one page per slide).

PDF behavior is unchanged. Reuses the existing split_pdf_page flag, so there's no new API field.

How it works

  • New pptx_utils.py — stdlib-only (zipfile + xml/regex), no new dependency. Provides is_pptx, get_pptx_slide_count, and in-memory / cache-to-disk chunk builders. Each chunk is a minimal valid .pptx that keeps every shared part (masters, layouts, themes, notes master, media) and only the chunk's slides plus the notes slides they reference; presentation.xml, its .rels, and [Content_Types].xml are rewritten so nothing dangles.
  • request_utils.pycreate_pdf_chunk_request is now a thin wrapper over a generalized create_document_chunk_request(..., content_type); PPTX chunks go out with the pptx content type and the original .pptx filename.
  • split_pdf_hook.py — one new branch in _before_request_unlocked, gated on strategy=hi_res + real .pptx. All downstream machinery (sizing defaults, concurrency, caching, retries, after_success recombination, cleanup) is shared with the PDF path. Legacy .ppt (OLE, not a zip) and non-hi_res requests fall through unsplit.

Verification

  • 233 unit tests pass (19 new in test_pptx_split.py); clean mypy. PDF path unchanged.
  • Structuralpython-pptx (the server's own parser) opens the synthesized 1000-slide deck and every chunk with exact slide/notes counts; notes slides are preserved.
  • Live (stage) — 1000-slide deck → 50 parallel chunks → contiguous element pages 1..1000, 0 chunk failures.
  • Unchunked vs chunked A/B (60-slide deck, hi_res): ~1.5× faster (6.6s vs 10.0s) with 717/720 elements byte-identical. The only deltas are 3 parent_id links whose parent Title lives on a slide in a different chunk — an inherent property of independent-chunk partitioning, identical to the existing PDF page-split behavior.

🤖 Generated with Claude Code

Review in cubic

Mirror the existing client-side PDF page-splitting for PPTX decks. When
split_pdf_page=true and strategy=hi_res and the upload is a .pptx, the deck
is split into slide-chunks that are partitioned in parallel and recombined,
parallelizing the expensive per-slide hi_res work and avoiding large-payload
timeouts.

- New stdlib-only pptx_utils (zipfile + xml), no new dependency: is_pptx,
  get_pptx_slide_count, and in-memory/cache-to-disk chunk builders. Each chunk
  is a minimal valid .pptx keeping shared parts (masters/layouts/themes/notes
  master/media) and only the chunk's slides plus the notes slides they
  reference; presentation.xml, its rels, and [Content_Types].xml are rewritten
  so nothing dangles.
- request_utils: generalize create_pdf_chunk_request into
  create_document_chunk_request(content_type); PDF path is a thin wrapper.
- split_pdf_hook: one new branch reusing all shared machinery (sizing,
  concurrency, caching, retries, recombination). PDF behavior unchanged;
  legacy .ppt and non-hi_res requests fall through unsplit.
- Reuses the existing split_pdf_page flag (no new API field).

Verified: 233 unit tests pass (19 new), clean mypy. python-pptx opens the
synthesized 1000-slide deck and every chunk with exact slide/notes counts.
Live (stage): 1000 slides -> 50 parallel chunks -> contiguous pages 1..1000.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

Shadow auto-approve: would require human review. Adds client-side PPTX slide-splitting for hi_res strategy, modifying core request processing and file chunking logic. Complex XML/zip manipulation with potential impact on data integrity and server behavior requires human review.

Re-trigger cubic

PPTX splitting is now best-effort. If reading the slide count or building the
slide-chunks raises for any reason (malformed package, unexpected OOXML
structure), the hook logs a warning, cleans up any partial operation state,
and returns the original request so the whole deck is sent unsplit instead of
failing the partition call. PDF behavior is unchanged (still re-raises).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 2 files (changes from recent commits).

Shadow auto-approve: would require human review. New feature adding PPTX slide splitting extends core request handling logic with non-trivial file processing. Risk of breakage in upload pipeline warrants human review.

Re-trigger cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant