feat(pptx): client-side slide splitting for hi_res PPTX#345
Open
badGarnet wants to merge 2 commits into
Open
Conversation
Mirror the existing client-side PDF page-splitting for PPTX decks. When split_pdf_page=true and strategy=hi_res and the upload is a .pptx, the deck is split into slide-chunks that are partitioned in parallel and recombined, parallelizing the expensive per-slide hi_res work and avoiding large-payload timeouts. - New stdlib-only pptx_utils (zipfile + xml), no new dependency: is_pptx, get_pptx_slide_count, and in-memory/cache-to-disk chunk builders. Each chunk is a minimal valid .pptx keeping shared parts (masters/layouts/themes/notes master/media) and only the chunk's slides plus the notes slides they reference; presentation.xml, its rels, and [Content_Types].xml are rewritten so nothing dangles. - request_utils: generalize create_pdf_chunk_request into create_document_chunk_request(content_type); PDF path is a thin wrapper. - split_pdf_hook: one new branch reusing all shared machinery (sizing, concurrency, caching, retries, recombination). PDF behavior unchanged; legacy .ppt and non-hi_res requests fall through unsplit. - Reuses the existing split_pdf_page flag (no new API field). Verified: 233 unit tests pass (19 new), clean mypy. python-pptx opens the synthesized 1000-slide deck and every chunk with exact slide/notes counts. Live (stage): 1000 slides -> 50 parallel chunks -> contiguous pages 1..1000. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
No issues found across 4 files
Shadow auto-approve: would require human review. Adds client-side PPTX slide-splitting for hi_res strategy, modifying core request processing and file chunking logic. Complex XML/zip manipulation with potential impact on data integrity and server behavior requires human review.
Re-trigger cubic
PPTX splitting is now best-effort. If reading the slide count or building the slide-chunks raises for any reason (malformed package, unexpected OOXML structure), the hook logs a warning, cleans up any partial operation state, and returns the original request so the whole deck is sent unsplit instead of failing the partition call. PDF behavior is unchanged (still re-raises). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
0 issues found across 2 files (changes from recent commits).
Shadow auto-approve: would require human review. New feature adding PPTX slide splitting extends core request handling logic with non-trivial file processing. Risk of breakage in upload pipeline warrants human review.
Re-trigger cubic
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends the SDK's client-side PDF page-splitting to PPTX decks. When
split_pdf_page=true,strategy=hi_res, and the upload is a.pptx, the deck is split into slide-chunks that are partitioned in parallel and recombined — parallelizing the expensive per-slide hi_res work and avoiding large-payload timeouts. Each slide maps to one page (the server already partitions PPTX one page per slide).PDF behavior is unchanged. Reuses the existing
split_pdf_pageflag, so there's no new API field.How it works
pptx_utils.py— stdlib-only (zipfile+xml/regex), no new dependency. Providesis_pptx,get_pptx_slide_count, and in-memory / cache-to-disk chunk builders. Each chunk is a minimal valid.pptxthat keeps every shared part (masters, layouts, themes, notes master, media) and only the chunk's slides plus the notes slides they reference;presentation.xml, its.rels, and[Content_Types].xmlare rewritten so nothing dangles.request_utils.py—create_pdf_chunk_requestis now a thin wrapper over a generalizedcreate_document_chunk_request(..., content_type); PPTX chunks go out with the pptx content type and the original.pptxfilename.split_pdf_hook.py— one new branch in_before_request_unlocked, gated onstrategy=hi_res+ real.pptx. All downstream machinery (sizing defaults, concurrency, caching, retries,after_successrecombination, cleanup) is shared with the PDF path. Legacy.ppt(OLE, not a zip) and non-hi_res requests fall through unsplit.Verification
test_pptx_split.py); clean mypy. PDF path unchanged.python-pptx(the server's own parser) opens the synthesized 1000-slide deck and every chunk with exact slide/notes counts; notes slides are preserved.parent_idlinks whose parentTitlelives on a slide in a different chunk — an inherent property of independent-chunk partitioning, identical to the existing PDF page-split behavior.🤖 Generated with Claude Code