feat(pptx): client-side slide splitting for hi_res PPTX by badGarnet · Pull Request #345 · Unstructured-IO/unstructured-python-client

badGarnet · 2026-06-23T19:33:14Z

Summary

Extends the SDK's client-side PDF page-splitting to PPTX decks. When split_pdf_page=true, strategy=hi_res, and the upload is a .pptx, the deck is split into slide-chunks that are partitioned in parallel and recombined — parallelizing the expensive per-slide hi_res work and avoiding large-payload timeouts. Each slide maps to one page (the server already partitions PPTX one page per slide).

PDF behavior is unchanged. Reuses the existing split_pdf_page flag, so there's no new API field.

How it works

New pptx_utils.py — stdlib-only (zipfile + xml/regex), no new dependency. Provides is_pptx, get_pptx_slide_count, and in-memory / cache-to-disk chunk builders. Each chunk is a minimal valid .pptx that keeps every shared part (masters, layouts, themes, notes master, media) and only the chunk's slides plus the notes slides they reference; presentation.xml, its .rels, and [Content_Types].xml are rewritten so nothing dangles.
request_utils.py — create_pdf_chunk_request is now a thin wrapper over a generalized create_document_chunk_request(..., content_type); PPTX chunks go out with the pptx content type and the original .pptx filename.
split_pdf_hook.py — one new branch in _before_request_unlocked, gated on strategy=hi_res + real .pptx. All downstream machinery (sizing defaults, concurrency, caching, retries, after_success recombination, cleanup) is shared with the PDF path. Legacy .ppt (OLE, not a zip) and non-hi_res requests fall through unsplit.

Verification

233 unit tests pass (19 new in test_pptx_split.py); clean mypy. PDF path unchanged.
Structural — python-pptx (the server's own parser) opens the synthesized 1000-slide deck and every chunk with exact slide/notes counts; notes slides are preserved.
Live (stage) — 1000-slide deck → 50 parallel chunks → contiguous element pages 1..1000, 0 chunk failures.
Unchunked vs chunked A/B (60-slide deck, hi_res): ~1.5× faster (6.6s vs 10.0s) with 717/720 elements byte-identical. The only deltas are 3 parent_id links whose parent Title lives on a slide in a different chunk — an inherent property of independent-chunk partitioning, identical to the existing PDF page-split behavior.

🤖 Generated with Claude Code

Mirror the existing client-side PDF page-splitting for PPTX decks. When split_pdf_page=true and strategy=hi_res and the upload is a .pptx, the deck is split into slide-chunks that are partitioned in parallel and recombined, parallelizing the expensive per-slide hi_res work and avoiding large-payload timeouts. - New stdlib-only pptx_utils (zipfile + xml), no new dependency: is_pptx, get_pptx_slide_count, and in-memory/cache-to-disk chunk builders. Each chunk is a minimal valid .pptx keeping shared parts (masters/layouts/themes/notes master/media) and only the chunk's slides plus the notes slides they reference; presentation.xml, its rels, and [Content_Types].xml are rewritten so nothing dangles. - request_utils: generalize create_pdf_chunk_request into create_document_chunk_request(content_type); PDF path is a thin wrapper. - split_pdf_hook: one new branch reusing all shared machinery (sizing, concurrency, caching, retries, recombination). PDF behavior unchanged; legacy .ppt and non-hi_res requests fall through unsplit. - Reuses the existing split_pdf_page flag (no new API field). Verified: 233 unit tests pass (19 new), clean mypy. python-pptx opens the synthesized 1000-slide deck and every chunk with exact slide/notes counts. Live (stage): 1000 slides -> 50 parallel chunks -> contiguous pages 1..1000. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cubic-dev-ai

No issues found across 4 files

_{Shadow auto-approve: would require human review. Adds client-side PPTX slide-splitting for hi_res strategy, modifying core request processing and file chunking logic. Complex XML/zip manipulation with potential impact on data integrity and server behavior requires human review.

Re-trigger cubic}

PPTX splitting is now best-effort. If reading the slide count or building the slide-chunks raises for any reason (malformed package, unexpected OOXML structure), the hook logs a warning, cleans up any partial operation state, and returns the original request so the whole deck is sent unsplit instead of failing the partition call. PDF behavior is unchanged (still re-raises). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cubic-dev-ai

0 issues found across 2 files (changes from recent commits).

_{Shadow auto-approve: would require human review. New feature adding PPTX slide splitting extends core request handling logic with non-trivial file processing. Risk of breakage in upload pipeline warrants human review.

Re-trigger cubic}

cubic-dev-ai Bot reviewed Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(pptx): client-side slide splitting for hi_res PPTX#345

feat(pptx): client-side slide splitting for hi_res PPTX#345
badGarnet wants to merge 2 commits into
mainfrom
feat/split-pptx-hi-res

badGarnet commented Jun 23, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

badGarnet commented Jun 23, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Verification

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

badGarnet commented Jun 23, 2026 •

edited by cubic-dev-ai Bot

Loading