UN-3372 [FEAT] Surface LLMWhisperer signature highlights in Prompt Studio by pk-zipstack · Pull Request #1967 · Zipstack/unstract

pk-zipstack · 2026-05-14T11:55:23Z

What

Adds document_insights as a new processing mode on the LLMWhisperer V2 adapter (alongside form, low_cost, high_quality, table). When selected, signature metadata returned by LLMWhisperer is captured, threaded into the LLM context, and surfaced to the Prompt Studio UI so the document viewer jumps to the page that contains the relevant signature when a signer-related answer is clicked.
Threading lives in three places: the adapter extracts signature metadata + computes resolved coords; the workers executor writes a sidecar JSON next to the extracted text so Prompt Studio cache hits don't lose signature data; the answer-prompt post-processor matches signer names against the LLM answer and injects the corresponding page coords into metadata[HIGHLIGHT_DATA][prompt_key] — which the existing PdfViewer + jumpToPage pipeline already knows how to consume.
Two small frontend tweaks let the existing highlight pipeline fire for signature-driven highlights without requiring the separate enable_highlight toggle on the tool. Existing flows are untouched because their highlight_data is empty when enable_highlight=false.
Mirrored in prompt-service for parity with workers.

Why

Jira: UN-3372.
LLMWhisperer's document_insights mode detects signatures and returns signature_metadata (per-page list of {name, type, desc}). Customers asking signature questions in Prompt Studio (e.g. "Who signed this on behalf of T, Inc.?") should be able to click the answer and have the document viewer take them to the page containing that signature. Before this change, the data wasn't reaching the LLM context, the frontend, or the viewer.

How

Adapter — unstract/sdk1/.../llm_whisperer_v2

New Modes.DOCUMENT_INSIGHTS enum entry + JSON schema option.
After the LLMWhisperer call, when mode is document_insights, walk response["metadata"] and surface per-page signature_metadata on TextExtractionMetadata.
New _build_signature_page_references() static helper: for each page with signatures, find the first content line in line_metadata (skipping marker/empty rows where height <= 0 or page_height <= 0) and emit {hex, line_metadata_index, signers, coords} for that page. The resolved coords are the [page, y, height, page_height] array the frontend overlay code expects.

Workers — workers/executor/executors/

legacy_executor._handle_extract writes a <extracted>.doc_insights.json sidecar alongside the .txt (helper _write_signature_sidecar) so Prompt Studio cache hits can recover the signature data without re-extraction.
The pipeline extract → answer_prompt path now also injects signature_page_references into tool_settings (alongside the already-injected signature_metadata).
New constant PromptServiceConstants.SIGNATURE_PAGE_REFERENCES.
answer_prompt.AnswerPromptService.construct_and_run_prompt calls a new _attach_signature_highlights post-processor after run_completion. The post-processor:
- For each signer in signature_metadata, looks for the name in the LLM answer using a case-insensitive word-boundary regex (r"\b" + re.escape(name) + r"\b") — picked over plain substring matching after observing a false positive where signer "P S" matched inside "Pradeep Surukanti".
- Falls back to attaching all signature page coords when the answer mentions generic signature keywords (signature, signed, signatory, signing, executed) but no specific name was matched. This handles "Is this signed?"-style questions.
- Appends the matched page coords to metadata[HIGHLIGHT_DATA][prompt_key], de-duped against entries already populated by hex-comment processing.

Prompt Studio backend — backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py

dynamic_extractor now returns an ExtractResult NamedTuple (text, signature_metadata, signature_page_references) instead of a bare string. All 6 callsites updated.
Cache miss → read signature data from the executor's dispatch result.
Cache hit → read the .doc_insights.json sidecar via new helper _load_signature_sidecar.
All 5 answer_prompt dispatch sites now inject signature_metadata + signature_page_references into tool_settings.
New constants on ToolStudioPromptKeys: SIGNATURE_METADATA, SIGNATURE_PAGE_REFERENCES.

Frontend — frontend/src/components/custom-tools/prompt-card/

DisplayPromptResult.jsx (TextResult component): made the answer clickable when the tool has enable_highlight ON OR when the backend produced highlight_data. Existing behaviour is unchanged because the only way highlight_data is non-empty with enable_highlight=false is the new signature flow.
PromptCard.jsx (handleSelectHighlight): same fallback — proceed with the selectedHighlight state update when enable_highlight is off but highlight_data is present. Without this, the click fired but the PdfViewer never received a non-empty highlightData prop and jumpToPage was never called.

Prompt-service parity — prompt-service/src/unstract/prompt_service/

Mirror of _attach_signature_highlights in services/answer_prompt.py.
controllers/extraction.py + services/extraction.py now also surface signature_page_references in the /extract response.
Matching SIGNATURE_PAGE_REFERENCES constant.

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

No regressions expected.
- The adapter behaviour is gated on mode == document_insights. Existing modes (form, low_cost, high_quality, table) are unaffected.
- The post-processor early-returns when signature_metadata or signature_page_references is missing — non-document_insights flows never enter the new path.
- The post-processor only appends to metadata[HIGHLIGHT_DATA][prompt_key] and de-dupes — it never overwrites the hex-comment highlights produced by the existing highlight-data plugin.
- The frontend gate relaxations short-circuit on !highlight_data && !enable_highlight; tools with enable_highlight=false and no signature data continue to render the non-clickable <div> exactly as before.
- dynamic_extractor's return type changed from str to ExtractResult (NamedTuple); all 6 callsites in prompt_studio_helper.py were updated to use .text. No other callers exist (verified via grep across backend/).
Tests: 8 unit tests added in workers/tests/test_answer_prompt.py::TestAttachSignatureHighlights cover name match, multi-page coords, keyword fallback, no-op cases, dedup against existing highlights, missing-input guards, and the "P S" vs "Pradeep Surukanti" regression. All 8 pass.

Database Migrations

None. No schema changes.

Env Config

None.

Relevant Docs

Sidecar layout: <extract-dir>/<basename>.doc_insights.json next to <basename>.txt. Read on cache hit only; written only when document_insights mode produces signature data.

Related Issues or PRs

Jira: UN-3372.

Dependencies Versions

None.

Notes on Testing

Automated:

cd workers && uv run pytest tests/test_answer_prompt.py::TestAttachSignatureHighlights -v

→ 8 passed (covers name match, case-insensitivity, multi-page, keyword fallback, no-op, dedup, regression for "P S" substring false positive, and missing-input guards).

Manual end-to-end (verified):

Configure an LLMWhisperer V2 adapter profile with mode = document_insights.
Upload a signed PDF in Prompt Studio.
Add a prompt: Who signed this doc on behalf of T, Inc. ?
Click Run. Observe DOC_INSIGHTS log lines in the executor worker:
- DOC_INSIGHTS: signature_page_references={...} with non-zero coords per page.
- DOC_INSIGHTS sidecar: wrote signature data to <path>.doc_insights.json.
- DOC_INSIGHTS attach_signature_highlights: prompt=..., added N signature highlight(s) on pages [...].
Click the answer text in the prompt card — the PDF viewer jumps to the signature page and overlays the signature line.

Cache behaviour: subsequent prompt runs on the same doc read signature data from the sidecar (no re-extraction needed) and the page jump still works.

Screenshots

n/a — UI change is the click-to-jump behaviour described above.

Checklist

I have read and understood the Contribution Guidelines.

…nature metadata in LLM context - Add document_insights as a new processing mode in the LLMWhisperer V2 adapter (Modes enum + JSON schema dropdown) - Extract signature_metadata from LLMWhisperer response when using document_insights mode and surface it in TextExtractionMetadata - Thread signature_metadata through the workers pipeline (extract → answer_params → construct_prompt) - Format signature metadata as a human-readable context block injected into the LLM prompt's Context section - Update prompt-service extraction endpoint to return signature_metadata - Mirror construct_prompt changes in prompt-service for parity Workers execution path (API deployments, workflow runs) is fully functional. Prompt-service path has endpoints ready but structure tool threading is a follow-up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When document_insights mode detects signatures, compute page references by finding the first line_metadata entry for each page with signatures and converting to a 1-indexed hex value. This enables the frontend to navigate/jump to pages containing signatures without highlighting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add loggers at key points to trace signature metadata through the pipeline: adapter extraction, workers pipeline (_handle_extract, tool_settings injection), prompt construction, and prompt-service extraction endpoint. All loggers use the DOC_INSIGHTS prefix for easy grep-filtering during UI testing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

user_id may be empty for mock auth users (default OSS setup). It's only used as a Redis cache key fragment, so empty values are acceptable. Dropping user_id from the required-fields validation unblocks indexing for these users. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…t_insights mode Wire the signature data captured by the LLMWhisperer V2 adapter through to Prompt Studio's existing highlight pipeline, so clicking a signer- related answer jumps the PDF viewer to the page containing the signature without any frontend changes. - Adapter: signature_page_references now also carries resolved coords [page, y, height, page_height] alongside the existing hex / line index. - Workers _handle_extract writes a <extract>.doc_insights.json sidecar so Prompt Studio cache hits don't lose signature data; the pipeline path threads signature_page_references into tool_settings alongside signature_metadata. - AnswerPromptService._attach_signature_highlights (mirrored in workers and prompt-service) scans the LLM answer for signer names (case- insensitive substring) and appends the matching page coords to metadata[HIGHLIGHT_DATA][prompt_key]. Falls back to all signature pages when the answer mentions signing generically. De-dupes against hex-comment highlights. - Prompt Studio backend: dynamic_extractor now returns ExtractResult (text + signature_metadata + signature_page_references), reading the sidecar on cache hits. All five answer_prompt dispatch sites inject the signature data into tool_settings. - Prompt-service: extraction service + controller surface signature_page_references for parity. - Tests: 7 new unit tests for _attach_signature_highlights covering name match, multi-page coords, keyword fallback, no-op cases, and preservation/dedup of existing highlight entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ignature highlights Two bugs surfaced when testing in Prompt Studio with a multi-page signed PDF where the LLM answer was a single signer name. 1) Adapter was selecting unusable line_metadata entries _build_signature_page_references picked the first line_metadata entry per page, but the first entry is often an empty marker row like [0, 0, 0, 3168] or [1, 0, 0, 0]. Zero height makes the overlay invisible; zero page_height causes divide-by-zero in the frontend's percentage calc. Now skip entries with height <= 0 or page_height <= 0 and pick the first true content line. 2) Post-processor matched signer initials inside other names "P S" (a signer on page 0) was matching across word boundaries inside "Pradeep Surukanti" — case-insensitive substring "p s" appears between "Pradee[p s]urukanti". Both pages got highlights, the viewer jumped to the first (wrong) page. Switched to a regex with \b anchors so the signer name has to appear as a whole token or phrase. Added a regression test (test_short_initials_do_not_falsely_match_across_words) that locks in the fix for this exact scenario. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two clickability gates in the frontend prevented signature-driven page jumps from firing when the tool's enable_highlight setting was off, even though the backend was correctly producing highlight_data from document_insights signature extraction. 1) TextResult only rendered the clickable Typography.Text variant when enableHighlight was true, so the answer stayed a plain <div>. 2) handleSelectHighlight in PromptCard.jsx returned silently when enable_highlight was false, so even if the click fired, selectedHighlight state never updated and PdfViewer never received a non-empty highlightData prop (jumpToPage was never called). Both gates now also pass through when highlight_data is present, so the signature feature works on tools that have document_insights mode on LLMWhisperer V2 even without flipping the separate enable_highlight toggle. Existing flows (with enable_highlight=false and no highlight data) are unchanged because their highlight_data is empty. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-14T11:55:38Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Capture signature metadata and per‑page reference coords from DOCUMENT_INSIGHTS extraction, persist as sidecars, propagate through extraction services and executors, inject signature context into prompts, resolve/merge signature-driven highlight coordinates into answer metadata, and surface selectable highlights in the frontend.

Signature extraction, propagation, and highlight attachment

Layer / File(s)	Summary
Constants & types `backend/prompt_studio/prompt_studio_core_v2/constants.py`, `prompt-service/src/unstract/prompt_service/constants.py`, `workers/executor/executors/constants.py`, `unstract/sdk1/src/unstract/sdk1/adapters/x2text/dto.py`, `unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/constants.py`	Add `SIGNATURE_METADATA` and `SIGNATURE_PAGE_REFERENCES` keys across prompt/pipeline constants; extend `TextExtractionMetadata` with optional signature fields; add `TABLE` and `DOCUMENT_INSIGHTS` mode.
Adapter: DOCUMENT_INSIGHTS extraction `unstract/sdk1/.../llm_whisperer_v2.py`, `.../static/json_schema.json`	When mode is `DOCUMENT_INSIGHTS`, extract per-page `signature_metadata`, build `signature_page_references` (hex, line index, signers, optional coords), attach to `TextExtractionMetadata`; schema allows `document_insights`.
Extraction service & controller `prompt-service/src/unstract/prompt_service/services/extraction.py`, `prompt-service/src/unstract/prompt_service/controllers/extraction.py`	`perform_extraction()` now returns dict with `extracted_text` plus optional `signature_metadata` and `signature_page_references`; controller conditionally includes these fields and logs presence.
Backend helper: ExtractResult & sidecar handling `backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py`	Introduce `ExtractResult(NamedTuple)` with `text`, `signature_metadata`, `signature_page_references`. `dynamic_extractor()` returns `ExtractResult`; on cache hit loads `.doc_insights.json`; builders use `extract_result.text` and inject signature fields into `tool_settings`.
Executor extraction persistence `workers/executor/executors/legacy_executor.py`	`_handle_extract()` captures signature artifacts into `result_data` and writes a `.doc_insights.json` sidecar when output file exists; `_handle_structure_pipeline()` injects signature fields into `answer_params[tool_settings]`.
Prompt construction & post-process highlight attachment (prompt-service) `prompt-service/src/unstract/prompt_service/services/answer_prompt.py`	`construct_prompt()` accepts optional `signature_metadata` and inserts a formatted "Document Signature Information" block; after completion `_attach_signature_highlights()` resolves signer-name matches or keyword fallback using `signature_page_references`, dedupes coords, and merges into `metadata[HIGHLIGHT_DATA][prompt_key]`.
Executor prompt flow & highlight attachment (worker) `workers/executor/executors/answer_prompt.py`	Mirror: pass `signature_metadata` into prompt construction and call `_attach_signature_highlights()` after completion to merge resolved coords into executor metadata.
Shared signature highlight utilities `unstract/sdk1/src/unstract/sdk1/utils/signature_highlights.py`	New helpers: `resolve_signature_highlight_coords(answer, signature_metadata, signature_page_references)` (name matching + keyword fallback → deduped coords) and `merge_into_highlight_data(metadata, prompt_key, new_coords, ...)` (mutating merge). Exports `SIGNATURE_KEYWORDS`.
Frontend: clickable highlights & selection wiring `frontend/src/components/custom-tools/prompt-card/DisplayPromptResult.jsx`, `frontend/src/components/custom-tools/prompt-card/PromptCard.jsx`	Treat result content as selectable/clickable when `details?.enable_highlight` OR backend-provided `highlightData` (non-empty). `handleSelectHighlight` flattens coords in single-pass mode and updates selected highlight state.
Controller input validation tweak `backend/prompt_studio/prompt_studio_core_v2/internal_views.py`	`indexing_status` no longer requires `user_id`; still validates `action`, `org_id`, and `doc_id_key`.
Tests: attach signature highlights `workers/tests/test_answer_prompt.py`	Add `TestAttachSignatureHighlights` covering name matching (case-insensitive, whole-word/phrase), multi-page attachments, keyword fallback, coord deduping, preservation of existing highlights, initials-safety, and no-op/missing-input cases.

sequenceDiagram
  participant FE as Frontend
  participant PS as PromptStudio (Backend)
  participant PromptSvc as Prompt-Service
  participant Exec as Worker/Executor
  participant SDK as X2Text Adapter
  participant FS as FileStore

  FE->>PS: request prompt / select result
  PS->>PS: dynamic_extractor() -> ExtractResult(text + signature fields)
  alt cache miss
    PS->>Exec: enqueue/run extraction
    Exec->>SDK: adapter(mode=DOCUMENT_INSIGHTS)
    SDK-->>Exec: TextExtractionMetadata (whisper_hash, line_metadata, signature_metadata, signature_page_references)
    Exec->>FS: write output file and .doc_insights.json
    Exec-->>PS: extraction result with signature fields
  else cache hit
    PS->>FS: load .doc_insights.json
    FS-->>PS: signature fields
  end
  PS->>PromptSvc: construct_and_run_prompt(tool_settings with signature fields)
  PromptSvc->>PromptSvc: construct_prompt(inject signature context)
  PromptSvc->>LLM: run completion
  LLM-->>PromptSvc: answer
  PromptSvc->>SDK: resolve_signature_highlight_coords(answer, signature_metadata, signature_page_references)
  SDK-->>PromptSvc: coords
  PromptSvc->>PromptSvc: merge coords into metadata.highlight_data[prompt_key]
  PromptSvc-->>PS: response payload (includes highlight data)
  PS-->>FE: response (clickable/selectable highlights enabled)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 74.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly summarizes the main feature: surfacing LLMWhisperer signature highlights in Prompt Studio, matching the core objective of the changeset.
Description check	✅ Passed	The PR description comprehensively covers all template sections: What (document_insights mode), Why (Jira UN-3372, customer need), How (detailed subsystems), Breaking Changes (none expected), Testing (8 unit tests, manual E2E verified), and other required sections.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/llmwhisperer-document-insights-mode

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/prompt_studio/prompt_studio_core_v2/internal_views.py (1)

227-236: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update docstring to reflect optional user_id.

The docstring still lists user_id: str without indicating it's optional, but the validation on line 249 no longer requires it. Consider updating the docstring to mark user_id as optional to maintain consistency between documentation and implementation.

📝 Suggested docstring update

     """Update document indexing cache status (mark indexed or remove).
 
     Expected JSON payload:
     {
         "action": "mark_indexed" | "remove",
         "org_id": str,
-        "user_id": str,
+        "user_id": str (optional, may be empty),
         "doc_id_key": str,
         "doc_id": str (required when action == "mark_indexed")
     }
     """

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/prompt_studio/prompt_studio_core_v2/internal_views.py` around lines
227 - 236, The docstring for the update-document-indexing-cache endpoint still
lists "user_id: str" as required even though the handler's validation no longer
requires it; update the docstring in prompt_studio_core_v2/internal_views.py to
mark "user_id" as optional (e.g., "user_id: Optional[str]" or "user_id: str
(optional)"), and ensure the payload description still documents that "doc_id"
is required only when action == "mark_indexed" so the docstring matches the
current validation logic (referencing the "user_id" and "doc_id" fields in the
JSON payload).

🧹 Nitpick comments (1)

unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py (1)

177-227: ⚡ Quick win

Consider validating response structure before accessing nested keys.

The code directly accesses response["metadata"] and nested page data without validating the response structure. If LLMWhisperer returns an unexpected response format, this could raise KeyError or TypeError.

🛡️ Proposed defensive fix

         if mode == Modes.DOCUMENT_INSIGHTS.value:
-            response_metadata = response.get("metadata", {})
+            if not isinstance(response, dict):
+                logger.warning("DOC_INSIGHTS: response is not a dict, skipping signature extraction")
+            else:
+                response_metadata = response.get("metadata", {})
+                if not isinstance(response_metadata, dict):
+                    logger.warning("DOC_INSIGHTS: metadata is not a dict, skipping")
+                    response_metadata = {}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py`
around lines 177 - 227, The code assumes response is a dict and directly walks
nested keys when extracting signature_metadata and raw_line_metadata; add
defensive validation in LLMWhispererV2 around response access: ensure response
is a dict before .get, ensure response_metadata = response.get("metadata", {})
is a dict, check each page_data is a dict and that
page_data["signature_metadata"] is a list before iterating/len()/list
comprehension, and ensure raw_line_metadata = response.get("line_metadata", [])
is a list before using len() or passing to _build_signature_page_references; if
types are unexpected, treat them as empty (None/{} /[]) and short-circuit
computing signature_page_references to avoid KeyError/TypeError.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/prompt_studio/prompt_studio_core_v2/internal_views.py`:
- Around line 247-249: The cache key logic in DocumentIndexingService currently
builds keys as f"{CACHE_PREFIX}{org_id}:{user_id}:{doc_id_key}" which yields
double colons when user_id is empty; either enforce non-empty user_id by
restoring validation (reject/raise/log when user_id is falsy in the caller or in
_cache_key) or make _cache_key resilient by constructing the key from only
non-empty parts (e.g., join on ":" for [org_id, user_id, doc_id_key] but skip
user_id when empty or substitute a stable placeholder like "anon_user"); update
the _cache_key implementation (and related callers) to use CACHE_PREFIX and the
chosen strategy so keys are consistent and collision-free.

In `@backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py`:
- Around line 2359-2394: The helper _load_signature_sidecar currently swallows
read/parse errors and returns (None, None) which hides corrupt sidecars from
dynamic_extractor; instead, introduce a dedicated SidecarCorruptError (or
similar) in this module and raise it whenever fs_instance.read or json.loads
fails (the except blocks that now return (None, None)), leaving
FileNotFoundError to continue returning (None, None) for legitimately missing
sidecars; update callers (e.g., dynamic_extractor) to catch SidecarCorruptError
and trigger a live extract/fallback path.

In `@prompt-service/src/unstract/prompt_service/services/answer_prompt.py`:
- Around line 159-167: AnswerPromptService currently calls
_attach_signature_highlights before the JSON/webhook postprocessing mutates
metadata[PSKeys.HIGHLIGHT_DATA][prompt_key], so signature coords are lost; to
fix, either move the call to AnswerPromptService._attach_signature_highlights
until after handle_json/postprocessing completes or change handle_json to merge
updated_highlight_data into metadata[PSKeys.HIGHLIGHT_DATA][prompt_key] (rather
than replace) so signature pages added from tool_settings
(PSKeys.SIGNATURE_METADATA / PSKeys.SIGNATURE_PAGE_REFERENCES) are preserved;
update references to prompt_key (output[PSKeys.NAME]), metadata, and
updated_highlight_data accordingly to ensure the merge appends/merges
coordinates instead of overwriting.
- Around line 281-289: The loop that formats `signatures` in answer_prompt.py
assumes each `sig` is a dict and calls `sig.get(...)`, which will crash on
non-dict entries; update the loop to skip any malformed entries by adding the
same guard used in `_attach_signature_highlights()` (e.g., `if not
isinstance(sig, dict): continue`) before accessing `sig.get(...)`, so only dict
items are processed and others are ignored silently.

---

Outside diff comments:
In `@backend/prompt_studio/prompt_studio_core_v2/internal_views.py`:
- Around line 227-236: The docstring for the update-document-indexing-cache
endpoint still lists "user_id: str" as required even though the handler's
validation no longer requires it; update the docstring in
prompt_studio_core_v2/internal_views.py to mark "user_id" as optional (e.g.,
"user_id: Optional[str]" or "user_id: str (optional)"), and ensure the payload
description still documents that "doc_id" is required only when action ==
"mark_indexed" so the docstring matches the current validation logic
(referencing the "user_id" and "doc_id" fields in the JSON payload).

---

Nitpick comments:
In
`@unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py`:
- Around line 177-227: The code assumes response is a dict and directly walks
nested keys when extracting signature_metadata and raw_line_metadata; add
defensive validation in LLMWhispererV2 around response access: ensure response
is a dict before .get, ensure response_metadata = response.get("metadata", {})
is a dict, check each page_data is a dict and that
page_data["signature_metadata"] is a list before iterating/len()/list
comprehension, and ensure raw_line_metadata = response.get("line_metadata", [])
is a list before using len() or passing to _build_signature_page_references; if
types are unexpected, treat them as empty (None/{} /[]) and short-circuit
computing signature_page_references to avoid KeyError/TypeError.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 18bd50bb-ca4a-43cc-9cc4-803fad75e44d

📥 Commits

Reviewing files that changed from the base of the PR and between 993f676 and df77c3e.

📒 Files selected for processing (17)

backend/prompt_studio/prompt_studio_core_v2/constants.py
backend/prompt_studio/prompt_studio_core_v2/internal_views.py
backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py
frontend/src/components/custom-tools/prompt-card/DisplayPromptResult.jsx
frontend/src/components/custom-tools/prompt-card/PromptCard.jsx
prompt-service/src/unstract/prompt_service/constants.py
prompt-service/src/unstract/prompt_service/controllers/extraction.py
prompt-service/src/unstract/prompt_service/services/answer_prompt.py
prompt-service/src/unstract/prompt_service/services/extraction.py
unstract/sdk1/src/unstract/sdk1/adapters/x2text/dto.py
unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/constants.py
unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py
unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/static/json_schema.json
workers/executor/executors/answer_prompt.py
workers/executor/executors/constants.py
workers/executor/executors/legacy_executor.py
workers/tests/test_answer_prompt.py

greptile-apps · 2026-05-14T12:00:52Z

Greptile Summary

This PR surfaces LLMWhisperer V2's document_insights signature metadata through the full Prompt Studio stack — from the adapter, through the workers pipeline and a sidecar cache file, into the LLM prompt context, and finally into the PDF viewer's jumpToPage highlight pipeline via post-processing of the LLM answer.

Adapter layer: New DOCUMENT_INSIGHTS mode enum + _build_signature_page_references helper resolves per-page coords from line_metadata for frontend use.
Workers/prompt-service: _capture_signature_data writes a .doc_insights.json sidecar so cache-hit runs don't lose signature data; _attach_signature_highlights post-processor matches the LLM answer against known signer names (word-boundary regex) or falls back to all signature pages on generic keywords.
Frontend: DisplayPromptResult.jsx and PromptCard.jsx relax the enable_highlight gate so signature-driven clicks work without requiring the separate toggle.

Confidence Score: 5/5

Safe to merge; the new document_insights path is fully gated on mode and early-returns when signature data is absent, leaving all existing extraction modes untouched.

The change is well-scoped: the adapter, workers, prompt-service, and frontend each have clear guard conditions that prevent the new code from touching existing flows. The ExtractResult NamedTuple migration covers all six callsites, the sidecar mechanism is tied to the config hash so stale data cannot bleed across mode changes, and the 8 unit tests cover the post-processor main paths including the P S word-boundary regression. All findings raised during this review were already addressed in previous thread comments.

The unguarded int() conversions in signature_highlights.py, llm_whisperer_v2.py, and both answer_prompt.py files (flagged in previous threads) are the only outstanding concerns; none affect existing flows and all are limited to the new document_insights code path.

Important Files Changed

Filename	Overview
unstract/sdk1/src/unstract/sdk1/utils/signature_highlights.py	New shared helper module for signature matching and highlight merging; word-boundary regex correctly prevents the "P S" false-positive; unguarded int() conversions in format_signature_metadata_context were flagged in a previous thread
unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py	Adds document_insights mode: signature_metadata extraction from response, page-reference building via line_metadata; unguarded int(page_str) in _build_signature_page_references already flagged; logic is otherwise sound and gated on mode check
workers/executor/executors/legacy_executor.py	_capture_signature_data + _write_signature_sidecar cleanly separate signature persistence; signature data injection into answer_params[tool_settings] is safe because TOOL_SETTINGS is always present at that pipeline stage
workers/executor/executors/answer_prompt.py	_attach_signature_highlights and construct_prompt signature context injection look correct; unguarded int(x[0]) sort key already flagged; url_safety extraction to shared module is a clean refactor
backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py	ExtractResult NamedTuple cleanly replaces bare str return; all 6 callsites updated; _load_signature_sidecar is defensively coded with proper FileNotFoundError separation; _inject_signature_data_into_tool_settings is a no-op for non-document_insights modes
prompt-service/src/unstract/prompt_service/services/answer_prompt.py	Mirror of workers post-processor; _is_safe_public_url correctly moved to shared url_safety module; same unguarded int sort key already flagged in previous thread
frontend/src/components/custom-tools/prompt-card/PromptCard.jsx	hasHighlightData definition diverges from DisplayPromptResult.jsx (Boolean() vs Array.isArray+length) — already flagged in previous thread; the restructured handleSelectHighlight logic is otherwise correct
frontend/src/components/custom-tools/prompt-card/DisplayPromptResult.jsx	hasHighlightData correctly uses Array.isArray + length check; isClickable gate is logically sound and preserves non-clickable behaviour when both conditions are false
unstract/sdk1/src/unstract/sdk1/utils/url_safety.py	Clean extraction of SSRF-protection helper into a shared module; logic is identical to the previously-inline functions in both workers and prompt-service answer_prompt files
workers/tests/test_answer_prompt.py	8 new tests cover name match, case-insensitivity, multi-page, keyword fallback, no-op, dedup, and the "P S" regression; good coverage of the post-processor
prompt-service/src/unstract/prompt_service/services/extraction.py	Return type changed from str to dict; signature metadata conditionally included; only one caller (extraction controller) which was updated accordingly
backend/prompt_studio/prompt_studio_core_v2/internal_views.py	Removes user_id from required-field validation with a clear explanatory comment; change is scoped and intentional for mock auth support

Sequence Diagram

sequenceDiagram
    participant LLMWv2 as LLMWhisperer V2 Adapter
    participant Worker as Workers LegacyExecutor
    participant FS as File Storage (sidecar)
    participant PS as Prompt Studio Helper
    participant AP as AnswerPromptService
    participant FE as Frontend (PromptCard / PdfViewer)

    Note over LLMWv2: document_insights mode
    LLMWv2->>LLMWv2: Extract text + collect signature_metadata from response["metadata"]
    LLMWv2->>LLMWv2: _build_signature_page_references() → resolved coords per page
    LLMWv2-->>Worker: "TextExtractionResult(text, metadata{signature_metadata, signature_page_references})"

    Worker->>FS: _write_signature_sidecar() → file.doc_insights.json
    Worker->>Worker: inject signature data into answer_params[tool_settings]

    alt Cache miss (fresh extraction)
        PS->>Worker: dynamic_extractor() dispatch
        Worker-->>PS: ExtractResult(text, signature_metadata, signature_page_references)
    else Cache hit (cached .txt exists)
        PS->>FS: _load_signature_sidecar() → read .doc_insights.json
        FS-->>PS: (signature_metadata, signature_page_references)
    end

    PS->>PS: _inject_signature_data_into_tool_settings()
    PS->>AP: dispatch answer_prompt with signature data in tool_settings

    AP->>AP: construct_prompt() → append [Document Signature Information] block
    AP->>AP: run_completion() → LLM answer
    AP->>AP: _attach_signature_highlights() → match signers in answer (word-boundary regex)
    AP-->>PS: "answer + metadata[highlight_data][prompt_key] = matched coords"

    PS-->>FE: response with highlight_data
    FE->>FE: "isClickable = enableHighlight OR hasHighlightData"
    FE->>FE: PdfViewer.jumpToPage(matched signature page)

_{Reviews (6): Last reviewed commit: "[MISC] Extract SSRF webhook-URL helper i..." | Re-trigger Greptile}

- Extract duplicated signature-highlight post-processor logic into a shared helper at ``unstract/sdk1/utils/signature_highlights.py`` and delegate from both workers and prompt-service. Cuts SonarCloud's duplication metric and brings the cognitive-complexity score below the gate. - Split LLMWhispererV2._build_signature_page_references into _index_first_content_line_per_page + _build_page_reference_entry to cut its cognitive complexity below 15. - Move ExtractResult NamedTuple in prompt_studio_helper below the import block to silence ruff E402 (and drop a duplicate logger= line that was already present). - Move the `logger = ...` line in prompt-service extraction.py below all imports to fix the same E402 issue. - Apply ruff-format normalisation across the touched files. No behaviour change. All 8 signature-highlight unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolve conflict in workers/executor/executors/legacy_executor.py where main reorganized the pipeline (wrapped steps 2-5 inside the same try block, added _failure/_absorb helpers and execution_id / file_execution_id context fields). Preserved this PR's signature metadata injection step by relocating it inside the new try block, right after the extract result is absorbed and extracted_text is read. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py`:
- Around line 137-141: The parsing assumes metadata is a dict and
signature_metadata is a dict with int-castable keys; harden it by first checking
that metadata is a dict and signature_metadata =
metadata.get("signature_metadata") is a dict (skip if not), then iterate over
signature_metadata.items() only after verifying types; inside the loop ensure
signatures is a non-empty list and guard the int() cast for page_str with
try/except (skip entries where int(page_str) fails) before referencing
page_first_line and other structures; apply the same defensive checks in the
second block mentioned (lines ~205-219) where signature_metadata and page keys
are used.

In `@unstract/sdk1/src/unstract/sdk1/utils/signature_highlights.py`:
- Around line 127-128: The fallback uses substring checks which cause false
positives (e.g., "unsigned" matching "signed"); change the condition in the
block that sets matched_pages (the variables matched_pages, answer,
SIGNATURE_KEYWORDS, page_coords) to test whole-word matches instead of simple
substrings — for example use a regex word-boundary search (with re.escape on
each keyword) or tokenize/split answer.lower() and check membership, and ensure
the re import is added if you use regex; keep the rest of the logic (assigning
list(page_coords.keys())) unchanged.

In `@workers/executor/executors/answer_prompt.py`:
- Around line 257-267: The loop over signature_metadata assumes well-formed keys
and dict entries and can crash on malformed payloads; wrap the
iteration/processing in validation and safe conversions: when iterating
signature_metadata.items() (used in the for page_num, signatures ... block),
first ensure page_num can be safely converted to int (use a try/except and skip
if not convertible), ensure signatures is an iterable/list before iterating, and
guard access to sig fields by coercing values to str with defaults (e.g., name =
str(sig.get("name", "Unknown"))), skip any sig that is not a dict, and only
build entry after these checks so malformed metadata cannot raise in the code
paths referencing int(page_num) or sig.get(...).

In `@workers/executor/executors/legacy_executor.py`:
- Around line 332-340: The current early-return when output_file_path exists but
both signature_metadata and signature_page_references are empty leaves any
existing sidecar intact; change the flow in LegacyExecutor (use the same block
that computes LegacyExecutor._signature_sidecar_path(output_file_path)) so that
if there is no signature data you check for and delete/unlink the existing
sidecar file instead of returning immediately; only create/write the sidecar
payload when signature_metadata or signature_page_references are present (use
variables output_file_path, signature_metadata, signature_page_references, and
sidecar_path to locate and remove the stale file).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 923cb799-5cf3-472a-b2a4-b414b7b45001

📥 Commits

Reviewing files that changed from the base of the PR and between df77c3e and d49d924.

📒 Files selected for processing (8)

backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py
prompt-service/src/unstract/prompt_service/services/answer_prompt.py
prompt-service/src/unstract/prompt_service/services/extraction.py
unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py
unstract/sdk1/src/unstract/sdk1/utils/signature_highlights.py
workers/executor/executors/answer_prompt.py
workers/executor/executors/legacy_executor.py
workers/tests/test_answer_prompt.py

🚧 Files skipped from review as they are similar to previous changes (3)

prompt-service/src/unstract/prompt_service/services/answer_prompt.py
prompt-service/src/unstract/prompt_service/services/extraction.py
workers/tests/test_answer_prompt.py

- Extract _inject_signature_data_into_tool_settings helper in prompt_studio_helper.py; the 5 call sites now invoke it instead of inlining the if-blocks. Drops complexity of build_fetch_response_payload, build_bulk_fetch_response_payload, and dynamic_extractor below the gate (15). - Extract _capture_signature_data on LegacyExecutor for the signature_metadata / signature_page_references capture + sidecar write. _handle_extract is back below 15. - Split _find_pages_matching_signers in signature_highlights.py into a helper _any_signer_matches so the outer pages walk becomes a list comprehension. Complexity below 15. - Fix two implicit-string-concatenation log lines that ruff-format collapsed onto one line (S5799). No behaviour change. All 8 signature-highlight unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

♻️ Duplicate comments (2)

backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py (1)

2389-2425: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Distinguish corrupt sidecar from missing sidecar to trigger fallback extraction.

Returning (None, None) for both a legitimately missing sidecar (e.g., document extracted before document_insights mode was enabled) and a corrupt/unreadable sidecar (I/O error, malformed JSON) makes dynamic_extractor() treat a corrupted cache as a valid cache hit. The .txt file is reused, but signature navigation never recovers until someone manually clears the cache.

🛡️ Proposed fix to surface sidecar corruption

Define a custom exception near the top of the module:

class SidecarCorruptError(Exception):
    """Raised when a signature sidecar exists but is unreadable or malformed."""
    pass

Then update _load_signature_sidecar:

     sidecar_path = PromptStudioHelper._signature_sidecar_path(extract_file_path)
     try:
         raw = fs_instance.read(path=sidecar_path, mode="r")
     except FileNotFoundError:
         return None, None
     except Exception as e:
         logger.warning(
             "DOC_INSIGHTS sidecar: failed to read %s: %s",
             sidecar_path,
             e,
         )
-        return None, None
+        raise SidecarCorruptError(
+            f"Sidecar exists but cannot be read: {sidecar_path}"
+        ) from e
     try:
         data = json.loads(raw)
     except (TypeError, ValueError) as e:
         logger.warning(
             "DOC_INSIGHTS sidecar: failed to parse %s: %s",
             sidecar_path,
             e,
         )
-        return None, None
+        raise SidecarCorruptError(
+            f"Sidecar JSON is malformed: {sidecar_path}"
+        ) from e
     sig_meta = data.get("signature_metadata") or None
     sig_refs = data.get("signature_page_references") or None
     return sig_meta, sig_refs

And update the dynamic_extractor cache-hit block around line 2465:

         try:
             extracted_text = fs_instance.read(path=extract_file_path, mode="r")
             logger.info("Extracted text found. Reading from file..")
-            sig_meta, sig_refs = PromptStudioHelper._load_signature_sidecar(
-                extract_file_path=extract_file_path,
-                fs_instance=fs_instance,
-            )
-            return ExtractResult(
-                text=extracted_text,
-                signature_metadata=sig_meta,
-                signature_page_references=sig_refs,
-            )
+            try:
+                sig_meta, sig_refs = PromptStudioHelper._load_signature_sidecar(
+                    extract_file_path=extract_file_path,
+                    fs_instance=fs_instance,
+                )
+                return ExtractResult(
+                    text=extracted_text,
+                    signature_metadata=sig_meta,
+                    signature_page_references=sig_refs,
+                )
+            except SidecarCorruptError as e:
+                logger.warning(
+                    "Sidecar corrupt for %s: %s. Re-extracting.",
+                    extract_file_path,
+                    e,
+                )
+                # Fall through to live extraction below
         except FileNotFoundError as e:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py` around
lines 2389 - 2425, The loader currently returns (None, None) for both missing
and corrupt sidecars which causes dynamic_extractor to treat corrupt sidecars as
cache hits; define a new SidecarCorruptError exception near the top of the
module (e.g., class SidecarCorruptError(Exception): pass), modify
_load_signature_sidecar to raise SidecarCorruptError when the sidecar exists but
reading/parsing fails (I/O errors other than FileNotFoundError or JSON parse
errors), and keep returning (None, None) only for FileNotFoundError; then update
dynamic_extractor's cache-hit branch to catch SidecarCorruptError and treat it
as a cache miss (trigger fallback extraction) while allowing real missing
sidecars to continue returning None, None.

workers/executor/executors/legacy_executor.py (1)

389-392: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Stale sidecar risk: delete leftover .doc_insights.json when signature data is absent.

The early-return at lines 391-392 skips sidecar cleanup when the current extraction produced no signature data. If a previous extraction wrote a sidecar (because the same document was once processed in document_insights mode), that stale file remains on disk. On the next cache hit, _load_signature_sidecar in prompt_studio_helper.py will read the outdated sidecar and inject stale signature coordinates into the answer pipeline.

🛡️ Proposed fix to remove stale sidecar

     if not output_file_path:
         return
+    sidecar_path = LegacyExecutor._signature_sidecar_path(output_file_path)
     if not signature_metadata and not signature_page_references:
+        # No signature data in this extraction → remove any stale sidecar
+        try:
+            if fs.exists(sidecar_path):
+                fs.remove(sidecar_path)
+                logger.info(
+                    "DOC_INSIGHTS sidecar: removed stale sidecar at %s",
+                    sidecar_path,
+                )
+        except Exception as e:
+            logger.warning(
+                "DOC_INSIGHTS sidecar: failed to remove stale sidecar %s: %s",
+                sidecar_path,
+                e,
+            )
         return
-    sidecar_path = LegacyExecutor._signature_sidecar_path(output_file_path)
     payload = {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workers/executor/executors/legacy_executor.py` around lines 389 - 392, The
early return in legacy_executor.py skips sidecar cleanup and can leave a stale
".doc_insights.json" beside output_file_path when signature_metadata and
signature_page_references are empty; update the block handling "if not
output_file_path" / "if not signature_metadata and not
signature_page_references" so that when output_file_path exists but there is no
signature data you explicitly remove the corresponding sidecar file (the
"<output_file_path>.doc_insights.json" used by _load_signature_sidecar in
prompt_studio_helper.py) before returning, ensuring stale signature sidecars are
deleted.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py`:
- Around line 2389-2425: The loader currently returns (None, None) for both
missing and corrupt sidecars which causes dynamic_extractor to treat corrupt
sidecars as cache hits; define a new SidecarCorruptError exception near the top
of the module (e.g., class SidecarCorruptError(Exception): pass), modify
_load_signature_sidecar to raise SidecarCorruptError when the sidecar exists but
reading/parsing fails (I/O errors other than FileNotFoundError or JSON parse
errors), and keep returning (None, None) only for FileNotFoundError; then update
dynamic_extractor's cache-hit branch to catch SidecarCorruptError and treat it
as a cache miss (trigger fallback extraction) while allowing real missing
sidecars to continue returning None, None.

In `@workers/executor/executors/legacy_executor.py`:
- Around line 389-392: The early return in legacy_executor.py skips sidecar
cleanup and can leave a stale ".doc_insights.json" beside output_file_path when
signature_metadata and signature_page_references are empty; update the block
handling "if not output_file_path" / "if not signature_metadata and not
signature_page_references" so that when output_file_path exists but there is no
signature data you explicitly remove the corresponding sidecar file (the
"<output_file_path>.doc_insights.json" used by _load_signature_sidecar in
prompt_studio_helper.py) before returning, ensuring stale signature sidecars are
deleted.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b3914110-6c3b-41fd-a890-37cc79ab9371

📥 Commits

Reviewing files that changed from the base of the PR and between a9a9840 and 4cfbc8c.

📒 Files selected for processing (4)

backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py
prompt-service/src/unstract/prompt_service/services/extraction.py
unstract/sdk1/src/unstract/sdk1/utils/signature_highlights.py
workers/executor/executors/legacy_executor.py

🚧 Files skipped from review as they are similar to previous changes (2)

unstract/sdk1/src/unstract/sdk1/utils/signature_highlights.py
prompt-service/src/unstract/prompt_service/services/extraction.py

…g concat - Use logger.exception() in three except blocks in prompt_studio_helper.py (S8572) so the traceback is always captured. - Extract _log_signature_capture from dynamic_extractor to bring its cognitive complexity below 15 (S3776). - Merge a string-pair that ruff-format collapsed onto one line in legacy_executor.py (S5799). No behaviour change. All 8 signature-highlight unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

🧹 Nitpick comments (3)

workers/executor/executors/legacy_executor.py (1)

367-375: 💤 Low value

Share the sidecar path helper across services.

prompt_studio_helper.py defines an identical _signature_sidecar_path (and the .doc_insights.json suffix is also referenced in docstrings elsewhere). A single shared constant/helper (e.g. on the SDK or a small signature_sidecar util) would prevent the two implementations from drifting — e.g. one switching to .with_suffix(".doc_insights.json") while the other keeps with_suffix("") + ".doc_insights.json" (which silently behaves differently for filenames with multiple dots).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@workers/executor/executors/legacy_executor.py` around lines 367 - 375,
Extract the duplicate sidecar-path logic into a shared helper/constant and
update this implementation to call that shared util instead of duplicating
logic: replace the static method _signature_sidecar_path in legacy_executor.py
with a call to the centralized function or constant (the same one used by
prompt_studio_helper.py), ensuring the canonical behavior uses
Path.with_suffix(".doc_insights.json") (or a single source-of-truth suffix
constant) so filenames with multiple dots are handled consistently; update
imports and references to use the new shared helper.

backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py (2)

1654-1657: 💤 Low value

Drop the trailing : {e} when using logger.exception.

logger.exception(...) already appends the active exception with traceback; concatenating {e} into the message just duplicates the exception's str() in the log line.

Suggested cleanup

-            logger.exception(
-                f"[{tool.tool_id}] Error while fetching response for "
-                f"prompt {id} and doc {document_id}: {e}"
-            )
+            logger.exception(
+                "[%s] Error while fetching response for prompt %s and doc %s",
+                tool.tool_id,
+                id,
+                document_id,
+            )

-            logger.exception(
-                f"[{tool.tool_id}] Error while fetching single pass response: {e}"
-            )
+            logger.exception(
+                "[%s] Error while fetching single pass response", tool.tool_id
+            )

Also applies to: 1722-1724

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py` around
lines 1654 - 1657, Remove the duplicated exception string from the
logger.exception calls that currently append ": {e}" to the formatted message;
instead, keep a concise descriptive message and let logger.exception include the
exception and traceback automatically. Update the call that logs
f"[{tool.tool_id}] Error while fetching response for prompt {id} and doc
{document_id}: {e}" to drop the ": {e}" suffix, and do the same for the second
occurrence around the other logger.exception (the one at lines ~1722–1724) so
both use a descriptive message only and rely on logger.exception to include the
exception details.

2401-2404: 💤 Low value

Duplicated sidecar-path logic with LegacyExecutor._signature_sidecar_path.

Same implementation lives at workers/executor/executors/legacy_executor.py (_signature_sidecar_path). Extract to a shared utility (e.g. on the SDK) so the producer and consumer of .doc_insights.json can't silently drift apart.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py` around
lines 2401 - 2404, The sidecar path logic in
PromptStudioHelper._signature_sidecar_path duplicates
LegacyExecutor._signature_sidecar_path; extract this logic into a single shared
utility function (e.g., create signature_sidecar_path(extract_file_path: str) in
a common SDK utils module) and replace both
PromptStudioHelper._signature_sidecar_path and
LegacyExecutor._signature_sidecar_path to call the new shared function instead
of reimplementing the suffix logic; update imports accordingly and ensure any
callers still receive/return strings and add a small unit test covering filename
variants to prevent future drift.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py`:
- Around line 1654-1657: Remove the duplicated exception string from the
logger.exception calls that currently append ": {e}" to the formatted message;
instead, keep a concise descriptive message and let logger.exception include the
exception and traceback automatically. Update the call that logs
f"[{tool.tool_id}] Error while fetching response for prompt {id} and doc
{document_id}: {e}" to drop the ": {e}" suffix, and do the same for the second
occurrence around the other logger.exception (the one at lines ~1722–1724) so
both use a descriptive message only and rely on logger.exception to include the
exception details.
- Around line 2401-2404: The sidecar path logic in
PromptStudioHelper._signature_sidecar_path duplicates
LegacyExecutor._signature_sidecar_path; extract this logic into a single shared
utility function (e.g., create signature_sidecar_path(extract_file_path: str) in
a common SDK utils module) and replace both
PromptStudioHelper._signature_sidecar_path and
LegacyExecutor._signature_sidecar_path to call the new shared function instead
of reimplementing the suffix logic; update imports accordingly and ensure any
callers still receive/return strings and add a small unit test covering filename
variants to prevent future drift.

In `@workers/executor/executors/legacy_executor.py`:
- Around line 367-375: Extract the duplicate sidecar-path logic into a shared
helper/constant and update this implementation to call that shared util instead
of duplicating logic: replace the static method _signature_sidecar_path in
legacy_executor.py with a call to the centralized function or constant (the same
one used by prompt_studio_helper.py), ensuring the canonical behavior uses
Path.with_suffix(".doc_insights.json") (or a single source-of-truth suffix
constant) so filenames with multiple dots are handled consistently; update
imports and references to use the new shared helper.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 60c935d5-84a6-4a73-80ca-e360edf481cb

📥 Commits

Reviewing files that changed from the base of the PR and between 4cfbc8c and fd3e200.

📒 Files selected for processing (2)

backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py
workers/executor/executors/legacy_executor.py

Cuts the SonarCloud duplication metric: the formatter that turns ``signature_metadata`` into the ``[Document Signature Information]`` context block was the same code in both workers and prompt-service. Now both services import ``format_signature_metadata_context`` from ``unstract.sdk1.utils.signature_highlights``. No behaviour change. All 8 signature-highlight unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Workers and prompt-service both had a ~50-line ``_is_safe_public_url`` helper for SSRF protection on postprocessing webhook URLs. Moved it to ``unstract.sdk1.utils.url_safety.is_safe_public_url`` and import from there. Brings the SonarCloud new-code duplication metric down further without touching behaviour. No behaviour change. All 8 signature-highlight unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-14T12:50:19Z

Frontend Lint Report (Biome)

✅ All checks passed! No linting or formatting issues found.

github-actions · 2026-05-14T12:51:14Z

Test Results

Summary

✅ Runner Tests: 11 passed, 0 failed (11 total)
✅ SDK1 Tests: 325 passed, 0 failed (325 total)

Runner Tests - Full Report

filepath	function	$$\textcolor{#23d18b}{\tt{passed}}$$	SUBTOTAL
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_logs}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_cleanup}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_cleanup\_skip}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_client\_init}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image\_exists}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config\_without\_mount}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_run\_container}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image\_for\_sidecar}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_sidecar\_container}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{TOTAL}}$$		$$\textcolor{#23d18b}{\tt{11}}$$	$$\textcolor{#23d18b}{\tt{11}}$$

SDK1 Tests - Full Report

sonarqubecloud · 2026-05-14T12:51:37Z

Quality Gate failed

Failed conditions
7.3% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

pk-zipstack · 2026-05-15T06:00:16Z

Note on the remaining SonarCloud Code Analysis failure

The only failing condition is new_duplicated_lines_density: 7.3% (gate threshold: > 3%). Everything else passes:

Condition	Status
`new_reliability_rating`	✅ A
`new_security_rating`	✅ A
`new_maintainability_rating`	✅ A
`new_security_hotspots_reviewed`	✅ 100%
`new_duplicated_lines_density`	❌ 7.3% (> 3% threshold)

Total issues on new code is 0.

Why the duplication metric is still red

The remaining 7.3% is pre-existing structural duplication between workers/executor/executors/answer_prompt.py and prompt-service/src/unstract/prompt_service/services/answer_prompt.py. The workers file was added in PR #1849 (UN-3266) on 2 Apr 2026 as an intentional port of the prompt-service equivalent — its module docstring even says:

"""Answer prompt service — prompt construction and LLM execution.

Ported from prompt-service/.../services/answer_prompt.py.
…
"""

The duplicated blocks SonarCloud still flags are inside construct_and_run_prompt and construct_prompt themselves, which existed in both files before this branch was created.

What this PR did to bring the metric down

I went after every shared helper I could extract without changing behaviour, taking duplication from 20.97% → 7.3%:

New unstract/sdk1/utils/signature_highlights.py — owns resolve_signature_highlight_coords, merge_into_highlight_data, and format_signature_metadata_context. Both services import from here.
New unstract/sdk1/utils/url_safety.py — owns is_safe_public_url. Removes the ~50-line SSRF helper that was duplicated between workers and prompt-service.

Why I'm stopping here

Deduping the rest would mean lifting construct_and_run_prompt / construct_prompt into shared code as well. That's a much larger refactor that:

Touches both services' core path
Has nothing to do with signature highlights (the subject of this PR)
Would be a much safer change in a dedicated cleanup PR with its own review

Requesting a SonarCloud override for this PR, or treating the duplication metric as accepted technical debt to be addressed in a follow-up. Happy to file a follow-up PR specifically for the dedup work if the team wants.

pk-zipstack and others added 8 commits April 7, 2026 11:55

Merge branch 'main' into feat/llmwhisperer-document-insights-mode

0caec15

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

greptile-apps Bot reviewed May 14, 2026

View reviewed changes

pk-zipstack and others added 2 commits May 14, 2026 17:37

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

greptile-apps Bot reviewed May 14, 2026

View reviewed changes

Comment thread unstract/sdk1/src/unstract/sdk1/utils/signature_highlights.py

pk-zipstack self-assigned this May 14, 2026

pk-zipstack requested review from Deepak-Kesavan, chandrasekharan-zipstack and harini-venkataraman May 14, 2026 19:04

Conversation

pk-zipstack commented May 14, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

Database Migrations

Env Config

Relevant Docs

Related Issues or PRs

Dependencies Versions

Notes on Testing

Screenshots

Checklist

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented May 14, 2026

Frontend Lint Report (Biome)

Uh oh!

github-actions Bot commented May 14, 2026

Test Results

Uh oh!

sonarqubecloud Bot commented May 14, 2026

Quality Gate failed

Uh oh!

pk-zipstack commented May 15, 2026

Note on the remaining SonarCloud Code Analysis failure

Why the duplication metric is still red

What this PR did to bring the metric down

Why I'm stopping here

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pk-zipstack commented May 14, 2026 •

edited by atlassian Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading

greptile-apps Bot commented May 14, 2026 •

edited

Loading