UN-3372 [FEAT] Surface LLMWhisperer signature highlights in Prompt Studio#1967
UN-3372 [FEAT] Surface LLMWhisperer signature highlights in Prompt Studio#1967pk-zipstack wants to merge 14 commits into
Conversation
…nature metadata in LLM context - Add document_insights as a new processing mode in the LLMWhisperer V2 adapter (Modes enum + JSON schema dropdown) - Extract signature_metadata from LLMWhisperer response when using document_insights mode and surface it in TextExtractionMetadata - Thread signature_metadata through the workers pipeline (extract → answer_params → construct_prompt) - Format signature metadata as a human-readable context block injected into the LLM prompt's Context section - Update prompt-service extraction endpoint to return signature_metadata - Mirror construct_prompt changes in prompt-service for parity Workers execution path (API deployments, workflow runs) is fully functional. Prompt-service path has endpoints ready but structure tool threading is a follow-up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When document_insights mode detects signatures, compute page references by finding the first line_metadata entry for each page with signatures and converting to a 1-indexed hex value. This enables the frontend to navigate/jump to pages containing signatures without highlighting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add loggers at key points to trace signature metadata through the pipeline: adapter extraction, workers pipeline (_handle_extract, tool_settings injection), prompt construction, and prompt-service extraction endpoint. All loggers use the DOC_INSIGHTS prefix for easy grep-filtering during UI testing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
user_id may be empty for mock auth users (default OSS setup). It's only used as a Redis cache key fragment, so empty values are acceptable. Dropping user_id from the required-fields validation unblocks indexing for these users. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t_insights mode Wire the signature data captured by the LLMWhisperer V2 adapter through to Prompt Studio's existing highlight pipeline, so clicking a signer- related answer jumps the PDF viewer to the page containing the signature without any frontend changes. - Adapter: signature_page_references now also carries resolved coords [page, y, height, page_height] alongside the existing hex / line index. - Workers _handle_extract writes a <extract>.doc_insights.json sidecar so Prompt Studio cache hits don't lose signature data; the pipeline path threads signature_page_references into tool_settings alongside signature_metadata. - AnswerPromptService._attach_signature_highlights (mirrored in workers and prompt-service) scans the LLM answer for signer names (case- insensitive substring) and appends the matching page coords to metadata[HIGHLIGHT_DATA][prompt_key]. Falls back to all signature pages when the answer mentions signing generically. De-dupes against hex-comment highlights. - Prompt Studio backend: dynamic_extractor now returns ExtractResult (text + signature_metadata + signature_page_references), reading the sidecar on cache hits. All five answer_prompt dispatch sites inject the signature data into tool_settings. - Prompt-service: extraction service + controller surface signature_page_references for parity. - Tests: 7 new unit tests for _attach_signature_highlights covering name match, multi-page coords, keyword fallback, no-op cases, and preservation/dedup of existing highlight entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ignature highlights Two bugs surfaced when testing in Prompt Studio with a multi-page signed PDF where the LLM answer was a single signer name. 1) Adapter was selecting unusable line_metadata entries _build_signature_page_references picked the first line_metadata entry per page, but the first entry is often an empty marker row like [0, 0, 0, 3168] or [1, 0, 0, 0]. Zero height makes the overlay invisible; zero page_height causes divide-by-zero in the frontend's percentage calc. Now skip entries with height <= 0 or page_height <= 0 and pick the first true content line. 2) Post-processor matched signer initials inside other names "P S" (a signer on page 0) was matching across word boundaries inside "Pradeep Surukanti" — case-insensitive substring "p s" appears between "Pradee[p s]urukanti". Both pages got highlights, the viewer jumped to the first (wrong) page. Switched to a regex with \b anchors so the signer name has to appear as a whole token or phrase. Added a regression test (test_short_initials_do_not_falsely_match_across_words) that locks in the fix for this exact scenario. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two clickability gates in the frontend prevented signature-driven page jumps from firing when the tool's enable_highlight setting was off, even though the backend was correctly producing highlight_data from document_insights signature extraction. 1) TextResult only rendered the clickable Typography.Text variant when enableHighlight was true, so the answer stayed a plain <div>. 2) handleSelectHighlight in PromptCard.jsx returned silently when enable_highlight was false, so even if the click fired, selectedHighlight state never updated and PdfViewer never received a non-empty highlightData prop (jumpToPage was never called). Both gates now also pass through when highlight_data is present, so the signature feature works on tools that have document_insights mode on LLMWhisperer V2 even without flipping the separate enable_highlight toggle. Existing flows (with enable_highlight=false and no highlight data) are unchanged because their highlight_data is empty. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughCapture signature metadata and per‑page reference coords from DOCUMENT_INSIGHTS extraction, persist as sidecars, propagate through extraction services and executors, inject signature context into prompts, resolve/merge signature-driven highlight coordinates into answer metadata, and surface selectable highlights in the frontend. Signature extraction, propagation, and highlight attachment
sequenceDiagram
participant FE as Frontend
participant PS as PromptStudio (Backend)
participant PromptSvc as Prompt-Service
participant Exec as Worker/Executor
participant SDK as X2Text Adapter
participant FS as FileStore
FE->>PS: request prompt / select result
PS->>PS: dynamic_extractor() -> ExtractResult(text + signature fields)
alt cache miss
PS->>Exec: enqueue/run extraction
Exec->>SDK: adapter(mode=DOCUMENT_INSIGHTS)
SDK-->>Exec: TextExtractionMetadata (whisper_hash, line_metadata, signature_metadata, signature_page_references)
Exec->>FS: write output file and .doc_insights.json
Exec-->>PS: extraction result with signature fields
else cache hit
PS->>FS: load .doc_insights.json
FS-->>PS: signature fields
end
PS->>PromptSvc: construct_and_run_prompt(tool_settings with signature fields)
PromptSvc->>PromptSvc: construct_prompt(inject signature context)
PromptSvc->>LLM: run completion
LLM-->>PromptSvc: answer
PromptSvc->>SDK: resolve_signature_highlight_coords(answer, signature_metadata, signature_page_references)
SDK-->>PromptSvc: coords
PromptSvc->>PromptSvc: merge coords into metadata.highlight_data[prompt_key]
PromptSvc-->>PS: response payload (includes highlight data)
PS-->>FE: response (clickable/selectable highlights enabled)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/prompt_studio/prompt_studio_core_v2/internal_views.py (1)
227-236:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate docstring to reflect optional
user_id.The docstring still lists
user_id: strwithout indicating it's optional, but the validation on line 249 no longer requires it. Consider updating the docstring to markuser_idas optional to maintain consistency between documentation and implementation.📝 Suggested docstring update
"""Update document indexing cache status (mark indexed or remove). Expected JSON payload: { "action": "mark_indexed" | "remove", "org_id": str, - "user_id": str, + "user_id": str (optional, may be empty), "doc_id_key": str, "doc_id": str (required when action == "mark_indexed") } """🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/prompt_studio/prompt_studio_core_v2/internal_views.py` around lines 227 - 236, The docstring for the update-document-indexing-cache endpoint still lists "user_id: str" as required even though the handler's validation no longer requires it; update the docstring in prompt_studio_core_v2/internal_views.py to mark "user_id" as optional (e.g., "user_id: Optional[str]" or "user_id: str (optional)"), and ensure the payload description still documents that "doc_id" is required only when action == "mark_indexed" so the docstring matches the current validation logic (referencing the "user_id" and "doc_id" fields in the JSON payload).
🧹 Nitpick comments (1)
unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py (1)
177-227: ⚡ Quick winConsider validating response structure before accessing nested keys.
The code directly accesses
response["metadata"]and nested page data without validating the response structure. If LLMWhisperer returns an unexpected response format, this could raiseKeyErrororTypeError.🛡️ Proposed defensive fix
if mode == Modes.DOCUMENT_INSIGHTS.value: - response_metadata = response.get("metadata", {}) + if not isinstance(response, dict): + logger.warning("DOC_INSIGHTS: response is not a dict, skipping signature extraction") + else: + response_metadata = response.get("metadata", {}) + if not isinstance(response_metadata, dict): + logger.warning("DOC_INSIGHTS: metadata is not a dict, skipping") + response_metadata = {}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py` around lines 177 - 227, The code assumes response is a dict and directly walks nested keys when extracting signature_metadata and raw_line_metadata; add defensive validation in LLMWhispererV2 around response access: ensure response is a dict before .get, ensure response_metadata = response.get("metadata", {}) is a dict, check each page_data is a dict and that page_data["signature_metadata"] is a list before iterating/len()/list comprehension, and ensure raw_line_metadata = response.get("line_metadata", []) is a list before using len() or passing to _build_signature_page_references; if types are unexpected, treat them as empty (None/{} /[]) and short-circuit computing signature_page_references to avoid KeyError/TypeError.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/prompt_studio/prompt_studio_core_v2/internal_views.py`:
- Around line 247-249: The cache key logic in DocumentIndexingService currently
builds keys as f"{CACHE_PREFIX}{org_id}:{user_id}:{doc_id_key}" which yields
double colons when user_id is empty; either enforce non-empty user_id by
restoring validation (reject/raise/log when user_id is falsy in the caller or in
_cache_key) or make _cache_key resilient by constructing the key from only
non-empty parts (e.g., join on ":" for [org_id, user_id, doc_id_key] but skip
user_id when empty or substitute a stable placeholder like "anon_user"); update
the _cache_key implementation (and related callers) to use CACHE_PREFIX and the
chosen strategy so keys are consistent and collision-free.
In `@backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py`:
- Around line 2359-2394: The helper _load_signature_sidecar currently swallows
read/parse errors and returns (None, None) which hides corrupt sidecars from
dynamic_extractor; instead, introduce a dedicated SidecarCorruptError (or
similar) in this module and raise it whenever fs_instance.read or json.loads
fails (the except blocks that now return (None, None)), leaving
FileNotFoundError to continue returning (None, None) for legitimately missing
sidecars; update callers (e.g., dynamic_extractor) to catch SidecarCorruptError
and trigger a live extract/fallback path.
In `@prompt-service/src/unstract/prompt_service/services/answer_prompt.py`:
- Around line 159-167: AnswerPromptService currently calls
_attach_signature_highlights before the JSON/webhook postprocessing mutates
metadata[PSKeys.HIGHLIGHT_DATA][prompt_key], so signature coords are lost; to
fix, either move the call to AnswerPromptService._attach_signature_highlights
until after handle_json/postprocessing completes or change handle_json to merge
updated_highlight_data into metadata[PSKeys.HIGHLIGHT_DATA][prompt_key] (rather
than replace) so signature pages added from tool_settings
(PSKeys.SIGNATURE_METADATA / PSKeys.SIGNATURE_PAGE_REFERENCES) are preserved;
update references to prompt_key (output[PSKeys.NAME]), metadata, and
updated_highlight_data accordingly to ensure the merge appends/merges
coordinates instead of overwriting.
- Around line 281-289: The loop that formats `signatures` in answer_prompt.py
assumes each `sig` is a dict and calls `sig.get(...)`, which will crash on
non-dict entries; update the loop to skip any malformed entries by adding the
same guard used in `_attach_signature_highlights()` (e.g., `if not
isinstance(sig, dict): continue`) before accessing `sig.get(...)`, so only dict
items are processed and others are ignored silently.
---
Outside diff comments:
In `@backend/prompt_studio/prompt_studio_core_v2/internal_views.py`:
- Around line 227-236: The docstring for the update-document-indexing-cache
endpoint still lists "user_id: str" as required even though the handler's
validation no longer requires it; update the docstring in
prompt_studio_core_v2/internal_views.py to mark "user_id" as optional (e.g.,
"user_id: Optional[str]" or "user_id: str (optional)"), and ensure the payload
description still documents that "doc_id" is required only when action ==
"mark_indexed" so the docstring matches the current validation logic
(referencing the "user_id" and "doc_id" fields in the JSON payload).
---
Nitpick comments:
In
`@unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py`:
- Around line 177-227: The code assumes response is a dict and directly walks
nested keys when extracting signature_metadata and raw_line_metadata; add
defensive validation in LLMWhispererV2 around response access: ensure response
is a dict before .get, ensure response_metadata = response.get("metadata", {})
is a dict, check each page_data is a dict and that
page_data["signature_metadata"] is a list before iterating/len()/list
comprehension, and ensure raw_line_metadata = response.get("line_metadata", [])
is a list before using len() or passing to _build_signature_page_references; if
types are unexpected, treat them as empty (None/{} /[]) and short-circuit
computing signature_page_references to avoid KeyError/TypeError.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 18bd50bb-ca4a-43cc-9cc4-803fad75e44d
📒 Files selected for processing (17)
backend/prompt_studio/prompt_studio_core_v2/constants.pybackend/prompt_studio/prompt_studio_core_v2/internal_views.pybackend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.pyfrontend/src/components/custom-tools/prompt-card/DisplayPromptResult.jsxfrontend/src/components/custom-tools/prompt-card/PromptCard.jsxprompt-service/src/unstract/prompt_service/constants.pyprompt-service/src/unstract/prompt_service/controllers/extraction.pyprompt-service/src/unstract/prompt_service/services/answer_prompt.pyprompt-service/src/unstract/prompt_service/services/extraction.pyunstract/sdk1/src/unstract/sdk1/adapters/x2text/dto.pyunstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/constants.pyunstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.pyunstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/static/json_schema.jsonworkers/executor/executors/answer_prompt.pyworkers/executor/executors/constants.pyworkers/executor/executors/legacy_executor.pyworkers/tests/test_answer_prompt.py
|
| Filename | Overview |
|---|---|
| unstract/sdk1/src/unstract/sdk1/utils/signature_highlights.py | New shared helper module for signature matching and highlight merging; word-boundary regex correctly prevents the "P S" false-positive; unguarded int() conversions in format_signature_metadata_context were flagged in a previous thread |
| unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py | Adds document_insights mode: signature_metadata extraction from response, page-reference building via line_metadata; unguarded int(page_str) in _build_signature_page_references already flagged; logic is otherwise sound and gated on mode check |
| workers/executor/executors/legacy_executor.py | _capture_signature_data + _write_signature_sidecar cleanly separate signature persistence; signature data injection into answer_params[tool_settings] is safe because TOOL_SETTINGS is always present at that pipeline stage |
| workers/executor/executors/answer_prompt.py | _attach_signature_highlights and construct_prompt signature context injection look correct; unguarded int(x[0]) sort key already flagged; url_safety extraction to shared module is a clean refactor |
| backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py | ExtractResult NamedTuple cleanly replaces bare str return; all 6 callsites updated; _load_signature_sidecar is defensively coded with proper FileNotFoundError separation; _inject_signature_data_into_tool_settings is a no-op for non-document_insights modes |
| prompt-service/src/unstract/prompt_service/services/answer_prompt.py | Mirror of workers post-processor; _is_safe_public_url correctly moved to shared url_safety module; same unguarded int sort key already flagged in previous thread |
| frontend/src/components/custom-tools/prompt-card/PromptCard.jsx | hasHighlightData definition diverges from DisplayPromptResult.jsx (Boolean() vs Array.isArray+length) — already flagged in previous thread; the restructured handleSelectHighlight logic is otherwise correct |
| frontend/src/components/custom-tools/prompt-card/DisplayPromptResult.jsx | hasHighlightData correctly uses Array.isArray + length check; isClickable gate is logically sound and preserves non-clickable behaviour when both conditions are false |
| unstract/sdk1/src/unstract/sdk1/utils/url_safety.py | Clean extraction of SSRF-protection helper into a shared module; logic is identical to the previously-inline functions in both workers and prompt-service answer_prompt files |
| workers/tests/test_answer_prompt.py | 8 new tests cover name match, case-insensitivity, multi-page, keyword fallback, no-op, dedup, and the "P S" regression; good coverage of the post-processor |
| prompt-service/src/unstract/prompt_service/services/extraction.py | Return type changed from str to dict; signature metadata conditionally included; only one caller (extraction controller) which was updated accordingly |
| backend/prompt_studio/prompt_studio_core_v2/internal_views.py | Removes user_id from required-field validation with a clear explanatory comment; change is scoped and intentional for mock auth support |
Sequence Diagram
sequenceDiagram
participant LLMWv2 as LLMWhisperer V2 Adapter
participant Worker as Workers LegacyExecutor
participant FS as File Storage (sidecar)
participant PS as Prompt Studio Helper
participant AP as AnswerPromptService
participant FE as Frontend (PromptCard / PdfViewer)
Note over LLMWv2: document_insights mode
LLMWv2->>LLMWv2: Extract text + collect signature_metadata from response["metadata"]
LLMWv2->>LLMWv2: _build_signature_page_references() → resolved coords per page
LLMWv2-->>Worker: "TextExtractionResult(text, metadata{signature_metadata, signature_page_references})"
Worker->>FS: _write_signature_sidecar() → file.doc_insights.json
Worker->>Worker: inject signature data into answer_params[tool_settings]
alt Cache miss (fresh extraction)
PS->>Worker: dynamic_extractor() dispatch
Worker-->>PS: ExtractResult(text, signature_metadata, signature_page_references)
else Cache hit (cached .txt exists)
PS->>FS: _load_signature_sidecar() → read .doc_insights.json
FS-->>PS: (signature_metadata, signature_page_references)
end
PS->>PS: _inject_signature_data_into_tool_settings()
PS->>AP: dispatch answer_prompt with signature data in tool_settings
AP->>AP: construct_prompt() → append [Document Signature Information] block
AP->>AP: run_completion() → LLM answer
AP->>AP: _attach_signature_highlights() → match signers in answer (word-boundary regex)
AP-->>PS: "answer + metadata[highlight_data][prompt_key] = matched coords"
PS-->>FE: response with highlight_data
FE->>FE: "isClickable = enableHighlight OR hasHighlightData"
FE->>FE: PdfViewer.jumpToPage(matched signature page)
Reviews (6): Last reviewed commit: "[MISC] Extract SSRF webhook-URL helper i..." | Re-trigger Greptile
- Extract duplicated signature-highlight post-processor logic into a shared helper at ``unstract/sdk1/utils/signature_highlights.py`` and delegate from both workers and prompt-service. Cuts SonarCloud's duplication metric and brings the cognitive-complexity score below the gate. - Split LLMWhispererV2._build_signature_page_references into _index_first_content_line_per_page + _build_page_reference_entry to cut its cognitive complexity below 15. - Move ExtractResult NamedTuple in prompt_studio_helper below the import block to silence ruff E402 (and drop a duplicate logger= line that was already present). - Move the `logger = ...` line in prompt-service extraction.py below all imports to fix the same E402 issue. - Apply ruff-format normalisation across the touched files. No behaviour change. All 8 signature-highlight unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolve conflict in workers/executor/executors/legacy_executor.py where main reorganized the pipeline (wrapped steps 2-5 inside the same try block, added _failure/_absorb helpers and execution_id / file_execution_id context fields). Preserved this PR's signature metadata injection step by relocating it inside the new try block, right after the extract result is absorbed and extracted_text is read. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@unstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.py`:
- Around line 137-141: The parsing assumes metadata is a dict and
signature_metadata is a dict with int-castable keys; harden it by first checking
that metadata is a dict and signature_metadata =
metadata.get("signature_metadata") is a dict (skip if not), then iterate over
signature_metadata.items() only after verifying types; inside the loop ensure
signatures is a non-empty list and guard the int() cast for page_str with
try/except (skip entries where int(page_str) fails) before referencing
page_first_line and other structures; apply the same defensive checks in the
second block mentioned (lines ~205-219) where signature_metadata and page keys
are used.
In `@unstract/sdk1/src/unstract/sdk1/utils/signature_highlights.py`:
- Around line 127-128: The fallback uses substring checks which cause false
positives (e.g., "unsigned" matching "signed"); change the condition in the
block that sets matched_pages (the variables matched_pages, answer,
SIGNATURE_KEYWORDS, page_coords) to test whole-word matches instead of simple
substrings — for example use a regex word-boundary search (with re.escape on
each keyword) or tokenize/split answer.lower() and check membership, and ensure
the re import is added if you use regex; keep the rest of the logic (assigning
list(page_coords.keys())) unchanged.
In `@workers/executor/executors/answer_prompt.py`:
- Around line 257-267: The loop over signature_metadata assumes well-formed keys
and dict entries and can crash on malformed payloads; wrap the
iteration/processing in validation and safe conversions: when iterating
signature_metadata.items() (used in the for page_num, signatures ... block),
first ensure page_num can be safely converted to int (use a try/except and skip
if not convertible), ensure signatures is an iterable/list before iterating, and
guard access to sig fields by coercing values to str with defaults (e.g., name =
str(sig.get("name", "Unknown"))), skip any sig that is not a dict, and only
build entry after these checks so malformed metadata cannot raise in the code
paths referencing int(page_num) or sig.get(...).
In `@workers/executor/executors/legacy_executor.py`:
- Around line 332-340: The current early-return when output_file_path exists but
both signature_metadata and signature_page_references are empty leaves any
existing sidecar intact; change the flow in LegacyExecutor (use the same block
that computes LegacyExecutor._signature_sidecar_path(output_file_path)) so that
if there is no signature data you check for and delete/unlink the existing
sidecar file instead of returning immediately; only create/write the sidecar
payload when signature_metadata or signature_page_references are present (use
variables output_file_path, signature_metadata, signature_page_references, and
sidecar_path to locate and remove the stale file).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 923cb799-5cf3-472a-b2a4-b414b7b45001
📒 Files selected for processing (8)
backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.pyprompt-service/src/unstract/prompt_service/services/answer_prompt.pyprompt-service/src/unstract/prompt_service/services/extraction.pyunstract/sdk1/src/unstract/sdk1/adapters/x2text/llm_whisperer_v2/src/llm_whisperer_v2.pyunstract/sdk1/src/unstract/sdk1/utils/signature_highlights.pyworkers/executor/executors/answer_prompt.pyworkers/executor/executors/legacy_executor.pyworkers/tests/test_answer_prompt.py
🚧 Files skipped from review as they are similar to previous changes (3)
- prompt-service/src/unstract/prompt_service/services/answer_prompt.py
- prompt-service/src/unstract/prompt_service/services/extraction.py
- workers/tests/test_answer_prompt.py
- Extract _inject_signature_data_into_tool_settings helper in prompt_studio_helper.py; the 5 call sites now invoke it instead of inlining the if-blocks. Drops complexity of build_fetch_response_payload, build_bulk_fetch_response_payload, and dynamic_extractor below the gate (15). - Extract _capture_signature_data on LegacyExecutor for the signature_metadata / signature_page_references capture + sidecar write. _handle_extract is back below 15. - Split _find_pages_matching_signers in signature_highlights.py into a helper _any_signer_matches so the outer pages walk becomes a list comprehension. Complexity below 15. - Fix two implicit-string-concatenation log lines that ruff-format collapsed onto one line (S5799). No behaviour change. All 8 signature-highlight unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (2)
backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py (1)
2389-2425:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDistinguish corrupt sidecar from missing sidecar to trigger fallback extraction.
Returning
(None, None)for both a legitimately missing sidecar (e.g., document extracted beforedocument_insightsmode was enabled) and a corrupt/unreadable sidecar (I/O error, malformed JSON) makesdynamic_extractor()treat a corrupted cache as a valid cache hit. The.txtfile is reused, but signature navigation never recovers until someone manually clears the cache.🛡️ Proposed fix to surface sidecar corruption
Define a custom exception near the top of the module:
class SidecarCorruptError(Exception): """Raised when a signature sidecar exists but is unreadable or malformed.""" passThen update
_load_signature_sidecar:sidecar_path = PromptStudioHelper._signature_sidecar_path(extract_file_path) try: raw = fs_instance.read(path=sidecar_path, mode="r") except FileNotFoundError: return None, None except Exception as e: logger.warning( "DOC_INSIGHTS sidecar: failed to read %s: %s", sidecar_path, e, ) - return None, None + raise SidecarCorruptError( + f"Sidecar exists but cannot be read: {sidecar_path}" + ) from e try: data = json.loads(raw) except (TypeError, ValueError) as e: logger.warning( "DOC_INSIGHTS sidecar: failed to parse %s: %s", sidecar_path, e, ) - return None, None + raise SidecarCorruptError( + f"Sidecar JSON is malformed: {sidecar_path}" + ) from e sig_meta = data.get("signature_metadata") or None sig_refs = data.get("signature_page_references") or None return sig_meta, sig_refsAnd update the
dynamic_extractorcache-hit block around line 2465:try: extracted_text = fs_instance.read(path=extract_file_path, mode="r") logger.info("Extracted text found. Reading from file..") - sig_meta, sig_refs = PromptStudioHelper._load_signature_sidecar( - extract_file_path=extract_file_path, - fs_instance=fs_instance, - ) - return ExtractResult( - text=extracted_text, - signature_metadata=sig_meta, - signature_page_references=sig_refs, - ) + try: + sig_meta, sig_refs = PromptStudioHelper._load_signature_sidecar( + extract_file_path=extract_file_path, + fs_instance=fs_instance, + ) + return ExtractResult( + text=extracted_text, + signature_metadata=sig_meta, + signature_page_references=sig_refs, + ) + except SidecarCorruptError as e: + logger.warning( + "Sidecar corrupt for %s: %s. Re-extracting.", + extract_file_path, + e, + ) + # Fall through to live extraction below except FileNotFoundError as e:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py` around lines 2389 - 2425, The loader currently returns (None, None) for both missing and corrupt sidecars which causes dynamic_extractor to treat corrupt sidecars as cache hits; define a new SidecarCorruptError exception near the top of the module (e.g., class SidecarCorruptError(Exception): pass), modify _load_signature_sidecar to raise SidecarCorruptError when the sidecar exists but reading/parsing fails (I/O errors other than FileNotFoundError or JSON parse errors), and keep returning (None, None) only for FileNotFoundError; then update dynamic_extractor's cache-hit branch to catch SidecarCorruptError and treat it as a cache miss (trigger fallback extraction) while allowing real missing sidecars to continue returning None, None.workers/executor/executors/legacy_executor.py (1)
389-392:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftStale sidecar risk: delete leftover
.doc_insights.jsonwhen signature data is absent.The early-return at lines 391-392 skips sidecar cleanup when the current extraction produced no signature data. If a previous extraction wrote a sidecar (because the same document was once processed in
document_insightsmode), that stale file remains on disk. On the next cache hit,_load_signature_sidecarinprompt_studio_helper.pywill read the outdated sidecar and inject stale signature coordinates into the answer pipeline.🛡️ Proposed fix to remove stale sidecar
if not output_file_path: return + sidecar_path = LegacyExecutor._signature_sidecar_path(output_file_path) if not signature_metadata and not signature_page_references: + # No signature data in this extraction → remove any stale sidecar + try: + if fs.exists(sidecar_path): + fs.remove(sidecar_path) + logger.info( + "DOC_INSIGHTS sidecar: removed stale sidecar at %s", + sidecar_path, + ) + except Exception as e: + logger.warning( + "DOC_INSIGHTS sidecar: failed to remove stale sidecar %s: %s", + sidecar_path, + e, + ) return - sidecar_path = LegacyExecutor._signature_sidecar_path(output_file_path) payload = {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@workers/executor/executors/legacy_executor.py` around lines 389 - 392, The early return in legacy_executor.py skips sidecar cleanup and can leave a stale ".doc_insights.json" beside output_file_path when signature_metadata and signature_page_references are empty; update the block handling "if not output_file_path" / "if not signature_metadata and not signature_page_references" so that when output_file_path exists but there is no signature data you explicitly remove the corresponding sidecar file (the "<output_file_path>.doc_insights.json" used by _load_signature_sidecar in prompt_studio_helper.py) before returning, ensuring stale signature sidecars are deleted.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py`:
- Around line 2389-2425: The loader currently returns (None, None) for both
missing and corrupt sidecars which causes dynamic_extractor to treat corrupt
sidecars as cache hits; define a new SidecarCorruptError exception near the top
of the module (e.g., class SidecarCorruptError(Exception): pass), modify
_load_signature_sidecar to raise SidecarCorruptError when the sidecar exists but
reading/parsing fails (I/O errors other than FileNotFoundError or JSON parse
errors), and keep returning (None, None) only for FileNotFoundError; then update
dynamic_extractor's cache-hit branch to catch SidecarCorruptError and treat it
as a cache miss (trigger fallback extraction) while allowing real missing
sidecars to continue returning None, None.
In `@workers/executor/executors/legacy_executor.py`:
- Around line 389-392: The early return in legacy_executor.py skips sidecar
cleanup and can leave a stale ".doc_insights.json" beside output_file_path when
signature_metadata and signature_page_references are empty; update the block
handling "if not output_file_path" / "if not signature_metadata and not
signature_page_references" so that when output_file_path exists but there is no
signature data you explicitly remove the corresponding sidecar file (the
"<output_file_path>.doc_insights.json" used by _load_signature_sidecar in
prompt_studio_helper.py) before returning, ensuring stale signature sidecars are
deleted.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b3914110-6c3b-41fd-a890-37cc79ab9371
📒 Files selected for processing (4)
backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.pyprompt-service/src/unstract/prompt_service/services/extraction.pyunstract/sdk1/src/unstract/sdk1/utils/signature_highlights.pyworkers/executor/executors/legacy_executor.py
🚧 Files skipped from review as they are similar to previous changes (2)
- unstract/sdk1/src/unstract/sdk1/utils/signature_highlights.py
- prompt-service/src/unstract/prompt_service/services/extraction.py
…g concat - Use logger.exception() in three except blocks in prompt_studio_helper.py (S8572) so the traceback is always captured. - Extract _log_signature_capture from dynamic_extractor to bring its cognitive complexity below 15 (S3776). - Merge a string-pair that ruff-format collapsed onto one line in legacy_executor.py (S5799). No behaviour change. All 8 signature-highlight unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (3)
workers/executor/executors/legacy_executor.py (1)
367-375: 💤 Low valueShare the sidecar path helper across services.
prompt_studio_helper.pydefines an identical_signature_sidecar_path(and the.doc_insights.jsonsuffix is also referenced in docstrings elsewhere). A single shared constant/helper (e.g. on the SDK or a smallsignature_sidecarutil) would prevent the two implementations from drifting — e.g. one switching to.with_suffix(".doc_insights.json")while the other keepswith_suffix("") + ".doc_insights.json"(which silently behaves differently for filenames with multiple dots).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@workers/executor/executors/legacy_executor.py` around lines 367 - 375, Extract the duplicate sidecar-path logic into a shared helper/constant and update this implementation to call that shared util instead of duplicating logic: replace the static method _signature_sidecar_path in legacy_executor.py with a call to the centralized function or constant (the same one used by prompt_studio_helper.py), ensuring the canonical behavior uses Path.with_suffix(".doc_insights.json") (or a single source-of-truth suffix constant) so filenames with multiple dots are handled consistently; update imports and references to use the new shared helper.backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py (2)
1654-1657: 💤 Low valueDrop the trailing
: {e}when usinglogger.exception.
logger.exception(...)already appends the active exception with traceback; concatenating{e}into the message just duplicates the exception'sstr()in the log line.Suggested cleanup
- logger.exception( - f"[{tool.tool_id}] Error while fetching response for " - f"prompt {id} and doc {document_id}: {e}" - ) + logger.exception( + "[%s] Error while fetching response for prompt %s and doc %s", + tool.tool_id, + id, + document_id, + )- logger.exception( - f"[{tool.tool_id}] Error while fetching single pass response: {e}" - ) + logger.exception( + "[%s] Error while fetching single pass response", tool.tool_id + )Also applies to: 1722-1724
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py` around lines 1654 - 1657, Remove the duplicated exception string from the logger.exception calls that currently append ": {e}" to the formatted message; instead, keep a concise descriptive message and let logger.exception include the exception and traceback automatically. Update the call that logs f"[{tool.tool_id}] Error while fetching response for prompt {id} and doc {document_id}: {e}" to drop the ": {e}" suffix, and do the same for the second occurrence around the other logger.exception (the one at lines ~1722–1724) so both use a descriptive message only and rely on logger.exception to include the exception details.
2401-2404: 💤 Low valueDuplicated sidecar-path logic with
LegacyExecutor._signature_sidecar_path.Same implementation lives at
workers/executor/executors/legacy_executor.py(_signature_sidecar_path). Extract to a shared utility (e.g. on the SDK) so the producer and consumer of.doc_insights.jsoncan't silently drift apart.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py` around lines 2401 - 2404, The sidecar path logic in PromptStudioHelper._signature_sidecar_path duplicates LegacyExecutor._signature_sidecar_path; extract this logic into a single shared utility function (e.g., create signature_sidecar_path(extract_file_path: str) in a common SDK utils module) and replace both PromptStudioHelper._signature_sidecar_path and LegacyExecutor._signature_sidecar_path to call the new shared function instead of reimplementing the suffix logic; update imports accordingly and ensure any callers still receive/return strings and add a small unit test covering filename variants to prevent future drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py`:
- Around line 1654-1657: Remove the duplicated exception string from the
logger.exception calls that currently append ": {e}" to the formatted message;
instead, keep a concise descriptive message and let logger.exception include the
exception and traceback automatically. Update the call that logs
f"[{tool.tool_id}] Error while fetching response for prompt {id} and doc
{document_id}: {e}" to drop the ": {e}" suffix, and do the same for the second
occurrence around the other logger.exception (the one at lines ~1722–1724) so
both use a descriptive message only and rely on logger.exception to include the
exception details.
- Around line 2401-2404: The sidecar path logic in
PromptStudioHelper._signature_sidecar_path duplicates
LegacyExecutor._signature_sidecar_path; extract this logic into a single shared
utility function (e.g., create signature_sidecar_path(extract_file_path: str) in
a common SDK utils module) and replace both
PromptStudioHelper._signature_sidecar_path and
LegacyExecutor._signature_sidecar_path to call the new shared function instead
of reimplementing the suffix logic; update imports accordingly and ensure any
callers still receive/return strings and add a small unit test covering filename
variants to prevent future drift.
In `@workers/executor/executors/legacy_executor.py`:
- Around line 367-375: Extract the duplicate sidecar-path logic into a shared
helper/constant and update this implementation to call that shared util instead
of duplicating logic: replace the static method _signature_sidecar_path in
legacy_executor.py with a call to the centralized function or constant (the same
one used by prompt_studio_helper.py), ensuring the canonical behavior uses
Path.with_suffix(".doc_insights.json") (or a single source-of-truth suffix
constant) so filenames with multiple dots are handled consistently; update
imports and references to use the new shared helper.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 60c935d5-84a6-4a73-80ca-e360edf481cb
📒 Files selected for processing (2)
backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.pyworkers/executor/executors/legacy_executor.py
Cuts the SonarCloud duplication metric: the formatter that turns ``signature_metadata`` into the ``[Document Signature Information]`` context block was the same code in both workers and prompt-service. Now both services import ``format_signature_metadata_context`` from ``unstract.sdk1.utils.signature_highlights``. No behaviour change. All 8 signature-highlight unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workers and prompt-service both had a ~50-line ``_is_safe_public_url`` helper for SSRF protection on postprocessing webhook URLs. Moved it to ``unstract.sdk1.utils.url_safety.is_safe_public_url`` and import from there. Brings the SonarCloud new-code duplication metric down further without touching behaviour. No behaviour change. All 8 signature-highlight unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Frontend Lint Report (Biome)✅ All checks passed! No linting or formatting issues found. |
Test ResultsSummary
Runner Tests - Full Report
SDK1 Tests - Full Report
|
|
Note on the remaining SonarCloud Code Analysis failureThe only failing condition is
Total issues on new code is 0. Why the duplication metric is still redThe remaining 7.3% is pre-existing structural duplication between """Answer prompt service — prompt construction and LLM execution.
Ported from prompt-service/.../services/answer_prompt.py.
…
"""The duplicated blocks SonarCloud still flags are inside What this PR did to bring the metric downI went after every shared helper I could extract without changing behaviour, taking duplication from 20.97% → 7.3%:
Why I'm stopping hereDeduping the rest would mean lifting
Requesting a SonarCloud override for this PR, or treating the duplication metric as accepted technical debt to be addressed in a follow-up. Happy to file a follow-up PR specifically for the dedup work if the team wants. |


What
document_insightsas a new processing mode on the LLMWhisperer V2 adapter (alongsideform,low_cost,high_quality,table). When selected, signature metadata returned by LLMWhisperer is captured, threaded into the LLM context, and surfaced to the Prompt Studio UI so the document viewer jumps to the page that contains the relevant signature when a signer-related answer is clicked.metadata[HIGHLIGHT_DATA][prompt_key]— which the existing PdfViewer +jumpToPagepipeline already knows how to consume.enable_highlighttoggle on the tool. Existing flows are untouched because theirhighlight_datais empty whenenable_highlight=false.prompt-servicefor parity withworkers.Why
document_insightsmode detects signatures and returnssignature_metadata(per-page list of{name, type, desc}). Customers asking signature questions in Prompt Studio (e.g. "Who signed this on behalf of T, Inc.?") should be able to click the answer and have the document viewer take them to the page containing that signature. Before this change, the data wasn't reaching the LLM context, the frontend, or the viewer.How
Adapter —
unstract/sdk1/.../llm_whisperer_v2Modes.DOCUMENT_INSIGHTSenum entry + JSON schema option.document_insights, walkresponse["metadata"]and surface per-pagesignature_metadataonTextExtractionMetadata._build_signature_page_references()static helper: for each page with signatures, find the first content line inline_metadata(skipping marker/empty rows whereheight <= 0orpage_height <= 0) and emit{hex, line_metadata_index, signers, coords}for that page. The resolvedcoordsare the[page, y, height, page_height]array the frontend overlay code expects.Workers —
workers/executor/executors/legacy_executor._handle_extractwrites a<extracted>.doc_insights.jsonsidecar alongside the.txt(helper_write_signature_sidecar) so Prompt Studio cache hits can recover the signature data without re-extraction.extract → answer_promptpath now also injectssignature_page_referencesintotool_settings(alongside the already-injectedsignature_metadata).PromptServiceConstants.SIGNATURE_PAGE_REFERENCES.answer_prompt.AnswerPromptService.construct_and_run_promptcalls a new_attach_signature_highlightspost-processor afterrun_completion. The post-processor:signature_metadata, looks for the name in the LLM answer using a case-insensitive word-boundary regex (r"\b" + re.escape(name) + r"\b") — picked over plain substring matching after observing a false positive where signer"P S"matched inside"Pradeep Surukanti".signature,signed,signatory,signing,executed) but no specific name was matched. This handles "Is this signed?"-style questions.metadata[HIGHLIGHT_DATA][prompt_key], de-duped against entries already populated by hex-comment processing.Prompt Studio backend —
backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.pydynamic_extractornow returns anExtractResultNamedTuple (text,signature_metadata,signature_page_references) instead of a bare string. All 6 callsites updated..doc_insights.jsonsidecar via new helper_load_signature_sidecar.answer_promptdispatch sites now injectsignature_metadata+signature_page_referencesintotool_settings.ToolStudioPromptKeys:SIGNATURE_METADATA,SIGNATURE_PAGE_REFERENCES.Frontend —
frontend/src/components/custom-tools/prompt-card/DisplayPromptResult.jsx(TextResultcomponent): made the answer clickable when the tool hasenable_highlightON OR when the backend producedhighlight_data. Existing behaviour is unchanged because the only wayhighlight_datais non-empty withenable_highlight=falseis the new signature flow.PromptCard.jsx(handleSelectHighlight): same fallback — proceed with theselectedHighlightstate update whenenable_highlightis off buthighlight_datais present. Without this, the click fired but the PdfViewer never received a non-emptyhighlightDataprop andjumpToPagewas never called.Prompt-service parity —
prompt-service/src/unstract/prompt_service/_attach_signature_highlightsinservices/answer_prompt.py.controllers/extraction.py+services/extraction.pynow also surfacesignature_page_referencesin the/extractresponse.SIGNATURE_PAGE_REFERENCESconstant.Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)
mode == document_insights. Existing modes (form,low_cost,high_quality,table) are unaffected.signature_metadataorsignature_page_referencesis missing — non-document_insights flows never enter the new path.metadata[HIGHLIGHT_DATA][prompt_key]and de-dupes — it never overwrites the hex-comment highlights produced by the existing highlight-data plugin.!highlight_data && !enable_highlight; tools withenable_highlight=falseand no signature data continue to render the non-clickable<div>exactly as before.dynamic_extractor's return type changed fromstrtoExtractResult(NamedTuple); all 6 callsites inprompt_studio_helper.pywere updated to use.text. No other callers exist (verified via grep acrossbackend/).workers/tests/test_answer_prompt.py::TestAttachSignatureHighlightscover name match, multi-page coords, keyword fallback, no-op cases, dedup against existing highlights, missing-input guards, and the"P S"vs"Pradeep Surukanti"regression. All 8 pass.Database Migrations
Env Config
Relevant Docs
<extract-dir>/<basename>.doc_insights.jsonnext to<basename>.txt. Read on cache hit only; written only whendocument_insightsmode produces signature data.Related Issues or PRs
Dependencies Versions
Notes on Testing
Automated:
→ 8 passed (covers name match, case-insensitivity, multi-page, keyword fallback, no-op, dedup, regression for
"P S"substring false positive, and missing-input guards).Manual end-to-end (verified):
document_insights.Who signed this doc on behalf of T, Inc. ?DOC_INSIGHTSlog lines in the executor worker:DOC_INSIGHTS: signature_page_references={...}with non-zerocoordsper page.DOC_INSIGHTS sidecar: wrote signature data to <path>.doc_insights.json.DOC_INSIGHTS attach_signature_highlights: prompt=..., added N signature highlight(s) on pages [...].Cache behaviour: subsequent prompt runs on the same doc read signature data from the sidecar (no re-extraction needed) and the page jump still works.
Screenshots
n/a — UI change is the click-to-jump behaviour described above.
Checklist
I have read and understood the Contribution Guidelines.