feat: add document-to-markdown MCP tool and streamline system prompt#28
feat: add document-to-markdown MCP tool and streamline system prompt#28
Conversation
…rieval, and extend blob storage: - Add document-to-markdown MCP tool (.pdf, .docx, .pptx) for ingestion → markdown → JSON-LD → DKG publishing - Improve DKG agent system prompt to structure raw data into knowledge graphs and retrieve data via SPARQL using schema.org and FOAF - Extend blob storage to support subfolders for better organization
There was a problem hiding this comment.
Pull request overview
This PR adds document conversion capabilities to the DKG essentials plugin, enhances the DKG agent's knowledge structuring abilities, and improves blob storage organization.
Changes:
- Adds a new
document-to-markdownMCP tool that converts PDF, DOCX, and PPTX files to markdown using Mistral OCR - Significantly expands the DKG agent system prompt with detailed instructions for structuring knowledge graphs, conducting SPARQL queries, and extracting deep knowledge from documents
- Extends blob storage to support subfolder organization through automatic parent directory creation
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts | Implements the core document-to-markdown conversion tool with Mistral OCR integration, image extraction, and blob storage |
| packages/plugin-dkg-essentials/tests/document-to-markdown.spec.ts | Comprehensive test suite covering tool registration, input validation, file type validation, and blob integration |
| packages/plugin-dkg-essentials/src/index.ts | Exports and initializes the new document-to-markdown plugin |
| packages/plugin-dkg-essentials/src/createFsBlobStorage.ts | Adds parent directory creation to support nested blob paths |
| packages/plugin-dkg-essentials/package.json | Adds Mistral AI SDK and undici dependencies |
| packages/plugin-dkg-essentials/src/plugins/dkg-tools.ts | Updates tool description for clarity |
| apps/agent/src/shared/chat.ts | Major expansion of agent system prompt with detailed knowledge extraction, SPARQL query patterns, and user communication guidelines |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts
Outdated
Show resolved
Hide resolved
packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts
Outdated
Show resolved
Hide resolved
packages/plugin-dkg-essentials/tests/document-to-markdown.spec.ts
Outdated
Show resolved
Hide resolved
packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts
Outdated
Show resolved
Hide resolved
packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts
Outdated
Show resolved
Hide resolved
packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts
Outdated
Show resolved
Hide resolved
- Extract DocumentConversionProvider interface for pluggable OCR backends - Move Mistral-specific code to providers/mistral.ts - Add provider registry with factory pattern (createProvider, getAvailableProviders) - Split into modular structure: types, validation, blob-integration, providers - Support runtime provider selection via DOCUMENT_CONVERSION_PROVIDER env var - Add createDocumentToMarkdownPlugin() factory for custom configuration - Update tests with provider registry and mock provider tests - Update documentation with provider abstraction usage
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Show resolved
Hide resolved
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Outdated
Show resolved
Hide resolved
…ion bugs - Fail fast on provider init instead of lazy-loading (zsculac review) - Add REST endpoint POST /document-to-markdown with multipart upload and OpenAPI docs (zsculac review) - Move file type/size validation into processDocument() so both MCP and REST paths validate consistently - Add busboy error and close handlers for malformed/empty multipart requests - Remove redundant imageFilename variable in blob-integration - Remove unused MistralProvider import in providers/index - Update tests to use mock providers and fresh MCP servers
…d improve tests - Clamp pageStart/pageEnd to valid bounds to fix wrong page numbering when callers pass out-of-range values (e.g. pageStart: 0) - Update MCP tool and REST endpoint descriptions to remove "using OCR" and note that capabilities depend on the configured provider - Fix stale docstring: providerName default is "unpdf", not "mistral" - Remove 2 duplicate tests in Provider Registry - Add 3 edge case tests: pageStart-only, pageEnd-only, single-page PDF Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allow users to choose between unpdf (basic PDF) and Mistral OCR (PDF/DOCX/PPTX) during setup, with conditional MISTRAL_API_KEY prompt that deduplicates when MistralAI is already the LLM provider. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deduplicate rules stated in multiple sections, trim the JSON-LD example, remove filler, and reorder into a clearer flow. ~30% token reduction with all core behaviors preserved.
Reflect the provider abstraction with unpdf as default and Mistral as opt-in. Add format support matrix, provider resolution order, REST endpoint, and per-provider capability details.
There was a problem hiding this comment.
This PR adds a document-to-markdown plugin with provider support, docs, and a broad test suite. There are two blockers: user-supplied filenames are used to build blob IDs (allowing path traversal writes) and the new REST endpoint lacks core/error API tests. A few smaller issues around schema-level validation, upload size limits, and page-range clamping will help robustness and align behavior with the docs.
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/blob-integration.ts
Outdated
Show resolved
Hide resolved
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Show resolved
Hide resolved
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Outdated
Show resolved
Hide resolved
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Show resolved
Hide resolved
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/providers/unpdf.ts
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
This PR adds a document-to-markdown plugin with provider selection plus MCP and REST interfaces. I found a security blocker around unsanitized filenames/image IDs in blob paths, and there are no tests for the new REST endpoint despite it being user-facing. There are also a few maintainability gaps around schema-level validation, request size limits, and test structure.
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/blob-integration.ts
Outdated
Show resolved
Hide resolved
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Show resolved
Hide resolved
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Outdated
Show resolved
Hide resolved
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Show resolved
Hide resolved
Leftover changes to implement. Removing so we dont accidentally merge
…and test coverage - Sanitize filename and image.id in blob paths to prevent path traversal (CWE-22) - Add busboy fileSize limit to abort oversized uploads at stream level (413) - Clamp pageStart/pageEnd to valid bounds in both unpdf and mistral providers - Add REST API tests for /document-to-markdown (success, missing file, bad type, oversize) - Fix stale test assertion (OCR -> provider in tool description check) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
This PR adds the document-to-markdown plugin plus setup/docs updates, but it introduces a path traversal risk in filesystem blob storage due to recursive directory creation without path validation. There are also a couple of correctness/contract mismatches around page numbering and the meaning of pageCount. Tests are present, but I’d address these issues before merging.
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/providers/mistral.ts
Show resolved
Hide resolved
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Show resolved
Hide resolved
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
This PR adds document-to-markdown tooling and provider support, but there are three blockers: blob storage writes can escape the blob root, the multipart handler can respond multiple times on multi-file uploads, and REST errors are always reported as 400. Maintainability is mostly solid, but there are schema/response mismatches and validation/tests that should be tightened to match project conventions. Overall direction is good, but these issues need addressing before merge.
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Outdated
Show resolved
Hide resolved
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Outdated
Show resolved
Hide resolved
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Outdated
Show resolved
Hide resolved
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Show resolved
Hide resolved
packages/plugin-dkg-essentials/tests/document-to-markdown.spec.ts
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
No blockers found. The refactor adds structured EPCIS querying and the document conversion plugin with solid coverage and docs, but there are a few maintainability/pattern drifts to address (schema validation placement, hardcoded explorer URL, and test grouping conventions). Overall maintainability trends positive, but these inconsistencies should be cleaned up to stay aligned with project standards.
Additional comments
These comments target lines outside the diff and could not be posted inline.
packages/plugin-epcis/src/utils/sourceKa.ts:2— 🟡 Issue:DKG_EXPLORER_BASE_URLis hardcoded to production. That makes links wrong in testnet/local deployments and duplicates existing helpers. Consider deriving from config/env or reusing a sharedgetExplorerUrlto keep environments consistent.packages/plugin-epcis/tests/pluginEpcis.spec.ts:95— 🟡 Issue: This new test suite doesn’t include the requireddescribe("Core Functionality")section (AGENTS.md). Please regroup the core cases under that block to satisfy test conventions (keep the existing Error Handling block).
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Show resolved
Hide resolved
There was a problem hiding this comment.
This PR adds a substantial document-to-markdown conversion pipeline and significantly refactors the EPCIS plugin with clearer validation and service boundaries. I didn’t find critical correctness or security regressions. Maintainability is generally improved, but a couple of schema-boundary validations are still done in handlers and one new env var isn’t documented yet.
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Show resolved
Hide resolved
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Show resolved
Hide resolved
There was a problem hiding this comment.
This PR adds the document-to-markdown conversion plugin and refactors the EPCIS plugin with clearer validation/query utilities. No critical correctness or security blockers found, but there are a few maintainability/contract issues: MCP validation is implemented in the handler instead of the Zod schema, new tests miss required "Core Functionality"/"Error Handling" blocks, and EPCIS track-item ordering contradicts the chronological ordering described. Maintainability is generally improved by the new modular structure, but conventions drift in tests and tool contracts should be tightened.
Additional comments
These comments target lines outside the diff and could not be posted inline.
packages/plugin-epcis/src/services/epcisQueryService.ts:177— 🟡 Issue: The query is orderedDESC, butepcis-track-itemclaims results are chronological. This yields reverse timelines in summaries. Consider making ordering configurable or sorting ascending in the track-item handler (or update the description/tests to match actual ordering).
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Show resolved
Hide resolved
The review prompt allowed Codex to read files outside the PR diff for context and then comment on them, producing unrelated feedback (e.g. EPCIS comments on a pdf-to-markdown PR). Adds hard scope boundaries in four places: context files section, review method pass, exclusion list, and summary instructions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
This PR introduces a substantial EPCIS plugin refactor plus a new document-to-markdown flow with strong test coverage, but it includes two blockers. The multipart upload route can return a server error for invalid content types before its own error handling runs, and the EPCIS publisher endpoint configuration changed in a backward-incompatible way that can break capture/status flows in existing deployments. There is also API documentation drift in the new EPCIS plugin docs. Maintainability generally improves via service extraction, but it is pulled back by schema/handler validation split and docs/code mismatch.
Additional comments
These comments target lines outside the diff and could not be posted inline.
packages/plugin-epcis/src/services/epcisPublisherService.ts:98— 🔴 Bug: Switching to requiredEXPO_PUBLIC_MCP_URLwith no fallback breaks existing setups that previously worked with default localhost/PUBLISHER_URL, causing capture and status calls to fail at runtime; keep backward compatibility by honoringPUBLISHER_URLand/or a localhost default.docs/build-a-dkg-node-ai-agent/plugins/epcis-plugin.md:46— 🟡 Issue: The docs introduceGET /epcis/asset/*ual, but this PR’s EPCIS plugin code does not register that route, which creates an API contract mismatch; either add the endpoint or remove this documented capability.
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Outdated
Show resolved
Hide resolved
packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts
Show resolved
Hide resolved
There was a problem hiding this comment.
This PR adds substantial EPCIS and document-conversion functionality, but there is a blocker in the new capture publishing flow: downstream publisher errors are collapsed into a generic 500, which breaks HTTP error semantics and hides actionable client feedback. The rest of the changes are generally coherent, but there are a few maintainability gaps around schema/validation alignment and test coverage for newly added response content. Maintainability in touched areas is mixed: some validation and retry structure improved, but core plugin files are now carrying more behavior without corresponding contract-level tests for all new outputs.
Additional comments
These comments target lines outside the diff and could not be posted inline.
packages/plugin-epcis/src/index.ts:205— 🔴 Bug: This catch-all rethrows every publisher failure asCapturePublishError, so upstream 4xx errors become generic 500 responses. Preserve and map publisher status/details (retry only transient failures) so client/input errors stay 4xx and server failures stay 5xx.packages/plugin-epcis/src/utils/sourceKa.ts:2— 🟡 Issue: Hardcoding the mainnet explorer base URL makes generated Source KA links incorrect for testnet UALs. Derive the explorer host from the UAL chain ID (or emit plain UALs and let the client choose the correct explorer URL).packages/plugin-epcis/tests/pluginEpcis.spec.ts:40— 🟡 Issue:parseToolResultonly reads the first content block, so tests can pass even if the new Source KA provenance block regresses. Add explicit assertions againstresult.content[1](present and well-formed when events exist, absent when none).
| async (input) => { | ||
| try { | ||
| // Validate input: require either blobId or fileBase64 | ||
| if (!input.blobId && !input.fileBase64) { |
There was a problem hiding this comment.
🟡 Issue: The blobId vs fileBase64 mutual-exclusion rule is enforced only in handler code, so the declared input schema does not reflect the real contract. Move this XOR constraint into schema validation (refine/superRefine) to keep runtime validation and generated tool/API contracts aligned.
Uh oh!
There was an error while loading. Please reload this page.