feat: add document-to-markdown MCP tool and streamline system prompt by Jurij89 · Pull Request #28 · OriginTrail/dkg-node

Jurij89 · 2026-02-04T16:03:00Z

Add document-to-markdown MCP tool (.pdf, .docx, .pptx) for ingestion → markdown → JSON-LD → DKG publishing
Introduce provider abstraction for document conversion (unpdf for basic PDF, Mistral OCR for complex PDF/DOCX/PPTX with OCR)
Extend blob storage to support subfolders for better organization
Streamline DEFAULT_SYSTEM_PROMPT: deduplicate rules, trim JSON-LD example, remove filler, reorder sections (~30% token reduction)
Add document conversion provider selection to setup script with conditional Mistral API key prompt
Auto-append .db extension to database filename in setup script to prevent misconfiguration

…rieval, and extend blob storage: - Add document-to-markdown MCP tool (.pdf, .docx, .pptx) for ingestion → markdown → JSON-LD → DKG publishing - Improve DKG agent system prompt to structure raw data into knowledge graphs and retrieve data via SPARQL using schema.org and FOAF - Extend blob storage to support subfolders for better organization

Copilot

Pull request overview

This PR adds document conversion capabilities to the DKG essentials plugin, enhances the DKG agent's knowledge structuring abilities, and improves blob storage organization.

Changes:

Adds a new document-to-markdown MCP tool that converts PDF, DOCX, and PPTX files to markdown using Mistral OCR
Significantly expands the DKG agent system prompt with detailed instructions for structuring knowledge graphs, conducting SPARQL queries, and extracting deep knowledge from documents
Extends blob storage to support subfolder organization through automatic parent directory creation

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts	Implements the core document-to-markdown conversion tool with Mistral OCR integration, image extraction, and blob storage
packages/plugin-dkg-essentials/tests/document-to-markdown.spec.ts	Comprehensive test suite covering tool registration, input validation, file type validation, and blob integration
packages/plugin-dkg-essentials/src/index.ts	Exports and initializes the new document-to-markdown plugin
packages/plugin-dkg-essentials/src/createFsBlobStorage.ts	Adds parent directory creation to support nested blob paths
packages/plugin-dkg-essentials/package.json	Adds Mistral AI SDK and undici dependencies
packages/plugin-dkg-essentials/src/plugins/dkg-tools.ts	Updates tool description for clarity
apps/agent/src/shared/chat.ts	Major expansion of agent system prompt with detailed knowledge extraction, SPARQL query patterns, and user communication guidelines

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts

packages/plugin-dkg-essentials/tests/document-to-markdown.spec.ts

package-lock.json

packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts

apps/agent/src/shared/chat.ts

- Extract DocumentConversionProvider interface for pluggable OCR backends - Move Mistral-specific code to providers/mistral.ts - Add provider registry with factory pattern (createProvider, getAvailableProviders) - Split into modular structure: types, validation, blob-integration, providers - Support runtime provider selection via DOCUMENT_CONVERSION_PROVIDER env var - Add createDocumentToMarkdownPlugin() factory for custom configuration - Update tests with provider registry and mock provider tests - Update documentation with provider abstraction usage

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts

…ion bugs - Fail fast on provider init instead of lazy-loading (zsculac review) - Add REST endpoint POST /document-to-markdown with multipart upload and OpenAPI docs (zsculac review) - Move file type/size validation into processDocument() so both MCP and REST paths validate consistently - Add busboy error and close handlers for malformed/empty multipart requests - Remove redundant imageFilename variable in blob-integration - Remove unused MistralProvider import in providers/index - Update tests to use mock providers and fresh MCP servers

…d improve tests - Clamp pageStart/pageEnd to valid bounds to fix wrong page numbering when callers pass out-of-range values (e.g. pageStart: 0) - Update MCP tool and REST endpoint descriptions to remove "using OCR" and note that capabilities depend on the configured provider - Fix stale docstring: providerName default is "unpdf", not "mistral" - Remove 2 duplicate tests in Provider Registry - Add 3 edge case tests: pageStart-only, pageEnd-only, single-page PDF Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Allow users to choose between unpdf (basic PDF) and Mistral OCR (PDF/DOCX/PPTX) during setup, with conditional MISTRAL_API_KEY prompt that deduplicates when MistralAI is already the LLM provider. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Deduplicate rules stated in multiple sections, trim the JSON-LD example, remove filler, and reorder into a clearer flow. ~30% token reduction with all core behaviors preserved.

Reflect the provider abstraction with unpdf as default and Mistral as opt-in. Add format support matrix, provider resolution order, REST endpoint, and per-provider capability details.

github-actions

This PR adds a document-to-markdown plugin with provider support, docs, and a broad test suite. There are two blockers: user-supplied filenames are used to build blob IDs (allowing path traversal writes) and the new REST endpoint lacks core/error API tests. A few smaller issues around schema-level validation, upload size limits, and page-range clamping will help robustness and align behavior with the docs.

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/blob-integration.ts

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/providers/unpdf.ts

github-actions

This PR adds a document-to-markdown plugin with provider selection plus MCP and REST interfaces. I found a security blocker around unsanitized filenames/image IDs in blob paths, and there are no tests for the new REST endpoint despite it being user-facing. There are also a few maintainability gaps around schema-level validation, request size limits, and test structure.

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/blob-integration.ts

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts

packages/plugin-dkg-essentials/tests/document-to-markdown.spec.ts

Leftover changes to implement. Removing so we dont accidentally merge

…and test coverage - Sanitize filename and image.id in blob paths to prevent path traversal (CWE-22) - Add busboy fileSize limit to abort oversized uploads at stream level (413) - Clamp pageStart/pageEnd to valid bounds in both unpdf and mistral providers - Add REST API tests for /document-to-markdown (success, missing file, bad type, oversize) - Fix stale test assertion (OCR -> provider in tool description check) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions

This PR adds the document-to-markdown plugin plus setup/docs updates, but it introduces a path traversal risk in filesystem blob storage due to recursive directory creation without path validation. There are also a couple of correctness/contract mismatches around page numbering and the meaning of pageCount. Tests are present, but I’d address these issues before merging.

packages/plugin-dkg-essentials/src/createFsBlobStorage.ts

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/providers/mistral.ts

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts

…tool

github-actions

This PR adds document-to-markdown tooling and provider support, but there are three blockers: blob storage writes can escape the blob root, the multipart handler can respond multiple times on multi-file uploads, and REST errors are always reported as 400. Maintainability is mostly solid, but there are schema/response mismatches and validation/tests that should be tightened to match project conventions. Overall direction is good, but these issues need addressing before merge.

packages/plugin-dkg-essentials/src/createFsBlobStorage.ts

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts

packages/plugin-dkg-essentials/tests/document-to-markdown.spec.ts

github-actions

No blockers found. The refactor adds structured EPCIS querying and the document conversion plugin with solid coverage and docs, but there are a few maintainability/pattern drifts to address (schema validation placement, hardcoded explorer URL, and test grouping conventions). Overall maintainability trends positive, but these inconsistencies should be cleaned up to stay aligned with project standards.

Additional comments

These comments target lines outside the diff and could not be posted inline.

packages/plugin-epcis/src/utils/sourceKa.ts:2 — 🟡 Issue: DKG_EXPLORER_BASE_URL is hardcoded to production. That makes links wrong in testnet/local deployments and duplicates existing helpers. Consider deriving from config/env or reusing a shared getExplorerUrl to keep environments consistent.
packages/plugin-epcis/tests/pluginEpcis.spec.ts:95 — 🟡 Issue: This new test suite doesn’t include the required describe("Core Functionality") section (AGENTS.md). Please regroup the core cases under that block to satisfy test conventions (keep the existing Error Handling block).

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts

… docs

github-actions

This PR adds a substantial document-to-markdown conversion pipeline and significantly refactors the EPCIS plugin with clearer validation and service boundaries. I didn’t find critical correctness or security regressions. Maintainability is generally improved, but a couple of schema-boundary validations are still done in handlers and one new env var isn’t documented yet.

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts

packages/plugin-dkg-publisher/src/services/DkgService.ts

github-actions

This PR adds the document-to-markdown conversion plugin and refactors the EPCIS plugin with clearer validation/query utilities. No critical correctness or security blockers found, but there are a few maintainability/contract issues: MCP validation is implemented in the handler instead of the Zod schema, new tests miss required "Core Functionality"/"Error Handling" blocks, and EPCIS track-item ordering contradicts the chronological ordering described. Maintainability is generally improved by the new modular structure, but conventions drift in tests and tool contracts should be tightened.

Additional comments

These comments target lines outside the diff and could not be posted inline.

packages/plugin-epcis/src/services/epcisQueryService.ts:177 — 🟡 Issue: The query is ordered DESC, but epcis-track-item claims results are chronological. This yields reverse timelines in summaries. Consider making ordering configurable or sorting ascending in the track-item handler (or update the description/tests to match actual ordering).

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts

packages/plugin-dkg-essentials/tests/document-to-markdown.spec.ts

The review prompt allowed Codex to read files outside the PR diff for context and then comment on them, producing unrelated feedback (e.g. EPCIS comments on a pdf-to-markdown PR). Adds hard scope boundaries in four places: context files section, review method pass, exclusion list, and summary instructions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions

This PR introduces a substantial EPCIS plugin refactor plus a new document-to-markdown flow with strong test coverage, but it includes two blockers. The multipart upload route can return a server error for invalid content types before its own error handling runs, and the EPCIS publisher endpoint configuration changed in a backward-incompatible way that can break capture/status flows in existing deployments. There is also API documentation drift in the new EPCIS plugin docs. Maintainability generally improves via service extraction, but it is pulled back by schema/handler validation split and docs/code mismatch.

Additional comments

These comments target lines outside the diff and could not be posted inline.

packages/plugin-epcis/src/services/epcisPublisherService.ts:98 — 🔴 Bug: Switching to required EXPO_PUBLIC_MCP_URL with no fallback breaks existing setups that previously worked with default localhost/PUBLISHER_URL, causing capture and status calls to fail at runtime; keep backward compatibility by honoring PUBLISHER_URL and/or a localhost default.
docs/build-a-dkg-node-ai-agent/plugins/epcis-plugin.md:46 — 🟡 Issue: The docs introduce GET /epcis/asset/*ual, but this PR’s EPCIS plugin code does not register that route, which creates an API contract mismatch; either add the endpoint or remove this documented capability.

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts

github-actions

This PR adds substantial EPCIS and document-conversion functionality, but there is a blocker in the new capture publishing flow: downstream publisher errors are collapsed into a generic 500, which breaks HTTP error semantics and hides actionable client feedback. The rest of the changes are generally coherent, but there are a few maintainability gaps around schema/validation alignment and test coverage for newly added response content. Maintainability in touched areas is mixed: some validation and retry structure improved, but core plugin files are now carrying more behavior without corresponding contract-level tests for all new outputs.

Additional comments

These comments target lines outside the diff and could not be posted inline.

packages/plugin-epcis/src/index.ts:205 — 🔴 Bug: This catch-all rethrows every publisher failure as CapturePublishError, so upstream 4xx errors become generic 500 responses. Preserve and map publisher status/details (retry only transient failures) so client/input errors stay 4xx and server failures stay 5xx.
packages/plugin-epcis/src/utils/sourceKa.ts:2 — 🟡 Issue: Hardcoding the mainnet explorer base URL makes generated Source KA links incorrect for testnet UALs. Derive the explorer host from the UAL chain ID (or emit plain UALs and let the client choose the correct explorer URL).
packages/plugin-epcis/tests/pluginEpcis.spec.ts:40 — 🟡 Issue: parseToolResult only reads the first content block, so tests can pass even if the new Source KA provenance block regresses. Add explicit assertions against result.content[1] (present and well-formed when events exist, absent when none).

github-actions · 2026-02-27T11:52:37Z

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts

+      async (input) => {
+        try {
+          // Validate input: require either blobId or fileBase64
+          if (!input.blobId && !input.fileBase64) {


🟡 Issue: The blobId vs fileBase64 mutual-exclusion rule is enforced only in handler code, so the declared input schema does not reflect the real contract. Move this XOR constraint into schema validation (refine/superRefine) to keep runtime validation and generated tool/API contracts aligned.

Jurij89 requested a review from Copilot February 4, 2026 16:06

Copilot AI reviewed Feb 4, 2026

View reviewed changes

lupuszr reviewed Feb 4, 2026

View reviewed changes

package-lock.json Show resolved Hide resolved

lupuszr reviewed Feb 4, 2026

View reviewed changes

packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts Outdated Show resolved Hide resolved

Balki-OriginTrail previously approved these changes Feb 5, 2026

View reviewed changes

zsculac reviewed Feb 5, 2026

View reviewed changes

packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts Outdated Show resolved Hide resolved

zsculac reviewed Feb 5, 2026

View reviewed changes

packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts Outdated Show resolved Hide resolved

zsculac reviewed Feb 5, 2026

View reviewed changes

apps/agent/src/shared/chat.ts Outdated Show resolved Hide resolved

zsculac reviewed Feb 5, 2026

View reviewed changes

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts Show resolved Hide resolved

zsculac reviewed Feb 5, 2026

View reviewed changes

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts Outdated Show resolved Hide resolved

Jurij Skornik and others added 6 commits February 9, 2026 19:02

fix: removed a folder that shouldln't have been a part of the commit

c9dc5e2

fix(setup): auto-append .db extension to database filename

398dcb9

refactor(agent): streamline DEFAULT_SYSTEM_PROMPT

2c0a167

Deduplicate rules stated in multiple sections, trim the JSON-LD example, remove filler, and reorder into a clearer flow. ~30% token reduction with all core behaviors preserved.

Jurij89 changed the title ~~feat: add document-to-markdown MCP tool, improve KG structuring & retrieval, and extend blob storage~~ feat: add document-to-markdown MCP tool and streamline system prompt Feb 25, 2026

Jurij Skornik added 3 commits February 25, 2026 12:03

fix(agent): add markdown formatting instruction to system prompt

d9dd38e

merge: resolve conflicts with main (keep streamlined prompt + new deps)

f937dd7

docs(document-to-markdown): update docs for unpdf default provider

facb2bc

Reflect the provider abstraction with unpdf as default and Mistral as opt-in. Add format support matrix, provider resolution order, REST endpoint, and per-provider capability details.

github-actions bot reviewed Feb 25, 2026

View reviewed changes

fix(agent): remove greeting example to reduce repetitive opener

244c66d

github-actions bot reviewed Feb 25, 2026

View reviewed changes

zsculac requested a review from Balki-OriginTrail February 25, 2026 11:55

github-actions bot reviewed Feb 25, 2026

View reviewed changes

better codex prompt

71ee667

Merge remote-tracking branch 'origin/main' into feat/pdf-to-markdown-…

a18161a

…tool

github-actions bot reviewed Feb 26, 2026

View reviewed changes

fix(plugin-dkg-essentials): finalize document-to-markdown review fixes

f855e76

github-actions bot reviewed Feb 26, 2026

View reviewed changes

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts Show resolved Hide resolved

chore(plugin-dkg-essentials): remove review summary artifact and tidy…

3e05a81

… docs

github-actions bot reviewed Feb 26, 2026

View reviewed changes

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts Show resolved Hide resolved

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts Show resolved Hide resolved

packages/plugin-dkg-publisher/src/services/DkgService.ts Outdated Show resolved Hide resolved

fix(document-to-markdown): mark MCP conversion failures as tool errors

a571f16

github-actions bot reviewed Feb 26, 2026

View reviewed changes

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts Show resolved Hide resolved

packages/plugin-dkg-essentials/tests/document-to-markdown.spec.ts Show resolved Hide resolved

github-actions bot reviewed Feb 27, 2026

View reviewed changes

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts Outdated Show resolved Hide resolved

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts Show resolved Hide resolved

fix(document-to-markdown): handle invalid multipart requests

77c77a9

github-actions bot reviewed Feb 27, 2026

View reviewed changes

zsculac approved these changes Feb 27, 2026

View reviewed changes

zsculac merged commit 8e7b178 into main Feb 27, 2026
10 checks passed

zsculac deleted the feat/pdf-to-markdown-tool branch February 27, 2026 14:09

Conversation

Jurij89 commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Additional comments

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Additional comments

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Additional comments

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Additional comments

Uh oh!

github-actions bot Feb 27, 2026

Jurij89 commented Feb 4, 2026 •

edited

Loading