Skip to content

feat: add document-to-markdown MCP tool and streamline system prompt#28

Merged
zsculac merged 20 commits intomainfrom
feat/pdf-to-markdown-tool
Feb 27, 2026
Merged

feat: add document-to-markdown MCP tool and streamline system prompt#28
zsculac merged 20 commits intomainfrom
feat/pdf-to-markdown-tool

Conversation

@Jurij89
Copy link
Contributor

@Jurij89 Jurij89 commented Feb 4, 2026

  • Add document-to-markdown MCP tool (.pdf, .docx, .pptx) for ingestion → markdown → JSON-LD → DKG publishing
  • Introduce provider abstraction for document conversion (unpdf for basic PDF, Mistral OCR for complex PDF/DOCX/PPTX with OCR)
  • Extend blob storage to support subfolders for better organization
  • Streamline DEFAULT_SYSTEM_PROMPT: deduplicate rules, trim JSON-LD example, remove filler, reorder sections (~30% token reduction)
  • Add document conversion provider selection to setup script with conditional Mistral API key prompt
  • Auto-append .db extension to database filename in setup script to prevent misconfiguration

…rieval, and extend blob storage:

- Add document-to-markdown MCP tool (.pdf, .docx, .pptx) for ingestion → markdown → JSON-LD → DKG publishing
- Improve DKG agent system prompt to structure raw data into knowledge graphs and retrieve data via SPARQL using schema.org and FOAF
- Extend blob storage to support subfolders for better organization
@Jurij89 Jurij89 requested a review from Copilot February 4, 2026 16:06
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds document conversion capabilities to the DKG essentials plugin, enhances the DKG agent's knowledge structuring abilities, and improves blob storage organization.

Changes:

  • Adds a new document-to-markdown MCP tool that converts PDF, DOCX, and PPTX files to markdown using Mistral OCR
  • Significantly expands the DKG agent system prompt with detailed instructions for structuring knowledge graphs, conducting SPARQL queries, and extracting deep knowledge from documents
  • Extends blob storage to support subfolder organization through automatic parent directory creation

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts Implements the core document-to-markdown conversion tool with Mistral OCR integration, image extraction, and blob storage
packages/plugin-dkg-essentials/tests/document-to-markdown.spec.ts Comprehensive test suite covering tool registration, input validation, file type validation, and blob integration
packages/plugin-dkg-essentials/src/index.ts Exports and initializes the new document-to-markdown plugin
packages/plugin-dkg-essentials/src/createFsBlobStorage.ts Adds parent directory creation to support nested blob paths
packages/plugin-dkg-essentials/package.json Adds Mistral AI SDK and undici dependencies
packages/plugin-dkg-essentials/src/plugins/dkg-tools.ts Updates tool description for clarity
apps/agent/src/shared/chat.ts Major expansion of agent system prompt with detailed knowledge extraction, SPARQL query patterns, and user communication guidelines

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Extract DocumentConversionProvider interface for pluggable OCR backends
- Move Mistral-specific code to providers/mistral.ts
- Add provider registry with factory pattern (createProvider, getAvailableProviders)
- Split into modular structure: types, validation, blob-integration, providers
- Support runtime provider selection via DOCUMENT_CONVERSION_PROVIDER env var
- Add createDocumentToMarkdownPlugin() factory for custom configuration
- Update tests with provider registry and mock provider tests
- Update documentation with provider abstraction usage
Jurij Skornik and others added 6 commits February 9, 2026 19:02
…ion bugs

- Fail fast on provider init instead of lazy-loading (zsculac review)
- Add REST endpoint POST /document-to-markdown with multipart upload and OpenAPI docs (zsculac review)
- Move file type/size validation into processDocument() so both MCP and REST paths validate consistently
- Add busboy error and close handlers for malformed/empty multipart requests
- Remove redundant imageFilename variable in blob-integration
- Remove unused MistralProvider import in providers/index
- Update tests to use mock providers and fresh MCP servers
…d improve tests

- Clamp pageStart/pageEnd to valid bounds to fix wrong page numbering
  when callers pass out-of-range values (e.g. pageStart: 0)
- Update MCP tool and REST endpoint descriptions to remove "using OCR"
  and note that capabilities depend on the configured provider
- Fix stale docstring: providerName default is "unpdf", not "mistral"
- Remove 2 duplicate tests in Provider Registry
- Add 3 edge case tests: pageStart-only, pageEnd-only, single-page PDF

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allow users to choose between unpdf (basic PDF) and Mistral OCR
(PDF/DOCX/PPTX) during setup, with conditional MISTRAL_API_KEY prompt
that deduplicates when MistralAI is already the LLM provider.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deduplicate rules stated in multiple sections, trim the JSON-LD
example, remove filler, and reorder into a clearer flow.
~30% token reduction with all core behaviors preserved.
@Jurij89 Jurij89 changed the title feat: add document-to-markdown MCP tool, improve KG structuring & retrieval, and extend blob storage feat: add document-to-markdown MCP tool and streamline system prompt Feb 25, 2026
Jurij Skornik added 3 commits February 25, 2026 12:03
Reflect the provider abstraction with unpdf as default and Mistral
as opt-in. Add format support matrix, provider resolution order,
REST endpoint, and per-provider capability details.
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds a document-to-markdown plugin with provider support, docs, and a broad test suite. There are two blockers: user-supplied filenames are used to build blob IDs (allowing path traversal writes) and the new REST endpoint lacks core/error API tests. A few smaller issues around schema-level validation, upload size limits, and page-range clamping will help robustness and align behavior with the docs.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds a document-to-markdown plugin with provider selection plus MCP and REST interfaces. I found a security blocker around unsanitized filenames/image IDs in blob paths, and there are no tests for the new REST endpoint despite it being user-facing. There are also a few maintainability gaps around schema-level validation, request size limits, and test structure.

@zsculac zsculac dismissed Balki-OriginTrail’s stale review February 25, 2026 11:56

Leftover changes to implement. Removing so we dont accidentally merge

…and test coverage

- Sanitize filename and image.id in blob paths to prevent path traversal (CWE-22)
- Add busboy fileSize limit to abort oversized uploads at stream level (413)
- Clamp pageStart/pageEnd to valid bounds in both unpdf and mistral providers
- Add REST API tests for /document-to-markdown (success, missing file, bad type, oversize)
- Fix stale test assertion (OCR -> provider in tool description check)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds the document-to-markdown plugin plus setup/docs updates, but it introduces a path traversal risk in filesystem blob storage due to recursive directory creation without path validation. There are also a couple of correctness/contract mismatches around page numbering and the meaning of pageCount. Tests are present, but I’d address these issues before merging.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds document-to-markdown tooling and provider support, but there are three blockers: blob storage writes can escape the blob root, the multipart handler can respond multiple times on multi-file uploads, and REST errors are always reported as 400. Maintainability is mostly solid, but there are schema/response mismatches and validation/tests that should be tightened to match project conventions. Overall direction is good, but these issues need addressing before merge.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blockers found. The refactor adds structured EPCIS querying and the document conversion plugin with solid coverage and docs, but there are a few maintainability/pattern drifts to address (schema validation placement, hardcoded explorer URL, and test grouping conventions). Overall maintainability trends positive, but these inconsistencies should be cleaned up to stay aligned with project standards.

Additional comments

These comments target lines outside the diff and could not be posted inline.

  • packages/plugin-epcis/src/utils/sourceKa.ts:2 — 🟡 Issue: DKG_EXPLORER_BASE_URL is hardcoded to production. That makes links wrong in testnet/local deployments and duplicates existing helpers. Consider deriving from config/env or reusing a shared getExplorerUrl to keep environments consistent.
  • packages/plugin-epcis/tests/pluginEpcis.spec.ts:95 — 🟡 Issue: This new test suite doesn’t include the required describe("Core Functionality") section (AGENTS.md). Please regroup the core cases under that block to satisfy test conventions (keep the existing Error Handling block).

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds a substantial document-to-markdown conversion pipeline and significantly refactors the EPCIS plugin with clearer validation and service boundaries. I didn’t find critical correctness or security regressions. Maintainability is generally improved, but a couple of schema-boundary validations are still done in handlers and one new env var isn’t documented yet.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds the document-to-markdown conversion plugin and refactors the EPCIS plugin with clearer validation/query utilities. No critical correctness or security blockers found, but there are a few maintainability/contract issues: MCP validation is implemented in the handler instead of the Zod schema, new tests miss required "Core Functionality"/"Error Handling" blocks, and EPCIS track-item ordering contradicts the chronological ordering described. Maintainability is generally improved by the new modular structure, but conventions drift in tests and tool contracts should be tightened.

Additional comments

These comments target lines outside the diff and could not be posted inline.

  • packages/plugin-epcis/src/services/epcisQueryService.ts:177 — 🟡 Issue: The query is ordered DESC, but epcis-track-item claims results are chronological. This yields reverse timelines in summaries. Consider making ordering configurable or sorting ascending in the track-item handler (or update the description/tests to match actual ordering).

The review prompt allowed Codex to read files outside the PR diff for
context and then comment on them, producing unrelated feedback (e.g.
EPCIS comments on a pdf-to-markdown PR). Adds hard scope boundaries in
four places: context files section, review method pass, exclusion list,
and summary instructions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces a substantial EPCIS plugin refactor plus a new document-to-markdown flow with strong test coverage, but it includes two blockers. The multipart upload route can return a server error for invalid content types before its own error handling runs, and the EPCIS publisher endpoint configuration changed in a backward-incompatible way that can break capture/status flows in existing deployments. There is also API documentation drift in the new EPCIS plugin docs. Maintainability generally improves via service extraction, but it is pulled back by schema/handler validation split and docs/code mismatch.

Additional comments

These comments target lines outside the diff and could not be posted inline.

  • packages/plugin-epcis/src/services/epcisPublisherService.ts:98 — 🔴 Bug: Switching to required EXPO_PUBLIC_MCP_URL with no fallback breaks existing setups that previously worked with default localhost/PUBLISHER_URL, causing capture and status calls to fail at runtime; keep backward compatibility by honoring PUBLISHER_URL and/or a localhost default.
  • docs/build-a-dkg-node-ai-agent/plugins/epcis-plugin.md:46 — 🟡 Issue: The docs introduce GET /epcis/asset/*ual, but this PR’s EPCIS plugin code does not register that route, which creates an API contract mismatch; either add the endpoint or remove this documented capability.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds substantial EPCIS and document-conversion functionality, but there is a blocker in the new capture publishing flow: downstream publisher errors are collapsed into a generic 500, which breaks HTTP error semantics and hides actionable client feedback. The rest of the changes are generally coherent, but there are a few maintainability gaps around schema/validation alignment and test coverage for newly added response content. Maintainability in touched areas is mixed: some validation and retry structure improved, but core plugin files are now carrying more behavior without corresponding contract-level tests for all new outputs.

Additional comments

These comments target lines outside the diff and could not be posted inline.

  • packages/plugin-epcis/src/index.ts:205 — 🔴 Bug: This catch-all rethrows every publisher failure as CapturePublishError, so upstream 4xx errors become generic 500 responses. Preserve and map publisher status/details (retry only transient failures) so client/input errors stay 4xx and server failures stay 5xx.
  • packages/plugin-epcis/src/utils/sourceKa.ts:2 — 🟡 Issue: Hardcoding the mainnet explorer base URL makes generated Source KA links incorrect for testnet UALs. Derive the explorer host from the UAL chain ID (or emit plain UALs and let the client choose the correct explorer URL).
  • packages/plugin-epcis/tests/pluginEpcis.spec.ts:40 — 🟡 Issue: parseToolResult only reads the first content block, so tests can pass even if the new Source KA provenance block regresses. Add explicit assertions against result.content[1] (present and well-formed when events exist, absent when none).

async (input) => {
try {
// Validate input: require either blobId or fileBase64
if (!input.blobId && !input.fileBase64) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: The blobId vs fileBase64 mutual-exclusion rule is enforced only in handler code, so the declared input schema does not reflect the real contract. Move this XOR constraint into schema validation (refine/superRefine) to keep runtime validation and generated tool/API contracts aligned.

@zsculac zsculac merged commit 8e7b178 into main Feb 27, 2026
10 checks passed
@zsculac zsculac deleted the feat/pdf-to-markdown-tool branch February 27, 2026 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants