Skip to content

Add pdf_oxide to Parsers, OCR and extraction#22

Open
yfedoseev wants to merge 3 commits into
OneOffTech:mainfrom
yfedoseev:add-pdf-oxide
Open

Add pdf_oxide to Parsers, OCR and extraction#22
yfedoseev wants to merge 3 commits into
OneOffTech:mainfrom
yfedoseev:add-pdf-oxide

Conversation

@yfedoseev
Copy link
Copy Markdown

Adds pdf_oxide to the Parsers, OCR and extraction section, following the README format and the awesome-lint rules (verified locally with npx awesome-lint; uses "WebAssembly" per the spell-check rule).

  • pdf_oxide - A fast Rust PDF library for text and image extraction, markdown conversion, and structured extraction, with bindings for Python, Go, JS/TS, .NET, Java, PHP, Ruby, and WebAssembly, plus a CLI and MCP server.

MIT-licensed Rust core. Happy to move it or adjust wording to fit your conventions — thanks for maintaining the list!

Copy link
Copy Markdown
Contributor

@avvertix avvertix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehi @yfedoseev thanks for sharing. I'm suggesting a small change to keep the description short. If you agree with my reformulation just accept the changes, otherwise please provide a shortened description citing the main feature.

Comment thread README.md Outdated
- [CatchTheTornado/pdf-extract-api](https://github.com/CatchTheTornado/pdf-extract-api) - Document (PDF) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown.
- [climatepolicyradar/navigator-document-parser](https://github.com/climatepolicyradar/navigator-document-parser) - Parsing PDFs and websites containing laws and policies.
- [Iteration Layer](https://iterationlayer.com) - An AI-powered API that extracts structured data from PDFs, images, DOCX, and text files.
- [pdf_oxide](https://github.com/yfedoseev/pdf_oxide) - A fast Rust PDF library for text and image extraction, markdown conversion, and structured extraction, with bindings for Python, Go, JS/TS, .NET, Java, PHP, Ruby, and WebAssembly, plus a CLI and MCP server.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [pdf_oxide](https://github.com/yfedoseev/pdf_oxide) - A fast Rust PDF library for text and image extraction, markdown conversion, and structured extraction, with bindings for Python, Go, JS/TS, .NET, Java, PHP, Ruby, and WebAssembly, plus a CLI and MCP server.
- [PDF Oxide](https://github.com/yfedoseev/pdf_oxide) - Rust PDF library and CLI for text and image extraction and markdown conversion with bindings for Python, Go, JS, .NET, Java, PHP, Ruby, and MCP server.

@yfedoseev
Copy link
Copy Markdown
Author

Thanks @avvertix! I've shortened it. Kept it almost exactly as your suggestion — just retained WASM (it's a first-class binding) and moved the MCP server out of the bindings list so the list reads as pure language bindings:

- [pdf_oxide](https://github.com/yfedoseev/pdf_oxide) - Fast Rust PDF library and CLI for text and image extraction and markdown conversion, with bindings for Python, Go, JS, .NET, Java, PHP, Ruby, and WASM.

Pushed in f4b06d2. Happy to trim further if you'd like it even shorter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants