Releases · itext/itext-pdfocr-java

01 Apr 08:39

nanouh

5.0.0

2cdd189

pdfOCR 5.0.0 Latest

Latest

This release of the pdfOCR add-on for iText Core not only supports PaddleOCR and EasyOCR models, but also offers some huge performance improvements and general OCR improvements across the board. Therefore it’s significant enough to warrant a major release version, bumping the version number to 5.0.0.

PaddleOCR/EasyOCR Model Support

The ML-based OCR engine is extended to include support for pretrained ONNX PaddleOCR and EasyOCR models, adding to the docTR models already supported. In our ongoing OCR tests, these models perform extremely well over a wide range of use cases and have extensive language support.

We are now maintaining a HuggingFace repository where you can download many compatible models to get started quickly. You’re free to experiment with alternative models, though some models will need to be converted to the ONNX format to work with pdfOCR’s ONNX engine. The PaddleOCR documentation has details on converting PaddleOCR models to ONNX format, however, EasyOCR does not provide official documentation on converting models.

GPU Acceleration

Optional GPU acceleration is also now enabled for pdfOCR’s ONNX engine, which not only lets the CPU handle other tasks but can also result in major performance gains.

ONNX Runtime supports multiple execution providers for hardware acceleration, although not all are ready for production. At present, we have only tested pdfOCR using Nvidia CUDA-enabled GPUs, so you should refer to OnnxRuntime’s official docs on execution providers for other hardware.

General OCR Improvements

Another nice change is we've significantly improved how pdfOCR positions recognized text boxes for rotated content This allows pdfOCR to better match the original orientation and placement of text, including small-angle rotations (not only 0°, 90°, 180°, 270° as previously).

Additionally, support for retrieving OCR text bounding rectangles in image pixel coordinates rather than PDF coordinate space has been added, removing the need for manual conversion when working at the image level.

Breaking Changes

Since this is a major version release, you can expect some breaking changes. The most important change is to split up and rename the module for pdfOCR’s ONNX engine.

Since we now support more than simply docTR ONNX models, the pdf-ocr-onnxtr package has been renamed and split into pdf-ocr-onnx-abstract and pdfocr-onnx-cpu. This change also accommodates for GPU acceleration using the onnxruntime_gpu package.

See the breaking changes for details on the differences from previous releases of pdfOCR.

New features

DEVSIX-9706 – Support EasyOCR and PaddleOCR models for pdfOCR
DEVSIX-9740 – Support PdfOCR-Onnx execution on GPU
DEVSIX-9792 – Allow multiple files for IOcrEngine#doImageOcr input
DEVSIX-9793 – Add the ability to get rectangles in image pixels for TextInfo

Improvements

DEVSIX-9739 – PdfOCR: Improve text boxes taking into account arbitrary rotation
DEVSIX-9458 – Improve BY_LINES TextPositioning mode to handle whitespaces

Assets 4

08 Jan 13:57

aapsasha

4.1.2

233d93b

pdfOCR 4.1.2

pdfOCR is our add-on for iText Core to perform OCR on documents and images.

There are no feature changes for this release. The only changes are to maintain compatibility with the iText Core 9.5.0 dependencies.

Assets 4

13 Nov 15:06

AnhelinaM

4.1.1

ff532a1

pdfOCR 4.1.1

pdfOCR is our add-on for iText Core to perform OCR on documents and images.

There are no feature changes for this release. The only changes are to maintain compatibility with the iText Core 9.4.0 dependencies.

Assets 4

02 Sep 12:22

nanouh

4.1.0

d559048

pdfOCR 4.1.0

pdfOCR is our add-on for iText Core to perform OCR on documents and images.

This release of pdfOCR brings a huge change with a new built-in OCR engine. It adds the pdfocr-onnxtr module, which implements the OnnxTR library for OCR tasks, with specific requirements for model predictors and resource management. It significantly improves recognition accuracy for English text, and other Latin-based languages.

The Open Neural Network Exchange (ONNX) is an open standard format for machine learning models, enabling interoperability across various frameworks and tools. OnnxTR is a Python OCR library which is a wrapper around the popular OCR tool doctr, enhanced with support for ONNX models.

It makes OCR processing faster and more accessible by leveraging optimized ONNX models without requiring heavy frameworks. This allows easy integration of OCR into applications with minimal resource consumption and high processing speed, offering fast processing and support for multiple platforms, with features like modularity and lightweight dependencies. Using the existing pdfOCR API, we’ve simply added another OCR engine to the existing pdfOcr-tesseract4 module

Not only that, but pdfOCR now directly supports PDF as input files. This can be a big benefit for OCR workflows, as it removes the need to process PDFs with iText Core to extract images from scanned documents.

You can find full details linked from the release notes on the iText Knowledge Base.

Assets 4

15 May 11:08

introfog

4.0.2

cabe77b

pdfOCR 4.0.2

pdfOCR is our add-on for iText Core to perform OCR on documents and images.

There are no feature changes for this release. The only changes are to maintain compatibility with the iText Core 9.2.0 dependencies.

Assets 4

14 Feb 13:09

nanouh

4.0.1

7a1490f

pdfOCR 4.0.1

pdfOCR is our add-on for iText Core to perform OCR on documents and images.

This version improves memory usage when using the Tesseract 4 engine for OCR text extraction.

Bug fixes

Improved memory usage

Assets 4

18 Nov 09:58

AnhelinaM

4.0.0

eb25fb0

pdfOCR 4.0.0

pdfOCR is our add-on for iText Core to perform OCR on documents and images.

For this release the version number has been bumped for compatibility with iText Core 9.0 and License Key Library 4.2.0.

In addition, it includes a fix for CVE-2024-47554 resulting from the use of the Apache Commons.io library. This was resolved by updating to version 2.14.0 from 2.11.0.

Bug fixes

Fix CVE-2024-47554 which comes from commons-io

Assets 4

07 Feb 14:31

StryhelskiAndrei

3.0.2

78c4b90

pdfOCR 3.0.2

pdfOCR is our add-on for iText Core to perform OCR on documents and images.

In this release we’ve added support for pdfOCR to be able to intelligently recognize table data and convert it into the correct tag structure in the resulting PDF documents.

A bug for the incorrect font size being selected for particularly small text was also fixed.

New features

Table recognition support

Bug fixes

Incorrect font size for small text in the PDFs generated with pdfOCR

Assets 4

25 Oct 15:08

StryhelskiAndrei

3.0.1

da196c6

pdfOCR 3.0.1

pdfOCR is our add-on for iText Core to perform OCR on documents and images.

For this release, the artifact names have been changed to reflect the new naming structure. In addition, since Bouncy Castle is a dependency for tests the .NET version has been updated to use the latest 2.2.1 version.

Improvements

Updated .NET Bouncy Castle dependency to 2.2.1

Assets 4

10 May 12:43

AnhelinaM

3.0.0

9197562

pdfOCR 3.0.0

pdfOCR is our add-on for iText Core to perform OCR on documents and images.

For this release, an incompatibility issue when using JDK19 and the Leptonica library which could result in a MethodTooLargeException has now been resolved. Otherwise, this release is for compatibility with the iText Core version 8.x.x release.

Bug fixes

Resolved incompatibility issue with JDK19 and Leptonica library.

Assets 4

Releases: itext/itext-pdfocr-java

pdfOCR 5.0.0

PaddleOCR/EasyOCR Model Support

GPU Acceleration

General OCR Improvements

Breaking Changes

New features

Uh oh!

pdfOCR 4.1.2

Uh oh!

pdfOCR 4.1.1

Uh oh!

pdfOCR 4.1.0

Uh oh!

pdfOCR 4.0.2

Uh oh!

pdfOCR 4.0.1

Bug fixes

Uh oh!

pdfOCR 4.0.0

Bug fixes

Uh oh!

pdfOCR 3.0.2

New features

Bug fixes

Uh oh!

pdfOCR 3.0.1

Improvements

Uh oh!

pdfOCR 3.0.0

Bug fixes

Uh oh!