- Added a new internal engine boundary in
datafog/engine.py:scan()redact()scan_and_redact()- dataclasses:
Entity,ScanResult,RedactResult
- Updated core compatibility layers (
datafog.core,datafog.main, CLI paths) to delegate through the engine interface. - Added
EngineNotAvailableerror for clear optional dependency failures. - Improved smart engine behavior for graceful fallback when optional NLP dependencies are unavailable.
- Added a corpus-driven detection accuracy suite:
tests/corpus/structured_pii.jsontests/corpus/unstructured_pii.jsontests/corpus/mixed_pii.jsontests/corpus/negative_cases.jsontests/corpus/edge_cases.jsontests/test_detection_accuracy.py
- Improved regex patterns for email, date/year handling, SSN boundaries, and strict IPv4 matching.
- Added explicit
xfailmarkers for known model limitations in select smart/NER corpus cases. - Added engine API tests in
tests/test_engine_api.py. - Added agent API tests in
tests/test_agent_api.py. - Updated Spark integration tests to skip cleanly when Java is not available.
- Added
datafog/agent.pywith:sanitize()scan_prompt()filter_output()create_guardrail()GuardrailandGuardrailWatch
- Exported agent-oriented API from top-level
datafogpackage.
- Updated GitHub Actions CI matrix to test Python
3.10,3.11, and3.12acrosscore,nlp, andnlp-advancedprofiles. - Added coverage enforcement thresholds in CI (line and branch).
- Added a dedicated corpus accuracy run in CI.
- Rewrote
README.mdwith validated, copy-pasteable examples and a dedicated LLM guardrails section. - Added/updated audit reports under
docs/audit/.
-
GLiNER Integration: Added modern Named Entity Recognition engine with GLiNER (Generalist Model for NER)
- New
glinerengine option in TextService providing 32x performance improvement over spaCy - PII-specialized model support (
urchade/gliner_multi_pii-v1) for enhanced accuracy - Custom entity type configuration for domain-specific detection
- Automatic model downloading and caching functionality
- New
-
Smart Cascading Engine: Introduced intelligent multi-engine approach
- New
smartengine that progressively tries regex → GLiNER → spaCy - Configurable stopping criteria based on entity count thresholds
- Optimized for best accuracy/performance balance (60x average speedup)
- New
-
Enhanced CLI Model Management: Extended command-line interface
--engineflag support fordownload-modelandlist-modelscommands- GLiNER model discovery and management capabilities
- Unified model management across spaCy and GLiNER engines
-
Optional Dependencies: Added new
nlp-advancedextra for GLiNER dependenciespip install datafog[nlp-advanced]for GLiNER + PyTorch + Transformers- Maintained lightweight core architecture (<2MB)
- Graceful degradation when GLiNER dependencies unavailable
-
Engine Ecosystem: Expanded from 3 to 5 annotation engines
regex: 190x faster, structured PII detection (core only)gliner: 32x faster, modern NER with custom entitiesspacy: Traditional NLP, comprehensive entity recognitionsmart: Cascading approach for optimal accuracy/speedauto: Legacy regex→spaCy fallback
-
Validated Performance: Comprehensive benchmarking across all engines
- GLiNER: 32x faster than spaCy with superior NER accuracy
- Smart cascading: 60x average speedup with highest accuracy scores
- Regex: Maintained 190x performance advantage
-
Comprehensive Testing: Added 19 new test cases for GLiNER integration
- Full coverage of GLiNER annotator functionality
- Graceful degradation testing for missing dependencies
- Smart cascading logic validation
- Cross-engine integration testing
-
Updated Documentation: Comprehensive guides and examples
- README performance comparison table with all 5 engines
- Engine selection guidance with use case recommendations
- GLiNER model management and CLI usage examples
- Installation options for different dependency combinations
-
Developer Guide: Streamlined development documentation
- Updated architecture overview with GLiNER integration
- Performance requirements and testing strategies
- Common development patterns and best practices
- Engine Options: New engine types added to TextService
- Existing code using
engine="auto"continues to work unchanged - New engines
glinerandsmartrequire[nlp-advanced]extra
- Existing code using
- New Optional Dependencies (nlp-advanced extra):
gliner>=0.2.5torch>=2.1.0,<2.7transformers>=4.20.0huggingface-hub>=0.16.0
For users upgrading from v4.1.1:
- All existing functionality remains unchanged
- To use GLiNER:
pip install datafog[nlp-advanced] - Smart cascading:
TextService(engine="smart")for best balance - CLI: Use
--engine glinerflag for GLiNER model management
- Added engine selection functionality to TextService class, allowing users to choose between 'regex', 'spacy', or 'auto' annotation engines
- Enhanced TextService with intelligent fallback mechanism in 'auto' mode that tries regex first and falls back to spaCy if no entities are found
- Added comprehensive integration tests for the new engine selection feature
- Implemented performance benchmarks showing regex engine is ~123x faster than spaCy
- Added CI pipeline for continuous performance monitoring with regression detection
- Added wheel-size gate (< 8 MB) to CI pipeline
- Added 'When do I need spaCy?' guidance to documentation
- Created scripts for running benchmarks locally and comparing results
- Improved documentation with performance metrics and engine selection guidance
- Extended .gitignore to better handle build artifacts and development files
- Added GitHub Actions workflows for testing, linting, and benchmarking
- Pinned all dependency versions in requirements.txt and requirements-dev.txt for reproducible builds
- Added mypy type checking to CI pipeline
- Added ruff linting to development dependencies
- Finalized stable release, no breaking changes from 4.1.0b5
- Added datafog-python/examples/uploading-file-types.ipynb to show JSON uploading example (#16)
- Added datafog-python/tests/regex_issue.py to show issue with regex recognizer creation
- Moved versioning to separate invocable function in setup.py