Skip to content

feat(data-sanitization): pattern groups, strict matching, cookieAndFormEncodedMatcher#321

Merged
ioncache merged 3 commits into
mainfrom
feat/018-pattern-and-matcher-additions
May 27, 2026
Merged

feat(data-sanitization): pattern groups, strict matching, cookieAndFormEncodedMatcher#321
ioncache merged 3 commits into
mainfrom
feat/018-pattern-and-matcher-additions

Conversation

@ioncache
Copy link
Copy Markdown
Owner

@ioncache ioncache commented May 27, 2026

Overview

Adds pattern groups, strict field-name matching, and unifies the cookie/form-encoded matcher into one.

Details

  • PatternEntry typecustomPatterns now accepts string | { match: string; strict?: boolean }. Plain strings continue to use substring matching; objects with strict: true match only the exact field name.
  • Pattern constants — Splits the flat default list into credentialPatterns, headerPatterns, piiPatterns, and phiPatterns. All four are exported from the package. defaultPatterns now combines credentials + headers (adds authorization and api-key to the defaults).
  • cookieAndFormEncodedMatcher — Renames formEncodedMatcher; stops at both & and ; so it handles URL form-encoded strings and HTTP Cookie headers in one matcher. Strict mode uses a negative lookbehind (?<![\w-]) to reject substring matches (e.g. token does not match inside session_token=abc).
  • normalizeEntry helper normalises PatternEntry to { match, strict } for use in buildPatterns, buildStringScanRegexes, and objectReplacer key matching.
  • ignorePatterns filter updated to match against the match string of each entry, so it works correctly with object-form patterns.
  • Testsconstants.test.ts added; matchers.test.ts expanded with cookie-style, strict mode, and renamed matcher tests (232 → 262 tests).
  • Docs — README updated with pattern groups section, PatternEntry type in options table, piiPatterns/phiPatterns usage examples, and renamed matcher description. performance.md updated to reflect renamed matcher, revised removeMatches string characterisation, and hardware-range cold-start ratio. CLAUDE.md updated to reflect the renamed matcher and new constants structure.

Related Tickets and/or Pull Requests

Relates to #320

Checklist

  • Tests added or updated
  • README and TSDoc updated if the public API changed
  • Breaking changes called out (if any)
  • Roadmap item checked off if this PR completes one

Breaking change: formEncodedMatcher is renamed to cookieAndFormEncodedMatcher. Callers importing the named export directly will need to update the import name.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Strict pattern-matching mode for exact field identification.
    • New granular pattern sets for credentials, headers, PII, and PHI.
    • Enhanced cookie-and-form-encoded detection for sanitization.
  • Documentation

    • Updated usage guides, examples, and “exact vs substring” guidance.
    • Refined performance notes and default-patterns explanations.
  • Tests

    • Expanded tests for strict matching, pattern coverage, and encodings.

Review Change Stack

ioncache and others added 2 commits May 27, 2026 11:50
…rmEncodedMatcher (#321)

- Add PatternEntry type (string | { match: string; strict?: boolean }) for
  exact vs. substring field-name control
- Add credentialPatterns, headerPatterns, piiPatterns, phiPatterns constants;
  export all from index
- defaultPatterns now includes headerPatterns (authorization, api-key) in
  addition to credentialPatterns
- Rename formEncodedMatcher → cookieAndFormEncodedMatcher; stops at both &
  and ; so it handles URL form-encoded and HTTP Cookie headers in one matcher
- Add strict param to all three matchers; cookie matcher uses a negative
  lookbehind (?<![\w-]) to reject substring matches in strict mode
- normalizeEntry helper in replacers.ts handles PatternEntry in all contexts
- ignorePatterns filter updated to match against normalizeEntry(p).match
- objectReplacer key matchers use ^pattern$ for strict, [\w-]*pattern[\w-]*
  for non-strict
- Add test/constants.test.ts; expand matchers.test.ts with cookie-style,
  strict mode, and renamed cookieAndFormEncodedMatcher tests (262 tests total)
- Update README and CLAUDE.md to reflect renamed matcher, pattern groups,
  PatternEntry type, and piiPatterns/phiPatterns usage examples

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…edMatcher

- Rename 'Form-encoded matcher and multiline strings' section to
  'Cookie and form-encoded matcher and multiline strings'
- Update value stop-char description to mention & and ; both
- Remove outdated claim that string removal is 10-20% slower than
  masking; benchmarks show cost is comparable
- Widen cold-start ratio to 15-32x range to reflect hardware variance

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 47c51e79-4a4a-42d7-b371-e44df2da6d8b

📥 Commits

Reviewing files that changed from the base of the PR and between 9b5281b and 1b04d52.

📒 Files selected for processing (1)
  • packages/data-sanitization/test/replacers.test.ts

📝 Walkthrough

Walkthrough

This PR refactors the data-sanitization pattern system to support granular categorization and strict exact-matching for field names. It introduces PatternEntry as a union type supporting both substring patterns (strings) and strict exact-match patterns (objects with match and optional strict flag), replaces formEncodedMatcher with cookieAndFormEncodedMatcher to handle both cookie and form-encoded delimiters, and updates sanitization logic throughout to respect the strictness mode when generating regex matchers.

Changes

Pattern categorization and strict field-name matching

Layer / File(s) Summary
Type definitions and pattern categories
packages/data-sanitization/src/types.ts, packages/data-sanitization/src/constants.ts
New PatternEntry type supports both substring matching (plain string) and strict exact matching ({ match: string; strict?: boolean }). Pattern constants now export granular categories: credentialPatterns, headerPatterns, piiPatterns, phiPatterns, with defaultPatterns combining only credentials and headers as the applied default.
Matcher implementations with strict parameter
packages/data-sanitization/src/matchers.ts
New cookieAndFormEncodedMatcher handles both cookie (;-delimited) and form (&-delimited) key/value input with boundary refinements. jsonMatcher and escapedJsonMatcher gain strict parameter: strict mode generates exact field-name matches (^...$), non-strict retains substring-friendly matching.
Pattern normalization and sanitization with strict matching
packages/data-sanitization/src/replacers.ts
normalizeEntry converts PatternEntry into { match, strict }. String scanning (buildStringScanRegexes) incorporates strict into cache key and builds per-pattern regexes via matchers with strict argument. Object key matching (objectReplacer) generates exact-match regexes for strict patterns and broader substring-style regexes for non-strict.
Public API surface expansion
packages/data-sanitization/src/index.ts
Re-exports PatternEntry type and pattern constants (credentialPatterns, defaultPatterns, headerPatterns, phiPatterns, piiPatterns).
Documentation updates
CLAUDE.md, packages/data-sanitization/README.md, packages/data-sanitization/docs/performance.md
CLAUDE.md reflects cookieAndFormEncodedMatcher and pattern category exports. README adds "Exact vs. substring matching" section, guides users toward piiPatterns/phiPatterns constants, and updates matcher/options examples. Performance.md refines cold-start, removeMatches, and cache memory guidance with focus on regex compilation and strict matching overhead.
Test coverage for patterns and strict matching
packages/data-sanitization/test/constants.test.ts, packages/data-sanitization/test/matchers.test.ts
New constants.test.ts validates each pattern category is non-empty, contains expected match strings, verifies defaultPatterns includes all credential/header but no PII/PHI patterns. Matchers tests migrate to cookieAndFormEncodedMatcher, add cookie-style (;-delimited) coverage, and introduce strict matching suites for jsonMatcher and escapedJsonMatcher ensuring exact vs substring behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 Patterns now speak with clarity and zeal,
Credentials, headers, PII—each with their deal,
Strict matching guards the ambiguous few,
While cookies and forms get their delimiter too,
From substring to exact, the rabbit hops through! 🎯

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the three main changes: introducing pattern groups, adding strict matching support, and renaming the matcher to cookieAndFormEncodedMatcher.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/018-pattern-and-matcher-additions

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

Coverage Report for packages/data-sanitization-log-providers

Status Category Percentage Covered / Total
🔵 Lines 100% 57 / 57
🔵 Statements 100% 58 / 58
🔵 Functions 100% 11 / 11
🔵 Branches 100% 36 / 36
File CoverageNo changed files found.
Generated in workflow #20 for commit 1b04d52 by the Vitest Coverage Report Action

…Replacer

Add two tests for object-form customPatterns entries:
- strict: true → exact key match only (covers the ^pattern$ branch)
- strict omitted → substring match (covers the strict ?? false branch)

Restores 100% branch coverage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Coverage Report for packages/data-sanitization

Status Category Percentage Covered / Total
🔵 Lines 100% 246 / 246
🔵 Statements 100% 252 / 252
🔵 Functions 100% 32 / 32
🔵 Branches 100% 187 / 187
File Coverage
File Stmts Branches Functions Lines Uncovered Lines
Changed Files
packages/data-sanitization/src/index.ts 100% 100% 100% 100%
packages/data-sanitization/src/matchers.ts 100% 100% 100% 100%
packages/data-sanitization/src/replacers.ts 100% 100% 100% 100%
Generated in workflow #20 for commit 1b04d52 by the Vitest Coverage Report Action

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
packages/data-sanitization/README.md (1)

255-277: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the PII/PHI example output for health_card.

The example masks health_card, but that key is not in piiPatterns/phiPatterns shown in this PR, so this output is likely incorrect. Either add a matching pattern in the example config or update the expected output.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/data-sanitization/README.md` around lines 255 - 277, The README
example is inconsistent: the patient object includes health_card but neither
piiPatterns nor phiPatterns (used in sanitizeData with customPatterns) define a
pattern for the "health_card" key, so the expected masked output is wrong;
update the example by either adding a matching pattern for "health_card" to the
patterns array referenced (piiPatterns or phiPatterns) or change the expected
output to show health_card unmasked, and ensure you reference the same symbols
(patient, sanitizeData, customPatterns, piiPatterns, phiPatterns, health_card)
so the example and config stay in sync.
packages/data-sanitization/test/matchers.test.ts (2)

268-279: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Tighten the semicolon delimiter test to assert the intended behavior.

Line 277 only checks toContain(';'), which passes whether ; is treated as a delimiter or consumed as value content. This test currently won’t catch regressions in delimiter handling.

Proposed test assertion update
-    it('should not treat a semicolon as a field delimiter', () => {
+    it('should treat a semicolon as a field delimiter', () => {
       // Arrange
       const matcher = cookieAndFormEncodedMatcher('password');
       const testData = 'password=secret;username=mark';

       // Act
       const allMatches = [...testData.matchAll(matcher)];

-      // Assert — semicolons are not in the stop-character set; value captures past the semicolon
+      // Assert
       expect(allMatches.length).toBe(1);
-      expect(allMatches[0]?.[0]).toContain(';');
+      expect(allMatches[0]?.[0]).toBe('password=secret;');
+      expect(allMatches[0]?.[1]).toBe('password=');
+      expect(allMatches[0]?.[2]).toBe(';');
     });

As per coding guidelines **/*.test.{js,ts,jsx,tsx}: Unit tests should follow best practices for test organization, assertions, and coverage.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/data-sanitization/test/matchers.test.ts` around lines 268 - 279, The
test for cookieAndFormEncodedMatcher is too weak — replace the loose
toContain(';') assertion with precise checks that the regex actually captures
the semicolon as part of the value: assert the full match equals
'password=secret;username=mark' (allMatches[0]?.[0]) and/or assert the value
capture group equals 'secret;username=mark' (allMatches[0]?.[1]) so the test
fails if the semicolon becomes a delimiter.

230-240: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace permissive assertions with deterministic expectations.

Line 239 (>= 1) is overly permissive, and Line 897 (>= 0) is always true. These assertions don’t reliably validate matcher behavior.

Proposed assertion tightening
     it('should match a field with an empty value', () => {
@@
-      // Assert — document current behavior when the value is empty
-      expect(allMatches.length).toBeGreaterThanOrEqual(1);
+      // Assert
+      expect(allMatches.length).toBe(1);
+      expect(allMatches[0]?.[0]).toBe('password=&');
     });
@@
     it('should handle values that contain escaped backslashes', () => {
@@
-      // Assert — the \\\\  in the value may interact with the regex stop pattern; document current behavior
-      expect(allMatches.length).toBeGreaterThanOrEqual(0);
+      // Assert
+      expect(allMatches.length).toBe(1);
+      expect(allMatches[0]?.[0]).toBe('\\"password\\":\\"sec\\\\ret\\"');
     });

As per coding guidelines **/*.test.{js,ts,jsx,tsx}: Unit tests should follow best practices for test organization, assertions, and coverage.

Also applies to: 887-898

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/data-sanitization/test/matchers.test.ts` around lines 230 - 240, The
test using cookieAndFormEncodedMatcher('password') is asserting a permissive
expect(allMatches.length).toBeGreaterThanOrEqual(1) which is non-deterministic;
change it to assert exact, deterministic behavior by checking that
allMatches.length equals the expected count (e.g., 1) and verify the matched
result content (e.g., the captured name is "password" and the captured value is
the empty string) using the iterator result from testData.matchAll(matcher).
Replace the >= assertions in this test and the similar assertions around lines
887–898 with precise expectations for length and captured groups so the matcher
behavior is unambiguous.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/data-sanitization/src/index.ts`:
- Around line 14-20: The root module index.ts currently re-exports
credentialPatterns, defaultPatterns, headerPatterns, phiPatterns, and
piiPatterns which expands the public API; revert index.ts to export only the
sanitizeData function and remove those named exports from this file, and instead
expose them from a dedicated submodule (e.g., a new patterns or constants
entrypoint) so consumers who need
credentialPatterns/defaultPatterns/headerPatterns/phiPatterns/piiPatterns import
them from that subpath rather than from the package root; update any internal
imports to reference the new submodule and ensure sanitizeData remains the sole
export from packages/data-sanitization/src/index.ts.

In `@packages/data-sanitization/src/replacers.ts`:
- Around line 19-24: Add unit tests to cover the uncovered branches by
exercising normalizeEntry and objectReplacer: write tests that pass a
PatternEntry as an object to normalizeEntry (with and without the strict
property) and assert it returns {match, strict} with strict defaulting to false
when omitted; also add tests for objectReplacer that verify strict key matching
semantics by using patterns wrapped with ^...$ (should only match exact keys)
versus plain substring patterns (should match keys containing the substring).
Target the normalizeEntry function and the objectReplacer behavior to trigger
both branches reported as missing.

---

Outside diff comments:
In `@packages/data-sanitization/README.md`:
- Around line 255-277: The README example is inconsistent: the patient object
includes health_card but neither piiPatterns nor phiPatterns (used in
sanitizeData with customPatterns) define a pattern for the "health_card" key, so
the expected masked output is wrong; update the example by either adding a
matching pattern for "health_card" to the patterns array referenced (piiPatterns
or phiPatterns) or change the expected output to show health_card unmasked, and
ensure you reference the same symbols (patient, sanitizeData, customPatterns,
piiPatterns, phiPatterns, health_card) so the example and config stay in sync.

In `@packages/data-sanitization/test/matchers.test.ts`:
- Around line 268-279: The test for cookieAndFormEncodedMatcher is too weak —
replace the loose toContain(';') assertion with precise checks that the regex
actually captures the semicolon as part of the value: assert the full match
equals 'password=secret;username=mark' (allMatches[0]?.[0]) and/or assert the
value capture group equals 'secret;username=mark' (allMatches[0]?.[1]) so the
test fails if the semicolon becomes a delimiter.
- Around line 230-240: The test using cookieAndFormEncodedMatcher('password') is
asserting a permissive expect(allMatches.length).toBeGreaterThanOrEqual(1) which
is non-deterministic; change it to assert exact, deterministic behavior by
checking that allMatches.length equals the expected count (e.g., 1) and verify
the matched result content (e.g., the captured name is "password" and the
captured value is the empty string) using the iterator result from
testData.matchAll(matcher). Replace the >= assertions in this test and the
similar assertions around lines 887–898 with precise expectations for length and
captured groups so the matcher behavior is unambiguous.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: cdf1af13-a5df-48e1-bd0c-b6a89454072b

📥 Commits

Reviewing files that changed from the base of the PR and between bee068c and 9b5281b.

📒 Files selected for processing (10)
  • CLAUDE.md
  • packages/data-sanitization/README.md
  • packages/data-sanitization/docs/performance.md
  • packages/data-sanitization/src/constants.ts
  • packages/data-sanitization/src/index.ts
  • packages/data-sanitization/src/matchers.ts
  • packages/data-sanitization/src/replacers.ts
  • packages/data-sanitization/src/types.ts
  • packages/data-sanitization/test/constants.test.ts
  • packages/data-sanitization/test/matchers.test.ts

Comment thread packages/data-sanitization/src/index.ts
Comment thread packages/data-sanitization/src/replacers.ts
@ioncache ioncache merged commit 737ecc6 into main May 27, 2026
10 checks passed
@ioncache ioncache deleted the feat/018-pattern-and-matcher-additions branch May 27, 2026 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant