Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,14 +29,16 @@ a single function, `sanitizeData`, exported from `packages/data-sanitization/src
**Key modules:**

- `packages/data-sanitization/src/matchers.ts` — Three built-in `DataSanitizationMatcher`
factories (`jsonMatcher`, `escapedJsonMatcher`, `formEncodedMatcher`). Each takes a pattern string
and optional `remove` flag and returns a `RegExp`. Custom matchers must produce a global,
case-insensitive regex using capture groups `$1`/`$2` for value replacement.
factories (`cookieAndFormEncodedMatcher`, `jsonMatcher`, `escapedJsonMatcher`). Each takes a
pattern string, optional `remove` flag, and optional `strict` flag and returns a `RegExp`. Custom
matchers must produce a global, case-insensitive regex using capture groups `$1`/`$2` for value
replacement.
- `packages/data-sanitization/src/replacers.ts` — `stringReplacer` and `objectReplacer`. String
replacer iterates all (pattern × matcher) combinations. Object replacer builds `RegExp` key
matchers once, then recurses with a `WeakSet` to detect circular references.
- `packages/data-sanitization/src/constants.ts` — Default field-name patterns (`apikey`,
`api_key`, `password`, `secret`, `token`) and default mask (`**********`).
- `packages/data-sanitization/src/constants.ts` — Pattern groups: `credentialPatterns`,
`headerPatterns`, `piiPatterns`, `phiPatterns`, and `defaultPatterns` (credentials + headers).
Also exports `DEFAULT_PATTERN_MASK` and `DEFAULT_NUMERIC_MASK`.
- `packages/data-sanitization/src/types.ts` — All exported TypeScript types
(`DataSanitizationMatcher`, `DataSanitizationReplacer`, `DataSanitizationReplacerOptions`, etc.).
- `packages/data-sanitization/src/errors.ts` — `DataSanitizationError` with a `details` property;
Expand Down
88 changes: 63 additions & 25 deletions packages/data-sanitization/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -246,25 +246,11 @@ sanitizeData(

### Sanitize PII and PHI with custom patterns

Use `customPatterns` to mask fields that are sensitive for your domain, such as
PII or PHI fields.
Use the exported `piiPatterns` and `phiPatterns` constants — or build your own
list — and pass them via `customPatterns`.

```typescript
import { sanitizeData } from 'data-sanitization';

const sensitivePatterns = [
'address',
'date_of_birth',
'email',
'emergency_contact',
'full_name',
'health_card',
'ip_address',
'medications',
'phone',
'postal_code',
'ssn',
];
import { sanitizeData, piiPatterns, phiPatterns } from 'data-sanitization';

const patient = {
accountId: 'acct_123',
Expand All @@ -277,7 +263,7 @@ const patient = {
};

sanitizeData(patient, {
customPatterns: sensitivePatterns,
customPatterns: [...piiPatterns, ...phiPatterns],
useDefaultPatterns: false,
});
// => {
Expand All @@ -296,7 +282,7 @@ masking them.

```typescript
sanitizeData(patient, {
customPatterns: sensitivePatterns,
customPatterns: [...piiPatterns, ...phiPatterns],
removeMatches: true,
useDefaultPatterns: false,
});
Expand Down Expand Up @@ -351,26 +337,76 @@ sanitizeData({ tags }, { sanitizeCollections: true });
| `sanitizeCollections` | `boolean` | `false` | Sanitize `Map` and `Set` instances by traversing their entries and returning a new sanitized copy. When false, these pass through unchanged like other non-plain object instances. |
| `scanStringValues` | `boolean` | `true` | Scan string values on non-sensitive keys for embedded patterns. Applies to object input and to string input when `parseJsonStrings` is enabled; has no effect on raw string input. |
| `parseJsonStrings` | `boolean` | `false` | Parse valid JSON string inputs as structured data and sanitize by field name. Re-serializes with `JSON.stringify`, discarding original whitespace. |
| `customPatterns` | `string[]` | `[]` | Additional field name patterns to match |
| `customPatterns` | `PatternEntry[]` | `[]` | Additional field name patterns to match. Each entry is a pattern string (substring match) or `{ match: string; strict?: boolean }` for an exact match. |
| `customMatchers` | `DataSanitizationMatcher[]` | `[]` | Additional regex matchers for custom string formats |
| `useDefaultPatterns` | `boolean` | `true` | Set to `false` to use only your custom patterns, ignoring the built-in defaults. |
| `useDefaultMatchers` | `boolean` | `true` | Set to `false` to use only your custom matchers, ignoring the built-in defaults. |
| `ignorePatterns` | `string[]` | `[]` | Patterns to exclude from the active set. Applied after defaults and `customPatterns` are merged. Use to prevent false positives from built-in substring matching. |

## Default patterns

The following field name patterns are matched by default using a
case-insensitive substring match:
The following field name patterns are matched by default. All use
case-insensitive substring matching unless noted as exact.

**Credentials** (`credentialPatterns`):

- `apikey`
- `api_key`
- `password`
- `secret`
- `token`

A field named `db_password` or `client_secret_key` would also match because
**HTTP authentication headers** (`headerPatterns`):

- `authorization`
- `api-key`

A field named `db_password` or `x-authorization` would also match because
these patterns match as substrings.

Two additional pattern groups are exported but not included by default:

- **`piiPatterns`** — Personally Identifiable Information: names, contact
details, government IDs, and digital identifiers. Ambiguous single-word
terms such as `address`, `city`, `state`, and `zip` use exact matching to
avoid false positives (e.g. `email_address` is not masked when only `address`
is in `piiPatterns`).
- **`phiPatterns`** — Protected Health Information under HIPAA: medical record
identifiers, healthcare dates, clinical data, and biometrics.

Use them via `customPatterns`:

```typescript
import { sanitizeData, piiPatterns, phiPatterns } from 'data-sanitization';

sanitizeData(patient, {
customPatterns: [...piiPatterns, ...phiPatterns],
});
```

### Exact vs. substring matching

Each pattern in `customPatterns` is a `PatternEntry`: either a plain string
(substring match) or an object with `strict: true` for an exact field-name
match.

```typescript
// Substring: matches 'token', 'access_token', 'session_token', ...
sanitizeData(data, { customPatterns: ['token'] });

// Exact: matches only 'token', not 'access_token'
sanitizeData(data, { customPatterns: [{ match: 'token', strict: true }] });
```

Use exact matching when a pattern is a common English word that would produce
false positives as a substring — for example, `state` would otherwise mask
`statement` or `stateCode`.

> **`ignorePatterns` and exact matching:** `ignorePatterns` is a `string[]`
> matched against the `match` string of each active pattern. To suppress an
> exact-match entry such as `{ match: 'state', strict: true }`, pass
> `ignorePatterns: ['state']`.

## Default matchers

Three matchers are included by default:
Expand All @@ -379,8 +415,10 @@ Three matchers are included by default:
JSON-like strings
- **Escaped JSON matcher**: matches `\"fieldName\":\"value\"` patterns in
JSON embedded inside JSON string values
- **Form-encoded matcher**: matches `fieldName=value` and `fieldName:value`
patterns in URL-encoded and similarly delimited strings
- **Cookie and form-encoded matcher**: matches `fieldName=value` and
`fieldName:value` patterns in URL form-encoded strings and HTTP Cookie
headers. Values stop at `&`, `;`, `\r`, or `\n` so neither format's
separator is consumed as part of a value.

## Custom patterns and matchers

Expand Down
35 changes: 18 additions & 17 deletions packages/data-sanitization/docs/performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -236,10 +236,11 @@ reuse the cache and pay no compile cost.
| Warm cache (same options each call) | ~451,000 | ~0.002 |
| Cold start (unique options per call) | ~14,000 | ~0.070 |

The first call is ~32× slower than a warm call due to regex compilation.
In steady-state server usage this cost is paid once per process lifetime and
is negligible. It becomes visible only in tests or scripts that create many
distinct option configurations (e.g. per-request custom patterns).
The first call is significantly slower than a warm call due to regex
compilation (typically 15–32×, hardware-dependent). In steady-state server
usage this cost is paid once per process lifetime and is negligible. It becomes
visible only in tests or scripts that create many distinct option
configurations (e.g. per-request custom patterns).

See [Cache memory growth](#cache-memory-growth) below for the memory
implication of many distinct configurations.
Expand All @@ -248,8 +249,7 @@ implication of many distinct configurations.

`removeMatches: true` deletes matched fields from objects and matched
key=value pairs from strings instead of masking them. The cost is similar to
masking for objects but slightly higher for string inputs due to regex
replacement pattern differences.
masking for both objects and strings.

<table>
<thead>
Expand Down Expand Up @@ -295,9 +295,9 @@ replacement pattern differences.
</table>

For objects, removal and masking are nearly equivalent — both write a result
object with the same traversal cost. For strings, removal is 10–20% slower
because the match-and-remove regex path involves different replacement
semantics than the `$1<mask>$2` substitution.
object with the same traversal cost. For strings, removal cost is comparable
to masking; the exact relative overhead varies with input and is within
benchmark noise at typical payload sizes.

## String workloads

Expand Down Expand Up @@ -416,16 +416,17 @@ In steady-state usage — a fixed configuration, possibly with a static list of

If `customPatterns` vary per call (e.g. injected from user input or request
data), entries will cycle through the cache and every call will pay the
cold-start regex compilation cost (~32× slower than a warm call). In that
scenario, prebuild the options object once (or a small set of them) and reuse
it across calls. Or set `scanStringValues: false`, which bypasses the cache
entirely.
cold-start regex compilation cost (typically 15–32× slower than a warm call,
depending on pattern count and hardware). In that scenario, prebuild the
options object once (or a small set of them) and reuse it across calls. Or set
`scanStringValues: false`, which bypasses the cache entirely.

### Form-encoded matcher and multiline strings
### Cookie and form-encoded matcher and multiline strings

The built-in form-encoded matcher uses `[^\n&]*` to match a field value —
stopping at either an `&` delimiter or a newline. This means content on lines
after a matched value is preserved:
The built-in `cookieAndFormEncodedMatcher` uses `[^\r\n&;]*` to match a field
value — stopping at `&`, `;`, `\r`, or `\n`. This means content on lines after
a matched value is preserved, and the two separator styles (URL form-encoded
`&` and HTTP Cookie `;`) do not bleed into each other:

```text
Input: "Error: auth failed — api_key=hunter2\n at foo (bar.js:10)"
Expand Down
108 changes: 104 additions & 4 deletions packages/data-sanitization/src/constants.ts
Original file line number Diff line number Diff line change
@@ -1,15 +1,111 @@
import { PatternEntry } from './types';

/**
* These are some default patterns to search within field
* names used to determine what data is sanitized.
* Field-name patterns for credentials commonly present in any application
* that performs authentication or calls external APIs.
*/
const DEFAULT_FIELD_NAME_PATTERNS = [
const credentialPatterns: PatternEntry[] = [
'apikey',
'api_key',
'password',
'secret',
'token',
];

/**
* Field-name patterns for HTTP headers that carry authentication or
* API-key material. Substring matching covers common variants:
* `authorization` matches `x-authorization` and `proxy-authorization`;
* `api-key` matches `x-api-key`.
*/
const headerPatterns: PatternEntry[] = ['authorization', 'api-key'];

/**
* Field-name patterns for Personally Identifiable Information (PII).
* Opt-in — not included in `defaultPatterns`. Single-word terms that
* would produce false positives as substrings use `strict: true`.
*/
const piiPatterns: PatternEntry[] = [
// Names
'first_name',
'last_name',
'middle_name',
'full_name',
'date_of_birth',
'dob',
'birth_date',
// Contact
'email',
'phone',
'mobile',
// Address — single-word terms use strict to avoid false positives
// (e.g. 'email_address', 'ip_address')
{ match: 'address', strict: true },
'street_address',
'address_line',
'postal_code',
{ match: 'city', strict: true },
{ match: 'state', strict: true },
{ match: 'zip', strict: true },
// Government IDs
'ssn',
'social_security',
'social_insurance_number',
'national_id',
'passport',
'drivers_license',
'tax_id',
// Digital identifiers (GDPR-relevant)
'ip_address',
];

/**
* Field-name patterns for Protected Health Information (PHI) under HIPAA.
* Opt-in — not included in `defaultPatterns`.
*/
const phiPatterns: PatternEntry[] = [
// Medical record identifiers
'mrn',
'medical_record_number',
'patient_id',
'chart_number',
'member_id',
'beneficiary_id',
'subscriber_id',
'insurance_id',
'claim_number',
'encounter_id',
// Healthcare-specific dates
'admission_date',
'discharge_date',
'service_date',
'appointment_date',
'death_date',
// Clinical data
'diagnosis_code',
'diagnosis',
'condition',
'medication',
'prescription',
'procedure_code',
// Provider / facility
'provider_npi',
'provider_id',
// Biometrics
'fingerprint',
'biometric_id',
];

/**
* The default set of field-name patterns applied when no options override
* them. Covers credentials and common authentication headers. PII and PHI
* patterns are opt-in via `piiPatterns` and `phiPatterns`.
*/
const defaultPatterns: PatternEntry[] = [
...credentialPatterns,
...headerPatterns,
];

/**
* A default mask used when replacing string field values.
*/
Expand All @@ -21,7 +117,11 @@ const DEFAULT_PATTERN_MASK = '**********';
const DEFAULT_NUMERIC_MASK = 9999999999;

export {
DEFAULT_FIELD_NAME_PATTERNS,
credentialPatterns,
defaultPatterns,
headerPatterns,
phiPatterns,
piiPatterns,
DEFAULT_NUMERIC_MASK,
DEFAULT_PATTERN_MASK,
};
9 changes: 9 additions & 0 deletions packages/data-sanitization/src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,17 @@ export type {
DataSanitizationOutput,
DataSanitizationReplacer,
DataSanitizationReplacerOptions,
PatternEntry,
} from './types';

export {
credentialPatterns,
defaultPatterns,
headerPatterns,
phiPatterns,
piiPatterns,
} from './constants';
Comment thread
ioncache marked this conversation as resolved.

/**
* Returns a safe type label for data passed to the sanitizer.
*
Expand Down
Loading