Skip to content

UN-3393 [FEAT] Wire AuditSerializer subclasses through SanitizedSerializerMixin#1966

Open
chandrasekharan-zipstack wants to merge 3 commits into
feat/UN-3393-input-validation-foundationfrom
feat/UN-3393-input-validation-sweep
Open

UN-3393 [FEAT] Wire AuditSerializer subclasses through SanitizedSerializerMixin#1966
chandrasekharan-zipstack wants to merge 3 commits into
feat/UN-3393-input-validation-foundationfrom
feat/UN-3393-input-validation-sweep

Conversation

@chandrasekharan-zipstack
Copy link
Copy Markdown
Contributor

What

  • Builds on UN-3393 [FEAT] Add SanitizedSerializerMixin foundation for input validation #1965 (foundation PR) by routing every AuditSerializer subclass through SanitizedSerializerMixin. ~18 write-path serializers now auto-reject HTML/JS-shaped input on every writable CharField / TextField.
  • Affected (transitively): WorkflowSerializer, CustomToolSerializer, APIDeploymentSerializer, APIKeySerializer, ConnectorInstanceSerializer, BaseAdapterSerializer, PlatformKeySerializer, PlatformApiKey{Create,Update}Serializer, PipelineSerializer, ToolInstanceSerializer, ToolStudioPromptSerializer, PromptStudioOutputSerializer, ProfileManagerSerializer, IndexManagerSerializer, PromptStudioRegistry{,Info}Serializer, PromptStudioDocumentManagerSerializer.
  • Adds Meta.html_safe_fields opt-outs on the two serializers that legitimately carry LLM markup:
    • ToolStudioPromptSerializerprompt, assert_prompt, assertion_failure_prompt, output.
    • CustomToolSerializersummarize_prompt, preamble, postamble, output.
  • Removes the three redundant validate_description methods (in api_v2.APIDeploymentSerializer, workflow_v2.WorkflowSerializer, prompt_studio_core_v2.CustomToolSerializer) — the mixin now covers them.

Why

  • The foundation PR (UN-3393 [FEAT] Add SanitizedSerializerMixin foundation for input validation #1965) added SanitizedSerializerMixin and pre-mixed ModelSerializer / Serializer / HyperlinkedModelSerializer under utils.serializer but did not wire any production serializer through them. This PR is the smallest, highest-leverage wiring step: changing one parent class (AuditSerializer) closes the input-validation gap on the majority of write-path entities.
  • validate_<name> methods (which use validate_name_field) are kept because they also strip whitespace and reject empty values — behaviour the mixin doesn't duplicate. Only the validate_description methods (which just call validate_no_html_tags) are removed.
  • Free-form LLM prompt / output fields (prompt, output, summarize_prompt, etc.) legitimately contain <context> / <thinking> / arbitrary model output. They're explicitly opted out via Meta.html_safe_fields.

How

  • backend/backend/serializers.py: AuditSerializer now inherits utils.serializer.ModelSerializer instead of rest_framework.serializers.ModelSerializer. create / update semantics unchanged.
  • Per-serializer Meta.html_safe_fields declarations on the two LLM-bearing serializers (Prompt Studio).
  • Three validate_description removals + validate_no_html_tags import drop from the affected modules.

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

  • Functional: The 8 serializers that were already hand-wired (in the in-flight fix) continue to behave identically. The 14 manual validate_<field> methods still exist — the mixin's validator is appended to the field's validators list, so the same rejection is enforced either way. Removing only the redundant validate_description methods avoids double-validation noise but doesn't change any acceptance outcome.
  • New rejections: AuditSerializer subclasses that were not previously sanitised now auto-reject HTML in their CharField / TextField (e.g. ConnectorInstance.connector_name, PlatformKey.key_name, Pipeline.pipeline_name, AdapterInstance.adapter_name, etc.). Per the May 2026 US-prod scan (see KB Obsidian Vault/zipstuff/UN-3393-input-validation/05-prod-baseline.md), zero legitimate user content uses <…> in these columns — every match was pentest leftover. Migration risk: ~zero.
  • LLM fields explicitly opted out: prompt, assert_prompt, assertion_failure_prompt, output, summarize_prompt, preamble, postamble are listed in Meta.html_safe_fields so they continue to accept arbitrary text.
  • Direct DRF-base-class users: Serializers that inherit directly from serializers.ModelSerializer / serializers.Serializer / serializers.HyperlinkedModelSerializer (i.e. not via AuditSerializer) are not touched by this PR. They remain unchanged. A follow-up sweep will convert them; their absence here means write paths through them retain pre-PR behaviour.

Database Migrations

  • None.

Env Config

  • None.

Relevant Docs

  • KB: Obsidian Vault/zipstuff/UN-3393-input-validation/.
  • ADR 0003 (adr/0003-base-mixin-default-on-opt-out.md) — opt-out semantics.

Related Issues or PRs

  • Jira: UN-3393.
  • Stacked on: UN-3393 [FEAT] Add SanitizedSerializerMixin foundation for input validation #1965 (foundation).
  • Follow-up: a separate PR that converts the remaining ~20 direct DRF-base-class users (tags/serializers.py, scheduler/serializer.py, tenant_account_v2/serializer.py, pipeline_v2/serializers/{internal,update,sharing,execute}.py, response-only and list serializers across the codebase, …).
  • Follow-up: cover backend/pluggable_apps/ (lives in the unstract-cloud repo; separate PR there).

Dependencies Versions

  • None.

Notes on Testing

Local tests in the OSS worktree (cloud-deps not installed, so manage.py check is unavailable here — CI runs the full suite):

cd backend && uv run pytest utils/tests/ -q

→ 85 passed (50 from foundation + 35 pre-existing utils tests).

Manual sanity:

cd backend && uv run python -c "from utils.serializer import ModelSerializer; from backend.serializers import AuditSerializer; assert issubclass(AuditSerializer, ModelSerializer); print('ok')"

Screenshots

n/a (no UI change)

Checklist

I have read and understood the Contribution Guidelines.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0334b9a5-0efb-4edb-9e99-5d477bf5dbab

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/UN-3393-input-validation-sweep

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Greptile Summary

This PR wires all AuditSerializer subclasses through SanitizedSerializerMixin by changing AuditSerializer's parent class from rest_framework.serializers.ModelSerializer to utils.serializer.ModelSerializer, which already bundles the mixin. Three redundant validate_description methods that were pure wrappers around validate_no_html_tags are removed, and two additional serializers (NotificationSerializer, TagSerializer) that don't extend AuditSerializer are also migrated directly.

  • backend/backend/serializers.py is the load-bearing change: swapping one import flips sanitization on for ~18 write-path serializers at once.
  • html_safe_fields opt-outs are declared on ToolStudioPromptSerializer, CustomToolSerializer, and PromptStudioOutputSerializer to allow legitimate LLM markup in prompt/output fields.
  • Three validate_description removals eliminate double-validation, since the mixin already covers those fields transitively.

Confidence Score: 5/5

Safe to merge — the core change is a single-line parent-class swap on AuditSerializer, all LLM-bearing fields are correctly opted out via html_safe_fields, and the removed validate_description methods were pure pass-throughs that are now covered transitively.

The change is structurally minimal: one import swap propagates sanitization to ~18 serializers without touching their logic. The html_safe_fields opt-outs on ToolStudioPromptSerializer, CustomToolSerializer, and PromptStudioOutputSerializer cover every identified LLM content field. The PromptStudioRegistry.name/description fields are editable=False in the model, so DRF marks them read_only and the mixin skips them automatically. No write-path regressions were found.

No files require special attention. The prompt_studio_core_v2/serializers.py file retains CustomToolListSerializer on raw DRF ModelSerializer, but this is explicitly deferred to a follow-up PR and does not affect the security boundary of the write path covered here.

Important Files Changed

Filename Overview
backend/backend/serializers.py Load-bearing change: swaps AuditSerializer's parent to utils.serializer.ModelSerializer, routing ~18 write-path serializers through SanitizedSerializerMixin in one line. create/update semantics are unchanged.
backend/prompt_studio/prompt_studio_output_manager_v2/serializers.py Adds html_safe_fields = ('output', 'context') to opt out PromptStudioOutputSerializer from HTML rejection; these fields store raw LLM responses and document chunks. Addresses the concern raised in a previous review thread.
backend/prompt_studio/prompt_studio_v2/serializers.py Adds html_safe_fields opt-out for the four LLM prompt/output fields on ToolStudioPromptSerializer; all other CharField fields (e.g. prompt_key) are correctly left under sanitization.
backend/prompt_studio/prompt_studio_core_v2/serializers.py Adds html_safe_fields for LLM prompt/output fields on CustomToolSerializer and removes the now-redundant validate_description method; CustomToolListSerializer still uses raw DRF ModelSerializer (deferred to follow-up).
backend/notification_v2/serializers.py Migrates NotificationSerializer to utils.serializer.ModelSerializer; webhook URL, name, and credential fields are all safe from HTML-rejection false positives since none contain angle-bracket content in normal use.
backend/tags/serializers.py Migrates TagSerializer to utils.serializer.ModelSerializer; TagParamsSerializer is intentionally left on the raw DRF Serializer (its validate_tags already enforces an alphanumeric-only pattern).
backend/workflow_manager/workflow_v2/serializers.py Removes redundant validate_description; other serializers in the file remain on raw DRF ModelSerializer per the stated follow-up scope.
backend/api_v2/serializers.py Removes redundant validate_description from APIDeploymentSerializer; validate_display_name (which calls validate_name_field -> validate_no_html_tags) is preserved for its whitespace-stripping and empty-value-rejection behaviour.

Class Diagram

%%{init: {'theme': 'neutral'}}%%
classDiagram
    class SanitizedSerializerMixin {
        +__init__()
        +appends validate_no_html_tags to every writable CharField
        +skips html_safe_fields and read_only fields
    }
    class DRFModelSerializer["drf.ModelSerializer"] {
    }
    class UtilsModelSerializer["utils.serializer.ModelSerializer"] {
    }
    class AuditSerializer {
        +create()
        +update()
    }
    class WorkflowSerializer {
        +Meta.html_safe_fields: none
        +validate_workflow_name()
    }
    class CustomToolSerializer {
        +Meta.html_safe_fields: summarize_prompt, preamble, postamble, output
        +validate_tool_name()
    }
    class ToolStudioPromptSerializer {
        +Meta.html_safe_fields: prompt, assert_prompt, assertion_failure_prompt, output
    }
    class PromptStudioOutputSerializer {
        +Meta.html_safe_fields: output, context
    }
    class NotificationSerializer {
        +Meta: no html_safe_fields
    }
    class TagSerializer {
        +Meta: no html_safe_fields
    }

    SanitizedSerializerMixin <|-- UtilsModelSerializer
    DRFModelSerializer <|-- UtilsModelSerializer
    UtilsModelSerializer <|-- AuditSerializer
    AuditSerializer <|-- WorkflowSerializer
    AuditSerializer <|-- CustomToolSerializer
    AuditSerializer <|-- ToolStudioPromptSerializer
    AuditSerializer <|-- PromptStudioOutputSerializer
    UtilsModelSerializer <|-- NotificationSerializer
    UtilsModelSerializer <|-- TagSerializer
Loading

Reviews (3): Last reviewed commit: "UN-3393 [FIX] Opt PromptStudioOutputSeri..." | Re-trigger Greptile

Comment thread backend/backend/serializers.py
@chandrasekharan-zipstack chandrasekharan-zipstack force-pushed the feat/UN-3393-input-validation-sweep branch from 93b7ecd to 1e8e4ef Compare May 15, 2026 10:43
chandrasekharan-zipstack and others added 3 commits May 15, 2026 17:15
Builds on the foundation PR by routing every `AuditSerializer` subclass
through `SanitizedSerializerMixin`. ~18 write-path serializers
(`WorkflowSerializer`, `CustomToolSerializer`, `APIDeploymentSerializer`,
`APIKeySerializer`, `ConnectorInstanceSerializer`, `BaseAdapterSerializer`,
`PlatformKeySerializer`, `PlatformApiKey{Create,Update}Serializer`,
`PipelineSerializer`, `ToolInstanceSerializer`, `ToolStudioPromptSerializer`,
`PromptStudioOutputSerializer`, `ProfileManagerSerializer`,
`IndexManagerSerializer`, `PromptStudioRegistry{,Info}Serializer`,
`PromptStudioDocumentManagerSerializer`) now auto-reject HTML/JS-shaped
input on every writable `CharField` / `TextField`.

- `backend/backend/serializers.py`: `AuditSerializer` now inherits
  `utils.serializer.ModelSerializer` (the pre-mixed variant) instead of
  `rest_framework.serializers.ModelSerializer`. `create` / `update`
  semantics unchanged.

- `prompt_studio/prompt_studio_v2.ToolStudioPromptSerializer`: declares
  `Meta.html_safe_fields = ("prompt", "assert_prompt",
  "assertion_failure_prompt", "output")`. LLM prompt text legitimately
  contains XML/HTML-like markup (e.g. `<context>`, `<thinking>`); LLM
  `output` may include anything the model produced.

- `prompt_studio/prompt_studio_core_v2.CustomToolSerializer`: declares
  `Meta.html_safe_fields = ("summarize_prompt", "preamble", "postamble",
  "output")`. Tool-level LLM context fields and stored LLM output.

- Removes redundant manual `validate_description(self, value)` methods
  in `api_v2.APIDeploymentSerializer`,
  `workflow_v2.WorkflowSerializer`, and
  `prompt_studio_core_v2.CustomToolSerializer`. The mixin now covers
  these via the `AuditSerializer` base. `validate_<name>` methods that
  use `validate_name_field` are retained because they also strip
  whitespace and reject empty values — behaviour the mixin doesn't
  duplicate.

- Imports of `validate_no_html_tags` are dropped from the three files
  whose `validate_description` methods were removed.

Files that inherit DRF base classes directly (without going through
`AuditSerializer`) are tracked as a follow-up sweep. The AuditSerializer
path already covers the highest-value write-path entities.

Test coverage: `cd backend && uv run pytest utils/tests/ -q` → 85 passed.
Full Django check requires cloud-deps; PR CI exercises the full suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends PR2 coverage to two user-write-path serializers that don't go
through `AuditSerializer`:

- `tags.TagSerializer` (user-create with `name` + `description`) now
  inherits `utils.serializer.ModelSerializer`. Sister
  `TagParamsSerializer` is a query-param parser with strict regex
  validation; left as-is.

- `notification_v2.NotificationSerializer` now inherits
  `utils.serializer.ModelSerializer`. The model's `name`,
  `authorization_key`, `authorization_header`, and `url` (URLField is a
  CharField subclass) are auto-sanitized. `notification_type`,
  `authorization_type`, `platform` are ChoiceField in the serializer,
  not CharField; the mixin skips them. The existing manual
  `validate_name` keeps its uniqueness + strip-whitespace logic; the
  mixin's HTML check runs alongside as redundant defence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Greptile P1 review comment on PR #1966. `PromptStudioOutputManager`
stores raw LLM responses (`output`: CharField) and document chunks
(`context`: TextField) that routinely contain <thinking>, <context>,
and other XML-like tags. The serializer is mounted on a ModelViewSet
with no class-level HTTP-method restriction, so write paths through
DRF admin / browsable API / any future write endpoint would
incorrectly reject legitimate LLM output without this opt-out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@chandrasekharan-zipstack chandrasekharan-zipstack force-pushed the feat/UN-3393-input-validation-sweep branch from 1e8e4ef to 6ce8ea8 Compare May 15, 2026 11:46
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant