From 677fa9c31b3c6cae8511d7a029429e87d06e4d41 Mon Sep 17 00:00:00 2001 From: JSON Schema Migration Date: Tue, 28 Apr 2026 10:05:16 +0530 Subject: [PATCH 01/13] feat(schemavault): accept JSON Schema uploads alongside legacy .py Adds a parallel JSON Schema upload path so schemas can be authored as data instead of executable Python. The worker materialises Pydantic models from JSON in memory (no exec) via the new remote_schema_loader. Legacy .py uploads continue to work unchanged. M1 of the migration plan. --- docs/CustomizeSchemaData.md | 79 +- scripts/py_schema_to_json.py | 76 ++ src/ContentProcessor/requirements.txt | 1 + .../src/libs/pipeline/entities/schema.py | 12 +- .../src/libs/pipeline/handlers/map_handler.py | 27 +- .../src/libs/utils/remote_schema_loader.py | 345 +++++++ .../unit/utils/test_remote_schema_loader.py | 282 ++++++ .../app/routers/logics/schema_validator.py | 157 ++++ .../app/routers/logics/schemavault.py | 9 +- .../app/routers/models/schmavault/model.py | 10 +- .../app/routers/schemavault.py | 167 ++-- .../app/tests/logics/test_schema_validator.py | 151 +++ .../app/tests/routers/test_schemavault.py | 186 ++++ src/ContentProcessorAPI/requirements.txt | 1 + .../samples/schemas/autoclaim.json | 886 ++++++++++++++++++ 15 files changed, 2322 insertions(+), 67 deletions(-) create mode 100644 scripts/py_schema_to_json.py create mode 100644 src/ContentProcessor/src/libs/utils/remote_schema_loader.py create mode 100644 src/ContentProcessor/tests/unit/utils/test_remote_schema_loader.py create mode 100644 src/ContentProcessorAPI/app/routers/logics/schema_validator.py create mode 100644 src/ContentProcessorAPI/app/tests/logics/test_schema_validator.py create mode 100644 src/ContentProcessorAPI/samples/schemas/autoclaim.json diff --git a/docs/CustomizeSchemaData.md b/docs/CustomizeSchemaData.md index 05a08b54..be24ff20 100644 --- a/docs/CustomizeSchemaData.md +++ b/docs/CustomizeSchemaData.md @@ -268,7 +268,84 @@ This means your field descriptions in the schema class **directly influence extr --- -## Related Documentation +## Authoring Schemas as JSON (recommended) + +The schema vault now also accepts **JSON Schema** documents (Draft 2020-12) +in addition to the legacy executable `.py` format. JSON schemas are treated +strictly as data: the worker parses them and materialises a Pydantic model +in memory without executing any uploaded code, eliminating an entire class +of remote-code-execution risk in the schema-management path. + +### Why JSON? + +| | Legacy `.py` | JSON Schema | +| --- | --- | --- | +| Format | Executable Pydantic class | Declarative JSON document | +| Worker behaviour | Imports and runs uploaded Python | Parses JSON, builds model in memory | +| Authoring | Hand-written Python | Pydantic-compatible JSON | +| Side-effects on import | Possible | Impossible | + +Both formats are accepted today; JSON is the recommended path for new +schemas and is required to be opted into per upload by using a `.json` +file extension. + +### Authoring with the conversion helper + +If you have an existing Pydantic-based `.py` schema, the repo ships a +helper that emits the equivalent JSON Schema: + +```bash +python scripts/py_schema_to_json.py \ + src/ContentProcessorAPI/samples/schemas/autoclaim.py \ + AutoInsuranceClaimForm +``` + +This writes `autoclaim.json` next to the source file. Under the hood it +calls `Model.model_json_schema()` from Pydantic v2 — the same call the +worker uses today to build the LLM prompt. The output is therefore +already aligned with the contract the pipeline expects. + +The accelerator ships a golden conversion of the auto-claim sample at +[/src/ContentProcessorAPI/samples/schemas/autoclaim.json](/src/ContentProcessorAPI/samples/schemas/autoclaim.json) +that you can reference. + +### Upload via API + +`POST /schemavault/` accepts either format. For JSON, send the file as +`application/json`: + +```http +POST /schemavault/ +Content-Type: multipart/form-data +- data: { "ClassName": "InvoiceSchema", "Description": "Invoice extraction" } +- file: invoice.json (application/json) +``` + +When uploading JSON: + +- The schema must be a JSON object with `"type": "object"` and a + `"properties"` block. +- The schema's `title` (if present) becomes the `ClassName` recorded in + Cosmos. If the JSON has no `title`, the request body's `ClassName` is + used as a fallback. +- Two project-specific extension keywords are accepted: + - `x-cps-extract-prompt` — optional override for the LLM extraction + prompt for that field. + - `x-cps-required-on-save` — marks a field that must be present in + the LLM output before persistence. + Any other `x-…` keyword is rejected. +- The schema must be ≤ 1 MB. + +### Constraints relative to the legacy Python schemas + +JSON schemas are pure data. They cannot carry custom validation logic +written in Python (e.g. `field_validator`). For most extraction +schemas this is not a limitation — the existing samples don't use +custom validators — but if you depend on imperative validation, keep +authoring those schemas in Python locally and run the resulting JSON +through the API. + + - [Modifying System Processing Prompts](./CustomizeSystemPrompts.md) — Customize extraction and mapping prompts - [Gap Analysis Ruleset Guide](./GapAnalysisRulesetGuide.md) — Define gap rules that reference your document types diff --git a/scripts/py_schema_to_json.py b/scripts/py_schema_to_json.py new file mode 100644 index 00000000..88f137b8 --- /dev/null +++ b/scripts/py_schema_to_json.py @@ -0,0 +1,76 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. + +"""Convert a legacy Pydantic ``.py`` schema into a declarative ``.json`` schema. + +This helper is part of the migration away from executable Python schemas. +It imports a Pydantic model from a ``.py`` file *in a trusted local +context* (the developer's machine), reads its +:py:meth:`pydantic.BaseModel.model_json_schema` output, and writes the +result to a ``.json`` file alongside. + +Usage: + + python scripts/py_schema_to_json.py \ + src/ContentProcessorAPI/samples/schemas/autoclaim.py \ + AutoInsuranceClaimForm + +The generated JSON is what should be uploaded to the schema vault going +forward; it is data only and never executed by the worker. +""" + +from __future__ import annotations + +import argparse +import importlib.util +import json +import sys +from pathlib import Path + +from pydantic import BaseModel + + +def convert(py_path: Path, class_name: str, out_path: Path | None = None) -> Path: + """Load *class_name* from *py_path* and write its JSON schema next to it.""" + spec = importlib.util.spec_from_file_location(py_path.stem, py_path) + if spec is None or spec.loader is None: + raise RuntimeError(f"Cannot import schema module from {py_path}") + module = importlib.util.module_from_spec(spec) + sys.modules[spec.name] = module + spec.loader.exec_module(module) # noqa: S102 - trusted local conversion only + + cls = getattr(module, class_name, None) + if cls is None or not isinstance(cls, type) or not issubclass(cls, BaseModel): + raise RuntimeError( + f"'{class_name}' is not a Pydantic BaseModel in {py_path}" + ) + + schema = cls.model_json_schema() + # Pydantic emits "title" at the root; ensure it matches the requested + # class name so the worker's ``derive_class_name`` picks it up. + schema["title"] = class_name + + target = out_path or py_path.with_suffix(".json") + target.write_text(json.dumps(schema, indent=2) + "\n", encoding="utf-8") + return target + + +def main() -> int: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("py_path", type=Path, help="Path to the .py schema file.") + parser.add_argument("class_name", help="BaseModel class to export.") + parser.add_argument( + "--out", + type=Path, + default=None, + help="Output .json path (defaults to alongside the input).", + ) + args = parser.parse_args() + + target = convert(args.py_path, args.class_name, args.out) + print(f"Wrote {target}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/src/ContentProcessor/requirements.txt b/src/ContentProcessor/requirements.txt index 579635b4..4a9bf175 100644 --- a/src/ContentProcessor/requirements.txt +++ b/src/ContentProcessor/requirements.txt @@ -16,6 +16,7 @@ dnspython==2.8.0 idna==3.11 iniconfig==2.3.0 isodate==0.7.2 +jsonschema==4.25.1 mongomock==4.3.0 msal==1.34.0 msal-extensions==1.3.1 diff --git a/src/ContentProcessor/src/libs/pipeline/entities/schema.py b/src/ContentProcessor/src/libs/pipeline/entities/schema.py index f7f5f7e0..429d2570 100644 --- a/src/ContentProcessor/src/libs/pipeline/entities/schema.py +++ b/src/ContentProcessor/src/libs/pipeline/entities/schema.py @@ -9,7 +9,7 @@ class file (in blob storage) that defines the structured output """ import datetime -from typing import Optional +from typing import Literal, Optional from pydantic import BaseModel, Field @@ -21,10 +21,15 @@ class Schema(BaseModel): Attributes: Id: Unique schema identifier. - ClassName: Python class name in the remote module. + ClassName: Class name to materialise from the schema artifact. Description: Human-readable description. - FileName: Blob filename containing the schema class. + FileName: Blob filename containing the schema artifact. ContentType: Target content type this schema handles. + Format: Storage format of the schema artifact. ``"python"`` (legacy) + indicates a ``.py`` Pydantic class; ``"json"`` indicates a + JSON Schema descriptor that the worker materialises in-memory + without executing any uploaded code. Defaults to ``"python"`` + so existing Cosmos records keep their current behaviour. """ Id: str @@ -32,6 +37,7 @@ class Schema(BaseModel): Description: str FileName: str ContentType: str + Format: Literal["python", "json"] = Field(default="python") Created_On: Optional[datetime.datetime] = Field(default=None) Updated_On: Optional[datetime.datetime] = Field(default=None) diff --git a/src/ContentProcessor/src/libs/pipeline/handlers/map_handler.py b/src/ContentProcessor/src/libs/pipeline/handlers/map_handler.py index aa08dda1..d85ee2cc 100644 --- a/src/ContentProcessor/src/libs/pipeline/handlers/map_handler.py +++ b/src/ContentProcessor/src/libs/pipeline/handlers/map_handler.py @@ -29,6 +29,7 @@ from libs.pipeline.entities.schema import Schema from libs.pipeline.queue_handler_base import HandlerBase from libs.utils.remote_module_loader import load_schema_from_blob +from libs.utils.remote_schema_loader import load_schema_from_blob_json logger = logging.getLogger(__name__) @@ -151,13 +152,25 @@ async def execute(self, context: MessageContext) -> StepResult: schema_id=context.data_pipeline.pipeline_status.schema_id, ) - # Load the schema class for structured output - schema_class = load_schema_from_blob( - account_url=self.application_context.configuration.app_storage_blob_url, - container_name=f"{self.application_context.configuration.app_cps_configuration}/Schemas/{context.data_pipeline.pipeline_status.schema_id}", - blob_name=selected_schema.FileName, - module_name=selected_schema.ClassName, - ) + # Load the schema class for structured output. JSON schemas are + # materialised as in-memory Pydantic models without executing any + # uploaded code; legacy ``.py`` schemas continue to use the + # remote-module loader so existing deployments keep working. + schema_format = getattr(selected_schema, "Format", "python") or "python" + if schema_format == "json": + schema_class = load_schema_from_blob_json( + account_url=self.application_context.configuration.app_storage_blob_url, + container_name=f"{self.application_context.configuration.app_cps_configuration}/Schemas/{context.data_pipeline.pipeline_status.schema_id}", + blob_name=selected_schema.FileName, + model_name=selected_schema.ClassName, + ) + else: + schema_class = load_schema_from_blob( + account_url=self.application_context.configuration.app_storage_blob_url, + container_name=f"{self.application_context.configuration.app_cps_configuration}/Schemas/{context.data_pipeline.pipeline_status.schema_id}", + blob_name=selected_schema.FileName, + module_name=selected_schema.ClassName, + ) # Invoke Model with Agent Framework SDK diff --git a/src/ContentProcessor/src/libs/utils/remote_schema_loader.py b/src/ContentProcessor/src/libs/utils/remote_schema_loader.py new file mode 100644 index 00000000..6ecd02aa --- /dev/null +++ b/src/ContentProcessor/src/libs/utils/remote_schema_loader.py @@ -0,0 +1,345 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. + +"""Materialise a Pydantic model from a JSON Schema descriptor. + +This is the *safe* counterpart of :mod:`libs.utils.remote_module_loader`. +A JSON schema descriptor is treated strictly as data: + +1. Bytes are downloaded from blob storage. +2. ``json.loads`` parses them into a ``dict``. +3. A recursive walk converts the schema into Pydantic ``BaseModel`` + subclasses via :func:`pydantic.create_model`. + +There is **no** ``exec``, ``compile``, ``importlib`` or any other +mechanism that would execute attacker-supplied code. The worst a +malicious schema can do is fail validation at load time. +""" + +from __future__ import annotations + +import json +import logging +from typing import Any, ForwardRef, List, Literal, Optional, Tuple, Type, Union + +from azure.storage.blob import BlobServiceClient +from pydantic import BaseModel, ConfigDict, Field, create_model + +from libs.utils.azure_credential_utils import get_azure_credential + +logger = logging.getLogger(__name__) + + +class JsonSchemaLoadError(ValueError): + """Raised when a JSON schema descriptor cannot be turned into a model.""" + + +def load_schema_from_blob_json( + account_url: str, + container_name: str, + blob_name: str, + model_name: str, +) -> Type[BaseModel]: + """Download a JSON Schema and return a generated Pydantic model class. + + Args: + account_url: Azure Blob Storage account URL. + container_name: Container (path) holding the blob. + blob_name: Blob filename to download (a ``.json`` schema). + model_name: Name to assign to the root generated model class. + + Returns: + A dynamically generated subclass of :class:`pydantic.BaseModel` + whose shape matches the JSON Schema. + + Raises: + JsonSchemaLoadError: If the blob is not valid JSON or the schema + cannot be translated into a Pydantic model. + """ + raw = _download_blob_content(container_name, blob_name, account_url) + try: + document = json.loads(raw) + except json.JSONDecodeError as exc: + raise JsonSchemaLoadError( + f"Schema blob '{blob_name}' is not valid JSON: {exc.msg}" + ) from exc + + if not isinstance(document, dict): + raise JsonSchemaLoadError("Schema root must be a JSON object.") + + return build_model_from_schema(document, model_name) + + +def build_model_from_schema( + document: dict[str, Any], model_name: str +) -> Type[BaseModel]: + """Build a Pydantic model class from an in-memory JSON Schema document. + + This is split out from :func:`load_schema_from_blob_json` so it can + be unit-tested without touching Azure storage. + """ + defs = document.get("$defs") or document.get("definitions") or {} + if not isinstance(defs, dict): + raise JsonSchemaLoadError("'$defs' must be a JSON object if present.") + + builder = _ModelBuilder(defs) + model = builder.build_object(document, model_name, is_root=True) + builder.resolve_forward_refs() + return model + + +# --------------------------------------------------------------------------- +# Internals +# --------------------------------------------------------------------------- + + +def _download_blob_content( + container_name: str, blob_name: str, account_url: str +) -> str: + """Download the blob and return its UTF-8 contents as a string.""" + credential = get_azure_credential() + blob_service_client = BlobServiceClient( + account_url=account_url, credential=credential + ) + blob_client = blob_service_client.get_blob_client( + container=container_name, blob=blob_name + ) + return blob_client.download_blob().readall().decode("utf-8") + + +class _ModelBuilder: + """Recursive JSON-Schema-to-Pydantic translator. + + The builder maintains a memo of already-generated models keyed by + ``$defs`` name so that repeated ``$ref`` references reuse the same + class and so that self/mutually-recursive schemas terminate. + """ + + _PRIMITIVE_TYPES: dict[str, type] = { + "string": str, + "integer": int, + "number": float, + "boolean": bool, + "null": type(None), + } + + def __init__(self, defs: dict[str, Any]): + self._defs = defs + self._models: dict[str, Type[BaseModel]] = {} + self._in_progress: set[str] = set() + self._all_models: list[Type[BaseModel]] = [] + + # -- public driver ---------------------------------------------------- + + def build_object( + self, + node: dict[str, Any], + model_name: str, + *, + is_root: bool = False, + ) -> Type[BaseModel]: + """Build a Pydantic model from an object-typed schema node.""" + if not is_root: + # Avoid colliding with a reserved $defs name when the caller + # supplies an inline object schema. + model_name = self._dedupe_name(model_name) + + # Reserve the slot so $ref to the same definition resolves to us + # even before we finish constructing it. + self._in_progress.add(model_name) + try: + properties = node.get("properties") or {} + required = set(node.get("required") or []) + fields: dict[str, tuple[Any, Any]] = {} + + for prop_name, prop_schema in properties.items(): + python_type, default = self._field_for( + prop_schema, prop_name, parent_name=model_name + ) + if prop_name in required and default is None: + field_default: Any = ... + else: + field_default = default + + description = ( + prop_schema.get("description") + if isinstance(prop_schema, dict) + else None + ) + fields[prop_name] = ( + python_type, + Field(default=field_default, description=description), + ) + + model = create_model( # type: ignore[call-overload] + model_name, + __config__=ConfigDict(extra="ignore"), + **fields, + ) + description = node.get("description") or node.get("title") + if isinstance(description, str): + model.__doc__ = description + finally: + self._in_progress.discard(model_name) + + self._models[model_name] = model + self._all_models.append(model) + return model + + def resolve_forward_refs(self) -> None: + """Resolve any ``ForwardRef`` placeholders left during construction.""" + ns = dict(self._models) + for model in self._all_models: + try: + model.model_rebuild(_types_namespace=ns) + except Exception: # pragma: no cover - defensive + logger.exception( + "Failed to rebuild model %s while resolving forward refs", + model.__name__, + ) + + # -- field translation ------------------------------------------------ + + def _field_for( + self, + schema: Any, + prop_name: str, + parent_name: str, + ) -> Tuple[Any, Any]: + """Translate a property schema into ``(python_type, default_value)``. + + ``default_value`` is ``None`` when the field is nullable / optional; + callers replace it with ``...`` when the field is required. + """ + if schema is True or schema is None or schema == {}: + return (Any, None) + if not isinstance(schema, dict): + raise JsonSchemaLoadError( + f"Property '{prop_name}' has invalid schema (not an object)." + ) + + # $ref resolution (local refs only). + ref = schema.get("$ref") + if isinstance(ref, str): + return (self._resolve_ref(ref), None) + + # anyOf / oneOf — treat as Union. + for key in ("anyOf", "oneOf"): + if key in schema: + members = schema[key] + if not isinstance(members, list) or not members: + raise JsonSchemaLoadError( + f"'{key}' for '{prop_name}' must be a non-empty list." + ) + resolved = [ + self._field_for(m, prop_name, parent_name)[0] for m in members + ] + return (Union[tuple(resolved)], None) # type: ignore[valid-type] + + # enum — Literal[...] of allowed values. + if "enum" in schema and isinstance(schema["enum"], list) and schema["enum"]: + literal_args = tuple(schema["enum"]) + return (Literal[literal_args], None) # type: ignore[valid-type] + + json_type = schema.get("type") + + if isinstance(json_type, list): + # e.g. ["string", "null"] + python_types = [self._type_for_simple(t, schema, prop_name, parent_name) + for t in json_type] + if len(python_types) == 1: + return (python_types[0], None) + unioned: Any = Union[tuple(python_types)] # type: ignore[valid-type] + return (unioned, None) + + if isinstance(json_type, str): + return ( + self._type_for_simple(json_type, schema, prop_name, parent_name), + None, + ) + + # No type declared → permissive. + return (Any, None) + + def _type_for_simple( + self, + json_type: str, + schema: dict[str, Any], + prop_name: str, + parent_name: str, + ) -> Any: + """Translate a single JSON-Schema primitive ``type`` token.""" + if json_type in self._PRIMITIVE_TYPES: + return self._PRIMITIVE_TYPES[json_type] + if json_type == "array": + items = schema.get("items") + if items is None: + return List[Any] + item_type, _ = self._field_for(items, f"{prop_name}_item", parent_name) + return List[item_type] # type: ignore[valid-type] + if json_type == "object": + inline_name = self._inline_object_name(parent_name, prop_name) + return self.build_object(schema, inline_name) + raise JsonSchemaLoadError( + f"Unsupported JSON Schema type '{json_type}' for property '{prop_name}'." + ) + + def _resolve_ref(self, ref: str) -> Any: + """Resolve a local JSON-Pointer reference into a generated model.""" + prefix_defs = "#/$defs/" + prefix_definitions = "#/definitions/" + if ref.startswith(prefix_defs): + name = ref[len(prefix_defs):] + elif ref.startswith(prefix_definitions): + name = ref[len(prefix_definitions):] + else: + raise JsonSchemaLoadError( + f"Only local '#/$defs/...' refs are supported (got '{ref}')." + ) + + if name in self._models: + return self._models[name] + + if name in self._in_progress: + # Cycle: emit a forward reference; resolved later. + return ForwardRef(name) + + if name not in self._defs: + raise JsonSchemaLoadError( + f"Reference '{ref}' does not resolve to a known $defs entry." + ) + + sub_schema = self._defs[name] + if not isinstance(sub_schema, dict): + raise JsonSchemaLoadError( + f"$defs entry '{name}' must be a JSON object." + ) + + sub_type = sub_schema.get("type") + if sub_type == "object" or "properties" in sub_schema: + return self.build_object(sub_schema, name) + + # Non-object $defs entry (rare): translate as a field type. + translated, _ = self._field_for(sub_schema, name, parent_name=name) + # Cache simple-type aliases so repeated refs return the same thing. + # (We don't add to self._models because that map is for BaseModel + # subclasses only, but ForwardRef handling does not apply to scalar + # aliases — return the type directly.) + return translated + + # -- name helpers ---------------------------------------------------- + + def _dedupe_name(self, candidate: str) -> str: + """Ensure a freshly generated model name does not collide.""" + if candidate not in self._models and candidate not in self._in_progress: + return candidate + i = 2 + while f"{candidate}_{i}" in self._models or f"{candidate}_{i}" in self._in_progress: + i += 1 + return f"{candidate}_{i}" + + @staticmethod + def _inline_object_name(parent_name: str, prop_name: str) -> str: + """Synthesize a stable name for an inline object schema.""" + camel = "".join(part.capitalize() for part in prop_name.split("_") if part) + return f"{parent_name}_{camel or 'Inline'}" diff --git a/src/ContentProcessor/tests/unit/utils/test_remote_schema_loader.py b/src/ContentProcessor/tests/unit/utils/test_remote_schema_loader.py new file mode 100644 index 00000000..81ba3535 --- /dev/null +++ b/src/ContentProcessor/tests/unit/utils/test_remote_schema_loader.py @@ -0,0 +1,282 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. + +"""Tests for the JSON-Schema-based remote schema loader. + +These tests intentionally avoid touching Azure and only exercise +:func:`build_model_from_schema`, the in-memory translator that +:func:`load_schema_from_blob_json` delegates to. +""" + +from __future__ import annotations + +import json +from pathlib import Path + +import pytest +from pydantic import BaseModel + +from libs.utils.remote_schema_loader import ( + JsonSchemaLoadError, + build_model_from_schema, +) + +#: Repo-relative path to the golden JSON schema generated from autoclaim.py. +_GOLDEN_AUTOCLAIM = ( + Path(__file__).resolve().parents[4] + / "ContentProcessorAPI" + / "samples" + / "schemas" + / "autoclaim.json" +) + + +# --------------------------------------------------------------------------- +# Construction +# --------------------------------------------------------------------------- + + +def test_builds_simple_object_model(): + schema = { + "type": "object", + "title": "Invoice", + "properties": { + "id": {"type": "string"}, + "amount": {"type": "number"}, + "paid": {"type": "boolean"}, + }, + "required": ["id"], + } + model = build_model_from_schema(schema, "Invoice") + + assert issubclass(model, BaseModel) + instance = model.model_validate({"id": "INV1", "amount": 12.5, "paid": True}) + assert instance.id == "INV1" + assert instance.amount == 12.5 + + with pytest.raises(Exception): + model.model_validate({}) # missing required 'id' + + +def test_supports_nullable_via_anyof(): + schema = { + "type": "object", + "properties": { + "name": {"anyOf": [{"type": "string"}, {"type": "null"}]}, + }, + } + model = build_model_from_schema(schema, "X") + instance = model.model_validate({"name": None}) + assert instance.name is None + + +def test_supports_nullable_via_type_array(): + schema = { + "type": "object", + "properties": { + "name": {"type": ["string", "null"]}, + }, + } + model = build_model_from_schema(schema, "X") + assert model.model_validate({"name": None}).name is None + assert model.model_validate({"name": "ok"}).name == "ok" + + +def test_supports_arrays_of_primitives(): + schema = { + "type": "object", + "properties": { + "tags": {"type": "array", "items": {"type": "string"}}, + }, + } + model = build_model_from_schema(schema, "X") + instance = model.model_validate({"tags": ["a", "b"]}) + assert instance.tags == ["a", "b"] + + +def test_supports_inline_nested_object(): + schema = { + "type": "object", + "properties": { + "address": { + "type": "object", + "properties": { + "city": {"type": "string"}, + }, + }, + }, + } + model = build_model_from_schema(schema, "Person") + instance = model.model_validate({"address": {"city": "Macon"}}) + assert instance.address.city == "Macon" + + +def test_supports_refs_and_defs(): + schema = { + "$defs": { + "Address": { + "type": "object", + "properties": { + "city": {"type": "string"}, + }, + } + }, + "type": "object", + "properties": { + "primary": {"$ref": "#/$defs/Address"}, + "secondary": {"$ref": "#/$defs/Address"}, + }, + } + model = build_model_from_schema(schema, "Contact") + + instance = model.model_validate({ + "primary": {"city": "Macon"}, + "secondary": {"city": "Atlanta"}, + }) + # Both refs resolved to the *same* generated class. + assert type(instance.primary) is type(instance.secondary) + + +def test_supports_enum_via_literal(): + schema = { + "type": "object", + "properties": { + "tier": {"enum": ["bronze", "silver", "gold"]}, + }, + } + model = build_model_from_schema(schema, "Tier") + assert model.model_validate({"tier": "gold"}).tier == "gold" + with pytest.raises(Exception): + model.model_validate({"tier": "platinum"}) + + +# --------------------------------------------------------------------------- +# Failure modes +# --------------------------------------------------------------------------- + + +def test_rejects_unknown_ref_target(): + schema = { + "type": "object", + "properties": {"a": {"$ref": "#/$defs/Missing"}}, + } + with pytest.raises(JsonSchemaLoadError) as exc: + build_model_from_schema(schema, "X") + assert "$defs" in str(exc.value) + + +def test_rejects_external_ref(): + schema = { + "type": "object", + "properties": {"a": {"$ref": "https://example.com/schema.json"}}, + } + with pytest.raises(JsonSchemaLoadError): + build_model_from_schema(schema, "X") + + +# --------------------------------------------------------------------------- +# Golden-equivalence: the JSON schema generated from autoclaim.py builds a +# model that round-trips an LLM-style payload to the same dict that the +# legacy autoclaim.py would produce. +# --------------------------------------------------------------------------- + + +def _representative_payload() -> dict: + return { + "insurance_company": "Contoso Insurance", + "claim_number": "CLM987654", + "policy_number": "AUTO123456", + "policyholder_information": { + "name": "Chad Brooks", + "address": { + "street": "123 Main St", + "city": "Macon", + "state": "GA", + "postal_code": "31201", + "country": "USA", + }, + "phone": "(555) 555-1212", + "email": "chad.brooks@example.com", + }, + "policy_details": { + "coverage_type": "Auto - Comprehensive", + "effective_date": "2025-01-01", + "expiration_date": "2025-12-31", + "deductible": 500.0, + "deductible_currency": "USD", + }, + "incident_details": { + "date_of_loss": "2025-11-28", + "time_of_loss": "14:15", + "location": "Parking lot", + "cause_of_loss": "Low-speed collision", + "description": "Minor dent", + "police_report_filed": True, + "police_report_number": "GA-20251128-CR", + }, + "vehicle_information": { + "year": 2022, + "make": "Toyota", + "model": "Camry", + "trim": "SE", + "vin": "4T1G11AK2NU123456", + "license_plate": "GA-ABC123", + "mileage": 28450, + }, + "damage_assessment": { + "items": [ + { + "item_description": "Right-front quarter panel", + "date_acquired": "2022-03-15", + "cost_new": 1200.0, + "cost_new_currency": "USD", + "repair_estimate": 350.0, + "repair_estimate_currency": "USD", + } + ], + "total_estimated_repair": 500.0, + "total_estimated_repair_currency": "USD", + }, + "supporting_documents": { + "photos_of_damage": True, + "police_report_copy": True, + "repair_shop_estimate": True, + "other": [], + }, + "declaration": { + "statement": "I declare...", + "signature": {"signatory": "Chad Brooks", "is_signed": True}, + "date": "2025-12-01", + }, + "submission_instructions": { + "submission_email": "claims@contoso.com", + "portal_url": None, + "notes": None, + }, + } + + +def test_golden_autoclaim_round_trip(): + document = json.loads(_GOLDEN_AUTOCLAIM.read_text(encoding="utf-8")) + model = build_model_from_schema(document, "AutoInsuranceClaimForm") + + payload = _representative_payload() + instance = model.model_validate(payload) + dumped = instance.model_dump() + + # Every field round-trips and nested objects produced the same shape. + assert dumped["insurance_company"] == "Contoso Insurance" + assert dumped["policyholder_information"]["address"]["city"] == "Macon" + assert dumped["damage_assessment"]["items"][0]["cost_new"] == 1200.0 + assert dumped["declaration"]["signature"]["is_signed"] is True + + +def test_golden_autoclaim_emits_json_schema(): + document = json.loads(_GOLDEN_AUTOCLAIM.read_text(encoding="utf-8")) + model = build_model_from_schema(document, "AutoInsuranceClaimForm") + + # The generated model must be able to emit its own JSON schema; this is + # what map_handler.py passes to the LLM via ``model_json_schema()``. + out_schema = model.model_json_schema() + assert out_schema.get("type") == "object" + assert "properties" in out_schema diff --git a/src/ContentProcessorAPI/app/routers/logics/schema_validator.py b/src/ContentProcessorAPI/app/routers/logics/schema_validator.py new file mode 100644 index 00000000..280fee9a --- /dev/null +++ b/src/ContentProcessorAPI/app/routers/logics/schema_validator.py @@ -0,0 +1,157 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. + +"""Validate uploaded JSON Schema descriptors used by the content-processing pipeline. + +A JSON schema descriptor is treated as **data**: it is parsed (never +executed), checked against the JSON Schema Draft 2020-12 meta-schema, and +required to use only a small set of project-specific custom keywords. + +This module is intentionally side-effect free; it does not touch storage +or Cosmos. The router is responsible for calling :func:`validate_json_schema` +and acting on the returned errors. +""" + +from __future__ import annotations + +import json +from typing import Any, Iterable + +from jsonschema import Draft202012Validator +from jsonschema.exceptions import SchemaError + +#: Maximum size in bytes for an uploaded JSON schema. Schemas are config +#: artefacts; a generous cap of 1 MB matches the legacy ``.py`` limit. +MAX_SCHEMA_BYTES: int = 1 * 1024 * 1024 + +#: Allowlisted project-specific custom keywords. Any other ``x-cps-*`` or +#: ``x-`` keyword in the uploaded schema is rejected so unknown extension +#: points cannot be smuggled in. +ALLOWED_CPS_KEYWORDS: frozenset[str] = frozenset({ + "x-cps-extract-prompt", + "x-cps-required-on-save", +}) + + +class SchemaValidationError(ValueError): + """Raised when an uploaded JSON schema fails validation. + + Attributes: + errors: Human-readable list of violations. + """ + + def __init__(self, errors: list[str]): + self.errors = errors + super().__init__("; ".join(errors) if errors else "Invalid JSON schema") + + +def validate_json_schema(raw_bytes: bytes) -> dict[str, Any]: + """Validate the bytes of an uploaded JSON Schema descriptor. + + Args: + raw_bytes: Uploaded file contents. + + Returns: + The parsed schema document as a ``dict`` (only on success). + + Raises: + SchemaValidationError: If the bytes are too large, are not valid + JSON, do not conform to JSON Schema Draft 2020-12, or use + disallowed custom extension keywords. + """ + errors: list[str] = [] + + if raw_bytes is None: + raise SchemaValidationError(["Empty schema upload."]) + + if len(raw_bytes) > MAX_SCHEMA_BYTES: + raise SchemaValidationError([ + f"Schema is too large ({len(raw_bytes)} bytes; max {MAX_SCHEMA_BYTES})." + ]) + + try: + document = json.loads(raw_bytes.decode("utf-8")) + except UnicodeDecodeError as exc: + raise SchemaValidationError([f"Schema must be UTF-8 encoded: {exc}"]) from exc + except json.JSONDecodeError as exc: + raise SchemaValidationError([f"Schema is not valid JSON: {exc.msg}"]) from exc + + if not isinstance(document, dict): + raise SchemaValidationError([ + "Schema root must be a JSON object describing the model." + ]) + + # Reject schemas without a usable type. We only support object roots + # because the pipeline materialises a Pydantic model from them. + root_type = document.get("type") + if root_type != "object": + errors.append( + "Schema root must declare 'type': 'object' " + "(got %r)." % (root_type,) + ) + + if "properties" not in document or not isinstance( + document.get("properties"), dict + ): + errors.append("Schema root must declare a 'properties' object.") + + # Validate the document itself is a syntactically valid Draft 2020-12 schema. + try: + Draft202012Validator.check_schema(document) + except SchemaError as exc: + errors.append(f"Not a valid JSON Schema (Draft 2020-12): {exc.message}") + + # Walk the document and reject unknown ``x-`` extension keywords. + for path, key in _walk_extension_keywords(document): + if key not in ALLOWED_CPS_KEYWORDS: + errors.append( + f"Unsupported extension keyword '{key}' at {path or ''}. " + f"Allowed: {sorted(ALLOWED_CPS_KEYWORDS)}." + ) + + if errors: + raise SchemaValidationError(errors) + + return document + + +def derive_class_name(document: dict[str, Any], fallback: str) -> str: + """Derive a stable class name for the schema document. + + The schema's ``title`` is preferred (matches Pydantic conventions); + otherwise the supplied filename stem is used. Any non-identifier + characters in the fallback are replaced with underscores so the + result is always a valid Python identifier. + + Args: + document: Parsed JSON schema document. + fallback: Filename stem (without extension) to use if no title. + + Returns: + A non-empty string suitable for use as a Pydantic model name. + """ + title = document.get("title") + if isinstance(title, str) and title.strip(): + candidate = title.strip() + else: + candidate = fallback + + cleaned = "".join(ch if ch.isalnum() or ch == "_" else "_" for ch in candidate) + if not cleaned or not (cleaned[0].isalpha() or cleaned[0] == "_"): + cleaned = "Schema_" + cleaned + return cleaned + + +def _walk_extension_keywords( + node: Any, path: str = "" +) -> Iterable[tuple[str, str]]: + """Yield every ``(path, key)`` for keys starting with ``x-`` anywhere in *node*.""" + if isinstance(node, dict): + for key, value in node.items(): + if isinstance(key, str) and key.startswith("x-"): + yield path, key + child_path = f"{path}.{key}" if path else str(key) + yield from _walk_extension_keywords(value, child_path) + elif isinstance(node, list): + for idx, item in enumerate(node): + yield from _walk_extension_keywords(item, f"{path}[{idx}]") diff --git a/src/ContentProcessorAPI/app/routers/logics/schemavault.py b/src/ContentProcessorAPI/app/routers/logics/schemavault.py index f97663c4..dc5f34c8 100644 --- a/src/ContentProcessorAPI/app/routers/logics/schemavault.py +++ b/src/ContentProcessorAPI/app/routers/logics/schemavault.py @@ -67,7 +67,13 @@ def Add(self, file: UploadFile, schema: Schema) -> Schema: self.mongoHelper.insert_document(schema.model_dump(mode="json")) return schema - def Update(self, file: UploadFile, schema_id: str, class_name: str) -> Schema: + def Update( + self, + file: UploadFile, + schema_id: str, + class_name: str, + storage_format: str = "python", + ) -> Schema: """Replace the schema file in blob storage and update Cosmos metadata.""" schemas = self.mongoHelper.find_document(query={"Id": schema_id}) if not schemas: @@ -80,6 +86,7 @@ def Update(self, file: UploadFile, schema_id: str, class_name: str) -> Schema: schema_object.ClassName = class_name schema_object.ContentType = file.content_type + schema_object.Format = storage_format schema_object.Updated_On = result["date"] self.mongoHelper.update_document( diff --git a/src/ContentProcessorAPI/app/routers/models/schmavault/model.py b/src/ContentProcessorAPI/app/routers/models/schmavault/model.py index c8045220..73072bab 100644 --- a/src/ContentProcessorAPI/app/routers/models/schmavault/model.py +++ b/src/ContentProcessorAPI/app/routers/models/schmavault/model.py @@ -5,7 +5,7 @@ import datetime import json -from typing import Optional +from typing import Literal, Optional from pydantic import BaseModel, ConfigDict, Field, model_validator @@ -15,10 +15,15 @@ class Schema(BaseModel): Attributes: Id: Unique schema identifier. - ClassName: Python class name of the schema. + ClassName: Class name of the schema (Python class for legacy + ``.py`` schemas, or the JSON Schema ``title`` for JSON + schemas). Description: Human-readable description. FileName: Source filename for the schema definition. ContentType: Expected content/MIME type. + Format: Storage format of the schema artifact. + ``"python"`` (default, legacy) for ``.py`` files; + ``"json"`` for declarative JSON Schema descriptors. Created_On: UTC timestamp when the schema was registered. Updated_On: UTC timestamp of the last update. """ @@ -28,6 +33,7 @@ class Schema(BaseModel): Description: str FileName: str ContentType: str + Format: Literal["python", "json"] = Field(default="python") Created_On: Optional[datetime.datetime] = Field(default=None) Updated_On: Optional[datetime.datetime] = Field(default=None) model_config = ConfigDict(from_attributes=True) diff --git a/src/ContentProcessorAPI/app/routers/schemavault.py b/src/ContentProcessorAPI/app/routers/schemavault.py index 93e4e2b7..be331d90 100644 --- a/src/ContentProcessorAPI/app/routers/schemavault.py +++ b/src/ContentProcessorAPI/app/routers/schemavault.py @@ -12,6 +12,11 @@ from fastapi.responses import StreamingResponse from app.libs.base.typed_fastapi import TypedFastAPI +from app.routers.logics.schema_validator import ( + SchemaValidationError, + derive_class_name, + validate_json_schema, +) from app.routers.logics.schemavault import Schemas from app.routers.models.schmavault.model import ( Schema, @@ -28,6 +33,47 @@ responses={404: {"description": "Not found"}}, ) +#: Filename extensions accepted by the schema-vault upload routes. +#: ``.py`` is the legacy Python class format (executed by the worker via +#: ``remote_module_loader``). ``.json`` is the declarative JSON Schema +#: format introduced as part of the migration away from executable +#: schemas; it is parsed as data and never executed. +_ALLOWED_EXTENSIONS: tuple[str, ...] = (".py", ".json") +_MAX_UPLOAD_BYTES: int = 1 * 1024 * 1024 + + +def _validate_upload(file: UploadFile) -> tuple[str, str]: + """Common upload checks for ``POST`` and ``PUT`` schema endpoints. + + Returns a ``(safe_filename, extension)`` tuple. Raises ``HTTPException`` + with the appropriate status on any failure. + """ + try: + safe_filename = sanitize_filename(file.filename) + except ValueError: + raise HTTPException(status_code=400, detail="Filename is too long.") + + extension = os.path.splitext(safe_filename)[1].lower() + if extension not in _ALLOWED_EXTENSIONS: + raise HTTPException( + status_code=415, + detail=( + "Unsupported schema file type. " + "Only .py and .json schema files are supported." + ), + ) + + size_bytes = get_upload_size_bytes(file) + if size_bytes is None: + raise HTTPException(status_code=400, detail="Unable to determine upload size.") + + if size_bytes > _MAX_UPLOAD_BYTES: + raise HTTPException( + status_code=413, detail="Schema file is too large (max 1 MB)." + ) + + return safe_filename, extension + @router.get( "/", @@ -61,25 +107,34 @@ async def Get_All_Registered_Schema( response_model=Schema, summary="Register a schema", description=""" - Registers a new schema file (`.py`) and stores its metadata in the Schema Vault. + Registers a new schema file (`.py` or `.json`) and stores its metadata + in the Schema Vault. The request must be sent as `multipart/form-data` with: - a JSON part (named `data`) - a file part (named `file`) Constraints: - - Only `.py` files are accepted. + - Accepted extensions: `.py` (legacy executable Python class) and + `.json` (declarative JSON Schema; recommended). - Max size: 1 MB. + For `.json` uploads: + - Must be a valid JSON Schema (Draft 2020-12) with `type: "object"` + and a `properties` block. + - The `ClassName` field in the request body is ignored if the JSON + document declares a `title`; otherwise the filename stem is used. + ## Parameters - - **ClassName** (body): Schema class name contained in the uploaded file. + - **ClassName** (body): Schema class name. Used for `.py` uploads and + as a fallback for `.json` uploads without a `title`. - **Description** (body): Human-readable description. - - **file** (form): `.py` schema file (max 1 MB). + - **file** (form): `.py` or `.json` schema file (max 1 MB). ## Example Request Body multipart/form-data - `data`: `{ "ClassName": "InvoiceSchema", "Description": "Extract invoice fields" }` - - `file`: `` + - `file`: `` or `` """, ) async def Register_Schema( @@ -87,40 +142,45 @@ async def Register_Schema( file: UploadFile = File(...), request: Request = None, ) -> Schema: - """Register a new schema file (.py) into the vault.""" + """Register a new schema file into the vault.""" app: TypedFastAPI = request.app # type: ignore schemas: Schemas = app.app_context.get_service(Schemas) - try: - safe_filename = sanitize_filename(file.filename) - except ValueError: - raise HTTPException(status_code=400, detail="Filename is too long.") - extension = os.path.splitext(safe_filename)[1].lower() - if extension != ".py": - raise HTTPException( - status_code=415, - detail="Unsupported schema file type. Only .py schema files are supported.", - ) - - size_bytes = get_upload_size_bytes(file) - if size_bytes is None: - raise HTTPException(status_code=400, detail="Unable to determine upload size.") - - # Schemas are small config artifacts; keep a conservative cap. - if size_bytes > 1 * 1024 * 1024: - raise HTTPException( - status_code=413, detail="Schema file is too large (max 1 MB)." - ) + safe_filename, extension = _validate_upload(file) + + # Determine the storage format and final ClassName based on extension. + # For ``.json`` schemas we additionally validate the document up front so + # that no blob or Cosmos record is ever written for an invalid schema. + if extension == ".json": + raw = file.file.read() + file.file.seek(0) + try: + document = validate_json_schema(raw) + except SchemaValidationError as exc: + raise HTTPException( + status_code=400, + detail={"message": "Invalid JSON schema.", "errors": exc.errors}, + ) from exc + + fallback = os.path.splitext(safe_filename)[0] + class_name = derive_class_name(document, fallback=data.ClassName or fallback) + storage_format = "json" + content_type = file.content_type or "application/json" + else: + class_name = data.ClassName + storage_format = "python" + content_type = file.content_type or "text/x-python" return schemas.Add( file, Schema( Id=str(uuid.uuid4()), - ClassName=data.ClassName, + ClassName=class_name, Description=data.Description, FileName=safe_filename, - ContentType=file.content_type, + ContentType=content_type, + Format=storage_format, ), ) @@ -130,25 +190,27 @@ async def Register_Schema( response_model=Schema, summary="Update a schema", description=""" - Updates an existing registered schema (`.py` file) and associated metadata. + Updates an existing registered schema (`.py` or `.json` file) and + associated metadata. The request must be sent as `multipart/form-data` with: - a JSON part (named `data`) - a file part (named `file`) Constraints: - - Only `.py` files are accepted. + - Accepted extensions: `.py` and `.json`. - Max size: 1 MB. ## Parameters - **SchemaId** (body): Schema ID to update. - - **ClassName** (body): Updated class name. - - **file** (form): New `.py` schema file (max 1 MB). + - **ClassName** (body): Updated class name (fallback for `.json` + schemas without a `title`). + - **file** (form): New `.py` or `.json` schema file (max 1 MB). ## Example Request Body multipart/form-data - `data`: `{ "SchemaId": "", "ClassName": "InvoiceSchema" }` - - `file`: `` + - `file`: `` or `` """, ) async def Update_Schema( @@ -158,29 +220,28 @@ async def Update_Schema( ) -> Schema: """Update an existing schema with a new file.""" app: TypedFastAPI = request.app # type: ignore - try: - safe_filename = sanitize_filename(file.filename) - except ValueError: - raise HTTPException(status_code=400, detail="Filename is too long.") - - extension = os.path.splitext(safe_filename)[1].lower() - if extension != ".py": - raise HTTPException( - status_code=415, - detail="Unsupported schema file type. Only .py schema files are supported.", - ) - size_bytes = get_upload_size_bytes(file) - if size_bytes is None: - raise HTTPException(status_code=400, detail="Unable to determine upload size.") - - if size_bytes > 1 * 1024 * 1024: - raise HTTPException( - status_code=413, detail="Schema file is too large (max 1 MB)." - ) + safe_filename, extension = _validate_upload(file) + + if extension == ".json": + raw = file.file.read() + file.file.seek(0) + try: + document = validate_json_schema(raw) + except SchemaValidationError as exc: + raise HTTPException( + status_code=400, + detail={"message": "Invalid JSON schema.", "errors": exc.errors}, + ) from exc + fallback = os.path.splitext(safe_filename)[0] + class_name = derive_class_name(document, fallback=data.ClassName or fallback) + storage_format = "json" + else: + class_name = data.ClassName + storage_format = "python" schemas: Schemas = app.app_context.get_service(Schemas) - return schemas.Update(file, data.SchemaId, data.ClassName) + return schemas.Update(file, data.SchemaId, class_name, storage_format) @router.delete( diff --git a/src/ContentProcessorAPI/app/tests/logics/test_schema_validator.py b/src/ContentProcessorAPI/app/tests/logics/test_schema_validator.py new file mode 100644 index 00000000..f8c02bfa --- /dev/null +++ b/src/ContentProcessorAPI/app/tests/logics/test_schema_validator.py @@ -0,0 +1,151 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. + +"""Tests for the JSON Schema validator used by the schema vault upload routes.""" + +from __future__ import annotations + +import json +from pathlib import Path + +import pytest + +from app.routers.logics.schema_validator import ( + ALLOWED_CPS_KEYWORDS, + SchemaValidationError, + derive_class_name, + validate_json_schema, +) + + +SAMPLES_DIR = ( + Path(__file__).resolve().parents[3] / "samples" / "schemas" +) + + +def _minimal_object_schema(**extra) -> dict: + base = { + "type": "object", + "title": "Minimal", + "properties": {"name": {"type": "string"}}, + } + base.update(extra) + return base + + +def _bytes(doc) -> bytes: + return json.dumps(doc).encode("utf-8") + + +# --------------------------------------------------------------------------- +# Happy path +# --------------------------------------------------------------------------- + + +def test_validate_accepts_minimal_object_schema(): + document = validate_json_schema(_bytes(_minimal_object_schema())) + assert document["title"] == "Minimal" + + +def test_validate_accepts_autoclaim_golden(): + raw = (SAMPLES_DIR / "autoclaim.json").read_bytes() + document = validate_json_schema(raw) + assert document["title"] == "AutoInsuranceClaimForm" + assert document["type"] == "object" + + +def test_validate_accepts_allowed_cps_keywords(): + schema = _minimal_object_schema() + schema["properties"]["name"]["x-cps-extract-prompt"] = "Extract the full name." + schema["properties"]["name"]["x-cps-required-on-save"] = True + validate_json_schema(_bytes(schema)) + + +# --------------------------------------------------------------------------- +# Failure modes +# --------------------------------------------------------------------------- + + +def test_validate_rejects_non_utf8_bytes(): + with pytest.raises(SchemaValidationError) as exc: + validate_json_schema(b"\xff\xfe\x00not utf-8") + assert "UTF-8" in str(exc.value) + + +def test_validate_rejects_non_json(): + with pytest.raises(SchemaValidationError) as exc: + validate_json_schema(b"not json at all") + assert "not valid JSON" in str(exc.value) + + +def test_validate_rejects_non_object_root(): + with pytest.raises(SchemaValidationError): + validate_json_schema(_bytes([1, 2, 3])) + + +def test_validate_rejects_missing_type_object(): + schema = {"title": "X", "properties": {"a": {"type": "string"}}} + with pytest.raises(SchemaValidationError) as exc: + validate_json_schema(_bytes(schema)) + assert "type" in str(exc.value) + + +def test_validate_rejects_missing_properties(): + schema = {"title": "X", "type": "object"} + with pytest.raises(SchemaValidationError) as exc: + validate_json_schema(_bytes(schema)) + assert "properties" in str(exc.value) + + +def test_validate_rejects_invalid_dialect(): + schema = _minimal_object_schema() + # ``type`` must be a string or array; this is a meta-schema violation. + schema["properties"]["name"] = {"type": "banana"} + with pytest.raises(SchemaValidationError) as exc: + validate_json_schema(_bytes(schema)) + assert "JSON Schema" in str(exc.value) + + +def test_validate_rejects_unknown_x_keyword(): + schema = _minimal_object_schema() + schema["x-evil-side-channel"] = "haha" + with pytest.raises(SchemaValidationError) as exc: + validate_json_schema(_bytes(schema)) + assert "x-evil-side-channel" in str(exc.value) + + +def test_validate_rejects_unknown_x_keyword_in_nested_property(): + schema = _minimal_object_schema() + schema["properties"]["name"]["x-cps-malicious"] = True + with pytest.raises(SchemaValidationError): + validate_json_schema(_bytes(schema)) + + +def test_validate_rejects_oversized_payload(): + big = "x" * (2 * 1024 * 1024) + schema = _minimal_object_schema(description=big) + with pytest.raises(SchemaValidationError) as exc: + validate_json_schema(_bytes(schema)) + assert "too large" in str(exc.value) + + +# --------------------------------------------------------------------------- +# derive_class_name +# --------------------------------------------------------------------------- + + +def test_derive_class_name_uses_title(): + assert derive_class_name({"title": "InvoiceSchema"}, fallback="x") == "InvoiceSchema" + + +def test_derive_class_name_falls_back_to_filename(): + assert derive_class_name({}, fallback="auto-claim") == "auto_claim" + + +def test_derive_class_name_sanitises_leading_digits(): + assert derive_class_name({}, fallback="9invoice") == "Schema_9invoice" + + +def test_allowed_keywords_constant_contains_expected_extensions(): + assert "x-cps-extract-prompt" in ALLOWED_CPS_KEYWORDS + assert "x-cps-required-on-save" in ALLOWED_CPS_KEYWORDS diff --git a/src/ContentProcessorAPI/app/tests/routers/test_schemavault.py b/src/ContentProcessorAPI/app/tests/routers/test_schemavault.py index 03c2134d..dca82123 100644 --- a/src/ContentProcessorAPI/app/tests/routers/test_schemavault.py +++ b/src/ContentProcessorAPI/app/tests/routers/test_schemavault.py @@ -44,6 +44,11 @@ def __init__(self, schemas: Schemas): def create_scope(self): return _FakeScopeContextManager(_FakeScope(self._schemas)) + def get_service(self, service_type): + if service_type is Schemas: + return self._schemas + raise KeyError(service_type) + @pytest.fixture def client_and_schemas(): @@ -199,3 +204,184 @@ def test_unregister_schema_error(client_and_schemas): json={"SchemaId": "missing"}, ) assert response.status_code == 500 + + +# --------------------------------------------------------------------------- +# JSON-schema upload path (declarative format, replaces executable .py) +# --------------------------------------------------------------------------- + + +def _minimal_json_schema_bytes(title: str = "InvoiceSchema") -> bytes: + return json.dumps({ + "type": "object", + "title": title, + "properties": {"invoice_id": {"type": "string"}}, + }).encode("utf-8") + + +def test_register_schema_accepts_json(client_and_schemas): + client, mock_schemas = client_and_schemas + mock_schemas.Add.return_value = { + "Id": "test-id", + "ClassName": "InvoiceSchema", + "Description": "desc", + "FileName": "invoice.json", + "ContentType": "application/json", + "Format": "json", + } + + files = { + "file": ( + "invoice.json", + _minimal_json_schema_bytes(), + "application/json", + ), + "data": ( + None, + json.dumps({"ClassName": "ignored", "Description": "desc"}), + "application/json", + ), + } + + response = client.post("/schemavault/", files=files) + assert response.status_code == 200, response.text + + add_args, _ = mock_schemas.Add.call_args + schema_obj = add_args[1] + # Schema's title wins over the request body's ClassName. + assert schema_obj.ClassName == "InvoiceSchema" + assert schema_obj.Format == "json" + assert schema_obj.FileName == "invoice.json" + + +def test_register_schema_rejects_invalid_json(client_and_schemas): + client, mock_schemas = client_and_schemas + mock_schemas.Add.reset_mock() + + files = { + "file": ("schema.json", b"{not json", "application/json"), + "data": ( + None, + json.dumps({"ClassName": "X", "Description": "Y"}), + "application/json", + ), + } + + response = client.post("/schemavault/", files=files) + assert response.status_code == 400 + assert "errors" in response.json()["detail"] + assert mock_schemas.Add.call_count == 0 + + +def test_register_schema_rejects_json_without_object_root(client_and_schemas): + client, mock_schemas = client_and_schemas + mock_schemas.Add.reset_mock() + + files = { + "file": ( + "schema.json", + json.dumps({"type": "array"}).encode("utf-8"), + "application/json", + ), + "data": ( + None, + json.dumps({"ClassName": "X", "Description": "Y"}), + "application/json", + ), + } + + response = client.post("/schemavault/", files=files) + assert response.status_code == 400 + assert mock_schemas.Add.call_count == 0 + + +def test_register_schema_falls_back_to_filename_for_classname(client_and_schemas): + client, mock_schemas = client_and_schemas + mock_schemas.Add.return_value = { + "Id": "test-id", + "ClassName": "fallback", + "Description": "desc", + "FileName": "auto-claim.json", + "ContentType": "application/json", + "Format": "json", + } + + schema_bytes = json.dumps({ + "type": "object", + "properties": {"x": {"type": "string"}}, + }).encode("utf-8") + + files = { + "file": ("auto-claim.json", schema_bytes, "application/json"), + "data": ( + None, + json.dumps({"ClassName": "fallback", "Description": "desc"}), + "application/json", + ), + } + + response = client.post("/schemavault/", files=files) + assert response.status_code == 200, response.text + schema_obj = mock_schemas.Add.call_args[0][1] + # When the JSON has no title, the request-body ClassName is used as + # the fallback (after sanitisation to a Python identifier). + assert schema_obj.ClassName == "fallback" + assert schema_obj.Format == "json" + + +def test_register_schema_still_accepts_py(client_and_schemas): + client, mock_schemas = client_and_schemas + mock_schemas.Add.return_value = { + "Id": "test-id", + "ClassName": "Legacy", + "Description": "desc", + "FileName": "legacy.py", + "ContentType": "text/x-python", + "Format": "python", + } + + files = { + "file": ("legacy.py", b"class Legacy: pass\n", "text/x-python"), + "data": ( + None, + json.dumps({"ClassName": "Legacy", "Description": "desc"}), + "application/json", + ), + } + + response = client.post("/schemavault/", files=files) + assert response.status_code == 200, response.text + schema_obj = mock_schemas.Add.call_args[0][1] + assert schema_obj.Format == "python" + + +def test_update_schema_accepts_json(client_and_schemas): + client, mock_schemas = client_and_schemas + mock_schemas.Update.return_value = { + "Id": "test-id", + "ClassName": "InvoiceSchema", + "Description": "", + "FileName": "invoice.json", + "ContentType": "application/json", + "Format": "json", + } + + files = { + "file": ( + "invoice.json", + _minimal_json_schema_bytes(), + "application/json", + ), + "data": ( + None, + json.dumps({"SchemaId": "test-id", "ClassName": "x"}), + "application/json", + ), + } + + response = client.put("/schemavault/", files=files) + assert response.status_code == 200, response.text + args, _ = mock_schemas.Update.call_args + # Update is called with (file, schema_id, class_name, storage_format). + assert args[2] == "InvoiceSchema" + assert args[3] == "json" diff --git a/src/ContentProcessorAPI/requirements.txt b/src/ContentProcessorAPI/requirements.txt index b57fbcd4..f620f6b7 100644 --- a/src/ContentProcessorAPI/requirements.txt +++ b/src/ContentProcessorAPI/requirements.txt @@ -25,6 +25,7 @@ httpx==0.28.1 idna==3.11 isodate==0.7.2 jinja2==3.1.6 +jsonschema==4.25.1 markdown-it-py==4.0.0 markupsafe==3.0.3 mdurl==0.1.2 diff --git a/src/ContentProcessorAPI/samples/schemas/autoclaim.json b/src/ContentProcessorAPI/samples/schemas/autoclaim.json new file mode 100644 index 00000000..cc7031b0 --- /dev/null +++ b/src/ContentProcessorAPI/samples/schemas/autoclaim.json @@ -0,0 +1,886 @@ +{ + "$defs": { + "AutoClaimAddress": { + "description": "A class representing an address used on an auto claim form.", + "properties": { + "street": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Street address, e.g. 123 Main St.", + "title": "Street" + }, + "city": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "City, e.g. Macon", + "title": "City" + }, + "state": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "State, e.g. GA", + "title": "State" + }, + "postal_code": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Postal code, e.g. 31201", + "title": "Postal Code" + }, + "country": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Country, e.g. USA", + "title": "Country" + } + }, + "required": [ + "street", + "city", + "state", + "postal_code", + "country" + ], + "title": "AutoClaimAddress", + "type": "object" + }, + "DamageAssessment": { + "description": "A class representing overall damage assessment.", + "properties": { + "items": { + "anyOf": [ + { + "items": { + "$ref": "#/$defs/DamageAssessmentItem" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "List of damage assessment line items", + "title": "Items" + }, + "total_estimated_repair": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Total estimated repair, e.g. 500.0", + "title": "Total Estimated Repair" + }, + "total_estimated_repair_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency of total_estimated_repair, e.g. USD", + "title": "Total Estimated Repair Currency" + } + }, + "required": [ + "items", + "total_estimated_repair", + "total_estimated_repair_currency" + ], + "title": "DamageAssessment", + "type": "object" + }, + "DamageAssessmentItem": { + "description": "A class representing a damage assessment line item.", + "properties": { + "item_description": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Damaged item/area description, e.g. Right-front quarter panel", + "title": "Item Description" + }, + "date_acquired": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Date acquired (if present), e.g. 2022-03-15", + "title": "Date Acquired" + }, + "cost_new": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Cost when new, e.g. 1200.0", + "title": "Cost New" + }, + "cost_new_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency of cost_new, e.g. USD", + "title": "Cost New Currency" + }, + "repair_estimate": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Repair estimate, e.g. 350.0", + "title": "Repair Estimate" + }, + "repair_estimate_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency of repair_estimate, e.g. USD", + "title": "Repair Estimate Currency" + } + }, + "required": [ + "item_description", + "date_acquired", + "cost_new", + "cost_new_currency", + "repair_estimate", + "repair_estimate_currency" + ], + "title": "DamageAssessmentItem", + "type": "object" + }, + "Declaration": { + "description": "A class representing the claim declaration.", + "properties": { + "statement": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Declaration statement text", + "title": "Statement" + }, + "signature": { + "anyOf": [ + { + "$ref": "#/$defs/Signature" + }, + { + "type": "null" + } + ], + "description": "Signature" + }, + "date": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Signature date, e.g. 2025-12-01", + "title": "Date" + } + }, + "required": [ + "statement", + "signature", + "date" + ], + "title": "Declaration", + "type": "object" + }, + "IncidentDetails": { + "description": "A class representing incident details.", + "properties": { + "date_of_loss": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Date of loss, e.g. 2025-11-28", + "title": "Date Of Loss" + }, + "time_of_loss": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Time of loss, e.g. 14:15", + "title": "Time Of Loss" + }, + "location": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Incident location, e.g. Parking lot near 123 Main Street, Macon, GA", + "title": "Location" + }, + "cause_of_loss": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Cause of loss, e.g. Low-speed collision with another vehicle", + "title": "Cause Of Loss" + }, + "description": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Incident description, e.g. Minor dent and paint scratches; no structural damage", + "title": "Description" + }, + "police_report_filed": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "null" + } + ], + "description": "Whether a police report was filed", + "title": "Police Report Filed" + }, + "police_report_number": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Police report number, e.g. GA-20251128-CR", + "title": "Police Report Number" + } + }, + "required": [ + "date_of_loss", + "time_of_loss", + "location", + "cause_of_loss", + "description", + "police_report_filed", + "police_report_number" + ], + "title": "IncidentDetails", + "type": "object" + }, + "PolicyDetails": { + "description": "A class representing policy details.", + "properties": { + "coverage_type": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Coverage type, e.g. Auto \u2013 Comprehensive", + "title": "Coverage Type" + }, + "effective_date": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Policy effective date, e.g. 2025-01-01", + "title": "Effective Date" + }, + "expiration_date": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Policy expiration date, e.g. 2025-12-31", + "title": "Expiration Date" + }, + "deductible": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Deductible amount, e.g. 500.0", + "title": "Deductible" + }, + "deductible_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency of the deductible, e.g. USD", + "title": "Deductible Currency" + } + }, + "required": [ + "coverage_type", + "effective_date", + "expiration_date", + "deductible", + "deductible_currency" + ], + "title": "PolicyDetails", + "type": "object" + }, + "PolicyholderInformation": { + "description": "A class representing policyholder information.", + "properties": { + "name": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Policyholder full name, e.g. Chad Brooks", + "title": "Name" + }, + "address": { + "anyOf": [ + { + "$ref": "#/$defs/AutoClaimAddress" + }, + { + "type": "null" + } + ], + "description": "Policyholder address, e.g. 123 Main Street, Macon, GA 31201" + }, + "phone": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Policyholder phone number, e.g. (555) 555-1212", + "title": "Phone" + }, + "email": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Policyholder email address, e.g. chad.brooks@example.com", + "title": "Email" + } + }, + "required": [ + "name", + "address", + "phone", + "email" + ], + "title": "PolicyholderInformation", + "type": "object" + }, + "Signature": { + "description": "A class representing a signature field.", + "properties": { + "signatory": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Name of the signatory", + "title": "Signatory" + }, + "is_signed": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "null" + } + ], + "description": "Indicates if the form is signed. GPT should check whether it has signature in image files. if there is Sign, fill it up as True", + "title": "Is Signed" + } + }, + "required": [ + "signatory", + "is_signed" + ], + "title": "Signature", + "type": "object" + }, + "SubmissionInstructions": { + "description": "A class representing submission instructions.", + "properties": { + "submission_email": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Submission email address, e.g. claims@contosoinsurance.com", + "title": "Submission Email" + }, + "portal_url": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Claims portal URL, if present", + "title": "Portal Url" + }, + "notes": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Additional submission notes", + "title": "Notes" + } + }, + "required": [ + "submission_email", + "portal_url", + "notes" + ], + "title": "SubmissionInstructions", + "type": "object" + }, + "SupportingDocuments": { + "description": "A class representing supporting documents included with the claim.", + "properties": { + "photos_of_damage": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "null" + } + ], + "description": "Whether photos of damage are included", + "title": "Photos Of Damage" + }, + "police_report_copy": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "null" + } + ], + "description": "Whether a police report copy is included", + "title": "Police Report Copy" + }, + "repair_shop_estimate": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "null" + } + ], + "description": "Whether a repair shop estimate is included", + "title": "Repair Shop Estimate" + }, + "other": { + "anyOf": [ + { + "items": { + "type": "string" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "Other supporting documents", + "title": "Other" + } + }, + "required": [ + "photos_of_damage", + "police_report_copy", + "repair_shop_estimate", + "other" + ], + "title": "SupportingDocuments", + "type": "object" + }, + "VehicleInformation": { + "description": "A class representing vehicle information.", + "properties": { + "year": { + "anyOf": [ + { + "type": "integer" + }, + { + "type": "null" + } + ], + "description": "Vehicle year, e.g. 2022", + "title": "Year" + }, + "make": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle make, e.g. Toyota", + "title": "Make" + }, + "model": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle model, e.g. Camry", + "title": "Model" + }, + "trim": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle trim, e.g. SE", + "title": "Trim" + }, + "vin": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle VIN, e.g. 4T1G11AK2NU123456", + "title": "Vin" + }, + "license_plate": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "License plate, e.g. GA-ABC123", + "title": "License Plate" + }, + "mileage": { + "anyOf": [ + { + "type": "integer" + }, + { + "type": "null" + } + ], + "description": "Mileage, e.g. 28450", + "title": "Mileage" + } + }, + "required": [ + "year", + "make", + "model", + "trim", + "vin", + "license_plate", + "mileage" + ], + "title": "VehicleInformation", + "type": "object" + } + }, + "description": "A class representing an auto insurance claim form.", + "properties": { + "insurance_company": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Insurance company name, e.g. Contoso Insurance", + "title": "Insurance Company" + }, + "claim_number": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Claim number, e.g. CLM987654", + "title": "Claim Number" + }, + "policy_number": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Policy number, e.g. AUTO123456", + "title": "Policy Number" + }, + "policyholder_information": { + "anyOf": [ + { + "$ref": "#/$defs/PolicyholderInformation" + }, + { + "type": "null" + } + ], + "description": "Policyholder information" + }, + "policy_details": { + "anyOf": [ + { + "$ref": "#/$defs/PolicyDetails" + }, + { + "type": "null" + } + ], + "description": "Policy details" + }, + "incident_details": { + "anyOf": [ + { + "$ref": "#/$defs/IncidentDetails" + }, + { + "type": "null" + } + ], + "description": "Incident details" + }, + "vehicle_information": { + "anyOf": [ + { + "$ref": "#/$defs/VehicleInformation" + }, + { + "type": "null" + } + ], + "description": "Vehicle information" + }, + "damage_assessment": { + "anyOf": [ + { + "$ref": "#/$defs/DamageAssessment" + }, + { + "type": "null" + } + ], + "description": "Damage assessment" + }, + "supporting_documents": { + "anyOf": [ + { + "$ref": "#/$defs/SupportingDocuments" + }, + { + "type": "null" + } + ], + "description": "Supporting documents" + }, + "declaration": { + "anyOf": [ + { + "$ref": "#/$defs/Declaration" + }, + { + "type": "null" + } + ], + "description": "Declaration" + }, + "submission_instructions": { + "anyOf": [ + { + "$ref": "#/$defs/SubmissionInstructions" + }, + { + "type": "null" + } + ], + "description": "Submission instructions" + } + }, + "required": [ + "insurance_company", + "claim_number", + "policy_number", + "policyholder_information", + "policy_details", + "incident_details", + "vehicle_information", + "damage_assessment", + "supporting_documents", + "declaration", + "submission_instructions" + ], + "title": "AutoInsuranceClaimForm", + "type": "object" +} From d0139e09a810c5fee8f98c18585f247dd622fd21 Mon Sep 17 00:00:00 2001 From: JSON Schema Migration Date: Tue, 28 Apr 2026 10:12:33 +0530 Subject: [PATCH 02/13] feat(schemavault): convert remaining sample schemas to JSON; register_schema.py supports .json - Adds damagedcarimage.json, policereport.json, repairestimate.json (generated via scripts/py_schema_to_json.py). - register_schema.py now picks the correct content-type per extension (.py -> text/x-python, .json -> application/json). - Manifest unchanged for now; flip to .json files when ready to deprecate the legacy Python path. --- .../samples/schemas/damagedcarimage.json | 617 ++++++++++++++++++ .../samples/schemas/policereport.json | 548 ++++++++++++++++ .../samples/schemas/register_schema.py | 25 +- .../samples/schemas/repairestimate.json | 596 +++++++++++++++++ 4 files changed, 1783 insertions(+), 3 deletions(-) create mode 100644 src/ContentProcessorAPI/samples/schemas/damagedcarimage.json create mode 100644 src/ContentProcessorAPI/samples/schemas/policereport.json create mode 100644 src/ContentProcessorAPI/samples/schemas/repairestimate.json diff --git a/src/ContentProcessorAPI/samples/schemas/damagedcarimage.json b/src/ContentProcessorAPI/samples/schemas/damagedcarimage.json new file mode 100644 index 00000000..f7d2385b --- /dev/null +++ b/src/ContentProcessorAPI/samples/schemas/damagedcarimage.json @@ -0,0 +1,617 @@ +{ + "$defs": { + "CameraViewpoint": { + "description": "Camera perspective relative to the vehicle.\n\nAttributes:\n spatial_reasoning: Chain-of-thought scratchpad for determining view angle.\n view_angle: Computed camera angle label.\n description: Free-text summary of the camera position.", + "properties": { + "spatial_reasoning": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "MANDATORY chain-of-thought reasoning about camera position. Must answer IN ORDER: (1) Can I see the FRONT (grille/headlights) or REAR (tail lights/trunk) of the vehicle? (2) Which side of the IMAGE does the body flank extend toward? (3) Apply the mirror rule: viewing the FRONT \u2014 image-right = vehicle LEFT, image-left = vehicle RIGHT. Viewing the REAR \u2014 image-right = vehicle RIGHT, image-left = vehicle LEFT. (4) Therefore view_angle = ? (5) FALLBACK only if neither front nor rear is visible (pure side view): use steering wheel position to determine driver side (LHD: left, RHD: right).", + "title": "Spatial Reasoning" + }, + "view_angle": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Primary camera viewing angle relative to the vehicle. Must be one of: front, front-left, front-right, left-side, right-side, rear-left, rear-right, rear, top, underneath, interior, unknown. Left/right = VEHICLE's own left/right (driver-perspective facing forward).", + "title": "View Angle" + }, + "description": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Free-text description of the camera position and angle relative to the vehicle, e.g. 'Slightly elevated front-left view showing hood, front bumper, and left fender.'", + "title": "Description" + } + }, + "required": [ + "spatial_reasoning", + "view_angle", + "description" + ], + "title": "CameraViewpoint", + "type": "object" + }, + "DamageBoundingBox": { + "description": "Bounding box in normalized image coordinates [0..1].", + "properties": { + "x_min": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Left edge in [0..1]", + "title": "X Min" + }, + "y_min": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Top edge in [0..1]", + "title": "Y Min" + }, + "x_max": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Right edge in [0..1]", + "title": "X Max" + }, + "y_max": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Bottom edge in [0..1]", + "title": "Y Max" + } + }, + "required": [ + "x_min", + "y_min", + "x_max", + "y_max" + ], + "title": "DamageBoundingBox", + "type": "object" + }, + "DamageRegion": { + "description": "A detected region of damage on the vehicle.", + "properties": { + "location_on_vehicle": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Location on the vehicle using the VEHICLE's own left/right (driver-perspective facing forward). The side MUST match camera_viewpoint.view_angle. Examples: 'front-left fender', 'rear-right quarter panel'.", + "title": "Location On Vehicle" + }, + "damage_types": { + "anyOf": [ + { + "items": { + "type": "string" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "Damage types, e.g. ['scratch','dent','crack','paint-transfer']", + "title": "Damage Types" + }, + "severity": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Severity label, e.g. minor, moderate, severe", + "title": "Severity" + }, + "description": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Free-text description of the damage", + "title": "Description" + }, + "bounding_box": { + "anyOf": [ + { + "$ref": "#/$defs/DamageBoundingBox" + }, + { + "type": "null" + } + ], + "description": "Approx bounding box of the damage area (normalized coordinates)" + }, + "confidence": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Confidence score in [0..1] for this damage region", + "title": "Confidence" + } + }, + "required": [ + "location_on_vehicle", + "damage_types", + "severity", + "description", + "bounding_box", + "confidence" + ], + "title": "DamageRegion", + "type": "object" + }, + "ImageInfo": { + "description": "Metadata about an input image.\n\nNote: Most fields may be unknown unless provided by the caller or extracted from EXIF.", + "properties": { + "filename": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Analyzed filename of the image", + "title": "Filename" + }, + "content_type": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "MIME type, e.g. image/jpeg", + "title": "Content Type" + }, + "width": { + "anyOf": [ + { + "type": "integer" + }, + { + "type": "null" + } + ], + "description": "Analyzed image width in pixels", + "title": "Width" + }, + "height": { + "anyOf": [ + { + "type": "integer" + }, + { + "type": "null" + } + ], + "description": "Analyzed image height in pixels", + "title": "Height" + }, + "capture_datetime": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Capture datetime if available, e.g. 2025-11-28T14:15:00 original EXIF string if unprocessed", + "title": "Capture Datetime" + } + }, + "required": [ + "filename", + "content_type", + "width", + "height", + "capture_datetime" + ], + "title": "ImageInfo", + "type": "object" + }, + "OverallDamageAssessment": { + "description": "Overall assessment across the full image.", + "properties": { + "has_visible_damage": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "null" + } + ], + "description": "Whether any damage is visible", + "title": "Has Visible Damage" + }, + "overall_severity": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Overall severity label, e.g. minor, moderate, severe", + "title": "Overall Severity" + }, + "affected_parts": { + "anyOf": [ + { + "items": { + "type": "string" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "Affected parts/panels using the VEHICLE's own left/right. Side labels MUST match camera_viewpoint.view_angle.", + "title": "Affected Parts" + }, + "estimated_repair_complexity": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Rough complexity, e.g. cosmetic_only, panel_repair, replacement_likely", + "title": "Estimated Repair Complexity" + }, + "notes": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Notes or caveats, e.g. lighting/angle limitations", + "title": "Notes" + } + }, + "required": [ + "has_visible_damage", + "overall_severity", + "affected_parts", + "estimated_repair_complexity", + "notes" + ], + "title": "OverallDamageAssessment", + "type": "object" + }, + "VehicleAppearance": { + "description": "Visible vehicle identification extracted from the image.\n\nGuidance:\n- Prefer fields that can be seen. If uncertain, leave null.\n- Do not guess VIN from images.", + "properties": { + "vehicle_type": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle type, e.g. sedan, SUV", + "title": "Vehicle Type" + }, + "make": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle make, e.g. Toyota", + "title": "Make" + }, + "model": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle model, e.g. Camry", + "title": "Model" + }, + "trim": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle trim, e.g. SE", + "title": "Trim" + }, + "model_year": { + "anyOf": [ + { + "type": "integer" + }, + { + "type": "null" + } + ], + "description": "Vehicle model year, e.g. 2022", + "title": "Model Year" + }, + "color": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle color, e.g. silver", + "title": "Color" + }, + "license_plate_visible": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "null" + } + ], + "description": "Whether the license plate is visible in the image", + "title": "License Plate Visible" + }, + "license_plate_text": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "License plate text if clearly readable; otherwise null", + "title": "License Plate Text" + }, + "visible_vehicle_parts": { + "anyOf": [ + { + "items": { + "type": "string" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "List of vehicle parts/panels actually visible in this image given the camera angle, e.g. ['hood', 'front bumper', 'front-left fender', 'front-left headlight']. Only parts that can be seen should be listed. Left/right MUST use the VEHICLE's own frame of reference and MUST match the side in camera_viewpoint.view_angle.", + "title": "Visible Vehicle Parts" + } + }, + "required": [ + "vehicle_type", + "make", + "model", + "trim", + "model_year", + "color", + "license_plate_visible", + "license_plate_text", + "visible_vehicle_parts" + ], + "title": "VehicleAppearance", + "type": "object" + }, + "VehicleAssessment": { + "description": "Per-vehicle damage assessment extracted from an image.\n\nGroups appearance, damage regions, and overall assessment for a single\nvehicle detected in the photograph.\n\nAttributes:\n vehicle_id: Human-readable identifier distinguishing this vehicle.\n vehicle_appearance: Visible vehicle identification.\n damage_regions: Detected damage regions for this vehicle.\n overall_assessment: Overall damage assessment for this vehicle.", + "properties": { + "vehicle_id": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "A short human-readable identifier for this vehicle, e.g. 'Vehicle 1 - silver sedan (front-left)'. Use color, type, and position to distinguish vehicles.", + "title": "Vehicle Id" + }, + "vehicle_appearance": { + "anyOf": [ + { + "$ref": "#/$defs/VehicleAppearance" + }, + { + "type": "null" + } + ], + "description": "Visible vehicle identification for this vehicle" + }, + "damage_regions": { + "anyOf": [ + { + "items": { + "$ref": "#/$defs/DamageRegion" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "List of detected damage regions for this vehicle", + "title": "Damage Regions" + }, + "overall_assessment": { + "anyOf": [ + { + "$ref": "#/$defs/OverallDamageAssessment" + }, + { + "type": "null" + } + ], + "description": "Overall damage assessment for this vehicle" + } + }, + "required": [ + "vehicle_id", + "vehicle_appearance", + "damage_regions", + "overall_assessment" + ], + "title": "VehicleAssessment", + "type": "object" + } + }, + "description": "Schema for extracting damaged vehicle information from an image.\n\nSupports single- and multi-vehicle images. Each vehicle detected in the\nphotograph gets its own entry in the ``vehicles`` list.\n\nAttributes:\n image_info: Image metadata (shared across all vehicles).\n camera_viewpoint: Camera perspective relative to the scene.\n vehicle_count: Number of distinct vehicles detected in the image.\n vehicles: Per-vehicle assessment list.", + "properties": { + "image_info": { + "anyOf": [ + { + "$ref": "#/$defs/ImageInfo" + }, + { + "type": "null" + } + ], + "description": "Image metadata" + }, + "camera_viewpoint": { + "anyOf": [ + { + "$ref": "#/$defs/CameraViewpoint" + }, + { + "type": "null" + } + ], + "description": "Camera perspective relative to the scene. MUST be determined BEFORE labelling any damage locations so that left/right orientation is anchored to each vehicle's own frame of reference." + }, + "vehicle_count": { + "anyOf": [ + { + "type": "integer" + }, + { + "type": "null" + } + ], + "description": "Number of distinct vehicles detected in the image. Must equal the length of the vehicles list.", + "title": "Vehicle Count" + }, + "vehicles": { + "anyOf": [ + { + "items": { + "$ref": "#/$defs/VehicleAssessment" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "Per-vehicle damage assessments. One entry per vehicle detected in the image. For single-vehicle images this list contains exactly one item.", + "title": "Vehicles" + }, + "consistency_check": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "MANDATORY self-verification. State the side from view_angle, then list every left/right label used in visible_vehicle_parts, damage_regions, and affected_parts. Confirm they ALL match the side in view_angle. If any mismatch was found and corrected, describe what was fixed.", + "title": "Consistency Check" + } + }, + "required": [ + "image_info", + "camera_viewpoint", + "vehicle_count", + "vehicles", + "consistency_check" + ], + "title": "DamagedVehicleImageAssessment", + "type": "object" +} diff --git a/src/ContentProcessorAPI/samples/schemas/policereport.json b/src/ContentProcessorAPI/samples/schemas/policereport.json new file mode 100644 index 00000000..87fc07af --- /dev/null +++ b/src/ContentProcessorAPI/samples/schemas/policereport.json @@ -0,0 +1,548 @@ +{ + "$defs": { + "PoliceReportAddress": { + "description": "A class representing an address referenced in a police report.", + "properties": { + "street": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Street address, e.g. 123 Main St.", + "title": "Street" + }, + "city": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "City, e.g. Macon", + "title": "City" + }, + "state": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "State, e.g. GA", + "title": "State" + }, + "postal_code": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Postal code, e.g. 31201", + "title": "Postal Code" + }, + "country": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Country, e.g. USA", + "title": "Country" + } + }, + "required": [ + "street", + "city", + "state", + "postal_code", + "country" + ], + "title": "PoliceReportAddress", + "type": "object" + }, + "PoliceReportDamageItem": { + "description": "A class representing a damage line item recorded alongside a police report.", + "properties": { + "item_description": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Damaged item/area description", + "title": "Item Description" + }, + "repair_estimate": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Repair estimate amount", + "title": "Repair Estimate" + }, + "repair_estimate_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency of repair_estimate, e.g. USD", + "title": "Repair Estimate Currency" + } + }, + "required": [ + "item_description", + "repair_estimate", + "repair_estimate_currency" + ], + "title": "PoliceReportDamageItem", + "type": "object" + }, + "PoliceReportDamageSummary": { + "description": "A class representing a damage summary section.", + "properties": { + "items": { + "anyOf": [ + { + "items": { + "$ref": "#/$defs/PoliceReportDamageItem" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "List of damage items", + "title": "Items" + }, + "total_estimated_repair": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Total estimated repair amount", + "title": "Total Estimated Repair" + }, + "total_estimated_repair_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency of total_estimated_repair, e.g. USD", + "title": "Total Estimated Repair Currency" + } + }, + "required": [ + "items", + "total_estimated_repair", + "total_estimated_repair_currency" + ], + "title": "PoliceReportDamageSummary", + "type": "object" + }, + "PoliceReportIncident": { + "description": "A class representing incident details in a police report.", + "properties": { + "date": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Incident date, e.g. 2025-11-28", + "title": "Date" + }, + "time": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Incident time, e.g. 14:15", + "title": "Time" + }, + "location": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Incident location", + "title": "Location" + }, + "cause": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Cause of incident", + "title": "Cause" + }, + "narrative": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Narrative/description of what happened", + "title": "Narrative" + } + }, + "required": [ + "date", + "time", + "location", + "cause", + "narrative" + ], + "title": "PoliceReportIncident", + "type": "object" + }, + "PoliceReportVehicle": { + "description": "A class representing a vehicle referenced in a police report.", + "properties": { + "year": { + "anyOf": [ + { + "type": "integer" + }, + { + "type": "null" + } + ], + "description": "Vehicle year, e.g. 2022", + "title": "Year" + }, + "make": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle make, e.g. Toyota", + "title": "Make" + }, + "model": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle model, e.g. Camry", + "title": "Model" + }, + "trim": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle trim, e.g. SE", + "title": "Trim" + }, + "vin": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle VIN", + "title": "Vin" + }, + "license_plate": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "License plate", + "title": "License Plate" + }, + "mileage": { + "anyOf": [ + { + "type": "integer" + }, + { + "type": "null" + } + ], + "description": "Mileage", + "title": "Mileage" + } + }, + "required": [ + "year", + "make", + "model", + "trim", + "vin", + "license_plate", + "mileage" + ], + "title": "PoliceReportVehicle", + "type": "object" + }, + "ReportingParty": { + "description": "A class representing the reporting party / claimant in the police report context.", + "properties": { + "name": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Full name of reporting party", + "title": "Name" + }, + "address": { + "anyOf": [ + { + "$ref": "#/$defs/PoliceReportAddress" + }, + { + "type": "null" + } + ], + "description": "Address of reporting party" + }, + "phone": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Phone number", + "title": "Phone" + }, + "email": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Email address", + "title": "Email" + } + }, + "required": [ + "name", + "address", + "phone", + "email" + ], + "title": "ReportingParty", + "type": "object" + } + }, + "description": "A class representing a police report document attached to an auto claim.\n\nNote: The sample content includes the statement \"Police Report: Filed (Report # GA-20251128-CR)\".\nThis schema focuses on extracting the report identifier and the related incident context.", + "properties": { + "report_number": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Police report number, e.g. GA-20251128-CR", + "title": "Report Number" + }, + "is_filed": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "null" + } + ], + "description": "Whether a police report was filed", + "title": "Is Filed" + }, + "reporting_agency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Reporting agency / department", + "title": "Reporting Agency" + }, + "insurance_company": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Insurance company name", + "title": "Insurance Company" + }, + "claim_number": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Claim number", + "title": "Claim Number" + }, + "policy_number": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Policy number", + "title": "Policy Number" + }, + "reporting_party": { + "anyOf": [ + { + "$ref": "#/$defs/ReportingParty" + }, + { + "type": "null" + } + ], + "description": "Reporting party information" + }, + "incident": { + "anyOf": [ + { + "$ref": "#/$defs/PoliceReportIncident" + }, + { + "type": "null" + } + ], + "description": "Incident details" + }, + "vehicles": { + "anyOf": [ + { + "items": { + "$ref": "#/$defs/PoliceReportVehicle" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "Vehicles involved", + "title": "Vehicles" + }, + "damage_summary": { + "anyOf": [ + { + "$ref": "#/$defs/PoliceReportDamageSummary" + }, + { + "type": "null" + } + ], + "description": "Damage summary" + } + }, + "required": [ + "report_number", + "is_filed", + "reporting_agency", + "insurance_company", + "claim_number", + "policy_number", + "reporting_party", + "incident", + "vehicles", + "damage_summary" + ], + "title": "PoliceReportDocument", + "type": "object" +} diff --git a/src/ContentProcessorAPI/samples/schemas/register_schema.py b/src/ContentProcessorAPI/samples/schemas/register_schema.py index ecd015c9..2996f9d5 100644 --- a/src/ContentProcessorAPI/samples/schemas/register_schema.py +++ b/src/ContentProcessorAPI/samples/schemas/register_schema.py @@ -17,7 +17,8 @@ Manifest format (see schema_info.json): { "schemas": [ - { "File": "autoclaim.py", "ClassName": "...", "Description": "..." }, + { "File": "autoclaim.py", "ClassName": "...", "Description": "..." }, + { "File": "invoice.json", "ClassName": "...", "Description": "..." }, ... ], "schemaset": { @@ -25,6 +26,9 @@ "Description": "Claim schema set for auto claims processing" } } + +Both ``.py`` (legacy executable Python class) and ``.json`` (declarative +JSON Schema, recommended) files are accepted in the ``File`` field. """ from __future__ import annotations @@ -75,11 +79,26 @@ def _register_schema( print(f" Description: {existing.get('Description')}") return schema_id - print(f"Registering new schema '{class_name}'...") + # Pick the right MIME type based on the file extension. Both ``.py`` + # (legacy executable Python class) and ``.json`` (declarative JSON + # Schema) are accepted by ``POST /schemavault/``. + extension = schema_path.suffix.lower() + if extension == ".json": + content_type = "application/json" + elif extension == ".py": + content_type = "text/x-python" + else: + print( + f"Error: Unsupported schema extension '{extension}' for " + f"'{schema_path.name}'. Expected .py or .json. Skipping..." + ) + return None + + print(f"Registering new schema '{class_name}' ({extension})...") data_payload = json.dumps({"ClassName": class_name, "Description": description}) with open(schema_path, "rb") as f: - files = {"file": (schema_path.name, f, "text/x-python")} + files = {"file": (schema_path.name, f, content_type)} data = {"data": data_payload} resp = requests.post(schemavault_url, files=files, data=data, timeout=60) diff --git a/src/ContentProcessorAPI/samples/schemas/repairestimate.json b/src/ContentProcessorAPI/samples/schemas/repairestimate.json new file mode 100644 index 00000000..5874a862 --- /dev/null +++ b/src/ContentProcessorAPI/samples/schemas/repairestimate.json @@ -0,0 +1,596 @@ +{ + "$defs": { + "RepairEstimateLineItem": { + "description": "A class representing a repair estimate line item.", + "properties": { + "service_description": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Service description, e.g. Dent repair (quarter panel)", + "title": "Service Description" + }, + "labor_hours": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Labor hours, e.g. 2.0", + "title": "Labor Hours" + }, + "rate_per_hour": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Labor rate per hour, e.g. 75.0", + "title": "Rate Per Hour" + }, + "rate_per_hour_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency for rate_per_hour, e.g. USD", + "title": "Rate Per Hour Currency" + }, + "parts_cost": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Parts cost, e.g. 150.0", + "title": "Parts Cost" + }, + "parts_cost_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency for parts_cost, e.g. USD", + "title": "Parts Cost Currency" + }, + "materials_cost": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Materials/supplies cost, e.g. 50.0", + "title": "Materials Cost" + }, + "materials_cost_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency for materials_cost, e.g. USD", + "title": "Materials Cost Currency" + }, + "total": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Line total amount", + "title": "Total" + }, + "total_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency for total, e.g. USD", + "title": "Total Currency" + } + }, + "required": [ + "service_description", + "labor_hours", + "rate_per_hour", + "rate_per_hour_currency", + "parts_cost", + "parts_cost_currency", + "materials_cost", + "materials_cost_currency", + "total", + "total_currency" + ], + "title": "RepairEstimateLineItem", + "type": "object" + }, + "RepairEstimateVehicle": { + "description": "A class representing the customer vehicle on a repair estimate.", + "properties": { + "year": { + "anyOf": [ + { + "type": "integer" + }, + { + "type": "null" + } + ], + "description": "Vehicle year, e.g. 2022", + "title": "Year" + }, + "make": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle make, e.g. Toyota", + "title": "Make" + }, + "model": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle model, e.g. Camry", + "title": "Model" + }, + "trim": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle trim, e.g. SE", + "title": "Trim" + }, + "vin": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle VIN, e.g. 4T1G11AK2NU123456", + "title": "Vin" + }, + "license_plate": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "License plate, e.g. GA-ABC123", + "title": "License Plate" + } + }, + "required": [ + "year", + "make", + "model", + "trim", + "vin", + "license_plate" + ], + "title": "RepairEstimateVehicle", + "type": "object" + }, + "RepairShopAddress": { + "description": "A class representing an auto body shop address.", + "properties": { + "street": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Street address, e.g. 456 Repair Lane", + "title": "Street" + }, + "city": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "City, e.g. Macon", + "title": "City" + }, + "state": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "State, e.g. GA", + "title": "State" + }, + "postal_code": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Postal code, e.g. 31201", + "title": "Postal Code" + }, + "country": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Country, e.g. USA", + "title": "Country" + } + }, + "required": [ + "street", + "city", + "state", + "postal_code", + "country" + ], + "title": "RepairShopAddress", + "type": "object" + }, + "Signature": { + "description": "A class representing an authorized signature field.", + "properties": { + "signatory": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Name of the signatory", + "title": "Signatory" + }, + "is_signed": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "null" + } + ], + "description": "Indicates if the document is signed. GPT should check whether it has signature in image files. if there is Sign, fill it up as True", + "title": "Is Signed" + } + }, + "required": [ + "signatory", + "is_signed" + ], + "title": "Signature", + "type": "object" + } + }, + "description": "A class representing an auto body shop repair estimate document.", + "properties": { + "estimate_number": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Estimate number, e.g. EST-20251130", + "title": "Estimate Number" + }, + "date": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Estimate date, e.g. 2025-11-30", + "title": "Date" + }, + "prepared_by": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Prepared by / shop name, e.g. Macon Auto Body & Paint", + "title": "Prepared By" + }, + "shop_address": { + "anyOf": [ + { + "$ref": "#/$defs/RepairShopAddress" + }, + { + "type": "null" + } + ], + "description": "Shop address" + }, + "shop_phone": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Shop phone number", + "title": "Shop Phone" + }, + "customer_name": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Customer name, e.g. Chad Brooks", + "title": "Customer Name" + }, + "vehicle": { + "anyOf": [ + { + "$ref": "#/$defs/RepairEstimateVehicle" + }, + { + "type": "null" + } + ], + "description": "Vehicle information" + }, + "damage_description": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Damage description / narrative", + "title": "Damage Description" + }, + "repair_details": { + "anyOf": [ + { + "items": { + "$ref": "#/$defs/RepairEstimateLineItem" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "Repair detail line items", + "title": "Repair Details" + }, + "subtotal": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Subtotal amount", + "title": "Subtotal" + }, + "subtotal_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency for subtotal, e.g. USD", + "title": "Subtotal Currency" + }, + "tax_rate": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Tax rate, e.g. 7%", + "title": "Tax Rate" + }, + "tax_amount": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Tax amount, e.g. 24.50", + "title": "Tax Amount" + }, + "tax_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency for tax_amount, e.g. USD", + "title": "Tax Currency" + }, + "total_estimate": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Total estimate amount, e.g. 374.50", + "title": "Total Estimate" + }, + "total_estimate_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency for total_estimate, e.g. USD", + "title": "Total Estimate Currency" + }, + "notes": { + "anyOf": [ + { + "items": { + "type": "string" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "Notes on the estimate", + "title": "Notes" + }, + "authorized_signature": { + "anyOf": [ + { + "$ref": "#/$defs/Signature" + }, + { + "type": "null" + } + ], + "description": "Authorized signature" + }, + "authorized_signature_date": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Signature date, e.g. 2025-11-30", + "title": "Authorized Signature Date" + } + }, + "required": [ + "estimate_number", + "date", + "prepared_by", + "shop_address", + "shop_phone", + "customer_name", + "vehicle", + "damage_description", + "repair_details", + "subtotal", + "subtotal_currency", + "tax_rate", + "tax_amount", + "tax_currency", + "total_estimate", + "total_estimate_currency", + "notes", + "authorized_signature", + "authorized_signature_date" + ], + "title": "RepairEstimateDocument", + "type": "object" +} From 1fb797f402b3d9985dbcb2a163b651d8fbbbfc6b Mon Sep 17 00:00:00 2001 From: JSON Schema Migration Date: Tue, 28 Apr 2026 10:16:18 +0530 Subject: [PATCH 03/13] feat(schemavault): switch default samples and deployment scripts to .json schemas - schema_info.json manifest now lists *.json files (was *.py). - post_deployment.sh and post_deployment.ps1 derive multipart Content-Type per file extension (.json -> application/json, .py -> text/x-python). - test_http/schema_API.http examples updated to upload .json samples. - docs/CustomizeSchemaData.md sample table, mermaid diagram, and manifest example refer to .json files. - register_schema.py docstring example updated. Legacy .py uploads still work end-to-end; the change just flips the default authored format. --- docs/CustomizeSchemaData.md | 22 +++++++++---------- infra/scripts/post_deployment.ps1 | 14 +++++++++++- infra/scripts/post_deployment.sh | 14 +++++++++++- .../samples/schemas/register_schema.py | 4 ++-- .../samples/schemas/schema_info.json | 8 +++---- .../test_http/schema_API.http | 16 +++++++------- 6 files changed, 51 insertions(+), 27 deletions(-) diff --git a/docs/CustomizeSchemaData.md b/docs/CustomizeSchemaData.md index be24ff20..41925d1f 100644 --- a/docs/CustomizeSchemaData.md +++ b/docs/CustomizeSchemaData.md @@ -15,9 +15,9 @@ Before processing documents, schemas must be **registered** in the system and gr ```mermaid flowchart TB subgraph Step1["Step 1: Register Schemas (one per document type)
POST /schemavault/ × N"] - S1["🗎 AutoInsuranceClaimForm
autoclaim.py
Schema ID: abc123"] - S2["🗎 PoliceReportDocument
policereport.py
Schema ID: def456"] - S3["🗎 RepairEstimateDocument
repairestimate.py
Schema ID: ghi789"] + S1["🗎 AutoInsuranceClaimForm
autoclaim.json
Schema ID: abc123"] + S2["🗎 PoliceReportDocument
policereport.json
Schema ID: def456"] + S3["🗎 RepairEstimateDocument
repairestimate.json
Schema ID: ghi789"] S4["🗎 ...
more schemas"] end @@ -79,10 +79,10 @@ A new class needs to be created that defines the schema as a strongly typed Pyth | Schema | File | Class Name | Auto-registered | | ------------------------- | --------------------------------------------------------------------------------- | ------------------------------- | --------------- | -| Auto Insurance Claim Form | [autoclaim.py](/src/ContentProcessorAPI/samples/schemas/autoclaim.py) | `AutoInsuranceClaimForm` | ✅ | -| Police Report | [policereport.py](/src/ContentProcessorAPI/samples/schemas/policereport.py) | `PoliceReportDocument` | ✅ | -| Repair Estimate | [repairestimate.py](/src/ContentProcessorAPI/samples/schemas/repairestimate.py) | `RepairEstimateDocument` | ✅ | -| Damaged Vehicle Image | [damagedcarimage.py](/src/ContentProcessorAPI/samples/schemas/damagedcarimage.py) | `DamagedVehicleImageAssessment` | ✅ | +| Auto Insurance Claim Form | [autoclaim.json](/src/ContentProcessorAPI/samples/schemas/autoclaim.json) | `AutoInsuranceClaimForm` | ✅ | +| Police Report | [policereport.json](/src/ContentProcessorAPI/samples/schemas/policereport.json) | `PoliceReportDocument` | ✅ | +| Repair Estimate | [repairestimate.json](/src/ContentProcessorAPI/samples/schemas/repairestimate.json) | `RepairEstimateDocument` | ✅ | +| Damaged Vehicle Image | [damagedcarimage.json](/src/ContentProcessorAPI/samples/schemas/damagedcarimage.json) | `DamagedVehicleImageAssessment` | ✅ | > **Note:** All 4 schemas are automatically registered during deployment (via `azd up` or the `register_schema.py` script) and grouped into the **"Auto Claim"** schema set. @@ -177,10 +177,10 @@ For bulk registration, use the provided script with a JSON manifest. The script ```json { "schemas": [ - { "File": "autoclaim.py", "ClassName": "AutoInsuranceClaimForm", "Description": "Auto Insurance Claim Form" }, - { "File": "damagedcarimage.py", "ClassName": "DamagedVehicleImageAssessment","Description": "Damaged Vehicle Image Assessment" }, - { "File": "policereport.py", "ClassName": "PoliceReportDocument", "Description": "Police Report Document" }, - { "File": "repairestimate.py", "ClassName": "RepairEstimateDocument", "Description": "Repair Estimate Document" } + { "File": "autoclaim.json", "ClassName": "AutoInsuranceClaimForm", "Description": "Auto Insurance Claim Form" }, + { "File": "damagedcarimage.json", "ClassName": "DamagedVehicleImageAssessment","Description": "Damaged Vehicle Image Assessment" }, + { "File": "policereport.json", "ClassName": "PoliceReportDocument", "Description": "Police Report Document" }, + { "File": "repairestimate.json", "ClassName": "RepairEstimateDocument", "Description": "Repair Estimate Document" } ], "schemaset": { "Name": "Auto Claim", diff --git a/infra/scripts/post_deployment.ps1 b/infra/scripts/post_deployment.ps1 index 04104a50..bcf2ac4a 100644 --- a/infra/scripts/post_deployment.ps1 +++ b/infra/scripts/post_deployment.ps1 @@ -124,6 +124,18 @@ if (-not $ApiReady) { Write-Host " Registering new schema '$ClassName'..." + # Pick MIME type by extension. Both .json (recommended) and .py + # (legacy) are accepted by the API. + $extension = [System.IO.Path]::GetExtension($SchemaFile).ToLowerInvariant() + switch ($extension) { + '.json' { $contentType = 'application/json' } + '.py' { $contentType = 'text/x-python' } + default { + Write-Host " Unsupported schema extension '$extension' for '$SchemaFile'. Skipping..." + continue + } + } + # Build multipart form data $dataPayload = @{ ClassName = $ClassName; Description = $Description } | ConvertTo-Json -Compress $fileBytes = [System.IO.File]::ReadAllBytes($SchemaFile) @@ -137,7 +149,7 @@ if (-not $ApiReady) { $dataPayload, "--$boundary", "Content-Disposition: form-data; name=`"file`"; filename=`"$fileName`"", - "Content-Type: text/x-python$LF", + "Content-Type: $contentType$LF", [System.Text.Encoding]::UTF8.GetString($fileBytes), "--$boundary--$LF" ) -join $LF diff --git a/infra/scripts/post_deployment.sh b/infra/scripts/post_deployment.sh index 2b0ee0ad..66f0180a 100644 --- a/infra/scripts/post_deployment.sh +++ b/infra/scripts/post_deployment.sh @@ -136,10 +136,22 @@ else echo " Registering new schema '$CLASS_NAME'..." DATA_PAYLOAD="{\"ClassName\": \"$CLASS_NAME\", \"Description\": \"$DESCRIPTION\"}" + # Pick MIME type by extension. Both .json (recommended) and .py (legacy) + # are accepted by the API. + EXT="${FILE_NAME##*.}" + case "$EXT" in + json) CONTENT_TYPE="application/json" ;; + py) CONTENT_TYPE="text/x-python" ;; + *) + echo " Unsupported schema extension '.$EXT' for '$FILE_NAME'. Skipping..." + continue + ;; + esac + RESPONSE=$(curl -s -w "\n%{http_code}" \ -X POST "$SCHEMAVAULT_URL" \ -F "data=$DATA_PAYLOAD" \ - -F "file=@$SCHEMA_FILE;type=text/x-python" \ + -F "file=@$SCHEMA_FILE;type=$CONTENT_TYPE" \ --connect-timeout 60) HTTP_CODE=$(echo "$RESPONSE" | tail -1) diff --git a/src/ContentProcessorAPI/samples/schemas/register_schema.py b/src/ContentProcessorAPI/samples/schemas/register_schema.py index 2996f9d5..1b3d570a 100644 --- a/src/ContentProcessorAPI/samples/schemas/register_schema.py +++ b/src/ContentProcessorAPI/samples/schemas/register_schema.py @@ -17,8 +17,8 @@ Manifest format (see schema_info.json): { "schemas": [ - { "File": "autoclaim.py", "ClassName": "...", "Description": "..." }, - { "File": "invoice.json", "ClassName": "...", "Description": "..." }, + { "File": "autoclaim.json", "ClassName": "...", "Description": "..." }, + { "File": "legacy.py", "ClassName": "...", "Description": "..." }, ... ], "schemaset": { diff --git a/src/ContentProcessorAPI/samples/schemas/schema_info.json b/src/ContentProcessorAPI/samples/schemas/schema_info.json index f4667e15..d1cbad0d 100644 --- a/src/ContentProcessorAPI/samples/schemas/schema_info.json +++ b/src/ContentProcessorAPI/samples/schemas/schema_info.json @@ -1,22 +1,22 @@ { "schemas": [ { - "File": "autoclaim.py", + "File": "autoclaim.json", "ClassName": "AutoInsuranceClaimForm", "Description": "Auto Insurance Claim Form" }, { - "File": "damagedcarimage.py", + "File": "damagedcarimage.json", "ClassName": "DamagedVehicleImageAssessment", "Description": "Damaged Vehicle Image Assessment" }, { - "File": "policereport.py", + "File": "policereport.json", "ClassName": "PoliceReportDocument", "Description": "Police Report Document" }, { - "File": "repairestimate.py", + "File": "repairestimate.json", "ClassName": "RepairEstimateDocument", "Description": "Repair Estimate Document" } diff --git a/src/ContentProcessorAPI/test_http/schema_API.http b/src/ContentProcessorAPI/test_http/schema_API.http index 3efd9b60..169f566c 100644 --- a/src/ContentProcessorAPI/test_http/schema_API.http +++ b/src/ContentProcessorAPI/test_http/schema_API.http @@ -6,10 +6,10 @@ # @name listSchemas GET {{baseUrl}}{{schemavault}}/ -### Register a schema (.py) into the vault +### Register a schema (.json) into the vault # Sends multipart/form-data with fields: # - data: JSON string { ClassName, Description } -# - file: schema file +# - file: schema file (.json recommended; .py still accepted for legacy) # # @name registerSchema POST {{baseUrl}}{{schemavault}}/ @@ -24,10 +24,10 @@ Content-Type: application/json "Description": "Uploaded from VS Code REST Client" } ------schema-boundary -Content-Disposition: form-data; name="file"; filename="autoclaim.py" -Content-Type: text/x-python +Content-Disposition: form-data; name="file"; filename="autoclaim.json" +Content-Type: application/json -< ../samples/schemas/autoclaim.py +< ../samples/schemas/autoclaim.json ------schema-boundary-- ### Update an existing schema (re-upload file + new ClassName) @@ -46,10 +46,10 @@ Content-Type: application/json "ClassName": "DamagedVehicleImageAssessment" } ------schema-boundary -Content-Disposition: form-data; name="file"; filename="damagedcarimage.py" -Content-Type: text/x-python +Content-Disposition: form-data; name="file"; filename="damagedcarimage.json" +Content-Type: application/json -< ../samples/schemas/damagedcarimage.py +< ../samples/schemas/damagedcarimage.json ------schema-boundary-- ### Download the registered schema file From 6cb8cf929b00908ce79634abbd1222c4675a3ed1 Mon Sep 17 00:00:00 2001 From: JSON Schema Migration Date: Tue, 28 Apr 2026 12:08:14 +0530 Subject: [PATCH 04/13] fix(deps): add jsonschema==4.25.1 to pyproject.toml and refresh uv.lock Container image was failing at import time with ModuleNotFoundError: 'jsonschema'. The Dockerfile installs from uv.lock via 'uv sync --frozen', so requirements.txt alone was not enough; the dep had to land in pyproject.toml + uv.lock. ContentProcessorAPI: adds jsonschema (+ specifications, referencing, rpds-py). ContentProcessor: pins jsonschema to 4.25.1 (was a 4.26.0 transitive). --- src/ContentProcessor/pyproject.toml | 1 + src/ContentProcessor/uv.lock | 8 +- src/ContentProcessorAPI/pyproject.toml | 1 + src/ContentProcessorAPI/uv.lock | 124 +++++++++++++++++++++++++ 4 files changed, 131 insertions(+), 3 deletions(-) diff --git a/src/ContentProcessor/pyproject.toml b/src/ContentProcessor/pyproject.toml index 1c075619..2a3745e0 100644 --- a/src/ContentProcessor/pyproject.toml +++ b/src/ContentProcessor/pyproject.toml @@ -25,6 +25,7 @@ dependencies = [ "protobuf==6.33.6", "pyjwt==2.12.1", "pyasn1==0.6.3", + "jsonschema==4.25.1", ] diff --git a/src/ContentProcessor/uv.lock b/src/ContentProcessor/uv.lock index f82c2376..5ccff485 100644 --- a/src/ContentProcessor/uv.lock +++ b/src/ContentProcessor/uv.lock @@ -852,6 +852,7 @@ dependencies = [ { name = "azure-storage-queue" }, { name = "certifi" }, { name = "charset-normalizer" }, + { name = "jsonschema" }, { name = "opentelemetry-api" }, { name = "pandas" }, { name = "pdf2image" }, @@ -888,6 +889,7 @@ requires-dist = [ { name = "azure-storage-queue", specifier = "==12.16.0b1" }, { name = "certifi", specifier = "==2026.1.4" }, { name = "charset-normalizer", specifier = "==3.4.4" }, + { name = "jsonschema", specifier = "==4.25.1" }, { name = "opentelemetry-api", specifier = "==1.39.1" }, { name = "pandas", specifier = "==3.0.0" }, { name = "pdf2image", specifier = "==1.17.0" }, @@ -1565,7 +1567,7 @@ wheels = [ [[package]] name = "jsonschema" -version = "4.26.0" +version = "4.25.1" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "attrs" }, @@ -1573,9 +1575,9 @@ dependencies = [ { name = "referencing" }, { name = "rpds-py" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/b3/fc/e067678238fa451312d4c62bf6e6cf5ec56375422aee02f9cb5f909b3047/jsonschema-4.26.0.tar.gz", hash = "sha256:0c26707e2efad8aa1bfc5b7ce170f3fccc2e4918ff85989ba9ffa9facb2be326", size = 366583, upload-time = "2026-01-07T13:41:07.246Z" } +sdist = { url = "https://files.pythonhosted.org/packages/74/69/f7185de793a29082a9f3c7728268ffb31cb5095131a9c139a74078e27336/jsonschema-4.25.1.tar.gz", hash = "sha256:e4a9655ce0da0c0b67a085847e00a3a51449e1157f4f75e9fb5aa545e122eb85", size = 357342, upload-time = "2025-08-18T17:03:50.038Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/69/90/f63fb5873511e014207a475e2bb4e8b2e570d655b00ac19a9a0ca0a385ee/jsonschema-4.26.0-py3-none-any.whl", hash = "sha256:d489f15263b8d200f8387e64b4c3a75f06629559fb73deb8fdfb525f2dab50ce", size = 90630, upload-time = "2026-01-07T13:41:05.306Z" }, + { url = "https://files.pythonhosted.org/packages/bf/9c/8c95d856233c1f82500c2450b8c68576b4cf1c871db3afac5c34ff84e6fd/jsonschema-4.25.1-py3-none-any.whl", hash = "sha256:3fba0169e345c7175110351d456342c364814cfcf3b964ba4587f22915230a63", size = 90040, upload-time = "2025-08-18T17:03:48.373Z" }, ] [[package]] diff --git a/src/ContentProcessorAPI/pyproject.toml b/src/ContentProcessorAPI/pyproject.toml index 87c586fe..c52fea35 100644 --- a/src/ContentProcessorAPI/pyproject.toml +++ b/src/ContentProcessorAPI/pyproject.toml @@ -26,6 +26,7 @@ dependencies = [ "sas-cosmosdb==0.1.4", "cryptography==46.0.7", "pyjwt==2.12.0", + "jsonschema==4.25.1", ] [dependency-groups] diff --git a/src/ContentProcessorAPI/uv.lock b/src/ContentProcessorAPI/uv.lock index 96a0152c..390dfa79 100644 --- a/src/ContentProcessorAPI/uv.lock +++ b/src/ContentProcessorAPI/uv.lock @@ -433,6 +433,7 @@ dependencies = [ { name = "cryptography" }, { name = "fastapi", extra = ["standard"] }, { name = "h11" }, + { name = "jsonschema" }, { name = "opentelemetry-api" }, { name = "poppler-utils" }, { name = "pydantic" }, @@ -470,6 +471,7 @@ requires-dist = [ { name = "cryptography", specifier = "==46.0.7" }, { name = "fastapi", extras = ["standard"], specifier = "==0.128.0" }, { name = "h11", specifier = "==0.16.0" }, + { name = "jsonschema", specifier = "==4.25.1" }, { name = "opentelemetry-api", specifier = "==1.39.1" }, { name = "poppler-utils", specifier = "==0.1.0" }, { name = "pydantic", specifier = "==2.12.5" }, @@ -1076,6 +1078,33 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/62/a1/3d680cbfd5f4b8f15abc1d571870c5fc3e594bb582bc3b64ea099db13e56/jinja2-3.1.6-py3-none-any.whl", hash = "sha256:85ece4451f492d0c13c5dd7c13a64681a86afae63a5f347908daf103ce6d2f67", size = 134899, upload-time = "2025-03-05T20:05:00.369Z" }, ] +[[package]] +name = "jsonschema" +version = "4.25.1" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "attrs" }, + { name = "jsonschema-specifications" }, + { name = "referencing" }, + { name = "rpds-py" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/74/69/f7185de793a29082a9f3c7728268ffb31cb5095131a9c139a74078e27336/jsonschema-4.25.1.tar.gz", hash = "sha256:e4a9655ce0da0c0b67a085847e00a3a51449e1157f4f75e9fb5aa545e122eb85", size = 357342, upload-time = "2025-08-18T17:03:50.038Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/bf/9c/8c95d856233c1f82500c2450b8c68576b4cf1c871db3afac5c34ff84e6fd/jsonschema-4.25.1-py3-none-any.whl", hash = "sha256:3fba0169e345c7175110351d456342c364814cfcf3b964ba4587f22915230a63", size = 90040, upload-time = "2025-08-18T17:03:48.373Z" }, +] + +[[package]] +name = "jsonschema-specifications" +version = "2025.9.1" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "referencing" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/19/74/a633ee74eb36c44aa6d1095e7cc5569bebf04342ee146178e2d36600708b/jsonschema_specifications-2025.9.1.tar.gz", hash = "sha256:b540987f239e745613c7a9176f3edb72b832a4ac465cf02712288397832b5e8d", size = 32855, upload-time = "2025-09-08T01:34:59.186Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/41/45/1a4ed80516f02155c51f51e8cedb3c1902296743db0bbc66608a0db2814f/jsonschema_specifications-2025.9.1-py3-none-any.whl", hash = "sha256:98802fee3a11ee76ecaca44429fda8a41bff98b00a0f2838151b113f210cc6fe", size = 18437, upload-time = "2025-09-08T01:34:57.871Z" }, +] + [[package]] name = "keyring" version = "25.7.0" @@ -1895,6 +1924,20 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/e1/67/921ec3024056483db83953ae8e48079ad62b92db7880013ca77632921dd0/readme_renderer-44.0-py3-none-any.whl", hash = "sha256:2fbca89b81a08526aadf1357a8c2ae889ec05fb03f5da67f9769c9a592166151", size = 13310, upload-time = "2024-07-08T15:00:56.577Z" }, ] +[[package]] +name = "referencing" +version = "0.37.0" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "attrs" }, + { name = "rpds-py" }, + { name = "typing-extensions", marker = "python_full_version < '3.13'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/22/f5/df4e9027acead3ecc63e50fe1e36aca1523e1719559c499951bb4b53188f/referencing-0.37.0.tar.gz", hash = "sha256:44aefc3142c5b842538163acb373e24cce6632bd54bdb01b21ad5863489f50d8", size = 78036, upload-time = "2025-10-13T15:30:48.871Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/2c/58/ca301544e1fa93ed4f80d724bf5b194f6e4b945841c5bfd555878eea9fcb/referencing-0.37.0-py3-none-any.whl", hash = "sha256:381329a9f99628c9069361716891d34ad94af76e461dcb0335825aecc7692231", size = 26766, upload-time = "2025-10-13T15:30:47.625Z" }, +] + [[package]] name = "requests" version = "2.32.5" @@ -2026,6 +2069,87 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/79/62/b88e5879512c55b8ee979c666ee6902adc4ed05007226de266410ae27965/rignore-0.7.6-cp314-cp314t-win_arm64.whl", hash = "sha256:b83adabeb3e8cf662cabe1931b83e165b88c526fa6af6b3aa90429686e474896", size = 656035, upload-time = "2025-11-05T21:41:31.13Z" }, ] +[[package]] +name = "rpds-py" +version = "0.30.0" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/20/af/3f2f423103f1113b36230496629986e0ef7e199d2aa8392452b484b38ced/rpds_py-0.30.0.tar.gz", hash = "sha256:dd8ff7cf90014af0c0f787eea34794ebf6415242ee1d6fa91eaba725cc441e84", size = 69469, upload-time = "2025-11-30T20:24:38.837Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/03/e7/98a2f4ac921d82f33e03f3835f5bf3a4a40aa1bfdc57975e74a97b2b4bdd/rpds_py-0.30.0-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:a161f20d9a43006833cd7068375a94d035714d73a172b681d8881820600abfad", size = 375086, upload-time = "2025-11-30T20:22:17.93Z" }, + { url = "https://files.pythonhosted.org/packages/4d/a1/bca7fd3d452b272e13335db8d6b0b3ecde0f90ad6f16f3328c6fb150c889/rpds_py-0.30.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:6abc8880d9d036ecaafe709079969f56e876fcf107f7a8e9920ba6d5a3878d05", size = 359053, upload-time = "2025-11-30T20:22:19.297Z" }, + { url = "https://files.pythonhosted.org/packages/65/1c/ae157e83a6357eceff62ba7e52113e3ec4834a84cfe07fa4b0757a7d105f/rpds_py-0.30.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ca28829ae5f5d569bb62a79512c842a03a12576375d5ece7d2cadf8abe96ec28", size = 390763, upload-time = "2025-11-30T20:22:21.661Z" }, + { url = "https://files.pythonhosted.org/packages/d4/36/eb2eb8515e2ad24c0bd43c3ee9cd74c33f7ca6430755ccdb240fd3144c44/rpds_py-0.30.0-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:a1010ed9524c73b94d15919ca4d41d8780980e1765babf85f9a2f90d247153dd", size = 408951, upload-time = "2025-11-30T20:22:23.408Z" }, + { url = "https://files.pythonhosted.org/packages/d6/65/ad8dc1784a331fabbd740ef6f71ce2198c7ed0890dab595adb9ea2d775a1/rpds_py-0.30.0-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:f8d1736cfb49381ba528cd5baa46f82fdc65c06e843dab24dd70b63d09121b3f", size = 514622, upload-time = "2025-11-30T20:22:25.16Z" }, + { url = "https://files.pythonhosted.org/packages/63/8e/0cfa7ae158e15e143fe03993b5bcd743a59f541f5952e1546b1ac1b5fd45/rpds_py-0.30.0-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:d948b135c4693daff7bc2dcfc4ec57237a29bd37e60c2fabf5aff2bbacf3e2f1", size = 414492, upload-time = "2025-11-30T20:22:26.505Z" }, + { url = "https://files.pythonhosted.org/packages/60/1b/6f8f29f3f995c7ffdde46a626ddccd7c63aefc0efae881dc13b6e5d5bb16/rpds_py-0.30.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:47f236970bccb2233267d89173d3ad2703cd36a0e2a6e92d0560d333871a3d23", size = 394080, upload-time = "2025-11-30T20:22:27.934Z" }, + { url = "https://files.pythonhosted.org/packages/6d/d5/a266341051a7a3ca2f4b750a3aa4abc986378431fc2da508c5034d081b70/rpds_py-0.30.0-cp312-cp312-manylinux_2_31_riscv64.whl", hash = "sha256:2e6ecb5a5bcacf59c3f912155044479af1d0b6681280048b338b28e364aca1f6", size = 408680, upload-time = "2025-11-30T20:22:29.341Z" }, + { url = "https://files.pythonhosted.org/packages/10/3b/71b725851df9ab7a7a4e33cf36d241933da66040d195a84781f49c50490c/rpds_py-0.30.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:a8fa71a2e078c527c3e9dc9fc5a98c9db40bcc8a92b4e8858e36d329f8684b51", size = 423589, upload-time = "2025-11-30T20:22:31.469Z" }, + { url = "https://files.pythonhosted.org/packages/00/2b/e59e58c544dc9bd8bd8384ecdb8ea91f6727f0e37a7131baeff8d6f51661/rpds_py-0.30.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:73c67f2db7bc334e518d097c6d1e6fed021bbc9b7d678d6cc433478365d1d5f5", size = 573289, upload-time = "2025-11-30T20:22:32.997Z" }, + { url = "https://files.pythonhosted.org/packages/da/3e/a18e6f5b460893172a7d6a680e86d3b6bc87a54c1f0b03446a3c8c7b588f/rpds_py-0.30.0-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:5ba103fb455be00f3b1c2076c9d4264bfcb037c976167a6047ed82f23153f02e", size = 599737, upload-time = "2025-11-30T20:22:34.419Z" }, + { url = "https://files.pythonhosted.org/packages/5c/e2/714694e4b87b85a18e2c243614974413c60aa107fd815b8cbc42b873d1d7/rpds_py-0.30.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:7cee9c752c0364588353e627da8a7e808a66873672bcb5f52890c33fd965b394", size = 563120, upload-time = "2025-11-30T20:22:35.903Z" }, + { url = "https://files.pythonhosted.org/packages/6f/ab/d5d5e3bcedb0a77f4f613706b750e50a5a3ba1c15ccd3665ecc636c968fd/rpds_py-0.30.0-cp312-cp312-win32.whl", hash = "sha256:1ab5b83dbcf55acc8b08fc62b796ef672c457b17dbd7820a11d6c52c06839bdf", size = 223782, upload-time = "2025-11-30T20:22:37.271Z" }, + { url = "https://files.pythonhosted.org/packages/39/3b/f786af9957306fdc38a74cef405b7b93180f481fb48453a114bb6465744a/rpds_py-0.30.0-cp312-cp312-win_amd64.whl", hash = "sha256:a090322ca841abd453d43456ac34db46e8b05fd9b3b4ac0c78bcde8b089f959b", size = 240463, upload-time = "2025-11-30T20:22:39.021Z" }, + { url = "https://files.pythonhosted.org/packages/f3/d2/b91dc748126c1559042cfe41990deb92c4ee3e2b415f6b5234969ffaf0cc/rpds_py-0.30.0-cp312-cp312-win_arm64.whl", hash = "sha256:669b1805bd639dd2989b281be2cfd951c6121b65e729d9b843e9639ef1fd555e", size = 230868, upload-time = "2025-11-30T20:22:40.493Z" }, + { url = "https://files.pythonhosted.org/packages/ed/dc/d61221eb88ff410de3c49143407f6f3147acf2538c86f2ab7ce65ae7d5f9/rpds_py-0.30.0-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:f83424d738204d9770830d35290ff3273fbb02b41f919870479fab14b9d303b2", size = 374887, upload-time = "2025-11-30T20:22:41.812Z" }, + { url = "https://files.pythonhosted.org/packages/fd/32/55fb50ae104061dbc564ef15cc43c013dc4a9f4527a1f4d99baddf56fe5f/rpds_py-0.30.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:e7536cd91353c5273434b4e003cbda89034d67e7710eab8761fd918ec6c69cf8", size = 358904, upload-time = "2025-11-30T20:22:43.479Z" }, + { url = "https://files.pythonhosted.org/packages/58/70/faed8186300e3b9bdd138d0273109784eea2396c68458ed580f885dfe7ad/rpds_py-0.30.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2771c6c15973347f50fece41fc447c054b7ac2ae0502388ce3b6738cd366e3d4", size = 389945, upload-time = "2025-11-30T20:22:44.819Z" }, + { url = "https://files.pythonhosted.org/packages/bd/a8/073cac3ed2c6387df38f71296d002ab43496a96b92c823e76f46b8af0543/rpds_py-0.30.0-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:0a59119fc6e3f460315fe9d08149f8102aa322299deaa5cab5b40092345c2136", size = 407783, upload-time = "2025-11-30T20:22:46.103Z" }, + { url = "https://files.pythonhosted.org/packages/77/57/5999eb8c58671f1c11eba084115e77a8899d6e694d2a18f69f0ba471ec8b/rpds_py-0.30.0-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:76fec018282b4ead0364022e3c54b60bf368b9d926877957a8624b58419169b7", size = 515021, upload-time = "2025-11-30T20:22:47.458Z" }, + { url = "https://files.pythonhosted.org/packages/e0/af/5ab4833eadc36c0a8ed2bc5c0de0493c04f6c06de223170bd0798ff98ced/rpds_py-0.30.0-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:692bef75a5525db97318e8cd061542b5a79812d711ea03dbc1f6f8dbb0c5f0d2", size = 414589, upload-time = "2025-11-30T20:22:48.872Z" }, + { url = "https://files.pythonhosted.org/packages/b7/de/f7192e12b21b9e9a68a6d0f249b4af3fdcdff8418be0767a627564afa1f1/rpds_py-0.30.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9027da1ce107104c50c81383cae773ef5c24d296dd11c99e2629dbd7967a20c6", size = 394025, upload-time = "2025-11-30T20:22:50.196Z" }, + { url = "https://files.pythonhosted.org/packages/91/c4/fc70cd0249496493500e7cc2de87504f5aa6509de1e88623431fec76d4b6/rpds_py-0.30.0-cp313-cp313-manylinux_2_31_riscv64.whl", hash = "sha256:9cf69cdda1f5968a30a359aba2f7f9aa648a9ce4b580d6826437f2b291cfc86e", size = 408895, upload-time = "2025-11-30T20:22:51.87Z" }, + { url = "https://files.pythonhosted.org/packages/58/95/d9275b05ab96556fefff73a385813eb66032e4c99f411d0795372d9abcea/rpds_py-0.30.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:a4796a717bf12b9da9d3ad002519a86063dcac8988b030e405704ef7d74d2d9d", size = 422799, upload-time = "2025-11-30T20:22:53.341Z" }, + { url = "https://files.pythonhosted.org/packages/06/c1/3088fc04b6624eb12a57eb814f0d4997a44b0d208d6cace713033ff1a6ba/rpds_py-0.30.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:5d4c2aa7c50ad4728a094ebd5eb46c452e9cb7edbfdb18f9e1221f597a73e1e7", size = 572731, upload-time = "2025-11-30T20:22:54.778Z" }, + { url = "https://files.pythonhosted.org/packages/d8/42/c612a833183b39774e8ac8fecae81263a68b9583ee343db33ab571a7ce55/rpds_py-0.30.0-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:ba81a9203d07805435eb06f536d95a266c21e5b2dfbf6517748ca40c98d19e31", size = 599027, upload-time = "2025-11-30T20:22:56.212Z" }, + { url = "https://files.pythonhosted.org/packages/5f/60/525a50f45b01d70005403ae0e25f43c0384369ad24ffe46e8d9068b50086/rpds_py-0.30.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:945dccface01af02675628334f7cf49c2af4c1c904748efc5cf7bbdf0b579f95", size = 563020, upload-time = "2025-11-30T20:22:58.2Z" }, + { url = "https://files.pythonhosted.org/packages/0b/5d/47c4655e9bcd5ca907148535c10e7d489044243cc9941c16ed7cd53be91d/rpds_py-0.30.0-cp313-cp313-win32.whl", hash = "sha256:b40fb160a2db369a194cb27943582b38f79fc4887291417685f3ad693c5a1d5d", size = 223139, upload-time = "2025-11-30T20:23:00.209Z" }, + { url = "https://files.pythonhosted.org/packages/f2/e1/485132437d20aa4d3e1d8b3fb5a5e65aa8139f1e097080c2a8443201742c/rpds_py-0.30.0-cp313-cp313-win_amd64.whl", hash = "sha256:806f36b1b605e2d6a72716f321f20036b9489d29c51c91f4dd29a3e3afb73b15", size = 240224, upload-time = "2025-11-30T20:23:02.008Z" }, + { url = "https://files.pythonhosted.org/packages/24/95/ffd128ed1146a153d928617b0ef673960130be0009c77d8fbf0abe306713/rpds_py-0.30.0-cp313-cp313-win_arm64.whl", hash = "sha256:d96c2086587c7c30d44f31f42eae4eac89b60dabbac18c7669be3700f13c3ce1", size = 230645, upload-time = "2025-11-30T20:23:03.43Z" }, + { url = "https://files.pythonhosted.org/packages/ff/1b/b10de890a0def2a319a2626334a7f0ae388215eb60914dbac8a3bae54435/rpds_py-0.30.0-cp313-cp313t-macosx_10_12_x86_64.whl", hash = "sha256:eb0b93f2e5c2189ee831ee43f156ed34e2a89a78a66b98cadad955972548be5a", size = 364443, upload-time = "2025-11-30T20:23:04.878Z" }, + { url = "https://files.pythonhosted.org/packages/0d/bf/27e39f5971dc4f305a4fb9c672ca06f290f7c4e261c568f3dea16a410d47/rpds_py-0.30.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:922e10f31f303c7c920da8981051ff6d8c1a56207dbdf330d9047f6d30b70e5e", size = 353375, upload-time = "2025-11-30T20:23:06.342Z" }, + { url = "https://files.pythonhosted.org/packages/40/58/442ada3bba6e8e6615fc00483135c14a7538d2ffac30e2d933ccf6852232/rpds_py-0.30.0-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:cdc62c8286ba9bf7f47befdcea13ea0e26bf294bda99758fd90535cbaf408000", size = 383850, upload-time = "2025-11-30T20:23:07.825Z" }, + { url = "https://files.pythonhosted.org/packages/14/14/f59b0127409a33c6ef6f5c1ebd5ad8e32d7861c9c7adfa9a624fc3889f6c/rpds_py-0.30.0-cp313-cp313t-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:47f9a91efc418b54fb8190a6b4aa7813a23fb79c51f4bb84e418f5476c38b8db", size = 392812, upload-time = "2025-11-30T20:23:09.228Z" }, + { url = "https://files.pythonhosted.org/packages/b3/66/e0be3e162ac299b3a22527e8913767d869e6cc75c46bd844aa43fb81ab62/rpds_py-0.30.0-cp313-cp313t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:1f3587eb9b17f3789ad50824084fa6f81921bbf9a795826570bda82cb3ed91f2", size = 517841, upload-time = "2025-11-30T20:23:11.186Z" }, + { url = "https://files.pythonhosted.org/packages/3d/55/fa3b9cf31d0c963ecf1ba777f7cf4b2a2c976795ac430d24a1f43d25a6ba/rpds_py-0.30.0-cp313-cp313t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:39c02563fc592411c2c61d26b6c5fe1e51eaa44a75aa2c8735ca88b0d9599daa", size = 408149, upload-time = "2025-11-30T20:23:12.864Z" }, + { url = "https://files.pythonhosted.org/packages/60/ca/780cf3b1a32b18c0f05c441958d3758f02544f1d613abf9488cd78876378/rpds_py-0.30.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:51a1234d8febafdfd33a42d97da7a43f5dcb120c1060e352a3fbc0c6d36e2083", size = 383843, upload-time = "2025-11-30T20:23:14.638Z" }, + { url = "https://files.pythonhosted.org/packages/82/86/d5f2e04f2aa6247c613da0c1dd87fcd08fa17107e858193566048a1e2f0a/rpds_py-0.30.0-cp313-cp313t-manylinux_2_31_riscv64.whl", hash = "sha256:eb2c4071ab598733724c08221091e8d80e89064cd472819285a9ab0f24bcedb9", size = 396507, upload-time = "2025-11-30T20:23:16.105Z" }, + { url = "https://files.pythonhosted.org/packages/4b/9a/453255d2f769fe44e07ea9785c8347edaf867f7026872e76c1ad9f7bed92/rpds_py-0.30.0-cp313-cp313t-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:6bdfdb946967d816e6adf9a3d8201bfad269c67efe6cefd7093ef959683c8de0", size = 414949, upload-time = "2025-11-30T20:23:17.539Z" }, + { url = "https://files.pythonhosted.org/packages/a3/31/622a86cdc0c45d6df0e9ccb6becdba5074735e7033c20e401a6d9d0e2ca0/rpds_py-0.30.0-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:c77afbd5f5250bf27bf516c7c4a016813eb2d3e116139aed0096940c5982da94", size = 565790, upload-time = "2025-11-30T20:23:19.029Z" }, + { url = "https://files.pythonhosted.org/packages/1c/5d/15bbf0fb4a3f58a3b1c67855ec1efcc4ceaef4e86644665fff03e1b66d8d/rpds_py-0.30.0-cp313-cp313t-musllinux_1_2_i686.whl", hash = "sha256:61046904275472a76c8c90c9ccee9013d70a6d0f73eecefd38c1ae7c39045a08", size = 590217, upload-time = "2025-11-30T20:23:20.885Z" }, + { url = "https://files.pythonhosted.org/packages/6d/61/21b8c41f68e60c8cc3b2e25644f0e3681926020f11d06ab0b78e3c6bbff1/rpds_py-0.30.0-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:4c5f36a861bc4b7da6516dbdf302c55313afa09b81931e8280361a4f6c9a2d27", size = 555806, upload-time = "2025-11-30T20:23:22.488Z" }, + { url = "https://files.pythonhosted.org/packages/f9/39/7e067bb06c31de48de3eb200f9fc7c58982a4d3db44b07e73963e10d3be9/rpds_py-0.30.0-cp313-cp313t-win32.whl", hash = "sha256:3d4a69de7a3e50ffc214ae16d79d8fbb0922972da0356dcf4d0fdca2878559c6", size = 211341, upload-time = "2025-11-30T20:23:24.449Z" }, + { url = "https://files.pythonhosted.org/packages/0a/4d/222ef0b46443cf4cf46764d9c630f3fe4abaa7245be9417e56e9f52b8f65/rpds_py-0.30.0-cp313-cp313t-win_amd64.whl", hash = "sha256:f14fc5df50a716f7ece6a80b6c78bb35ea2ca47c499e422aa4463455dd96d56d", size = 225768, upload-time = "2025-11-30T20:23:25.908Z" }, + { url = "https://files.pythonhosted.org/packages/86/81/dad16382ebbd3d0e0328776d8fd7ca94220e4fa0798d1dc5e7da48cb3201/rpds_py-0.30.0-cp314-cp314-macosx_10_12_x86_64.whl", hash = "sha256:68f19c879420aa08f61203801423f6cd5ac5f0ac4ac82a2368a9fcd6a9a075e0", size = 362099, upload-time = "2025-11-30T20:23:27.316Z" }, + { url = "https://files.pythonhosted.org/packages/2b/60/19f7884db5d5603edf3c6bce35408f45ad3e97e10007df0e17dd57af18f8/rpds_py-0.30.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:ec7c4490c672c1a0389d319b3a9cfcd098dcdc4783991553c332a15acf7249be", size = 353192, upload-time = "2025-11-30T20:23:29.151Z" }, + { url = "https://files.pythonhosted.org/packages/bf/c4/76eb0e1e72d1a9c4703c69607cec123c29028bff28ce41588792417098ac/rpds_py-0.30.0-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f251c812357a3fed308d684a5079ddfb9d933860fc6de89f2b7ab00da481e65f", size = 384080, upload-time = "2025-11-30T20:23:30.785Z" }, + { url = "https://files.pythonhosted.org/packages/72/87/87ea665e92f3298d1b26d78814721dc39ed8d2c74b86e83348d6b48a6f31/rpds_py-0.30.0-cp314-cp314-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:ac98b175585ecf4c0348fd7b29c3864bda53b805c773cbf7bfdaffc8070c976f", size = 394841, upload-time = "2025-11-30T20:23:32.209Z" }, + { url = "https://files.pythonhosted.org/packages/77/ad/7783a89ca0587c15dcbf139b4a8364a872a25f861bdb88ed99f9b0dec985/rpds_py-0.30.0-cp314-cp314-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:3e62880792319dbeb7eb866547f2e35973289e7d5696c6e295476448f5b63c87", size = 516670, upload-time = "2025-11-30T20:23:33.742Z" }, + { url = "https://files.pythonhosted.org/packages/5b/3c/2882bdac942bd2172f3da574eab16f309ae10a3925644e969536553cb4ee/rpds_py-0.30.0-cp314-cp314-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:4e7fc54e0900ab35d041b0601431b0a0eb495f0851a0639b6ef90f7741b39a18", size = 408005, upload-time = "2025-11-30T20:23:35.253Z" }, + { url = "https://files.pythonhosted.org/packages/ce/81/9a91c0111ce1758c92516a3e44776920b579d9a7c09b2b06b642d4de3f0f/rpds_py-0.30.0-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:47e77dc9822d3ad616c3d5759ea5631a75e5809d5a28707744ef79d7a1bcfcad", size = 382112, upload-time = "2025-11-30T20:23:36.842Z" }, + { url = "https://files.pythonhosted.org/packages/cf/8e/1da49d4a107027e5fbc64daeab96a0706361a2918da10cb41769244b805d/rpds_py-0.30.0-cp314-cp314-manylinux_2_31_riscv64.whl", hash = "sha256:b4dc1a6ff022ff85ecafef7979a2c6eb423430e05f1165d6688234e62ba99a07", size = 399049, upload-time = "2025-11-30T20:23:38.343Z" }, + { url = "https://files.pythonhosted.org/packages/df/5a/7ee239b1aa48a127570ec03becbb29c9d5a9eb092febbd1699d567cae859/rpds_py-0.30.0-cp314-cp314-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:4559c972db3a360808309e06a74628b95eaccbf961c335c8fe0d590cf587456f", size = 415661, upload-time = "2025-11-30T20:23:40.263Z" }, + { url = "https://files.pythonhosted.org/packages/70/ea/caa143cf6b772f823bc7929a45da1fa83569ee49b11d18d0ada7f5ee6fd6/rpds_py-0.30.0-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:0ed177ed9bded28f8deb6ab40c183cd1192aa0de40c12f38be4d59cd33cb5c65", size = 565606, upload-time = "2025-11-30T20:23:42.186Z" }, + { url = "https://files.pythonhosted.org/packages/64/91/ac20ba2d69303f961ad8cf55bf7dbdb4763f627291ba3d0d7d67333cced9/rpds_py-0.30.0-cp314-cp314-musllinux_1_2_i686.whl", hash = "sha256:ad1fa8db769b76ea911cb4e10f049d80bf518c104f15b3edb2371cc65375c46f", size = 591126, upload-time = "2025-11-30T20:23:44.086Z" }, + { url = "https://files.pythonhosted.org/packages/21/20/7ff5f3c8b00c8a95f75985128c26ba44503fb35b8e0259d812766ea966c7/rpds_py-0.30.0-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:46e83c697b1f1c72b50e5ee5adb4353eef7406fb3f2043d64c33f20ad1c2fc53", size = 553371, upload-time = "2025-11-30T20:23:46.004Z" }, + { url = "https://files.pythonhosted.org/packages/72/c7/81dadd7b27c8ee391c132a6b192111ca58d866577ce2d9b0ca157552cce0/rpds_py-0.30.0-cp314-cp314-win32.whl", hash = "sha256:ee454b2a007d57363c2dfd5b6ca4a5d7e2c518938f8ed3b706e37e5d470801ed", size = 215298, upload-time = "2025-11-30T20:23:47.696Z" }, + { url = "https://files.pythonhosted.org/packages/3e/d2/1aaac33287e8cfb07aab2e6b8ac1deca62f6f65411344f1433c55e6f3eb8/rpds_py-0.30.0-cp314-cp314-win_amd64.whl", hash = "sha256:95f0802447ac2d10bcc69f6dc28fe95fdf17940367b21d34e34c737870758950", size = 228604, upload-time = "2025-11-30T20:23:49.501Z" }, + { url = "https://files.pythonhosted.org/packages/e8/95/ab005315818cc519ad074cb7784dae60d939163108bd2b394e60dc7b5461/rpds_py-0.30.0-cp314-cp314-win_arm64.whl", hash = "sha256:613aa4771c99f03346e54c3f038e4cc574ac09a3ddfb0e8878487335e96dead6", size = 222391, upload-time = "2025-11-30T20:23:50.96Z" }, + { url = "https://files.pythonhosted.org/packages/9e/68/154fe0194d83b973cdedcdcc88947a2752411165930182ae41d983dcefa6/rpds_py-0.30.0-cp314-cp314t-macosx_10_12_x86_64.whl", hash = "sha256:7e6ecfcb62edfd632e56983964e6884851786443739dbfe3582947e87274f7cb", size = 364868, upload-time = "2025-11-30T20:23:52.494Z" }, + { url = "https://files.pythonhosted.org/packages/83/69/8bbc8b07ec854d92a8b75668c24d2abcb1719ebf890f5604c61c9369a16f/rpds_py-0.30.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:a1d0bc22a7cdc173fedebb73ef81e07faef93692b8c1ad3733b67e31e1b6e1b8", size = 353747, upload-time = "2025-11-30T20:23:54.036Z" }, + { url = "https://files.pythonhosted.org/packages/ab/00/ba2e50183dbd9abcce9497fa5149c62b4ff3e22d338a30d690f9af970561/rpds_py-0.30.0-cp314-cp314t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:0d08f00679177226c4cb8c5265012eea897c8ca3b93f429e546600c971bcbae7", size = 383795, upload-time = "2025-11-30T20:23:55.556Z" }, + { url = "https://files.pythonhosted.org/packages/05/6f/86f0272b84926bcb0e4c972262f54223e8ecc556b3224d281e6598fc9268/rpds_py-0.30.0-cp314-cp314t-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:5965af57d5848192c13534f90f9dd16464f3c37aaf166cc1da1cae1fd5a34898", size = 393330, upload-time = "2025-11-30T20:23:57.033Z" }, + { url = "https://files.pythonhosted.org/packages/cb/e9/0e02bb2e6dc63d212641da45df2b0bf29699d01715913e0d0f017ee29438/rpds_py-0.30.0-cp314-cp314t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:9a4e86e34e9ab6b667c27f3211ca48f73dba7cd3d90f8d5b11be56e5dbc3fb4e", size = 518194, upload-time = "2025-11-30T20:23:58.637Z" }, + { url = "https://files.pythonhosted.org/packages/ee/ca/be7bca14cf21513bdf9c0606aba17d1f389ea2b6987035eb4f62bd923f25/rpds_py-0.30.0-cp314-cp314t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:e5d3e6b26f2c785d65cc25ef1e5267ccbe1b069c5c21b8cc724efee290554419", size = 408340, upload-time = "2025-11-30T20:24:00.2Z" }, + { url = "https://files.pythonhosted.org/packages/c2/c7/736e00ebf39ed81d75544c0da6ef7b0998f8201b369acf842f9a90dc8fce/rpds_py-0.30.0-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:626a7433c34566535b6e56a1b39a7b17ba961e97ce3b80ec62e6f1312c025551", size = 383765, upload-time = "2025-11-30T20:24:01.759Z" }, + { url = "https://files.pythonhosted.org/packages/4a/3f/da50dfde9956aaf365c4adc9533b100008ed31aea635f2b8d7b627e25b49/rpds_py-0.30.0-cp314-cp314t-manylinux_2_31_riscv64.whl", hash = "sha256:acd7eb3f4471577b9b5a41baf02a978e8bdeb08b4b355273994f8b87032000a8", size = 396834, upload-time = "2025-11-30T20:24:03.687Z" }, + { url = "https://files.pythonhosted.org/packages/4e/00/34bcc2565b6020eab2623349efbdec810676ad571995911f1abdae62a3a0/rpds_py-0.30.0-cp314-cp314t-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:fe5fa731a1fa8a0a56b0977413f8cacac1768dad38d16b3a296712709476fbd5", size = 415470, upload-time = "2025-11-30T20:24:05.232Z" }, + { url = "https://files.pythonhosted.org/packages/8c/28/882e72b5b3e6f718d5453bd4d0d9cf8df36fddeb4ddbbab17869d5868616/rpds_py-0.30.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:74a3243a411126362712ee1524dfc90c650a503502f135d54d1b352bd01f2404", size = 565630, upload-time = "2025-11-30T20:24:06.878Z" }, + { url = "https://files.pythonhosted.org/packages/3b/97/04a65539c17692de5b85c6e293520fd01317fd878ea1995f0367d4532fb1/rpds_py-0.30.0-cp314-cp314t-musllinux_1_2_i686.whl", hash = "sha256:3e8eeb0544f2eb0d2581774be4c3410356eba189529a6b3e36bbbf9696175856", size = 591148, upload-time = "2025-11-30T20:24:08.445Z" }, + { url = "https://files.pythonhosted.org/packages/85/70/92482ccffb96f5441aab93e26c4d66489eb599efdcf96fad90c14bbfb976/rpds_py-0.30.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:dbd936cde57abfee19ab3213cf9c26be06d60750e60a8e4dd85d1ab12c8b1f40", size = 556030, upload-time = "2025-11-30T20:24:10.956Z" }, + { url = "https://files.pythonhosted.org/packages/20/53/7c7e784abfa500a2b6b583b147ee4bb5a2b3747a9166bab52fec4b5b5e7d/rpds_py-0.30.0-cp314-cp314t-win32.whl", hash = "sha256:dc824125c72246d924f7f796b4f63c1e9dc810c7d9e2355864b3c3a73d59ade0", size = 211570, upload-time = "2025-11-30T20:24:12.735Z" }, + { url = "https://files.pythonhosted.org/packages/d0/02/fa464cdfbe6b26e0600b62c528b72d8608f5cc49f96b8d6e38c95d60c676/rpds_py-0.30.0-cp314-cp314t-win_amd64.whl", hash = "sha256:27f4b0e92de5bfbc6f86e43959e6edd1425c33b5e69aab0984a72047f2bcf1e3", size = 226532, upload-time = "2025-11-30T20:24:14.634Z" }, +] + [[package]] name = "ruff" version = "0.14.11" From 6b37de5076a8e00b4761e45d075c417e591419c1 Mon Sep 17 00:00:00 2001 From: JSON Schema Migration Date: Tue, 28 Apr 2026 12:59:07 +0530 Subject: [PATCH 05/13] feat(schemavault)!: remove legacy .py schema path (RCE remediation) BREAKING CHANGE: schema vault no longer accepts Python (.py) schema files. - API rejects .py uploads with HTTP 415; only .json (JSON Schema Draft 2020-12) is accepted. - Worker (map_handler) refuses to process schemas with Format='python'; existing Cosmos records must be re-registered as JSON. - Deleted libs/utils/remote_module_loader.py (the exec/importlib loader that was the original RCE primitive). - Deleted sample .py schemas; .json equivalents have been the default since the previous commit. - register_schema.py, post_deployment.sh/ps1, .http examples, and CustomizeSchemaData.md all updated to JSON-only. - Schema model defaults Format to 'json'; API model Literal restricted to 'json' only. - Test suite updated: previous .py-accepting tests now assert .py is rejected. --- docs/CustomizeSchemaData.md | 125 ++-- infra/scripts/post_deployment.ps1 | 15 +- infra/scripts/post_deployment.sh | 17 +- .../src/libs/pipeline/entities/schema.py | 12 +- .../src/libs/pipeline/handlers/map_handler.py | 35 +- .../src/libs/utils/__init__.py | 4 +- .../src/libs/utils/remote_module_loader.py | 65 -- .../src/libs/utils/remote_schema_loader.py | 1 - .../app/routers/logics/schemavault.py | 2 +- .../app/routers/models/schmavault/model.py | 13 +- .../app/routers/schemavault.py | 80 +-- .../app/tests/routers/test_schemavault.py | 76 +-- .../samples/schemas/autoclaim.py | 592 ------------------ .../samples/schemas/damagedcarimage.py | 519 --------------- .../samples/schemas/policereport.py | 353 ----------- .../samples/schemas/register_schema.py | 20 +- .../samples/schemas/repairestimate.py | 333 ---------- 17 files changed, 175 insertions(+), 2087 deletions(-) delete mode 100644 src/ContentProcessor/src/libs/utils/remote_module_loader.py delete mode 100644 src/ContentProcessorAPI/samples/schemas/autoclaim.py delete mode 100644 src/ContentProcessorAPI/samples/schemas/damagedcarimage.py delete mode 100644 src/ContentProcessorAPI/samples/schemas/policereport.py delete mode 100644 src/ContentProcessorAPI/samples/schemas/repairestimate.py diff --git a/docs/CustomizeSchemaData.md b/docs/CustomizeSchemaData.md index 41925d1f..c4ae4d59 100644 --- a/docs/CustomizeSchemaData.md +++ b/docs/CustomizeSchemaData.md @@ -92,58 +92,58 @@ Duplicate one of these files and update with a class definition that represents > > *Generate a Schema Class based on the following autoclaim.py schema definition, which has been built and derived from Pydantic BaseModel class. The generated Schema Class should be called "Freight Shipment Bill of Lading" schema file. Please define the entities based on standard bill of lading documents in the logistics industry.* -### Class Structure - -Each schema `.py` file must include: - -```python -from pydantic import BaseModel, Field -from typing import List, Optional - -class SubModel(BaseModel): - """Description of this sub-entity — used as LLM context.""" - - field_name: Optional[str] = Field( - description="What this field represents, e.g. Consignee company name" - ) - -class MyDocumentSchema(BaseModel): - """Top-level description of the document type.""" - - some_field: Optional[str] = Field(description="...") - sub_entity: Optional[SubModel] = Field(description="...") - - @staticmethod - def example() -> "MyDocumentSchema": - """Returns an empty instance of this schema.""" - return MyDocumentSchema(some_field="", sub_entity=SubModel.example()) - - @staticmethod - def from_json(json_str: str) -> "MyDocumentSchema": - """Creates an instance from a JSON string.""" - return MyDocumentSchema.model_validate_json(json_str) - - def to_dict(self) -> dict: - """Converts this instance to a dictionary.""" - return self.model_dump() +### Schema Document Structure + +Each schema `.json` file must be a JSON Schema (Draft 2020-12) with +`"type": "object"` at the root and a `"properties"` block. Example: + +```json +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "title": "MyDocumentSchema", + "description": "Top-level description of the document type.", + "type": "object", + "properties": { + "some_field": { + "type": ["string", "null"], + "description": "What this field represents, e.g. policy number" + }, + "sub_entity": { + "$ref": "#/$defs/SubModel" + } + }, + "$defs": { + "SubModel": { + "title": "SubModel", + "description": "Description of this sub-entity — used as LLM context.", + "type": "object", + "properties": { + "field_name": { + "type": ["string", "null"], + "description": "What this field represents, e.g. Consignee company name" + } + } + } + } +} ``` ### Key Rules | Element | Requirement | | ------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **Inheritance** | All classes must inherit from `pydantic.BaseModel` | -| **Field descriptions** | Every field must have a `description=` — this is the prompt text the LLM uses for extraction. Include examples for better accuracy (e.g., `"Date of loss, e.g. 01/15/2026"`) | -| **Optional vs Required** | Use `Optional[str]` for fields that may not be present in every document | -| **Subclasses** | Use nested `BaseModel` classes for complex entities (address, line items, etc.) | -| **Required methods** | `example()`, `from_json()`, `to_dict()` — all three must be present | -| **Class docstring** | Include a description — it's used as context during mapping | +| **Root type** | Must be `"type": "object"` with a `"properties"` block | +| **Field descriptions** | Every property must have a `"description"` — this is the prompt text the LLM uses for extraction. Include examples for better accuracy (e.g., `"Date of loss, e.g. 01/15/2026"`) | +| **Optional vs Required** | Use `["string", "null"]` for fields that may not be present in every document; list required keys in the root `"required"` array if any | +| **Sub-objects** | Define reusable nested types under `"$defs"` and reference them via `"$ref": "#/$defs/"` | +| **Class name** | Use a top-level `"title"` field; this becomes `ClassName` in the Schema Vault. If absent, the request body's `ClassName` (or filename) is used | +| **Top-level description**| Include a `"description"` — it's used as context during mapping | --- ## Step 2: Register Schemas -After creating your `.py` class files, register each schema in the system. Registration uploads the class file to Blob Storage and stores metadata in Cosmos DB. +After creating your `.json` schema files, register each schema in the system. Registration uploads the file to Blob Storage and stores metadata in Cosmos DB. ### Option A: Register via API (individual) @@ -152,7 +152,7 @@ After creating your `.py` class files, register each schema in the system. Regis | Part | Type | Description | | ------------- | ----------- | ----------------------------------------------------------------- | | `schema_info` | JSON string | `{"ClassName": "MyDocumentSchema", "Description": "My Document"}` | -| `file` | File upload | The `.py` class file (max 1 MB) | +| `file` | File upload | The `.json` JSON Schema file (max 1 MB) | Example using the REST Client extension: @@ -259,35 +259,32 @@ Repeat for each schema. The SchemaSet now holds references to all your document Once schemas are registered and grouped into a SchemaSet, the pipeline uses them automatically during the **Map** step: 1. **Schema lookup** — The Map handler reads the `Schema_Id` from the processing queue message, then fetches metadata from Cosmos DB -2. **Dynamic class loading** — Downloads the `.py` file from Blob Storage and dynamically loads the Pydantic class -3. **JSON Schema generation** — Calls `model_json_schema()` on the class to produce a full JSON Schema with all field descriptions +2. **Schema materialisation** — Downloads the JSON Schema document from Blob Storage and builds a Pydantic model from it in memory (no code execution) +3. **JSON Schema generation** — Calls `model_json_schema()` on the materialised model to produce the schema with all field descriptions 4. **LLM extraction** — Embeds the JSON Schema into the GPT-5.1 system prompt with `response_format` for structured JSON output (temperature=0.1 for deterministic results) -5. **Validation & scoring** — Parses the GPT response back into the Pydantic class, then computes per-field confidence scores using log-probabilities +5. **Validation & scoring** — Parses the GPT response back into the Pydantic model, then computes per-field confidence scores using log-probabilities This means your field descriptions in the schema class **directly influence extraction quality** — write clear, specific descriptions with examples for best results. --- -## Authoring Schemas as JSON (recommended) - -The schema vault now also accepts **JSON Schema** documents (Draft 2020-12) -in addition to the legacy executable `.py` format. JSON schemas are treated -strictly as data: the worker parses them and materialises a Pydantic model -in memory without executing any uploaded code, eliminating an entire class -of remote-code-execution risk in the schema-management path. +## Authoring Schemas as JSON -### Why JSON? +The schema vault accepts **JSON Schema** documents (Draft 2020-12) only. +JSON schemas are treated strictly as data: the worker parses them and +materialises a Pydantic model in memory without executing any uploaded +code, eliminating an entire class of remote-code-execution risk in the +schema-management path. The legacy executable `.py` format has been +removed; uploads of `.py` files are rejected with HTTP 415. -| | Legacy `.py` | JSON Schema | -| --- | --- | --- | -| Format | Executable Pydantic class | Declarative JSON document | -| Worker behaviour | Imports and runs uploaded Python | Parses JSON, builds model in memory | -| Authoring | Hand-written Python | Pydantic-compatible JSON | -| Side-effects on import | Possible | Impossible | +### Format requirements -Both formats are accepted today; JSON is the recommended path for new -schemas and is required to be opted into per upload by using a `.json` -file extension. +| | JSON Schema | +| --- | --- | +| Format | Declarative JSON document | +| Worker behaviour | Parses JSON, builds model in memory | +| Authoring | Pydantic-compatible JSON | +| Side-effects on import | Impossible | ### Authoring with the conversion helper @@ -311,8 +308,8 @@ that you can reference. ### Upload via API -`POST /schemavault/` accepts either format. For JSON, send the file as -`application/json`: +`POST /schemavault/` accepts JSON Schema documents. Send the file with +`Content-Type: application/json`: ```http POST /schemavault/ diff --git a/infra/scripts/post_deployment.ps1 b/infra/scripts/post_deployment.ps1 index bcf2ac4a..aa116003 100644 --- a/infra/scripts/post_deployment.ps1 +++ b/infra/scripts/post_deployment.ps1 @@ -124,17 +124,14 @@ if (-not $ApiReady) { Write-Host " Registering new schema '$ClassName'..." - # Pick MIME type by extension. Both .json (recommended) and .py - # (legacy) are accepted by the API. + # Only JSON Schema descriptors are accepted. The legacy .py format + # was removed as part of the schemavault RCE remediation. $extension = [System.IO.Path]::GetExtension($SchemaFile).ToLowerInvariant() - switch ($extension) { - '.json' { $contentType = 'application/json' } - '.py' { $contentType = 'text/x-python' } - default { - Write-Host " Unsupported schema extension '$extension' for '$SchemaFile'. Skipping..." - continue - } + if ($extension -ne '.json') { + Write-Host " Unsupported schema extension '$extension' for '$SchemaFile'. Only .json is accepted. Skipping..." + continue } + $contentType = 'application/json' # Build multipart form data $dataPayload = @{ ClassName = $ClassName; Description = $Description } | ConvertTo-Json -Compress diff --git a/infra/scripts/post_deployment.sh b/infra/scripts/post_deployment.sh index 66f0180a..f02f8be9 100644 --- a/infra/scripts/post_deployment.sh +++ b/infra/scripts/post_deployment.sh @@ -136,17 +136,14 @@ else echo " Registering new schema '$CLASS_NAME'..." DATA_PAYLOAD="{\"ClassName\": \"$CLASS_NAME\", \"Description\": \"$DESCRIPTION\"}" - # Pick MIME type by extension. Both .json (recommended) and .py (legacy) - # are accepted by the API. + # Only JSON Schema descriptors are accepted. The legacy .py format + # was removed as part of the schemavault RCE remediation. EXT="${FILE_NAME##*.}" - case "$EXT" in - json) CONTENT_TYPE="application/json" ;; - py) CONTENT_TYPE="text/x-python" ;; - *) - echo " Unsupported schema extension '.$EXT' for '$FILE_NAME'. Skipping..." - continue - ;; - esac + if [ "$EXT" != "json" ]; then + echo " Unsupported schema extension '.$EXT' for '$FILE_NAME'. Only .json is accepted. Skipping..." + continue + fi + CONTENT_TYPE="application/json" RESPONSE=$(curl -s -w "\n%{http_code}" \ -X POST "$SCHEMAVAULT_URL" \ diff --git a/src/ContentProcessor/src/libs/pipeline/entities/schema.py b/src/ContentProcessor/src/libs/pipeline/entities/schema.py index 429d2570..f1ec8c18 100644 --- a/src/ContentProcessor/src/libs/pipeline/entities/schema.py +++ b/src/ContentProcessor/src/libs/pipeline/entities/schema.py @@ -25,11 +25,11 @@ class Schema(BaseModel): Description: Human-readable description. FileName: Blob filename containing the schema artifact. ContentType: Target content type this schema handles. - Format: Storage format of the schema artifact. ``"python"`` (legacy) - indicates a ``.py`` Pydantic class; ``"json"`` indicates a - JSON Schema descriptor that the worker materialises in-memory - without executing any uploaded code. Defaults to ``"python"`` - so existing Cosmos records keep their current behaviour. + Format: Storage format of the schema artifact. Always + ``"json"`` — declarative JSON Schema descriptors are the + only supported format. The legacy ``"python"`` value is + tolerated when reading historical Cosmos records but the + worker will refuse to process them. """ Id: str @@ -37,7 +37,7 @@ class Schema(BaseModel): Description: str FileName: str ContentType: str - Format: Literal["python", "json"] = Field(default="python") + Format: Literal["python", "json"] = Field(default="json") Created_On: Optional[datetime.datetime] = Field(default=None) Updated_On: Optional[datetime.datetime] = Field(default=None) diff --git a/src/ContentProcessor/src/libs/pipeline/handlers/map_handler.py b/src/ContentProcessor/src/libs/pipeline/handlers/map_handler.py index d85ee2cc..0217662d 100644 --- a/src/ContentProcessor/src/libs/pipeline/handlers/map_handler.py +++ b/src/ContentProcessor/src/libs/pipeline/handlers/map_handler.py @@ -28,7 +28,6 @@ from libs.pipeline.entities.pipeline_step_result import StepResult from libs.pipeline.entities.schema import Schema from libs.pipeline.queue_handler_base import HandlerBase -from libs.utils.remote_module_loader import load_schema_from_blob from libs.utils.remote_schema_loader import load_schema_from_blob_json logger = logging.getLogger(__name__) @@ -152,25 +151,23 @@ async def execute(self, context: MessageContext) -> StepResult: schema_id=context.data_pipeline.pipeline_status.schema_id, ) - # Load the schema class for structured output. JSON schemas are - # materialised as in-memory Pydantic models without executing any - # uploaded code; legacy ``.py`` schemas continue to use the - # remote-module loader so existing deployments keep working. - schema_format = getattr(selected_schema, "Format", "python") or "python" - if schema_format == "json": - schema_class = load_schema_from_blob_json( - account_url=self.application_context.configuration.app_storage_blob_url, - container_name=f"{self.application_context.configuration.app_cps_configuration}/Schemas/{context.data_pipeline.pipeline_status.schema_id}", - blob_name=selected_schema.FileName, - model_name=selected_schema.ClassName, - ) - else: - schema_class = load_schema_from_blob( - account_url=self.application_context.configuration.app_storage_blob_url, - container_name=f"{self.application_context.configuration.app_cps_configuration}/Schemas/{context.data_pipeline.pipeline_status.schema_id}", - blob_name=selected_schema.FileName, - module_name=selected_schema.ClassName, + # Load the schema class for structured output. Only JSON schemas + # are supported; the worker materialises the descriptor as an + # in-memory Pydantic model without ever executing uploaded code. + schema_format = getattr(selected_schema, "Format", "json") or "json" + if schema_format != "json": + raise ValueError( + f"Schema {selected_schema.Id} has unsupported Format " + f"'{schema_format}'. Re-register the schema as a JSON " + "Schema (.json) document; legacy Python (.py) schemas " + "are no longer supported." ) + schema_class = load_schema_from_blob_json( + account_url=self.application_context.configuration.app_storage_blob_url, + container_name=f"{self.application_context.configuration.app_cps_configuration}/Schemas/{context.data_pipeline.pipeline_status.schema_id}", + blob_name=selected_schema.FileName, + model_name=selected_schema.ClassName, + ) # Invoke Model with Agent Framework SDK diff --git a/src/ContentProcessor/src/libs/utils/__init__.py b/src/ContentProcessor/src/libs/utils/__init__.py index e4b1d5a6..b5f16936 100644 --- a/src/ContentProcessor/src/libs/utils/__init__.py +++ b/src/ContentProcessor/src/libs/utils/__init__.py @@ -8,8 +8,8 @@ base64_util: Base-64 encoding detection. credential_util: Convenience re-export of credential and token-provider helpers (mirrors azure_credential_utils). - remote_module_loader: Dynamically load Python modules from Azure Blob - Storage. + remote_schema_loader: Materialise Pydantic models from JSON Schema + descriptors stored in Azure Blob Storage (no code execution). stopwatch: Lightweight elapsed-time measurement context manager. utils: General-purpose JSON encoding, dict flattening, and value comparison helpers. diff --git a/src/ContentProcessor/src/libs/utils/remote_module_loader.py b/src/ContentProcessor/src/libs/utils/remote_module_loader.py deleted file mode 100644 index f3985aa7..00000000 --- a/src/ContentProcessor/src/libs/utils/remote_module_loader.py +++ /dev/null @@ -1,65 +0,0 @@ -# Copyright (c) Microsoft Corporation. -# Licensed under the MIT License. - -"""Dynamically load Python modules stored in Azure Blob Storage. - -Used by the map handler to fetch schema classes at runtime from a -configurable blob container. -""" - -import importlib.util -import sys - -from azure.storage.blob import BlobServiceClient - -from libs.utils.azure_credential_utils import get_azure_credential - - -def load_schema_from_blob( - account_url: str, container_name: str, blob_name: str, module_name: str -): - """Download a Python file from blob storage and return a class from it. - - Args: - account_url: Azure Blob Storage account URL. - container_name: Container (path) holding the blob. - blob_name: Blob filename to download. - module_name: Name of the class to extract from the module. - - Returns: - The class object loaded from the downloaded script. - """ - # Download the blob content - blob_content = _download_blob_content(container_name, blob_name, account_url) - - # Execute the script content - module = _execute_script(blob_content, module_name) - - loaded_class = getattr(module, module_name) - return loaded_class - - -def _download_blob_content(container_name, blob_name, account_url): - """Download blob content as a UTF-8 string.""" - credential = get_azure_credential() - blob_service_client = BlobServiceClient( - account_url=account_url, credential=credential - ) - - blob_client = blob_service_client.get_blob_client( - container=container_name, blob=blob_name - ) - - blob_content = blob_client.download_blob().readall().decode("utf-8") - return blob_content - - -def _execute_script(script_content, module_name): - """Execute Python source text as a new module and return it.""" - spec = importlib.util.spec_from_loader(module_name, loader=None) - module = importlib.util.module_from_spec(spec) - sys.modules[module_name] = module - - # Execute the script content in the module's namespace - exec(script_content, module.__dict__) - return module diff --git a/src/ContentProcessor/src/libs/utils/remote_schema_loader.py b/src/ContentProcessor/src/libs/utils/remote_schema_loader.py index 6ecd02aa..53079df1 100644 --- a/src/ContentProcessor/src/libs/utils/remote_schema_loader.py +++ b/src/ContentProcessor/src/libs/utils/remote_schema_loader.py @@ -3,7 +3,6 @@ """Materialise a Pydantic model from a JSON Schema descriptor. -This is the *safe* counterpart of :mod:`libs.utils.remote_module_loader`. A JSON schema descriptor is treated strictly as data: 1. Bytes are downloaded from blob storage. diff --git a/src/ContentProcessorAPI/app/routers/logics/schemavault.py b/src/ContentProcessorAPI/app/routers/logics/schemavault.py index dc5f34c8..548ea6f5 100644 --- a/src/ContentProcessorAPI/app/routers/logics/schemavault.py +++ b/src/ContentProcessorAPI/app/routers/logics/schemavault.py @@ -72,7 +72,7 @@ def Update( file: UploadFile, schema_id: str, class_name: str, - storage_format: str = "python", + storage_format: str = "json", ) -> Schema: """Replace the schema file in blob storage and update Cosmos metadata.""" schemas = self.mongoHelper.find_document(query={"Id": schema_id}) diff --git a/src/ContentProcessorAPI/app/routers/models/schmavault/model.py b/src/ContentProcessorAPI/app/routers/models/schmavault/model.py index 73072bab..6c500f27 100644 --- a/src/ContentProcessorAPI/app/routers/models/schmavault/model.py +++ b/src/ContentProcessorAPI/app/routers/models/schmavault/model.py @@ -15,15 +15,14 @@ class Schema(BaseModel): Attributes: Id: Unique schema identifier. - ClassName: Class name of the schema (Python class for legacy - ``.py`` schemas, or the JSON Schema ``title`` for JSON - schemas). + ClassName: Class name of the schema (the JSON Schema ``title`` + field, or a sanitised fallback derived from the filename). Description: Human-readable description. FileName: Source filename for the schema definition. ContentType: Expected content/MIME type. - Format: Storage format of the schema artifact. - ``"python"`` (default, legacy) for ``.py`` files; - ``"json"`` for declarative JSON Schema descriptors. + Format: Storage format of the schema artifact. Always + ``"json"`` — declarative JSON Schema descriptors are the + only supported format. Created_On: UTC timestamp when the schema was registered. Updated_On: UTC timestamp of the last update. """ @@ -33,7 +32,7 @@ class Schema(BaseModel): Description: str FileName: str ContentType: str - Format: Literal["python", "json"] = Field(default="python") + Format: Literal["json"] = Field(default="json") Created_On: Optional[datetime.datetime] = Field(default=None) Updated_On: Optional[datetime.datetime] = Field(default=None) model_config = ConfigDict(from_attributes=True) diff --git a/src/ContentProcessorAPI/app/routers/schemavault.py b/src/ContentProcessorAPI/app/routers/schemavault.py index be331d90..62058544 100644 --- a/src/ContentProcessorAPI/app/routers/schemavault.py +++ b/src/ContentProcessorAPI/app/routers/schemavault.py @@ -34,11 +34,11 @@ ) #: Filename extensions accepted by the schema-vault upload routes. -#: ``.py`` is the legacy Python class format (executed by the worker via -#: ``remote_module_loader``). ``.json`` is the declarative JSON Schema -#: format introduced as part of the migration away from executable -#: schemas; it is parsed as data and never executed. -_ALLOWED_EXTENSIONS: tuple[str, ...] = (".py", ".json") +#: Only ``.json`` (declarative JSON Schema) is supported. The legacy +#: ``.py`` (executable Pydantic class) format was removed because the +#: worker would ``exec`` uploaded code, exposing an RCE primitive +#: against any caller able to register a schema. +_ALLOWED_EXTENSIONS: tuple[str, ...] = (".json",) _MAX_UPLOAD_BYTES: int = 1 * 1024 * 1024 @@ -58,8 +58,8 @@ def _validate_upload(file: UploadFile) -> tuple[str, str]: raise HTTPException( status_code=415, detail=( - "Unsupported schema file type. " - "Only .py and .json schema files are supported." + "Unsupported schema file type. Only .json schema files " + "are accepted; legacy .py uploads are disabled." ), ) @@ -149,28 +149,19 @@ async def Register_Schema( safe_filename, extension = _validate_upload(file) - # Determine the storage format and final ClassName based on extension. - # For ``.json`` schemas we additionally validate the document up front so - # that no blob or Cosmos record is ever written for an invalid schema. - if extension == ".json": - raw = file.file.read() - file.file.seek(0) - try: - document = validate_json_schema(raw) - except SchemaValidationError as exc: - raise HTTPException( - status_code=400, - detail={"message": "Invalid JSON schema.", "errors": exc.errors}, - ) from exc - - fallback = os.path.splitext(safe_filename)[0] - class_name = derive_class_name(document, fallback=data.ClassName or fallback) - storage_format = "json" - content_type = file.content_type or "application/json" - else: - class_name = data.ClassName - storage_format = "python" - content_type = file.content_type or "text/x-python" + raw = file.file.read() + file.file.seek(0) + try: + document = validate_json_schema(raw) + except SchemaValidationError as exc: + raise HTTPException( + status_code=400, + detail={"message": "Invalid JSON schema.", "errors": exc.errors}, + ) from exc + + fallback = os.path.splitext(safe_filename)[0] + class_name = derive_class_name(document, fallback=data.ClassName or fallback) + content_type = file.content_type or "application/json" return schemas.Add( file, @@ -180,7 +171,7 @@ async def Register_Schema( Description=data.Description, FileName=safe_filename, ContentType=content_type, - Format=storage_format, + Format="json", ), ) @@ -223,25 +214,20 @@ async def Update_Schema( safe_filename, extension = _validate_upload(file) - if extension == ".json": - raw = file.file.read() - file.file.seek(0) - try: - document = validate_json_schema(raw) - except SchemaValidationError as exc: - raise HTTPException( - status_code=400, - detail={"message": "Invalid JSON schema.", "errors": exc.errors}, - ) from exc - fallback = os.path.splitext(safe_filename)[0] - class_name = derive_class_name(document, fallback=data.ClassName or fallback) - storage_format = "json" - else: - class_name = data.ClassName - storage_format = "python" + raw = file.file.read() + file.file.seek(0) + try: + document = validate_json_schema(raw) + except SchemaValidationError as exc: + raise HTTPException( + status_code=400, + detail={"message": "Invalid JSON schema.", "errors": exc.errors}, + ) from exc + fallback = os.path.splitext(safe_filename)[0] + class_name = derive_class_name(document, fallback=data.ClassName or fallback) schemas: Schemas = app.app_context.get_service(Schemas) - return schemas.Update(file, data.SchemaId, class_name, storage_format) + return schemas.Update(file, data.SchemaId, class_name, "json") @router.delete( diff --git a/src/ContentProcessorAPI/app/tests/routers/test_schemavault.py b/src/ContentProcessorAPI/app/tests/routers/test_schemavault.py index dca82123..96fc7931 100644 --- a/src/ContentProcessorAPI/app/tests/routers/test_schemavault.py +++ b/src/ContentProcessorAPI/app/tests/routers/test_schemavault.py @@ -93,15 +93,10 @@ def test_get_registered_schema_file_by_schema_id_500_error(client_and_schemas): assert response.json() == {"detail": "Internal Server Error"} -def test_register_schema_accepts_py_and_sanitizes_filename(client_and_schemas): +def test_register_schema_rejects_py(client_and_schemas): + """Legacy .py uploads must be refused outright (RCE remediation).""" client, mock_schemas = client_and_schemas - mock_schemas.Add.return_value = { - "Id": "test-id", - "ClassName": "TestClass", - "Description": "Test description", - "FileName": "invoice.py", - "ContentType": "text/x-python", - } + mock_schemas.Add.reset_mock() files = { "file": ("C:/fakepath/invoice.py", b"class Invoice: pass\n", "text/x-python"), @@ -113,15 +108,11 @@ def test_register_schema_accepts_py_and_sanitizes_filename(client_and_schemas): } response = client.post("/schemavault/", files=files) - assert response.status_code == 200 - - # Ensure Add() is called with Schema.FileName sanitized to just the basename - add_args, _ = mock_schemas.Add.call_args - schema_obj = add_args[1] - assert schema_obj.FileName == "invoice.py" + assert response.status_code == 415 + assert mock_schemas.Add.call_count == 0 -def test_register_schema_rejects_non_py(client_and_schemas): +def test_register_schema_rejects_unsupported_extension(client_and_schemas): client, mock_schemas = client_and_schemas mock_schemas.Add.reset_mock() @@ -143,17 +134,18 @@ def test_update_schema_success(client_and_schemas): client, mock_schemas = client_and_schemas mock_schemas.Update.return_value = { "Id": "test-id", - "ClassName": "Updated", + "ClassName": "InvoiceSchema", "Description": "desc", - "FileName": "updated.py", - "ContentType": "text/x-python", + "FileName": "updated.json", + "ContentType": "application/json", + "Format": "json", } files = { - "file": ("updated.py", b"class Updated: pass\n", "text/x-python"), + "file": ("updated.json", _minimal_json_schema_bytes(), "application/json"), "data": ( None, - json.dumps({"SchemaId": "test-id", "ClassName": "Updated"}), + json.dumps({"SchemaId": "test-id", "ClassName": "InvoiceSchema"}), "application/json", ), } @@ -163,7 +155,23 @@ def test_update_schema_success(client_and_schemas): mock_schemas.Update.assert_called_once() -def test_update_schema_rejects_non_py(client_and_schemas): +def test_update_schema_rejects_py(client_and_schemas): + client, mock_schemas = client_and_schemas + + files = { + "file": ("updated.py", b"class Updated: pass\n", "text/x-python"), + "data": ( + None, + json.dumps({"SchemaId": "test-id", "ClassName": "X"}), + "application/json", + ), + } + + response = client.put("/schemavault/", files=files) + assert response.status_code == 415 + + +def test_update_schema_rejects_unsupported_extension(client_and_schemas): client, mock_schemas = client_and_schemas files = { @@ -329,32 +337,6 @@ def test_register_schema_falls_back_to_filename_for_classname(client_and_schemas assert schema_obj.Format == "json" -def test_register_schema_still_accepts_py(client_and_schemas): - client, mock_schemas = client_and_schemas - mock_schemas.Add.return_value = { - "Id": "test-id", - "ClassName": "Legacy", - "Description": "desc", - "FileName": "legacy.py", - "ContentType": "text/x-python", - "Format": "python", - } - - files = { - "file": ("legacy.py", b"class Legacy: pass\n", "text/x-python"), - "data": ( - None, - json.dumps({"ClassName": "Legacy", "Description": "desc"}), - "application/json", - ), - } - - response = client.post("/schemavault/", files=files) - assert response.status_code == 200, response.text - schema_obj = mock_schemas.Add.call_args[0][1] - assert schema_obj.Format == "python" - - def test_update_schema_accepts_json(client_and_schemas): client, mock_schemas = client_and_schemas mock_schemas.Update.return_value = { diff --git a/src/ContentProcessorAPI/samples/schemas/autoclaim.py b/src/ContentProcessorAPI/samples/schemas/autoclaim.py deleted file mode 100644 index f207c017..00000000 --- a/src/ContentProcessorAPI/samples/schemas/autoclaim.py +++ /dev/null @@ -1,592 +0,0 @@ -# Copyright (c) Microsoft Corporation. -# Licensed under the MIT License. -"""Pydantic models for auto insurance claim form data extraction. - -Defines the hierarchical schema used by the content processing pipeline to -extract structured fields from auto insurance claim documents. -""" - -from __future__ import annotations - -import json -from typing import List, Optional - -from pydantic import BaseModel, Field - - -class AutoClaimAddress(BaseModel): - """A class representing an address used on an auto claim form.""" - - street: Optional[str] = Field(description="Street address, e.g. 123 Main St.") - city: Optional[str] = Field(description="City, e.g. Macon") - state: Optional[str] = Field(description="State, e.g. GA") - postal_code: Optional[str] = Field(description="Postal code, e.g. 31201") - country: Optional[str] = Field(description="Country, e.g. USA") - - @staticmethod - def example() -> "AutoClaimAddress": - """Return an empty instance with default placeholder values.""" - return AutoClaimAddress( - street="", city="", state="", postal_code="", country="" - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "street": self.street, - "city": self.city, - "state": self.state, - "postal_code": self.postal_code, - "country": self.country, - } - - -class PolicyholderInformation(BaseModel): - """A class representing policyholder information.""" - - name: Optional[str] = Field(description="Policyholder full name, e.g. Chad Brooks") - address: Optional[AutoClaimAddress] = Field( - description="Policyholder address, e.g. 123 Main Street, Macon, GA 31201" - ) - phone: Optional[str] = Field( - description="Policyholder phone number, e.g. (555) 555-1212" - ) - email: Optional[str] = Field( - description="Policyholder email address, e.g. chad.brooks@example.com" - ) - - @staticmethod - def example() -> "PolicyholderInformation": - """Return an empty instance with default placeholder values.""" - return PolicyholderInformation( - name="", - address=AutoClaimAddress.example(), - phone="", - email="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "name": self.name, - "address": self.address.to_dict() if self.address else None, - "phone": self.phone, - "email": self.email, - } - - -class PolicyDetails(BaseModel): - """A class representing policy details.""" - - coverage_type: Optional[str] = Field( - description="Coverage type, e.g. Auto – Comprehensive" - ) - effective_date: Optional[str] = Field( - description="Policy effective date, e.g. 2025-01-01" - ) - expiration_date: Optional[str] = Field( - description="Policy expiration date, e.g. 2025-12-31" - ) - deductible: Optional[float] = Field(description="Deductible amount, e.g. 500.0") - deductible_currency: Optional[str] = Field( - description="Currency of the deductible, e.g. USD" - ) - - @staticmethod - def example() -> "PolicyDetails": - """Return an empty instance with default placeholder values.""" - return PolicyDetails( - coverage_type="", - effective_date="", - expiration_date="", - deductible=0.0, - deductible_currency="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "coverage_type": self.coverage_type, - "effective_date": self.effective_date, - "expiration_date": self.expiration_date, - "deductible": self.deductible, - "deductible_currency": self.deductible_currency, - } - - -class IncidentDetails(BaseModel): - """A class representing incident details.""" - - date_of_loss: Optional[str] = Field(description="Date of loss, e.g. 2025-11-28") - time_of_loss: Optional[str] = Field(description="Time of loss, e.g. 14:15") - location: Optional[str] = Field( - description="Incident location, e.g. Parking lot near 123 Main Street, Macon, GA" - ) - cause_of_loss: Optional[str] = Field( - description="Cause of loss, e.g. Low-speed collision with another vehicle" - ) - description: Optional[str] = Field( - description="Incident description, e.g. Minor dent and paint scratches; no structural damage" - ) - police_report_filed: Optional[bool] = Field( - description="Whether a police report was filed" - ) - police_report_number: Optional[str] = Field( - description="Police report number, e.g. GA-20251128-CR" - ) - - @staticmethod - def example() -> "IncidentDetails": - """Return an empty instance with default placeholder values.""" - return IncidentDetails( - date_of_loss="", - time_of_loss="", - location="", - cause_of_loss="", - description="", - police_report_filed=False, - police_report_number="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "date_of_loss": self.date_of_loss, - "time_of_loss": self.time_of_loss, - "location": self.location, - "cause_of_loss": self.cause_of_loss, - "description": self.description, - "police_report_filed": self.police_report_filed, - "police_report_number": self.police_report_number, - } - - -class VehicleInformation(BaseModel): - """A class representing vehicle information.""" - - year: Optional[int] = Field(description="Vehicle year, e.g. 2022") - make: Optional[str] = Field(description="Vehicle make, e.g. Toyota") - model: Optional[str] = Field(description="Vehicle model, e.g. Camry") - trim: Optional[str] = Field(description="Vehicle trim, e.g. SE") - vin: Optional[str] = Field(description="Vehicle VIN, e.g. 4T1G11AK2NU123456") - license_plate: Optional[str] = Field(description="License plate, e.g. GA-ABC123") - mileage: Optional[int] = Field(description="Mileage, e.g. 28450") - - @staticmethod - def example() -> "VehicleInformation": - """Return an empty instance with default placeholder values.""" - return VehicleInformation( - year=0, - make="", - model="", - trim="", - vin="", - license_plate="", - mileage=0, - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "year": self.year, - "make": self.make, - "model": self.model, - "trim": self.trim, - "vin": self.vin, - "license_plate": self.license_plate, - "mileage": self.mileage, - } - - -class DamageAssessmentItem(BaseModel): - """A class representing a damage assessment line item.""" - - item_description: Optional[str] = Field( - description="Damaged item/area description, e.g. Right-front quarter panel" - ) - date_acquired: Optional[str] = Field( - description="Date acquired (if present), e.g. 2022-03-15" - ) - cost_new: Optional[float] = Field(description="Cost when new, e.g. 1200.0") - cost_new_currency: Optional[str] = Field( - description="Currency of cost_new, e.g. USD" - ) - repair_estimate: Optional[float] = Field(description="Repair estimate, e.g. 350.0") - repair_estimate_currency: Optional[str] = Field( - description="Currency of repair_estimate, e.g. USD" - ) - - @staticmethod - def example() -> "DamageAssessmentItem": - """Return an empty instance with default placeholder values.""" - return DamageAssessmentItem( - item_description="", - date_acquired="", - cost_new=0.0, - cost_new_currency="", - repair_estimate=0.0, - repair_estimate_currency="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "item_description": self.item_description, - "date_acquired": self.date_acquired, - "cost_new": self.cost_new, - "cost_new_currency": self.cost_new_currency, - "repair_estimate": self.repair_estimate, - "repair_estimate_currency": self.repair_estimate_currency, - } - - -class DamageAssessment(BaseModel): - """A class representing overall damage assessment.""" - - items: Optional[List[DamageAssessmentItem]] = Field( - description="List of damage assessment line items" - ) - total_estimated_repair: Optional[float] = Field( - description="Total estimated repair, e.g. 500.0" - ) - total_estimated_repair_currency: Optional[str] = Field( - description="Currency of total_estimated_repair, e.g. USD" - ) - - @staticmethod - def example() -> "DamageAssessment": - """Return an empty instance with default placeholder values.""" - return DamageAssessment( - items=[DamageAssessmentItem.example()], - total_estimated_repair=0.0, - total_estimated_repair_currency="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "items": [item.to_dict() for item in (self.items or [])], - "total_estimated_repair": self.total_estimated_repair, - "total_estimated_repair_currency": self.total_estimated_repair_currency, - } - - -class SupportingDocuments(BaseModel): - """A class representing supporting documents included with the claim.""" - - photos_of_damage: Optional[bool] = Field( - description="Whether photos of damage are included" - ) - police_report_copy: Optional[bool] = Field( - description="Whether a police report copy is included" - ) - repair_shop_estimate: Optional[bool] = Field( - description="Whether a repair shop estimate is included" - ) - other: Optional[List[str]] = Field(description="Other supporting documents") - - @staticmethod - def example() -> "SupportingDocuments": - """Return an empty instance with default placeholder values.""" - return SupportingDocuments( - photos_of_damage=False, - police_report_copy=False, - repair_shop_estimate=False, - other=[], - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "photos_of_damage": self.photos_of_damage, - "police_report_copy": self.police_report_copy, - "repair_shop_estimate": self.repair_shop_estimate, - "other": self.other or [], - } - - -class Signature(BaseModel): - """A class representing a signature field.""" - - signatory: Optional[str] = Field(description="Name of the signatory") - is_signed: Optional[bool] = Field( - description="Indicates if the form is signed. GPT should check whether it has signature in image files. if there is Sign, fill it up as True" - ) - - @staticmethod - def example() -> "Signature": - """Return an empty instance with default placeholder values.""" - return Signature(signatory="", is_signed=False) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return {"signatory": self.signatory, "is_signed": self.is_signed} - - -class Declaration(BaseModel): - """A class representing the claim declaration.""" - - statement: Optional[str] = Field(description="Declaration statement text") - signature: Optional[Signature] = Field(description="Signature") - date: Optional[str] = Field(description="Signature date, e.g. 2025-12-01") - - @staticmethod - def example() -> "Declaration": - """Return an empty instance with default placeholder values.""" - return Declaration(statement="", signature=Signature.example(), date="") - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "statement": self.statement, - "signature": self.signature.to_dict() if self.signature else None, - "date": self.date, - } - - -class SubmissionInstructions(BaseModel): - """A class representing submission instructions.""" - - submission_email: Optional[str] = Field( - description="Submission email address, e.g. claims@contosoinsurance.com" - ) - portal_url: Optional[str] = Field(description="Claims portal URL, if present") - notes: Optional[str] = Field(description="Additional submission notes") - - @staticmethod - def example() -> "SubmissionInstructions": - """Return an empty instance with default placeholder values.""" - return SubmissionInstructions(submission_email="", portal_url="", notes="") - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "submission_email": self.submission_email, - "portal_url": self.portal_url, - "notes": self.notes, - } - - -class AutoInsuranceClaimForm(BaseModel): - """A class representing an auto insurance claim form.""" - - insurance_company: Optional[str] = Field( - description="Insurance company name, e.g. Contoso Insurance" - ) - claim_number: Optional[str] = Field(description="Claim number, e.g. CLM987654") - policy_number: Optional[str] = Field(description="Policy number, e.g. AUTO123456") - - policyholder_information: Optional[PolicyholderInformation] = Field( - description="Policyholder information" - ) - policy_details: Optional[PolicyDetails] = Field(description="Policy details") - incident_details: Optional[IncidentDetails] = Field(description="Incident details") - vehicle_information: Optional[VehicleInformation] = Field( - description="Vehicle information" - ) - damage_assessment: Optional[DamageAssessment] = Field( - description="Damage assessment" - ) - supporting_documents: Optional[SupportingDocuments] = Field( - description="Supporting documents" - ) - declaration: Optional[Declaration] = Field(description="Declaration") - submission_instructions: Optional[SubmissionInstructions] = Field( - description="Submission instructions" - ) - - @staticmethod - def example() -> "AutoInsuranceClaimForm": - """Return an empty instance with default placeholder values.""" - return AutoInsuranceClaimForm( - insurance_company="", - claim_number="", - policy_number="", - policyholder_information=PolicyholderInformation.example(), - policy_details=PolicyDetails.example(), - incident_details=IncidentDetails.example(), - vehicle_information=VehicleInformation.example(), - damage_assessment=DamageAssessment.example(), - supporting_documents=SupportingDocuments.example(), - declaration=Declaration.example(), - submission_instructions=SubmissionInstructions.example(), - ) - - @staticmethod - def from_json(json_str: str) -> "AutoInsuranceClaimForm": - """Deserialize a JSON string into an AutoInsuranceClaimForm instance.""" - json_content = json.loads(json_str) - - def create_address(address: Optional[dict]) -> Optional[AutoClaimAddress]: - if not address: - return None - return AutoClaimAddress( - street=address.get("street"), - city=address.get("city"), - state=address.get("state"), - postal_code=address.get("postal_code"), - country=address.get("country"), - ) - - def create_policyholder( - info: Optional[dict], - ) -> Optional[PolicyholderInformation]: - if not info: - return None - return PolicyholderInformation( - name=info.get("name"), - address=create_address(info.get("address")), - phone=info.get("phone"), - email=info.get("email"), - ) - - def create_policy_details(details: Optional[dict]) -> Optional[PolicyDetails]: - if not details: - return None - return PolicyDetails( - coverage_type=details.get("coverage_type"), - effective_date=details.get("effective_date"), - expiration_date=details.get("expiration_date"), - deductible=details.get("deductible"), - deductible_currency=details.get("deductible_currency"), - ) - - def create_incident(details: Optional[dict]) -> Optional[IncidentDetails]: - if not details: - return None - return IncidentDetails( - date_of_loss=details.get("date_of_loss"), - time_of_loss=details.get("time_of_loss"), - location=details.get("location"), - cause_of_loss=details.get("cause_of_loss"), - description=details.get("description"), - police_report_filed=details.get("police_report_filed"), - police_report_number=details.get("police_report_number"), - ) - - def create_vehicle(details: Optional[dict]) -> Optional[VehicleInformation]: - if not details: - return None - return VehicleInformation( - year=details.get("year"), - make=details.get("make"), - model=details.get("model"), - trim=details.get("trim"), - vin=details.get("vin"), - license_plate=details.get("license_plate"), - mileage=details.get("mileage"), - ) - - def create_damage_item(item: Optional[dict]) -> Optional[DamageAssessmentItem]: - if not item: - return None - return DamageAssessmentItem( - item_description=item.get("item_description"), - date_acquired=item.get("date_acquired"), - cost_new=item.get("cost_new"), - cost_new_currency=item.get("cost_new_currency"), - repair_estimate=item.get("repair_estimate"), - repair_estimate_currency=item.get("repair_estimate_currency"), - ) - - def create_damage(details: Optional[dict]) -> Optional[DamageAssessment]: - if not details: - return None - items_raw = details.get("items") or [] - items = [create_damage_item(i) for i in items_raw] - items = [i for i in items if i is not None] - return DamageAssessment( - items=items, - total_estimated_repair=details.get("total_estimated_repair"), - total_estimated_repair_currency=details.get( - "total_estimated_repair_currency" - ), - ) - - def create_supporting(details: Optional[dict]) -> Optional[SupportingDocuments]: - if not details: - return None - return SupportingDocuments( - photos_of_damage=details.get("photos_of_damage"), - police_report_copy=details.get("police_report_copy"), - repair_shop_estimate=details.get("repair_shop_estimate"), - other=details.get("other") or [], - ) - - def create_signature(details: Optional[dict]) -> Optional[Signature]: - if not details: - return None - return Signature( - signatory=details.get("signatory"), - is_signed=details.get("is_signed"), - ) - - def create_declaration(details: Optional[dict]) -> Optional[Declaration]: - if not details: - return None - return Declaration( - statement=details.get("statement"), - signature=create_signature(details.get("signature")), - date=details.get("date"), - ) - - def create_submission( - details: Optional[dict], - ) -> Optional[SubmissionInstructions]: - if not details: - return None - return SubmissionInstructions( - submission_email=details.get("submission_email"), - portal_url=details.get("portal_url"), - notes=details.get("notes"), - ) - - return AutoInsuranceClaimForm( - insurance_company=json_content.get("insurance_company"), - claim_number=json_content.get("claim_number"), - policy_number=json_content.get("policy_number"), - policyholder_information=create_policyholder( - json_content.get("policyholder_information") - ), - policy_details=create_policy_details(json_content.get("policy_details")), - incident_details=create_incident(json_content.get("incident_details")), - vehicle_information=create_vehicle(json_content.get("vehicle_information")), - damage_assessment=create_damage(json_content.get("damage_assessment")), - supporting_documents=create_supporting( - json_content.get("supporting_documents") - ), - declaration=create_declaration(json_content.get("declaration")), - submission_instructions=create_submission( - json_content.get("submission_instructions") - ), - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "insurance_company": self.insurance_company, - "claim_number": self.claim_number, - "policy_number": self.policy_number, - "policyholder_information": self.policyholder_information.to_dict() - if self.policyholder_information - else None, - "policy_details": self.policy_details.to_dict() - if self.policy_details - else None, - "incident_details": self.incident_details.to_dict() - if self.incident_details - else None, - "vehicle_information": self.vehicle_information.to_dict() - if self.vehicle_information - else None, - "damage_assessment": self.damage_assessment.to_dict() - if self.damage_assessment - else None, - "supporting_documents": self.supporting_documents.to_dict() - if self.supporting_documents - else None, - "declaration": self.declaration.to_dict() if self.declaration else None, - "submission_instructions": self.submission_instructions.to_dict() - if self.submission_instructions - else None, - } diff --git a/src/ContentProcessorAPI/samples/schemas/damagedcarimage.py b/src/ContentProcessorAPI/samples/schemas/damagedcarimage.py deleted file mode 100644 index 93343dae..00000000 --- a/src/ContentProcessorAPI/samples/schemas/damagedcarimage.py +++ /dev/null @@ -1,519 +0,0 @@ -# Copyright (c) Microsoft Corporation. -# Licensed under the MIT License. -"""Pydantic models for damaged vehicle image assessment data extraction. - -Defines the schema used by the content processing pipeline to extract -structured damage information from vehicle photographs. -""" - -from __future__ import annotations - -import json -from typing import List, Optional - -from pydantic import BaseModel, Field - - -class ImageInfo(BaseModel): - """Metadata about an input image. - - Note: Most fields may be unknown unless provided by the caller or extracted from EXIF. - """ - - filename: Optional[str] = Field(description="Analyzed filename of the image") - content_type: Optional[str] = Field(description="MIME type, e.g. image/jpeg") - width: Optional[int] = Field(description="Analyzed image width in pixels") - height: Optional[int] = Field(description="Analyzed image height in pixels") - capture_datetime: Optional[str] = Field( - description="Capture datetime if available, e.g. 2025-11-28T14:15:00 original EXIF string if unprocessed" - ) - - @staticmethod - def example() -> "ImageInfo": - """Return an empty instance with default placeholder values.""" - return ImageInfo( - filename="", - content_type="", - width=0, - height=0, - capture_datetime="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "filename": self.filename, - "content_type": self.content_type, - "width": self.width, - "height": self.height, - "capture_datetime": self.capture_datetime, - } - - -class VehicleAppearance(BaseModel): - """Visible vehicle identification extracted from the image. - - Guidance: - - Prefer fields that can be seen. If uncertain, leave null. - - Do not guess VIN from images. - """ - - vehicle_type: Optional[str] = Field(description="Vehicle type, e.g. sedan, SUV") - make: Optional[str] = Field(description="Vehicle make, e.g. Toyota") - model: Optional[str] = Field(description="Vehicle model, e.g. Camry") - trim: Optional[str] = Field(description="Vehicle trim, e.g. SE") - model_year: Optional[int] = Field(description="Vehicle model year, e.g. 2022") - color: Optional[str] = Field(description="Vehicle color, e.g. silver") - - license_plate_visible: Optional[bool] = Field( - description="Whether the license plate is visible in the image" - ) - license_plate_text: Optional[str] = Field( - description="License plate text if clearly readable; otherwise null" - ) - - visible_vehicle_parts: Optional[List[str]] = Field( - description=( - "List of vehicle parts/panels actually visible in this image " - "given the camera angle, e.g. ['hood', 'front bumper', " - "'front-left fender', 'front-left headlight']. " - "Only parts that can be seen should be listed. " - "Left/right MUST use the VEHICLE's own frame of reference " - "and MUST match the side in camera_viewpoint.view_angle." - ) - ) - - @staticmethod - def example() -> "VehicleAppearance": - """Return an empty instance with default placeholder values.""" - return VehicleAppearance( - vehicle_type="", - make="", - model="", - trim="", - model_year=0, - color="", - license_plate_visible=False, - license_plate_text="", - visible_vehicle_parts=[], - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "vehicle_type": self.vehicle_type, - "make": self.make, - "model": self.model, - "trim": self.trim, - "model_year": self.model_year, - "color": self.color, - "license_plate_visible": self.license_plate_visible, - "license_plate_text": self.license_plate_text, - "visible_vehicle_parts": self.visible_vehicle_parts or [], - } - - -class CameraViewpoint(BaseModel): - """Camera perspective relative to the vehicle. - - Attributes: - spatial_reasoning: Chain-of-thought scratchpad for determining view angle. - view_angle: Computed camera angle label. - description: Free-text summary of the camera position. - """ - - spatial_reasoning: Optional[str] = Field( - description=( - "MANDATORY chain-of-thought reasoning about camera position. " - "Must answer IN ORDER: " - "(1) Can I see the FRONT (grille/headlights) or REAR (tail lights/trunk) of the vehicle? " - "(2) Which side of the IMAGE does the body flank extend toward? " - "(3) Apply the mirror rule: viewing the FRONT — image-right = vehicle LEFT, " - "image-left = vehicle RIGHT. Viewing the REAR — image-right = vehicle RIGHT, " - "image-left = vehicle LEFT. " - "(4) Therefore view_angle = ? " - "(5) FALLBACK only if neither front nor rear is visible (pure side view): " - "use steering wheel position to determine driver side (LHD: left, RHD: right)." - ) - ) - view_angle: Optional[str] = Field( - description=( - "Primary camera viewing angle relative to the vehicle. " - "Must be one of: front, front-left, front-right, " - "left-side, right-side, rear-left, rear-right, rear, " - "top, underneath, interior, unknown. " - "Left/right = VEHICLE's own left/right (driver-perspective facing forward)." - ) - ) - description: Optional[str] = Field( - description=( - "Free-text description of the camera position and angle " - "relative to the vehicle, e.g. 'Slightly elevated front-left " - "view showing hood, front bumper, and left fender.'" - ) - ) - - @staticmethod - def example() -> "CameraViewpoint": - """Return an empty instance with default placeholder values.""" - return CameraViewpoint(spatial_reasoning="", view_angle="", description="") - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "spatial_reasoning": self.spatial_reasoning, - "view_angle": self.view_angle, - "description": self.description, - } - - -class DamageBoundingBox(BaseModel): - """Bounding box in normalized image coordinates [0..1].""" - - x_min: Optional[float] = Field(description="Left edge in [0..1]") - y_min: Optional[float] = Field(description="Top edge in [0..1]") - x_max: Optional[float] = Field(description="Right edge in [0..1]") - y_max: Optional[float] = Field(description="Bottom edge in [0..1]") - - @staticmethod - def example() -> "DamageBoundingBox": - """Return an empty instance with default placeholder values.""" - return DamageBoundingBox(x_min=0.0, y_min=0.0, x_max=0.0, y_max=0.0) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "x_min": self.x_min, - "y_min": self.y_min, - "x_max": self.x_max, - "y_max": self.y_max, - } - - -class DamageRegion(BaseModel): - """A detected region of damage on the vehicle.""" - - location_on_vehicle: Optional[str] = Field( - description=( - "Location on the vehicle using the VEHICLE's own left/right " - "(driver-perspective facing forward). " - "The side MUST match camera_viewpoint.view_angle. " - "Examples: 'front-left fender', 'rear-right quarter panel'." - ) - ) - damage_types: Optional[List[str]] = Field( - description="Damage types, e.g. ['scratch','dent','crack','paint-transfer']" - ) - severity: Optional[str] = Field( - description="Severity label, e.g. minor, moderate, severe" - ) - description: Optional[str] = Field( - description="Free-text description of the damage" - ) - - bounding_box: Optional[DamageBoundingBox] = Field( - description="Approx bounding box of the damage area (normalized coordinates)" - ) - - confidence: Optional[float] = Field( - description="Confidence score in [0..1] for this damage region" - ) - - @staticmethod - def example() -> "DamageRegion": - """Return an empty instance with default placeholder values.""" - return DamageRegion( - location_on_vehicle="", - damage_types=[], - severity="", - description="", - bounding_box=DamageBoundingBox.example(), - confidence=0.0, - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "location_on_vehicle": self.location_on_vehicle, - "damage_types": self.damage_types or [], - "severity": self.severity, - "description": self.description, - "bounding_box": self.bounding_box.to_dict() if self.bounding_box else None, - "confidence": self.confidence, - } - - -class OverallDamageAssessment(BaseModel): - """Overall assessment across the full image.""" - - has_visible_damage: Optional[bool] = Field( - description="Whether any damage is visible" - ) - overall_severity: Optional[str] = Field( - description="Overall severity label, e.g. minor, moderate, severe" - ) - - affected_parts: Optional[List[str]] = Field( - description=( - "Affected parts/panels using the VEHICLE's own left/right. " - "Side labels MUST match camera_viewpoint.view_angle." - ) - ) - - estimated_repair_complexity: Optional[str] = Field( - description="Rough complexity, e.g. cosmetic_only, panel_repair, replacement_likely" - ) - - notes: Optional[str] = Field( - description="Notes or caveats, e.g. lighting/angle limitations" - ) - - @staticmethod - def example() -> "OverallDamageAssessment": - """Return an empty instance with default placeholder values.""" - return OverallDamageAssessment( - has_visible_damage=False, - overall_severity="", - affected_parts=[], - estimated_repair_complexity="", - notes="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "has_visible_damage": self.has_visible_damage, - "overall_severity": self.overall_severity, - "affected_parts": self.affected_parts or [], - "estimated_repair_complexity": self.estimated_repair_complexity, - "notes": self.notes, - } - - -class VehicleAssessment(BaseModel): - """Per-vehicle damage assessment extracted from an image. - - Groups appearance, damage regions, and overall assessment for a single - vehicle detected in the photograph. - - Attributes: - vehicle_id: Human-readable identifier distinguishing this vehicle. - vehicle_appearance: Visible vehicle identification. - damage_regions: Detected damage regions for this vehicle. - overall_assessment: Overall damage assessment for this vehicle. - """ - - vehicle_id: Optional[str] = Field( - description=( - "A short human-readable identifier for this vehicle, " - "e.g. 'Vehicle 1 - silver sedan (front-left)'. " - "Use color, type, and position to distinguish vehicles." - ) - ) - vehicle_appearance: Optional[VehicleAppearance] = Field( - description="Visible vehicle identification for this vehicle" - ) - damage_regions: Optional[List[DamageRegion]] = Field( - description="List of detected damage regions for this vehicle" - ) - overall_assessment: Optional[OverallDamageAssessment] = Field( - description="Overall damage assessment for this vehicle" - ) - - @staticmethod - def example() -> "VehicleAssessment": - """Return an empty instance with default placeholder values.""" - return VehicleAssessment( - vehicle_id="", - vehicle_appearance=VehicleAppearance.example(), - damage_regions=[DamageRegion.example()], - overall_assessment=OverallDamageAssessment.example(), - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "vehicle_id": self.vehicle_id, - "vehicle_appearance": self.vehicle_appearance.to_dict() - if self.vehicle_appearance - else None, - "damage_regions": [r.to_dict() for r in (self.damage_regions or [])], - "overall_assessment": self.overall_assessment.to_dict() - if self.overall_assessment - else None, - } - - -class DamagedVehicleImageAssessment(BaseModel): - """Schema for extracting damaged vehicle information from an image. - - Supports single- and multi-vehicle images. Each vehicle detected in the - photograph gets its own entry in the ``vehicles`` list. - - Attributes: - image_info: Image metadata (shared across all vehicles). - camera_viewpoint: Camera perspective relative to the scene. - vehicle_count: Number of distinct vehicles detected in the image. - vehicles: Per-vehicle assessment list. - """ - - image_info: Optional[ImageInfo] = Field(description="Image metadata") - camera_viewpoint: Optional[CameraViewpoint] = Field( - description=( - "Camera perspective relative to the scene. " - "MUST be determined BEFORE labelling any damage " - "locations so that left/right orientation is anchored " - "to each vehicle's own frame of reference." - ) - ) - vehicle_count: Optional[int] = Field( - description=( - "Number of distinct vehicles detected in the image. " - "Must equal the length of the vehicles list." - ) - ) - vehicles: Optional[List[VehicleAssessment]] = Field( - description=( - "Per-vehicle damage assessments. One entry per vehicle " - "detected in the image. For single-vehicle images this " - "list contains exactly one item." - ) - ) - consistency_check: Optional[str] = Field( - description=( - "MANDATORY self-verification. State the side from view_angle, " - "then list every left/right label used in visible_vehicle_parts, " - "damage_regions, and affected_parts. Confirm they ALL match the " - "side in view_angle. If any mismatch was found and corrected, " - "describe what was fixed." - ) - ) - - @staticmethod - def example() -> "DamagedVehicleImageAssessment": - """Return an empty instance with default placeholder values.""" - return DamagedVehicleImageAssessment( - image_info=ImageInfo.example(), - camera_viewpoint=CameraViewpoint.example(), - vehicle_count=1, - vehicles=[VehicleAssessment.example()], - consistency_check="", - ) - - @staticmethod - def from_json(json_str: str) -> "DamagedVehicleImageAssessment": - """Deserialize a JSON string into a DamagedVehicleImageAssessment instance.""" - json_content = json.loads(json_str) - - def create_image_info(details: Optional[dict]) -> Optional[ImageInfo]: - if not details: - return None - return ImageInfo( - filename=details.get("filename"), - content_type=details.get("content_type"), - width=details.get("width"), - height=details.get("height"), - capture_datetime=details.get("capture_datetime"), - ) - - def create_viewpoint( - details: Optional[dict], - ) -> Optional[CameraViewpoint]: - if not details: - return None - return CameraViewpoint( - spatial_reasoning=details.get("spatial_reasoning"), - view_angle=details.get("view_angle"), - description=details.get("description"), - ) - - def create_appearance( - details: Optional[dict], - ) -> Optional[VehicleAppearance]: - if not details: - return None - return VehicleAppearance( - vehicle_type=details.get("vehicle_type"), - make=details.get("make"), - model=details.get("model"), - trim=details.get("trim"), - model_year=details.get("model_year"), - color=details.get("color"), - license_plate_visible=details.get("license_plate_visible"), - license_plate_text=details.get("license_plate_text"), - visible_vehicle_parts=details.get("visible_vehicle_parts") or [], - ) - - def create_bbox(details: Optional[dict]) -> Optional[DamageBoundingBox]: - if not details: - return None - return DamageBoundingBox( - x_min=details.get("x_min"), - y_min=details.get("y_min"), - x_max=details.get("x_max"), - y_max=details.get("y_max"), - ) - - def create_region(details: Optional[dict]) -> Optional[DamageRegion]: - if not details: - return None - return DamageRegion( - location_on_vehicle=details.get("location_on_vehicle"), - damage_types=details.get("damage_types") or [], - severity=details.get("severity"), - description=details.get("description"), - bounding_box=create_bbox(details.get("bounding_box")), - confidence=details.get("confidence"), - ) - - def create_overall( - details: Optional[dict], - ) -> Optional[OverallDamageAssessment]: - if not details: - return None - return OverallDamageAssessment( - has_visible_damage=details.get("has_visible_damage"), - overall_severity=details.get("overall_severity"), - affected_parts=details.get("affected_parts") or [], - estimated_repair_complexity=details.get("estimated_repair_complexity"), - notes=details.get("notes"), - ) - - def create_vehicle_assessment( - details: Optional[dict], - ) -> Optional[VehicleAssessment]: - if not details: - return None - regions_raw = details.get("damage_regions") or [] - regions = [r for r in (create_region(r) for r in regions_raw) if r] - return VehicleAssessment( - vehicle_id=details.get("vehicle_id"), - vehicle_appearance=create_appearance(details.get("vehicle_appearance")), - damage_regions=regions, - overall_assessment=create_overall(details.get("overall_assessment")), - ) - - vehicles_raw = json_content.get("vehicles") or [] - vehicles = [ - v for v in (create_vehicle_assessment(v) for v in vehicles_raw) if v - ] - - return DamagedVehicleImageAssessment( - image_info=create_image_info(json_content.get("image_info")), - camera_viewpoint=create_viewpoint(json_content.get("camera_viewpoint")), - vehicle_count=json_content.get("vehicle_count"), - vehicles=vehicles, - consistency_check=json_content.get("consistency_check"), - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "image_info": self.image_info.to_dict() if self.image_info else None, - "camera_viewpoint": self.camera_viewpoint.to_dict() - if self.camera_viewpoint - else None, - "vehicle_count": self.vehicle_count, - "vehicles": [v.to_dict() for v in (self.vehicles or [])], - "consistency_check": self.consistency_check, - } diff --git a/src/ContentProcessorAPI/samples/schemas/policereport.py b/src/ContentProcessorAPI/samples/schemas/policereport.py deleted file mode 100644 index 8d437a89..00000000 --- a/src/ContentProcessorAPI/samples/schemas/policereport.py +++ /dev/null @@ -1,353 +0,0 @@ -# Copyright (c) Microsoft Corporation. -# Licensed under the MIT License. -"""Pydantic models for police report data extraction. - -Defines the schema used by the content processing pipeline to extract -structured fields from police report documents attached to insurance claims. -""" - -from __future__ import annotations - -import json -from typing import List, Optional - -from pydantic import BaseModel, Field - - -class PoliceReportAddress(BaseModel): - """A class representing an address referenced in a police report.""" - - street: Optional[str] = Field(description="Street address, e.g. 123 Main St.") - city: Optional[str] = Field(description="City, e.g. Macon") - state: Optional[str] = Field(description="State, e.g. GA") - postal_code: Optional[str] = Field(description="Postal code, e.g. 31201") - country: Optional[str] = Field(description="Country, e.g. USA") - - @staticmethod - def example() -> "PoliceReportAddress": - """Return an empty instance with default placeholder values.""" - return PoliceReportAddress( - street="", city="", state="", postal_code="", country="" - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "street": self.street, - "city": self.city, - "state": self.state, - "postal_code": self.postal_code, - "country": self.country, - } - - -class ReportingParty(BaseModel): - """A class representing the reporting party / claimant in the police report context.""" - - name: Optional[str] = Field(description="Full name of reporting party") - address: Optional[PoliceReportAddress] = Field( - description="Address of reporting party" - ) - phone: Optional[str] = Field(description="Phone number") - email: Optional[str] = Field(description="Email address") - - @staticmethod - def example() -> "ReportingParty": - """Return an empty instance with default placeholder values.""" - return ReportingParty( - name="", - address=PoliceReportAddress.example(), - phone="", - email="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "name": self.name, - "address": self.address.to_dict() if self.address else None, - "phone": self.phone, - "email": self.email, - } - - -class PoliceReportVehicle(BaseModel): - """A class representing a vehicle referenced in a police report.""" - - year: Optional[int] = Field(description="Vehicle year, e.g. 2022") - make: Optional[str] = Field(description="Vehicle make, e.g. Toyota") - model: Optional[str] = Field(description="Vehicle model, e.g. Camry") - trim: Optional[str] = Field(description="Vehicle trim, e.g. SE") - vin: Optional[str] = Field(description="Vehicle VIN") - license_plate: Optional[str] = Field(description="License plate") - mileage: Optional[int] = Field(description="Mileage") - - @staticmethod - def example() -> "PoliceReportVehicle": - """Return an empty instance with default placeholder values.""" - return PoliceReportVehicle( - year=0, - make="", - model="", - trim="", - vin="", - license_plate="", - mileage=0, - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "year": self.year, - "make": self.make, - "model": self.model, - "trim": self.trim, - "vin": self.vin, - "license_plate": self.license_plate, - "mileage": self.mileage, - } - - -class PoliceReportIncident(BaseModel): - """A class representing incident details in a police report.""" - - date: Optional[str] = Field(description="Incident date, e.g. 2025-11-28") - time: Optional[str] = Field(description="Incident time, e.g. 14:15") - location: Optional[str] = Field(description="Incident location") - cause: Optional[str] = Field(description="Cause of incident") - narrative: Optional[str] = Field( - description="Narrative/description of what happened" - ) - - @staticmethod - def example() -> "PoliceReportIncident": - """Return an empty instance with default placeholder values.""" - return PoliceReportIncident( - date="", time="", location="", cause="", narrative="" - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "date": self.date, - "time": self.time, - "location": self.location, - "cause": self.cause, - "narrative": self.narrative, - } - - -class PoliceReportDamageItem(BaseModel): - """A class representing a damage line item recorded alongside a police report.""" - - item_description: Optional[str] = Field(description="Damaged item/area description") - repair_estimate: Optional[float] = Field(description="Repair estimate amount") - repair_estimate_currency: Optional[str] = Field( - description="Currency of repair_estimate, e.g. USD" - ) - - @staticmethod - def example() -> "PoliceReportDamageItem": - """Return an empty instance with default placeholder values.""" - return PoliceReportDamageItem( - item_description="", - repair_estimate=0.0, - repair_estimate_currency="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "item_description": self.item_description, - "repair_estimate": self.repair_estimate, - "repair_estimate_currency": self.repair_estimate_currency, - } - - -class PoliceReportDamageSummary(BaseModel): - """A class representing a damage summary section.""" - - items: Optional[List[PoliceReportDamageItem]] = Field( - description="List of damage items" - ) - total_estimated_repair: Optional[float] = Field( - description="Total estimated repair amount" - ) - total_estimated_repair_currency: Optional[str] = Field( - description="Currency of total_estimated_repair, e.g. USD" - ) - - @staticmethod - def example() -> "PoliceReportDamageSummary": - """Return an empty instance with default placeholder values.""" - return PoliceReportDamageSummary( - items=[PoliceReportDamageItem.example()], - total_estimated_repair=0.0, - total_estimated_repair_currency="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "items": [item.to_dict() for item in (self.items or [])], - "total_estimated_repair": self.total_estimated_repair, - "total_estimated_repair_currency": self.total_estimated_repair_currency, - } - - -class PoliceReportDocument(BaseModel): - """A class representing a police report document attached to an auto claim. - - Note: The sample content includes the statement "Police Report: Filed (Report # GA-20251128-CR)". - This schema focuses on extracting the report identifier and the related incident context. - """ - - report_number: Optional[str] = Field( - description="Police report number, e.g. GA-20251128-CR" - ) - is_filed: Optional[bool] = Field(description="Whether a police report was filed") - reporting_agency: Optional[str] = Field(description="Reporting agency / department") - - insurance_company: Optional[str] = Field(description="Insurance company name") - claim_number: Optional[str] = Field(description="Claim number") - policy_number: Optional[str] = Field(description="Policy number") - - reporting_party: Optional[ReportingParty] = Field( - description="Reporting party information" - ) - incident: Optional[PoliceReportIncident] = Field(description="Incident details") - vehicles: Optional[List[PoliceReportVehicle]] = Field( - description="Vehicles involved" - ) - damage_summary: Optional[PoliceReportDamageSummary] = Field( - description="Damage summary" - ) - - @staticmethod - def example() -> "PoliceReportDocument": - """Return an empty instance with default placeholder values.""" - return PoliceReportDocument( - report_number="", - is_filed=False, - reporting_agency="", - insurance_company="", - claim_number="", - policy_number="", - reporting_party=ReportingParty.example(), - incident=PoliceReportIncident.example(), - vehicles=[PoliceReportVehicle.example()], - damage_summary=PoliceReportDamageSummary.example(), - ) - - @staticmethod - def from_json(json_str: str) -> "PoliceReportDocument": - """Deserialize a JSON string into a PoliceReportDocument instance.""" - json_content = json.loads(json_str) - - def create_address(address: Optional[dict]) -> Optional[PoliceReportAddress]: - if not address: - return None - return PoliceReportAddress( - street=address.get("street"), - city=address.get("city"), - state=address.get("state"), - postal_code=address.get("postal_code"), - country=address.get("country"), - ) - - def create_reporting_party(details: Optional[dict]) -> Optional[ReportingParty]: - if not details: - return None - return ReportingParty( - name=details.get("name"), - address=create_address(details.get("address")), - phone=details.get("phone"), - email=details.get("email"), - ) - - def create_incident(details: Optional[dict]) -> Optional[PoliceReportIncident]: - if not details: - return None - return PoliceReportIncident( - date=details.get("date"), - time=details.get("time"), - location=details.get("location"), - cause=details.get("cause"), - narrative=details.get("narrative"), - ) - - def create_vehicle(details: Optional[dict]) -> Optional[PoliceReportVehicle]: - if not details: - return None - return PoliceReportVehicle( - year=details.get("year"), - make=details.get("make"), - model=details.get("model"), - trim=details.get("trim"), - vin=details.get("vin"), - license_plate=details.get("license_plate"), - mileage=details.get("mileage"), - ) - - def create_damage_item( - details: Optional[dict], - ) -> Optional[PoliceReportDamageItem]: - if not details: - return None - return PoliceReportDamageItem( - item_description=details.get("item_description"), - repair_estimate=details.get("repair_estimate"), - repair_estimate_currency=details.get("repair_estimate_currency"), - ) - - def create_damage_summary( - details: Optional[dict], - ) -> Optional[PoliceReportDamageSummary]: - if not details: - return None - items_raw = details.get("items") or [] - items = [create_damage_item(i) for i in items_raw] - items = [i for i in items if i is not None] - return PoliceReportDamageSummary( - items=items, - total_estimated_repair=details.get("total_estimated_repair"), - total_estimated_repair_currency=details.get( - "total_estimated_repair_currency" - ), - ) - - vehicles_raw = json_content.get("vehicles") or [] - vehicles = [create_vehicle(v) for v in vehicles_raw] - vehicles = [v for v in vehicles if v is not None] - - return PoliceReportDocument( - report_number=json_content.get("report_number"), - is_filed=json_content.get("is_filed"), - reporting_agency=json_content.get("reporting_agency"), - insurance_company=json_content.get("insurance_company"), - claim_number=json_content.get("claim_number"), - policy_number=json_content.get("policy_number"), - reporting_party=create_reporting_party(json_content.get("reporting_party")), - incident=create_incident(json_content.get("incident")), - vehicles=vehicles, - damage_summary=create_damage_summary(json_content.get("damage_summary")), - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "report_number": self.report_number, - "is_filed": self.is_filed, - "reporting_agency": self.reporting_agency, - "insurance_company": self.insurance_company, - "claim_number": self.claim_number, - "policy_number": self.policy_number, - "reporting_party": self.reporting_party.to_dict() - if self.reporting_party - else None, - "incident": self.incident.to_dict() if self.incident else None, - "vehicles": [v.to_dict() for v in (self.vehicles or [])], - "damage_summary": self.damage_summary.to_dict() - if self.damage_summary - else None, - } diff --git a/src/ContentProcessorAPI/samples/schemas/register_schema.py b/src/ContentProcessorAPI/samples/schemas/register_schema.py index 1b3d570a..45cdc72c 100644 --- a/src/ContentProcessorAPI/samples/schemas/register_schema.py +++ b/src/ContentProcessorAPI/samples/schemas/register_schema.py @@ -18,7 +18,6 @@ { "schemas": [ { "File": "autoclaim.json", "ClassName": "...", "Description": "..." }, - { "File": "legacy.py", "ClassName": "...", "Description": "..." }, ... ], "schemaset": { @@ -27,8 +26,8 @@ } } -Both ``.py`` (legacy executable Python class) and ``.json`` (declarative -JSON Schema, recommended) files are accepted in the ``File`` field. +Only ``.json`` schema files are accepted; the legacy ``.py`` format was +removed as part of the schemavault RCE remediation. """ from __future__ import annotations @@ -79,20 +78,17 @@ def _register_schema( print(f" Description: {existing.get('Description')}") return schema_id - # Pick the right MIME type based on the file extension. Both ``.py`` - # (legacy executable Python class) and ``.json`` (declarative JSON - # Schema) are accepted by ``POST /schemavault/``. + # Only JSON Schema descriptors (.json) are accepted. The legacy + # ``.py`` (executable Pydantic class) format was removed because + # the worker would ``exec`` uploaded code, exposing an RCE primitive. extension = schema_path.suffix.lower() - if extension == ".json": - content_type = "application/json" - elif extension == ".py": - content_type = "text/x-python" - else: + if extension != ".json": print( f"Error: Unsupported schema extension '{extension}' for " - f"'{schema_path.name}'. Expected .py or .json. Skipping..." + f"'{schema_path.name}'. Only .json schemas are accepted. Skipping..." ) return None + content_type = "application/json" print(f"Registering new schema '{class_name}' ({extension})...") data_payload = json.dumps({"ClassName": class_name, "Description": description}) diff --git a/src/ContentProcessorAPI/samples/schemas/repairestimate.py b/src/ContentProcessorAPI/samples/schemas/repairestimate.py deleted file mode 100644 index 31635a4b..00000000 --- a/src/ContentProcessorAPI/samples/schemas/repairestimate.py +++ /dev/null @@ -1,333 +0,0 @@ -# Copyright (c) Microsoft Corporation. -# Licensed under the MIT License. -"""Pydantic models for auto repair estimate data extraction. - -Defines the schema used by the content processing pipeline to extract -structured fields from body shop repair estimate documents. -""" - -from __future__ import annotations - -import json -from typing import List, Optional - -from pydantic import BaseModel, Field - - -class RepairShopAddress(BaseModel): - """A class representing an auto body shop address.""" - - street: Optional[str] = Field(description="Street address, e.g. 456 Repair Lane") - city: Optional[str] = Field(description="City, e.g. Macon") - state: Optional[str] = Field(description="State, e.g. GA") - postal_code: Optional[str] = Field(description="Postal code, e.g. 31201") - country: Optional[str] = Field(description="Country, e.g. USA") - - @staticmethod - def example() -> "RepairShopAddress": - """Return an empty instance with default placeholder values.""" - return RepairShopAddress( - street="", city="", state="", postal_code="", country="" - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "street": self.street, - "city": self.city, - "state": self.state, - "postal_code": self.postal_code, - "country": self.country, - } - - -class RepairEstimateVehicle(BaseModel): - """A class representing the customer vehicle on a repair estimate.""" - - year: Optional[int] = Field(description="Vehicle year, e.g. 2022") - make: Optional[str] = Field(description="Vehicle make, e.g. Toyota") - model: Optional[str] = Field(description="Vehicle model, e.g. Camry") - trim: Optional[str] = Field(description="Vehicle trim, e.g. SE") - vin: Optional[str] = Field(description="Vehicle VIN, e.g. 4T1G11AK2NU123456") - license_plate: Optional[str] = Field(description="License plate, e.g. GA-ABC123") - - @staticmethod - def example() -> "RepairEstimateVehicle": - """Return an empty instance with default placeholder values.""" - return RepairEstimateVehicle( - year=0, - make="", - model="", - trim="", - vin="", - license_plate="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "year": self.year, - "make": self.make, - "model": self.model, - "trim": self.trim, - "vin": self.vin, - "license_plate": self.license_plate, - } - - -class RepairEstimateLineItem(BaseModel): - """A class representing a repair estimate line item.""" - - service_description: Optional[str] = Field( - description="Service description, e.g. Dent repair (quarter panel)" - ) - labor_hours: Optional[float] = Field(description="Labor hours, e.g. 2.0") - rate_per_hour: Optional[float] = Field(description="Labor rate per hour, e.g. 75.0") - rate_per_hour_currency: Optional[str] = Field( - description="Currency for rate_per_hour, e.g. USD" - ) - parts_cost: Optional[float] = Field(description="Parts cost, e.g. 150.0") - parts_cost_currency: Optional[str] = Field( - description="Currency for parts_cost, e.g. USD" - ) - materials_cost: Optional[float] = Field( - description="Materials/supplies cost, e.g. 50.0" - ) - materials_cost_currency: Optional[str] = Field( - description="Currency for materials_cost, e.g. USD" - ) - total: Optional[float] = Field(description="Line total amount") - total_currency: Optional[str] = Field(description="Currency for total, e.g. USD") - - @staticmethod - def example() -> "RepairEstimateLineItem": - """Return an empty instance with default placeholder values.""" - return RepairEstimateLineItem( - service_description="", - labor_hours=0.0, - rate_per_hour=0.0, - rate_per_hour_currency="", - parts_cost=0.0, - parts_cost_currency="", - materials_cost=0.0, - materials_cost_currency="", - total=0.0, - total_currency="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "service_description": self.service_description, - "labor_hours": self.labor_hours, - "rate_per_hour": self.rate_per_hour, - "rate_per_hour_currency": self.rate_per_hour_currency, - "parts_cost": self.parts_cost, - "parts_cost_currency": self.parts_cost_currency, - "materials_cost": self.materials_cost, - "materials_cost_currency": self.materials_cost_currency, - "total": self.total, - "total_currency": self.total_currency, - } - - -class Signature(BaseModel): - """A class representing an authorized signature field.""" - - signatory: Optional[str] = Field(description="Name of the signatory") - is_signed: Optional[bool] = Field( - description="Indicates if the document is signed. GPT should check whether it has signature in image files. if there is Sign, fill it up as True" - ) - - @staticmethod - def example() -> "Signature": - """Return an empty instance with default placeholder values.""" - return Signature(signatory="", is_signed=False) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return {"signatory": self.signatory, "is_signed": self.is_signed} - - -class RepairEstimateDocument(BaseModel): - """A class representing an auto body shop repair estimate document.""" - - estimate_number: Optional[str] = Field( - description="Estimate number, e.g. EST-20251130" - ) - date: Optional[str] = Field(description="Estimate date, e.g. 2025-11-30") - - prepared_by: Optional[str] = Field( - description="Prepared by / shop name, e.g. Macon Auto Body & Paint" - ) - shop_address: Optional[RepairShopAddress] = Field(description="Shop address") - shop_phone: Optional[str] = Field(description="Shop phone number") - - customer_name: Optional[str] = Field(description="Customer name, e.g. Chad Brooks") - vehicle: Optional[RepairEstimateVehicle] = Field(description="Vehicle information") - - damage_description: Optional[str] = Field( - description="Damage description / narrative" - ) - - repair_details: Optional[List[RepairEstimateLineItem]] = Field( - description="Repair detail line items" - ) - - subtotal: Optional[float] = Field(description="Subtotal amount") - subtotal_currency: Optional[str] = Field( - description="Currency for subtotal, e.g. USD" - ) - - tax_rate: Optional[str] = Field(description="Tax rate, e.g. 7%") - tax_amount: Optional[float] = Field(description="Tax amount, e.g. 24.50") - tax_currency: Optional[str] = Field(description="Currency for tax_amount, e.g. USD") - - total_estimate: Optional[float] = Field( - description="Total estimate amount, e.g. 374.50" - ) - total_estimate_currency: Optional[str] = Field( - description="Currency for total_estimate, e.g. USD" - ) - - notes: Optional[List[str]] = Field(description="Notes on the estimate") - - authorized_signature: Optional[Signature] = Field( - description="Authorized signature" - ) - authorized_signature_date: Optional[str] = Field( - description="Signature date, e.g. 2025-11-30" - ) - - @staticmethod - def example() -> "RepairEstimateDocument": - """Return an empty instance with default placeholder values.""" - return RepairEstimateDocument( - estimate_number="", - date="", - prepared_by="", - shop_address=RepairShopAddress.example(), - shop_phone="", - customer_name="", - vehicle=RepairEstimateVehicle.example(), - damage_description="", - repair_details=[RepairEstimateLineItem.example()], - subtotal=0.0, - subtotal_currency="", - tax_rate="", - tax_amount=0.0, - tax_currency="", - total_estimate=0.0, - total_estimate_currency="", - notes=[], - authorized_signature=Signature.example(), - authorized_signature_date="", - ) - - @staticmethod - def from_json(json_str: str) -> "RepairEstimateDocument": - """Deserialize a JSON string into a RepairEstimateDocument instance.""" - json_content = json.loads(json_str) - - def create_address(details: Optional[dict]) -> Optional[RepairShopAddress]: - if not details: - return None - return RepairShopAddress( - street=details.get("street"), - city=details.get("city"), - state=details.get("state"), - postal_code=details.get("postal_code"), - country=details.get("country"), - ) - - def create_vehicle(details: Optional[dict]) -> Optional[RepairEstimateVehicle]: - if not details: - return None - return RepairEstimateVehicle( - year=details.get("year"), - make=details.get("make"), - model=details.get("model"), - trim=details.get("trim"), - vin=details.get("vin"), - license_plate=details.get("license_plate"), - ) - - def create_line_item( - details: Optional[dict], - ) -> Optional[RepairEstimateLineItem]: - if not details: - return None - return RepairEstimateLineItem( - service_description=details.get("service_description"), - labor_hours=details.get("labor_hours"), - rate_per_hour=details.get("rate_per_hour"), - rate_per_hour_currency=details.get("rate_per_hour_currency"), - parts_cost=details.get("parts_cost"), - parts_cost_currency=details.get("parts_cost_currency"), - materials_cost=details.get("materials_cost"), - materials_cost_currency=details.get("materials_cost_currency"), - total=details.get("total"), - total_currency=details.get("total_currency"), - ) - - def create_signature(details: Optional[dict]) -> Optional[Signature]: - if not details: - return None - return Signature( - signatory=details.get("signatory"), - is_signed=details.get("is_signed"), - ) - - line_items_raw = json_content.get("repair_details") or [] - line_items = [create_line_item(item) for item in line_items_raw] - line_items = [item for item in line_items if item is not None] - - return RepairEstimateDocument( - estimate_number=json_content.get("estimate_number"), - date=json_content.get("date"), - prepared_by=json_content.get("prepared_by"), - shop_address=create_address(json_content.get("shop_address")), - shop_phone=json_content.get("shop_phone"), - customer_name=json_content.get("customer_name"), - vehicle=create_vehicle(json_content.get("vehicle")), - damage_description=json_content.get("damage_description"), - repair_details=line_items, - subtotal=json_content.get("subtotal"), - subtotal_currency=json_content.get("subtotal_currency"), - tax_rate=json_content.get("tax_rate"), - tax_amount=json_content.get("tax_amount"), - tax_currency=json_content.get("tax_currency"), - total_estimate=json_content.get("total_estimate"), - total_estimate_currency=json_content.get("total_estimate_currency"), - notes=json_content.get("notes") or [], - authorized_signature=create_signature( - json_content.get("authorized_signature") - ), - authorized_signature_date=json_content.get("authorized_signature_date"), - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "estimate_number": self.estimate_number, - "date": self.date, - "prepared_by": self.prepared_by, - "shop_address": self.shop_address.to_dict() if self.shop_address else None, - "shop_phone": self.shop_phone, - "customer_name": self.customer_name, - "vehicle": self.vehicle.to_dict() if self.vehicle else None, - "damage_description": self.damage_description, - "repair_details": [item.to_dict() for item in (self.repair_details or [])], - "subtotal": self.subtotal, - "subtotal_currency": self.subtotal_currency, - "tax_rate": self.tax_rate, - "tax_amount": self.tax_amount, - "tax_currency": self.tax_currency, - "total_estimate": self.total_estimate, - "total_estimate_currency": self.total_estimate_currency, - "notes": self.notes or [], - "authorized_signature": self.authorized_signature.to_dict() - if self.authorized_signature - else None, - "authorized_signature_date": self.authorized_signature_date, - } From 9b9f0c242d374e6c32ebc3d43234ac2b2901a15d Mon Sep 17 00:00:00 2001 From: Prajwal-Microsoft Date: Wed, 6 May 2026 15:47:15 +0530 Subject: [PATCH 06/13] fix: Copilot comments --- src/ContentProcessor/pyproject.toml | 1 - src/ContentProcessor/requirements.txt | 1 - .../src/libs/pipeline/entities/schema.py | 4 +--- .../src/libs/utils/remote_schema_loader.py | 2 +- .../app/routers/logics/schema_validator.py | 24 +++++++++++++++++++ .../app/routers/logics/schemavault.py | 2 +- .../app/routers/schemavault.py | 23 +++++++++--------- 7 files changed, 38 insertions(+), 19 deletions(-) diff --git a/src/ContentProcessor/pyproject.toml b/src/ContentProcessor/pyproject.toml index e3375fcf..310524ce 100644 --- a/src/ContentProcessor/pyproject.toml +++ b/src/ContentProcessor/pyproject.toml @@ -26,7 +26,6 @@ dependencies = [ "protobuf==6.33.6", "pyjwt==2.12.1", "pyasn1==0.6.3", - "jsonschema==4.25.1", ] diff --git a/src/ContentProcessor/requirements.txt b/src/ContentProcessor/requirements.txt index e420ff50..0a8b324b 100644 --- a/src/ContentProcessor/requirements.txt +++ b/src/ContentProcessor/requirements.txt @@ -18,7 +18,6 @@ dnspython==2.8.0 idna==3.11 iniconfig==2.3.0 isodate==0.7.2 -jsonschema==4.25.1 mongomock==4.3.0 msal==1.35.1 msal-extensions==1.3.1 diff --git a/src/ContentProcessor/src/libs/pipeline/entities/schema.py b/src/ContentProcessor/src/libs/pipeline/entities/schema.py index f1ec8c18..409e9ac9 100644 --- a/src/ContentProcessor/src/libs/pipeline/entities/schema.py +++ b/src/ContentProcessor/src/libs/pipeline/entities/schema.py @@ -27,9 +27,7 @@ class Schema(BaseModel): ContentType: Target content type this schema handles. Format: Storage format of the schema artifact. Always ``"json"`` — declarative JSON Schema descriptors are the - only supported format. The legacy ``"python"`` value is - tolerated when reading historical Cosmos records but the - worker will refuse to process them. + only supported format. """ Id: str diff --git a/src/ContentProcessor/src/libs/utils/remote_schema_loader.py b/src/ContentProcessor/src/libs/utils/remote_schema_loader.py index 53079df1..4010bb10 100644 --- a/src/ContentProcessor/src/libs/utils/remote_schema_loader.py +++ b/src/ContentProcessor/src/libs/utils/remote_schema_loader.py @@ -19,7 +19,7 @@ import json import logging -from typing import Any, ForwardRef, List, Literal, Optional, Tuple, Type, Union +from typing import Any, ForwardRef, List, Literal, Tuple, Type, Union from azure.storage.blob import BlobServiceClient from pydantic import BaseModel, ConfigDict, Field, create_model diff --git a/src/ContentProcessorAPI/app/routers/logics/schema_validator.py b/src/ContentProcessorAPI/app/routers/logics/schema_validator.py index 280fee9a..98791783 100644 --- a/src/ContentProcessorAPI/app/routers/logics/schema_validator.py +++ b/src/ContentProcessorAPI/app/routers/logics/schema_validator.py @@ -109,6 +109,15 @@ def validate_json_schema(raw_bytes: bytes) -> dict[str, Any]: f"Allowed: {sorted(ALLOWED_CPS_KEYWORDS)}." ) + # Reject external $ref values. The runtime loader only supports local + # references of the form ``#/$defs/...`` or ``#/definitions/...``. + for path, ref in _walk_refs(document): + if not ref.startswith("#/"): + errors.append( + f"External $ref '{ref}' at {path or ''} is not supported. " + "Only local '#/$defs/...' and '#/definitions/...' references are allowed." + ) + if errors: raise SchemaValidationError(errors) @@ -155,3 +164,18 @@ def _walk_extension_keywords( elif isinstance(node, list): for idx, item in enumerate(node): yield from _walk_extension_keywords(item, f"{path}[{idx}]") + + +def _walk_refs( + node: Any, path: str = "" +) -> Iterable[tuple[str, str]]: + """Yield every ``(path, ref_value)`` for ``$ref`` keys anywhere in *node*.""" + if isinstance(node, dict): + if "$ref" in node and isinstance(node["$ref"], str): + yield path, node["$ref"] + for key, value in node.items(): + child_path = f"{path}.{key}" if path else str(key) + yield from _walk_refs(value, child_path) + elif isinstance(node, list): + for idx, item in enumerate(node): + yield from _walk_refs(item, f"{path}[{idx}]") diff --git a/src/ContentProcessorAPI/app/routers/logics/schemavault.py b/src/ContentProcessorAPI/app/routers/logics/schemavault.py index 548ea6f5..e0227cc1 100644 --- a/src/ContentProcessorAPI/app/routers/logics/schemavault.py +++ b/src/ContentProcessorAPI/app/routers/logics/schemavault.py @@ -85,7 +85,7 @@ def Update( ) schema_object.ClassName = class_name - schema_object.ContentType = file.content_type + schema_object.ContentType = "application/json" schema_object.Format = storage_format schema_object.Updated_On = result["date"] diff --git a/src/ContentProcessorAPI/app/routers/schemavault.py b/src/ContentProcessorAPI/app/routers/schemavault.py index 62058544..9b2a3b70 100644 --- a/src/ContentProcessorAPI/app/routers/schemavault.py +++ b/src/ContentProcessorAPI/app/routers/schemavault.py @@ -107,7 +107,7 @@ async def Get_All_Registered_Schema( response_model=Schema, summary="Register a schema", description=""" - Registers a new schema file (`.py` or `.json`) and stores its metadata + Registers a new schema file (`.json`) and stores its metadata in the Schema Vault. The request must be sent as `multipart/form-data` with: @@ -115,8 +115,7 @@ async def Get_All_Registered_Schema( - a file part (named `file`) Constraints: - - Accepted extensions: `.py` (legacy executable Python class) and - `.json` (declarative JSON Schema; recommended). + - Only `.json` (declarative JSON Schema) files are accepted. - Max size: 1 MB. For `.json` uploads: @@ -126,15 +125,15 @@ async def Get_All_Registered_Schema( document declares a `title`; otherwise the filename stem is used. ## Parameters - - **ClassName** (body): Schema class name. Used for `.py` uploads and - as a fallback for `.json` uploads without a `title`. + - **ClassName** (body): Schema class name. Used as a fallback for + `.json` uploads without a `title`. - **Description** (body): Human-readable description. - - **file** (form): `.py` or `.json` schema file (max 1 MB). + - **file** (form): `.json` schema file (max 1 MB). ## Example Request Body multipart/form-data - `data`: `{ "ClassName": "InvoiceSchema", "Description": "Extract invoice fields" }` - - `file`: `` or `` + - `file`: `` """, ) async def Register_Schema( @@ -161,7 +160,7 @@ async def Register_Schema( fallback = os.path.splitext(safe_filename)[0] class_name = derive_class_name(document, fallback=data.ClassName or fallback) - content_type = file.content_type or "application/json" + content_type = "application/json" return schemas.Add( file, @@ -181,7 +180,7 @@ async def Register_Schema( response_model=Schema, summary="Update a schema", description=""" - Updates an existing registered schema (`.py` or `.json` file) and + Updates an existing registered schema (`.json` file) and associated metadata. The request must be sent as `multipart/form-data` with: @@ -189,19 +188,19 @@ async def Register_Schema( - a file part (named `file`) Constraints: - - Accepted extensions: `.py` and `.json`. + - Only `.json` files are accepted. - Max size: 1 MB. ## Parameters - **SchemaId** (body): Schema ID to update. - **ClassName** (body): Updated class name (fallback for `.json` schemas without a `title`). - - **file** (form): New `.py` or `.json` schema file (max 1 MB). + - **file** (form): New `.json` schema file (max 1 MB). ## Example Request Body multipart/form-data - `data`: `{ "SchemaId": "", "ClassName": "InvoiceSchema" }` - - `file`: `` or `` + - `file`: `` """, ) async def Update_Schema( From e8b2faed0fa54656b4c41d01a0f22aec86eda3df Mon Sep 17 00:00:00 2001 From: Prajwal-Microsoft Date: Wed, 6 May 2026 18:09:19 +0530 Subject: [PATCH 07/13] fix: Fixed copilot comments --- infra/scripts/post_deployment.sh | 2 +- .../src/libs/pipeline/entities/schema.py | 2 +- .../app/routers/logics/schema_validator.py | 8 ++++---- src/ContentProcessorAPI/app/routers/schemavault.py | 12 ++++++------ 4 files changed, 12 insertions(+), 12 deletions(-) diff --git a/infra/scripts/post_deployment.sh b/infra/scripts/post_deployment.sh index f02f8be9..49644a4d 100644 --- a/infra/scripts/post_deployment.sh +++ b/infra/scripts/post_deployment.sh @@ -138,7 +138,7 @@ else # Only JSON Schema descriptors are accepted. The legacy .py format # was removed as part of the schemavault RCE remediation. - EXT="${FILE_NAME##*.}" + EXT=$(echo "${FILE_NAME##*.}" | tr '[:upper:]' '[:lower:]') if [ "$EXT" != "json" ]; then echo " Unsupported schema extension '.$EXT' for '$FILE_NAME'. Only .json is accepted. Skipping..." continue diff --git a/src/ContentProcessor/src/libs/pipeline/entities/schema.py b/src/ContentProcessor/src/libs/pipeline/entities/schema.py index 409e9ac9..e9138897 100644 --- a/src/ContentProcessor/src/libs/pipeline/entities/schema.py +++ b/src/ContentProcessor/src/libs/pipeline/entities/schema.py @@ -35,7 +35,7 @@ class Schema(BaseModel): Description: str FileName: str ContentType: str - Format: Literal["python", "json"] = Field(default="json") + Format: Literal["json"] = Field(default="json") Created_On: Optional[datetime.datetime] = Field(default=None) Updated_On: Optional[datetime.datetime] = Field(default=None) diff --git a/src/ContentProcessorAPI/app/routers/logics/schema_validator.py b/src/ContentProcessorAPI/app/routers/logics/schema_validator.py index 98791783..320198ff 100644 --- a/src/ContentProcessorAPI/app/routers/logics/schema_validator.py +++ b/src/ContentProcessorAPI/app/routers/logics/schema_validator.py @@ -109,13 +109,13 @@ def validate_json_schema(raw_bytes: bytes) -> dict[str, Any]: f"Allowed: {sorted(ALLOWED_CPS_KEYWORDS)}." ) - # Reject external $ref values. The runtime loader only supports local + # Reject unsupported $ref values. The runtime loader only supports local # references of the form ``#/$defs/...`` or ``#/definitions/...``. for path, ref in _walk_refs(document): - if not ref.startswith("#/"): + if not (ref.startswith("#/$defs/") or ref.startswith("#/definitions/")): errors.append( - f"External $ref '{ref}' at {path or ''} is not supported. " - "Only local '#/$defs/...' and '#/definitions/...' references are allowed." + f"Unsupported $ref '{ref}' at {path or ''}. " + "Only '#/$defs/...' and '#/definitions/...' references are supported." ) if errors: diff --git a/src/ContentProcessorAPI/app/routers/schemavault.py b/src/ContentProcessorAPI/app/routers/schemavault.py index 9b2a3b70..5741faec 100644 --- a/src/ContentProcessorAPI/app/routers/schemavault.py +++ b/src/ContentProcessorAPI/app/routers/schemavault.py @@ -146,10 +146,10 @@ async def Register_Schema( schemas: Schemas = app.app_context.get_service(Schemas) - safe_filename, extension = _validate_upload(file) + safe_filename, _ = _validate_upload(file) - raw = file.file.read() - file.file.seek(0) + raw = await file.read() + await file.seek(0) try: document = validate_json_schema(raw) except SchemaValidationError as exc: @@ -211,10 +211,10 @@ async def Update_Schema( """Update an existing schema with a new file.""" app: TypedFastAPI = request.app # type: ignore - safe_filename, extension = _validate_upload(file) + safe_filename, _ = _validate_upload(file) - raw = file.file.read() - file.file.seek(0) + raw = await file.read() + await file.seek(0) try: document = validate_json_schema(raw) except SchemaValidationError as exc: From 87a7108071a166c2abe136271160720273fffb5b Mon Sep 17 00:00:00 2001 From: Prajwal-Microsoft Date: Wed, 6 May 2026 18:29:39 +0530 Subject: [PATCH 08/13] fix: Updated unit tests --- docs/CustomizeSchemaData.md | 18 +++++------ .../tests/unit/pipeline/test_schema.py | 8 ++--- .../unit/utils/test_remote_schema_loader.py | 2 +- .../tests/logics/test_schemasetvault_logic.py | 8 ++--- .../tests/logics/test_schemavault_logic.py | 30 +++++++++---------- .../app/tests/models/test_schmavault_model.py | 10 +++---- .../app/tests/routers/test_schemavault.py | 2 +- .../ContentProcessor/pipeline/test_schema.py | 8 ++--- 8 files changed, 43 insertions(+), 43 deletions(-) diff --git a/docs/CustomizeSchemaData.md b/docs/CustomizeSchemaData.md index c4ae4d59..1872f5c0 100644 --- a/docs/CustomizeSchemaData.md +++ b/docs/CustomizeSchemaData.md @@ -37,8 +37,8 @@ flowchart TB subgraph Runtime["Runtime — Pipeline Map Step"] R1["1. Look up Schema metadata
from Cosmos DB"] - R2["2. Download .py class file
from Blob Storage"] - R3["3. Dynamically load Pydantic class
→ generate JSON Schema"] + R2["2. Download JSON Schema
from Blob Storage"] + R3["3. Materialise Pydantic model
from JSON Schema (no code execution)"] R4["4. Embed JSON Schema in
GPT-5.1 prompt"] R5["5. Validate response with
Pydantic → confidence scoring"] R1 --> R2 --> R3 --> R4 --> R5 @@ -60,18 +60,18 @@ flowchart TB flowchart LR Claim["🗂️ Claim"] -->|"assigned to"| SchemaSet["📂 SchemaSet"] SchemaSet -->|"contains"| Schema["🗎 Schema"] - Schema -->|"stores .py file"| Blob["💾 Blob Storage"] + Schema -->|"stores .json file"| Blob["💾 Blob Storage"] ``` -- **Schema** — one per document type. Metadata in Cosmos DB, `.py` class file in Blob Storage. +- **Schema** — one per document type. Metadata in Cosmos DB, `.json` schema file in Blob Storage. - **SchemaSet** — a named group that holds references to one or more Schemas. Assigned to a Claim at creation time. - A Schema can belong to multiple SchemaSets or none at all. --- -## Step 1: Create Schema Class (.py) +## Step 1: Create a JSON Schema Document -A new class needs to be created that defines the schema as a strongly typed Python class inheriting from Pydantic `BaseModel`. +A new JSON Schema document needs to be created that defines the schema as a declarative description of your document type. > **Schema Folder:** [/src/ContentProcessorAPI/samples/schemas/](/src/ContentProcessorAPI/samples/schemas/) — All schema classes should be placed into this folder @@ -86,11 +86,11 @@ A new class needs to be created that defines the schema as a strongly typed Pyth > **Note:** All 4 schemas are automatically registered during deployment (via `azd up` or the `register_schema.py` script) and grouped into the **"Auto Claim"** schema set. -Duplicate one of these files and update with a class definition that represents your document type. +Duplicate one of these files and update with fields that represent your document type. > **Tip:** You can use GitHub Copilot to generate a schema. Example prompt: > -> *Generate a Schema Class based on the following autoclaim.py schema definition, which has been built and derived from Pydantic BaseModel class. The generated Schema Class should be called "Freight Shipment Bill of Lading" schema file. Please define the entities based on standard bill of lading documents in the logistics industry.* +> *Generate a JSON Schema (Draft 2020-12) based on the following autoclaim.json schema definition. The generated schema should be called "Freight Shipment Bill of Lading". Please define the properties based on standard bill of lading documents in the logistics industry.* ### Schema Document Structure @@ -205,7 +205,7 @@ The script checks for existing schemas and schema sets to avoid duplicates, and | `POST` | `/schemavault/` | Register a new schema (multipart upload) | | `PUT` | `/schemavault/` | Update an existing schema | | `DELETE` | `/schemavault/` | Delete a schema by ID | -| `GET` | `/schemavault/schemas/{schema_id}` | Get a schema by ID (includes `.py` file) | +| `GET` | `/schemavault/schemas/{schema_id}` | Get a schema by ID (includes `.json` file) | --- diff --git a/src/ContentProcessor/tests/unit/pipeline/test_schema.py b/src/ContentProcessor/tests/unit/pipeline/test_schema.py index e5c18ef1..bbdb46b6 100644 --- a/src/ContentProcessor/tests/unit/pipeline/test_schema.py +++ b/src/ContentProcessor/tests/unit/pipeline/test_schema.py @@ -22,8 +22,8 @@ def test_construction(self): Id="s-1", ClassName="InvoiceSchema", Description="Invoice extraction", - FileName="invoice_schema.py", - ContentType="application/pdf", + FileName="invoice_schema.json", + ContentType="application/json", ) assert schema.Id == "s-1" assert schema.ClassName == "InvoiceSchema" @@ -46,8 +46,8 @@ def test_get_schema_returns_schema(self, mock_helper_cls): "Id": "s-1", "ClassName": "MySchema", "Description": "desc", - "FileName": "file.py", - "ContentType": "text/plain", + "FileName": "file.json", + "ContentType": "application/json", } ] result = Schema.get_schema("connstr", "db", "coll", "s-1") diff --git a/src/ContentProcessor/tests/unit/utils/test_remote_schema_loader.py b/src/ContentProcessor/tests/unit/utils/test_remote_schema_loader.py index 81ba3535..7102fba1 100644 --- a/src/ContentProcessor/tests/unit/utils/test_remote_schema_loader.py +++ b/src/ContentProcessor/tests/unit/utils/test_remote_schema_loader.py @@ -21,7 +21,7 @@ build_model_from_schema, ) -#: Repo-relative path to the golden JSON schema generated from autoclaim.py. +#: Repo-relative path to the golden JSON schema. _GOLDEN_AUTOCLAIM = ( Path(__file__).resolve().parents[4] / "ContentProcessorAPI" diff --git a/src/ContentProcessorAPI/app/tests/logics/test_schemasetvault_logic.py b/src/ContentProcessorAPI/app/tests/logics/test_schemasetvault_logic.py index 70ca3178..ce1f1e52 100644 --- a/src/ContentProcessorAPI/app/tests/logics/test_schemasetvault_logic.py +++ b/src/ContentProcessorAPI/app/tests/logics/test_schemasetvault_logic.py @@ -120,8 +120,8 @@ def test_add_schema_to_set(MockBlob, MockMongo, mock_app_context): "Id": "s1", "ClassName": "Invoice", "Description": "desc", - "FileName": "invoice.py", - "ContentType": "text/x-python", + "FileName": "invoice.json", + "ContentType": "application/json", } ] @@ -229,8 +229,8 @@ def test_get_all_schemas_in_set(MockBlob, MockMongo, mock_app_context): "Id": "s1", "ClassName": "Invoice", "Description": "d1", - "FileName": "invoice.py", - "ContentType": "text/x-python", + "FileName": "invoice.json", + "ContentType": "application/json", } ] diff --git a/src/ContentProcessorAPI/app/tests/logics/test_schemavault_logic.py b/src/ContentProcessorAPI/app/tests/logics/test_schemavault_logic.py index 1467c902..500c37cf 100644 --- a/src/ContentProcessorAPI/app/tests/logics/test_schemavault_logic.py +++ b/src/ContentProcessorAPI/app/tests/logics/test_schemavault_logic.py @@ -32,8 +32,8 @@ def test_get_all(MockBlob, MockMongo, mock_app_context): "Id": "s1", "ClassName": "Invoice", "Description": "desc", - "FileName": "invoice.py", - "ContentType": "text/x-python", + "FileName": "invoice.json", + "ContentType": "application/json", } ] @@ -55,20 +55,20 @@ def test_get_file(MockBlob, MockMongo, mock_app_context): "Id": "s1", "ClassName": "Invoice", "Description": "desc", - "FileName": "invoice.py", - "ContentType": "text/x-python", + "FileName": "invoice.json", + "ContentType": "application/json", } ] mock_blob = MockBlob.return_value - mock_blob.download_blob.return_value = b"class Invoice: pass" + mock_blob.download_blob.return_value = b'{"type": "object"}' from app.routers.logics.schemavault import Schemas schemas = Schemas(app_context=mock_app_context) result = schemas.GetFile("s1") - assert result["File"] == b"class Invoice: pass" - assert result["FileName"] == "invoice.py" - assert result["ContentType"] == "text/x-python" + assert result["File"] == b'{"type": "object"}' + assert result["FileName"] == "invoice.json" + assert result["ContentType"] == "application/json" @patch("app.routers.logics.schemavault.CosmosMongDBHelper") @@ -99,8 +99,8 @@ def test_add(MockBlob, MockMongo, mock_app_context): Id="s1", ClassName="Invoice", Description="desc", - FileName="invoice.py", - ContentType="text/x-python", + FileName="invoice.json", + ContentType="application/json", ) result = schemas.Add(file, schema) assert result.Created_On == "2025-01-01T00:00:00Z" @@ -116,8 +116,8 @@ def test_update(MockBlob, MockMongo, mock_app_context): "Id": "s1", "ClassName": "Old", "Description": "desc", - "FileName": "old.py", - "ContentType": "text/x-python", + "FileName": "old.json", + "ContentType": "application/json", } ] mock_blob = MockBlob.return_value @@ -127,7 +127,7 @@ def test_update(MockBlob, MockMongo, mock_app_context): schemas = Schemas(app_context=mock_app_context) file = MagicMock() - file.content_type = "text/x-python" + file.content_type = "application/json" result = schemas.Update(file, "s1", "NewClass") assert result.ClassName == "NewClass" mock_mongo.update_document.assert_called_once() @@ -155,8 +155,8 @@ def test_delete(MockBlob, MockMongo, mock_app_context): "Id": "s1", "ClassName": "Invoice", "Description": "desc", - "FileName": "invoice.py", - "ContentType": "text/x-python", + "FileName": "invoice.json", + "ContentType": "application/json", } ] diff --git a/src/ContentProcessorAPI/app/tests/models/test_schmavault_model.py b/src/ContentProcessorAPI/app/tests/models/test_schmavault_model.py index 09b0dbe4..3b3e6e41 100644 --- a/src/ContentProcessorAPI/app/tests/models/test_schmavault_model.py +++ b/src/ContentProcessorAPI/app/tests/models/test_schmavault_model.py @@ -22,8 +22,8 @@ def test_parse_dates_from_iso_string(self): Id="s1", ClassName="Invoice", Description="desc", - FileName="invoice.py", - ContentType="text/x-python", + FileName="invoice.json", + ContentType="application/json", Created_On="2025-01-01T00:00:00Z", Updated_On="2025-06-15T12:30:00Z", ) @@ -36,8 +36,8 @@ def test_parse_dates_none(self): Id="s1", ClassName="Invoice", Description="desc", - FileName="invoice.py", - ContentType="text/x-python", + FileName="invoice.json", + ContentType="application/json", ) assert schema.Created_On is None assert schema.Updated_On is None @@ -76,7 +76,7 @@ def test_to_dict(self): Status="Success", SchemaId="s1", ClassName="Invoice", - FileName="invoice.py", + FileName="invoice.json", ) d = resp.to_dict() assert d["Status"] == "Success" diff --git a/src/ContentProcessorAPI/app/tests/routers/test_schemavault.py b/src/ContentProcessorAPI/app/tests/routers/test_schemavault.py index 96fc7931..fb21a61f 100644 --- a/src/ContentProcessorAPI/app/tests/routers/test_schemavault.py +++ b/src/ContentProcessorAPI/app/tests/routers/test_schemavault.py @@ -190,7 +190,7 @@ def test_update_schema_rejects_unsupported_extension(client_and_schemas): def test_unregister_schema_success(client_and_schemas): client, mock_schemas = client_and_schemas mock_schemas.Delete.return_value = MagicMock( - Id="test-id", ClassName="TestClass", FileName="test.py" + Id="test-id", ClassName="TestClass", FileName="test.json" ) response = client.request( diff --git a/src/tests/ContentProcessor/pipeline/test_schema.py b/src/tests/ContentProcessor/pipeline/test_schema.py index e5c18ef1..bbdb46b6 100644 --- a/src/tests/ContentProcessor/pipeline/test_schema.py +++ b/src/tests/ContentProcessor/pipeline/test_schema.py @@ -22,8 +22,8 @@ def test_construction(self): Id="s-1", ClassName="InvoiceSchema", Description="Invoice extraction", - FileName="invoice_schema.py", - ContentType="application/pdf", + FileName="invoice_schema.json", + ContentType="application/json", ) assert schema.Id == "s-1" assert schema.ClassName == "InvoiceSchema" @@ -46,8 +46,8 @@ def test_get_schema_returns_schema(self, mock_helper_cls): "Id": "s-1", "ClassName": "MySchema", "Description": "desc", - "FileName": "file.py", - "ContentType": "text/plain", + "FileName": "file.json", + "ContentType": "application/json", } ] result = Schema.get_schema("connstr", "db", "coll", "s-1") From 7244f95c976f19a20befde220300ebd8a4caf8af Mon Sep 17 00:00:00 2001 From: Prajwal-Microsoft Date: Wed, 6 May 2026 18:38:05 +0530 Subject: [PATCH 09/13] refactore: Removed unused headers --- docs/CustomizeSchemaData.md | 6 ------ .../app/routers/logics/schema_validator.py | 8 +------- .../app/tests/logics/test_schema_validator.py | 12 ++---------- 3 files changed, 3 insertions(+), 23 deletions(-) diff --git a/docs/CustomizeSchemaData.md b/docs/CustomizeSchemaData.md index 1872f5c0..fc936ba4 100644 --- a/docs/CustomizeSchemaData.md +++ b/docs/CustomizeSchemaData.md @@ -325,12 +325,6 @@ When uploading JSON: - The schema's `title` (if present) becomes the `ClassName` recorded in Cosmos. If the JSON has no `title`, the request body's `ClassName` is used as a fallback. -- Two project-specific extension keywords are accepted: - - `x-cps-extract-prompt` — optional override for the LLM extraction - prompt for that field. - - `x-cps-required-on-save` — marks a field that must be present in - the LLM output before persistence. - Any other `x-…` keyword is rejected. - The schema must be ≤ 1 MB. ### Constraints relative to the legacy Python schemas diff --git a/src/ContentProcessorAPI/app/routers/logics/schema_validator.py b/src/ContentProcessorAPI/app/routers/logics/schema_validator.py index 320198ff..b3c5e441 100644 --- a/src/ContentProcessorAPI/app/routers/logics/schema_validator.py +++ b/src/ContentProcessorAPI/app/routers/logics/schema_validator.py @@ -24,13 +24,7 @@ #: artefacts; a generous cap of 1 MB matches the legacy ``.py`` limit. MAX_SCHEMA_BYTES: int = 1 * 1024 * 1024 -#: Allowlisted project-specific custom keywords. Any other ``x-cps-*`` or -#: ``x-`` keyword in the uploaded schema is rejected so unknown extension -#: points cannot be smuggled in. -ALLOWED_CPS_KEYWORDS: frozenset[str] = frozenset({ - "x-cps-extract-prompt", - "x-cps-required-on-save", -}) +ALLOWED_CPS_KEYWORDS: frozenset[str] = frozenset() class SchemaValidationError(ValueError): diff --git a/src/ContentProcessorAPI/app/tests/logics/test_schema_validator.py b/src/ContentProcessorAPI/app/tests/logics/test_schema_validator.py index f8c02bfa..aceb3a6d 100644 --- a/src/ContentProcessorAPI/app/tests/logics/test_schema_validator.py +++ b/src/ContentProcessorAPI/app/tests/logics/test_schema_validator.py @@ -54,13 +54,6 @@ def test_validate_accepts_autoclaim_golden(): assert document["type"] == "object" -def test_validate_accepts_allowed_cps_keywords(): - schema = _minimal_object_schema() - schema["properties"]["name"]["x-cps-extract-prompt"] = "Extract the full name." - schema["properties"]["name"]["x-cps-required-on-save"] = True - validate_json_schema(_bytes(schema)) - - # --------------------------------------------------------------------------- # Failure modes # --------------------------------------------------------------------------- @@ -146,6 +139,5 @@ def test_derive_class_name_sanitises_leading_digits(): assert derive_class_name({}, fallback="9invoice") == "Schema_9invoice" -def test_allowed_keywords_constant_contains_expected_extensions(): - assert "x-cps-extract-prompt" in ALLOWED_CPS_KEYWORDS - assert "x-cps-required-on-save" in ALLOWED_CPS_KEYWORDS +def test_allowed_keywords_constant_is_empty(): + assert len(ALLOWED_CPS_KEYWORDS) == 0 From 22cd514f98756f9e1281fc7ea3a2b60399ea566e Mon Sep 17 00:00:00 2001 From: Prajwal-Microsoft Date: Wed, 6 May 2026 18:44:22 +0530 Subject: [PATCH 10/13] updated uv.lock --- src/ContentProcessor/uv.lock | 2 -- 1 file changed, 2 deletions(-) diff --git a/src/ContentProcessor/uv.lock b/src/ContentProcessor/uv.lock index b022cf4c..77b91fd3 100644 --- a/src/ContentProcessor/uv.lock +++ b/src/ContentProcessor/uv.lock @@ -979,7 +979,6 @@ dependencies = [ { name = "azure-storage-queue" }, { name = "certifi" }, { name = "charset-normalizer" }, - { name = "jsonschema" }, { name = "opentelemetry-api" }, { name = "pandas" }, { name = "pdf2image" }, @@ -1017,7 +1016,6 @@ requires-dist = [ { name = "azure-storage-queue", specifier = "==12.16.0b1" }, { name = "certifi", specifier = "==2026.2.25" }, { name = "charset-normalizer", specifier = "==3.4.6" }, - { name = "jsonschema", specifier = "==4.25.1" }, { name = "opentelemetry-api", specifier = "==1.40.0" }, { name = "pandas", specifier = "==3.0.2" }, { name = "pdf2image", specifier = "==1.17.0" }, From b341a9308604028013a4be565541afeec46fc355 Mon Sep 17 00:00:00 2001 From: Prajwal-Microsoft Date: Wed, 6 May 2026 18:59:52 +0530 Subject: [PATCH 11/13] refactore: removed unused files --- docs/CustomizeSchemaData.md | 20 ---------- scripts/py_schema_to_json.py | 76 ------------------------------------ 2 files changed, 96 deletions(-) delete mode 100644 scripts/py_schema_to_json.py diff --git a/docs/CustomizeSchemaData.md b/docs/CustomizeSchemaData.md index fc936ba4..5ebdebbd 100644 --- a/docs/CustomizeSchemaData.md +++ b/docs/CustomizeSchemaData.md @@ -286,26 +286,6 @@ removed; uploads of `.py` files are rejected with HTTP 415. | Authoring | Pydantic-compatible JSON | | Side-effects on import | Impossible | -### Authoring with the conversion helper - -If you have an existing Pydantic-based `.py` schema, the repo ships a -helper that emits the equivalent JSON Schema: - -```bash -python scripts/py_schema_to_json.py \ - src/ContentProcessorAPI/samples/schemas/autoclaim.py \ - AutoInsuranceClaimForm -``` - -This writes `autoclaim.json` next to the source file. Under the hood it -calls `Model.model_json_schema()` from Pydantic v2 — the same call the -worker uses today to build the LLM prompt. The output is therefore -already aligned with the contract the pipeline expects. - -The accelerator ships a golden conversion of the auto-claim sample at -[/src/ContentProcessorAPI/samples/schemas/autoclaim.json](/src/ContentProcessorAPI/samples/schemas/autoclaim.json) -that you can reference. - ### Upload via API `POST /schemavault/` accepts JSON Schema documents. Send the file with diff --git a/scripts/py_schema_to_json.py b/scripts/py_schema_to_json.py deleted file mode 100644 index 88f137b8..00000000 --- a/scripts/py_schema_to_json.py +++ /dev/null @@ -1,76 +0,0 @@ -# Copyright (c) Microsoft Corporation. -# Licensed under the MIT License. - -"""Convert a legacy Pydantic ``.py`` schema into a declarative ``.json`` schema. - -This helper is part of the migration away from executable Python schemas. -It imports a Pydantic model from a ``.py`` file *in a trusted local -context* (the developer's machine), reads its -:py:meth:`pydantic.BaseModel.model_json_schema` output, and writes the -result to a ``.json`` file alongside. - -Usage: - - python scripts/py_schema_to_json.py \ - src/ContentProcessorAPI/samples/schemas/autoclaim.py \ - AutoInsuranceClaimForm - -The generated JSON is what should be uploaded to the schema vault going -forward; it is data only and never executed by the worker. -""" - -from __future__ import annotations - -import argparse -import importlib.util -import json -import sys -from pathlib import Path - -from pydantic import BaseModel - - -def convert(py_path: Path, class_name: str, out_path: Path | None = None) -> Path: - """Load *class_name* from *py_path* and write its JSON schema next to it.""" - spec = importlib.util.spec_from_file_location(py_path.stem, py_path) - if spec is None or spec.loader is None: - raise RuntimeError(f"Cannot import schema module from {py_path}") - module = importlib.util.module_from_spec(spec) - sys.modules[spec.name] = module - spec.loader.exec_module(module) # noqa: S102 - trusted local conversion only - - cls = getattr(module, class_name, None) - if cls is None or not isinstance(cls, type) or not issubclass(cls, BaseModel): - raise RuntimeError( - f"'{class_name}' is not a Pydantic BaseModel in {py_path}" - ) - - schema = cls.model_json_schema() - # Pydantic emits "title" at the root; ensure it matches the requested - # class name so the worker's ``derive_class_name`` picks it up. - schema["title"] = class_name - - target = out_path or py_path.with_suffix(".json") - target.write_text(json.dumps(schema, indent=2) + "\n", encoding="utf-8") - return target - - -def main() -> int: - parser = argparse.ArgumentParser(description=__doc__) - parser.add_argument("py_path", type=Path, help="Path to the .py schema file.") - parser.add_argument("class_name", help="BaseModel class to export.") - parser.add_argument( - "--out", - type=Path, - default=None, - help="Output .json path (defaults to alongside the input).", - ) - args = parser.parse_args() - - target = convert(args.py_path, args.class_name, args.out) - print(f"Wrote {target}") - return 0 - - -if __name__ == "__main__": - raise SystemExit(main()) From fa465a9ee166b85c0f716df7a07c6bab1600a40c Mon Sep 17 00:00:00 2001 From: Prajwal-Microsoft Date: Wed, 6 May 2026 19:33:55 +0530 Subject: [PATCH 12/13] fix: Fixed copilot comments --- docs/CustomizeSchemaData.md | 11 +++++------ .../src/libs/pipeline/handlers/map_handler.py | 9 ++++----- src/ContentProcessorAPI/test_http/schema_API.http | 2 +- 3 files changed, 10 insertions(+), 12 deletions(-) diff --git a/docs/CustomizeSchemaData.md b/docs/CustomizeSchemaData.md index 5ebdebbd..361e35ba 100644 --- a/docs/CustomizeSchemaData.md +++ b/docs/CustomizeSchemaData.md @@ -307,14 +307,13 @@ When uploading JSON: used as a fallback. - The schema must be ≤ 1 MB. -### Constraints relative to the legacy Python schemas +### Limitations JSON schemas are pure data. They cannot carry custom validation logic -written in Python (e.g. `field_validator`). For most extraction -schemas this is not a limitation — the existing samples don't use -custom validators — but if you depend on imperative validation, keep -authoring those schemas in Python locally and run the resulting JSON -through the API. +(e.g. Pydantic `field_validator`). For most extraction schemas this is +not a limitation — the existing samples don't use custom validators. +If you need imperative validation, implement it downstream after the +pipeline extracts the data. diff --git a/src/ContentProcessor/src/libs/pipeline/handlers/map_handler.py b/src/ContentProcessor/src/libs/pipeline/handlers/map_handler.py index 0217662d..f3f20cb3 100644 --- a/src/ContentProcessor/src/libs/pipeline/handlers/map_handler.py +++ b/src/ContentProcessor/src/libs/pipeline/handlers/map_handler.py @@ -154,12 +154,11 @@ async def execute(self, context: MessageContext) -> StepResult: # Load the schema class for structured output. Only JSON schemas # are supported; the worker materialises the descriptor as an # in-memory Pydantic model without ever executing uploaded code. - schema_format = getattr(selected_schema, "Format", "json") or "json" - if schema_format != "json": + if not selected_schema.FileName.lower().endswith(".json"): raise ValueError( - f"Schema {selected_schema.Id} has unsupported Format " - f"'{schema_format}'. Re-register the schema as a JSON " - "Schema (.json) document; legacy Python (.py) schemas " + f"Schema {selected_schema.Id} has a non-JSON file " + f"'{selected_schema.FileName}'. Re-register the schema as a " + "JSON Schema (.json) document; legacy Python (.py) schemas " "are no longer supported." ) schema_class = load_schema_from_blob_json( diff --git a/src/ContentProcessorAPI/test_http/schema_API.http b/src/ContentProcessorAPI/test_http/schema_API.http index 169f566c..3c550e21 100644 --- a/src/ContentProcessorAPI/test_http/schema_API.http +++ b/src/ContentProcessorAPI/test_http/schema_API.http @@ -9,7 +9,7 @@ GET {{baseUrl}}{{schemavault}}/ ### Register a schema (.json) into the vault # Sends multipart/form-data with fields: # - data: JSON string { ClassName, Description } -# - file: schema file (.json recommended; .py still accepted for legacy) +# - file: schema file (.json only) # # @name registerSchema POST {{baseUrl}}{{schemavault}}/ From 37d338b68630d222f4265025b7f4388b48030f44 Mon Sep 17 00:00:00 2001 From: Prajwal-Microsoft Date: Wed, 6 May 2026 19:56:04 +0530 Subject: [PATCH 13/13] fix: Updated documentation for JSON schema files --- docs/CustomizeSchemaData.md | 8 ++++---- docs/CustomizeSystemPrompts.md | 2 +- docs/GoldenPathWorkflows.md | 2 +- docs/TechnicalArchitecture.md | 2 +- 4 files changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/CustomizeSchemaData.md b/docs/CustomizeSchemaData.md index 361e35ba..0e3105d8 100644 --- a/docs/CustomizeSchemaData.md +++ b/docs/CustomizeSchemaData.md @@ -2,9 +2,9 @@ ## How to Use Your Own Data -Files processed by the solution are mapped and transformed into **schemas** — strongly typed Pydantic class definitions that represent a standardized output for each document type. For example, the accelerator includes an `AutoInsuranceClaimForm` schema with fields like `policy_number`, `date_of_loss`, and `vehicle_information`. +Files processed by the solution are mapped and transformed into **schemas** — JSON Schema documents that represent a standardized output for each document type. For example, the accelerator includes an `AutoInsuranceClaimForm` schema with fields like `policy_number`, `date_of_loss`, and `vehicle_information`. -Using AI, the processing pipeline extracts content from each document (text, images, tables), then maps the extracted data into the schema fields using GPT-5.1 with structured JSON output — field descriptions in the schema class act as extraction guidance for the LLM. +Using AI, the processing pipeline extracts content from each document (text, images, tables), then maps the extracted data into the schema fields using GPT-5.1 with structured JSON output — field descriptions in the schema act as extraction guidance for the LLM. Schemas need to be created specific to your business and domain requirements. A lot of times schemas may be generally common across industries, but this allows for variations specific to your use case. @@ -73,7 +73,7 @@ flowchart LR A new JSON Schema document needs to be created that defines the schema as a declarative description of your document type. -> **Schema Folder:** [/src/ContentProcessorAPI/samples/schemas/](/src/ContentProcessorAPI/samples/schemas/) — All schema classes should be placed into this folder +> **Schema Folder:** [/src/ContentProcessorAPI/samples/schemas/](/src/ContentProcessorAPI/samples/schemas/) — All schema JSON files should be placed into this folder **Sample Schemas:** The accelerator ships with 4 sample schemas — use any as a starting template: @@ -264,7 +264,7 @@ Once schemas are registered and grouped into a SchemaSet, the pipeline uses them 4. **LLM extraction** — Embeds the JSON Schema into the GPT-5.1 system prompt with `response_format` for structured JSON output (temperature=0.1 for deterministic results) 5. **Validation & scoring** — Parses the GPT response back into the Pydantic model, then computes per-field confidence scores using log-probabilities -This means your field descriptions in the schema class **directly influence extraction quality** — write clear, specific descriptions with examples for best results. +This means your field descriptions in the schema **directly influence extraction quality** — write clear, specific descriptions with examples for best results. --- diff --git a/docs/CustomizeSystemPrompts.md b/docs/CustomizeSystemPrompts.md index d1d3cf53..67dabf83 100644 --- a/docs/CustomizeSystemPrompts.md +++ b/docs/CustomizeSystemPrompts.md @@ -51,4 +51,4 @@ For the complete DSL reference, expression language, domain adaptation examples, ## Schema-Specific Prompts -Schema-specific prompts are managed directly in the individual schema .py file that is created. The field descriptions in your schema class act as prompts for the LLM during data extraction and mapping. See [Customizing Schema and Data](./CustomizeSchemaData.md) for details on how to write effective field descriptions. +Schema-specific prompts are managed directly in the individual schema JSON file. The field descriptions in your schema act as prompts for the LLM during data extraction and mapping. See [Customizing Schema and Data](./CustomizeSchemaData.md) for details on how to write effective field descriptions. diff --git a/docs/GoldenPathWorkflows.md b/docs/GoldenPathWorkflows.md index cf48b480..fc8c911a 100644 --- a/docs/GoldenPathWorkflows.md +++ b/docs/GoldenPathWorkflows.md @@ -121,7 +121,7 @@ The final stage applies **YAML-based rules** to detect missing documents and cro 1. **Create Custom Schema** - Follow the [Custom Schema Guide](./CustomizeSchemaData.md) - - Define your document structure and required fields (Pydantic model) + - Define your document structure and required fields (JSON Schema) 2. **Register Your Schema** - Add your schema to `schema_info.json` and run `register_schema.py` diff --git a/docs/TechnicalArchitecture.md b/docs/TechnicalArchitecture.md index dce44b65..3e0f651f 100644 --- a/docs/TechnicalArchitecture.md +++ b/docs/TechnicalArchitecture.md @@ -209,7 +209,7 @@ Using Azure OpenAI Service, a deployment of the GPT-5.1 model is used during the Using Azure Blob Storage, the solution uses multiple containers: - **process-batch** – Claim batch manifests and batch-level artifacts. - **cps-processes** – Source files for processing, intermediate results, and final output JSON files. -- **cps-configuration** – Schema `.py` files and configuration data. +- **cps-configuration** – Schema JSON files and configuration data. ### Azure Cosmos DB for MongoDB Using Azure Cosmos DB for MongoDB, the solution uses multiple collections: