diff --git a/docs/CustomizeSchemaData.md b/docs/CustomizeSchemaData.md index 05a08b54..0e3105d8 100644 --- a/docs/CustomizeSchemaData.md +++ b/docs/CustomizeSchemaData.md @@ -2,9 +2,9 @@ ## How to Use Your Own Data -Files processed by the solution are mapped and transformed into **schemas** — strongly typed Pydantic class definitions that represent a standardized output for each document type. For example, the accelerator includes an `AutoInsuranceClaimForm` schema with fields like `policy_number`, `date_of_loss`, and `vehicle_information`. +Files processed by the solution are mapped and transformed into **schemas** — JSON Schema documents that represent a standardized output for each document type. For example, the accelerator includes an `AutoInsuranceClaimForm` schema with fields like `policy_number`, `date_of_loss`, and `vehicle_information`. -Using AI, the processing pipeline extracts content from each document (text, images, tables), then maps the extracted data into the schema fields using GPT-5.1 with structured JSON output — field descriptions in the schema class act as extraction guidance for the LLM. +Using AI, the processing pipeline extracts content from each document (text, images, tables), then maps the extracted data into the schema fields using GPT-5.1 with structured JSON output — field descriptions in the schema act as extraction guidance for the LLM. Schemas need to be created specific to your business and domain requirements. A lot of times schemas may be generally common across industries, but this allows for variations specific to your use case. @@ -15,9 +15,9 @@ Before processing documents, schemas must be **registered** in the system and gr ```mermaid flowchart TB subgraph Step1["Step 1: Register Schemas (one per document type)
POST /schemavault/ × N"] - S1["🗎 AutoInsuranceClaimForm
autoclaim.py
Schema ID: abc123"] - S2["🗎 PoliceReportDocument
policereport.py
Schema ID: def456"] - S3["🗎 RepairEstimateDocument
repairestimate.py
Schema ID: ghi789"] + S1["🗎 AutoInsuranceClaimForm
autoclaim.json
Schema ID: abc123"] + S2["🗎 PoliceReportDocument
policereport.json
Schema ID: def456"] + S3["🗎 RepairEstimateDocument
repairestimate.json
Schema ID: ghi789"] S4["🗎 ...
more schemas"] end @@ -37,8 +37,8 @@ flowchart TB subgraph Runtime["Runtime — Pipeline Map Step"] R1["1. Look up Schema metadata
from Cosmos DB"] - R2["2. Download .py class file
from Blob Storage"] - R3["3. Dynamically load Pydantic class
→ generate JSON Schema"] + R2["2. Download JSON Schema
from Blob Storage"] + R3["3. Materialise Pydantic model
from JSON Schema (no code execution)"] R4["4. Embed JSON Schema in
GPT-5.1 prompt"] R5["5. Validate response with
Pydantic → confidence scoring"] R1 --> R2 --> R3 --> R4 --> R5 @@ -60,90 +60,90 @@ flowchart TB flowchart LR Claim["🗂️ Claim"] -->|"assigned to"| SchemaSet["📂 SchemaSet"] SchemaSet -->|"contains"| Schema["🗎 Schema"] - Schema -->|"stores .py file"| Blob["💾 Blob Storage"] + Schema -->|"stores .json file"| Blob["💾 Blob Storage"] ``` -- **Schema** — one per document type. Metadata in Cosmos DB, `.py` class file in Blob Storage. +- **Schema** — one per document type. Metadata in Cosmos DB, `.json` schema file in Blob Storage. - **SchemaSet** — a named group that holds references to one or more Schemas. Assigned to a Claim at creation time. - A Schema can belong to multiple SchemaSets or none at all. --- -## Step 1: Create Schema Class (.py) +## Step 1: Create a JSON Schema Document -A new class needs to be created that defines the schema as a strongly typed Python class inheriting from Pydantic `BaseModel`. +A new JSON Schema document needs to be created that defines the schema as a declarative description of your document type. -> **Schema Folder:** [/src/ContentProcessorAPI/samples/schemas/](/src/ContentProcessorAPI/samples/schemas/) — All schema classes should be placed into this folder +> **Schema Folder:** [/src/ContentProcessorAPI/samples/schemas/](/src/ContentProcessorAPI/samples/schemas/) — All schema JSON files should be placed into this folder **Sample Schemas:** The accelerator ships with 4 sample schemas — use any as a starting template: | Schema | File | Class Name | Auto-registered | | ------------------------- | --------------------------------------------------------------------------------- | ------------------------------- | --------------- | -| Auto Insurance Claim Form | [autoclaim.py](/src/ContentProcessorAPI/samples/schemas/autoclaim.py) | `AutoInsuranceClaimForm` | ✅ | -| Police Report | [policereport.py](/src/ContentProcessorAPI/samples/schemas/policereport.py) | `PoliceReportDocument` | ✅ | -| Repair Estimate | [repairestimate.py](/src/ContentProcessorAPI/samples/schemas/repairestimate.py) | `RepairEstimateDocument` | ✅ | -| Damaged Vehicle Image | [damagedcarimage.py](/src/ContentProcessorAPI/samples/schemas/damagedcarimage.py) | `DamagedVehicleImageAssessment` | ✅ | +| Auto Insurance Claim Form | [autoclaim.json](/src/ContentProcessorAPI/samples/schemas/autoclaim.json) | `AutoInsuranceClaimForm` | ✅ | +| Police Report | [policereport.json](/src/ContentProcessorAPI/samples/schemas/policereport.json) | `PoliceReportDocument` | ✅ | +| Repair Estimate | [repairestimate.json](/src/ContentProcessorAPI/samples/schemas/repairestimate.json) | `RepairEstimateDocument` | ✅ | +| Damaged Vehicle Image | [damagedcarimage.json](/src/ContentProcessorAPI/samples/schemas/damagedcarimage.json) | `DamagedVehicleImageAssessment` | ✅ | > **Note:** All 4 schemas are automatically registered during deployment (via `azd up` or the `register_schema.py` script) and grouped into the **"Auto Claim"** schema set. -Duplicate one of these files and update with a class definition that represents your document type. +Duplicate one of these files and update with fields that represent your document type. > **Tip:** You can use GitHub Copilot to generate a schema. Example prompt: > -> *Generate a Schema Class based on the following autoclaim.py schema definition, which has been built and derived from Pydantic BaseModel class. The generated Schema Class should be called "Freight Shipment Bill of Lading" schema file. Please define the entities based on standard bill of lading documents in the logistics industry.* - -### Class Structure - -Each schema `.py` file must include: - -```python -from pydantic import BaseModel, Field -from typing import List, Optional - -class SubModel(BaseModel): - """Description of this sub-entity — used as LLM context.""" - - field_name: Optional[str] = Field( - description="What this field represents, e.g. Consignee company name" - ) - -class MyDocumentSchema(BaseModel): - """Top-level description of the document type.""" - - some_field: Optional[str] = Field(description="...") - sub_entity: Optional[SubModel] = Field(description="...") - - @staticmethod - def example() -> "MyDocumentSchema": - """Returns an empty instance of this schema.""" - return MyDocumentSchema(some_field="", sub_entity=SubModel.example()) - - @staticmethod - def from_json(json_str: str) -> "MyDocumentSchema": - """Creates an instance from a JSON string.""" - return MyDocumentSchema.model_validate_json(json_str) - - def to_dict(self) -> dict: - """Converts this instance to a dictionary.""" - return self.model_dump() +> *Generate a JSON Schema (Draft 2020-12) based on the following autoclaim.json schema definition. The generated schema should be called "Freight Shipment Bill of Lading". Please define the properties based on standard bill of lading documents in the logistics industry.* + +### Schema Document Structure + +Each schema `.json` file must be a JSON Schema (Draft 2020-12) with +`"type": "object"` at the root and a `"properties"` block. Example: + +```json +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "title": "MyDocumentSchema", + "description": "Top-level description of the document type.", + "type": "object", + "properties": { + "some_field": { + "type": ["string", "null"], + "description": "What this field represents, e.g. policy number" + }, + "sub_entity": { + "$ref": "#/$defs/SubModel" + } + }, + "$defs": { + "SubModel": { + "title": "SubModel", + "description": "Description of this sub-entity — used as LLM context.", + "type": "object", + "properties": { + "field_name": { + "type": ["string", "null"], + "description": "What this field represents, e.g. Consignee company name" + } + } + } + } +} ``` ### Key Rules | Element | Requirement | | ------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **Inheritance** | All classes must inherit from `pydantic.BaseModel` | -| **Field descriptions** | Every field must have a `description=` — this is the prompt text the LLM uses for extraction. Include examples for better accuracy (e.g., `"Date of loss, e.g. 01/15/2026"`) | -| **Optional vs Required** | Use `Optional[str]` for fields that may not be present in every document | -| **Subclasses** | Use nested `BaseModel` classes for complex entities (address, line items, etc.) | -| **Required methods** | `example()`, `from_json()`, `to_dict()` — all three must be present | -| **Class docstring** | Include a description — it's used as context during mapping | +| **Root type** | Must be `"type": "object"` with a `"properties"` block | +| **Field descriptions** | Every property must have a `"description"` — this is the prompt text the LLM uses for extraction. Include examples for better accuracy (e.g., `"Date of loss, e.g. 01/15/2026"`) | +| **Optional vs Required** | Use `["string", "null"]` for fields that may not be present in every document; list required keys in the root `"required"` array if any | +| **Sub-objects** | Define reusable nested types under `"$defs"` and reference them via `"$ref": "#/$defs/"` | +| **Class name** | Use a top-level `"title"` field; this becomes `ClassName` in the Schema Vault. If absent, the request body's `ClassName` (or filename) is used | +| **Top-level description**| Include a `"description"` — it's used as context during mapping | --- ## Step 2: Register Schemas -After creating your `.py` class files, register each schema in the system. Registration uploads the class file to Blob Storage and stores metadata in Cosmos DB. +After creating your `.json` schema files, register each schema in the system. Registration uploads the file to Blob Storage and stores metadata in Cosmos DB. ### Option A: Register via API (individual) @@ -152,7 +152,7 @@ After creating your `.py` class files, register each schema in the system. Regis | Part | Type | Description | | ------------- | ----------- | ----------------------------------------------------------------- | | `schema_info` | JSON string | `{"ClassName": "MyDocumentSchema", "Description": "My Document"}` | -| `file` | File upload | The `.py` class file (max 1 MB) | +| `file` | File upload | The `.json` JSON Schema file (max 1 MB) | Example using the REST Client extension: @@ -177,10 +177,10 @@ For bulk registration, use the provided script with a JSON manifest. The script ```json { "schemas": [ - { "File": "autoclaim.py", "ClassName": "AutoInsuranceClaimForm", "Description": "Auto Insurance Claim Form" }, - { "File": "damagedcarimage.py", "ClassName": "DamagedVehicleImageAssessment","Description": "Damaged Vehicle Image Assessment" }, - { "File": "policereport.py", "ClassName": "PoliceReportDocument", "Description": "Police Report Document" }, - { "File": "repairestimate.py", "ClassName": "RepairEstimateDocument", "Description": "Repair Estimate Document" } + { "File": "autoclaim.json", "ClassName": "AutoInsuranceClaimForm", "Description": "Auto Insurance Claim Form" }, + { "File": "damagedcarimage.json", "ClassName": "DamagedVehicleImageAssessment","Description": "Damaged Vehicle Image Assessment" }, + { "File": "policereport.json", "ClassName": "PoliceReportDocument", "Description": "Police Report Document" }, + { "File": "repairestimate.json", "ClassName": "RepairEstimateDocument", "Description": "Repair Estimate Document" } ], "schemaset": { "Name": "Auto Claim", @@ -205,7 +205,7 @@ The script checks for existing schemas and schema sets to avoid duplicates, and | `POST` | `/schemavault/` | Register a new schema (multipart upload) | | `PUT` | `/schemavault/` | Update an existing schema | | `DELETE` | `/schemavault/` | Delete a schema by ID | -| `GET` | `/schemavault/schemas/{schema_id}` | Get a schema by ID (includes `.py` file) | +| `GET` | `/schemavault/schemas/{schema_id}` | Get a schema by ID (includes `.json` file) | --- @@ -259,16 +259,63 @@ Repeat for each schema. The SchemaSet now holds references to all your document Once schemas are registered and grouped into a SchemaSet, the pipeline uses them automatically during the **Map** step: 1. **Schema lookup** — The Map handler reads the `Schema_Id` from the processing queue message, then fetches metadata from Cosmos DB -2. **Dynamic class loading** — Downloads the `.py` file from Blob Storage and dynamically loads the Pydantic class -3. **JSON Schema generation** — Calls `model_json_schema()` on the class to produce a full JSON Schema with all field descriptions +2. **Schema materialisation** — Downloads the JSON Schema document from Blob Storage and builds a Pydantic model from it in memory (no code execution) +3. **JSON Schema generation** — Calls `model_json_schema()` on the materialised model to produce the schema with all field descriptions 4. **LLM extraction** — Embeds the JSON Schema into the GPT-5.1 system prompt with `response_format` for structured JSON output (temperature=0.1 for deterministic results) -5. **Validation & scoring** — Parses the GPT response back into the Pydantic class, then computes per-field confidence scores using log-probabilities +5. **Validation & scoring** — Parses the GPT response back into the Pydantic model, then computes per-field confidence scores using log-probabilities -This means your field descriptions in the schema class **directly influence extraction quality** — write clear, specific descriptions with examples for best results. +This means your field descriptions in the schema **directly influence extraction quality** — write clear, specific descriptions with examples for best results. --- -## Related Documentation +## Authoring Schemas as JSON + +The schema vault accepts **JSON Schema** documents (Draft 2020-12) only. +JSON schemas are treated strictly as data: the worker parses them and +materialises a Pydantic model in memory without executing any uploaded +code, eliminating an entire class of remote-code-execution risk in the +schema-management path. The legacy executable `.py` format has been +removed; uploads of `.py` files are rejected with HTTP 415. + +### Format requirements + +| | JSON Schema | +| --- | --- | +| Format | Declarative JSON document | +| Worker behaviour | Parses JSON, builds model in memory | +| Authoring | Pydantic-compatible JSON | +| Side-effects on import | Impossible | + +### Upload via API + +`POST /schemavault/` accepts JSON Schema documents. Send the file with +`Content-Type: application/json`: + +```http +POST /schemavault/ +Content-Type: multipart/form-data +- data: { "ClassName": "InvoiceSchema", "Description": "Invoice extraction" } +- file: invoice.json (application/json) +``` + +When uploading JSON: + +- The schema must be a JSON object with `"type": "object"` and a + `"properties"` block. +- The schema's `title` (if present) becomes the `ClassName` recorded in + Cosmos. If the JSON has no `title`, the request body's `ClassName` is + used as a fallback. +- The schema must be ≤ 1 MB. + +### Limitations + +JSON schemas are pure data. They cannot carry custom validation logic +(e.g. Pydantic `field_validator`). For most extraction schemas this is +not a limitation — the existing samples don't use custom validators. +If you need imperative validation, implement it downstream after the +pipeline extracts the data. + + - [Modifying System Processing Prompts](./CustomizeSystemPrompts.md) — Customize extraction and mapping prompts - [Gap Analysis Ruleset Guide](./GapAnalysisRulesetGuide.md) — Define gap rules that reference your document types diff --git a/docs/CustomizeSystemPrompts.md b/docs/CustomizeSystemPrompts.md index d1d3cf53..67dabf83 100644 --- a/docs/CustomizeSystemPrompts.md +++ b/docs/CustomizeSystemPrompts.md @@ -51,4 +51,4 @@ For the complete DSL reference, expression language, domain adaptation examples, ## Schema-Specific Prompts -Schema-specific prompts are managed directly in the individual schema .py file that is created. The field descriptions in your schema class act as prompts for the LLM during data extraction and mapping. See [Customizing Schema and Data](./CustomizeSchemaData.md) for details on how to write effective field descriptions. +Schema-specific prompts are managed directly in the individual schema JSON file. The field descriptions in your schema act as prompts for the LLM during data extraction and mapping. See [Customizing Schema and Data](./CustomizeSchemaData.md) for details on how to write effective field descriptions. diff --git a/docs/GoldenPathWorkflows.md b/docs/GoldenPathWorkflows.md index cf48b480..fc8c911a 100644 --- a/docs/GoldenPathWorkflows.md +++ b/docs/GoldenPathWorkflows.md @@ -121,7 +121,7 @@ The final stage applies **YAML-based rules** to detect missing documents and cro 1. **Create Custom Schema** - Follow the [Custom Schema Guide](./CustomizeSchemaData.md) - - Define your document structure and required fields (Pydantic model) + - Define your document structure and required fields (JSON Schema) 2. **Register Your Schema** - Add your schema to `schema_info.json` and run `register_schema.py` diff --git a/docs/TechnicalArchitecture.md b/docs/TechnicalArchitecture.md index dce44b65..3e0f651f 100644 --- a/docs/TechnicalArchitecture.md +++ b/docs/TechnicalArchitecture.md @@ -209,7 +209,7 @@ Using Azure OpenAI Service, a deployment of the GPT-5.1 model is used during the Using Azure Blob Storage, the solution uses multiple containers: - **process-batch** – Claim batch manifests and batch-level artifacts. - **cps-processes** – Source files for processing, intermediate results, and final output JSON files. -- **cps-configuration** – Schema `.py` files and configuration data. +- **cps-configuration** – Schema JSON files and configuration data. ### Azure Cosmos DB for MongoDB Using Azure Cosmos DB for MongoDB, the solution uses multiple collections: diff --git a/infra/scripts/post_deployment.ps1 b/infra/scripts/post_deployment.ps1 index 04104a50..aa116003 100644 --- a/infra/scripts/post_deployment.ps1 +++ b/infra/scripts/post_deployment.ps1 @@ -124,6 +124,15 @@ if (-not $ApiReady) { Write-Host " Registering new schema '$ClassName'..." + # Only JSON Schema descriptors are accepted. The legacy .py format + # was removed as part of the schemavault RCE remediation. + $extension = [System.IO.Path]::GetExtension($SchemaFile).ToLowerInvariant() + if ($extension -ne '.json') { + Write-Host " Unsupported schema extension '$extension' for '$SchemaFile'. Only .json is accepted. Skipping..." + continue + } + $contentType = 'application/json' + # Build multipart form data $dataPayload = @{ ClassName = $ClassName; Description = $Description } | ConvertTo-Json -Compress $fileBytes = [System.IO.File]::ReadAllBytes($SchemaFile) @@ -137,7 +146,7 @@ if (-not $ApiReady) { $dataPayload, "--$boundary", "Content-Disposition: form-data; name=`"file`"; filename=`"$fileName`"", - "Content-Type: text/x-python$LF", + "Content-Type: $contentType$LF", [System.Text.Encoding]::UTF8.GetString($fileBytes), "--$boundary--$LF" ) -join $LF diff --git a/infra/scripts/post_deployment.sh b/infra/scripts/post_deployment.sh index 2b0ee0ad..49644a4d 100644 --- a/infra/scripts/post_deployment.sh +++ b/infra/scripts/post_deployment.sh @@ -136,10 +136,19 @@ else echo " Registering new schema '$CLASS_NAME'..." DATA_PAYLOAD="{\"ClassName\": \"$CLASS_NAME\", \"Description\": \"$DESCRIPTION\"}" + # Only JSON Schema descriptors are accepted. The legacy .py format + # was removed as part of the schemavault RCE remediation. + EXT=$(echo "${FILE_NAME##*.}" | tr '[:upper:]' '[:lower:]') + if [ "$EXT" != "json" ]; then + echo " Unsupported schema extension '.$EXT' for '$FILE_NAME'. Only .json is accepted. Skipping..." + continue + fi + CONTENT_TYPE="application/json" + RESPONSE=$(curl -s -w "\n%{http_code}" \ -X POST "$SCHEMAVAULT_URL" \ -F "data=$DATA_PAYLOAD" \ - -F "file=@$SCHEMA_FILE;type=text/x-python" \ + -F "file=@$SCHEMA_FILE;type=$CONTENT_TYPE" \ --connect-timeout 60) HTTP_CODE=$(echo "$RESPONSE" | tail -1) diff --git a/src/ContentProcessor/src/libs/pipeline/entities/schema.py b/src/ContentProcessor/src/libs/pipeline/entities/schema.py index f7f5f7e0..e9138897 100644 --- a/src/ContentProcessor/src/libs/pipeline/entities/schema.py +++ b/src/ContentProcessor/src/libs/pipeline/entities/schema.py @@ -9,7 +9,7 @@ class file (in blob storage) that defines the structured output """ import datetime -from typing import Optional +from typing import Literal, Optional from pydantic import BaseModel, Field @@ -21,10 +21,13 @@ class Schema(BaseModel): Attributes: Id: Unique schema identifier. - ClassName: Python class name in the remote module. + ClassName: Class name to materialise from the schema artifact. Description: Human-readable description. - FileName: Blob filename containing the schema class. + FileName: Blob filename containing the schema artifact. ContentType: Target content type this schema handles. + Format: Storage format of the schema artifact. Always + ``"json"`` — declarative JSON Schema descriptors are the + only supported format. """ Id: str @@ -32,6 +35,7 @@ class Schema(BaseModel): Description: str FileName: str ContentType: str + Format: Literal["json"] = Field(default="json") Created_On: Optional[datetime.datetime] = Field(default=None) Updated_On: Optional[datetime.datetime] = Field(default=None) diff --git a/src/ContentProcessor/src/libs/pipeline/handlers/map_handler.py b/src/ContentProcessor/src/libs/pipeline/handlers/map_handler.py index aa08dda1..f3f20cb3 100644 --- a/src/ContentProcessor/src/libs/pipeline/handlers/map_handler.py +++ b/src/ContentProcessor/src/libs/pipeline/handlers/map_handler.py @@ -28,7 +28,7 @@ from libs.pipeline.entities.pipeline_step_result import StepResult from libs.pipeline.entities.schema import Schema from libs.pipeline.queue_handler_base import HandlerBase -from libs.utils.remote_module_loader import load_schema_from_blob +from libs.utils.remote_schema_loader import load_schema_from_blob_json logger = logging.getLogger(__name__) @@ -151,12 +151,21 @@ async def execute(self, context: MessageContext) -> StepResult: schema_id=context.data_pipeline.pipeline_status.schema_id, ) - # Load the schema class for structured output - schema_class = load_schema_from_blob( + # Load the schema class for structured output. Only JSON schemas + # are supported; the worker materialises the descriptor as an + # in-memory Pydantic model without ever executing uploaded code. + if not selected_schema.FileName.lower().endswith(".json"): + raise ValueError( + f"Schema {selected_schema.Id} has a non-JSON file " + f"'{selected_schema.FileName}'. Re-register the schema as a " + "JSON Schema (.json) document; legacy Python (.py) schemas " + "are no longer supported." + ) + schema_class = load_schema_from_blob_json( account_url=self.application_context.configuration.app_storage_blob_url, container_name=f"{self.application_context.configuration.app_cps_configuration}/Schemas/{context.data_pipeline.pipeline_status.schema_id}", blob_name=selected_schema.FileName, - module_name=selected_schema.ClassName, + model_name=selected_schema.ClassName, ) # Invoke Model with Agent Framework SDK diff --git a/src/ContentProcessor/src/libs/utils/__init__.py b/src/ContentProcessor/src/libs/utils/__init__.py index e4b1d5a6..b5f16936 100644 --- a/src/ContentProcessor/src/libs/utils/__init__.py +++ b/src/ContentProcessor/src/libs/utils/__init__.py @@ -8,8 +8,8 @@ base64_util: Base-64 encoding detection. credential_util: Convenience re-export of credential and token-provider helpers (mirrors azure_credential_utils). - remote_module_loader: Dynamically load Python modules from Azure Blob - Storage. + remote_schema_loader: Materialise Pydantic models from JSON Schema + descriptors stored in Azure Blob Storage (no code execution). stopwatch: Lightweight elapsed-time measurement context manager. utils: General-purpose JSON encoding, dict flattening, and value comparison helpers. diff --git a/src/ContentProcessor/src/libs/utils/remote_module_loader.py b/src/ContentProcessor/src/libs/utils/remote_module_loader.py deleted file mode 100644 index f3985aa7..00000000 --- a/src/ContentProcessor/src/libs/utils/remote_module_loader.py +++ /dev/null @@ -1,65 +0,0 @@ -# Copyright (c) Microsoft Corporation. -# Licensed under the MIT License. - -"""Dynamically load Python modules stored in Azure Blob Storage. - -Used by the map handler to fetch schema classes at runtime from a -configurable blob container. -""" - -import importlib.util -import sys - -from azure.storage.blob import BlobServiceClient - -from libs.utils.azure_credential_utils import get_azure_credential - - -def load_schema_from_blob( - account_url: str, container_name: str, blob_name: str, module_name: str -): - """Download a Python file from blob storage and return a class from it. - - Args: - account_url: Azure Blob Storage account URL. - container_name: Container (path) holding the blob. - blob_name: Blob filename to download. - module_name: Name of the class to extract from the module. - - Returns: - The class object loaded from the downloaded script. - """ - # Download the blob content - blob_content = _download_blob_content(container_name, blob_name, account_url) - - # Execute the script content - module = _execute_script(blob_content, module_name) - - loaded_class = getattr(module, module_name) - return loaded_class - - -def _download_blob_content(container_name, blob_name, account_url): - """Download blob content as a UTF-8 string.""" - credential = get_azure_credential() - blob_service_client = BlobServiceClient( - account_url=account_url, credential=credential - ) - - blob_client = blob_service_client.get_blob_client( - container=container_name, blob=blob_name - ) - - blob_content = blob_client.download_blob().readall().decode("utf-8") - return blob_content - - -def _execute_script(script_content, module_name): - """Execute Python source text as a new module and return it.""" - spec = importlib.util.spec_from_loader(module_name, loader=None) - module = importlib.util.module_from_spec(spec) - sys.modules[module_name] = module - - # Execute the script content in the module's namespace - exec(script_content, module.__dict__) - return module diff --git a/src/ContentProcessor/src/libs/utils/remote_schema_loader.py b/src/ContentProcessor/src/libs/utils/remote_schema_loader.py new file mode 100644 index 00000000..4010bb10 --- /dev/null +++ b/src/ContentProcessor/src/libs/utils/remote_schema_loader.py @@ -0,0 +1,344 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. + +"""Materialise a Pydantic model from a JSON Schema descriptor. + +A JSON schema descriptor is treated strictly as data: + +1. Bytes are downloaded from blob storage. +2. ``json.loads`` parses them into a ``dict``. +3. A recursive walk converts the schema into Pydantic ``BaseModel`` + subclasses via :func:`pydantic.create_model`. + +There is **no** ``exec``, ``compile``, ``importlib`` or any other +mechanism that would execute attacker-supplied code. The worst a +malicious schema can do is fail validation at load time. +""" + +from __future__ import annotations + +import json +import logging +from typing import Any, ForwardRef, List, Literal, Tuple, Type, Union + +from azure.storage.blob import BlobServiceClient +from pydantic import BaseModel, ConfigDict, Field, create_model + +from libs.utils.azure_credential_utils import get_azure_credential + +logger = logging.getLogger(__name__) + + +class JsonSchemaLoadError(ValueError): + """Raised when a JSON schema descriptor cannot be turned into a model.""" + + +def load_schema_from_blob_json( + account_url: str, + container_name: str, + blob_name: str, + model_name: str, +) -> Type[BaseModel]: + """Download a JSON Schema and return a generated Pydantic model class. + + Args: + account_url: Azure Blob Storage account URL. + container_name: Container (path) holding the blob. + blob_name: Blob filename to download (a ``.json`` schema). + model_name: Name to assign to the root generated model class. + + Returns: + A dynamically generated subclass of :class:`pydantic.BaseModel` + whose shape matches the JSON Schema. + + Raises: + JsonSchemaLoadError: If the blob is not valid JSON or the schema + cannot be translated into a Pydantic model. + """ + raw = _download_blob_content(container_name, blob_name, account_url) + try: + document = json.loads(raw) + except json.JSONDecodeError as exc: + raise JsonSchemaLoadError( + f"Schema blob '{blob_name}' is not valid JSON: {exc.msg}" + ) from exc + + if not isinstance(document, dict): + raise JsonSchemaLoadError("Schema root must be a JSON object.") + + return build_model_from_schema(document, model_name) + + +def build_model_from_schema( + document: dict[str, Any], model_name: str +) -> Type[BaseModel]: + """Build a Pydantic model class from an in-memory JSON Schema document. + + This is split out from :func:`load_schema_from_blob_json` so it can + be unit-tested without touching Azure storage. + """ + defs = document.get("$defs") or document.get("definitions") or {} + if not isinstance(defs, dict): + raise JsonSchemaLoadError("'$defs' must be a JSON object if present.") + + builder = _ModelBuilder(defs) + model = builder.build_object(document, model_name, is_root=True) + builder.resolve_forward_refs() + return model + + +# --------------------------------------------------------------------------- +# Internals +# --------------------------------------------------------------------------- + + +def _download_blob_content( + container_name: str, blob_name: str, account_url: str +) -> str: + """Download the blob and return its UTF-8 contents as a string.""" + credential = get_azure_credential() + blob_service_client = BlobServiceClient( + account_url=account_url, credential=credential + ) + blob_client = blob_service_client.get_blob_client( + container=container_name, blob=blob_name + ) + return blob_client.download_blob().readall().decode("utf-8") + + +class _ModelBuilder: + """Recursive JSON-Schema-to-Pydantic translator. + + The builder maintains a memo of already-generated models keyed by + ``$defs`` name so that repeated ``$ref`` references reuse the same + class and so that self/mutually-recursive schemas terminate. + """ + + _PRIMITIVE_TYPES: dict[str, type] = { + "string": str, + "integer": int, + "number": float, + "boolean": bool, + "null": type(None), + } + + def __init__(self, defs: dict[str, Any]): + self._defs = defs + self._models: dict[str, Type[BaseModel]] = {} + self._in_progress: set[str] = set() + self._all_models: list[Type[BaseModel]] = [] + + # -- public driver ---------------------------------------------------- + + def build_object( + self, + node: dict[str, Any], + model_name: str, + *, + is_root: bool = False, + ) -> Type[BaseModel]: + """Build a Pydantic model from an object-typed schema node.""" + if not is_root: + # Avoid colliding with a reserved $defs name when the caller + # supplies an inline object schema. + model_name = self._dedupe_name(model_name) + + # Reserve the slot so $ref to the same definition resolves to us + # even before we finish constructing it. + self._in_progress.add(model_name) + try: + properties = node.get("properties") or {} + required = set(node.get("required") or []) + fields: dict[str, tuple[Any, Any]] = {} + + for prop_name, prop_schema in properties.items(): + python_type, default = self._field_for( + prop_schema, prop_name, parent_name=model_name + ) + if prop_name in required and default is None: + field_default: Any = ... + else: + field_default = default + + description = ( + prop_schema.get("description") + if isinstance(prop_schema, dict) + else None + ) + fields[prop_name] = ( + python_type, + Field(default=field_default, description=description), + ) + + model = create_model( # type: ignore[call-overload] + model_name, + __config__=ConfigDict(extra="ignore"), + **fields, + ) + description = node.get("description") or node.get("title") + if isinstance(description, str): + model.__doc__ = description + finally: + self._in_progress.discard(model_name) + + self._models[model_name] = model + self._all_models.append(model) + return model + + def resolve_forward_refs(self) -> None: + """Resolve any ``ForwardRef`` placeholders left during construction.""" + ns = dict(self._models) + for model in self._all_models: + try: + model.model_rebuild(_types_namespace=ns) + except Exception: # pragma: no cover - defensive + logger.exception( + "Failed to rebuild model %s while resolving forward refs", + model.__name__, + ) + + # -- field translation ------------------------------------------------ + + def _field_for( + self, + schema: Any, + prop_name: str, + parent_name: str, + ) -> Tuple[Any, Any]: + """Translate a property schema into ``(python_type, default_value)``. + + ``default_value`` is ``None`` when the field is nullable / optional; + callers replace it with ``...`` when the field is required. + """ + if schema is True or schema is None or schema == {}: + return (Any, None) + if not isinstance(schema, dict): + raise JsonSchemaLoadError( + f"Property '{prop_name}' has invalid schema (not an object)." + ) + + # $ref resolution (local refs only). + ref = schema.get("$ref") + if isinstance(ref, str): + return (self._resolve_ref(ref), None) + + # anyOf / oneOf — treat as Union. + for key in ("anyOf", "oneOf"): + if key in schema: + members = schema[key] + if not isinstance(members, list) or not members: + raise JsonSchemaLoadError( + f"'{key}' for '{prop_name}' must be a non-empty list." + ) + resolved = [ + self._field_for(m, prop_name, parent_name)[0] for m in members + ] + return (Union[tuple(resolved)], None) # type: ignore[valid-type] + + # enum — Literal[...] of allowed values. + if "enum" in schema and isinstance(schema["enum"], list) and schema["enum"]: + literal_args = tuple(schema["enum"]) + return (Literal[literal_args], None) # type: ignore[valid-type] + + json_type = schema.get("type") + + if isinstance(json_type, list): + # e.g. ["string", "null"] + python_types = [self._type_for_simple(t, schema, prop_name, parent_name) + for t in json_type] + if len(python_types) == 1: + return (python_types[0], None) + unioned: Any = Union[tuple(python_types)] # type: ignore[valid-type] + return (unioned, None) + + if isinstance(json_type, str): + return ( + self._type_for_simple(json_type, schema, prop_name, parent_name), + None, + ) + + # No type declared → permissive. + return (Any, None) + + def _type_for_simple( + self, + json_type: str, + schema: dict[str, Any], + prop_name: str, + parent_name: str, + ) -> Any: + """Translate a single JSON-Schema primitive ``type`` token.""" + if json_type in self._PRIMITIVE_TYPES: + return self._PRIMITIVE_TYPES[json_type] + if json_type == "array": + items = schema.get("items") + if items is None: + return List[Any] + item_type, _ = self._field_for(items, f"{prop_name}_item", parent_name) + return List[item_type] # type: ignore[valid-type] + if json_type == "object": + inline_name = self._inline_object_name(parent_name, prop_name) + return self.build_object(schema, inline_name) + raise JsonSchemaLoadError( + f"Unsupported JSON Schema type '{json_type}' for property '{prop_name}'." + ) + + def _resolve_ref(self, ref: str) -> Any: + """Resolve a local JSON-Pointer reference into a generated model.""" + prefix_defs = "#/$defs/" + prefix_definitions = "#/definitions/" + if ref.startswith(prefix_defs): + name = ref[len(prefix_defs):] + elif ref.startswith(prefix_definitions): + name = ref[len(prefix_definitions):] + else: + raise JsonSchemaLoadError( + f"Only local '#/$defs/...' refs are supported (got '{ref}')." + ) + + if name in self._models: + return self._models[name] + + if name in self._in_progress: + # Cycle: emit a forward reference; resolved later. + return ForwardRef(name) + + if name not in self._defs: + raise JsonSchemaLoadError( + f"Reference '{ref}' does not resolve to a known $defs entry." + ) + + sub_schema = self._defs[name] + if not isinstance(sub_schema, dict): + raise JsonSchemaLoadError( + f"$defs entry '{name}' must be a JSON object." + ) + + sub_type = sub_schema.get("type") + if sub_type == "object" or "properties" in sub_schema: + return self.build_object(sub_schema, name) + + # Non-object $defs entry (rare): translate as a field type. + translated, _ = self._field_for(sub_schema, name, parent_name=name) + # Cache simple-type aliases so repeated refs return the same thing. + # (We don't add to self._models because that map is for BaseModel + # subclasses only, but ForwardRef handling does not apply to scalar + # aliases — return the type directly.) + return translated + + # -- name helpers ---------------------------------------------------- + + def _dedupe_name(self, candidate: str) -> str: + """Ensure a freshly generated model name does not collide.""" + if candidate not in self._models and candidate not in self._in_progress: + return candidate + i = 2 + while f"{candidate}_{i}" in self._models or f"{candidate}_{i}" in self._in_progress: + i += 1 + return f"{candidate}_{i}" + + @staticmethod + def _inline_object_name(parent_name: str, prop_name: str) -> str: + """Synthesize a stable name for an inline object schema.""" + camel = "".join(part.capitalize() for part in prop_name.split("_") if part) + return f"{parent_name}_{camel or 'Inline'}" diff --git a/src/ContentProcessor/tests/unit/pipeline/test_schema.py b/src/ContentProcessor/tests/unit/pipeline/test_schema.py index e5c18ef1..bbdb46b6 100644 --- a/src/ContentProcessor/tests/unit/pipeline/test_schema.py +++ b/src/ContentProcessor/tests/unit/pipeline/test_schema.py @@ -22,8 +22,8 @@ def test_construction(self): Id="s-1", ClassName="InvoiceSchema", Description="Invoice extraction", - FileName="invoice_schema.py", - ContentType="application/pdf", + FileName="invoice_schema.json", + ContentType="application/json", ) assert schema.Id == "s-1" assert schema.ClassName == "InvoiceSchema" @@ -46,8 +46,8 @@ def test_get_schema_returns_schema(self, mock_helper_cls): "Id": "s-1", "ClassName": "MySchema", "Description": "desc", - "FileName": "file.py", - "ContentType": "text/plain", + "FileName": "file.json", + "ContentType": "application/json", } ] result = Schema.get_schema("connstr", "db", "coll", "s-1") diff --git a/src/ContentProcessor/tests/unit/utils/test_remote_schema_loader.py b/src/ContentProcessor/tests/unit/utils/test_remote_schema_loader.py new file mode 100644 index 00000000..7102fba1 --- /dev/null +++ b/src/ContentProcessor/tests/unit/utils/test_remote_schema_loader.py @@ -0,0 +1,282 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. + +"""Tests for the JSON-Schema-based remote schema loader. + +These tests intentionally avoid touching Azure and only exercise +:func:`build_model_from_schema`, the in-memory translator that +:func:`load_schema_from_blob_json` delegates to. +""" + +from __future__ import annotations + +import json +from pathlib import Path + +import pytest +from pydantic import BaseModel + +from libs.utils.remote_schema_loader import ( + JsonSchemaLoadError, + build_model_from_schema, +) + +#: Repo-relative path to the golden JSON schema. +_GOLDEN_AUTOCLAIM = ( + Path(__file__).resolve().parents[4] + / "ContentProcessorAPI" + / "samples" + / "schemas" + / "autoclaim.json" +) + + +# --------------------------------------------------------------------------- +# Construction +# --------------------------------------------------------------------------- + + +def test_builds_simple_object_model(): + schema = { + "type": "object", + "title": "Invoice", + "properties": { + "id": {"type": "string"}, + "amount": {"type": "number"}, + "paid": {"type": "boolean"}, + }, + "required": ["id"], + } + model = build_model_from_schema(schema, "Invoice") + + assert issubclass(model, BaseModel) + instance = model.model_validate({"id": "INV1", "amount": 12.5, "paid": True}) + assert instance.id == "INV1" + assert instance.amount == 12.5 + + with pytest.raises(Exception): + model.model_validate({}) # missing required 'id' + + +def test_supports_nullable_via_anyof(): + schema = { + "type": "object", + "properties": { + "name": {"anyOf": [{"type": "string"}, {"type": "null"}]}, + }, + } + model = build_model_from_schema(schema, "X") + instance = model.model_validate({"name": None}) + assert instance.name is None + + +def test_supports_nullable_via_type_array(): + schema = { + "type": "object", + "properties": { + "name": {"type": ["string", "null"]}, + }, + } + model = build_model_from_schema(schema, "X") + assert model.model_validate({"name": None}).name is None + assert model.model_validate({"name": "ok"}).name == "ok" + + +def test_supports_arrays_of_primitives(): + schema = { + "type": "object", + "properties": { + "tags": {"type": "array", "items": {"type": "string"}}, + }, + } + model = build_model_from_schema(schema, "X") + instance = model.model_validate({"tags": ["a", "b"]}) + assert instance.tags == ["a", "b"] + + +def test_supports_inline_nested_object(): + schema = { + "type": "object", + "properties": { + "address": { + "type": "object", + "properties": { + "city": {"type": "string"}, + }, + }, + }, + } + model = build_model_from_schema(schema, "Person") + instance = model.model_validate({"address": {"city": "Macon"}}) + assert instance.address.city == "Macon" + + +def test_supports_refs_and_defs(): + schema = { + "$defs": { + "Address": { + "type": "object", + "properties": { + "city": {"type": "string"}, + }, + } + }, + "type": "object", + "properties": { + "primary": {"$ref": "#/$defs/Address"}, + "secondary": {"$ref": "#/$defs/Address"}, + }, + } + model = build_model_from_schema(schema, "Contact") + + instance = model.model_validate({ + "primary": {"city": "Macon"}, + "secondary": {"city": "Atlanta"}, + }) + # Both refs resolved to the *same* generated class. + assert type(instance.primary) is type(instance.secondary) + + +def test_supports_enum_via_literal(): + schema = { + "type": "object", + "properties": { + "tier": {"enum": ["bronze", "silver", "gold"]}, + }, + } + model = build_model_from_schema(schema, "Tier") + assert model.model_validate({"tier": "gold"}).tier == "gold" + with pytest.raises(Exception): + model.model_validate({"tier": "platinum"}) + + +# --------------------------------------------------------------------------- +# Failure modes +# --------------------------------------------------------------------------- + + +def test_rejects_unknown_ref_target(): + schema = { + "type": "object", + "properties": {"a": {"$ref": "#/$defs/Missing"}}, + } + with pytest.raises(JsonSchemaLoadError) as exc: + build_model_from_schema(schema, "X") + assert "$defs" in str(exc.value) + + +def test_rejects_external_ref(): + schema = { + "type": "object", + "properties": {"a": {"$ref": "https://example.com/schema.json"}}, + } + with pytest.raises(JsonSchemaLoadError): + build_model_from_schema(schema, "X") + + +# --------------------------------------------------------------------------- +# Golden-equivalence: the JSON schema generated from autoclaim.py builds a +# model that round-trips an LLM-style payload to the same dict that the +# legacy autoclaim.py would produce. +# --------------------------------------------------------------------------- + + +def _representative_payload() -> dict: + return { + "insurance_company": "Contoso Insurance", + "claim_number": "CLM987654", + "policy_number": "AUTO123456", + "policyholder_information": { + "name": "Chad Brooks", + "address": { + "street": "123 Main St", + "city": "Macon", + "state": "GA", + "postal_code": "31201", + "country": "USA", + }, + "phone": "(555) 555-1212", + "email": "chad.brooks@example.com", + }, + "policy_details": { + "coverage_type": "Auto - Comprehensive", + "effective_date": "2025-01-01", + "expiration_date": "2025-12-31", + "deductible": 500.0, + "deductible_currency": "USD", + }, + "incident_details": { + "date_of_loss": "2025-11-28", + "time_of_loss": "14:15", + "location": "Parking lot", + "cause_of_loss": "Low-speed collision", + "description": "Minor dent", + "police_report_filed": True, + "police_report_number": "GA-20251128-CR", + }, + "vehicle_information": { + "year": 2022, + "make": "Toyota", + "model": "Camry", + "trim": "SE", + "vin": "4T1G11AK2NU123456", + "license_plate": "GA-ABC123", + "mileage": 28450, + }, + "damage_assessment": { + "items": [ + { + "item_description": "Right-front quarter panel", + "date_acquired": "2022-03-15", + "cost_new": 1200.0, + "cost_new_currency": "USD", + "repair_estimate": 350.0, + "repair_estimate_currency": "USD", + } + ], + "total_estimated_repair": 500.0, + "total_estimated_repair_currency": "USD", + }, + "supporting_documents": { + "photos_of_damage": True, + "police_report_copy": True, + "repair_shop_estimate": True, + "other": [], + }, + "declaration": { + "statement": "I declare...", + "signature": {"signatory": "Chad Brooks", "is_signed": True}, + "date": "2025-12-01", + }, + "submission_instructions": { + "submission_email": "claims@contoso.com", + "portal_url": None, + "notes": None, + }, + } + + +def test_golden_autoclaim_round_trip(): + document = json.loads(_GOLDEN_AUTOCLAIM.read_text(encoding="utf-8")) + model = build_model_from_schema(document, "AutoInsuranceClaimForm") + + payload = _representative_payload() + instance = model.model_validate(payload) + dumped = instance.model_dump() + + # Every field round-trips and nested objects produced the same shape. + assert dumped["insurance_company"] == "Contoso Insurance" + assert dumped["policyholder_information"]["address"]["city"] == "Macon" + assert dumped["damage_assessment"]["items"][0]["cost_new"] == 1200.0 + assert dumped["declaration"]["signature"]["is_signed"] is True + + +def test_golden_autoclaim_emits_json_schema(): + document = json.loads(_GOLDEN_AUTOCLAIM.read_text(encoding="utf-8")) + model = build_model_from_schema(document, "AutoInsuranceClaimForm") + + # The generated model must be able to emit its own JSON schema; this is + # what map_handler.py passes to the LLM via ``model_json_schema()``. + out_schema = model.model_json_schema() + assert out_schema.get("type") == "object" + assert "properties" in out_schema diff --git a/src/ContentProcessor/uv.lock b/src/ContentProcessor/uv.lock index 3e32f808..77b91fd3 100644 --- a/src/ContentProcessor/uv.lock +++ b/src/ContentProcessor/uv.lock @@ -1727,7 +1727,7 @@ wheels = [ [[package]] name = "jsonschema" -version = "4.26.0" +version = "4.25.1" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "attrs" }, @@ -1735,9 +1735,9 @@ dependencies = [ { name = "referencing" }, { name = "rpds-py" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/b3/fc/e067678238fa451312d4c62bf6e6cf5ec56375422aee02f9cb5f909b3047/jsonschema-4.26.0.tar.gz", hash = "sha256:0c26707e2efad8aa1bfc5b7ce170f3fccc2e4918ff85989ba9ffa9facb2be326", size = 366583, upload-time = "2026-01-07T13:41:07.246Z" } +sdist = { url = "https://files.pythonhosted.org/packages/74/69/f7185de793a29082a9f3c7728268ffb31cb5095131a9c139a74078e27336/jsonschema-4.25.1.tar.gz", hash = "sha256:e4a9655ce0da0c0b67a085847e00a3a51449e1157f4f75e9fb5aa545e122eb85", size = 357342, upload-time = "2025-08-18T17:03:50.038Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/69/90/f63fb5873511e014207a475e2bb4e8b2e570d655b00ac19a9a0ca0a385ee/jsonschema-4.26.0-py3-none-any.whl", hash = "sha256:d489f15263b8d200f8387e64b4c3a75f06629559fb73deb8fdfb525f2dab50ce", size = 90630, upload-time = "2026-01-07T13:41:05.306Z" }, + { url = "https://files.pythonhosted.org/packages/bf/9c/8c95d856233c1f82500c2450b8c68576b4cf1c871db3afac5c34ff84e6fd/jsonschema-4.25.1-py3-none-any.whl", hash = "sha256:3fba0169e345c7175110351d456342c364814cfcf3b964ba4587f22915230a63", size = 90040, upload-time = "2025-08-18T17:03:48.373Z" }, ] [[package]] diff --git a/src/ContentProcessorAPI/app/routers/logics/schema_validator.py b/src/ContentProcessorAPI/app/routers/logics/schema_validator.py new file mode 100644 index 00000000..b3c5e441 --- /dev/null +++ b/src/ContentProcessorAPI/app/routers/logics/schema_validator.py @@ -0,0 +1,175 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. + +"""Validate uploaded JSON Schema descriptors used by the content-processing pipeline. + +A JSON schema descriptor is treated as **data**: it is parsed (never +executed), checked against the JSON Schema Draft 2020-12 meta-schema, and +required to use only a small set of project-specific custom keywords. + +This module is intentionally side-effect free; it does not touch storage +or Cosmos. The router is responsible for calling :func:`validate_json_schema` +and acting on the returned errors. +""" + +from __future__ import annotations + +import json +from typing import Any, Iterable + +from jsonschema import Draft202012Validator +from jsonschema.exceptions import SchemaError + +#: Maximum size in bytes for an uploaded JSON schema. Schemas are config +#: artefacts; a generous cap of 1 MB matches the legacy ``.py`` limit. +MAX_SCHEMA_BYTES: int = 1 * 1024 * 1024 + +ALLOWED_CPS_KEYWORDS: frozenset[str] = frozenset() + + +class SchemaValidationError(ValueError): + """Raised when an uploaded JSON schema fails validation. + + Attributes: + errors: Human-readable list of violations. + """ + + def __init__(self, errors: list[str]): + self.errors = errors + super().__init__("; ".join(errors) if errors else "Invalid JSON schema") + + +def validate_json_schema(raw_bytes: bytes) -> dict[str, Any]: + """Validate the bytes of an uploaded JSON Schema descriptor. + + Args: + raw_bytes: Uploaded file contents. + + Returns: + The parsed schema document as a ``dict`` (only on success). + + Raises: + SchemaValidationError: If the bytes are too large, are not valid + JSON, do not conform to JSON Schema Draft 2020-12, or use + disallowed custom extension keywords. + """ + errors: list[str] = [] + + if raw_bytes is None: + raise SchemaValidationError(["Empty schema upload."]) + + if len(raw_bytes) > MAX_SCHEMA_BYTES: + raise SchemaValidationError([ + f"Schema is too large ({len(raw_bytes)} bytes; max {MAX_SCHEMA_BYTES})." + ]) + + try: + document = json.loads(raw_bytes.decode("utf-8")) + except UnicodeDecodeError as exc: + raise SchemaValidationError([f"Schema must be UTF-8 encoded: {exc}"]) from exc + except json.JSONDecodeError as exc: + raise SchemaValidationError([f"Schema is not valid JSON: {exc.msg}"]) from exc + + if not isinstance(document, dict): + raise SchemaValidationError([ + "Schema root must be a JSON object describing the model." + ]) + + # Reject schemas without a usable type. We only support object roots + # because the pipeline materialises a Pydantic model from them. + root_type = document.get("type") + if root_type != "object": + errors.append( + "Schema root must declare 'type': 'object' " + "(got %r)." % (root_type,) + ) + + if "properties" not in document or not isinstance( + document.get("properties"), dict + ): + errors.append("Schema root must declare a 'properties' object.") + + # Validate the document itself is a syntactically valid Draft 2020-12 schema. + try: + Draft202012Validator.check_schema(document) + except SchemaError as exc: + errors.append(f"Not a valid JSON Schema (Draft 2020-12): {exc.message}") + + # Walk the document and reject unknown ``x-`` extension keywords. + for path, key in _walk_extension_keywords(document): + if key not in ALLOWED_CPS_KEYWORDS: + errors.append( + f"Unsupported extension keyword '{key}' at {path or ''}. " + f"Allowed: {sorted(ALLOWED_CPS_KEYWORDS)}." + ) + + # Reject unsupported $ref values. The runtime loader only supports local + # references of the form ``#/$defs/...`` or ``#/definitions/...``. + for path, ref in _walk_refs(document): + if not (ref.startswith("#/$defs/") or ref.startswith("#/definitions/")): + errors.append( + f"Unsupported $ref '{ref}' at {path or ''}. " + "Only '#/$defs/...' and '#/definitions/...' references are supported." + ) + + if errors: + raise SchemaValidationError(errors) + + return document + + +def derive_class_name(document: dict[str, Any], fallback: str) -> str: + """Derive a stable class name for the schema document. + + The schema's ``title`` is preferred (matches Pydantic conventions); + otherwise the supplied filename stem is used. Any non-identifier + characters in the fallback are replaced with underscores so the + result is always a valid Python identifier. + + Args: + document: Parsed JSON schema document. + fallback: Filename stem (without extension) to use if no title. + + Returns: + A non-empty string suitable for use as a Pydantic model name. + """ + title = document.get("title") + if isinstance(title, str) and title.strip(): + candidate = title.strip() + else: + candidate = fallback + + cleaned = "".join(ch if ch.isalnum() or ch == "_" else "_" for ch in candidate) + if not cleaned or not (cleaned[0].isalpha() or cleaned[0] == "_"): + cleaned = "Schema_" + cleaned + return cleaned + + +def _walk_extension_keywords( + node: Any, path: str = "" +) -> Iterable[tuple[str, str]]: + """Yield every ``(path, key)`` for keys starting with ``x-`` anywhere in *node*.""" + if isinstance(node, dict): + for key, value in node.items(): + if isinstance(key, str) and key.startswith("x-"): + yield path, key + child_path = f"{path}.{key}" if path else str(key) + yield from _walk_extension_keywords(value, child_path) + elif isinstance(node, list): + for idx, item in enumerate(node): + yield from _walk_extension_keywords(item, f"{path}[{idx}]") + + +def _walk_refs( + node: Any, path: str = "" +) -> Iterable[tuple[str, str]]: + """Yield every ``(path, ref_value)`` for ``$ref`` keys anywhere in *node*.""" + if isinstance(node, dict): + if "$ref" in node and isinstance(node["$ref"], str): + yield path, node["$ref"] + for key, value in node.items(): + child_path = f"{path}.{key}" if path else str(key) + yield from _walk_refs(value, child_path) + elif isinstance(node, list): + for idx, item in enumerate(node): + yield from _walk_refs(item, f"{path}[{idx}]") diff --git a/src/ContentProcessorAPI/app/routers/logics/schemavault.py b/src/ContentProcessorAPI/app/routers/logics/schemavault.py index f97663c4..e0227cc1 100644 --- a/src/ContentProcessorAPI/app/routers/logics/schemavault.py +++ b/src/ContentProcessorAPI/app/routers/logics/schemavault.py @@ -67,7 +67,13 @@ def Add(self, file: UploadFile, schema: Schema) -> Schema: self.mongoHelper.insert_document(schema.model_dump(mode="json")) return schema - def Update(self, file: UploadFile, schema_id: str, class_name: str) -> Schema: + def Update( + self, + file: UploadFile, + schema_id: str, + class_name: str, + storage_format: str = "json", + ) -> Schema: """Replace the schema file in blob storage and update Cosmos metadata.""" schemas = self.mongoHelper.find_document(query={"Id": schema_id}) if not schemas: @@ -79,7 +85,8 @@ def Update(self, file: UploadFile, schema_id: str, class_name: str) -> Schema: ) schema_object.ClassName = class_name - schema_object.ContentType = file.content_type + schema_object.ContentType = "application/json" + schema_object.Format = storage_format schema_object.Updated_On = result["date"] self.mongoHelper.update_document( diff --git a/src/ContentProcessorAPI/app/routers/models/schmavault/model.py b/src/ContentProcessorAPI/app/routers/models/schmavault/model.py index c8045220..6c500f27 100644 --- a/src/ContentProcessorAPI/app/routers/models/schmavault/model.py +++ b/src/ContentProcessorAPI/app/routers/models/schmavault/model.py @@ -5,7 +5,7 @@ import datetime import json -from typing import Optional +from typing import Literal, Optional from pydantic import BaseModel, ConfigDict, Field, model_validator @@ -15,10 +15,14 @@ class Schema(BaseModel): Attributes: Id: Unique schema identifier. - ClassName: Python class name of the schema. + ClassName: Class name of the schema (the JSON Schema ``title`` + field, or a sanitised fallback derived from the filename). Description: Human-readable description. FileName: Source filename for the schema definition. ContentType: Expected content/MIME type. + Format: Storage format of the schema artifact. Always + ``"json"`` — declarative JSON Schema descriptors are the + only supported format. Created_On: UTC timestamp when the schema was registered. Updated_On: UTC timestamp of the last update. """ @@ -28,6 +32,7 @@ class Schema(BaseModel): Description: str FileName: str ContentType: str + Format: Literal["json"] = Field(default="json") Created_On: Optional[datetime.datetime] = Field(default=None) Updated_On: Optional[datetime.datetime] = Field(default=None) model_config = ConfigDict(from_attributes=True) diff --git a/src/ContentProcessorAPI/app/routers/schemavault.py b/src/ContentProcessorAPI/app/routers/schemavault.py index 93e4e2b7..5741faec 100644 --- a/src/ContentProcessorAPI/app/routers/schemavault.py +++ b/src/ContentProcessorAPI/app/routers/schemavault.py @@ -12,6 +12,11 @@ from fastapi.responses import StreamingResponse from app.libs.base.typed_fastapi import TypedFastAPI +from app.routers.logics.schema_validator import ( + SchemaValidationError, + derive_class_name, + validate_json_schema, +) from app.routers.logics.schemavault import Schemas from app.routers.models.schmavault.model import ( Schema, @@ -28,6 +33,47 @@ responses={404: {"description": "Not found"}}, ) +#: Filename extensions accepted by the schema-vault upload routes. +#: Only ``.json`` (declarative JSON Schema) is supported. The legacy +#: ``.py`` (executable Pydantic class) format was removed because the +#: worker would ``exec`` uploaded code, exposing an RCE primitive +#: against any caller able to register a schema. +_ALLOWED_EXTENSIONS: tuple[str, ...] = (".json",) +_MAX_UPLOAD_BYTES: int = 1 * 1024 * 1024 + + +def _validate_upload(file: UploadFile) -> tuple[str, str]: + """Common upload checks for ``POST`` and ``PUT`` schema endpoints. + + Returns a ``(safe_filename, extension)`` tuple. Raises ``HTTPException`` + with the appropriate status on any failure. + """ + try: + safe_filename = sanitize_filename(file.filename) + except ValueError: + raise HTTPException(status_code=400, detail="Filename is too long.") + + extension = os.path.splitext(safe_filename)[1].lower() + if extension not in _ALLOWED_EXTENSIONS: + raise HTTPException( + status_code=415, + detail=( + "Unsupported schema file type. Only .json schema files " + "are accepted; legacy .py uploads are disabled." + ), + ) + + size_bytes = get_upload_size_bytes(file) + if size_bytes is None: + raise HTTPException(status_code=400, detail="Unable to determine upload size.") + + if size_bytes > _MAX_UPLOAD_BYTES: + raise HTTPException( + status_code=413, detail="Schema file is too large (max 1 MB)." + ) + + return safe_filename, extension + @router.get( "/", @@ -61,25 +107,33 @@ async def Get_All_Registered_Schema( response_model=Schema, summary="Register a schema", description=""" - Registers a new schema file (`.py`) and stores its metadata in the Schema Vault. + Registers a new schema file (`.json`) and stores its metadata + in the Schema Vault. The request must be sent as `multipart/form-data` with: - a JSON part (named `data`) - a file part (named `file`) Constraints: - - Only `.py` files are accepted. + - Only `.json` (declarative JSON Schema) files are accepted. - Max size: 1 MB. + For `.json` uploads: + - Must be a valid JSON Schema (Draft 2020-12) with `type: "object"` + and a `properties` block. + - The `ClassName` field in the request body is ignored if the JSON + document declares a `title`; otherwise the filename stem is used. + ## Parameters - - **ClassName** (body): Schema class name contained in the uploaded file. + - **ClassName** (body): Schema class name. Used as a fallback for + `.json` uploads without a `title`. - **Description** (body): Human-readable description. - - **file** (form): `.py` schema file (max 1 MB). + - **file** (form): `.json` schema file (max 1 MB). ## Example Request Body multipart/form-data - `data`: `{ "ClassName": "InvoiceSchema", "Description": "Extract invoice fields" }` - - `file`: `` + - `file`: `` """, ) async def Register_Schema( @@ -87,40 +141,36 @@ async def Register_Schema( file: UploadFile = File(...), request: Request = None, ) -> Schema: - """Register a new schema file (.py) into the vault.""" + """Register a new schema file into the vault.""" app: TypedFastAPI = request.app # type: ignore schemas: Schemas = app.app_context.get_service(Schemas) - try: - safe_filename = sanitize_filename(file.filename) - except ValueError: - raise HTTPException(status_code=400, detail="Filename is too long.") - extension = os.path.splitext(safe_filename)[1].lower() - if extension != ".py": - raise HTTPException( - status_code=415, - detail="Unsupported schema file type. Only .py schema files are supported.", - ) - - size_bytes = get_upload_size_bytes(file) - if size_bytes is None: - raise HTTPException(status_code=400, detail="Unable to determine upload size.") + safe_filename, _ = _validate_upload(file) - # Schemas are small config artifacts; keep a conservative cap. - if size_bytes > 1 * 1024 * 1024: + raw = await file.read() + await file.seek(0) + try: + document = validate_json_schema(raw) + except SchemaValidationError as exc: raise HTTPException( - status_code=413, detail="Schema file is too large (max 1 MB)." - ) + status_code=400, + detail={"message": "Invalid JSON schema.", "errors": exc.errors}, + ) from exc + + fallback = os.path.splitext(safe_filename)[0] + class_name = derive_class_name(document, fallback=data.ClassName or fallback) + content_type = "application/json" return schemas.Add( file, Schema( Id=str(uuid.uuid4()), - ClassName=data.ClassName, + ClassName=class_name, Description=data.Description, FileName=safe_filename, - ContentType=file.content_type, + ContentType=content_type, + Format="json", ), ) @@ -130,25 +180,27 @@ async def Register_Schema( response_model=Schema, summary="Update a schema", description=""" - Updates an existing registered schema (`.py` file) and associated metadata. + Updates an existing registered schema (`.json` file) and + associated metadata. The request must be sent as `multipart/form-data` with: - a JSON part (named `data`) - a file part (named `file`) Constraints: - - Only `.py` files are accepted. + - Only `.json` files are accepted. - Max size: 1 MB. ## Parameters - **SchemaId** (body): Schema ID to update. - - **ClassName** (body): Updated class name. - - **file** (form): New `.py` schema file (max 1 MB). + - **ClassName** (body): Updated class name (fallback for `.json` + schemas without a `title`). + - **file** (form): New `.json` schema file (max 1 MB). ## Example Request Body multipart/form-data - `data`: `{ "SchemaId": "", "ClassName": "InvoiceSchema" }` - - `file`: `` + - `file`: `` """, ) async def Update_Schema( @@ -158,29 +210,23 @@ async def Update_Schema( ) -> Schema: """Update an existing schema with a new file.""" app: TypedFastAPI = request.app # type: ignore - try: - safe_filename = sanitize_filename(file.filename) - except ValueError: - raise HTTPException(status_code=400, detail="Filename is too long.") - extension = os.path.splitext(safe_filename)[1].lower() - if extension != ".py": - raise HTTPException( - status_code=415, - detail="Unsupported schema file type. Only .py schema files are supported.", - ) + safe_filename, _ = _validate_upload(file) - size_bytes = get_upload_size_bytes(file) - if size_bytes is None: - raise HTTPException(status_code=400, detail="Unable to determine upload size.") - - if size_bytes > 1 * 1024 * 1024: + raw = await file.read() + await file.seek(0) + try: + document = validate_json_schema(raw) + except SchemaValidationError as exc: raise HTTPException( - status_code=413, detail="Schema file is too large (max 1 MB)." - ) + status_code=400, + detail={"message": "Invalid JSON schema.", "errors": exc.errors}, + ) from exc + fallback = os.path.splitext(safe_filename)[0] + class_name = derive_class_name(document, fallback=data.ClassName or fallback) schemas: Schemas = app.app_context.get_service(Schemas) - return schemas.Update(file, data.SchemaId, data.ClassName) + return schemas.Update(file, data.SchemaId, class_name, "json") @router.delete( diff --git a/src/ContentProcessorAPI/app/tests/logics/test_schema_validator.py b/src/ContentProcessorAPI/app/tests/logics/test_schema_validator.py new file mode 100644 index 00000000..aceb3a6d --- /dev/null +++ b/src/ContentProcessorAPI/app/tests/logics/test_schema_validator.py @@ -0,0 +1,143 @@ +# Copyright (c) Microsoft Corporation. +# Licensed under the MIT License. + +"""Tests for the JSON Schema validator used by the schema vault upload routes.""" + +from __future__ import annotations + +import json +from pathlib import Path + +import pytest + +from app.routers.logics.schema_validator import ( + ALLOWED_CPS_KEYWORDS, + SchemaValidationError, + derive_class_name, + validate_json_schema, +) + + +SAMPLES_DIR = ( + Path(__file__).resolve().parents[3] / "samples" / "schemas" +) + + +def _minimal_object_schema(**extra) -> dict: + base = { + "type": "object", + "title": "Minimal", + "properties": {"name": {"type": "string"}}, + } + base.update(extra) + return base + + +def _bytes(doc) -> bytes: + return json.dumps(doc).encode("utf-8") + + +# --------------------------------------------------------------------------- +# Happy path +# --------------------------------------------------------------------------- + + +def test_validate_accepts_minimal_object_schema(): + document = validate_json_schema(_bytes(_minimal_object_schema())) + assert document["title"] == "Minimal" + + +def test_validate_accepts_autoclaim_golden(): + raw = (SAMPLES_DIR / "autoclaim.json").read_bytes() + document = validate_json_schema(raw) + assert document["title"] == "AutoInsuranceClaimForm" + assert document["type"] == "object" + + +# --------------------------------------------------------------------------- +# Failure modes +# --------------------------------------------------------------------------- + + +def test_validate_rejects_non_utf8_bytes(): + with pytest.raises(SchemaValidationError) as exc: + validate_json_schema(b"\xff\xfe\x00not utf-8") + assert "UTF-8" in str(exc.value) + + +def test_validate_rejects_non_json(): + with pytest.raises(SchemaValidationError) as exc: + validate_json_schema(b"not json at all") + assert "not valid JSON" in str(exc.value) + + +def test_validate_rejects_non_object_root(): + with pytest.raises(SchemaValidationError): + validate_json_schema(_bytes([1, 2, 3])) + + +def test_validate_rejects_missing_type_object(): + schema = {"title": "X", "properties": {"a": {"type": "string"}}} + with pytest.raises(SchemaValidationError) as exc: + validate_json_schema(_bytes(schema)) + assert "type" in str(exc.value) + + +def test_validate_rejects_missing_properties(): + schema = {"title": "X", "type": "object"} + with pytest.raises(SchemaValidationError) as exc: + validate_json_schema(_bytes(schema)) + assert "properties" in str(exc.value) + + +def test_validate_rejects_invalid_dialect(): + schema = _minimal_object_schema() + # ``type`` must be a string or array; this is a meta-schema violation. + schema["properties"]["name"] = {"type": "banana"} + with pytest.raises(SchemaValidationError) as exc: + validate_json_schema(_bytes(schema)) + assert "JSON Schema" in str(exc.value) + + +def test_validate_rejects_unknown_x_keyword(): + schema = _minimal_object_schema() + schema["x-evil-side-channel"] = "haha" + with pytest.raises(SchemaValidationError) as exc: + validate_json_schema(_bytes(schema)) + assert "x-evil-side-channel" in str(exc.value) + + +def test_validate_rejects_unknown_x_keyword_in_nested_property(): + schema = _minimal_object_schema() + schema["properties"]["name"]["x-cps-malicious"] = True + with pytest.raises(SchemaValidationError): + validate_json_schema(_bytes(schema)) + + +def test_validate_rejects_oversized_payload(): + big = "x" * (2 * 1024 * 1024) + schema = _minimal_object_schema(description=big) + with pytest.raises(SchemaValidationError) as exc: + validate_json_schema(_bytes(schema)) + assert "too large" in str(exc.value) + + +# --------------------------------------------------------------------------- +# derive_class_name +# --------------------------------------------------------------------------- + + +def test_derive_class_name_uses_title(): + assert derive_class_name({"title": "InvoiceSchema"}, fallback="x") == "InvoiceSchema" + + +def test_derive_class_name_falls_back_to_filename(): + assert derive_class_name({}, fallback="auto-claim") == "auto_claim" + + +def test_derive_class_name_sanitises_leading_digits(): + assert derive_class_name({}, fallback="9invoice") == "Schema_9invoice" + + +def test_allowed_keywords_constant_is_empty(): + assert len(ALLOWED_CPS_KEYWORDS) == 0 diff --git a/src/ContentProcessorAPI/app/tests/logics/test_schemasetvault_logic.py b/src/ContentProcessorAPI/app/tests/logics/test_schemasetvault_logic.py index 70ca3178..ce1f1e52 100644 --- a/src/ContentProcessorAPI/app/tests/logics/test_schemasetvault_logic.py +++ b/src/ContentProcessorAPI/app/tests/logics/test_schemasetvault_logic.py @@ -120,8 +120,8 @@ def test_add_schema_to_set(MockBlob, MockMongo, mock_app_context): "Id": "s1", "ClassName": "Invoice", "Description": "desc", - "FileName": "invoice.py", - "ContentType": "text/x-python", + "FileName": "invoice.json", + "ContentType": "application/json", } ] @@ -229,8 +229,8 @@ def test_get_all_schemas_in_set(MockBlob, MockMongo, mock_app_context): "Id": "s1", "ClassName": "Invoice", "Description": "d1", - "FileName": "invoice.py", - "ContentType": "text/x-python", + "FileName": "invoice.json", + "ContentType": "application/json", } ] diff --git a/src/ContentProcessorAPI/app/tests/logics/test_schemavault_logic.py b/src/ContentProcessorAPI/app/tests/logics/test_schemavault_logic.py index 1467c902..500c37cf 100644 --- a/src/ContentProcessorAPI/app/tests/logics/test_schemavault_logic.py +++ b/src/ContentProcessorAPI/app/tests/logics/test_schemavault_logic.py @@ -32,8 +32,8 @@ def test_get_all(MockBlob, MockMongo, mock_app_context): "Id": "s1", "ClassName": "Invoice", "Description": "desc", - "FileName": "invoice.py", - "ContentType": "text/x-python", + "FileName": "invoice.json", + "ContentType": "application/json", } ] @@ -55,20 +55,20 @@ def test_get_file(MockBlob, MockMongo, mock_app_context): "Id": "s1", "ClassName": "Invoice", "Description": "desc", - "FileName": "invoice.py", - "ContentType": "text/x-python", + "FileName": "invoice.json", + "ContentType": "application/json", } ] mock_blob = MockBlob.return_value - mock_blob.download_blob.return_value = b"class Invoice: pass" + mock_blob.download_blob.return_value = b'{"type": "object"}' from app.routers.logics.schemavault import Schemas schemas = Schemas(app_context=mock_app_context) result = schemas.GetFile("s1") - assert result["File"] == b"class Invoice: pass" - assert result["FileName"] == "invoice.py" - assert result["ContentType"] == "text/x-python" + assert result["File"] == b'{"type": "object"}' + assert result["FileName"] == "invoice.json" + assert result["ContentType"] == "application/json" @patch("app.routers.logics.schemavault.CosmosMongDBHelper") @@ -99,8 +99,8 @@ def test_add(MockBlob, MockMongo, mock_app_context): Id="s1", ClassName="Invoice", Description="desc", - FileName="invoice.py", - ContentType="text/x-python", + FileName="invoice.json", + ContentType="application/json", ) result = schemas.Add(file, schema) assert result.Created_On == "2025-01-01T00:00:00Z" @@ -116,8 +116,8 @@ def test_update(MockBlob, MockMongo, mock_app_context): "Id": "s1", "ClassName": "Old", "Description": "desc", - "FileName": "old.py", - "ContentType": "text/x-python", + "FileName": "old.json", + "ContentType": "application/json", } ] mock_blob = MockBlob.return_value @@ -127,7 +127,7 @@ def test_update(MockBlob, MockMongo, mock_app_context): schemas = Schemas(app_context=mock_app_context) file = MagicMock() - file.content_type = "text/x-python" + file.content_type = "application/json" result = schemas.Update(file, "s1", "NewClass") assert result.ClassName == "NewClass" mock_mongo.update_document.assert_called_once() @@ -155,8 +155,8 @@ def test_delete(MockBlob, MockMongo, mock_app_context): "Id": "s1", "ClassName": "Invoice", "Description": "desc", - "FileName": "invoice.py", - "ContentType": "text/x-python", + "FileName": "invoice.json", + "ContentType": "application/json", } ] diff --git a/src/ContentProcessorAPI/app/tests/models/test_schmavault_model.py b/src/ContentProcessorAPI/app/tests/models/test_schmavault_model.py index 09b0dbe4..3b3e6e41 100644 --- a/src/ContentProcessorAPI/app/tests/models/test_schmavault_model.py +++ b/src/ContentProcessorAPI/app/tests/models/test_schmavault_model.py @@ -22,8 +22,8 @@ def test_parse_dates_from_iso_string(self): Id="s1", ClassName="Invoice", Description="desc", - FileName="invoice.py", - ContentType="text/x-python", + FileName="invoice.json", + ContentType="application/json", Created_On="2025-01-01T00:00:00Z", Updated_On="2025-06-15T12:30:00Z", ) @@ -36,8 +36,8 @@ def test_parse_dates_none(self): Id="s1", ClassName="Invoice", Description="desc", - FileName="invoice.py", - ContentType="text/x-python", + FileName="invoice.json", + ContentType="application/json", ) assert schema.Created_On is None assert schema.Updated_On is None @@ -76,7 +76,7 @@ def test_to_dict(self): Status="Success", SchemaId="s1", ClassName="Invoice", - FileName="invoice.py", + FileName="invoice.json", ) d = resp.to_dict() assert d["Status"] == "Success" diff --git a/src/ContentProcessorAPI/app/tests/routers/test_schemavault.py b/src/ContentProcessorAPI/app/tests/routers/test_schemavault.py index 03c2134d..fb21a61f 100644 --- a/src/ContentProcessorAPI/app/tests/routers/test_schemavault.py +++ b/src/ContentProcessorAPI/app/tests/routers/test_schemavault.py @@ -44,6 +44,11 @@ def __init__(self, schemas: Schemas): def create_scope(self): return _FakeScopeContextManager(_FakeScope(self._schemas)) + def get_service(self, service_type): + if service_type is Schemas: + return self._schemas + raise KeyError(service_type) + @pytest.fixture def client_and_schemas(): @@ -88,15 +93,10 @@ def test_get_registered_schema_file_by_schema_id_500_error(client_and_schemas): assert response.json() == {"detail": "Internal Server Error"} -def test_register_schema_accepts_py_and_sanitizes_filename(client_and_schemas): +def test_register_schema_rejects_py(client_and_schemas): + """Legacy .py uploads must be refused outright (RCE remediation).""" client, mock_schemas = client_and_schemas - mock_schemas.Add.return_value = { - "Id": "test-id", - "ClassName": "TestClass", - "Description": "Test description", - "FileName": "invoice.py", - "ContentType": "text/x-python", - } + mock_schemas.Add.reset_mock() files = { "file": ("C:/fakepath/invoice.py", b"class Invoice: pass\n", "text/x-python"), @@ -108,15 +108,11 @@ def test_register_schema_accepts_py_and_sanitizes_filename(client_and_schemas): } response = client.post("/schemavault/", files=files) - assert response.status_code == 200 - - # Ensure Add() is called with Schema.FileName sanitized to just the basename - add_args, _ = mock_schemas.Add.call_args - schema_obj = add_args[1] - assert schema_obj.FileName == "invoice.py" + assert response.status_code == 415 + assert mock_schemas.Add.call_count == 0 -def test_register_schema_rejects_non_py(client_and_schemas): +def test_register_schema_rejects_unsupported_extension(client_and_schemas): client, mock_schemas = client_and_schemas mock_schemas.Add.reset_mock() @@ -138,17 +134,18 @@ def test_update_schema_success(client_and_schemas): client, mock_schemas = client_and_schemas mock_schemas.Update.return_value = { "Id": "test-id", - "ClassName": "Updated", + "ClassName": "InvoiceSchema", "Description": "desc", - "FileName": "updated.py", - "ContentType": "text/x-python", + "FileName": "updated.json", + "ContentType": "application/json", + "Format": "json", } files = { - "file": ("updated.py", b"class Updated: pass\n", "text/x-python"), + "file": ("updated.json", _minimal_json_schema_bytes(), "application/json"), "data": ( None, - json.dumps({"SchemaId": "test-id", "ClassName": "Updated"}), + json.dumps({"SchemaId": "test-id", "ClassName": "InvoiceSchema"}), "application/json", ), } @@ -158,7 +155,23 @@ def test_update_schema_success(client_and_schemas): mock_schemas.Update.assert_called_once() -def test_update_schema_rejects_non_py(client_and_schemas): +def test_update_schema_rejects_py(client_and_schemas): + client, mock_schemas = client_and_schemas + + files = { + "file": ("updated.py", b"class Updated: pass\n", "text/x-python"), + "data": ( + None, + json.dumps({"SchemaId": "test-id", "ClassName": "X"}), + "application/json", + ), + } + + response = client.put("/schemavault/", files=files) + assert response.status_code == 415 + + +def test_update_schema_rejects_unsupported_extension(client_and_schemas): client, mock_schemas = client_and_schemas files = { @@ -177,7 +190,7 @@ def test_update_schema_rejects_non_py(client_and_schemas): def test_unregister_schema_success(client_and_schemas): client, mock_schemas = client_and_schemas mock_schemas.Delete.return_value = MagicMock( - Id="test-id", ClassName="TestClass", FileName="test.py" + Id="test-id", ClassName="TestClass", FileName="test.json" ) response = client.request( @@ -199,3 +212,158 @@ def test_unregister_schema_error(client_and_schemas): json={"SchemaId": "missing"}, ) assert response.status_code == 500 + + +# --------------------------------------------------------------------------- +# JSON-schema upload path (declarative format, replaces executable .py) +# --------------------------------------------------------------------------- + + +def _minimal_json_schema_bytes(title: str = "InvoiceSchema") -> bytes: + return json.dumps({ + "type": "object", + "title": title, + "properties": {"invoice_id": {"type": "string"}}, + }).encode("utf-8") + + +def test_register_schema_accepts_json(client_and_schemas): + client, mock_schemas = client_and_schemas + mock_schemas.Add.return_value = { + "Id": "test-id", + "ClassName": "InvoiceSchema", + "Description": "desc", + "FileName": "invoice.json", + "ContentType": "application/json", + "Format": "json", + } + + files = { + "file": ( + "invoice.json", + _minimal_json_schema_bytes(), + "application/json", + ), + "data": ( + None, + json.dumps({"ClassName": "ignored", "Description": "desc"}), + "application/json", + ), + } + + response = client.post("/schemavault/", files=files) + assert response.status_code == 200, response.text + + add_args, _ = mock_schemas.Add.call_args + schema_obj = add_args[1] + # Schema's title wins over the request body's ClassName. + assert schema_obj.ClassName == "InvoiceSchema" + assert schema_obj.Format == "json" + assert schema_obj.FileName == "invoice.json" + + +def test_register_schema_rejects_invalid_json(client_and_schemas): + client, mock_schemas = client_and_schemas + mock_schemas.Add.reset_mock() + + files = { + "file": ("schema.json", b"{not json", "application/json"), + "data": ( + None, + json.dumps({"ClassName": "X", "Description": "Y"}), + "application/json", + ), + } + + response = client.post("/schemavault/", files=files) + assert response.status_code == 400 + assert "errors" in response.json()["detail"] + assert mock_schemas.Add.call_count == 0 + + +def test_register_schema_rejects_json_without_object_root(client_and_schemas): + client, mock_schemas = client_and_schemas + mock_schemas.Add.reset_mock() + + files = { + "file": ( + "schema.json", + json.dumps({"type": "array"}).encode("utf-8"), + "application/json", + ), + "data": ( + None, + json.dumps({"ClassName": "X", "Description": "Y"}), + "application/json", + ), + } + + response = client.post("/schemavault/", files=files) + assert response.status_code == 400 + assert mock_schemas.Add.call_count == 0 + + +def test_register_schema_falls_back_to_filename_for_classname(client_and_schemas): + client, mock_schemas = client_and_schemas + mock_schemas.Add.return_value = { + "Id": "test-id", + "ClassName": "fallback", + "Description": "desc", + "FileName": "auto-claim.json", + "ContentType": "application/json", + "Format": "json", + } + + schema_bytes = json.dumps({ + "type": "object", + "properties": {"x": {"type": "string"}}, + }).encode("utf-8") + + files = { + "file": ("auto-claim.json", schema_bytes, "application/json"), + "data": ( + None, + json.dumps({"ClassName": "fallback", "Description": "desc"}), + "application/json", + ), + } + + response = client.post("/schemavault/", files=files) + assert response.status_code == 200, response.text + schema_obj = mock_schemas.Add.call_args[0][1] + # When the JSON has no title, the request-body ClassName is used as + # the fallback (after sanitisation to a Python identifier). + assert schema_obj.ClassName == "fallback" + assert schema_obj.Format == "json" + + +def test_update_schema_accepts_json(client_and_schemas): + client, mock_schemas = client_and_schemas + mock_schemas.Update.return_value = { + "Id": "test-id", + "ClassName": "InvoiceSchema", + "Description": "", + "FileName": "invoice.json", + "ContentType": "application/json", + "Format": "json", + } + + files = { + "file": ( + "invoice.json", + _minimal_json_schema_bytes(), + "application/json", + ), + "data": ( + None, + json.dumps({"SchemaId": "test-id", "ClassName": "x"}), + "application/json", + ), + } + + response = client.put("/schemavault/", files=files) + assert response.status_code == 200, response.text + args, _ = mock_schemas.Update.call_args + # Update is called with (file, schema_id, class_name, storage_format). + assert args[2] == "InvoiceSchema" + assert args[3] == "json" diff --git a/src/ContentProcessorAPI/pyproject.toml b/src/ContentProcessorAPI/pyproject.toml index abc7791e..5244625d 100644 --- a/src/ContentProcessorAPI/pyproject.toml +++ b/src/ContentProcessorAPI/pyproject.toml @@ -28,6 +28,7 @@ dependencies = [ "azure-monitor-opentelemetry==1.8.7", "cryptography==46.0.7", "pyjwt==2.12.1", + "jsonschema==4.25.1", ] [dependency-groups] diff --git a/src/ContentProcessorAPI/requirements.txt b/src/ContentProcessorAPI/requirements.txt index 7e4b6d94..bdfaeb0c 100644 --- a/src/ContentProcessorAPI/requirements.txt +++ b/src/ContentProcessorAPI/requirements.txt @@ -25,6 +25,7 @@ httpx==0.28.1 idna==3.11 isodate==0.7.2 jinja2==3.1.6 +jsonschema==4.25.1 markdown-it-py==4.0.0 markupsafe==3.0.3 mdurl==0.1.2 diff --git a/src/ContentProcessorAPI/samples/schemas/autoclaim.json b/src/ContentProcessorAPI/samples/schemas/autoclaim.json new file mode 100644 index 00000000..cc7031b0 --- /dev/null +++ b/src/ContentProcessorAPI/samples/schemas/autoclaim.json @@ -0,0 +1,886 @@ +{ + "$defs": { + "AutoClaimAddress": { + "description": "A class representing an address used on an auto claim form.", + "properties": { + "street": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Street address, e.g. 123 Main St.", + "title": "Street" + }, + "city": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "City, e.g. Macon", + "title": "City" + }, + "state": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "State, e.g. GA", + "title": "State" + }, + "postal_code": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Postal code, e.g. 31201", + "title": "Postal Code" + }, + "country": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Country, e.g. USA", + "title": "Country" + } + }, + "required": [ + "street", + "city", + "state", + "postal_code", + "country" + ], + "title": "AutoClaimAddress", + "type": "object" + }, + "DamageAssessment": { + "description": "A class representing overall damage assessment.", + "properties": { + "items": { + "anyOf": [ + { + "items": { + "$ref": "#/$defs/DamageAssessmentItem" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "List of damage assessment line items", + "title": "Items" + }, + "total_estimated_repair": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Total estimated repair, e.g. 500.0", + "title": "Total Estimated Repair" + }, + "total_estimated_repair_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency of total_estimated_repair, e.g. USD", + "title": "Total Estimated Repair Currency" + } + }, + "required": [ + "items", + "total_estimated_repair", + "total_estimated_repair_currency" + ], + "title": "DamageAssessment", + "type": "object" + }, + "DamageAssessmentItem": { + "description": "A class representing a damage assessment line item.", + "properties": { + "item_description": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Damaged item/area description, e.g. Right-front quarter panel", + "title": "Item Description" + }, + "date_acquired": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Date acquired (if present), e.g. 2022-03-15", + "title": "Date Acquired" + }, + "cost_new": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Cost when new, e.g. 1200.0", + "title": "Cost New" + }, + "cost_new_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency of cost_new, e.g. USD", + "title": "Cost New Currency" + }, + "repair_estimate": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Repair estimate, e.g. 350.0", + "title": "Repair Estimate" + }, + "repair_estimate_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency of repair_estimate, e.g. USD", + "title": "Repair Estimate Currency" + } + }, + "required": [ + "item_description", + "date_acquired", + "cost_new", + "cost_new_currency", + "repair_estimate", + "repair_estimate_currency" + ], + "title": "DamageAssessmentItem", + "type": "object" + }, + "Declaration": { + "description": "A class representing the claim declaration.", + "properties": { + "statement": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Declaration statement text", + "title": "Statement" + }, + "signature": { + "anyOf": [ + { + "$ref": "#/$defs/Signature" + }, + { + "type": "null" + } + ], + "description": "Signature" + }, + "date": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Signature date, e.g. 2025-12-01", + "title": "Date" + } + }, + "required": [ + "statement", + "signature", + "date" + ], + "title": "Declaration", + "type": "object" + }, + "IncidentDetails": { + "description": "A class representing incident details.", + "properties": { + "date_of_loss": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Date of loss, e.g. 2025-11-28", + "title": "Date Of Loss" + }, + "time_of_loss": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Time of loss, e.g. 14:15", + "title": "Time Of Loss" + }, + "location": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Incident location, e.g. Parking lot near 123 Main Street, Macon, GA", + "title": "Location" + }, + "cause_of_loss": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Cause of loss, e.g. Low-speed collision with another vehicle", + "title": "Cause Of Loss" + }, + "description": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Incident description, e.g. Minor dent and paint scratches; no structural damage", + "title": "Description" + }, + "police_report_filed": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "null" + } + ], + "description": "Whether a police report was filed", + "title": "Police Report Filed" + }, + "police_report_number": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Police report number, e.g. GA-20251128-CR", + "title": "Police Report Number" + } + }, + "required": [ + "date_of_loss", + "time_of_loss", + "location", + "cause_of_loss", + "description", + "police_report_filed", + "police_report_number" + ], + "title": "IncidentDetails", + "type": "object" + }, + "PolicyDetails": { + "description": "A class representing policy details.", + "properties": { + "coverage_type": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Coverage type, e.g. Auto \u2013 Comprehensive", + "title": "Coverage Type" + }, + "effective_date": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Policy effective date, e.g. 2025-01-01", + "title": "Effective Date" + }, + "expiration_date": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Policy expiration date, e.g. 2025-12-31", + "title": "Expiration Date" + }, + "deductible": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Deductible amount, e.g. 500.0", + "title": "Deductible" + }, + "deductible_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency of the deductible, e.g. USD", + "title": "Deductible Currency" + } + }, + "required": [ + "coverage_type", + "effective_date", + "expiration_date", + "deductible", + "deductible_currency" + ], + "title": "PolicyDetails", + "type": "object" + }, + "PolicyholderInformation": { + "description": "A class representing policyholder information.", + "properties": { + "name": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Policyholder full name, e.g. Chad Brooks", + "title": "Name" + }, + "address": { + "anyOf": [ + { + "$ref": "#/$defs/AutoClaimAddress" + }, + { + "type": "null" + } + ], + "description": "Policyholder address, e.g. 123 Main Street, Macon, GA 31201" + }, + "phone": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Policyholder phone number, e.g. (555) 555-1212", + "title": "Phone" + }, + "email": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Policyholder email address, e.g. chad.brooks@example.com", + "title": "Email" + } + }, + "required": [ + "name", + "address", + "phone", + "email" + ], + "title": "PolicyholderInformation", + "type": "object" + }, + "Signature": { + "description": "A class representing a signature field.", + "properties": { + "signatory": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Name of the signatory", + "title": "Signatory" + }, + "is_signed": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "null" + } + ], + "description": "Indicates if the form is signed. GPT should check whether it has signature in image files. if there is Sign, fill it up as True", + "title": "Is Signed" + } + }, + "required": [ + "signatory", + "is_signed" + ], + "title": "Signature", + "type": "object" + }, + "SubmissionInstructions": { + "description": "A class representing submission instructions.", + "properties": { + "submission_email": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Submission email address, e.g. claims@contosoinsurance.com", + "title": "Submission Email" + }, + "portal_url": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Claims portal URL, if present", + "title": "Portal Url" + }, + "notes": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Additional submission notes", + "title": "Notes" + } + }, + "required": [ + "submission_email", + "portal_url", + "notes" + ], + "title": "SubmissionInstructions", + "type": "object" + }, + "SupportingDocuments": { + "description": "A class representing supporting documents included with the claim.", + "properties": { + "photos_of_damage": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "null" + } + ], + "description": "Whether photos of damage are included", + "title": "Photos Of Damage" + }, + "police_report_copy": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "null" + } + ], + "description": "Whether a police report copy is included", + "title": "Police Report Copy" + }, + "repair_shop_estimate": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "null" + } + ], + "description": "Whether a repair shop estimate is included", + "title": "Repair Shop Estimate" + }, + "other": { + "anyOf": [ + { + "items": { + "type": "string" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "Other supporting documents", + "title": "Other" + } + }, + "required": [ + "photos_of_damage", + "police_report_copy", + "repair_shop_estimate", + "other" + ], + "title": "SupportingDocuments", + "type": "object" + }, + "VehicleInformation": { + "description": "A class representing vehicle information.", + "properties": { + "year": { + "anyOf": [ + { + "type": "integer" + }, + { + "type": "null" + } + ], + "description": "Vehicle year, e.g. 2022", + "title": "Year" + }, + "make": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle make, e.g. Toyota", + "title": "Make" + }, + "model": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle model, e.g. Camry", + "title": "Model" + }, + "trim": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle trim, e.g. SE", + "title": "Trim" + }, + "vin": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle VIN, e.g. 4T1G11AK2NU123456", + "title": "Vin" + }, + "license_plate": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "License plate, e.g. GA-ABC123", + "title": "License Plate" + }, + "mileage": { + "anyOf": [ + { + "type": "integer" + }, + { + "type": "null" + } + ], + "description": "Mileage, e.g. 28450", + "title": "Mileage" + } + }, + "required": [ + "year", + "make", + "model", + "trim", + "vin", + "license_plate", + "mileage" + ], + "title": "VehicleInformation", + "type": "object" + } + }, + "description": "A class representing an auto insurance claim form.", + "properties": { + "insurance_company": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Insurance company name, e.g. Contoso Insurance", + "title": "Insurance Company" + }, + "claim_number": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Claim number, e.g. CLM987654", + "title": "Claim Number" + }, + "policy_number": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Policy number, e.g. AUTO123456", + "title": "Policy Number" + }, + "policyholder_information": { + "anyOf": [ + { + "$ref": "#/$defs/PolicyholderInformation" + }, + { + "type": "null" + } + ], + "description": "Policyholder information" + }, + "policy_details": { + "anyOf": [ + { + "$ref": "#/$defs/PolicyDetails" + }, + { + "type": "null" + } + ], + "description": "Policy details" + }, + "incident_details": { + "anyOf": [ + { + "$ref": "#/$defs/IncidentDetails" + }, + { + "type": "null" + } + ], + "description": "Incident details" + }, + "vehicle_information": { + "anyOf": [ + { + "$ref": "#/$defs/VehicleInformation" + }, + { + "type": "null" + } + ], + "description": "Vehicle information" + }, + "damage_assessment": { + "anyOf": [ + { + "$ref": "#/$defs/DamageAssessment" + }, + { + "type": "null" + } + ], + "description": "Damage assessment" + }, + "supporting_documents": { + "anyOf": [ + { + "$ref": "#/$defs/SupportingDocuments" + }, + { + "type": "null" + } + ], + "description": "Supporting documents" + }, + "declaration": { + "anyOf": [ + { + "$ref": "#/$defs/Declaration" + }, + { + "type": "null" + } + ], + "description": "Declaration" + }, + "submission_instructions": { + "anyOf": [ + { + "$ref": "#/$defs/SubmissionInstructions" + }, + { + "type": "null" + } + ], + "description": "Submission instructions" + } + }, + "required": [ + "insurance_company", + "claim_number", + "policy_number", + "policyholder_information", + "policy_details", + "incident_details", + "vehicle_information", + "damage_assessment", + "supporting_documents", + "declaration", + "submission_instructions" + ], + "title": "AutoInsuranceClaimForm", + "type": "object" +} diff --git a/src/ContentProcessorAPI/samples/schemas/autoclaim.py b/src/ContentProcessorAPI/samples/schemas/autoclaim.py deleted file mode 100644 index f207c017..00000000 --- a/src/ContentProcessorAPI/samples/schemas/autoclaim.py +++ /dev/null @@ -1,592 +0,0 @@ -# Copyright (c) Microsoft Corporation. -# Licensed under the MIT License. -"""Pydantic models for auto insurance claim form data extraction. - -Defines the hierarchical schema used by the content processing pipeline to -extract structured fields from auto insurance claim documents. -""" - -from __future__ import annotations - -import json -from typing import List, Optional - -from pydantic import BaseModel, Field - - -class AutoClaimAddress(BaseModel): - """A class representing an address used on an auto claim form.""" - - street: Optional[str] = Field(description="Street address, e.g. 123 Main St.") - city: Optional[str] = Field(description="City, e.g. Macon") - state: Optional[str] = Field(description="State, e.g. GA") - postal_code: Optional[str] = Field(description="Postal code, e.g. 31201") - country: Optional[str] = Field(description="Country, e.g. USA") - - @staticmethod - def example() -> "AutoClaimAddress": - """Return an empty instance with default placeholder values.""" - return AutoClaimAddress( - street="", city="", state="", postal_code="", country="" - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "street": self.street, - "city": self.city, - "state": self.state, - "postal_code": self.postal_code, - "country": self.country, - } - - -class PolicyholderInformation(BaseModel): - """A class representing policyholder information.""" - - name: Optional[str] = Field(description="Policyholder full name, e.g. Chad Brooks") - address: Optional[AutoClaimAddress] = Field( - description="Policyholder address, e.g. 123 Main Street, Macon, GA 31201" - ) - phone: Optional[str] = Field( - description="Policyholder phone number, e.g. (555) 555-1212" - ) - email: Optional[str] = Field( - description="Policyholder email address, e.g. chad.brooks@example.com" - ) - - @staticmethod - def example() -> "PolicyholderInformation": - """Return an empty instance with default placeholder values.""" - return PolicyholderInformation( - name="", - address=AutoClaimAddress.example(), - phone="", - email="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "name": self.name, - "address": self.address.to_dict() if self.address else None, - "phone": self.phone, - "email": self.email, - } - - -class PolicyDetails(BaseModel): - """A class representing policy details.""" - - coverage_type: Optional[str] = Field( - description="Coverage type, e.g. Auto – Comprehensive" - ) - effective_date: Optional[str] = Field( - description="Policy effective date, e.g. 2025-01-01" - ) - expiration_date: Optional[str] = Field( - description="Policy expiration date, e.g. 2025-12-31" - ) - deductible: Optional[float] = Field(description="Deductible amount, e.g. 500.0") - deductible_currency: Optional[str] = Field( - description="Currency of the deductible, e.g. USD" - ) - - @staticmethod - def example() -> "PolicyDetails": - """Return an empty instance with default placeholder values.""" - return PolicyDetails( - coverage_type="", - effective_date="", - expiration_date="", - deductible=0.0, - deductible_currency="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "coverage_type": self.coverage_type, - "effective_date": self.effective_date, - "expiration_date": self.expiration_date, - "deductible": self.deductible, - "deductible_currency": self.deductible_currency, - } - - -class IncidentDetails(BaseModel): - """A class representing incident details.""" - - date_of_loss: Optional[str] = Field(description="Date of loss, e.g. 2025-11-28") - time_of_loss: Optional[str] = Field(description="Time of loss, e.g. 14:15") - location: Optional[str] = Field( - description="Incident location, e.g. Parking lot near 123 Main Street, Macon, GA" - ) - cause_of_loss: Optional[str] = Field( - description="Cause of loss, e.g. Low-speed collision with another vehicle" - ) - description: Optional[str] = Field( - description="Incident description, e.g. Minor dent and paint scratches; no structural damage" - ) - police_report_filed: Optional[bool] = Field( - description="Whether a police report was filed" - ) - police_report_number: Optional[str] = Field( - description="Police report number, e.g. GA-20251128-CR" - ) - - @staticmethod - def example() -> "IncidentDetails": - """Return an empty instance with default placeholder values.""" - return IncidentDetails( - date_of_loss="", - time_of_loss="", - location="", - cause_of_loss="", - description="", - police_report_filed=False, - police_report_number="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "date_of_loss": self.date_of_loss, - "time_of_loss": self.time_of_loss, - "location": self.location, - "cause_of_loss": self.cause_of_loss, - "description": self.description, - "police_report_filed": self.police_report_filed, - "police_report_number": self.police_report_number, - } - - -class VehicleInformation(BaseModel): - """A class representing vehicle information.""" - - year: Optional[int] = Field(description="Vehicle year, e.g. 2022") - make: Optional[str] = Field(description="Vehicle make, e.g. Toyota") - model: Optional[str] = Field(description="Vehicle model, e.g. Camry") - trim: Optional[str] = Field(description="Vehicle trim, e.g. SE") - vin: Optional[str] = Field(description="Vehicle VIN, e.g. 4T1G11AK2NU123456") - license_plate: Optional[str] = Field(description="License plate, e.g. GA-ABC123") - mileage: Optional[int] = Field(description="Mileage, e.g. 28450") - - @staticmethod - def example() -> "VehicleInformation": - """Return an empty instance with default placeholder values.""" - return VehicleInformation( - year=0, - make="", - model="", - trim="", - vin="", - license_plate="", - mileage=0, - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "year": self.year, - "make": self.make, - "model": self.model, - "trim": self.trim, - "vin": self.vin, - "license_plate": self.license_plate, - "mileage": self.mileage, - } - - -class DamageAssessmentItem(BaseModel): - """A class representing a damage assessment line item.""" - - item_description: Optional[str] = Field( - description="Damaged item/area description, e.g. Right-front quarter panel" - ) - date_acquired: Optional[str] = Field( - description="Date acquired (if present), e.g. 2022-03-15" - ) - cost_new: Optional[float] = Field(description="Cost when new, e.g. 1200.0") - cost_new_currency: Optional[str] = Field( - description="Currency of cost_new, e.g. USD" - ) - repair_estimate: Optional[float] = Field(description="Repair estimate, e.g. 350.0") - repair_estimate_currency: Optional[str] = Field( - description="Currency of repair_estimate, e.g. USD" - ) - - @staticmethod - def example() -> "DamageAssessmentItem": - """Return an empty instance with default placeholder values.""" - return DamageAssessmentItem( - item_description="", - date_acquired="", - cost_new=0.0, - cost_new_currency="", - repair_estimate=0.0, - repair_estimate_currency="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "item_description": self.item_description, - "date_acquired": self.date_acquired, - "cost_new": self.cost_new, - "cost_new_currency": self.cost_new_currency, - "repair_estimate": self.repair_estimate, - "repair_estimate_currency": self.repair_estimate_currency, - } - - -class DamageAssessment(BaseModel): - """A class representing overall damage assessment.""" - - items: Optional[List[DamageAssessmentItem]] = Field( - description="List of damage assessment line items" - ) - total_estimated_repair: Optional[float] = Field( - description="Total estimated repair, e.g. 500.0" - ) - total_estimated_repair_currency: Optional[str] = Field( - description="Currency of total_estimated_repair, e.g. USD" - ) - - @staticmethod - def example() -> "DamageAssessment": - """Return an empty instance with default placeholder values.""" - return DamageAssessment( - items=[DamageAssessmentItem.example()], - total_estimated_repair=0.0, - total_estimated_repair_currency="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "items": [item.to_dict() for item in (self.items or [])], - "total_estimated_repair": self.total_estimated_repair, - "total_estimated_repair_currency": self.total_estimated_repair_currency, - } - - -class SupportingDocuments(BaseModel): - """A class representing supporting documents included with the claim.""" - - photos_of_damage: Optional[bool] = Field( - description="Whether photos of damage are included" - ) - police_report_copy: Optional[bool] = Field( - description="Whether a police report copy is included" - ) - repair_shop_estimate: Optional[bool] = Field( - description="Whether a repair shop estimate is included" - ) - other: Optional[List[str]] = Field(description="Other supporting documents") - - @staticmethod - def example() -> "SupportingDocuments": - """Return an empty instance with default placeholder values.""" - return SupportingDocuments( - photos_of_damage=False, - police_report_copy=False, - repair_shop_estimate=False, - other=[], - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "photos_of_damage": self.photos_of_damage, - "police_report_copy": self.police_report_copy, - "repair_shop_estimate": self.repair_shop_estimate, - "other": self.other or [], - } - - -class Signature(BaseModel): - """A class representing a signature field.""" - - signatory: Optional[str] = Field(description="Name of the signatory") - is_signed: Optional[bool] = Field( - description="Indicates if the form is signed. GPT should check whether it has signature in image files. if there is Sign, fill it up as True" - ) - - @staticmethod - def example() -> "Signature": - """Return an empty instance with default placeholder values.""" - return Signature(signatory="", is_signed=False) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return {"signatory": self.signatory, "is_signed": self.is_signed} - - -class Declaration(BaseModel): - """A class representing the claim declaration.""" - - statement: Optional[str] = Field(description="Declaration statement text") - signature: Optional[Signature] = Field(description="Signature") - date: Optional[str] = Field(description="Signature date, e.g. 2025-12-01") - - @staticmethod - def example() -> "Declaration": - """Return an empty instance with default placeholder values.""" - return Declaration(statement="", signature=Signature.example(), date="") - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "statement": self.statement, - "signature": self.signature.to_dict() if self.signature else None, - "date": self.date, - } - - -class SubmissionInstructions(BaseModel): - """A class representing submission instructions.""" - - submission_email: Optional[str] = Field( - description="Submission email address, e.g. claims@contosoinsurance.com" - ) - portal_url: Optional[str] = Field(description="Claims portal URL, if present") - notes: Optional[str] = Field(description="Additional submission notes") - - @staticmethod - def example() -> "SubmissionInstructions": - """Return an empty instance with default placeholder values.""" - return SubmissionInstructions(submission_email="", portal_url="", notes="") - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "submission_email": self.submission_email, - "portal_url": self.portal_url, - "notes": self.notes, - } - - -class AutoInsuranceClaimForm(BaseModel): - """A class representing an auto insurance claim form.""" - - insurance_company: Optional[str] = Field( - description="Insurance company name, e.g. Contoso Insurance" - ) - claim_number: Optional[str] = Field(description="Claim number, e.g. CLM987654") - policy_number: Optional[str] = Field(description="Policy number, e.g. AUTO123456") - - policyholder_information: Optional[PolicyholderInformation] = Field( - description="Policyholder information" - ) - policy_details: Optional[PolicyDetails] = Field(description="Policy details") - incident_details: Optional[IncidentDetails] = Field(description="Incident details") - vehicle_information: Optional[VehicleInformation] = Field( - description="Vehicle information" - ) - damage_assessment: Optional[DamageAssessment] = Field( - description="Damage assessment" - ) - supporting_documents: Optional[SupportingDocuments] = Field( - description="Supporting documents" - ) - declaration: Optional[Declaration] = Field(description="Declaration") - submission_instructions: Optional[SubmissionInstructions] = Field( - description="Submission instructions" - ) - - @staticmethod - def example() -> "AutoInsuranceClaimForm": - """Return an empty instance with default placeholder values.""" - return AutoInsuranceClaimForm( - insurance_company="", - claim_number="", - policy_number="", - policyholder_information=PolicyholderInformation.example(), - policy_details=PolicyDetails.example(), - incident_details=IncidentDetails.example(), - vehicle_information=VehicleInformation.example(), - damage_assessment=DamageAssessment.example(), - supporting_documents=SupportingDocuments.example(), - declaration=Declaration.example(), - submission_instructions=SubmissionInstructions.example(), - ) - - @staticmethod - def from_json(json_str: str) -> "AutoInsuranceClaimForm": - """Deserialize a JSON string into an AutoInsuranceClaimForm instance.""" - json_content = json.loads(json_str) - - def create_address(address: Optional[dict]) -> Optional[AutoClaimAddress]: - if not address: - return None - return AutoClaimAddress( - street=address.get("street"), - city=address.get("city"), - state=address.get("state"), - postal_code=address.get("postal_code"), - country=address.get("country"), - ) - - def create_policyholder( - info: Optional[dict], - ) -> Optional[PolicyholderInformation]: - if not info: - return None - return PolicyholderInformation( - name=info.get("name"), - address=create_address(info.get("address")), - phone=info.get("phone"), - email=info.get("email"), - ) - - def create_policy_details(details: Optional[dict]) -> Optional[PolicyDetails]: - if not details: - return None - return PolicyDetails( - coverage_type=details.get("coverage_type"), - effective_date=details.get("effective_date"), - expiration_date=details.get("expiration_date"), - deductible=details.get("deductible"), - deductible_currency=details.get("deductible_currency"), - ) - - def create_incident(details: Optional[dict]) -> Optional[IncidentDetails]: - if not details: - return None - return IncidentDetails( - date_of_loss=details.get("date_of_loss"), - time_of_loss=details.get("time_of_loss"), - location=details.get("location"), - cause_of_loss=details.get("cause_of_loss"), - description=details.get("description"), - police_report_filed=details.get("police_report_filed"), - police_report_number=details.get("police_report_number"), - ) - - def create_vehicle(details: Optional[dict]) -> Optional[VehicleInformation]: - if not details: - return None - return VehicleInformation( - year=details.get("year"), - make=details.get("make"), - model=details.get("model"), - trim=details.get("trim"), - vin=details.get("vin"), - license_plate=details.get("license_plate"), - mileage=details.get("mileage"), - ) - - def create_damage_item(item: Optional[dict]) -> Optional[DamageAssessmentItem]: - if not item: - return None - return DamageAssessmentItem( - item_description=item.get("item_description"), - date_acquired=item.get("date_acquired"), - cost_new=item.get("cost_new"), - cost_new_currency=item.get("cost_new_currency"), - repair_estimate=item.get("repair_estimate"), - repair_estimate_currency=item.get("repair_estimate_currency"), - ) - - def create_damage(details: Optional[dict]) -> Optional[DamageAssessment]: - if not details: - return None - items_raw = details.get("items") or [] - items = [create_damage_item(i) for i in items_raw] - items = [i for i in items if i is not None] - return DamageAssessment( - items=items, - total_estimated_repair=details.get("total_estimated_repair"), - total_estimated_repair_currency=details.get( - "total_estimated_repair_currency" - ), - ) - - def create_supporting(details: Optional[dict]) -> Optional[SupportingDocuments]: - if not details: - return None - return SupportingDocuments( - photos_of_damage=details.get("photos_of_damage"), - police_report_copy=details.get("police_report_copy"), - repair_shop_estimate=details.get("repair_shop_estimate"), - other=details.get("other") or [], - ) - - def create_signature(details: Optional[dict]) -> Optional[Signature]: - if not details: - return None - return Signature( - signatory=details.get("signatory"), - is_signed=details.get("is_signed"), - ) - - def create_declaration(details: Optional[dict]) -> Optional[Declaration]: - if not details: - return None - return Declaration( - statement=details.get("statement"), - signature=create_signature(details.get("signature")), - date=details.get("date"), - ) - - def create_submission( - details: Optional[dict], - ) -> Optional[SubmissionInstructions]: - if not details: - return None - return SubmissionInstructions( - submission_email=details.get("submission_email"), - portal_url=details.get("portal_url"), - notes=details.get("notes"), - ) - - return AutoInsuranceClaimForm( - insurance_company=json_content.get("insurance_company"), - claim_number=json_content.get("claim_number"), - policy_number=json_content.get("policy_number"), - policyholder_information=create_policyholder( - json_content.get("policyholder_information") - ), - policy_details=create_policy_details(json_content.get("policy_details")), - incident_details=create_incident(json_content.get("incident_details")), - vehicle_information=create_vehicle(json_content.get("vehicle_information")), - damage_assessment=create_damage(json_content.get("damage_assessment")), - supporting_documents=create_supporting( - json_content.get("supporting_documents") - ), - declaration=create_declaration(json_content.get("declaration")), - submission_instructions=create_submission( - json_content.get("submission_instructions") - ), - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "insurance_company": self.insurance_company, - "claim_number": self.claim_number, - "policy_number": self.policy_number, - "policyholder_information": self.policyholder_information.to_dict() - if self.policyholder_information - else None, - "policy_details": self.policy_details.to_dict() - if self.policy_details - else None, - "incident_details": self.incident_details.to_dict() - if self.incident_details - else None, - "vehicle_information": self.vehicle_information.to_dict() - if self.vehicle_information - else None, - "damage_assessment": self.damage_assessment.to_dict() - if self.damage_assessment - else None, - "supporting_documents": self.supporting_documents.to_dict() - if self.supporting_documents - else None, - "declaration": self.declaration.to_dict() if self.declaration else None, - "submission_instructions": self.submission_instructions.to_dict() - if self.submission_instructions - else None, - } diff --git a/src/ContentProcessorAPI/samples/schemas/damagedcarimage.json b/src/ContentProcessorAPI/samples/schemas/damagedcarimage.json new file mode 100644 index 00000000..f7d2385b --- /dev/null +++ b/src/ContentProcessorAPI/samples/schemas/damagedcarimage.json @@ -0,0 +1,617 @@ +{ + "$defs": { + "CameraViewpoint": { + "description": "Camera perspective relative to the vehicle.\n\nAttributes:\n spatial_reasoning: Chain-of-thought scratchpad for determining view angle.\n view_angle: Computed camera angle label.\n description: Free-text summary of the camera position.", + "properties": { + "spatial_reasoning": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "MANDATORY chain-of-thought reasoning about camera position. Must answer IN ORDER: (1) Can I see the FRONT (grille/headlights) or REAR (tail lights/trunk) of the vehicle? (2) Which side of the IMAGE does the body flank extend toward? (3) Apply the mirror rule: viewing the FRONT \u2014 image-right = vehicle LEFT, image-left = vehicle RIGHT. Viewing the REAR \u2014 image-right = vehicle RIGHT, image-left = vehicle LEFT. (4) Therefore view_angle = ? (5) FALLBACK only if neither front nor rear is visible (pure side view): use steering wheel position to determine driver side (LHD: left, RHD: right).", + "title": "Spatial Reasoning" + }, + "view_angle": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Primary camera viewing angle relative to the vehicle. Must be one of: front, front-left, front-right, left-side, right-side, rear-left, rear-right, rear, top, underneath, interior, unknown. Left/right = VEHICLE's own left/right (driver-perspective facing forward).", + "title": "View Angle" + }, + "description": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Free-text description of the camera position and angle relative to the vehicle, e.g. 'Slightly elevated front-left view showing hood, front bumper, and left fender.'", + "title": "Description" + } + }, + "required": [ + "spatial_reasoning", + "view_angle", + "description" + ], + "title": "CameraViewpoint", + "type": "object" + }, + "DamageBoundingBox": { + "description": "Bounding box in normalized image coordinates [0..1].", + "properties": { + "x_min": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Left edge in [0..1]", + "title": "X Min" + }, + "y_min": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Top edge in [0..1]", + "title": "Y Min" + }, + "x_max": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Right edge in [0..1]", + "title": "X Max" + }, + "y_max": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Bottom edge in [0..1]", + "title": "Y Max" + } + }, + "required": [ + "x_min", + "y_min", + "x_max", + "y_max" + ], + "title": "DamageBoundingBox", + "type": "object" + }, + "DamageRegion": { + "description": "A detected region of damage on the vehicle.", + "properties": { + "location_on_vehicle": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Location on the vehicle using the VEHICLE's own left/right (driver-perspective facing forward). The side MUST match camera_viewpoint.view_angle. Examples: 'front-left fender', 'rear-right quarter panel'.", + "title": "Location On Vehicle" + }, + "damage_types": { + "anyOf": [ + { + "items": { + "type": "string" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "Damage types, e.g. ['scratch','dent','crack','paint-transfer']", + "title": "Damage Types" + }, + "severity": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Severity label, e.g. minor, moderate, severe", + "title": "Severity" + }, + "description": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Free-text description of the damage", + "title": "Description" + }, + "bounding_box": { + "anyOf": [ + { + "$ref": "#/$defs/DamageBoundingBox" + }, + { + "type": "null" + } + ], + "description": "Approx bounding box of the damage area (normalized coordinates)" + }, + "confidence": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Confidence score in [0..1] for this damage region", + "title": "Confidence" + } + }, + "required": [ + "location_on_vehicle", + "damage_types", + "severity", + "description", + "bounding_box", + "confidence" + ], + "title": "DamageRegion", + "type": "object" + }, + "ImageInfo": { + "description": "Metadata about an input image.\n\nNote: Most fields may be unknown unless provided by the caller or extracted from EXIF.", + "properties": { + "filename": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Analyzed filename of the image", + "title": "Filename" + }, + "content_type": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "MIME type, e.g. image/jpeg", + "title": "Content Type" + }, + "width": { + "anyOf": [ + { + "type": "integer" + }, + { + "type": "null" + } + ], + "description": "Analyzed image width in pixels", + "title": "Width" + }, + "height": { + "anyOf": [ + { + "type": "integer" + }, + { + "type": "null" + } + ], + "description": "Analyzed image height in pixels", + "title": "Height" + }, + "capture_datetime": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Capture datetime if available, e.g. 2025-11-28T14:15:00 original EXIF string if unprocessed", + "title": "Capture Datetime" + } + }, + "required": [ + "filename", + "content_type", + "width", + "height", + "capture_datetime" + ], + "title": "ImageInfo", + "type": "object" + }, + "OverallDamageAssessment": { + "description": "Overall assessment across the full image.", + "properties": { + "has_visible_damage": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "null" + } + ], + "description": "Whether any damage is visible", + "title": "Has Visible Damage" + }, + "overall_severity": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Overall severity label, e.g. minor, moderate, severe", + "title": "Overall Severity" + }, + "affected_parts": { + "anyOf": [ + { + "items": { + "type": "string" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "Affected parts/panels using the VEHICLE's own left/right. Side labels MUST match camera_viewpoint.view_angle.", + "title": "Affected Parts" + }, + "estimated_repair_complexity": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Rough complexity, e.g. cosmetic_only, panel_repair, replacement_likely", + "title": "Estimated Repair Complexity" + }, + "notes": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Notes or caveats, e.g. lighting/angle limitations", + "title": "Notes" + } + }, + "required": [ + "has_visible_damage", + "overall_severity", + "affected_parts", + "estimated_repair_complexity", + "notes" + ], + "title": "OverallDamageAssessment", + "type": "object" + }, + "VehicleAppearance": { + "description": "Visible vehicle identification extracted from the image.\n\nGuidance:\n- Prefer fields that can be seen. If uncertain, leave null.\n- Do not guess VIN from images.", + "properties": { + "vehicle_type": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle type, e.g. sedan, SUV", + "title": "Vehicle Type" + }, + "make": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle make, e.g. Toyota", + "title": "Make" + }, + "model": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle model, e.g. Camry", + "title": "Model" + }, + "trim": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle trim, e.g. SE", + "title": "Trim" + }, + "model_year": { + "anyOf": [ + { + "type": "integer" + }, + { + "type": "null" + } + ], + "description": "Vehicle model year, e.g. 2022", + "title": "Model Year" + }, + "color": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle color, e.g. silver", + "title": "Color" + }, + "license_plate_visible": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "null" + } + ], + "description": "Whether the license plate is visible in the image", + "title": "License Plate Visible" + }, + "license_plate_text": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "License plate text if clearly readable; otherwise null", + "title": "License Plate Text" + }, + "visible_vehicle_parts": { + "anyOf": [ + { + "items": { + "type": "string" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "List of vehicle parts/panels actually visible in this image given the camera angle, e.g. ['hood', 'front bumper', 'front-left fender', 'front-left headlight']. Only parts that can be seen should be listed. Left/right MUST use the VEHICLE's own frame of reference and MUST match the side in camera_viewpoint.view_angle.", + "title": "Visible Vehicle Parts" + } + }, + "required": [ + "vehicle_type", + "make", + "model", + "trim", + "model_year", + "color", + "license_plate_visible", + "license_plate_text", + "visible_vehicle_parts" + ], + "title": "VehicleAppearance", + "type": "object" + }, + "VehicleAssessment": { + "description": "Per-vehicle damage assessment extracted from an image.\n\nGroups appearance, damage regions, and overall assessment for a single\nvehicle detected in the photograph.\n\nAttributes:\n vehicle_id: Human-readable identifier distinguishing this vehicle.\n vehicle_appearance: Visible vehicle identification.\n damage_regions: Detected damage regions for this vehicle.\n overall_assessment: Overall damage assessment for this vehicle.", + "properties": { + "vehicle_id": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "A short human-readable identifier for this vehicle, e.g. 'Vehicle 1 - silver sedan (front-left)'. Use color, type, and position to distinguish vehicles.", + "title": "Vehicle Id" + }, + "vehicle_appearance": { + "anyOf": [ + { + "$ref": "#/$defs/VehicleAppearance" + }, + { + "type": "null" + } + ], + "description": "Visible vehicle identification for this vehicle" + }, + "damage_regions": { + "anyOf": [ + { + "items": { + "$ref": "#/$defs/DamageRegion" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "List of detected damage regions for this vehicle", + "title": "Damage Regions" + }, + "overall_assessment": { + "anyOf": [ + { + "$ref": "#/$defs/OverallDamageAssessment" + }, + { + "type": "null" + } + ], + "description": "Overall damage assessment for this vehicle" + } + }, + "required": [ + "vehicle_id", + "vehicle_appearance", + "damage_regions", + "overall_assessment" + ], + "title": "VehicleAssessment", + "type": "object" + } + }, + "description": "Schema for extracting damaged vehicle information from an image.\n\nSupports single- and multi-vehicle images. Each vehicle detected in the\nphotograph gets its own entry in the ``vehicles`` list.\n\nAttributes:\n image_info: Image metadata (shared across all vehicles).\n camera_viewpoint: Camera perspective relative to the scene.\n vehicle_count: Number of distinct vehicles detected in the image.\n vehicles: Per-vehicle assessment list.", + "properties": { + "image_info": { + "anyOf": [ + { + "$ref": "#/$defs/ImageInfo" + }, + { + "type": "null" + } + ], + "description": "Image metadata" + }, + "camera_viewpoint": { + "anyOf": [ + { + "$ref": "#/$defs/CameraViewpoint" + }, + { + "type": "null" + } + ], + "description": "Camera perspective relative to the scene. MUST be determined BEFORE labelling any damage locations so that left/right orientation is anchored to each vehicle's own frame of reference." + }, + "vehicle_count": { + "anyOf": [ + { + "type": "integer" + }, + { + "type": "null" + } + ], + "description": "Number of distinct vehicles detected in the image. Must equal the length of the vehicles list.", + "title": "Vehicle Count" + }, + "vehicles": { + "anyOf": [ + { + "items": { + "$ref": "#/$defs/VehicleAssessment" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "Per-vehicle damage assessments. One entry per vehicle detected in the image. For single-vehicle images this list contains exactly one item.", + "title": "Vehicles" + }, + "consistency_check": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "MANDATORY self-verification. State the side from view_angle, then list every left/right label used in visible_vehicle_parts, damage_regions, and affected_parts. Confirm they ALL match the side in view_angle. If any mismatch was found and corrected, describe what was fixed.", + "title": "Consistency Check" + } + }, + "required": [ + "image_info", + "camera_viewpoint", + "vehicle_count", + "vehicles", + "consistency_check" + ], + "title": "DamagedVehicleImageAssessment", + "type": "object" +} diff --git a/src/ContentProcessorAPI/samples/schemas/damagedcarimage.py b/src/ContentProcessorAPI/samples/schemas/damagedcarimage.py deleted file mode 100644 index 93343dae..00000000 --- a/src/ContentProcessorAPI/samples/schemas/damagedcarimage.py +++ /dev/null @@ -1,519 +0,0 @@ -# Copyright (c) Microsoft Corporation. -# Licensed under the MIT License. -"""Pydantic models for damaged vehicle image assessment data extraction. - -Defines the schema used by the content processing pipeline to extract -structured damage information from vehicle photographs. -""" - -from __future__ import annotations - -import json -from typing import List, Optional - -from pydantic import BaseModel, Field - - -class ImageInfo(BaseModel): - """Metadata about an input image. - - Note: Most fields may be unknown unless provided by the caller or extracted from EXIF. - """ - - filename: Optional[str] = Field(description="Analyzed filename of the image") - content_type: Optional[str] = Field(description="MIME type, e.g. image/jpeg") - width: Optional[int] = Field(description="Analyzed image width in pixels") - height: Optional[int] = Field(description="Analyzed image height in pixels") - capture_datetime: Optional[str] = Field( - description="Capture datetime if available, e.g. 2025-11-28T14:15:00 original EXIF string if unprocessed" - ) - - @staticmethod - def example() -> "ImageInfo": - """Return an empty instance with default placeholder values.""" - return ImageInfo( - filename="", - content_type="", - width=0, - height=0, - capture_datetime="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "filename": self.filename, - "content_type": self.content_type, - "width": self.width, - "height": self.height, - "capture_datetime": self.capture_datetime, - } - - -class VehicleAppearance(BaseModel): - """Visible vehicle identification extracted from the image. - - Guidance: - - Prefer fields that can be seen. If uncertain, leave null. - - Do not guess VIN from images. - """ - - vehicle_type: Optional[str] = Field(description="Vehicle type, e.g. sedan, SUV") - make: Optional[str] = Field(description="Vehicle make, e.g. Toyota") - model: Optional[str] = Field(description="Vehicle model, e.g. Camry") - trim: Optional[str] = Field(description="Vehicle trim, e.g. SE") - model_year: Optional[int] = Field(description="Vehicle model year, e.g. 2022") - color: Optional[str] = Field(description="Vehicle color, e.g. silver") - - license_plate_visible: Optional[bool] = Field( - description="Whether the license plate is visible in the image" - ) - license_plate_text: Optional[str] = Field( - description="License plate text if clearly readable; otherwise null" - ) - - visible_vehicle_parts: Optional[List[str]] = Field( - description=( - "List of vehicle parts/panels actually visible in this image " - "given the camera angle, e.g. ['hood', 'front bumper', " - "'front-left fender', 'front-left headlight']. " - "Only parts that can be seen should be listed. " - "Left/right MUST use the VEHICLE's own frame of reference " - "and MUST match the side in camera_viewpoint.view_angle." - ) - ) - - @staticmethod - def example() -> "VehicleAppearance": - """Return an empty instance with default placeholder values.""" - return VehicleAppearance( - vehicle_type="", - make="", - model="", - trim="", - model_year=0, - color="", - license_plate_visible=False, - license_plate_text="", - visible_vehicle_parts=[], - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "vehicle_type": self.vehicle_type, - "make": self.make, - "model": self.model, - "trim": self.trim, - "model_year": self.model_year, - "color": self.color, - "license_plate_visible": self.license_plate_visible, - "license_plate_text": self.license_plate_text, - "visible_vehicle_parts": self.visible_vehicle_parts or [], - } - - -class CameraViewpoint(BaseModel): - """Camera perspective relative to the vehicle. - - Attributes: - spatial_reasoning: Chain-of-thought scratchpad for determining view angle. - view_angle: Computed camera angle label. - description: Free-text summary of the camera position. - """ - - spatial_reasoning: Optional[str] = Field( - description=( - "MANDATORY chain-of-thought reasoning about camera position. " - "Must answer IN ORDER: " - "(1) Can I see the FRONT (grille/headlights) or REAR (tail lights/trunk) of the vehicle? " - "(2) Which side of the IMAGE does the body flank extend toward? " - "(3) Apply the mirror rule: viewing the FRONT — image-right = vehicle LEFT, " - "image-left = vehicle RIGHT. Viewing the REAR — image-right = vehicle RIGHT, " - "image-left = vehicle LEFT. " - "(4) Therefore view_angle = ? " - "(5) FALLBACK only if neither front nor rear is visible (pure side view): " - "use steering wheel position to determine driver side (LHD: left, RHD: right)." - ) - ) - view_angle: Optional[str] = Field( - description=( - "Primary camera viewing angle relative to the vehicle. " - "Must be one of: front, front-left, front-right, " - "left-side, right-side, rear-left, rear-right, rear, " - "top, underneath, interior, unknown. " - "Left/right = VEHICLE's own left/right (driver-perspective facing forward)." - ) - ) - description: Optional[str] = Field( - description=( - "Free-text description of the camera position and angle " - "relative to the vehicle, e.g. 'Slightly elevated front-left " - "view showing hood, front bumper, and left fender.'" - ) - ) - - @staticmethod - def example() -> "CameraViewpoint": - """Return an empty instance with default placeholder values.""" - return CameraViewpoint(spatial_reasoning="", view_angle="", description="") - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "spatial_reasoning": self.spatial_reasoning, - "view_angle": self.view_angle, - "description": self.description, - } - - -class DamageBoundingBox(BaseModel): - """Bounding box in normalized image coordinates [0..1].""" - - x_min: Optional[float] = Field(description="Left edge in [0..1]") - y_min: Optional[float] = Field(description="Top edge in [0..1]") - x_max: Optional[float] = Field(description="Right edge in [0..1]") - y_max: Optional[float] = Field(description="Bottom edge in [0..1]") - - @staticmethod - def example() -> "DamageBoundingBox": - """Return an empty instance with default placeholder values.""" - return DamageBoundingBox(x_min=0.0, y_min=0.0, x_max=0.0, y_max=0.0) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "x_min": self.x_min, - "y_min": self.y_min, - "x_max": self.x_max, - "y_max": self.y_max, - } - - -class DamageRegion(BaseModel): - """A detected region of damage on the vehicle.""" - - location_on_vehicle: Optional[str] = Field( - description=( - "Location on the vehicle using the VEHICLE's own left/right " - "(driver-perspective facing forward). " - "The side MUST match camera_viewpoint.view_angle. " - "Examples: 'front-left fender', 'rear-right quarter panel'." - ) - ) - damage_types: Optional[List[str]] = Field( - description="Damage types, e.g. ['scratch','dent','crack','paint-transfer']" - ) - severity: Optional[str] = Field( - description="Severity label, e.g. minor, moderate, severe" - ) - description: Optional[str] = Field( - description="Free-text description of the damage" - ) - - bounding_box: Optional[DamageBoundingBox] = Field( - description="Approx bounding box of the damage area (normalized coordinates)" - ) - - confidence: Optional[float] = Field( - description="Confidence score in [0..1] for this damage region" - ) - - @staticmethod - def example() -> "DamageRegion": - """Return an empty instance with default placeholder values.""" - return DamageRegion( - location_on_vehicle="", - damage_types=[], - severity="", - description="", - bounding_box=DamageBoundingBox.example(), - confidence=0.0, - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "location_on_vehicle": self.location_on_vehicle, - "damage_types": self.damage_types or [], - "severity": self.severity, - "description": self.description, - "bounding_box": self.bounding_box.to_dict() if self.bounding_box else None, - "confidence": self.confidence, - } - - -class OverallDamageAssessment(BaseModel): - """Overall assessment across the full image.""" - - has_visible_damage: Optional[bool] = Field( - description="Whether any damage is visible" - ) - overall_severity: Optional[str] = Field( - description="Overall severity label, e.g. minor, moderate, severe" - ) - - affected_parts: Optional[List[str]] = Field( - description=( - "Affected parts/panels using the VEHICLE's own left/right. " - "Side labels MUST match camera_viewpoint.view_angle." - ) - ) - - estimated_repair_complexity: Optional[str] = Field( - description="Rough complexity, e.g. cosmetic_only, panel_repair, replacement_likely" - ) - - notes: Optional[str] = Field( - description="Notes or caveats, e.g. lighting/angle limitations" - ) - - @staticmethod - def example() -> "OverallDamageAssessment": - """Return an empty instance with default placeholder values.""" - return OverallDamageAssessment( - has_visible_damage=False, - overall_severity="", - affected_parts=[], - estimated_repair_complexity="", - notes="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "has_visible_damage": self.has_visible_damage, - "overall_severity": self.overall_severity, - "affected_parts": self.affected_parts or [], - "estimated_repair_complexity": self.estimated_repair_complexity, - "notes": self.notes, - } - - -class VehicleAssessment(BaseModel): - """Per-vehicle damage assessment extracted from an image. - - Groups appearance, damage regions, and overall assessment for a single - vehicle detected in the photograph. - - Attributes: - vehicle_id: Human-readable identifier distinguishing this vehicle. - vehicle_appearance: Visible vehicle identification. - damage_regions: Detected damage regions for this vehicle. - overall_assessment: Overall damage assessment for this vehicle. - """ - - vehicle_id: Optional[str] = Field( - description=( - "A short human-readable identifier for this vehicle, " - "e.g. 'Vehicle 1 - silver sedan (front-left)'. " - "Use color, type, and position to distinguish vehicles." - ) - ) - vehicle_appearance: Optional[VehicleAppearance] = Field( - description="Visible vehicle identification for this vehicle" - ) - damage_regions: Optional[List[DamageRegion]] = Field( - description="List of detected damage regions for this vehicle" - ) - overall_assessment: Optional[OverallDamageAssessment] = Field( - description="Overall damage assessment for this vehicle" - ) - - @staticmethod - def example() -> "VehicleAssessment": - """Return an empty instance with default placeholder values.""" - return VehicleAssessment( - vehicle_id="", - vehicle_appearance=VehicleAppearance.example(), - damage_regions=[DamageRegion.example()], - overall_assessment=OverallDamageAssessment.example(), - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "vehicle_id": self.vehicle_id, - "vehicle_appearance": self.vehicle_appearance.to_dict() - if self.vehicle_appearance - else None, - "damage_regions": [r.to_dict() for r in (self.damage_regions or [])], - "overall_assessment": self.overall_assessment.to_dict() - if self.overall_assessment - else None, - } - - -class DamagedVehicleImageAssessment(BaseModel): - """Schema for extracting damaged vehicle information from an image. - - Supports single- and multi-vehicle images. Each vehicle detected in the - photograph gets its own entry in the ``vehicles`` list. - - Attributes: - image_info: Image metadata (shared across all vehicles). - camera_viewpoint: Camera perspective relative to the scene. - vehicle_count: Number of distinct vehicles detected in the image. - vehicles: Per-vehicle assessment list. - """ - - image_info: Optional[ImageInfo] = Field(description="Image metadata") - camera_viewpoint: Optional[CameraViewpoint] = Field( - description=( - "Camera perspective relative to the scene. " - "MUST be determined BEFORE labelling any damage " - "locations so that left/right orientation is anchored " - "to each vehicle's own frame of reference." - ) - ) - vehicle_count: Optional[int] = Field( - description=( - "Number of distinct vehicles detected in the image. " - "Must equal the length of the vehicles list." - ) - ) - vehicles: Optional[List[VehicleAssessment]] = Field( - description=( - "Per-vehicle damage assessments. One entry per vehicle " - "detected in the image. For single-vehicle images this " - "list contains exactly one item." - ) - ) - consistency_check: Optional[str] = Field( - description=( - "MANDATORY self-verification. State the side from view_angle, " - "then list every left/right label used in visible_vehicle_parts, " - "damage_regions, and affected_parts. Confirm they ALL match the " - "side in view_angle. If any mismatch was found and corrected, " - "describe what was fixed." - ) - ) - - @staticmethod - def example() -> "DamagedVehicleImageAssessment": - """Return an empty instance with default placeholder values.""" - return DamagedVehicleImageAssessment( - image_info=ImageInfo.example(), - camera_viewpoint=CameraViewpoint.example(), - vehicle_count=1, - vehicles=[VehicleAssessment.example()], - consistency_check="", - ) - - @staticmethod - def from_json(json_str: str) -> "DamagedVehicleImageAssessment": - """Deserialize a JSON string into a DamagedVehicleImageAssessment instance.""" - json_content = json.loads(json_str) - - def create_image_info(details: Optional[dict]) -> Optional[ImageInfo]: - if not details: - return None - return ImageInfo( - filename=details.get("filename"), - content_type=details.get("content_type"), - width=details.get("width"), - height=details.get("height"), - capture_datetime=details.get("capture_datetime"), - ) - - def create_viewpoint( - details: Optional[dict], - ) -> Optional[CameraViewpoint]: - if not details: - return None - return CameraViewpoint( - spatial_reasoning=details.get("spatial_reasoning"), - view_angle=details.get("view_angle"), - description=details.get("description"), - ) - - def create_appearance( - details: Optional[dict], - ) -> Optional[VehicleAppearance]: - if not details: - return None - return VehicleAppearance( - vehicle_type=details.get("vehicle_type"), - make=details.get("make"), - model=details.get("model"), - trim=details.get("trim"), - model_year=details.get("model_year"), - color=details.get("color"), - license_plate_visible=details.get("license_plate_visible"), - license_plate_text=details.get("license_plate_text"), - visible_vehicle_parts=details.get("visible_vehicle_parts") or [], - ) - - def create_bbox(details: Optional[dict]) -> Optional[DamageBoundingBox]: - if not details: - return None - return DamageBoundingBox( - x_min=details.get("x_min"), - y_min=details.get("y_min"), - x_max=details.get("x_max"), - y_max=details.get("y_max"), - ) - - def create_region(details: Optional[dict]) -> Optional[DamageRegion]: - if not details: - return None - return DamageRegion( - location_on_vehicle=details.get("location_on_vehicle"), - damage_types=details.get("damage_types") or [], - severity=details.get("severity"), - description=details.get("description"), - bounding_box=create_bbox(details.get("bounding_box")), - confidence=details.get("confidence"), - ) - - def create_overall( - details: Optional[dict], - ) -> Optional[OverallDamageAssessment]: - if not details: - return None - return OverallDamageAssessment( - has_visible_damage=details.get("has_visible_damage"), - overall_severity=details.get("overall_severity"), - affected_parts=details.get("affected_parts") or [], - estimated_repair_complexity=details.get("estimated_repair_complexity"), - notes=details.get("notes"), - ) - - def create_vehicle_assessment( - details: Optional[dict], - ) -> Optional[VehicleAssessment]: - if not details: - return None - regions_raw = details.get("damage_regions") or [] - regions = [r for r in (create_region(r) for r in regions_raw) if r] - return VehicleAssessment( - vehicle_id=details.get("vehicle_id"), - vehicle_appearance=create_appearance(details.get("vehicle_appearance")), - damage_regions=regions, - overall_assessment=create_overall(details.get("overall_assessment")), - ) - - vehicles_raw = json_content.get("vehicles") or [] - vehicles = [ - v for v in (create_vehicle_assessment(v) for v in vehicles_raw) if v - ] - - return DamagedVehicleImageAssessment( - image_info=create_image_info(json_content.get("image_info")), - camera_viewpoint=create_viewpoint(json_content.get("camera_viewpoint")), - vehicle_count=json_content.get("vehicle_count"), - vehicles=vehicles, - consistency_check=json_content.get("consistency_check"), - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "image_info": self.image_info.to_dict() if self.image_info else None, - "camera_viewpoint": self.camera_viewpoint.to_dict() - if self.camera_viewpoint - else None, - "vehicle_count": self.vehicle_count, - "vehicles": [v.to_dict() for v in (self.vehicles or [])], - "consistency_check": self.consistency_check, - } diff --git a/src/ContentProcessorAPI/samples/schemas/policereport.json b/src/ContentProcessorAPI/samples/schemas/policereport.json new file mode 100644 index 00000000..87fc07af --- /dev/null +++ b/src/ContentProcessorAPI/samples/schemas/policereport.json @@ -0,0 +1,548 @@ +{ + "$defs": { + "PoliceReportAddress": { + "description": "A class representing an address referenced in a police report.", + "properties": { + "street": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Street address, e.g. 123 Main St.", + "title": "Street" + }, + "city": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "City, e.g. Macon", + "title": "City" + }, + "state": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "State, e.g. GA", + "title": "State" + }, + "postal_code": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Postal code, e.g. 31201", + "title": "Postal Code" + }, + "country": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Country, e.g. USA", + "title": "Country" + } + }, + "required": [ + "street", + "city", + "state", + "postal_code", + "country" + ], + "title": "PoliceReportAddress", + "type": "object" + }, + "PoliceReportDamageItem": { + "description": "A class representing a damage line item recorded alongside a police report.", + "properties": { + "item_description": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Damaged item/area description", + "title": "Item Description" + }, + "repair_estimate": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Repair estimate amount", + "title": "Repair Estimate" + }, + "repair_estimate_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency of repair_estimate, e.g. USD", + "title": "Repair Estimate Currency" + } + }, + "required": [ + "item_description", + "repair_estimate", + "repair_estimate_currency" + ], + "title": "PoliceReportDamageItem", + "type": "object" + }, + "PoliceReportDamageSummary": { + "description": "A class representing a damage summary section.", + "properties": { + "items": { + "anyOf": [ + { + "items": { + "$ref": "#/$defs/PoliceReportDamageItem" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "List of damage items", + "title": "Items" + }, + "total_estimated_repair": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Total estimated repair amount", + "title": "Total Estimated Repair" + }, + "total_estimated_repair_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency of total_estimated_repair, e.g. USD", + "title": "Total Estimated Repair Currency" + } + }, + "required": [ + "items", + "total_estimated_repair", + "total_estimated_repair_currency" + ], + "title": "PoliceReportDamageSummary", + "type": "object" + }, + "PoliceReportIncident": { + "description": "A class representing incident details in a police report.", + "properties": { + "date": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Incident date, e.g. 2025-11-28", + "title": "Date" + }, + "time": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Incident time, e.g. 14:15", + "title": "Time" + }, + "location": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Incident location", + "title": "Location" + }, + "cause": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Cause of incident", + "title": "Cause" + }, + "narrative": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Narrative/description of what happened", + "title": "Narrative" + } + }, + "required": [ + "date", + "time", + "location", + "cause", + "narrative" + ], + "title": "PoliceReportIncident", + "type": "object" + }, + "PoliceReportVehicle": { + "description": "A class representing a vehicle referenced in a police report.", + "properties": { + "year": { + "anyOf": [ + { + "type": "integer" + }, + { + "type": "null" + } + ], + "description": "Vehicle year, e.g. 2022", + "title": "Year" + }, + "make": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle make, e.g. Toyota", + "title": "Make" + }, + "model": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle model, e.g. Camry", + "title": "Model" + }, + "trim": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle trim, e.g. SE", + "title": "Trim" + }, + "vin": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle VIN", + "title": "Vin" + }, + "license_plate": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "License plate", + "title": "License Plate" + }, + "mileage": { + "anyOf": [ + { + "type": "integer" + }, + { + "type": "null" + } + ], + "description": "Mileage", + "title": "Mileage" + } + }, + "required": [ + "year", + "make", + "model", + "trim", + "vin", + "license_plate", + "mileage" + ], + "title": "PoliceReportVehicle", + "type": "object" + }, + "ReportingParty": { + "description": "A class representing the reporting party / claimant in the police report context.", + "properties": { + "name": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Full name of reporting party", + "title": "Name" + }, + "address": { + "anyOf": [ + { + "$ref": "#/$defs/PoliceReportAddress" + }, + { + "type": "null" + } + ], + "description": "Address of reporting party" + }, + "phone": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Phone number", + "title": "Phone" + }, + "email": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Email address", + "title": "Email" + } + }, + "required": [ + "name", + "address", + "phone", + "email" + ], + "title": "ReportingParty", + "type": "object" + } + }, + "description": "A class representing a police report document attached to an auto claim.\n\nNote: The sample content includes the statement \"Police Report: Filed (Report # GA-20251128-CR)\".\nThis schema focuses on extracting the report identifier and the related incident context.", + "properties": { + "report_number": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Police report number, e.g. GA-20251128-CR", + "title": "Report Number" + }, + "is_filed": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "null" + } + ], + "description": "Whether a police report was filed", + "title": "Is Filed" + }, + "reporting_agency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Reporting agency / department", + "title": "Reporting Agency" + }, + "insurance_company": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Insurance company name", + "title": "Insurance Company" + }, + "claim_number": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Claim number", + "title": "Claim Number" + }, + "policy_number": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Policy number", + "title": "Policy Number" + }, + "reporting_party": { + "anyOf": [ + { + "$ref": "#/$defs/ReportingParty" + }, + { + "type": "null" + } + ], + "description": "Reporting party information" + }, + "incident": { + "anyOf": [ + { + "$ref": "#/$defs/PoliceReportIncident" + }, + { + "type": "null" + } + ], + "description": "Incident details" + }, + "vehicles": { + "anyOf": [ + { + "items": { + "$ref": "#/$defs/PoliceReportVehicle" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "Vehicles involved", + "title": "Vehicles" + }, + "damage_summary": { + "anyOf": [ + { + "$ref": "#/$defs/PoliceReportDamageSummary" + }, + { + "type": "null" + } + ], + "description": "Damage summary" + } + }, + "required": [ + "report_number", + "is_filed", + "reporting_agency", + "insurance_company", + "claim_number", + "policy_number", + "reporting_party", + "incident", + "vehicles", + "damage_summary" + ], + "title": "PoliceReportDocument", + "type": "object" +} diff --git a/src/ContentProcessorAPI/samples/schemas/policereport.py b/src/ContentProcessorAPI/samples/schemas/policereport.py deleted file mode 100644 index 8d437a89..00000000 --- a/src/ContentProcessorAPI/samples/schemas/policereport.py +++ /dev/null @@ -1,353 +0,0 @@ -# Copyright (c) Microsoft Corporation. -# Licensed under the MIT License. -"""Pydantic models for police report data extraction. - -Defines the schema used by the content processing pipeline to extract -structured fields from police report documents attached to insurance claims. -""" - -from __future__ import annotations - -import json -from typing import List, Optional - -from pydantic import BaseModel, Field - - -class PoliceReportAddress(BaseModel): - """A class representing an address referenced in a police report.""" - - street: Optional[str] = Field(description="Street address, e.g. 123 Main St.") - city: Optional[str] = Field(description="City, e.g. Macon") - state: Optional[str] = Field(description="State, e.g. GA") - postal_code: Optional[str] = Field(description="Postal code, e.g. 31201") - country: Optional[str] = Field(description="Country, e.g. USA") - - @staticmethod - def example() -> "PoliceReportAddress": - """Return an empty instance with default placeholder values.""" - return PoliceReportAddress( - street="", city="", state="", postal_code="", country="" - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "street": self.street, - "city": self.city, - "state": self.state, - "postal_code": self.postal_code, - "country": self.country, - } - - -class ReportingParty(BaseModel): - """A class representing the reporting party / claimant in the police report context.""" - - name: Optional[str] = Field(description="Full name of reporting party") - address: Optional[PoliceReportAddress] = Field( - description="Address of reporting party" - ) - phone: Optional[str] = Field(description="Phone number") - email: Optional[str] = Field(description="Email address") - - @staticmethod - def example() -> "ReportingParty": - """Return an empty instance with default placeholder values.""" - return ReportingParty( - name="", - address=PoliceReportAddress.example(), - phone="", - email="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "name": self.name, - "address": self.address.to_dict() if self.address else None, - "phone": self.phone, - "email": self.email, - } - - -class PoliceReportVehicle(BaseModel): - """A class representing a vehicle referenced in a police report.""" - - year: Optional[int] = Field(description="Vehicle year, e.g. 2022") - make: Optional[str] = Field(description="Vehicle make, e.g. Toyota") - model: Optional[str] = Field(description="Vehicle model, e.g. Camry") - trim: Optional[str] = Field(description="Vehicle trim, e.g. SE") - vin: Optional[str] = Field(description="Vehicle VIN") - license_plate: Optional[str] = Field(description="License plate") - mileage: Optional[int] = Field(description="Mileage") - - @staticmethod - def example() -> "PoliceReportVehicle": - """Return an empty instance with default placeholder values.""" - return PoliceReportVehicle( - year=0, - make="", - model="", - trim="", - vin="", - license_plate="", - mileage=0, - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "year": self.year, - "make": self.make, - "model": self.model, - "trim": self.trim, - "vin": self.vin, - "license_plate": self.license_plate, - "mileage": self.mileage, - } - - -class PoliceReportIncident(BaseModel): - """A class representing incident details in a police report.""" - - date: Optional[str] = Field(description="Incident date, e.g. 2025-11-28") - time: Optional[str] = Field(description="Incident time, e.g. 14:15") - location: Optional[str] = Field(description="Incident location") - cause: Optional[str] = Field(description="Cause of incident") - narrative: Optional[str] = Field( - description="Narrative/description of what happened" - ) - - @staticmethod - def example() -> "PoliceReportIncident": - """Return an empty instance with default placeholder values.""" - return PoliceReportIncident( - date="", time="", location="", cause="", narrative="" - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "date": self.date, - "time": self.time, - "location": self.location, - "cause": self.cause, - "narrative": self.narrative, - } - - -class PoliceReportDamageItem(BaseModel): - """A class representing a damage line item recorded alongside a police report.""" - - item_description: Optional[str] = Field(description="Damaged item/area description") - repair_estimate: Optional[float] = Field(description="Repair estimate amount") - repair_estimate_currency: Optional[str] = Field( - description="Currency of repair_estimate, e.g. USD" - ) - - @staticmethod - def example() -> "PoliceReportDamageItem": - """Return an empty instance with default placeholder values.""" - return PoliceReportDamageItem( - item_description="", - repair_estimate=0.0, - repair_estimate_currency="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "item_description": self.item_description, - "repair_estimate": self.repair_estimate, - "repair_estimate_currency": self.repair_estimate_currency, - } - - -class PoliceReportDamageSummary(BaseModel): - """A class representing a damage summary section.""" - - items: Optional[List[PoliceReportDamageItem]] = Field( - description="List of damage items" - ) - total_estimated_repair: Optional[float] = Field( - description="Total estimated repair amount" - ) - total_estimated_repair_currency: Optional[str] = Field( - description="Currency of total_estimated_repair, e.g. USD" - ) - - @staticmethod - def example() -> "PoliceReportDamageSummary": - """Return an empty instance with default placeholder values.""" - return PoliceReportDamageSummary( - items=[PoliceReportDamageItem.example()], - total_estimated_repair=0.0, - total_estimated_repair_currency="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "items": [item.to_dict() for item in (self.items or [])], - "total_estimated_repair": self.total_estimated_repair, - "total_estimated_repair_currency": self.total_estimated_repair_currency, - } - - -class PoliceReportDocument(BaseModel): - """A class representing a police report document attached to an auto claim. - - Note: The sample content includes the statement "Police Report: Filed (Report # GA-20251128-CR)". - This schema focuses on extracting the report identifier and the related incident context. - """ - - report_number: Optional[str] = Field( - description="Police report number, e.g. GA-20251128-CR" - ) - is_filed: Optional[bool] = Field(description="Whether a police report was filed") - reporting_agency: Optional[str] = Field(description="Reporting agency / department") - - insurance_company: Optional[str] = Field(description="Insurance company name") - claim_number: Optional[str] = Field(description="Claim number") - policy_number: Optional[str] = Field(description="Policy number") - - reporting_party: Optional[ReportingParty] = Field( - description="Reporting party information" - ) - incident: Optional[PoliceReportIncident] = Field(description="Incident details") - vehicles: Optional[List[PoliceReportVehicle]] = Field( - description="Vehicles involved" - ) - damage_summary: Optional[PoliceReportDamageSummary] = Field( - description="Damage summary" - ) - - @staticmethod - def example() -> "PoliceReportDocument": - """Return an empty instance with default placeholder values.""" - return PoliceReportDocument( - report_number="", - is_filed=False, - reporting_agency="", - insurance_company="", - claim_number="", - policy_number="", - reporting_party=ReportingParty.example(), - incident=PoliceReportIncident.example(), - vehicles=[PoliceReportVehicle.example()], - damage_summary=PoliceReportDamageSummary.example(), - ) - - @staticmethod - def from_json(json_str: str) -> "PoliceReportDocument": - """Deserialize a JSON string into a PoliceReportDocument instance.""" - json_content = json.loads(json_str) - - def create_address(address: Optional[dict]) -> Optional[PoliceReportAddress]: - if not address: - return None - return PoliceReportAddress( - street=address.get("street"), - city=address.get("city"), - state=address.get("state"), - postal_code=address.get("postal_code"), - country=address.get("country"), - ) - - def create_reporting_party(details: Optional[dict]) -> Optional[ReportingParty]: - if not details: - return None - return ReportingParty( - name=details.get("name"), - address=create_address(details.get("address")), - phone=details.get("phone"), - email=details.get("email"), - ) - - def create_incident(details: Optional[dict]) -> Optional[PoliceReportIncident]: - if not details: - return None - return PoliceReportIncident( - date=details.get("date"), - time=details.get("time"), - location=details.get("location"), - cause=details.get("cause"), - narrative=details.get("narrative"), - ) - - def create_vehicle(details: Optional[dict]) -> Optional[PoliceReportVehicle]: - if not details: - return None - return PoliceReportVehicle( - year=details.get("year"), - make=details.get("make"), - model=details.get("model"), - trim=details.get("trim"), - vin=details.get("vin"), - license_plate=details.get("license_plate"), - mileage=details.get("mileage"), - ) - - def create_damage_item( - details: Optional[dict], - ) -> Optional[PoliceReportDamageItem]: - if not details: - return None - return PoliceReportDamageItem( - item_description=details.get("item_description"), - repair_estimate=details.get("repair_estimate"), - repair_estimate_currency=details.get("repair_estimate_currency"), - ) - - def create_damage_summary( - details: Optional[dict], - ) -> Optional[PoliceReportDamageSummary]: - if not details: - return None - items_raw = details.get("items") or [] - items = [create_damage_item(i) for i in items_raw] - items = [i for i in items if i is not None] - return PoliceReportDamageSummary( - items=items, - total_estimated_repair=details.get("total_estimated_repair"), - total_estimated_repair_currency=details.get( - "total_estimated_repair_currency" - ), - ) - - vehicles_raw = json_content.get("vehicles") or [] - vehicles = [create_vehicle(v) for v in vehicles_raw] - vehicles = [v for v in vehicles if v is not None] - - return PoliceReportDocument( - report_number=json_content.get("report_number"), - is_filed=json_content.get("is_filed"), - reporting_agency=json_content.get("reporting_agency"), - insurance_company=json_content.get("insurance_company"), - claim_number=json_content.get("claim_number"), - policy_number=json_content.get("policy_number"), - reporting_party=create_reporting_party(json_content.get("reporting_party")), - incident=create_incident(json_content.get("incident")), - vehicles=vehicles, - damage_summary=create_damage_summary(json_content.get("damage_summary")), - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "report_number": self.report_number, - "is_filed": self.is_filed, - "reporting_agency": self.reporting_agency, - "insurance_company": self.insurance_company, - "claim_number": self.claim_number, - "policy_number": self.policy_number, - "reporting_party": self.reporting_party.to_dict() - if self.reporting_party - else None, - "incident": self.incident.to_dict() if self.incident else None, - "vehicles": [v.to_dict() for v in (self.vehicles or [])], - "damage_summary": self.damage_summary.to_dict() - if self.damage_summary - else None, - } diff --git a/src/ContentProcessorAPI/samples/schemas/register_schema.py b/src/ContentProcessorAPI/samples/schemas/register_schema.py index ecd015c9..45cdc72c 100644 --- a/src/ContentProcessorAPI/samples/schemas/register_schema.py +++ b/src/ContentProcessorAPI/samples/schemas/register_schema.py @@ -17,7 +17,7 @@ Manifest format (see schema_info.json): { "schemas": [ - { "File": "autoclaim.py", "ClassName": "...", "Description": "..." }, + { "File": "autoclaim.json", "ClassName": "...", "Description": "..." }, ... ], "schemaset": { @@ -25,6 +25,9 @@ "Description": "Claim schema set for auto claims processing" } } + +Only ``.json`` schema files are accepted; the legacy ``.py`` format was +removed as part of the schemavault RCE remediation. """ from __future__ import annotations @@ -75,11 +78,23 @@ def _register_schema( print(f" Description: {existing.get('Description')}") return schema_id - print(f"Registering new schema '{class_name}'...") + # Only JSON Schema descriptors (.json) are accepted. The legacy + # ``.py`` (executable Pydantic class) format was removed because + # the worker would ``exec`` uploaded code, exposing an RCE primitive. + extension = schema_path.suffix.lower() + if extension != ".json": + print( + f"Error: Unsupported schema extension '{extension}' for " + f"'{schema_path.name}'. Only .json schemas are accepted. Skipping..." + ) + return None + content_type = "application/json" + + print(f"Registering new schema '{class_name}' ({extension})...") data_payload = json.dumps({"ClassName": class_name, "Description": description}) with open(schema_path, "rb") as f: - files = {"file": (schema_path.name, f, "text/x-python")} + files = {"file": (schema_path.name, f, content_type)} data = {"data": data_payload} resp = requests.post(schemavault_url, files=files, data=data, timeout=60) diff --git a/src/ContentProcessorAPI/samples/schemas/repairestimate.json b/src/ContentProcessorAPI/samples/schemas/repairestimate.json new file mode 100644 index 00000000..5874a862 --- /dev/null +++ b/src/ContentProcessorAPI/samples/schemas/repairestimate.json @@ -0,0 +1,596 @@ +{ + "$defs": { + "RepairEstimateLineItem": { + "description": "A class representing a repair estimate line item.", + "properties": { + "service_description": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Service description, e.g. Dent repair (quarter panel)", + "title": "Service Description" + }, + "labor_hours": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Labor hours, e.g. 2.0", + "title": "Labor Hours" + }, + "rate_per_hour": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Labor rate per hour, e.g. 75.0", + "title": "Rate Per Hour" + }, + "rate_per_hour_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency for rate_per_hour, e.g. USD", + "title": "Rate Per Hour Currency" + }, + "parts_cost": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Parts cost, e.g. 150.0", + "title": "Parts Cost" + }, + "parts_cost_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency for parts_cost, e.g. USD", + "title": "Parts Cost Currency" + }, + "materials_cost": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Materials/supplies cost, e.g. 50.0", + "title": "Materials Cost" + }, + "materials_cost_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency for materials_cost, e.g. USD", + "title": "Materials Cost Currency" + }, + "total": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Line total amount", + "title": "Total" + }, + "total_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency for total, e.g. USD", + "title": "Total Currency" + } + }, + "required": [ + "service_description", + "labor_hours", + "rate_per_hour", + "rate_per_hour_currency", + "parts_cost", + "parts_cost_currency", + "materials_cost", + "materials_cost_currency", + "total", + "total_currency" + ], + "title": "RepairEstimateLineItem", + "type": "object" + }, + "RepairEstimateVehicle": { + "description": "A class representing the customer vehicle on a repair estimate.", + "properties": { + "year": { + "anyOf": [ + { + "type": "integer" + }, + { + "type": "null" + } + ], + "description": "Vehicle year, e.g. 2022", + "title": "Year" + }, + "make": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle make, e.g. Toyota", + "title": "Make" + }, + "model": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle model, e.g. Camry", + "title": "Model" + }, + "trim": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle trim, e.g. SE", + "title": "Trim" + }, + "vin": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Vehicle VIN, e.g. 4T1G11AK2NU123456", + "title": "Vin" + }, + "license_plate": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "License plate, e.g. GA-ABC123", + "title": "License Plate" + } + }, + "required": [ + "year", + "make", + "model", + "trim", + "vin", + "license_plate" + ], + "title": "RepairEstimateVehicle", + "type": "object" + }, + "RepairShopAddress": { + "description": "A class representing an auto body shop address.", + "properties": { + "street": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Street address, e.g. 456 Repair Lane", + "title": "Street" + }, + "city": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "City, e.g. Macon", + "title": "City" + }, + "state": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "State, e.g. GA", + "title": "State" + }, + "postal_code": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Postal code, e.g. 31201", + "title": "Postal Code" + }, + "country": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Country, e.g. USA", + "title": "Country" + } + }, + "required": [ + "street", + "city", + "state", + "postal_code", + "country" + ], + "title": "RepairShopAddress", + "type": "object" + }, + "Signature": { + "description": "A class representing an authorized signature field.", + "properties": { + "signatory": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Name of the signatory", + "title": "Signatory" + }, + "is_signed": { + "anyOf": [ + { + "type": "boolean" + }, + { + "type": "null" + } + ], + "description": "Indicates if the document is signed. GPT should check whether it has signature in image files. if there is Sign, fill it up as True", + "title": "Is Signed" + } + }, + "required": [ + "signatory", + "is_signed" + ], + "title": "Signature", + "type": "object" + } + }, + "description": "A class representing an auto body shop repair estimate document.", + "properties": { + "estimate_number": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Estimate number, e.g. EST-20251130", + "title": "Estimate Number" + }, + "date": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Estimate date, e.g. 2025-11-30", + "title": "Date" + }, + "prepared_by": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Prepared by / shop name, e.g. Macon Auto Body & Paint", + "title": "Prepared By" + }, + "shop_address": { + "anyOf": [ + { + "$ref": "#/$defs/RepairShopAddress" + }, + { + "type": "null" + } + ], + "description": "Shop address" + }, + "shop_phone": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Shop phone number", + "title": "Shop Phone" + }, + "customer_name": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Customer name, e.g. Chad Brooks", + "title": "Customer Name" + }, + "vehicle": { + "anyOf": [ + { + "$ref": "#/$defs/RepairEstimateVehicle" + }, + { + "type": "null" + } + ], + "description": "Vehicle information" + }, + "damage_description": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Damage description / narrative", + "title": "Damage Description" + }, + "repair_details": { + "anyOf": [ + { + "items": { + "$ref": "#/$defs/RepairEstimateLineItem" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "Repair detail line items", + "title": "Repair Details" + }, + "subtotal": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Subtotal amount", + "title": "Subtotal" + }, + "subtotal_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency for subtotal, e.g. USD", + "title": "Subtotal Currency" + }, + "tax_rate": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Tax rate, e.g. 7%", + "title": "Tax Rate" + }, + "tax_amount": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Tax amount, e.g. 24.50", + "title": "Tax Amount" + }, + "tax_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency for tax_amount, e.g. USD", + "title": "Tax Currency" + }, + "total_estimate": { + "anyOf": [ + { + "type": "number" + }, + { + "type": "null" + } + ], + "description": "Total estimate amount, e.g. 374.50", + "title": "Total Estimate" + }, + "total_estimate_currency": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Currency for total_estimate, e.g. USD", + "title": "Total Estimate Currency" + }, + "notes": { + "anyOf": [ + { + "items": { + "type": "string" + }, + "type": "array" + }, + { + "type": "null" + } + ], + "description": "Notes on the estimate", + "title": "Notes" + }, + "authorized_signature": { + "anyOf": [ + { + "$ref": "#/$defs/Signature" + }, + { + "type": "null" + } + ], + "description": "Authorized signature" + }, + "authorized_signature_date": { + "anyOf": [ + { + "type": "string" + }, + { + "type": "null" + } + ], + "description": "Signature date, e.g. 2025-11-30", + "title": "Authorized Signature Date" + } + }, + "required": [ + "estimate_number", + "date", + "prepared_by", + "shop_address", + "shop_phone", + "customer_name", + "vehicle", + "damage_description", + "repair_details", + "subtotal", + "subtotal_currency", + "tax_rate", + "tax_amount", + "tax_currency", + "total_estimate", + "total_estimate_currency", + "notes", + "authorized_signature", + "authorized_signature_date" + ], + "title": "RepairEstimateDocument", + "type": "object" +} diff --git a/src/ContentProcessorAPI/samples/schemas/repairestimate.py b/src/ContentProcessorAPI/samples/schemas/repairestimate.py deleted file mode 100644 index 31635a4b..00000000 --- a/src/ContentProcessorAPI/samples/schemas/repairestimate.py +++ /dev/null @@ -1,333 +0,0 @@ -# Copyright (c) Microsoft Corporation. -# Licensed under the MIT License. -"""Pydantic models for auto repair estimate data extraction. - -Defines the schema used by the content processing pipeline to extract -structured fields from body shop repair estimate documents. -""" - -from __future__ import annotations - -import json -from typing import List, Optional - -from pydantic import BaseModel, Field - - -class RepairShopAddress(BaseModel): - """A class representing an auto body shop address.""" - - street: Optional[str] = Field(description="Street address, e.g. 456 Repair Lane") - city: Optional[str] = Field(description="City, e.g. Macon") - state: Optional[str] = Field(description="State, e.g. GA") - postal_code: Optional[str] = Field(description="Postal code, e.g. 31201") - country: Optional[str] = Field(description="Country, e.g. USA") - - @staticmethod - def example() -> "RepairShopAddress": - """Return an empty instance with default placeholder values.""" - return RepairShopAddress( - street="", city="", state="", postal_code="", country="" - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "street": self.street, - "city": self.city, - "state": self.state, - "postal_code": self.postal_code, - "country": self.country, - } - - -class RepairEstimateVehicle(BaseModel): - """A class representing the customer vehicle on a repair estimate.""" - - year: Optional[int] = Field(description="Vehicle year, e.g. 2022") - make: Optional[str] = Field(description="Vehicle make, e.g. Toyota") - model: Optional[str] = Field(description="Vehicle model, e.g. Camry") - trim: Optional[str] = Field(description="Vehicle trim, e.g. SE") - vin: Optional[str] = Field(description="Vehicle VIN, e.g. 4T1G11AK2NU123456") - license_plate: Optional[str] = Field(description="License plate, e.g. GA-ABC123") - - @staticmethod - def example() -> "RepairEstimateVehicle": - """Return an empty instance with default placeholder values.""" - return RepairEstimateVehicle( - year=0, - make="", - model="", - trim="", - vin="", - license_plate="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "year": self.year, - "make": self.make, - "model": self.model, - "trim": self.trim, - "vin": self.vin, - "license_plate": self.license_plate, - } - - -class RepairEstimateLineItem(BaseModel): - """A class representing a repair estimate line item.""" - - service_description: Optional[str] = Field( - description="Service description, e.g. Dent repair (quarter panel)" - ) - labor_hours: Optional[float] = Field(description="Labor hours, e.g. 2.0") - rate_per_hour: Optional[float] = Field(description="Labor rate per hour, e.g. 75.0") - rate_per_hour_currency: Optional[str] = Field( - description="Currency for rate_per_hour, e.g. USD" - ) - parts_cost: Optional[float] = Field(description="Parts cost, e.g. 150.0") - parts_cost_currency: Optional[str] = Field( - description="Currency for parts_cost, e.g. USD" - ) - materials_cost: Optional[float] = Field( - description="Materials/supplies cost, e.g. 50.0" - ) - materials_cost_currency: Optional[str] = Field( - description="Currency for materials_cost, e.g. USD" - ) - total: Optional[float] = Field(description="Line total amount") - total_currency: Optional[str] = Field(description="Currency for total, e.g. USD") - - @staticmethod - def example() -> "RepairEstimateLineItem": - """Return an empty instance with default placeholder values.""" - return RepairEstimateLineItem( - service_description="", - labor_hours=0.0, - rate_per_hour=0.0, - rate_per_hour_currency="", - parts_cost=0.0, - parts_cost_currency="", - materials_cost=0.0, - materials_cost_currency="", - total=0.0, - total_currency="", - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "service_description": self.service_description, - "labor_hours": self.labor_hours, - "rate_per_hour": self.rate_per_hour, - "rate_per_hour_currency": self.rate_per_hour_currency, - "parts_cost": self.parts_cost, - "parts_cost_currency": self.parts_cost_currency, - "materials_cost": self.materials_cost, - "materials_cost_currency": self.materials_cost_currency, - "total": self.total, - "total_currency": self.total_currency, - } - - -class Signature(BaseModel): - """A class representing an authorized signature field.""" - - signatory: Optional[str] = Field(description="Name of the signatory") - is_signed: Optional[bool] = Field( - description="Indicates if the document is signed. GPT should check whether it has signature in image files. if there is Sign, fill it up as True" - ) - - @staticmethod - def example() -> "Signature": - """Return an empty instance with default placeholder values.""" - return Signature(signatory="", is_signed=False) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return {"signatory": self.signatory, "is_signed": self.is_signed} - - -class RepairEstimateDocument(BaseModel): - """A class representing an auto body shop repair estimate document.""" - - estimate_number: Optional[str] = Field( - description="Estimate number, e.g. EST-20251130" - ) - date: Optional[str] = Field(description="Estimate date, e.g. 2025-11-30") - - prepared_by: Optional[str] = Field( - description="Prepared by / shop name, e.g. Macon Auto Body & Paint" - ) - shop_address: Optional[RepairShopAddress] = Field(description="Shop address") - shop_phone: Optional[str] = Field(description="Shop phone number") - - customer_name: Optional[str] = Field(description="Customer name, e.g. Chad Brooks") - vehicle: Optional[RepairEstimateVehicle] = Field(description="Vehicle information") - - damage_description: Optional[str] = Field( - description="Damage description / narrative" - ) - - repair_details: Optional[List[RepairEstimateLineItem]] = Field( - description="Repair detail line items" - ) - - subtotal: Optional[float] = Field(description="Subtotal amount") - subtotal_currency: Optional[str] = Field( - description="Currency for subtotal, e.g. USD" - ) - - tax_rate: Optional[str] = Field(description="Tax rate, e.g. 7%") - tax_amount: Optional[float] = Field(description="Tax amount, e.g. 24.50") - tax_currency: Optional[str] = Field(description="Currency for tax_amount, e.g. USD") - - total_estimate: Optional[float] = Field( - description="Total estimate amount, e.g. 374.50" - ) - total_estimate_currency: Optional[str] = Field( - description="Currency for total_estimate, e.g. USD" - ) - - notes: Optional[List[str]] = Field(description="Notes on the estimate") - - authorized_signature: Optional[Signature] = Field( - description="Authorized signature" - ) - authorized_signature_date: Optional[str] = Field( - description="Signature date, e.g. 2025-11-30" - ) - - @staticmethod - def example() -> "RepairEstimateDocument": - """Return an empty instance with default placeholder values.""" - return RepairEstimateDocument( - estimate_number="", - date="", - prepared_by="", - shop_address=RepairShopAddress.example(), - shop_phone="", - customer_name="", - vehicle=RepairEstimateVehicle.example(), - damage_description="", - repair_details=[RepairEstimateLineItem.example()], - subtotal=0.0, - subtotal_currency="", - tax_rate="", - tax_amount=0.0, - tax_currency="", - total_estimate=0.0, - total_estimate_currency="", - notes=[], - authorized_signature=Signature.example(), - authorized_signature_date="", - ) - - @staticmethod - def from_json(json_str: str) -> "RepairEstimateDocument": - """Deserialize a JSON string into a RepairEstimateDocument instance.""" - json_content = json.loads(json_str) - - def create_address(details: Optional[dict]) -> Optional[RepairShopAddress]: - if not details: - return None - return RepairShopAddress( - street=details.get("street"), - city=details.get("city"), - state=details.get("state"), - postal_code=details.get("postal_code"), - country=details.get("country"), - ) - - def create_vehicle(details: Optional[dict]) -> Optional[RepairEstimateVehicle]: - if not details: - return None - return RepairEstimateVehicle( - year=details.get("year"), - make=details.get("make"), - model=details.get("model"), - trim=details.get("trim"), - vin=details.get("vin"), - license_plate=details.get("license_plate"), - ) - - def create_line_item( - details: Optional[dict], - ) -> Optional[RepairEstimateLineItem]: - if not details: - return None - return RepairEstimateLineItem( - service_description=details.get("service_description"), - labor_hours=details.get("labor_hours"), - rate_per_hour=details.get("rate_per_hour"), - rate_per_hour_currency=details.get("rate_per_hour_currency"), - parts_cost=details.get("parts_cost"), - parts_cost_currency=details.get("parts_cost_currency"), - materials_cost=details.get("materials_cost"), - materials_cost_currency=details.get("materials_cost_currency"), - total=details.get("total"), - total_currency=details.get("total_currency"), - ) - - def create_signature(details: Optional[dict]) -> Optional[Signature]: - if not details: - return None - return Signature( - signatory=details.get("signatory"), - is_signed=details.get("is_signed"), - ) - - line_items_raw = json_content.get("repair_details") or [] - line_items = [create_line_item(item) for item in line_items_raw] - line_items = [item for item in line_items if item is not None] - - return RepairEstimateDocument( - estimate_number=json_content.get("estimate_number"), - date=json_content.get("date"), - prepared_by=json_content.get("prepared_by"), - shop_address=create_address(json_content.get("shop_address")), - shop_phone=json_content.get("shop_phone"), - customer_name=json_content.get("customer_name"), - vehicle=create_vehicle(json_content.get("vehicle")), - damage_description=json_content.get("damage_description"), - repair_details=line_items, - subtotal=json_content.get("subtotal"), - subtotal_currency=json_content.get("subtotal_currency"), - tax_rate=json_content.get("tax_rate"), - tax_amount=json_content.get("tax_amount"), - tax_currency=json_content.get("tax_currency"), - total_estimate=json_content.get("total_estimate"), - total_estimate_currency=json_content.get("total_estimate_currency"), - notes=json_content.get("notes") or [], - authorized_signature=create_signature( - json_content.get("authorized_signature") - ), - authorized_signature_date=json_content.get("authorized_signature_date"), - ) - - def to_dict(self) -> dict: - """Serialize to a plain dictionary.""" - return { - "estimate_number": self.estimate_number, - "date": self.date, - "prepared_by": self.prepared_by, - "shop_address": self.shop_address.to_dict() if self.shop_address else None, - "shop_phone": self.shop_phone, - "customer_name": self.customer_name, - "vehicle": self.vehicle.to_dict() if self.vehicle else None, - "damage_description": self.damage_description, - "repair_details": [item.to_dict() for item in (self.repair_details or [])], - "subtotal": self.subtotal, - "subtotal_currency": self.subtotal_currency, - "tax_rate": self.tax_rate, - "tax_amount": self.tax_amount, - "tax_currency": self.tax_currency, - "total_estimate": self.total_estimate, - "total_estimate_currency": self.total_estimate_currency, - "notes": self.notes or [], - "authorized_signature": self.authorized_signature.to_dict() - if self.authorized_signature - else None, - "authorized_signature_date": self.authorized_signature_date, - } diff --git a/src/ContentProcessorAPI/samples/schemas/schema_info.json b/src/ContentProcessorAPI/samples/schemas/schema_info.json index f4667e15..d1cbad0d 100644 --- a/src/ContentProcessorAPI/samples/schemas/schema_info.json +++ b/src/ContentProcessorAPI/samples/schemas/schema_info.json @@ -1,22 +1,22 @@ { "schemas": [ { - "File": "autoclaim.py", + "File": "autoclaim.json", "ClassName": "AutoInsuranceClaimForm", "Description": "Auto Insurance Claim Form" }, { - "File": "damagedcarimage.py", + "File": "damagedcarimage.json", "ClassName": "DamagedVehicleImageAssessment", "Description": "Damaged Vehicle Image Assessment" }, { - "File": "policereport.py", + "File": "policereport.json", "ClassName": "PoliceReportDocument", "Description": "Police Report Document" }, { - "File": "repairestimate.py", + "File": "repairestimate.json", "ClassName": "RepairEstimateDocument", "Description": "Repair Estimate Document" } diff --git a/src/ContentProcessorAPI/test_http/schema_API.http b/src/ContentProcessorAPI/test_http/schema_API.http index 3efd9b60..3c550e21 100644 --- a/src/ContentProcessorAPI/test_http/schema_API.http +++ b/src/ContentProcessorAPI/test_http/schema_API.http @@ -6,10 +6,10 @@ # @name listSchemas GET {{baseUrl}}{{schemavault}}/ -### Register a schema (.py) into the vault +### Register a schema (.json) into the vault # Sends multipart/form-data with fields: # - data: JSON string { ClassName, Description } -# - file: schema file +# - file: schema file (.json only) # # @name registerSchema POST {{baseUrl}}{{schemavault}}/ @@ -24,10 +24,10 @@ Content-Type: application/json "Description": "Uploaded from VS Code REST Client" } ------schema-boundary -Content-Disposition: form-data; name="file"; filename="autoclaim.py" -Content-Type: text/x-python +Content-Disposition: form-data; name="file"; filename="autoclaim.json" +Content-Type: application/json -< ../samples/schemas/autoclaim.py +< ../samples/schemas/autoclaim.json ------schema-boundary-- ### Update an existing schema (re-upload file + new ClassName) @@ -46,10 +46,10 @@ Content-Type: application/json "ClassName": "DamagedVehicleImageAssessment" } ------schema-boundary -Content-Disposition: form-data; name="file"; filename="damagedcarimage.py" -Content-Type: text/x-python +Content-Disposition: form-data; name="file"; filename="damagedcarimage.json" +Content-Type: application/json -< ../samples/schemas/damagedcarimage.py +< ../samples/schemas/damagedcarimage.json ------schema-boundary-- ### Download the registered schema file diff --git a/src/ContentProcessorAPI/uv.lock b/src/ContentProcessorAPI/uv.lock index 87af67cc..37a716fd 100644 --- a/src/ContentProcessorAPI/uv.lock +++ b/src/ContentProcessorAPI/uv.lock @@ -532,6 +532,7 @@ dependencies = [ { name = "cryptography" }, { name = "fastapi", extra = ["standard"] }, { name = "h11" }, + { name = "jsonschema" }, { name = "opentelemetry-api" }, { name = "poppler-utils" }, { name = "pydantic" }, @@ -571,6 +572,7 @@ requires-dist = [ { name = "cryptography", specifier = "==46.0.7" }, { name = "fastapi", extras = ["standard"], specifier = "==0.135.2" }, { name = "h11", specifier = "==0.16.0" }, + { name = "jsonschema", specifier = "==4.25.1" }, { name = "opentelemetry-api", specifier = "==1.40.0" }, { name = "poppler-utils", specifier = "==0.1.0" }, { name = "pydantic", specifier = "==2.13.3" }, @@ -1192,6 +1194,33 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/62/a1/3d680cbfd5f4b8f15abc1d571870c5fc3e594bb582bc3b64ea099db13e56/jinja2-3.1.6-py3-none-any.whl", hash = "sha256:85ece4451f492d0c13c5dd7c13a64681a86afae63a5f347908daf103ce6d2f67", size = 134899, upload-time = "2025-03-05T20:05:00.369Z" }, ] +[[package]] +name = "jsonschema" +version = "4.25.1" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "attrs" }, + { name = "jsonschema-specifications" }, + { name = "referencing" }, + { name = "rpds-py" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/74/69/f7185de793a29082a9f3c7728268ffb31cb5095131a9c139a74078e27336/jsonschema-4.25.1.tar.gz", hash = "sha256:e4a9655ce0da0c0b67a085847e00a3a51449e1157f4f75e9fb5aa545e122eb85", size = 357342, upload-time = "2025-08-18T17:03:50.038Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/bf/9c/8c95d856233c1f82500c2450b8c68576b4cf1c871db3afac5c34ff84e6fd/jsonschema-4.25.1-py3-none-any.whl", hash = "sha256:3fba0169e345c7175110351d456342c364814cfcf3b964ba4587f22915230a63", size = 90040, upload-time = "2025-08-18T17:03:48.373Z" }, +] + +[[package]] +name = "jsonschema-specifications" +version = "2025.9.1" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "referencing" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/19/74/a633ee74eb36c44aa6d1095e7cc5569bebf04342ee146178e2d36600708b/jsonschema_specifications-2025.9.1.tar.gz", hash = "sha256:b540987f239e745613c7a9176f3edb72b832a4ac465cf02712288397832b5e8d", size = 32855, upload-time = "2025-09-08T01:34:59.186Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/41/45/1a4ed80516f02155c51f51e8cedb3c1902296743db0bbc66608a0db2814f/jsonschema_specifications-2025.9.1-py3-none-any.whl", hash = "sha256:98802fee3a11ee76ecaca44429fda8a41bff98b00a0f2838151b113f210cc6fe", size = 18437, upload-time = "2025-09-08T01:34:57.871Z" }, +] + [[package]] name = "keyring" version = "25.7.0" @@ -2313,6 +2342,20 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/e1/67/921ec3024056483db83953ae8e48079ad62b92db7880013ca77632921dd0/readme_renderer-44.0-py3-none-any.whl", hash = "sha256:2fbca89b81a08526aadf1357a8c2ae889ec05fb03f5da67f9769c9a592166151", size = 13310, upload-time = "2024-07-08T15:00:56.577Z" }, ] +[[package]] +name = "referencing" +version = "0.37.0" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "attrs" }, + { name = "rpds-py" }, + { name = "typing-extensions", marker = "python_full_version < '3.13'" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/22/f5/df4e9027acead3ecc63e50fe1e36aca1523e1719559c499951bb4b53188f/referencing-0.37.0.tar.gz", hash = "sha256:44aefc3142c5b842538163acb373e24cce6632bd54bdb01b21ad5863489f50d8", size = 78036, upload-time = "2025-10-13T15:30:48.871Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/2c/58/ca301544e1fa93ed4f80d724bf5b194f6e4b945841c5bfd555878eea9fcb/referencing-0.37.0-py3-none-any.whl", hash = "sha256:381329a9f99628c9069361716891d34ad94af76e461dcb0335825aecc7692231", size = 26766, upload-time = "2025-10-13T15:30:47.625Z" }, +] + [[package]] name = "requests" version = "2.33.1" @@ -2457,6 +2500,87 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/79/62/b88e5879512c55b8ee979c666ee6902adc4ed05007226de266410ae27965/rignore-0.7.6-cp314-cp314t-win_arm64.whl", hash = "sha256:b83adabeb3e8cf662cabe1931b83e165b88c526fa6af6b3aa90429686e474896", size = 656035, upload-time = "2025-11-05T21:41:31.13Z" }, ] +[[package]] +name = "rpds-py" +version = "0.30.0" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/20/af/3f2f423103f1113b36230496629986e0ef7e199d2aa8392452b484b38ced/rpds_py-0.30.0.tar.gz", hash = "sha256:dd8ff7cf90014af0c0f787eea34794ebf6415242ee1d6fa91eaba725cc441e84", size = 69469, upload-time = "2025-11-30T20:24:38.837Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/03/e7/98a2f4ac921d82f33e03f3835f5bf3a4a40aa1bfdc57975e74a97b2b4bdd/rpds_py-0.30.0-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:a161f20d9a43006833cd7068375a94d035714d73a172b681d8881820600abfad", size = 375086, upload-time = "2025-11-30T20:22:17.93Z" }, + { url = "https://files.pythonhosted.org/packages/4d/a1/bca7fd3d452b272e13335db8d6b0b3ecde0f90ad6f16f3328c6fb150c889/rpds_py-0.30.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:6abc8880d9d036ecaafe709079969f56e876fcf107f7a8e9920ba6d5a3878d05", size = 359053, upload-time = "2025-11-30T20:22:19.297Z" }, + { url = "https://files.pythonhosted.org/packages/65/1c/ae157e83a6357eceff62ba7e52113e3ec4834a84cfe07fa4b0757a7d105f/rpds_py-0.30.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ca28829ae5f5d569bb62a79512c842a03a12576375d5ece7d2cadf8abe96ec28", size = 390763, upload-time = "2025-11-30T20:22:21.661Z" }, + { url = "https://files.pythonhosted.org/packages/d4/36/eb2eb8515e2ad24c0bd43c3ee9cd74c33f7ca6430755ccdb240fd3144c44/rpds_py-0.30.0-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:a1010ed9524c73b94d15919ca4d41d8780980e1765babf85f9a2f90d247153dd", size = 408951, upload-time = "2025-11-30T20:22:23.408Z" }, + { url = "https://files.pythonhosted.org/packages/d6/65/ad8dc1784a331fabbd740ef6f71ce2198c7ed0890dab595adb9ea2d775a1/rpds_py-0.30.0-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:f8d1736cfb49381ba528cd5baa46f82fdc65c06e843dab24dd70b63d09121b3f", size = 514622, upload-time = "2025-11-30T20:22:25.16Z" }, + { url = "https://files.pythonhosted.org/packages/63/8e/0cfa7ae158e15e143fe03993b5bcd743a59f541f5952e1546b1ac1b5fd45/rpds_py-0.30.0-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:d948b135c4693daff7bc2dcfc4ec57237a29bd37e60c2fabf5aff2bbacf3e2f1", size = 414492, upload-time = "2025-11-30T20:22:26.505Z" }, + { url = "https://files.pythonhosted.org/packages/60/1b/6f8f29f3f995c7ffdde46a626ddccd7c63aefc0efae881dc13b6e5d5bb16/rpds_py-0.30.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:47f236970bccb2233267d89173d3ad2703cd36a0e2a6e92d0560d333871a3d23", size = 394080, upload-time = "2025-11-30T20:22:27.934Z" }, + { url = "https://files.pythonhosted.org/packages/6d/d5/a266341051a7a3ca2f4b750a3aa4abc986378431fc2da508c5034d081b70/rpds_py-0.30.0-cp312-cp312-manylinux_2_31_riscv64.whl", hash = "sha256:2e6ecb5a5bcacf59c3f912155044479af1d0b6681280048b338b28e364aca1f6", size = 408680, upload-time = "2025-11-30T20:22:29.341Z" }, + { url = "https://files.pythonhosted.org/packages/10/3b/71b725851df9ab7a7a4e33cf36d241933da66040d195a84781f49c50490c/rpds_py-0.30.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:a8fa71a2e078c527c3e9dc9fc5a98c9db40bcc8a92b4e8858e36d329f8684b51", size = 423589, upload-time = "2025-11-30T20:22:31.469Z" }, + { url = "https://files.pythonhosted.org/packages/00/2b/e59e58c544dc9bd8bd8384ecdb8ea91f6727f0e37a7131baeff8d6f51661/rpds_py-0.30.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:73c67f2db7bc334e518d097c6d1e6fed021bbc9b7d678d6cc433478365d1d5f5", size = 573289, upload-time = "2025-11-30T20:22:32.997Z" }, + { url = "https://files.pythonhosted.org/packages/da/3e/a18e6f5b460893172a7d6a680e86d3b6bc87a54c1f0b03446a3c8c7b588f/rpds_py-0.30.0-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:5ba103fb455be00f3b1c2076c9d4264bfcb037c976167a6047ed82f23153f02e", size = 599737, upload-time = "2025-11-30T20:22:34.419Z" }, + { url = "https://files.pythonhosted.org/packages/5c/e2/714694e4b87b85a18e2c243614974413c60aa107fd815b8cbc42b873d1d7/rpds_py-0.30.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:7cee9c752c0364588353e627da8a7e808a66873672bcb5f52890c33fd965b394", size = 563120, upload-time = "2025-11-30T20:22:35.903Z" }, + { url = "https://files.pythonhosted.org/packages/6f/ab/d5d5e3bcedb0a77f4f613706b750e50a5a3ba1c15ccd3665ecc636c968fd/rpds_py-0.30.0-cp312-cp312-win32.whl", hash = "sha256:1ab5b83dbcf55acc8b08fc62b796ef672c457b17dbd7820a11d6c52c06839bdf", size = 223782, upload-time = "2025-11-30T20:22:37.271Z" }, + { url = "https://files.pythonhosted.org/packages/39/3b/f786af9957306fdc38a74cef405b7b93180f481fb48453a114bb6465744a/rpds_py-0.30.0-cp312-cp312-win_amd64.whl", hash = "sha256:a090322ca841abd453d43456ac34db46e8b05fd9b3b4ac0c78bcde8b089f959b", size = 240463, upload-time = "2025-11-30T20:22:39.021Z" }, + { url = "https://files.pythonhosted.org/packages/f3/d2/b91dc748126c1559042cfe41990deb92c4ee3e2b415f6b5234969ffaf0cc/rpds_py-0.30.0-cp312-cp312-win_arm64.whl", hash = "sha256:669b1805bd639dd2989b281be2cfd951c6121b65e729d9b843e9639ef1fd555e", size = 230868, upload-time = "2025-11-30T20:22:40.493Z" }, + { url = "https://files.pythonhosted.org/packages/ed/dc/d61221eb88ff410de3c49143407f6f3147acf2538c86f2ab7ce65ae7d5f9/rpds_py-0.30.0-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:f83424d738204d9770830d35290ff3273fbb02b41f919870479fab14b9d303b2", size = 374887, upload-time = "2025-11-30T20:22:41.812Z" }, + { url = "https://files.pythonhosted.org/packages/fd/32/55fb50ae104061dbc564ef15cc43c013dc4a9f4527a1f4d99baddf56fe5f/rpds_py-0.30.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:e7536cd91353c5273434b4e003cbda89034d67e7710eab8761fd918ec6c69cf8", size = 358904, upload-time = "2025-11-30T20:22:43.479Z" }, + { url = "https://files.pythonhosted.org/packages/58/70/faed8186300e3b9bdd138d0273109784eea2396c68458ed580f885dfe7ad/rpds_py-0.30.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2771c6c15973347f50fece41fc447c054b7ac2ae0502388ce3b6738cd366e3d4", size = 389945, upload-time = "2025-11-30T20:22:44.819Z" }, + { url = "https://files.pythonhosted.org/packages/bd/a8/073cac3ed2c6387df38f71296d002ab43496a96b92c823e76f46b8af0543/rpds_py-0.30.0-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:0a59119fc6e3f460315fe9d08149f8102aa322299deaa5cab5b40092345c2136", size = 407783, upload-time = "2025-11-30T20:22:46.103Z" }, + { url = "https://files.pythonhosted.org/packages/77/57/5999eb8c58671f1c11eba084115e77a8899d6e694d2a18f69f0ba471ec8b/rpds_py-0.30.0-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:76fec018282b4ead0364022e3c54b60bf368b9d926877957a8624b58419169b7", size = 515021, upload-time = "2025-11-30T20:22:47.458Z" }, + { url = "https://files.pythonhosted.org/packages/e0/af/5ab4833eadc36c0a8ed2bc5c0de0493c04f6c06de223170bd0798ff98ced/rpds_py-0.30.0-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:692bef75a5525db97318e8cd061542b5a79812d711ea03dbc1f6f8dbb0c5f0d2", size = 414589, upload-time = "2025-11-30T20:22:48.872Z" }, + { url = "https://files.pythonhosted.org/packages/b7/de/f7192e12b21b9e9a68a6d0f249b4af3fdcdff8418be0767a627564afa1f1/rpds_py-0.30.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9027da1ce107104c50c81383cae773ef5c24d296dd11c99e2629dbd7967a20c6", size = 394025, upload-time = "2025-11-30T20:22:50.196Z" }, + { url = "https://files.pythonhosted.org/packages/91/c4/fc70cd0249496493500e7cc2de87504f5aa6509de1e88623431fec76d4b6/rpds_py-0.30.0-cp313-cp313-manylinux_2_31_riscv64.whl", hash = "sha256:9cf69cdda1f5968a30a359aba2f7f9aa648a9ce4b580d6826437f2b291cfc86e", size = 408895, upload-time = "2025-11-30T20:22:51.87Z" }, + { url = "https://files.pythonhosted.org/packages/58/95/d9275b05ab96556fefff73a385813eb66032e4c99f411d0795372d9abcea/rpds_py-0.30.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:a4796a717bf12b9da9d3ad002519a86063dcac8988b030e405704ef7d74d2d9d", size = 422799, upload-time = "2025-11-30T20:22:53.341Z" }, + { url = "https://files.pythonhosted.org/packages/06/c1/3088fc04b6624eb12a57eb814f0d4997a44b0d208d6cace713033ff1a6ba/rpds_py-0.30.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:5d4c2aa7c50ad4728a094ebd5eb46c452e9cb7edbfdb18f9e1221f597a73e1e7", size = 572731, upload-time = "2025-11-30T20:22:54.778Z" }, + { url = "https://files.pythonhosted.org/packages/d8/42/c612a833183b39774e8ac8fecae81263a68b9583ee343db33ab571a7ce55/rpds_py-0.30.0-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:ba81a9203d07805435eb06f536d95a266c21e5b2dfbf6517748ca40c98d19e31", size = 599027, upload-time = "2025-11-30T20:22:56.212Z" }, + { url = "https://files.pythonhosted.org/packages/5f/60/525a50f45b01d70005403ae0e25f43c0384369ad24ffe46e8d9068b50086/rpds_py-0.30.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:945dccface01af02675628334f7cf49c2af4c1c904748efc5cf7bbdf0b579f95", size = 563020, upload-time = "2025-11-30T20:22:58.2Z" }, + { url = "https://files.pythonhosted.org/packages/0b/5d/47c4655e9bcd5ca907148535c10e7d489044243cc9941c16ed7cd53be91d/rpds_py-0.30.0-cp313-cp313-win32.whl", hash = "sha256:b40fb160a2db369a194cb27943582b38f79fc4887291417685f3ad693c5a1d5d", size = 223139, upload-time = "2025-11-30T20:23:00.209Z" }, + { url = "https://files.pythonhosted.org/packages/f2/e1/485132437d20aa4d3e1d8b3fb5a5e65aa8139f1e097080c2a8443201742c/rpds_py-0.30.0-cp313-cp313-win_amd64.whl", hash = "sha256:806f36b1b605e2d6a72716f321f20036b9489d29c51c91f4dd29a3e3afb73b15", size = 240224, upload-time = "2025-11-30T20:23:02.008Z" }, + { url = "https://files.pythonhosted.org/packages/24/95/ffd128ed1146a153d928617b0ef673960130be0009c77d8fbf0abe306713/rpds_py-0.30.0-cp313-cp313-win_arm64.whl", hash = "sha256:d96c2086587c7c30d44f31f42eae4eac89b60dabbac18c7669be3700f13c3ce1", size = 230645, upload-time = "2025-11-30T20:23:03.43Z" }, + { url = "https://files.pythonhosted.org/packages/ff/1b/b10de890a0def2a319a2626334a7f0ae388215eb60914dbac8a3bae54435/rpds_py-0.30.0-cp313-cp313t-macosx_10_12_x86_64.whl", hash = "sha256:eb0b93f2e5c2189ee831ee43f156ed34e2a89a78a66b98cadad955972548be5a", size = 364443, upload-time = "2025-11-30T20:23:04.878Z" }, + { url = "https://files.pythonhosted.org/packages/0d/bf/27e39f5971dc4f305a4fb9c672ca06f290f7c4e261c568f3dea16a410d47/rpds_py-0.30.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:922e10f31f303c7c920da8981051ff6d8c1a56207dbdf330d9047f6d30b70e5e", size = 353375, upload-time = "2025-11-30T20:23:06.342Z" }, + { url = "https://files.pythonhosted.org/packages/40/58/442ada3bba6e8e6615fc00483135c14a7538d2ffac30e2d933ccf6852232/rpds_py-0.30.0-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:cdc62c8286ba9bf7f47befdcea13ea0e26bf294bda99758fd90535cbaf408000", size = 383850, upload-time = "2025-11-30T20:23:07.825Z" }, + { url = "https://files.pythonhosted.org/packages/14/14/f59b0127409a33c6ef6f5c1ebd5ad8e32d7861c9c7adfa9a624fc3889f6c/rpds_py-0.30.0-cp313-cp313t-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:47f9a91efc418b54fb8190a6b4aa7813a23fb79c51f4bb84e418f5476c38b8db", size = 392812, upload-time = "2025-11-30T20:23:09.228Z" }, + { url = "https://files.pythonhosted.org/packages/b3/66/e0be3e162ac299b3a22527e8913767d869e6cc75c46bd844aa43fb81ab62/rpds_py-0.30.0-cp313-cp313t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:1f3587eb9b17f3789ad50824084fa6f81921bbf9a795826570bda82cb3ed91f2", size = 517841, upload-time = "2025-11-30T20:23:11.186Z" }, + { url = "https://files.pythonhosted.org/packages/3d/55/fa3b9cf31d0c963ecf1ba777f7cf4b2a2c976795ac430d24a1f43d25a6ba/rpds_py-0.30.0-cp313-cp313t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:39c02563fc592411c2c61d26b6c5fe1e51eaa44a75aa2c8735ca88b0d9599daa", size = 408149, upload-time = "2025-11-30T20:23:12.864Z" }, + { url = "https://files.pythonhosted.org/packages/60/ca/780cf3b1a32b18c0f05c441958d3758f02544f1d613abf9488cd78876378/rpds_py-0.30.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:51a1234d8febafdfd33a42d97da7a43f5dcb120c1060e352a3fbc0c6d36e2083", size = 383843, upload-time = "2025-11-30T20:23:14.638Z" }, + { url = "https://files.pythonhosted.org/packages/82/86/d5f2e04f2aa6247c613da0c1dd87fcd08fa17107e858193566048a1e2f0a/rpds_py-0.30.0-cp313-cp313t-manylinux_2_31_riscv64.whl", hash = "sha256:eb2c4071ab598733724c08221091e8d80e89064cd472819285a9ab0f24bcedb9", size = 396507, upload-time = "2025-11-30T20:23:16.105Z" }, + { url = "https://files.pythonhosted.org/packages/4b/9a/453255d2f769fe44e07ea9785c8347edaf867f7026872e76c1ad9f7bed92/rpds_py-0.30.0-cp313-cp313t-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:6bdfdb946967d816e6adf9a3d8201bfad269c67efe6cefd7093ef959683c8de0", size = 414949, upload-time = "2025-11-30T20:23:17.539Z" }, + { url = "https://files.pythonhosted.org/packages/a3/31/622a86cdc0c45d6df0e9ccb6becdba5074735e7033c20e401a6d9d0e2ca0/rpds_py-0.30.0-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:c77afbd5f5250bf27bf516c7c4a016813eb2d3e116139aed0096940c5982da94", size = 565790, upload-time = "2025-11-30T20:23:19.029Z" }, + { url = "https://files.pythonhosted.org/packages/1c/5d/15bbf0fb4a3f58a3b1c67855ec1efcc4ceaef4e86644665fff03e1b66d8d/rpds_py-0.30.0-cp313-cp313t-musllinux_1_2_i686.whl", hash = "sha256:61046904275472a76c8c90c9ccee9013d70a6d0f73eecefd38c1ae7c39045a08", size = 590217, upload-time = "2025-11-30T20:23:20.885Z" }, + { url = "https://files.pythonhosted.org/packages/6d/61/21b8c41f68e60c8cc3b2e25644f0e3681926020f11d06ab0b78e3c6bbff1/rpds_py-0.30.0-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:4c5f36a861bc4b7da6516dbdf302c55313afa09b81931e8280361a4f6c9a2d27", size = 555806, upload-time = "2025-11-30T20:23:22.488Z" }, + { url = "https://files.pythonhosted.org/packages/f9/39/7e067bb06c31de48de3eb200f9fc7c58982a4d3db44b07e73963e10d3be9/rpds_py-0.30.0-cp313-cp313t-win32.whl", hash = "sha256:3d4a69de7a3e50ffc214ae16d79d8fbb0922972da0356dcf4d0fdca2878559c6", size = 211341, upload-time = "2025-11-30T20:23:24.449Z" }, + { url = "https://files.pythonhosted.org/packages/0a/4d/222ef0b46443cf4cf46764d9c630f3fe4abaa7245be9417e56e9f52b8f65/rpds_py-0.30.0-cp313-cp313t-win_amd64.whl", hash = "sha256:f14fc5df50a716f7ece6a80b6c78bb35ea2ca47c499e422aa4463455dd96d56d", size = 225768, upload-time = "2025-11-30T20:23:25.908Z" }, + { url = "https://files.pythonhosted.org/packages/86/81/dad16382ebbd3d0e0328776d8fd7ca94220e4fa0798d1dc5e7da48cb3201/rpds_py-0.30.0-cp314-cp314-macosx_10_12_x86_64.whl", hash = "sha256:68f19c879420aa08f61203801423f6cd5ac5f0ac4ac82a2368a9fcd6a9a075e0", size = 362099, upload-time = "2025-11-30T20:23:27.316Z" }, + { url = "https://files.pythonhosted.org/packages/2b/60/19f7884db5d5603edf3c6bce35408f45ad3e97e10007df0e17dd57af18f8/rpds_py-0.30.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:ec7c4490c672c1a0389d319b3a9cfcd098dcdc4783991553c332a15acf7249be", size = 353192, upload-time = "2025-11-30T20:23:29.151Z" }, + { url = "https://files.pythonhosted.org/packages/bf/c4/76eb0e1e72d1a9c4703c69607cec123c29028bff28ce41588792417098ac/rpds_py-0.30.0-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f251c812357a3fed308d684a5079ddfb9d933860fc6de89f2b7ab00da481e65f", size = 384080, upload-time = "2025-11-30T20:23:30.785Z" }, + { url = "https://files.pythonhosted.org/packages/72/87/87ea665e92f3298d1b26d78814721dc39ed8d2c74b86e83348d6b48a6f31/rpds_py-0.30.0-cp314-cp314-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:ac98b175585ecf4c0348fd7b29c3864bda53b805c773cbf7bfdaffc8070c976f", size = 394841, upload-time = "2025-11-30T20:23:32.209Z" }, + { url = "https://files.pythonhosted.org/packages/77/ad/7783a89ca0587c15dcbf139b4a8364a872a25f861bdb88ed99f9b0dec985/rpds_py-0.30.0-cp314-cp314-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:3e62880792319dbeb7eb866547f2e35973289e7d5696c6e295476448f5b63c87", size = 516670, upload-time = "2025-11-30T20:23:33.742Z" }, + { url = "https://files.pythonhosted.org/packages/5b/3c/2882bdac942bd2172f3da574eab16f309ae10a3925644e969536553cb4ee/rpds_py-0.30.0-cp314-cp314-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:4e7fc54e0900ab35d041b0601431b0a0eb495f0851a0639b6ef90f7741b39a18", size = 408005, upload-time = "2025-11-30T20:23:35.253Z" }, + { url = "https://files.pythonhosted.org/packages/ce/81/9a91c0111ce1758c92516a3e44776920b579d9a7c09b2b06b642d4de3f0f/rpds_py-0.30.0-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:47e77dc9822d3ad616c3d5759ea5631a75e5809d5a28707744ef79d7a1bcfcad", size = 382112, upload-time = "2025-11-30T20:23:36.842Z" }, + { url = "https://files.pythonhosted.org/packages/cf/8e/1da49d4a107027e5fbc64daeab96a0706361a2918da10cb41769244b805d/rpds_py-0.30.0-cp314-cp314-manylinux_2_31_riscv64.whl", hash = "sha256:b4dc1a6ff022ff85ecafef7979a2c6eb423430e05f1165d6688234e62ba99a07", size = 399049, upload-time = "2025-11-30T20:23:38.343Z" }, + { url = "https://files.pythonhosted.org/packages/df/5a/7ee239b1aa48a127570ec03becbb29c9d5a9eb092febbd1699d567cae859/rpds_py-0.30.0-cp314-cp314-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:4559c972db3a360808309e06a74628b95eaccbf961c335c8fe0d590cf587456f", size = 415661, upload-time = "2025-11-30T20:23:40.263Z" }, + { url = "https://files.pythonhosted.org/packages/70/ea/caa143cf6b772f823bc7929a45da1fa83569ee49b11d18d0ada7f5ee6fd6/rpds_py-0.30.0-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:0ed177ed9bded28f8deb6ab40c183cd1192aa0de40c12f38be4d59cd33cb5c65", size = 565606, upload-time = "2025-11-30T20:23:42.186Z" }, + { url = "https://files.pythonhosted.org/packages/64/91/ac20ba2d69303f961ad8cf55bf7dbdb4763f627291ba3d0d7d67333cced9/rpds_py-0.30.0-cp314-cp314-musllinux_1_2_i686.whl", hash = "sha256:ad1fa8db769b76ea911cb4e10f049d80bf518c104f15b3edb2371cc65375c46f", size = 591126, upload-time = "2025-11-30T20:23:44.086Z" }, + { url = "https://files.pythonhosted.org/packages/21/20/7ff5f3c8b00c8a95f75985128c26ba44503fb35b8e0259d812766ea966c7/rpds_py-0.30.0-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:46e83c697b1f1c72b50e5ee5adb4353eef7406fb3f2043d64c33f20ad1c2fc53", size = 553371, upload-time = "2025-11-30T20:23:46.004Z" }, + { url = "https://files.pythonhosted.org/packages/72/c7/81dadd7b27c8ee391c132a6b192111ca58d866577ce2d9b0ca157552cce0/rpds_py-0.30.0-cp314-cp314-win32.whl", hash = "sha256:ee454b2a007d57363c2dfd5b6ca4a5d7e2c518938f8ed3b706e37e5d470801ed", size = 215298, upload-time = "2025-11-30T20:23:47.696Z" }, + { url = "https://files.pythonhosted.org/packages/3e/d2/1aaac33287e8cfb07aab2e6b8ac1deca62f6f65411344f1433c55e6f3eb8/rpds_py-0.30.0-cp314-cp314-win_amd64.whl", hash = "sha256:95f0802447ac2d10bcc69f6dc28fe95fdf17940367b21d34e34c737870758950", size = 228604, upload-time = "2025-11-30T20:23:49.501Z" }, + { url = "https://files.pythonhosted.org/packages/e8/95/ab005315818cc519ad074cb7784dae60d939163108bd2b394e60dc7b5461/rpds_py-0.30.0-cp314-cp314-win_arm64.whl", hash = "sha256:613aa4771c99f03346e54c3f038e4cc574ac09a3ddfb0e8878487335e96dead6", size = 222391, upload-time = "2025-11-30T20:23:50.96Z" }, + { url = "https://files.pythonhosted.org/packages/9e/68/154fe0194d83b973cdedcdcc88947a2752411165930182ae41d983dcefa6/rpds_py-0.30.0-cp314-cp314t-macosx_10_12_x86_64.whl", hash = "sha256:7e6ecfcb62edfd632e56983964e6884851786443739dbfe3582947e87274f7cb", size = 364868, upload-time = "2025-11-30T20:23:52.494Z" }, + { url = "https://files.pythonhosted.org/packages/83/69/8bbc8b07ec854d92a8b75668c24d2abcb1719ebf890f5604c61c9369a16f/rpds_py-0.30.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:a1d0bc22a7cdc173fedebb73ef81e07faef93692b8c1ad3733b67e31e1b6e1b8", size = 353747, upload-time = "2025-11-30T20:23:54.036Z" }, + { url = "https://files.pythonhosted.org/packages/ab/00/ba2e50183dbd9abcce9497fa5149c62b4ff3e22d338a30d690f9af970561/rpds_py-0.30.0-cp314-cp314t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:0d08f00679177226c4cb8c5265012eea897c8ca3b93f429e546600c971bcbae7", size = 383795, upload-time = "2025-11-30T20:23:55.556Z" }, + { url = "https://files.pythonhosted.org/packages/05/6f/86f0272b84926bcb0e4c972262f54223e8ecc556b3224d281e6598fc9268/rpds_py-0.30.0-cp314-cp314t-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:5965af57d5848192c13534f90f9dd16464f3c37aaf166cc1da1cae1fd5a34898", size = 393330, upload-time = "2025-11-30T20:23:57.033Z" }, + { url = "https://files.pythonhosted.org/packages/cb/e9/0e02bb2e6dc63d212641da45df2b0bf29699d01715913e0d0f017ee29438/rpds_py-0.30.0-cp314-cp314t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:9a4e86e34e9ab6b667c27f3211ca48f73dba7cd3d90f8d5b11be56e5dbc3fb4e", size = 518194, upload-time = "2025-11-30T20:23:58.637Z" }, + { url = "https://files.pythonhosted.org/packages/ee/ca/be7bca14cf21513bdf9c0606aba17d1f389ea2b6987035eb4f62bd923f25/rpds_py-0.30.0-cp314-cp314t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:e5d3e6b26f2c785d65cc25ef1e5267ccbe1b069c5c21b8cc724efee290554419", size = 408340, upload-time = "2025-11-30T20:24:00.2Z" }, + { url = "https://files.pythonhosted.org/packages/c2/c7/736e00ebf39ed81d75544c0da6ef7b0998f8201b369acf842f9a90dc8fce/rpds_py-0.30.0-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:626a7433c34566535b6e56a1b39a7b17ba961e97ce3b80ec62e6f1312c025551", size = 383765, upload-time = "2025-11-30T20:24:01.759Z" }, + { url = "https://files.pythonhosted.org/packages/4a/3f/da50dfde9956aaf365c4adc9533b100008ed31aea635f2b8d7b627e25b49/rpds_py-0.30.0-cp314-cp314t-manylinux_2_31_riscv64.whl", hash = "sha256:acd7eb3f4471577b9b5a41baf02a978e8bdeb08b4b355273994f8b87032000a8", size = 396834, upload-time = "2025-11-30T20:24:03.687Z" }, + { url = "https://files.pythonhosted.org/packages/4e/00/34bcc2565b6020eab2623349efbdec810676ad571995911f1abdae62a3a0/rpds_py-0.30.0-cp314-cp314t-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:fe5fa731a1fa8a0a56b0977413f8cacac1768dad38d16b3a296712709476fbd5", size = 415470, upload-time = "2025-11-30T20:24:05.232Z" }, + { url = "https://files.pythonhosted.org/packages/8c/28/882e72b5b3e6f718d5453bd4d0d9cf8df36fddeb4ddbbab17869d5868616/rpds_py-0.30.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:74a3243a411126362712ee1524dfc90c650a503502f135d54d1b352bd01f2404", size = 565630, upload-time = "2025-11-30T20:24:06.878Z" }, + { url = "https://files.pythonhosted.org/packages/3b/97/04a65539c17692de5b85c6e293520fd01317fd878ea1995f0367d4532fb1/rpds_py-0.30.0-cp314-cp314t-musllinux_1_2_i686.whl", hash = "sha256:3e8eeb0544f2eb0d2581774be4c3410356eba189529a6b3e36bbbf9696175856", size = 591148, upload-time = "2025-11-30T20:24:08.445Z" }, + { url = "https://files.pythonhosted.org/packages/85/70/92482ccffb96f5441aab93e26c4d66489eb599efdcf96fad90c14bbfb976/rpds_py-0.30.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:dbd936cde57abfee19ab3213cf9c26be06d60750e60a8e4dd85d1ab12c8b1f40", size = 556030, upload-time = "2025-11-30T20:24:10.956Z" }, + { url = "https://files.pythonhosted.org/packages/20/53/7c7e784abfa500a2b6b583b147ee4bb5a2b3747a9166bab52fec4b5b5e7d/rpds_py-0.30.0-cp314-cp314t-win32.whl", hash = "sha256:dc824125c72246d924f7f796b4f63c1e9dc810c7d9e2355864b3c3a73d59ade0", size = 211570, upload-time = "2025-11-30T20:24:12.735Z" }, + { url = "https://files.pythonhosted.org/packages/d0/02/fa464cdfbe6b26e0600b62c528b72d8608f5cc49f96b8d6e38c95d60c676/rpds_py-0.30.0-cp314-cp314t-win_amd64.whl", hash = "sha256:27f4b0e92de5bfbc6f86e43959e6edd1425c33b5e69aab0984a72047f2bcf1e3", size = 226532, upload-time = "2025-11-30T20:24:14.634Z" }, +] + [[package]] name = "ruff" version = "0.15.8" diff --git a/src/tests/ContentProcessor/pipeline/test_schema.py b/src/tests/ContentProcessor/pipeline/test_schema.py index e5c18ef1..bbdb46b6 100644 --- a/src/tests/ContentProcessor/pipeline/test_schema.py +++ b/src/tests/ContentProcessor/pipeline/test_schema.py @@ -22,8 +22,8 @@ def test_construction(self): Id="s-1", ClassName="InvoiceSchema", Description="Invoice extraction", - FileName="invoice_schema.py", - ContentType="application/pdf", + FileName="invoice_schema.json", + ContentType="application/json", ) assert schema.Id == "s-1" assert schema.ClassName == "InvoiceSchema" @@ -46,8 +46,8 @@ def test_get_schema_returns_schema(self, mock_helper_cls): "Id": "s-1", "ClassName": "MySchema", "Description": "desc", - "FileName": "file.py", - "ContentType": "text/plain", + "FileName": "file.json", + "ContentType": "application/json", } ] result = Schema.get_schema("connstr", "db", "coll", "s-1")