Skip to content

Azure AI Evaluation's RedTeam.scan() method decodes **all** encoded attack prompts when storing the result files #47228

@hante-sonova

Description

@hante-sonova
  • Package Name:
    Name: azure-ai-evaluation
    Version: 1.16.5
  • Package Version: as above
  • Operating System: MacOS
  • Python Version: Python 3.12.13

Describe the bug
Azure AI Evaluation's RedTeam.scan() method decodes all encoded attack prompts before storing them in evaluation_results.json and results.json, regardless of encoding strategy (flip, base64, morse, etc.). This makes it impossible to verify what the target agent actually received. The attack_technique metadata is correctly set, but the conversation payloads in attack_details[].conversation are decoded back to their original form, losing fidelity regarding the actual attack surface.

Screenshot
Captured prompts received by target: flip-encoded - entries 3 and 4

Image

corresponding evaluation_results.json

Image

To Reproduce
Steps to reproduce the behavior:

  1. Create a PyRIT RedTeam scan targeting any remote agent/endpoint with multiple encoding-based attack strategies (e.g., [AttackStrategy.Flip, AttackStrategy.Base64, AttackStrategy.Morse])
  2. Implement a custom target callback that logs/captures the raw prompts it receives before processing
  3. Run the scan: await red_team.scan(target=callback, attack_strategies=[AttackStrategy.Flip, ...])
  4. Save the raw received prompts to a reference file (e.g., JSONL)
  5. Compare the raw prompts in your reference file with evaluation_results.json[attack_details][].conversation[].content

Expected behavior

  • If a flip-encoded prompt was sent to the target: "...edocne txet si siht", it should appear in evaluation results as flip-encoded
  • If a base64-encoded prompt was sent: "...aW52YWxpZCBiYXNlNjQ=", it should appear as base64-encoded
  • If a morse-encoded prompt was sent: ".... . .-.. .-.. ---", it should appear in morse
  • The attack_details[].conversation should contain the exact payloads that were sent to the target
  • Attack technique metadata should correlate with the actual prompt encoding in the conversation

Current behavior

  • All encoding strategies (flip, base64, morse, etc.) are decoded back to plaintext in evaluation_results.json
  • The raw encoded prompts appear correctly when logged at the target callback level
  • Same prompts appear decoded to plaintext in evaluation_results.json[attack_details][].conversation
  • The attack_technique labels are correct (flip, base64, morse), but the conversation content doesn't match the encoding
  • This creates an audit/transparency gap: you cannot verify which variant the agent actually saw

Impact

  • Impossible to audit/verify the attack surface post-scan
  • Cannot debug why specific encoding variants succeeded/failed
  • Cannot correlate target responses to the exact encoded prompts they received
  • Breaks reproducibility of attack attempts (encoded prompt in metadata doesn't match actual prompt in results)

Additional context

  • Issue occurs across all encoding attack strategies, not limited to specific ones
  • Baseline (non-encoded) prompts remain unchanged (as expected)
  • Only encoded prompts are affected—all are normalized back to plaintext before storage in results JSON

Metadata

Metadata

Assignees

No one assigned

    Labels

    customer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-triageWorkflow: This is a new issue that needs to be triaged to the appropriate team.questionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions