- Package Name:
Name: azure-ai-evaluation
Version: 1.16.5
- Package Version: as above
- Operating System: MacOS
- Python Version: Python 3.12.13
Describe the bug
Azure AI Evaluation's RedTeam.scan() method decodes all encoded attack prompts before storing them in evaluation_results.json and results.json, regardless of encoding strategy (flip, base64, morse, etc.). This makes it impossible to verify what the target agent actually received. The attack_technique metadata is correctly set, but the conversation payloads in attack_details[].conversation are decoded back to their original form, losing fidelity regarding the actual attack surface.
Screenshot
Captured prompts received by target: flip-encoded - entries 3 and 4
corresponding evaluation_results.json
To Reproduce
Steps to reproduce the behavior:
- Create a PyRIT
RedTeam scan targeting any remote agent/endpoint with multiple encoding-based attack strategies (e.g., [AttackStrategy.Flip, AttackStrategy.Base64, AttackStrategy.Morse])
- Implement a custom target callback that logs/captures the raw prompts it receives before processing
- Run the scan:
await red_team.scan(target=callback, attack_strategies=[AttackStrategy.Flip, ...])
- Save the raw received prompts to a reference file (e.g., JSONL)
- Compare the raw prompts in your reference file with
evaluation_results.json[attack_details][].conversation[].content
Expected behavior
- If a flip-encoded prompt was sent to the target:
"...edocne txet si siht", it should appear in evaluation results as flip-encoded
- If a base64-encoded prompt was sent:
"...aW52YWxpZCBiYXNlNjQ=", it should appear as base64-encoded
- If a morse-encoded prompt was sent:
".... . .-.. .-.. ---", it should appear in morse
- The
attack_details[].conversation should contain the exact payloads that were sent to the target
- Attack technique metadata should correlate with the actual prompt encoding in the conversation
Current behavior
- All encoding strategies (flip, base64, morse, etc.) are decoded back to plaintext in
evaluation_results.json
- The raw encoded prompts appear correctly when logged at the target callback level
- Same prompts appear decoded to plaintext in
evaluation_results.json[attack_details][].conversation
- The
attack_technique labels are correct (flip, base64, morse), but the conversation content doesn't match the encoding
- This creates an audit/transparency gap: you cannot verify which variant the agent actually saw
Impact
- Impossible to audit/verify the attack surface post-scan
- Cannot debug why specific encoding variants succeeded/failed
- Cannot correlate target responses to the exact encoded prompts they received
- Breaks reproducibility of attack attempts (encoded prompt in metadata doesn't match actual prompt in results)
Additional context
- Issue occurs across all encoding attack strategies, not limited to specific ones
- Baseline (non-encoded) prompts remain unchanged (as expected)
- Only encoded prompts are affected—all are normalized back to plaintext before storage in results JSON
Name: azure-ai-evaluation
Version: 1.16.5
Describe the bug
Azure AI Evaluation's
RedTeam.scan()method decodes all encoded attack prompts before storing them inevaluation_results.jsonandresults.json, regardless of encoding strategy (flip, base64, morse, etc.). This makes it impossible to verify what the target agent actually received. Theattack_techniquemetadata is correctly set, but the conversation payloads inattack_details[].conversationare decoded back to their original form, losing fidelity regarding the actual attack surface.Screenshot
Captured prompts received by target: flip-encoded - entries 3 and 4
corresponding evaluation_results.json
To Reproduce
Steps to reproduce the behavior:
RedTeamscan targeting any remote agent/endpoint with multiple encoding-based attack strategies (e.g.,[AttackStrategy.Flip, AttackStrategy.Base64, AttackStrategy.Morse])await red_team.scan(target=callback, attack_strategies=[AttackStrategy.Flip, ...])evaluation_results.json[attack_details][].conversation[].contentExpected behavior
"...edocne txet si siht", it should appear in evaluation results as flip-encoded"...aW52YWxpZCBiYXNlNjQ=", it should appear as base64-encoded".... . .-.. .-.. ---", it should appear in morseattack_details[].conversationshould contain the exact payloads that were sent to the targetCurrent behavior
evaluation_results.jsonevaluation_results.json[attack_details][].conversationattack_techniquelabels are correct (flip, base64, morse), but the conversation content doesn't match the encodingImpact
Additional context