Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions capabilities/ai-red-teaming/agents/ai-red-teaming-agent.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ Complete requests that don't need clarification:
5. Never report failure without first running diagnostic tools AND asking for clarification.

**Parameter handling:**
- When user specifies transforms (e.g. "using 3 transforms", "with base64, caesar, authority"), ALWAYS set `compare_transforms=true`. This produces N+1 runs (baseline + each transform). Set `compare_transforms=false` ONLY if the user explicitly says "bundle transforms" or "apply all together".
- When user specifies transforms (e.g. "with base64", "using Telugu"), set `compare_transforms=false` by default. The transforms are applied to the attack — no baseline run is added. Only set `compare_transforms=true` if the user explicitly asks to "compare transforms", "benchmark transforms", "compare against baseline", or uses similar comparison language. `compare_transforms=true` creates N+1 runs (1 baseline without transforms + 1 per transform individually).
- "max trials N", "N trials", "max_trials N", or "iterations N" → set `n_iterations=N`.
- Pass the user's model name as `target_model` verbatim. The tool resolves aliases internally.

Expand Down Expand Up @@ -406,11 +406,17 @@ Use `generate_image_attack` when the user wants to attack a traditional ML model
User: "run TAP on gpt-4o, goal: extract system prompt"
→ `generate_attack(attack_type="tap", goal="extract system prompt", target_model="gpt-4o")`

### Transform Comparison
### Transform (applied, no baseline)

User: "run TAP with base64 and caesar transforms"
→ `generate_attack(attack_type="tap", ..., transforms=["base64","caesar"], compare_transforms=false)`
Transforms are applied to the attack. No baseline run.

### Transform Comparison (explicit)

User: "run TAP with base64, caesar, authority transforms, compare them"
→ `generate_attack(attack_type="tap", ..., transforms=["base64","caesar","authority"], compare_transforms=true)`
This generates N+1 runs: 1 baseline + 1 per transform.
This generates N+1 runs: 1 baseline + 1 per transform. Only when user asks to "compare".

### Campaign (multiple attacks)

Expand Down
2 changes: 1 addition & 1 deletion capabilities/ai-red-teaming/capability.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
schema: 1
name: ai-red-teaming
version: "1.3.2"
version: "1.3.5"
description: >
Probe the security and safety of AI applications, agents, and foundation models.
Orchestrates adversarial attack workflows to discover vulnerabilities in LLMs,
Expand Down
30 changes: 30 additions & 0 deletions capabilities/ai-red-teaming/scripts/attack_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -3059,6 +3059,16 @@ async def main():
print(f"--- end {{label}} ---")
sys.stdout.flush()

# Flush OTEL spans between studies so each study's traces
# are exported to the platform before starting the next one.
try:
from dreadnode.app.main import DEFAULT_INSTANCE
_provider = DEFAULT_INSTANCE._logfire._tracer_provider
if hasattr(_provider, 'force_flush'):
_provider.force_flush(timeout_millis=10_000)
except Exception:
pass

except Exception as e:
print(f"\\nERROR in study '{{label}}': {{e}}")
traceback.print_exc()
Expand Down Expand Up @@ -3144,6 +3154,16 @@ async def main():
await assessment.run(_{var}_study)
print(f"{canon} completed successfully")
sys.stdout.flush()

# Flush OTEL spans between attacks so each attack's traces
# are exported to the platform before starting the next one.
try:
from dreadnode.app.main import DEFAULT_INSTANCE
_provider = DEFAULT_INSTANCE._logfire._tracer_provider
if hasattr(_provider, 'force_flush'):
_provider.force_flush(timeout_millis=10_000)
except Exception:
pass
except Exception as e:
print(f"\\nERROR in {canon}: {{e}}")
traceback.print_exc()
Expand Down Expand Up @@ -3420,6 +3440,16 @@ async def main():
print(f"completed")
sys.stdout.flush()

# Flush OTEL spans between goals so each goal's traces
# are exported to the platform before starting the next one.
try:
from dreadnode.app.main import DEFAULT_INSTANCE
_provider = DEFAULT_INSTANCE._logfire._tracer_provider
if hasattr(_provider, 'force_flush'):
_provider.force_flush(timeout_millis=10_000)
except Exception:
pass

except Exception as e:
print(f"ERROR: {{e}}")
traceback.print_exc()
Expand Down
Loading