Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
185 changes: 185 additions & 0 deletions examples/harness/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
# Harness modules: ContextEngine + ResultVerifier

This example shows how to compose two Harness modules with a veADK Agent for
context engineering, evidence tracking, and final-answer verification.

- `ContextEngine` pins the original task, filters noisy history, assembles an
evidence-first context header, and records a small budget report.
- `ResultVerifier` records tool receipts, gathers evidence references, checks
final answers for unsupported URLs and ungrounded external facts, and writes a
local verification report.

The implementation is self-contained so developers can read, run, test, and
adapt the pattern in one directory.

## Layout

```text
examples/harness/
├── main.py
├── harness_agent.py
├── harness_modules/
│ ├── core.py
│ ├── context_engine.py
│ ├── result_verifier.py
│ ├── tool_wrappers.py
│ └── stores.py
├── tests/
└── golden/
├── production_scenarios.jsonl
├── context_engine_cases.jsonl
└── verifier_cases.jsonl
```

## Run

Configure the normal veADK model environment variables, then run:

```bash
python examples/harness/main.py
```

The run writes local audit data under `.harness_runs/`:

- `events.jsonl`
- `messages.jsonl`
- `receipts.jsonl`
- `evidence/*.txt`
- `reports/<session_id>-<run_id>.json`

## Core usage

```python
from harness_agent import build_harness_agent

bundle = build_harness_agent()
answer = await bundle.run(
"请查一下 veADK Harness 示例的核心能力,给出来源,并用 3 条要点回答。",
session_id="harness-demo",
)
report = bundle.latest_report(session_id="harness-demo")
```

`bundle.agent` and `bundle.runner` are regular veADK `Agent` and `Runner`
instances. The thin `bundle.run(...)` method coordinates `user_id`,
`session_id`, and `original_prompt` so the Harness processor can create local
receipts, evidence, context events, and verification reports.

## Test

The tests use fake tools and fake runner events, so no model key is needed:

```bash
pytest examples/harness/tests
```

The validation targets are:

- task anchor retention across follow-up turns;
- removal of progress and control messages from model context;
- deterministic detection of fabricated URLs;
- failure when a current/external factual answer has no evidence;
- receipt recording for failed tools;
- externalization of large tool results.

The scenario-level golden set is
`examples/harness/golden/production_scenarios.jsonl`. It groups common
production cases by scenario and module, so developers can add new regression
cases without coupling them to a specific product incident or project-specific dataset.
The smaller `verifier_cases.jsonl` and `context_engine_cases.jsonl` files keep
module-focused golden checks.

## Evaluate the Harness lift

Run the offline A/B evaluation:

```bash
python examples/harness/evaluation/run_eval.py
```

The evaluation isolates deterministic Harness effects rather than model quality.
Baseline uses raw history and trusts every non-empty answer. Harness uses
`ContextEngine` plus `ResultVerifier`.
The case set uses common production-style developer scenarios: stale RAG
memory, failed tool receipts, permission over-blocking, runtime parameter drift,
and multi-turn context anchoring.

Current result:

| Metric | Baseline | Harness | Delta |
| --- | ---: | ---: | ---: |
| Result verifier accuracy | 20.0% | 100.0% | +80.0 pp |
| Unsafe false-accept rate | 100.0% | 0.0% | -100.0 pp |
| Unsafe detection recall | 0.0% | 100.0% | +100.0 pp |
| Context quality score | 0.0% | 100.0% | +100.0 pp |

Offline report summary by scenario:

| Scenario | Baseline behavior | Harness lift | Module |
| --- | --- | --- | --- |
| RAG memory freshness | Trusts stale-memory answers without current evidence. | Blocks the answer until current knowledge evidence exists. | `ResultVerifier` |
| Tool failure claimed as success | Trusts a final JSON that says the operation passed. | Detects failed tool receipts and blocks false completion claims. | `ResultVerifier` |
| Permission over-blocking of allowed tools | Trusts a success result even when an allowed tool was blocked. | Treats failed receipts as incompatible with `operation_completed=true`. | `ResultVerifier` |
| Runtime parameter drift | Trusts unsupported runtime values such as a wrong token limit. | Blocks key numeric facts that are not present in evidence. | `ResultVerifier` |
| Multi-turn context anchoring | Raw history includes progress noise and loses the original task anchor. | Pins the original task and removes control-message pollution. | `ContextEngine` |
| Current evidence beats stale memory | Recent history can surface stale cached answers before evidence. | Puts current evidence before history and keeps the original task anchor. | `ContextEngine` |

Reports are written to
`examples/harness/evaluation/results/harness_eval_report.json` and
`examples/harness/evaluation/results/harness_eval_report.md`.

## Run model-in-the-loop evaluation

The model evaluation makes real veADK model calls. Export the standard model
environment variables, or pass any dotenv file that contains
`MODEL_AGENT_API_KEY`, `MODEL_AGENT_NAME`, and `MODEL_AGENT_API_BASE`:

```bash
python examples/harness/evaluation/run_model_eval.py \
--env-file /path/to/model.env
```

If the variables are already exported in the shell, `--env-file` can be omitted.

No secret values are written to the reports. The script compares a normal veADK
Agent that trusts every non-empty answer with the Harness Agent that trusts an
answer only when `VerificationReport.done` is true.

Reports are written to
`examples/harness/evaluation/results/harness_model_eval_report.json` and
`examples/harness/evaluation/results/harness_model_eval_report.md`.

Current sample model result:

| Metric | Baseline | Harness | Delta |
| --- | ---: | ---: | ---: |
| Trust decision accuracy | 66.7% | 100.0% | +33.3 pp |
| Unsupported false-accept rate | 100.0% | 0.0% | -100.0 pp |
| Answerable verified pass rate | - | 100.0% | +100.0 pp |
| Answerable receipt coverage | - | 100.0% | +100.0 pp |
| Unsupported request block rate | - | 100.0% | +100.0 pp |

The model report also includes a scenario matrix with the scenario as the first
column, covering RAG freshness, tool evidence receipts, and no-evidence
hallucination suppression.

Model report summary by scenario:

| Scenario | Baseline runtime | Harness runtime | What the result shows |
| --- | --- | --- | --- |
| RAG freshness with source grounding | Trusts the non-empty model answer. | Trusts only after tool receipts and source evidence are present. | Answerable sourced requests can still pass when grounded. |
| Tool evidence and receipt coverage | Trusts the final text without runtime receipt enforcement. | Keeps the answer trusted and records tool receipts. | Harness adds auditability without blocking valid answers. |
| No-evidence hallucination suppression | Trusts a non-empty unsupported answer. | Blocks the answer because no tool evidence or source receipt exists. | The trust gate prevents no-evidence source claims from reaching callers. |

## Design Notes

This example focuses on the core developer workflow:

- build a task-aware context header before the model runs;
- wrap tools so every capability call leaves an auditable receipt;
- attach evidence references to tool outputs;
- verify the final answer before treating it as trusted;
- use tests and offline/model evaluations to measure the lift.

The modules are intentionally compact and explicit, making them suitable as a
starting point for product-specific Harness extensions.
176 changes: 176 additions & 0 deletions examples/harness/README.zh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# Harness 子模块示例:ContextEngine + ResultVerifier

这个示例演示如何为 veADK Agent 组合两个 Harness 子模块,用于上下文工程、
证据追踪和最终答案验证。

- `ContextEngine`:固定原始任务、过滤噪声历史、组装证据优先上下文,并记录
轻量预算报告。
- `ResultVerifier`:记录工具收据、收集证据引用、检查最终答案里的伪造 URL 和
无证据外部事实,并写入本地验证报告。

所有代码都自包含在 `examples/harness/`,开发者可以在一个目录内阅读、运行、
测试并按自己的业务场景改造。

## 目录

```text
examples/harness/
├── main.py
├── harness_agent.py
├── harness_modules/
│ ├── core.py
│ ├── context_engine.py
│ ├── result_verifier.py
│ ├── tool_wrappers.py
│ └── stores.py
├── tests/
└── golden/
├── production_scenarios.jsonl
├── context_engine_cases.jsonl
└── verifier_cases.jsonl
```

## 运行

先配置常规 veADK 模型环境变量,然后执行:

```bash
python examples/harness/main.py
```

运行审计数据会写到 `.harness_runs/`:

- `events.jsonl`
- `messages.jsonl`
- `receipts.jsonl`
- `evidence/*.txt`
- `reports/<session_id>-<run_id>.json`

## 核心用法

```python
from harness_agent import build_harness_agent

bundle = build_harness_agent()
answer = await bundle.run(
"请查一下 veADK Harness 示例的核心能力,给出来源,并用 3 条要点回答。",
session_id="harness-demo",
)
report = bundle.latest_report(session_id="harness-demo")
```

`bundle.agent` 和 `bundle.runner` 是常规 veADK `Agent` / `Runner` 实例。
`bundle.run(...)` 是很薄的一层,用于协调 `user_id`、`session_id` 和
`original_prompt`,让 Harness processor 生成本地收据、证据、上下文事件和验证报告。

## 测试

测试使用 fake tool 和 fake runner event,不需要模型 key:

```bash
pytest examples/harness/tests
```

验证点:

- follow-up 轮次保留原始任务锚点;
- progress 和控制消息不会进入模型上下文;
- 可确定性拦截伪造 URL;
- 当前/外部事实任务无证据时验证失败;
- 工具异常时保留 failed receipt;
- 大工具结果外置为 evidence 文件。

场景级 golden 集合是
`examples/harness/golden/production_scenarios.jsonl`。它按通用生产场景和模块组织,
开发者可以在不绑定特定产品问题或项目特定数据集的情况下新增回归 case。
`verifier_cases.jsonl` 和 `context_engine_cases.jsonl` 保留模块级 golden 检查。

## 评测 Harness 增益

运行离线 A/B 评测:

```bash
python examples/harness/evaluation/run_eval.py
```

评测隔离的是 Harness 子模块的确定性效果,而不是模型能力。Baseline 使用原始历史,
并信任所有非空答案;Harness Treatment 使用 `ContextEngine` 和 `ResultVerifier`。
Case 集合覆盖常见生产开发场景:RAG 旧缓存、失败工具收据、权限误拦截、
运行时参数偏移,以及多轮上下文锚定。

当前结果:

| 指标 | Baseline | Harness | 增益 |
| --- | ---: | ---: | ---: |
| 结果验证准确率 | 20.0% | 100.0% | +80.0 pp |
| 不安全答案误放行率 | 100.0% | 0.0% | -100.0 pp |
| 不安全答案召回率 | 0.0% | 100.0% | +100.0 pp |
| 上下文质量分 | 0.0% | 100.0% | +100.0 pp |

离线报告按场景的摘要:

| 场景 | Baseline 表现 | Harness 增益 | 模块 |
| --- | --- | --- | --- |
| RAG 记忆新鲜度 | 无当前证据时仍信任旧缓存答案。 | 缺少当前知识库证据时阻断答案。 | `ResultVerifier` |
| 工具失败却声称成功 | 只要最终 JSON 写了 passed 就信任。 | 检测 failed receipt,阻断虚假的完成声明。 | `ResultVerifier` |
| 权限策略误拦截合法工具 | 合法工具被拦截后仍可能信任成功结果。 | 将 failed receipt 与 `operation_completed=true` 判为冲突。 | `ResultVerifier` |
| 运行时参数偏移 | 信任没有证据支撑的 token/runtime 数值。 | 阻断证据中不存在的关键数字事实。 | `ResultVerifier` |
| 多轮上下文锚定 | 原始历史包含 progress 噪声,且容易丢失任务锚点。 | 固定原始任务,并过滤控制消息污染。 | `ContextEngine` |
| 当前证据优先于旧记忆 | 最近历史中的旧缓存答案可能先于证据进入上下文。 | 将当前证据放在历史前,并保留原始任务锚点。 | `ContextEngine` |

报告输出到:
`examples/harness/evaluation/results/harness_eval_report.json` 和
`examples/harness/evaluation/results/harness_eval_report.md`。

## 带模型的评测

模型评测会发起实际 veADK 模型调用。可以先在 shell 中导出标准模型环境变量,也可以传入
任意包含 `MODEL_AGENT_API_KEY`、`MODEL_AGENT_NAME`、`MODEL_AGENT_API_BASE` 的
dotenv 文件:

```bash
python examples/harness/evaluation/run_model_eval.py \
--env-file /path/to/model.env
```

如果这些变量已经在当前 shell 中导出,可以省略 `--env-file`。

报告不会写入任何密钥值。评测对比的是:普通 veADK Agent 对所有非空答案直接信任;
Harness Agent 只有在 `VerificationReport.done=True` 时才把答案视为可信。

报告输出到:
`examples/harness/evaluation/results/harness_model_eval_report.json` 和
`examples/harness/evaluation/results/harness_model_eval_report.md`。

当前示例模型评测结果:

| 指标 | Baseline | Harness | 增益 |
| --- | ---: | ---: | ---: |
| 信任决策准确率 | 66.7% | 100.0% | +33.3 pp |
| 无证据任务错误放行率 | 100.0% | 0.0% | -100.0 pp |
| 可回答任务验证放行率 | - | 100.0% | +100.0 pp |
| 可回答任务 receipt 覆盖率 | - | 100.0% | +100.0 pp |
| 无证据任务阻断率 | - | 100.0% | +100.0 pp |

模型报告还包含按场景组织的矩阵,第一列就是场景,覆盖 RAG 新鲜度、
工具证据收据、无证据幻觉抑制。

模型报告按场景的摘要:

| 场景 | Baseline 运行时 | Harness 运行时 | 结果说明 |
| --- | --- | --- | --- |
| RAG 新鲜度与来源支撑 | 信任非空模型答案。 | 只有存在工具收据和来源证据时才信任。 | 有来源要求的请求在证据充分时正常放行。 |
| 工具证据与 receipt 覆盖 | 不强制运行时 receipt 校验。 | 答案放行,同时记录工具 receipt。 | Harness 增加审计能力,不误伤有效答案。 |
| 无证据幻觉抑制 | 信任非空但无证据的答案。 | 因缺少工具证据或 source receipt 而阻断。 | 信任门能防止无证据来源声明返回给调用方。 |

## 设计说明

这个示例聚焦开发者最常用的核心链路:

- 在模型运行前构造任务感知的上下文 header;
- 包装工具调用,让每次能力调用都留下可审计收据;
- 将工具输出绑定到 evidence reference;
- 在信任最终答案前执行结果验证;
- 通过测试、离线评测和带模型评测度量效果增益。

模块实现保持紧凑、直接,适合作为业务侧扩展 Harness 能力的起点。
15 changes: 15 additions & 0 deletions examples/harness/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright (c) 2025 Beijing Volcano Engine Technology Co., Ltd. and/or its affiliates.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""ContextEngine and ResultVerifier Harness example."""
15 changes: 15 additions & 0 deletions examples/harness/evaluation/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright (c) 2025 Beijing Volcano Engine Technology Co., Ltd. and/or its affiliates.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Offline A/B evaluation for the Harness example."""
Loading