Skip to content

Conversation

@zzjweb
Copy link

@zzjweb zzjweb commented Jan 21, 2026

Problem

Agent-lightning inherits VeRL's default advantage estimation, which assumes each batch sample is independent. In multi-turn scenarios, this causes turn-level bias: trajectories with more turns contribute more to baseline statistics (mean/std), leading to biased advantage estimation and inefficient optimization.

Solution

Implements trajectory-level deduplication using (data_id, rollout_id) pairs. Set algorithm.compute_mean_std_cross_all_data=False to ensure each trajectory is counted only once when computing baselines.

In agentlightning.verl.trainer, we re-implement computer_grpo_outcome_advantage to integrate the new trajectory-level deduplication logic while keeping dependency on VeRL minimal.

seen_pairs = set()
for i in range(bsz):
    if (index[i], traj_index[i]) in seen_pairs:
        continue  # Skip duplicate turns from same trajectory
    id2score[index[i]].append(scores[i])
    if not compute_mean_std_cross_all_data:
        seen_pairs.add((index[i], traj_index[i]))

Example Configuration

Control the normalization behavior via the compute_mean_std_cross_all_data parameter:

  • compute_mean_std_cross_all_data=True (default): Cross-all-data normalization, more stable but still counts each turn
  • compute_mean_std_cross_all_data=False: Trajectory-level normalization - each trajectory counted only once, eliminates bias
config = {
    "algorithm": {
        "adv_estimator": "grpo",
        "norm_adv_by_std_in_grpo": True,
        "compute_mean_std_cross_all_data": False,  # Enable trajectory-level
    }
}

Implementation

Affected algorithms (currently only GRPO is supported):

  • ✅ GRPO
  • ❌ GRPO Pass@k
  • ❌ REINFORCE++ Baseline
  • ❌ RLOO

Files modified:

  • agentlightning/verl/trainer.py: Add computer_grpo_outcome_advantage

Copilot AI review requested due to automatic review settings January 21, 2026 12:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds trajectory-level deduplication to GRPO advantage normalization to address turn-level bias in multi-turn reinforcement learning scenarios. The implementation introduces a new compute_grpo_outcome_advantage function that tracks unique (data_id, rollout_id) pairs to ensure each trajectory is counted only once when computing baseline statistics for advantage estimation.

Changes:

  • Added compute_grpo_outcome_advantage function with trajectory-level deduplication logic
  • Integrated new advantage computation into the training pipeline with configurable behavior via compute_mean_std_cross_all_data parameter
  • Added assertion to restrict trajectory-level normalization to GRPO algorithm only

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +50 to +113
def compute_grpo_outcome_advantage(
token_level_rewards: torch.Tensor,
response_mask: torch.Tensor,
index: np.ndarray,
traj_index: np.ndarray | None = None,
epsilon: float = 1e-6,
norm_adv_by_std_in_grpo: bool = True,
compute_mean_std_cross_all_data: bool = True,
) -> tuple[torch.Tensor, torch.Tensor]:
"""Compute advantage for GRPO with trajectory-level deduplication support.

This is a minimal extension of VeRL's GRPO implementation, adding support for
trajectory-level deduplication via `traj_index` and `compute_mean_std_cross_all_data`.

Args:
token_level_rewards: Shape (bs, response_length).
response_mask: Shape (bs, response_length).
index: Group index array (e.g., data_id).
traj_index: Trajectory index array (e.g., rollout_id). If None, no deduplication.
epsilon: Small value for numerical stability.
norm_adv_by_std_in_grpo: If True, normalize by std (original GRPO). If False, Dr.GRPO style.
compute_mean_std_cross_all_data: If True (default), compute mean/std across all data.
If False, compute mean/std per unique (index, traj_index) trajectory.

Returns:
Tuple of (advantages, returns), both shape (bs, response_length).
"""
scores = token_level_rewards.sum(dim=-1)

id2score: dict = defaultdict(list)
id2mean: dict = {}
id2std: dict = {}
seen_pairs: set = set()

with torch.no_grad():
bsz = scores.shape[0]
for i in range(bsz):
# Trajectory deduplication: skip if (index, traj_index) already seen
if traj_index is not None and (index[i], traj_index[i]) in seen_pairs:
continue
id2score[index[i]].append(scores[i])
# Mark as seen only when compute_mean_std_cross_all_data is False
if traj_index is not None and not compute_mean_std_cross_all_data:
seen_pairs.add((index[i], traj_index[i]))

for idx in id2score:
if len(id2score[idx]) == 1:
id2mean[idx] = torch.tensor(0.0)
id2std[idx] = torch.tensor(1.0)
elif len(id2score[idx]) > 1:
scores_tensor = torch.stack(id2score[idx])
id2mean[idx] = torch.mean(scores_tensor)
id2std[idx] = torch.std(scores_tensor)
else:
raise ValueError(f"no score in prompt index: {idx}")

for i in range(bsz):
if norm_adv_by_std_in_grpo:
scores[i] = (scores[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon)
else:
scores[i] = scores[i] - id2mean[index[i]]
scores = scores.unsqueeze(-1) * response_mask

return scores, scores

Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new compute_grpo_outcome_advantage function lacks test coverage. Given that this is a critical mathematical computation affecting training outcomes, unit tests should be added to verify:

  1. Correct behavior when compute_mean_std_cross_all_data=True vs False
  2. Proper handling of trajectory deduplication with different (index, traj_index) combinations
  3. Device consistency (tensors on GPU)
  4. Edge cases: single-sample groups, all identical scores, etc.
  5. Correct advantage normalization with and without std division

Consider adding tests in tests/trainer/ directory or a new test file specifically for GRPO advantage computation.

Copilot uses AI. Check for mistakes.
id2mean[idx] = torch.mean(scores_tensor)
id2std[idx] = torch.std(scores_tensor)
else:
raise ValueError(f"no score in prompt index: {idx}")
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message uses f-string formatting but doesn't include the idx variable value. The message should be updated to include the actual index value that's causing the issue for better debugging:

raise ValueError(f"no score in prompt index: {idx}")

should ensure the value is actually included in the error output.

Copilot uses AI. Check for mistakes.
Comment on lines +86 to +109
for i in range(bsz):
# Trajectory deduplication: skip if (index, traj_index) already seen
if traj_index is not None and (index[i], traj_index[i]) in seen_pairs:
continue
id2score[index[i]].append(scores[i])
# Mark as seen only when compute_mean_std_cross_all_data is False
if traj_index is not None and not compute_mean_std_cross_all_data:
seen_pairs.add((index[i], traj_index[i]))

for idx in id2score:
if len(id2score[idx]) == 1:
id2mean[idx] = torch.tensor(0.0)
id2std[idx] = torch.tensor(1.0)
elif len(id2score[idx]) > 1:
scores_tensor = torch.stack(id2score[idx])
id2mean[idx] = torch.mean(scores_tensor)
id2std[idx] = torch.std(scores_tensor)
else:
raise ValueError(f"no score in prompt index: {idx}")

for i in range(bsz):
if norm_adv_by_std_in_grpo:
scores[i] = (scores[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon)
else:
scores[i] = scores[i] - id2mean[index[i]]
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function accepts index as np.ndarray but uses it directly to index into dictionaries (lines 90, 108, 110). In Python dictionaries, NumPy array elements may not hash correctly depending on their dtype. If index contains NumPy scalars (e.g., np.int64), this could cause issues.

Consider converting array elements to Python native types when using them as dictionary keys:

idx_key = int(index[i])
id2score[idx_key].append(scores[i])

Or document that index must contain hashable types that work as dictionary keys.

Copilot uses AI. Check for mistakes.
Comment on lines +431 to +432
if not compute_mean_std_cross_all_data:
assert self.config.algorithm.adv_estimator == AdvantageEstimator.GRPO, (
f"compute_mean_std_cross_all_data=False is only supported for GRPO, "
f"got {self.config.algorithm.adv_estimator}"
)
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion on lines 432-435 only checks when compute_mean_std_cross_all_data=False, but the new GRPO implementation is used for ALL GRPO cases (line 438 condition). This means when compute_mean_std_cross_all_data=True with a non-GRPO estimator, the assertion is never checked, but the code would still go through the else branch at line 452.

While this is not necessarily incorrect (the else branch handles non-GRPO cases properly), the control flow could be clearer. Consider restructuring to make the relationship between the flag and the GRPO check more explicit, or add a comment explaining why the assertion only needs to check the False case.

Copilot uses AI. Check for mistakes.
@zzjweb
Copy link
Author

zzjweb commented Jan 21, 2026

@zzjweb please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.请阅读以下贡献者许可协议(CLA)。如果您同意 CLA 的观点,请回复以下信息。

@microsoft-github-policy-service agree [company="{your company}"]

Options:  选项:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.(默认 - 未指定公司)我拥有我提交的知识产权独占权,且我不在工作过程中为雇主提交。
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.(当有人陪伴时)我正在为雇主工作期间提交材料(或者我的雇主根据合同或适用法律拥有我的提交内容的知识产权)。我已获得雇主的许可,可以代表雇主提交意见并签署本协议。在下方签名,定义的“你”一词包括我和我的雇主。
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement
贡献者许可协议

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant