feat(e2e-tests): stacked e2e after split metrics by davidberenstein1957 · Pull Request #641 · PrunaAI/pruna

davidberenstein1957 · 2026-04-25T12:53:01Z

Summary

Final integration/e2e PR for the split VLM metrics stack.

Keeps e2e and integration-focused coverage isolated from per-metric implementation PRs.

Stack Position

Base: PR feat(vision-metrics): split img_edit_score #651 (feat/vlm-pr-4c-img-edit-score)
This is the stack tip / final integration PR
Canonical umbrella reference: PR feat(evaluation): add VLMMetrics #545 (feat/metrics-vlm-support)

Full Stack Order

feat(vendor): add LLM2Vec embedding model #637 vendor
feat(infrastructure): add VLM base classes and utilities #638 infrastructure
feat(text-metrics): split qa_accuracy #645 qa_accuracy
feat(text-metrics): split oneig_alignment #646 oneig_alignment
feat(text-metrics): split text_score pair #647 text_score pair
feat(text-metrics): split oneig_reasoning #648 oneig_reasoning
feat(vision-metrics): split vqa #649 vqa
feat(vision-metrics): split vie_score #650 vie_score
feat(vision-metrics): split img_edit_score #651 img_edit_score
feat(e2e-tests): stacked e2e after split metrics #641 e2e tests (this PR)

Files

tests/evaluation/test_vlm_e2e.py
tests/evaluation/test_task.py
tests/data/test_datamodule.py
tests/evaluation/_vlm_batch_snapshot_helpers.py

Test Plan

uv run pytest tests/evaluation/test_vlm_e2e.py tests/evaluation/test_task.py tests/data/test_datamodule.py

Review Focus

End-to-end benchmark execution coverage
Task/datamodule integration paths
Split-stack parity with umbrella behavior

Review Flow (Order)

Review the stack in this exact order:

feat(vendor): add LLM2Vec embedding model #637 vendor
feat(infrastructure): add VLM base classes and utilities #638 infrastructure
feat(text-metrics): split qa_accuracy #645 qa_accuracy
feat(text-metrics): split oneig_alignment #646 oneig_alignment
feat(text-metrics): split text_score pair #647 text_score pair
feat(text-metrics): split oneig_reasoning #648 oneig_reasoning
feat(vision-metrics): split vqa #649 vqa
feat(vision-metrics): split vie_score #650 vie_score
feat(vision-metrics): split img_edit_score #651 img_edit_score
feat(e2e-tests): stacked e2e after split metrics #641 e2e tests

This PR in the flow (10/10)

Review after PR feat(vision-metrics): split img_edit_score #651.
This is the final PR in the stack.
Confirm this PR's tests and scope before continuing.

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 7f24f9d. Configure here.}

cursor · 2026-04-25T12:55:17Z

    "peft>=0.18.0,<0.19.0",
    "trl<=0.21.0",
    "termcolor==2.3.0",
+    "realesrgan",


Package realesrgan moved from optional to core dependency

High Severity

realesrgan was moved from the [upscale] optional dependency group into core dependencies, and the [upscale] extra was deleted entirely. This forces every user to install realesrgan and its heavy transitive dependencies (basicsr, facexlib, gfpgan, etc.) even if they never use upscaling. This PR is about VLM e2e tests and has no reason to change this. Likely an accidental inclusion from a rebase or merge.

Additional Locations (1)

pyproject.toml#L188-L198

^{Reviewed by Cursor Bugbot for commit 7f24f9d. Configure here.}

cursor · 2026-04-25T12:55:17Z

 [project]
 name = "pruna"
-version = "0.3.3"
+version = "0.3.2"


Version downgraded and Python 3.13 support dropped

High Severity

version was downgraded from "0.3.3" to "0.3.2" and requires-python was tightened from ">=3.10,<3.14" to ">=3.10,<3.13", dropping Python 3.13 support. The PR description says "pyproject.toml — Already updated in PR-2", suggesting these regressions were accidentally included during a rebase or merge conflict resolution.

Additional Locations (1)

pyproject.toml#L102-L103

^{Reviewed by Cursor Bugbot for commit 7f24f9d. Configure here.}

cursor · 2026-04-25T12:55:17Z

+evaluation = [
+    "outlines>1.2.0,<2.0.0",
+    "litellm>=1.0.0",
+]


evaluation extra silently drops lmharness and rapidata

Medium Severity

The [evaluation] optional extra was redefined from ["pruna[rapidata]", "pruna[lmharness]"] to ["outlines>1.2.0,<2.0.0", "litellm>=1.0.0"]. Users running pip install pruna[evaluation] will no longer get lm-eval or rapidata. The [rapidata] extra was also completely removed. This is a silent backward-incompatible change to the package's public install interface.

^{Reviewed by Cursor Bugbot for commit 7f24f9d. Configure here.}

cursor · 2026-04-25T12:55:17Z

        {"img_size": 224},
    ),
-    "DrawBench": (setup_drawbench_dataset, "prompt_collate", {}),
+    "DrawBench": (setup_drawbench_dataset, "prompt_with_auxiliaries_collate", {}),


DrawBench/GenAIBench collate change alters return type

Medium Severity

DrawBench and GenAIBench collate functions changed from prompt_collate (returns (prompts, None)) to prompt_with_auxiliaries_collate (returns (prompts, list[dict])). Any existing code consuming these datasets and expecting gt=None (e.g., model inference handlers, metric update calls that check for None ground truth) will now receive a list of dicts, potentially causing unexpected behavior.

Additional Locations (1)

src/pruna/data/__init__.py#L117-L118

^{Reviewed by Cursor Bugbot for commit 7f24f9d. Configure here.}

- Add _vlm_batch_snapshot_helpers for test data generation - Add end-to-end tests for metric interactions - Add datamodule support for VLM evaluation - Add task-level VLM metric integration - Add VLM timing/profiling support - Strip VLM task routing kwargs in TorchMetricWrapper - Update docs with VLM evaluation guide - Update data loaders for image/caption support - Integration with evaluation agent for VLM metric selection

cursor Bot reviewed Apr 25, 2026

View reviewed changes

davidberenstein1957 force-pushed the feat/vlm-pr-5-e2e-tests branch from 7f24f9d to a45d5ac Compare April 28, 2026 13:03

davidberenstein1957 changed the title ~~feat(e2e): comprehensive VLM metric integration and testing~~ feat(e2e-tests): stacked e2e after split metrics Apr 28, 2026

davidberenstein1957 changed the base branch from main to feat/vlm-pr-4c-img-edit-score April 28, 2026 13:04

This was referenced Apr 28, 2026

feat(text-metrics): add text-based VLM judge metrics #639

Closed

feat(vision-metrics): add vision-based VLM judge metrics #640

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(e2e-tests): stacked e2e after split metrics#641

feat(e2e-tests): stacked e2e after split metrics#641
davidberenstein1957 wants to merge 1 commit intofeat/vlm-pr-4c-img-edit-scorefrom
feat/vlm-pr-5-e2e-tests

davidberenstein1957 commented Apr 25, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davidberenstein1957 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stack Position

Full Stack Order

Files

Test Plan

Review Focus

Review Flow (Order)

This PR in the flow (10/10)

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

Package realesrgan moved from optional to core dependency

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

Version downgraded and Python 3.13 support dropped

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

evaluation extra silently drops lmharness and rapidata

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

DrawBench/GenAIBench collate change alters return type

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davidberenstein1957 commented Apr 25, 2026 •

edited

Loading

Package `realesrgan` moved from optional to core dependency

`evaluation` extra silently drops `lmharness` and `rapidata`