Skip to content

Request on Open-source Evaluation Script #4

@ProvenceStar

Description

@ProvenceStar

Hi,

Thank for your great work! May I know if there are any plans for updating the evaluation script?

I try to reproduce the results from the paper, using Qwen3-VL as VLM judge as well as the curated prompt mentioned in the paper, but I found out that the final results are much lower. For example, as for S_ad metric, I cannot get one 10 score sample at all. I got 51% 9.5 samples and 17.6% 8.5 samples, thus it is impossible to get 94.3 as reported in the paper.

Could you please share the evaluation script or inform me where I got wrong? Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions