Skip to content

Fear/data synthesis#457

Open
QianqiuerQS wants to merge 2 commits intoModelEngine-Group:mainfrom
QianqiuerQS:fear/data-synthesis
Open

Fear/data synthesis#457
QianqiuerQS wants to merge 2 commits intoModelEngine-Group:mainfrom
QianqiuerQS:fear/data-synthesis

Conversation

@QianqiuerQS
Copy link
Copy Markdown

数据合成和合成数据质量评估代码

Copilot AI review requested due to automatic review settings March 30, 2026 07:04
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 增加两块能力:一是面向 Ascend NPU 的 unstructured/YOLOX 推理适配与基准脚本;二是医疗数据合成与合成数据质量评估(含指标计算与单元测试),用于数据合成与验收。

Changes:

  • 新增 unstructured NPU 侧的 YOLOX 推理 monkey-patch、OCR 适配与运行/压测脚本。
  • 新增医疗数据合成器(QA/CoT/Preference 模板)、评估器(规则评估优先)、需求指标计算与验收脚本。
  • 新增项目需求验证单测与模型下载脚本。

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
runtime/ops/mapper/unstructured_npu/run.sh Ascend NPU 运行入口脚本(LD_PRELOAD/Jemalloc 等环境设置 + benchmark 调用)
runtime/ops/mapper/unstructured_npu/ocr_npu_adapter.py PaddleOCR CPU 隔离进程 + pytesseract 模块注入补丁
runtime/ops/mapper/unstructured_npu/npu_adapter.py YOLOX/Unstructured 推理适配(安全算子、结构替换、PageLayout/from_image 改写、模型加载)
runtime/ops/mapper/unstructured_npu/fusion_result.json NPU 图融合统计输出(疑似运行产物)
runtime/ops/mapper/unstructured_npu/benchmark_npu.py NPU benchmark 脚本(依赖屏蔽/深度 mock + unstructured 分区调用)
runtime/ops/mapper/data_synthesis/data_synthesizer.py 医疗数据合成核心逻辑(模板、质量校验、修复与确定性兜底)
runtime/ops/mapper/data_synthesis/data_evaluator.py 合成数据评估器(5 维度规则打分为主 + summarize_accuracy)
runtime/ops/mapper/data_synthesis/requirement_metrics.py 需求指标计算与达标判定工具函数
runtime/ops/mapper/data_synthesis/test_project_requirements.py 覆盖三模板、配比、指标达标与五维二值准确率的单元测试
runtime/ops/mapper/data_synthesis/prepare_golden_data.py 生成金标准数据集脚本
runtime/ops/mapper/data_synthesis/verify_evaluator.py 用金标准验证评估器一致性的脚本
runtime/ops/mapper/data_synthesis/run_50_each_test.py 每类 50 条数据的批量生成与统计脚本
runtime/ops/mapper/data_synthesis/benchmark_and_visualize.py 合成任务压测 + 可视化报告脚本
runtime/ops/mapper/data_synthesis/final_delivery_part1.py 交付流程脚本(生成数据/指标 CSV/可视化/汇总)
runtime/ops/mapper/data_synthesis/download.py ModelScope 模型下载工具(支持忽略训练产物)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +8
{
"session_and_graph_id_0_0": {
"graph_fusion": {
"ARefreshCubeC0FusionPass": {
"effect_times": "1",
"match_times": "1"
},
"Conv2dToConv2dV2FusionPass": {
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a runtime-generated profiling/log artifact (graph fusion pass statistics). Committing it will create noisy diffs and can bloat the repo; it should typically be ignored (e.g., via .gitignore) or moved to an artifacts/output location rather than source control.

Copilot uses AI. Check for mistakes.
Comment on lines +212 to +220
def evaluate(self, data_list: List[Dict[str, Any]], target_dimensions: Optional[List[str]] = None) -> List[Dict]:
"""
批量评估入口
:param data_list: 包含 'content' 字段的字典列表
:param target_dimensions: 指定要评测的维度,默认全部 7 个
"""
if target_dimensions is None:
target_dimensions = list(self.dimension_criteria.keys())

Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The evaluate docstring says "默认全部 7 个", but dimension_criteria defines 5 dimensions (and the PR description also mentions 5). Update the docstring to avoid misleading callers about the evaluation surface area.

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +28
parser.add_argument(
"--cache_dir",
default="/mnt/nvme0n1/home/pjj/.cache/modelscope",
help="模型缓存目录(必须可写)"
)
parser.add_argument(
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default --cache_dir points to a machine/user-specific path (/mnt/nvme0n1/home/pjj/.cache/modelscope). This will fail on other machines and in containers without that mount. Prefer a portable default (e.g., ~/.cache/modelscope or $MODELSCOPE_CACHE) and keep the current value as an example in docs/README instead of a default.

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +34
candidates = [
os.getenv("MODEL_PATH"),
"/root/.cache/modelscope/hub/models/Qwen/Qwen3-4B",
"/work/.cache/modelscope/testUser/Qwen3-1___7b-Medical-R1-sft",
"/mnt/nvme0n1/home/pjj/.cache/modelscope/testUser/Qwen3-1___7b-Medical-R1-sft",
"/data/models/Qwen/Qwen2.5-7B-Instruct",
]
for path in candidates:
if path and os.path.exists(path):
return path
raise FileNotFoundError("未找到可用模型路径,请设置 MODEL_PATH 或检查本地目录。")


Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolve_model_path() bakes in several environment-specific absolute paths (including personal mount points). This makes the script harder to run outside the original environment. Prefer a single env var (e.g., MODEL_PATH) plus a clearly documented default, and fail fast with a helpful error when not set.

Suggested change
candidates = [
os.getenv("MODEL_PATH"),
"/root/.cache/modelscope/hub/models/Qwen/Qwen3-4B",
"/work/.cache/modelscope/testUser/Qwen3-1___7b-Medical-R1-sft",
"/mnt/nvme0n1/home/pjj/.cache/modelscope/testUser/Qwen3-1___7b-Medical-R1-sft",
"/data/models/Qwen/Qwen2.5-7B-Instruct",
]
for path in candidates:
if path and os.path.exists(path):
return path
raise FileNotFoundError("未找到可用模型路径,请设置 MODEL_PATH 或检查本地目录。")
"""
解析模型路径
1. 优先使用环境变量 MODEL_PATH
2. 若未设置则尝试使用约定的默认路径如集群预置模型目录);
3. 若都不可用则快速失败并给出明确提示
"""
env_path = os.getenv("MODEL_PATH")
if env_path:
if os.path.exists(env_path):
return env_path
raise FileNotFoundError(
f"环境变量 MODEL_PATH 已设置为 '{env_path}',但该路径不存在。"
)
# 单一、约定俗成的默认模型路径(可根据部署环境调整)
default_path = "/data/models/Qwen/Qwen2.5-7B-Instruct"
if os.path.exists(default_path):
return default_path
raise FileNotFoundError(
"未找到可用模型路径:请设置环境变量 MODEL_PATH 指向已下载的模型目录,"
f"或在默认路径存在模型目录:{default_path}"
)

Copilot uses AI. Check for mistakes.
strides = []

for (hsize, wsize), stride in zip(self.hw, self.strides):
yv, xv = torch.meshgrid([torch.arange(hsize), torch.arange(wsize)])
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.meshgrid([torch.arange(hsize), torch.arange(wsize)]) relies on the legacy default indexing behavior and will emit warnings on newer PyTorch; future versions may require an explicit indexing= argument. Specify indexing="ij" (or the intended indexing) to avoid warning spam and forward-compat issues.

Suggested change
yv, xv = torch.meshgrid([torch.arange(hsize), torch.arange(wsize)])
yv, xv = torch.meshgrid(torch.arange(hsize), torch.arange(wsize), indexing="ij")

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +5

# 配置
MODEL_PATH = "/data/models/Qwen/Qwen2.5-7B-Instruct"
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MODEL_PATH is hard-coded to /data/models/Qwen/Qwen2.5-7B-Instruct, which makes the verification script non-portable and likely to fail in CI/other environments. Consider resolving from an env var (with a sensible default) and validating the path exists before initializing the evaluator.

Suggested change
# 配置
MODEL_PATH = "/data/models/Qwen/Qwen2.5-7B-Instruct"
import os
from pathlib import Path
# 配置
MODEL_PATH = os.environ.get("MODEL_PATH", "/data/models/Qwen/Qwen2.5-7B-Instruct")
if not Path(MODEL_PATH).exists():
raise FileNotFoundError(
f"MODEL_PATH does not exist: {MODEL_PATH}. "
"Set the MODEL_PATH environment variable to a valid model directory."
)

Copilot uses AI. Check for mistakes.
export OMP_NUM_THREADS=1

# 6. Python 路径 (包含当前目录和 YOLOX)
export PYTHONPATH=$(pwd):$(pwd)/YOLOX-main:$PYTHONPATH
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set -u is enabled, but export PYTHONPATH=$(pwd):$(pwd)/YOLOX-main:$PYTHONPATH will abort if PYTHONPATH is unset in the environment. Use a default expansion (e.g., ${PYTHONPATH:-}) to keep the script runnable in clean shells/containers.

Suggested change
export PYTHONPATH=$(pwd):$(pwd)/YOLOX-main:$PYTHONPATH
export PYTHONPATH="$(pwd):$(pwd)/YOLOX-main:${PYTHONPATH:-}"

Copilot uses AI. Check for mistakes.

# 3. 设置 LD_PRELOAD (覆盖式设置,防止重复)
# 注意:jemalloc 必须排在第一位,libgomp 排第二解决 TLS 问题
export LD_PRELOAD="$JEMALLOC:$GOMP"
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LD_PRELOAD is set to include both jemalloc and libgomp, but the script only checks that $JEMALLOC exists. If $GOMP is missing (common across distros/containers), the preload will fail and prevent startup. Add a file existence check for $GOMP (or make it optional) before exporting LD_PRELOAD.

Suggested change
export LD_PRELOAD="$JEMALLOC:$GOMP"
LD_PRELOAD="$JEMALLOC"
if [ -f "$GOMP" ]; then
LD_PRELOAD="$LD_PRELOAD:$GOMP"
else
echo "⚠️ Warning: libgomp not found at $GOMP, continuing without it"
fi
export LD_PRELOAD

Copilot uses AI. Check for mistakes.
Comment on lines +152 to +159
try:
res = torch.cat(tensors, dim=dim)
torch.npu.synchronize()
return res
except Exception:
cpu_tensors = [t.cpu() for t in tensors]
if not cpu_tensors: return torch.tensor([], device=tensors[0].device)
return torch.cat(cpu_tensors, dim=dim).to(tensors[0].device)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safe_cat's exception fallback still dereferences tensors[0] when the input list is empty (if not cpu_tensors: return torch.tensor([], device=tensors[0].device)), which will raise IndexError and mask the original error. Handle the empty-input case without indexing (e.g., return an empty CPU tensor, or accept an explicit device argument).

Suggested change
try:
res = torch.cat(tensors, dim=dim)
torch.npu.synchronize()
return res
except Exception:
cpu_tensors = [t.cpu() for t in tensors]
if not cpu_tensors: return torch.tensor([], device=tensors[0].device)
return torch.cat(cpu_tensors, dim=dim).to(tensors[0].device)
# Handle empty input explicitly to avoid masking errors and out-of-range access
if not tensors:
# Return an empty CPU tensor; callers can move it to a device if needed
return torch.tensor([])
try:
res = torch.cat(tensors, dim=dim)
torch.npu.synchronize()
return res
except Exception:
cpu_tensors = [t.cpu() for t in tensors]
if not cpu_tensors:
# Fallback for unexpected empty case in exception path
return torch.tensor([])
device = tensors[0].device
return torch.cat(cpu_tensors, dim=dim).to(device)

Copilot uses AI. Check for mistakes.
Comment on lines +481 to +482
model_path = "/mnt/nvme0n1/pjj-data/data/models/yolox_l.pt"

Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

npu_get_model falls back to an environment-specific absolute model path (/mnt/nvme0n1/pjj-data/...). This will break on other machines/containers and makes the adapter non-portable. Prefer a required env var/config option for the model path (or search standard cache locations) and raise a clear error if not found.

Suggested change
model_path = "/mnt/nvme0n1/pjj-data/data/models/yolox_l.pt"
env_model_path = os.getenv("NPU_YOLOX_MODEL_PATH")
if env_model_path and os.path.exists(env_model_path):
model_path = env_model_path
else:
raise FileNotFoundError(
"YOLOX model file not found. Expected './yolox_l.pt' in the current directory "
"or a valid path set via the NPU_YOLOX_MODEL_PATH environment variable."
)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants