Fear/data synthesis by QianqiuerQS · Pull Request #457 · ModelEngine-Group/DataMate

QianqiuerQS · 2026-03-30T07:04:23Z

数据合成和合成数据质量评估代码

Copilot

Pull request overview

该 PR 增加两块能力：一是面向 Ascend NPU 的 unstructured/YOLOX 推理适配与基准脚本；二是医疗数据合成与合成数据质量评估（含指标计算与单元测试），用于数据合成与验收。

Changes:

新增 unstructured NPU 侧的 YOLOX 推理 monkey-patch、OCR 适配与运行/压测脚本。
新增医疗数据合成器（QA/CoT/Preference 模板）、评估器（规则评估优先）、需求指标计算与验收脚本。
新增项目需求验证单测与模型下载脚本。

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
runtime/ops/mapper/unstructured_npu/run.sh	Ascend NPU 运行入口脚本（LD_PRELOAD/Jemalloc 等环境设置 + benchmark 调用）
runtime/ops/mapper/unstructured_npu/ocr_npu_adapter.py	PaddleOCR CPU 隔离进程 + pytesseract 模块注入补丁
runtime/ops/mapper/unstructured_npu/npu_adapter.py	YOLOX/Unstructured 推理适配（安全算子、结构替换、PageLayout/from_image 改写、模型加载）
runtime/ops/mapper/unstructured_npu/fusion_result.json	NPU 图融合统计输出（疑似运行产物）
runtime/ops/mapper/unstructured_npu/benchmark_npu.py	NPU benchmark 脚本（依赖屏蔽/深度 mock + unstructured 分区调用）
runtime/ops/mapper/data_synthesis/data_synthesizer.py	医疗数据合成核心逻辑（模板、质量校验、修复与确定性兜底）
runtime/ops/mapper/data_synthesis/data_evaluator.py	合成数据评估器（5 维度规则打分为主 + summarize_accuracy）
runtime/ops/mapper/data_synthesis/requirement_metrics.py	需求指标计算与达标判定工具函数
runtime/ops/mapper/data_synthesis/test_project_requirements.py	覆盖三模板、配比、指标达标与五维二值准确率的单元测试
runtime/ops/mapper/data_synthesis/prepare_golden_data.py	生成金标准数据集脚本
runtime/ops/mapper/data_synthesis/verify_evaluator.py	用金标准验证评估器一致性的脚本
runtime/ops/mapper/data_synthesis/run_50_each_test.py	每类 50 条数据的批量生成与统计脚本
runtime/ops/mapper/data_synthesis/benchmark_and_visualize.py	合成任务压测 + 可视化报告脚本
runtime/ops/mapper/data_synthesis/final_delivery_part1.py	交付流程脚本（生成数据/指标 CSV/可视化/汇总）
runtime/ops/mapper/data_synthesis/download.py	ModelScope 模型下载工具（支持忽略训练产物）

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-30T07:08:45Z

+{
+    "session_and_graph_id_0_0": {
+        "graph_fusion": {
+            "ARefreshCubeC0FusionPass": {
+                "effect_times": "1",
+                "match_times": "1"
+            },
+            "Conv2dToConv2dV2FusionPass": {


This looks like a runtime-generated profiling/log artifact (graph fusion pass statistics). Committing it will create noisy diffs and can bloat the repo; it should typically be ignored (e.g., via .gitignore) or moved to an artifacts/output location rather than source control.

Copilot · 2026-03-30T07:08:46Z

+    def evaluate(self, data_list: List[Dict[str, Any]], target_dimensions: Optional[List[str]] = None) -> List[Dict]:
+        """
+        批量评估入口
+        :param data_list: 包含 'content' 字段的字典列表
+        :param target_dimensions: 指定要评测的维度，默认全部 7 个
+        """
+        if target_dimensions is None:
+            target_dimensions = list(self.dimension_criteria.keys())
+


The evaluate docstring says "默认全部 7 个", but dimension_criteria defines 5 dimensions (and the PR description also mentions 5). Update the docstring to avoid misleading callers about the evaluation surface area.

Copilot · 2026-03-30T07:08:46Z

+    parser.add_argument(
+        "--cache_dir",
+        default="/mnt/nvme0n1/home/pjj/.cache/modelscope",
+        help="模型缓存目录（必须可写）"
+    )
+    parser.add_argument(


The default --cache_dir points to a machine/user-specific path (/mnt/nvme0n1/home/pjj/.cache/modelscope). This will fail on other machines and in containers without that mount. Prefer a portable default (e.g., ~/.cache/modelscope or $MODELSCOPE_CACHE) and keep the current value as an example in docs/README instead of a default.

Copilot · 2026-03-30T07:08:46Z

+    candidates = [
+        os.getenv("MODEL_PATH"),
+        "/root/.cache/modelscope/hub/models/Qwen/Qwen3-4B",
+        "/work/.cache/modelscope/testUser/Qwen3-1___7b-Medical-R1-sft",
+        "/mnt/nvme0n1/home/pjj/.cache/modelscope/testUser/Qwen3-1___7b-Medical-R1-sft",
+        "/data/models/Qwen/Qwen2.5-7B-Instruct",
+    ]
+    for path in candidates:
+        if path and os.path.exists(path):
+            return path
+    raise FileNotFoundError("未找到可用模型路径，请设置 MODEL_PATH 或检查本地目录。")
+
+


resolve_model_path() bakes in several environment-specific absolute paths (including personal mount points). This makes the script harder to run outside the original environment. Prefer a single env var (e.g., MODEL_PATH) plus a clearly documented default, and fail fast with a helpful error when not set.

Suggested change

candidates = [

os.getenv("MODEL_PATH"),

"/root/.cache/modelscope/hub/models/Qwen/Qwen3-4B",

"/work/.cache/modelscope/testUser/Qwen3-1___7b-Medical-R1-sft",

"/mnt/nvme0n1/home/pjj/.cache/modelscope/testUser/Qwen3-1___7b-Medical-R1-sft",

"/data/models/Qwen/Qwen2.5-7B-Instruct",

]

for path in candidates:

if path and os.path.exists(path):

return path

raise FileNotFoundError("未找到可用模型路径，请设置 MODEL_PATH 或检查本地目录。")

"""

解析模型路径：

1. 优先使用环境变量 MODEL_PATH；

2. 若未设置，则尝试使用约定的默认路径（如集群预置模型目录）；

3. 若都不可用，则快速失败并给出明确提示。

"""

env_path = os.getenv("MODEL_PATH")

if env_path:

if os.path.exists(env_path):

return env_path

raise FileNotFoundError(

f"环境变量 MODEL_PATH 已设置为 '{env_path}'，但该路径不存在。"

)

# 单一、约定俗成的默认模型路径（可根据部署环境调整）

default_path = "/data/models/Qwen/Qwen2.5-7B-Instruct"

if os.path.exists(default_path):

return default_path

raise FileNotFoundError(

"未找到可用模型路径：请设置环境变量 MODEL_PATH 指向已下载的模型目录，"

f"或在默认路径存在模型目录：{default_path}"

)

Copilot · 2026-03-30T07:08:47Z

+    strides = []
+
+    for (hsize, wsize), stride in zip(self.hw, self.strides):
+        yv, xv = torch.meshgrid([torch.arange(hsize), torch.arange(wsize)])


torch.meshgrid([torch.arange(hsize), torch.arange(wsize)]) relies on the legacy default indexing behavior and will emit warnings on newer PyTorch; future versions may require an explicit indexing= argument. Specify indexing="ij" (or the intended indexing) to avoid warning spam and forward-compat issues.

Suggested change

yv, xv = torch.meshgrid([torch.arange(hsize), torch.arange(wsize)])

yv, xv = torch.meshgrid(torch.arange(hsize), torch.arange(wsize), indexing="ij")

Copilot · 2026-03-30T07:08:47Z

+
+# 配置
+MODEL_PATH = "/data/models/Qwen/Qwen2.5-7B-Instruct"


MODEL_PATH is hard-coded to /data/models/Qwen/Qwen2.5-7B-Instruct, which makes the verification script non-portable and likely to fail in CI/other environments. Consider resolving from an env var (with a sensible default) and validating the path exists before initializing the evaluator.

Suggested change

# 配置

MODEL_PATH = "/data/models/Qwen/Qwen2.5-7B-Instruct"

import os

from pathlib import Path

# 配置

MODEL_PATH = os.environ.get("MODEL_PATH", "/data/models/Qwen/Qwen2.5-7B-Instruct")

if not Path(MODEL_PATH).exists():

raise FileNotFoundError(

f"MODEL_PATH does not exist: {MODEL_PATH}. "

"Set the MODEL_PATH environment variable to a valid model directory."

)

Copilot · 2026-03-30T07:08:47Z

+export OMP_NUM_THREADS=1
+
+# 6. Python 路径 (包含当前目录和 YOLOX)
+export PYTHONPATH=$(pwd):$(pwd)/YOLOX-main:$PYTHONPATH


set -u is enabled, but export PYTHONPATH=$(pwd):$(pwd)/YOLOX-main:$PYTHONPATH will abort if PYTHONPATH is unset in the environment. Use a default expansion (e.g., ${PYTHONPATH:-}) to keep the script runnable in clean shells/containers.

Suggested change

export PYTHONPATH=$(pwd):$(pwd)/YOLOX-main:$PYTHONPATH

export PYTHONPATH="$(pwd):$(pwd)/YOLOX-main:${PYTHONPATH:-}"

Copilot · 2026-03-30T07:08:48Z

+
+# 3. 设置 LD_PRELOAD (覆盖式设置，防止重复)
+# 注意：jemalloc 必须排在第一位，libgomp 排第二解决 TLS 问题
+export LD_PRELOAD="$JEMALLOC:$GOMP"


LD_PRELOAD is set to include both jemalloc and libgomp, but the script only checks that $JEMALLOC exists. If $GOMP is missing (common across distros/containers), the preload will fail and prevent startup. Add a file existence check for $GOMP (or make it optional) before exporting LD_PRELOAD.

Suggested change

export LD_PRELOAD="$JEMALLOC:$GOMP"

LD_PRELOAD="$JEMALLOC"

if [ -f "$GOMP" ]; then

LD_PRELOAD="$LD_PRELOAD:$GOMP"

else

echo "⚠️ Warning: libgomp not found at $GOMP, continuing without it"

fi

export LD_PRELOAD

Copilot · 2026-03-30T07:08:48Z

+    try:
+        res = torch.cat(tensors, dim=dim)
+        torch.npu.synchronize() 
+        return res
+    except Exception:
+        cpu_tensors = [t.cpu() for t in tensors]
+        if not cpu_tensors: return torch.tensor([], device=tensors[0].device)
+        return torch.cat(cpu_tensors, dim=dim).to(tensors[0].device)


safe_cat's exception fallback still dereferences tensors[0] when the input list is empty (if not cpu_tensors: return torch.tensor([], device=tensors[0].device)), which will raise IndexError and mask the original error. Handle the empty-input case without indexing (e.g., return an empty CPU tensor, or accept an explicit device argument).

Suggested change

try:

res = torch.cat(tensors, dim=dim)

torch.npu.synchronize()

return res

except Exception:

cpu_tensors = [t.cpu() for t in tensors]

if not cpu_tensors: return torch.tensor([], device=tensors[0].device)

return torch.cat(cpu_tensors, dim=dim).to(tensors[0].device)

# Handle empty input explicitly to avoid masking errors and out-of-range access

if not tensors:

# Return an empty CPU tensor; callers can move it to a device if needed

return torch.tensor([])

try:

res = torch.cat(tensors, dim=dim)

torch.npu.synchronize()

return res

except Exception:

cpu_tensors = [t.cpu() for t in tensors]

if not cpu_tensors:

# Fallback for unexpected empty case in exception path

return torch.tensor([])

device = tensors[0].device

return torch.cat(cpu_tensors, dim=dim).to(device)

Copilot · 2026-03-30T07:08:48Z

+        model_path = "/mnt/nvme0n1/pjj-data/data/models/yolox_l.pt"
+


npu_get_model falls back to an environment-specific absolute model path (/mnt/nvme0n1/pjj-data/...). This will break on other machines/containers and makes the adapter non-portable. Prefer a required env var/config option for the model path (or search standard cache locations) and raise a clear error if not found.

Suggested change

model_path = "/mnt/nvme0n1/pjj-data/data/models/yolox_l.pt"

env_model_path = os.getenv("NPU_YOLOX_MODEL_PATH")

if env_model_path and os.path.exists(env_model_path):

model_path = env_model_path

else:

raise FileNotFoundError(

"YOLOX model file not found. Expected './yolox_l.pt' in the current directory "

"or a valid path set via the NPU_YOLOX_MODEL_PATH environment variable."

)

QianqiuerQS added 2 commits March 30, 2026 11:49

unstructuredio

2224d88

data-synthesis

7e475b8

Copilot AI review requested due to automatic review settings March 30, 2026 07:04

Copilot started reviewing on behalf of QianqiuerQS March 30, 2026 07:04 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fear/data synthesis#457

Fear/data synthesis#457
QianqiuerQS wants to merge 2 commits into
ModelEngine-Group:mainfrom
QianqiuerQS:fear/data-synthesis

QianqiuerQS commented Mar 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    candidates = [
-        os.getenv("MODEL_PATH"),
-        "/root/.cache/modelscope/hub/models/Qwen/Qwen3-4B",
-        "/work/.cache/modelscope/testUser/Qwen3-1___7b-Medical-R1-sft",
-        "/mnt/nvme0n1/home/pjj/.cache/modelscope/testUser/Qwen3-1___7b-Medical-R1-sft",
-        "/data/models/Qwen/Qwen2.5-7B-Instruct",
-    ]
-    for path in candidates:
-        if path and os.path.exists(path):
-            return path
-    raise FileNotFoundError("未找到可用模型路径，请设置 MODEL_PATH 或检查本地目录。")
+    """
+    解析模型路径：
+. 优先使用环境变量 MODEL_PATH；
+. 若未设置，则尝试使用约定的默认路径（如集群预置模型目录）；
+. 若都不可用，则快速失败并给出明确提示。
+    """
+    env_path = os.getenv("MODEL_PATH")
+    if env_path:
+        if os.path.exists(env_path):
+            return env_path
+        raise FileNotFoundError(
+            f"环境变量 MODEL_PATH 已设置为 '{env_path}'，但该路径不存在。"
+        )
+    # 单一、约定俗成的默认模型路径（可根据部署环境调整）
+    default_path = "/data/models/Qwen/Qwen2.5-7B-Instruct"
+    if os.path.exists(default_path):
+        return default_path
+    raise FileNotFoundError(
+        "未找到可用模型路径：请设置环境变量 MODEL_PATH 指向已下载的模型目录，"
+        f"或在默认路径存在模型目录：{default_path}"
+    )

	yv, xv = torch.meshgrid([torch.arange(hsize), torch.arange(wsize)])
	yv, xv = torch.meshgrid(torch.arange(hsize), torch.arange(wsize), indexing="ij")


		# 配置
		MODEL_PATH = "/data/models/Qwen/Qwen2.5-7B-Instruct"

-# 配置
-MODEL_PATH = "/data/models/Qwen/Qwen2.5-7B-Instruct"
+import os
+from pathlib import Path
+# 配置
+MODEL_PATH = os.environ.get("MODEL_PATH", "/data/models/Qwen/Qwen2.5-7B-Instruct")
+if not Path(MODEL_PATH).exists():
+    raise FileNotFoundError(
+        f"MODEL_PATH does not exist: {MODEL_PATH}. "
+        "Set the MODEL_PATH environment variable to a valid model directory."
+    )

	export PYTHONPATH=$(pwd):$(pwd)/YOLOX-main:$PYTHONPATH
	export PYTHONPATH="$(pwd):$(pwd)/YOLOX-main:${PYTHONPATH:-}"

-export LD_PRELOAD="$JEMALLOC:$GOMP"
+LD_PRELOAD="$JEMALLOC"
+if [ -f "$GOMP" ]; then
+    LD_PRELOAD="$LD_PRELOAD:$GOMP"
+else
+    echo "⚠️ Warning: libgomp not found at $GOMP, continuing without it"
+fi
+export LD_PRELOAD

-        model_path = "/mnt/nvme0n1/pjj-data/data/models/yolox_l.pt"
+        env_model_path = os.getenv("NPU_YOLOX_MODEL_PATH")
+        if env_model_path and os.path.exists(env_model_path):
+            model_path = env_model_path
+        else:
+            raise FileNotFoundError(
+                "YOLOX model file not found. Expected './yolox_l.pt' in the current directory "
+                "or a valid path set via the NPU_YOLOX_MODEL_PATH environment variable."
+            )

Conversation

QianqiuerQS commented Mar 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants