Conversation
There was a problem hiding this comment.
Pull request overview
该 PR 增加两块能力:一是面向 Ascend NPU 的 unstructured/YOLOX 推理适配与基准脚本;二是医疗数据合成与合成数据质量评估(含指标计算与单元测试),用于数据合成与验收。
Changes:
- 新增 unstructured NPU 侧的 YOLOX 推理 monkey-patch、OCR 适配与运行/压测脚本。
- 新增医疗数据合成器(QA/CoT/Preference 模板)、评估器(规则评估优先)、需求指标计算与验收脚本。
- 新增项目需求验证单测与模型下载脚本。
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| runtime/ops/mapper/unstructured_npu/run.sh | Ascend NPU 运行入口脚本(LD_PRELOAD/Jemalloc 等环境设置 + benchmark 调用) |
| runtime/ops/mapper/unstructured_npu/ocr_npu_adapter.py | PaddleOCR CPU 隔离进程 + pytesseract 模块注入补丁 |
| runtime/ops/mapper/unstructured_npu/npu_adapter.py | YOLOX/Unstructured 推理适配(安全算子、结构替换、PageLayout/from_image 改写、模型加载) |
| runtime/ops/mapper/unstructured_npu/fusion_result.json | NPU 图融合统计输出(疑似运行产物) |
| runtime/ops/mapper/unstructured_npu/benchmark_npu.py | NPU benchmark 脚本(依赖屏蔽/深度 mock + unstructured 分区调用) |
| runtime/ops/mapper/data_synthesis/data_synthesizer.py | 医疗数据合成核心逻辑(模板、质量校验、修复与确定性兜底) |
| runtime/ops/mapper/data_synthesis/data_evaluator.py | 合成数据评估器(5 维度规则打分为主 + summarize_accuracy) |
| runtime/ops/mapper/data_synthesis/requirement_metrics.py | 需求指标计算与达标判定工具函数 |
| runtime/ops/mapper/data_synthesis/test_project_requirements.py | 覆盖三模板、配比、指标达标与五维二值准确率的单元测试 |
| runtime/ops/mapper/data_synthesis/prepare_golden_data.py | 生成金标准数据集脚本 |
| runtime/ops/mapper/data_synthesis/verify_evaluator.py | 用金标准验证评估器一致性的脚本 |
| runtime/ops/mapper/data_synthesis/run_50_each_test.py | 每类 50 条数据的批量生成与统计脚本 |
| runtime/ops/mapper/data_synthesis/benchmark_and_visualize.py | 合成任务压测 + 可视化报告脚本 |
| runtime/ops/mapper/data_synthesis/final_delivery_part1.py | 交付流程脚本(生成数据/指标 CSV/可视化/汇总) |
| runtime/ops/mapper/data_synthesis/download.py | ModelScope 模型下载工具(支持忽略训练产物) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| { | ||
| "session_and_graph_id_0_0": { | ||
| "graph_fusion": { | ||
| "ARefreshCubeC0FusionPass": { | ||
| "effect_times": "1", | ||
| "match_times": "1" | ||
| }, | ||
| "Conv2dToConv2dV2FusionPass": { |
There was a problem hiding this comment.
This looks like a runtime-generated profiling/log artifact (graph fusion pass statistics). Committing it will create noisy diffs and can bloat the repo; it should typically be ignored (e.g., via .gitignore) or moved to an artifacts/output location rather than source control.
| def evaluate(self, data_list: List[Dict[str, Any]], target_dimensions: Optional[List[str]] = None) -> List[Dict]: | ||
| """ | ||
| 批量评估入口 | ||
| :param data_list: 包含 'content' 字段的字典列表 | ||
| :param target_dimensions: 指定要评测的维度,默认全部 7 个 | ||
| """ | ||
| if target_dimensions is None: | ||
| target_dimensions = list(self.dimension_criteria.keys()) | ||
|
|
There was a problem hiding this comment.
The evaluate docstring says "默认全部 7 个", but dimension_criteria defines 5 dimensions (and the PR description also mentions 5). Update the docstring to avoid misleading callers about the evaluation surface area.
| parser.add_argument( | ||
| "--cache_dir", | ||
| default="/mnt/nvme0n1/home/pjj/.cache/modelscope", | ||
| help="模型缓存目录(必须可写)" | ||
| ) | ||
| parser.add_argument( |
There was a problem hiding this comment.
The default --cache_dir points to a machine/user-specific path (/mnt/nvme0n1/home/pjj/.cache/modelscope). This will fail on other machines and in containers without that mount. Prefer a portable default (e.g., ~/.cache/modelscope or $MODELSCOPE_CACHE) and keep the current value as an example in docs/README instead of a default.
| candidates = [ | ||
| os.getenv("MODEL_PATH"), | ||
| "/root/.cache/modelscope/hub/models/Qwen/Qwen3-4B", | ||
| "/work/.cache/modelscope/testUser/Qwen3-1___7b-Medical-R1-sft", | ||
| "/mnt/nvme0n1/home/pjj/.cache/modelscope/testUser/Qwen3-1___7b-Medical-R1-sft", | ||
| "/data/models/Qwen/Qwen2.5-7B-Instruct", | ||
| ] | ||
| for path in candidates: | ||
| if path and os.path.exists(path): | ||
| return path | ||
| raise FileNotFoundError("未找到可用模型路径,请设置 MODEL_PATH 或检查本地目录。") | ||
|
|
||
|
|
There was a problem hiding this comment.
resolve_model_path() bakes in several environment-specific absolute paths (including personal mount points). This makes the script harder to run outside the original environment. Prefer a single env var (e.g., MODEL_PATH) plus a clearly documented default, and fail fast with a helpful error when not set.
| candidates = [ | |
| os.getenv("MODEL_PATH"), | |
| "/root/.cache/modelscope/hub/models/Qwen/Qwen3-4B", | |
| "/work/.cache/modelscope/testUser/Qwen3-1___7b-Medical-R1-sft", | |
| "/mnt/nvme0n1/home/pjj/.cache/modelscope/testUser/Qwen3-1___7b-Medical-R1-sft", | |
| "/data/models/Qwen/Qwen2.5-7B-Instruct", | |
| ] | |
| for path in candidates: | |
| if path and os.path.exists(path): | |
| return path | |
| raise FileNotFoundError("未找到可用模型路径,请设置 MODEL_PATH 或检查本地目录。") | |
| """ | |
| 解析模型路径: | |
| 1. 优先使用环境变量 MODEL_PATH; | |
| 2. 若未设置,则尝试使用约定的默认路径(如集群预置模型目录); | |
| 3. 若都不可用,则快速失败并给出明确提示。 | |
| """ | |
| env_path = os.getenv("MODEL_PATH") | |
| if env_path: | |
| if os.path.exists(env_path): | |
| return env_path | |
| raise FileNotFoundError( | |
| f"环境变量 MODEL_PATH 已设置为 '{env_path}',但该路径不存在。" | |
| ) | |
| # 单一、约定俗成的默认模型路径(可根据部署环境调整) | |
| default_path = "/data/models/Qwen/Qwen2.5-7B-Instruct" | |
| if os.path.exists(default_path): | |
| return default_path | |
| raise FileNotFoundError( | |
| "未找到可用模型路径:请设置环境变量 MODEL_PATH 指向已下载的模型目录," | |
| f"或在默认路径存在模型目录:{default_path}" | |
| ) |
| strides = [] | ||
|
|
||
| for (hsize, wsize), stride in zip(self.hw, self.strides): | ||
| yv, xv = torch.meshgrid([torch.arange(hsize), torch.arange(wsize)]) |
There was a problem hiding this comment.
torch.meshgrid([torch.arange(hsize), torch.arange(wsize)]) relies on the legacy default indexing behavior and will emit warnings on newer PyTorch; future versions may require an explicit indexing= argument. Specify indexing="ij" (or the intended indexing) to avoid warning spam and forward-compat issues.
| yv, xv = torch.meshgrid([torch.arange(hsize), torch.arange(wsize)]) | |
| yv, xv = torch.meshgrid(torch.arange(hsize), torch.arange(wsize), indexing="ij") |
|
|
||
| # 配置 | ||
| MODEL_PATH = "/data/models/Qwen/Qwen2.5-7B-Instruct" |
There was a problem hiding this comment.
MODEL_PATH is hard-coded to /data/models/Qwen/Qwen2.5-7B-Instruct, which makes the verification script non-portable and likely to fail in CI/other environments. Consider resolving from an env var (with a sensible default) and validating the path exists before initializing the evaluator.
| # 配置 | |
| MODEL_PATH = "/data/models/Qwen/Qwen2.5-7B-Instruct" | |
| import os | |
| from pathlib import Path | |
| # 配置 | |
| MODEL_PATH = os.environ.get("MODEL_PATH", "/data/models/Qwen/Qwen2.5-7B-Instruct") | |
| if not Path(MODEL_PATH).exists(): | |
| raise FileNotFoundError( | |
| f"MODEL_PATH does not exist: {MODEL_PATH}. " | |
| "Set the MODEL_PATH environment variable to a valid model directory." | |
| ) |
| export OMP_NUM_THREADS=1 | ||
|
|
||
| # 6. Python 路径 (包含当前目录和 YOLOX) | ||
| export PYTHONPATH=$(pwd):$(pwd)/YOLOX-main:$PYTHONPATH |
There was a problem hiding this comment.
set -u is enabled, but export PYTHONPATH=$(pwd):$(pwd)/YOLOX-main:$PYTHONPATH will abort if PYTHONPATH is unset in the environment. Use a default expansion (e.g., ${PYTHONPATH:-}) to keep the script runnable in clean shells/containers.
| export PYTHONPATH=$(pwd):$(pwd)/YOLOX-main:$PYTHONPATH | |
| export PYTHONPATH="$(pwd):$(pwd)/YOLOX-main:${PYTHONPATH:-}" |
|
|
||
| # 3. 设置 LD_PRELOAD (覆盖式设置,防止重复) | ||
| # 注意:jemalloc 必须排在第一位,libgomp 排第二解决 TLS 问题 | ||
| export LD_PRELOAD="$JEMALLOC:$GOMP" |
There was a problem hiding this comment.
LD_PRELOAD is set to include both jemalloc and libgomp, but the script only checks that $JEMALLOC exists. If $GOMP is missing (common across distros/containers), the preload will fail and prevent startup. Add a file existence check for $GOMP (or make it optional) before exporting LD_PRELOAD.
| export LD_PRELOAD="$JEMALLOC:$GOMP" | |
| LD_PRELOAD="$JEMALLOC" | |
| if [ -f "$GOMP" ]; then | |
| LD_PRELOAD="$LD_PRELOAD:$GOMP" | |
| else | |
| echo "⚠️ Warning: libgomp not found at $GOMP, continuing without it" | |
| fi | |
| export LD_PRELOAD |
| try: | ||
| res = torch.cat(tensors, dim=dim) | ||
| torch.npu.synchronize() | ||
| return res | ||
| except Exception: | ||
| cpu_tensors = [t.cpu() for t in tensors] | ||
| if not cpu_tensors: return torch.tensor([], device=tensors[0].device) | ||
| return torch.cat(cpu_tensors, dim=dim).to(tensors[0].device) |
There was a problem hiding this comment.
safe_cat's exception fallback still dereferences tensors[0] when the input list is empty (if not cpu_tensors: return torch.tensor([], device=tensors[0].device)), which will raise IndexError and mask the original error. Handle the empty-input case without indexing (e.g., return an empty CPU tensor, or accept an explicit device argument).
| try: | |
| res = torch.cat(tensors, dim=dim) | |
| torch.npu.synchronize() | |
| return res | |
| except Exception: | |
| cpu_tensors = [t.cpu() for t in tensors] | |
| if not cpu_tensors: return torch.tensor([], device=tensors[0].device) | |
| return torch.cat(cpu_tensors, dim=dim).to(tensors[0].device) | |
| # Handle empty input explicitly to avoid masking errors and out-of-range access | |
| if not tensors: | |
| # Return an empty CPU tensor; callers can move it to a device if needed | |
| return torch.tensor([]) | |
| try: | |
| res = torch.cat(tensors, dim=dim) | |
| torch.npu.synchronize() | |
| return res | |
| except Exception: | |
| cpu_tensors = [t.cpu() for t in tensors] | |
| if not cpu_tensors: | |
| # Fallback for unexpected empty case in exception path | |
| return torch.tensor([]) | |
| device = tensors[0].device | |
| return torch.cat(cpu_tensors, dim=dim).to(device) |
| model_path = "/mnt/nvme0n1/pjj-data/data/models/yolox_l.pt" | ||
|
|
There was a problem hiding this comment.
npu_get_model falls back to an environment-specific absolute model path (/mnt/nvme0n1/pjj-data/...). This will break on other machines/containers and makes the adapter non-portable. Prefer a required env var/config option for the model path (or search standard cache locations) and raise a clear error if not found.
| model_path = "/mnt/nvme0n1/pjj-data/data/models/yolox_l.pt" | |
| env_model_path = os.getenv("NPU_YOLOX_MODEL_PATH") | |
| if env_model_path and os.path.exists(env_model_path): | |
| model_path = env_model_path | |
| else: | |
| raise FileNotFoundError( | |
| "YOLOX model file not found. Expected './yolox_l.pt' in the current directory " | |
| "or a valid path set via the NPU_YOLOX_MODEL_PATH environment variable." | |
| ) |
数据合成和合成数据质量评估代码