-
Notifications
You must be signed in to change notification settings - Fork 955
Qualcomm AI Engine Direct - Add claude skill for qualcomm #18831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
abhinaykukkadapu
merged 3 commits into
pytorch:main
from
CodeLinaro:dev1/hutton/add_claude_skill_for_qualcomm
Apr 14, 2026
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,98 @@ | ||
| --- | ||
| name: qualcomm | ||
| description: Build, test, or develop the QNN (Qualcomm AI Engine Direct) backend. Use when working on backends/qualcomm/, building QNN (use backends/qualcomm/scripts/build.sh), adding new ops or passes, running QNN delegate | ||
| tests, or exporting models for Qualcomm HTP/GPU targets. | ||
| --- | ||
|
|
||
| # QNN (Qualcomm AI Engine Direct) Backend | ||
|
|
||
| ## Advanced Topics | ||
|
|
||
| When the user's request falls into one of these areas, read the corresponding file before proceeding: | ||
|
|
||
| | Topic | File | When to read | | ||
| |---|---|---| | ||
| | Export / lowering / quantization options / pass pipelines | `lowering_export.md` | User asks about exporting, lowering, quantization config, QuantDtype, QuantRecipe, pass pipelines | | ||
| | New op development | `new_op_development.md` | User asks to add/implement a new op or op builder | | ||
| | Model enablement | `model_enablement.md` | User asks to enable a new model end-to-end | | ||
| | Profiling & debugging | `profiling.md` | User asks about profiling, optrace, QHAS, QAIRT Visualizer *(file TBD)* | | ||
|
|
||
| ## Building | ||
|
|
||
| Use `backends/qualcomm/scripts/build.sh`. Linux only (macOS not supported). | ||
|
|
||
| **Environment variables:** | ||
| - `QNN_SDK_ROOT` — path to QNN SDK (auto-downloaded if not set) | ||
| - `ANDROID_NDK_ROOT` — path to Android NDK (auto-downloaded if not set) | ||
|
Comment on lines
+25
to
+26
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we also add PYTHON_PATH and LD_LIBRARY_PATH?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let me help him understand this. :) |
||
|
|
||
| **Build targets:** | ||
|
|
||
| | Target | Default | Build dir | | ||
| |---|---|---| | ||
| | x86_64 (Python interface + host tools) | enabled | `build-x86/` | | ||
| | Android arm64-v8a (device runner) | enabled | `build-android/` | | ||
| | Hexagon DSP (direct mode) | disabled | `build-hexagon/` | | ||
| | OE Linux embedded | disabled | `build-oe-linux/` | | ||
|
|
||
| **Common build commands:** | ||
|
|
||
| ```bash | ||
| # Full build (x86_64 + Android) | ||
| ./backends/qualcomm/scripts/build.sh | ||
|
|
||
| # x86_64 only (faster, for Python interface development) | ||
| ./backends/qualcomm/scripts/build.sh --skip_linux_android | ||
|
|
||
| # Android only (skip x86_64) | ||
| ./backends/qualcomm/scripts/build.sh --skip_x86_64 | ||
|
|
||
| # Incremental build (skip clean) | ||
| ./backends/qualcomm/scripts/build.sh --no_clean | ||
|
|
||
| # Enable Hexagon DSP direct mode (requires HEXAGON_SDK_ROOT, HEXAGON_TOOLS_ROOT, DSP_VERSION) | ||
| ./backends/qualcomm/scripts/build.sh --enable_hexagon | ||
|
|
||
| # OE Linux embedded target (requires TOOLCHAIN_ROOT_HOST, TOOLCHAIN_ROOT_TARGET) | ||
| ./backends/qualcomm/scripts/build.sh --enable_linux_embedded | ||
|
|
||
| # Release build | ||
| ./backends/qualcomm/scripts/build.sh --release | ||
|
|
||
| # Control parallelism | ||
| ./backends/qualcomm/scripts/build.sh --job_number 8 | ||
| ``` | ||
|
|
||
| **After x86_64 build**, the Python interface `.so` files are copied to `backends/qualcomm/python/` automatically. | ||
|
|
||
| ## Testing | ||
|
|
||
| ```bash | ||
| QNN_SDK_ROOT=/path/to/qnn_sdk \ | ||
| ANDROID_NDK_ROOT=/path/to/android_ndk \ | ||
| LD_LIBRARY_PATH=/path/to/executorch/build-x86/lib:/path/to/qnn_sdk/lib/x86_64-linux-clang \ | ||
| PYTHONPATH=$(dirname $EXECUTORCH_ROOT) \ | ||
| python backends/qualcomm/tests/test_qnn_delegate.py \ | ||
| TestQNNFloatingPointOperator.test_qnn_backend_abs \ | ||
| -H $HOST -s $DEVICE_SERIAL -m SM8850 -b build-android -a /path/to/artifacts | ||
| ``` | ||
|
|
||
| > **Note (build from source):** Set `PYTHONPATH` to the parent directory of the executorch repo root. Required because `executorch.examples.qualcomm` lives in the source tree and is not installed into site-packages. | ||
|
|
||
| Required flags: `-m` (SoC model), `-b` (Android build dir). Optional: `-s` (device serial), `-H` (host), `-a` (artifact dir), `-c` (compile only), `-x` (run on x86_64). | ||
|
|
||
| **Test classes:** | ||
|
|
||
| | Class | Description | | ||
| |---|---| | ||
| | `TestQNNFloatingPointOperator` | FP16 operator tests | | ||
| | `TestQNNQuantizedOperator` | Quantized operator tests | | ||
| | `TestQNNFloatingPointModel` | FP16 model-level tests | | ||
| | `TestQNNQuantizedModel` | Quantized model-level tests | | ||
| | `TestQNNFloatingPointUtils` | FP16 utility tests | | ||
| | `TestQNNQuantizedUtils` | Quantized utility tests | | ||
| | `TestExampleLLMScript` | LLM script tests | | ||
| | `TestExampleMultimodalityScript` | Multimodality script tests | | ||
| | `TestExampleOssScript` | OSS model script tests | | ||
| | `TestExampleQaihubScript` | QAI Hub script tests | | ||
| | `TestExampleScript` | General example script tests | | ||
| | `TestUtilsScript` | Utility script tests | | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,141 @@ | ||
| # QNN Lowering / Export | ||
|
|
||
| ## Common Setup | ||
|
|
||
| ```python | ||
| from executorch.backends.qualcomm.serialization.qc_schema import QnnExecuTorchBackendType | ||
| from executorch.backends.qualcomm.utils.utils import ( | ||
| generate_htp_compiler_spec, | ||
| generate_qnn_executorch_compiler_spec, | ||
| get_soc_to_chipset_map, | ||
| to_edge_transform_and_lower_to_qnn, | ||
| ) | ||
|
|
||
| soc_model = get_soc_to_chipset_map()["SM8650"] # adjust SoC as needed | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## FP16 Export | ||
|
|
||
| ```python | ||
| backend_options = generate_htp_compiler_spec(use_fp16=True) | ||
| compiler_specs = generate_qnn_executorch_compiler_spec( | ||
| soc_model=soc_model, | ||
| backend_options=backend_options, | ||
| ) | ||
| edge_prog_mgr = to_edge_transform_and_lower_to_qnn(model, example_inputs, compiler_specs) | ||
| et_program = edge_prog_mgr.to_executorch() | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Quantized (PTQ) Export | ||
|
|
||
| ```python | ||
| import torch | ||
| from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e | ||
| from executorch.backends.qualcomm.quantizer.quantizer import QnnQuantizer | ||
|
|
||
| # 1. Export to ATen IR | ||
| m = torch.export.export(model.eval(), example_inputs, strict=True).module() | ||
|
|
||
| # 2. Prepare for quantization | ||
| quantizer = QnnQuantizer( | ||
| backend=QnnExecuTorchBackendType.kHtpBackend, | ||
| soc_model=soc_model, | ||
| ) | ||
| m = prepare_pt2e(m, quantizer) | ||
|
|
||
| # 3. Calibrate | ||
| m(*example_inputs) | ||
|
|
||
| # 4. Convert | ||
| m = convert_pt2e(m) | ||
|
|
||
| # 5. Lower to QNN | ||
| backend_options = generate_htp_compiler_spec(use_fp16=False) | ||
| compiler_specs = generate_qnn_executorch_compiler_spec( | ||
| soc_model=soc_model, | ||
| backend_options=backend_options, | ||
| ) | ||
| edge_prog_mgr = to_edge_transform_and_lower_to_qnn(m, example_inputs, compiler_specs) | ||
| et_program = edge_prog_mgr.to_executorch() | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Quantized (QAT) Export | ||
|
|
||
| Same as PTQ but use `prepare_qat_pt2e` and run a training loop instead of calibration: | ||
|
|
||
| ```python | ||
| from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_qat_pt2e | ||
|
|
||
| m = prepare_qat_pt2e(m, quantizer) | ||
| # training loop | ||
| m(*example_inputs) | ||
| m = convert_pt2e(m) | ||
| # ... same lowering steps as PTQ | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Quantization Options | ||
|
|
||
| | QuantDtype | Activation | Weight | | ||
| |---|---|---| | ||
| | `use_16a16w` | uint16 | int16 | | ||
| | `use_16a8w` | uint16 | int8 | | ||
| | `use_16a4w` | uint16 | int4 | | ||
| | `use_16a4w_block` | uint16 | int4 (block-wise) | | ||
| | `use_8a8w` | uint8 | int8 | | ||
| | `use_8a4w` | uint8 | int4 | | ||
|
|
||
| **Fine-grained control with QuantRecipe:** | ||
|
|
||
| ```python | ||
| from executorch.backends.qualcomm.quantizer.quant_recipe import QuantRecipe, QuantGranularity | ||
|
|
||
| recipe = QuantRecipe(quant_dtype=QuantDtype.use_8a8w, is_qat=False) | ||
| recipe.add_node_target(targets={torch.ops.aten.linear.default}, quant_dtype=QuantDtype.use_16a8w) | ||
| recipe.add_regex(regex={"layers.[0-3].attention"}, quant_dtype=QuantDtype.use_16a4w) | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Pass Pipelines (QnnPassManager) | ||
|
|
||
| | Pipeline | When Called | Key Passes | | ||
| |---|---|---| | ||
| | `transform_for_annotation_pipeline` | Before `prepare_pt2e` (called internally by `QnnQuantizer`) | RemoveRedundancy, Decompose*, Recompose*, ReplaceInfValues | | ||
| | `transform_for_export_pipeline` | After `torch.export` | Decompose*, CanonicalizeConv, LiftConstantScalarOperands | | ||
| | `get_to_edge_transform_passes` | Before `to_edge` | AnnotateQuantAttrs, FoldQDQ, LayoutTransform, TagQuantIO, **ResolveDebugHandle (must be last)** | | ||
| | `transform_for_preprocess_pipeline` | Inside `QnnBackend.preprocess` | FoldQDQ(force_fold=True), InsertRequantize, InsertIOQDQ, LayoutTransform(insert_permute=True), FuseConsecutiveCast | | ||
|
|
||
| --- | ||
|
|
||
| ## Skipping Ops / Partial Delegation | ||
|
|
||
| ```python | ||
| from executorch.backends.qualcomm.utils.utils import skip_annotation | ||
|
|
||
| # Skip specific node targets from being delegated | ||
| skip_annotation(model, skipped_ops={torch.ops.aten.add.Tensor}) | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Dumping Context Binary | ||
|
|
||
| ```python | ||
| from executorch.backends.qualcomm.utils.utils import dump_context_from_pte | ||
|
|
||
| dump_context_from_pte("model.pte", output_dir="./context_bins/") | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## SoC Reference | ||
|
|
||
| See `_soc_info_table` in `backends/qualcomm/serialization/qc_schema.py`. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,107 @@ | ||
| # Model Enablement | ||
|
|
||
| Checklist for enabling a new model end-to-end on the QNN backend. | ||
|
|
||
| --- | ||
|
|
||
| ## 1. Identify Unsupported Ops | ||
|
|
||
| Export the model and check which ops fall back to CPU: | ||
|
|
||
| ```python | ||
| from executorch.backends.qualcomm.utils.utils import capture_program | ||
|
|
||
| prog = capture_program(model, example_inputs) | ||
| for node in prog.exported_program.graph.nodes: | ||
| if node.op == "call_function": | ||
| print(node.target.__name__) | ||
| ``` | ||
|
|
||
| Or run the full lowering and inspect the partition result — nodes outside the delegate are CPU fallbacks. | ||
|
|
||
| For each unsupported op, follow `new_op_development.md`. | ||
|
|
||
| --- | ||
|
|
||
| ## 2. Add Export Script | ||
|
|
||
| Place the script under `examples/qualcomm/scripts/<model_name>.py`. Use `build_executorch_binary` as the standard entry point: | ||
|
|
||
| ```python | ||
| from executorch.examples.qualcomm.utils import build_executorch_binary | ||
|
|
||
| build_executorch_binary( | ||
| model=model, | ||
| inputs=example_inputs, | ||
| soc_model=args.model, | ||
| file_name=f"{args.artifact}/{pte_filename}", | ||
| dataset=calibration_data, # None for FP16 | ||
| quant_dtype=QuantDtype.use_8a8w, # omit for FP16 | ||
| online_prepare=args.online_prepare, | ||
| ) | ||
| ``` | ||
|
|
||
| For models requiring custom runners, add under `examples/qualcomm/oss_scripts/`. | ||
|
|
||
| --- | ||
|
|
||
| ## 3. Verify Delegation | ||
|
|
||
| After lowering, confirm the graph is fully delegated: | ||
|
|
||
| ```python | ||
| from executorch.backends.qualcomm.utils.utils import draw_graph | ||
|
|
||
| draw_graph("model_graph", prog.exported_program.graph) | ||
| ``` | ||
|
|
||
| Expected: all compute nodes inside a single `torch.ops.higher_order.executorch_call_delegate` node. Any remaining `call_function` nodes are CPU fallbacks — investigate and fix. | ||
|
|
||
| --- | ||
|
|
||
| ## 4. Add Model-Level Tests | ||
|
|
||
| In `tests/test_qnn_delegate.py`, add to `TestQNNFloatingPointModel` and/or `TestQNNQuantizedModel`: | ||
|
|
||
| ```python | ||
| def test_qnn_backend_my_model(self): | ||
| # setup model and inputs | ||
| module = MyModel() | ||
| sample_input = (torch.randn(1, 3, 224, 224),) | ||
| # lower and test | ||
| self.lower_module_and_test_output(module, sample_input) | ||
| ``` | ||
|
|
||
| For script-based tests (with artifact dependencies), add to `TestExampleScript` or `TestExampleOssScript`. | ||
|
|
||
| --- | ||
|
|
||
| ## 5. Accuracy Validation | ||
|
|
||
| Run on device and compare outputs against CPU reference: | ||
|
|
||
| ```python | ||
| import torch | ||
|
|
||
| cpu_output = model(*example_inputs) | ||
| qnn_output = # load from device execution | ||
|
|
||
| torch.testing.assert_close(qnn_output, cpu_output, rtol=1e-2, atol=1e-2) | ||
| ``` | ||
|
|
||
| Typical tolerances: | ||
| - FP16: `rtol=1e-2, atol=1e-2` | ||
| - INT8 quantized: `rtol=1e-1, atol=1e-1` (accuracy depends on calibration quality) | ||
|
|
||
| --- | ||
|
|
||
| ## 6. Common Issues | ||
|
|
||
| | Symptom | Likely Cause | Fix | | ||
| |---|---|---| | ||
| | Op falls back to CPU | Missing builder or annotation | Add op builder + quantizer annotation | | ||
| | Shape mismatch after layout transform | NHWC/NCHW confusion | Check `LayoutTransform` pass, verify `get_tensor` axis order | | ||
| | Quantization accuracy degraded | Poor calibration data | Use representative dataset; try per-channel quantization | | ||
| | `KeyError` in `node_visitors` | Builder not registered | Check `builders/__init__.py` import | | ||
| | Context binary compile failure | QNN op spec mismatch | Verify IO order and parameter names against `QnnOpDef.h` | | ||
| | `online_prepare` vs offline mismatch | Context binary format | Use `--online_prepare` for QAIRT Visualizer; offline for deployment | |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
profiling.mdis not available in this PR?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we will push profiling.md and custom_op.md in other PR ASAP. Because we have some enhancements for debugging and interface of custom op, the Skill would be included their PR.