pytorch · abhinaykukkadapu · Apr 14, 2026 · Apr 7, 2026 · Apr 7, 2026 · Apr 13, 2026
@@ -0,0 +1,98 @@
+---
+name: qualcomm
+description: Build, test, or develop the QNN (Qualcomm AI Engine Direct) backend. Use when working on backends/qualcomm/, building QNN (use backends/qualcomm/scripts/build.sh), adding new ops or passes, running QNN delegate
+  tests, or exporting models for Qualcomm HTP/GPU targets.
+---
+
+# QNN (Qualcomm AI Engine Direct) Backend
+
+## Advanced Topics
+
+When the user's request falls into one of these areas, read the corresponding file before proceeding:
+
+| Topic | File | When to read |
+|---|---|---|
+| Export / lowering / quantization options / pass pipelines | `lowering_export.md` | User asks about exporting, lowering, quantization config, QuantDtype, QuantRecipe, pass pipelines |
+| New op development | `new_op_development.md` | User asks to add/implement a new op or op builder |
+| Model enablement | `model_enablement.md` | User asks to enable a new model end-to-end |
+| Profiling & debugging | `profiling.md` | User asks about profiling, optrace, QHAS, QAIRT Visualizer *(file TBD)* |
+
+## Building
+
+Use `backends/qualcomm/scripts/build.sh`. Linux only (macOS not supported).
+
+**Environment variables:**
+- `QNN_SDK_ROOT` — path to QNN SDK (auto-downloaded if not set)
+- `ANDROID_NDK_ROOT` — path to Android NDK (auto-downloaded if not set)
+
+**Build targets:**
+
+| Target | Default | Build dir |
+|---|---|---|
+| x86_64 (Python interface + host tools) | enabled | `build-x86/` |
+| Android arm64-v8a (device runner) | enabled | `build-android/` |
+| Hexagon DSP (direct mode) | disabled | `build-hexagon/` |
+| OE Linux embedded | disabled | `build-oe-linux/` |
+
+**Common build commands:**
+
+```bash
+# Full build (x86_64 + Android)
+./backends/qualcomm/scripts/build.sh
+
+# x86_64 only (faster, for Python interface development)
+./backends/qualcomm/scripts/build.sh --skip_linux_android
+
+# Android only (skip x86_64)
+./backends/qualcomm/scripts/build.sh --skip_x86_64
+
+# Incremental build (skip clean)
+./backends/qualcomm/scripts/build.sh --no_clean
+
+# Enable Hexagon DSP direct mode (requires HEXAGON_SDK_ROOT, HEXAGON_TOOLS_ROOT, DSP_VERSION)
+./backends/qualcomm/scripts/build.sh --enable_hexagon
+
+# OE Linux embedded target (requires TOOLCHAIN_ROOT_HOST, TOOLCHAIN_ROOT_TARGET)
+./backends/qualcomm/scripts/build.sh --enable_linux_embedded
+
+# Release build
+./backends/qualcomm/scripts/build.sh --release
+
+# Control parallelism
+./backends/qualcomm/scripts/build.sh --job_number 8
+```
+
+**After x86_64 build**, the Python interface `.so` files are copied to `backends/qualcomm/python/` automatically.
+
+## Testing
+
+```bash
+QNN_SDK_ROOT=/path/to/qnn_sdk \
+ANDROID_NDK_ROOT=/path/to/android_ndk \
+LD_LIBRARY_PATH=/path/to/executorch/build-x86/lib:/path/to/qnn_sdk/lib/x86_64-linux-clang \
+PYTHONPATH=$(dirname $EXECUTORCH_ROOT) \
+python backends/qualcomm/tests/test_qnn_delegate.py \
+    TestQNNFloatingPointOperator.test_qnn_backend_abs \
+    -H $HOST -s $DEVICE_SERIAL -m SM8850 -b build-android -a /path/to/artifacts
+```
+
+> **Note (build from source):** Set `PYTHONPATH` to the parent directory of the executorch repo root. Required because `executorch.examples.qualcomm` lives in the source tree and is not installed into site-packages.
+
+Required flags: `-m` (SoC model), `-b` (Android build dir). Optional: `-s` (device serial), `-H` (host), `-a` (artifact dir), `-c` (compile only), `-x` (run on x86_64).
+
+**Test classes:**
+
+| Class | Description |
+|---|---|
+| `TestQNNFloatingPointOperator` | FP16 operator tests |
+| `TestQNNQuantizedOperator` | Quantized operator tests |
+| `TestQNNFloatingPointModel` | FP16 model-level tests |
+| `TestQNNQuantizedModel` | Quantized model-level tests |
+| `TestQNNFloatingPointUtils` | FP16 utility tests |
+| `TestQNNQuantizedUtils` | Quantized utility tests |
+| `TestExampleLLMScript` | LLM script tests |
+| `TestExampleMultimodalityScript` | Multimodality script tests |
+| `TestExampleOssScript` | OSS model script tests |
+| `TestExampleQaihubScript` | QAI Hub script tests |
+| `TestExampleScript` | General example script tests |
+| `TestUtilsScript` | Utility script tests |
@@ -0,0 +1,141 @@
+# QNN Lowering / Export
+
+## Common Setup
+
+```python
+from executorch.backends.qualcomm.serialization.qc_schema import QnnExecuTorchBackendType
+from executorch.backends.qualcomm.utils.utils import (
+    generate_htp_compiler_spec,
+    generate_qnn_executorch_compiler_spec,
+    get_soc_to_chipset_map,
+    to_edge_transform_and_lower_to_qnn,
+)
+
+soc_model = get_soc_to_chipset_map()["SM8650"]  # adjust SoC as needed
+```
+
+---
+
+## FP16 Export
+
+```python
+backend_options = generate_htp_compiler_spec(use_fp16=True)
+compiler_specs = generate_qnn_executorch_compiler_spec(
+    soc_model=soc_model,
+    backend_options=backend_options,
+)
+edge_prog_mgr = to_edge_transform_and_lower_to_qnn(model, example_inputs, compiler_specs)
+et_program = edge_prog_mgr.to_executorch()
+```
+
+---
+
+## Quantized (PTQ) Export
+
+```python
+import torch
+from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
+from executorch.backends.qualcomm.quantizer.quantizer import QnnQuantizer
+
+# 1. Export to ATen IR
+m = torch.export.export(model.eval(), example_inputs, strict=True).module()
+
+# 2. Prepare for quantization
+quantizer = QnnQuantizer(
+    backend=QnnExecuTorchBackendType.kHtpBackend,
+    soc_model=soc_model,
+)
+m = prepare_pt2e(m, quantizer)
+
+# 3. Calibrate
+m(*example_inputs)
+
+# 4. Convert
+m = convert_pt2e(m)
+
+# 5. Lower to QNN
+backend_options = generate_htp_compiler_spec(use_fp16=False)
+compiler_specs = generate_qnn_executorch_compiler_spec(
+    soc_model=soc_model,
+    backend_options=backend_options,
+)
+edge_prog_mgr = to_edge_transform_and_lower_to_qnn(m, example_inputs, compiler_specs)
+et_program = edge_prog_mgr.to_executorch()
+```
+
+---
+
+## Quantized (QAT) Export
+
+Same as PTQ but use `prepare_qat_pt2e` and run a training loop instead of calibration:
+
+```python
+from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_qat_pt2e
+
+m = prepare_qat_pt2e(m, quantizer)
+# training loop
+m(*example_inputs)
+m = convert_pt2e(m)
+# ... same lowering steps as PTQ
+```
+
+---
+
+## Quantization Options
+
+| QuantDtype | Activation | Weight |
+|---|---|---|
+| `use_16a16w` | uint16 | int16 |
+| `use_16a8w` | uint16 | int8 |
+| `use_16a4w` | uint16 | int4 |
+| `use_16a4w_block` | uint16 | int4 (block-wise) |
+| `use_8a8w` | uint8 | int8 |
+| `use_8a4w` | uint8 | int4 |
+
+**Fine-grained control with QuantRecipe:**
+
+```python
+from executorch.backends.qualcomm.quantizer.quant_recipe import QuantRecipe, QuantGranularity
+
+recipe = QuantRecipe(quant_dtype=QuantDtype.use_8a8w, is_qat=False)
+recipe.add_node_target(targets={torch.ops.aten.linear.default}, quant_dtype=QuantDtype.use_16a8w)
+recipe.add_regex(regex={"layers.[0-3].attention"}, quant_dtype=QuantDtype.use_16a4w)
+```
+
+---
+
+## Pass Pipelines (QnnPassManager)
+
+| Pipeline | When Called | Key Passes |
+|---|---|---|
+| `transform_for_annotation_pipeline` | Before `prepare_pt2e` (called internally by `QnnQuantizer`) | RemoveRedundancy, Decompose*, Recompose*, ReplaceInfValues |
+| `transform_for_export_pipeline` | After `torch.export` | Decompose*, CanonicalizeConv, LiftConstantScalarOperands |
+| `get_to_edge_transform_passes` | Before `to_edge` | AnnotateQuantAttrs, FoldQDQ, LayoutTransform, TagQuantIO, **ResolveDebugHandle (must be last)** |
+| `transform_for_preprocess_pipeline` | Inside `QnnBackend.preprocess` | FoldQDQ(force_fold=True), InsertRequantize, InsertIOQDQ, LayoutTransform(insert_permute=True), FuseConsecutiveCast |
+
+---
+
+## Skipping Ops / Partial Delegation
+
+```python
+from executorch.backends.qualcomm.utils.utils import skip_annotation
+
+# Skip specific node targets from being delegated
+skip_annotation(model, skipped_ops={torch.ops.aten.add.Tensor})
+```
+
+---
+
+## Dumping Context Binary
+
+```python
+from executorch.backends.qualcomm.utils.utils import dump_context_from_pte
+
+dump_context_from_pte("model.pte", output_dir="./context_bins/")
+```
+
+---
+
+## SoC Reference
+
+See `_soc_info_table` in `backends/qualcomm/serialization/qc_schema.py`.
@@ -0,0 +1,107 @@
+# Model Enablement
+
+Checklist for enabling a new model end-to-end on the QNN backend.
+
+---
+
+## 1. Identify Unsupported Ops
+
+Export the model and check which ops fall back to CPU:
+
+```python
+from executorch.backends.qualcomm.utils.utils import capture_program
+
+prog = capture_program(model, example_inputs)
+for node in prog.exported_program.graph.nodes:
+    if node.op == "call_function":
+        print(node.target.__name__)
+```
+
+Or run the full lowering and inspect the partition result — nodes outside the delegate are CPU fallbacks.
+
+For each unsupported op, follow `new_op_development.md`.
+
+---
+
+## 2. Add Export Script
+
+Place the script under `examples/qualcomm/scripts/<model_name>.py`. Use `build_executorch_binary` as the standard entry point:
+
+```python
+from executorch.examples.qualcomm.utils import build_executorch_binary
+
+build_executorch_binary(
+    model=model,
+    inputs=example_inputs,
+    soc_model=args.model,
+    file_name=f"{args.artifact}/{pte_filename}",
+    dataset=calibration_data,       # None for FP16
+    quant_dtype=QuantDtype.use_8a8w, # omit for FP16
+    online_prepare=args.online_prepare,
+)
+```
+
+For models requiring custom runners, add under `examples/qualcomm/oss_scripts/`.
+
+---
+
+## 3. Verify Delegation
+
+After lowering, confirm the graph is fully delegated:
+
+```python
+from executorch.backends.qualcomm.utils.utils import draw_graph
+
+draw_graph("model_graph", prog.exported_program.graph)
+```
+
+Expected: all compute nodes inside a single `torch.ops.higher_order.executorch_call_delegate` node. Any remaining `call_function` nodes are CPU fallbacks — investigate and fix.
+
+---
+
+## 4. Add Model-Level Tests
+
+In `tests/test_qnn_delegate.py`, add to `TestQNNFloatingPointModel` and/or `TestQNNQuantizedModel`:
+
+```python
+def test_qnn_backend_my_model(self):
+    # setup model and inputs
+    module = MyModel()
+    sample_input = (torch.randn(1, 3, 224, 224),)
+    # lower and test
+    self.lower_module_and_test_output(module, sample_input)
+```
+
+For script-based tests (with artifact dependencies), add to `TestExampleScript` or `TestExampleOssScript`.
+
+---
+
+## 5. Accuracy Validation
+
+Run on device and compare outputs against CPU reference:
+
+```python
+import torch
+
+cpu_output = model(*example_inputs)
+qnn_output = # load from device execution
+
+torch.testing.assert_close(qnn_output, cpu_output, rtol=1e-2, atol=1e-2)
+```
+
+Typical tolerances:
+- FP16: `rtol=1e-2, atol=1e-2`
+- INT8 quantized: `rtol=1e-1, atol=1e-1` (accuracy depends on calibration quality)
+
+---
+
+## 6. Common Issues
+
+| Symptom | Likely Cause | Fix |
+|---|---|---|
+| Op falls back to CPU | Missing builder or annotation | Add op builder + quantizer annotation |
+| Shape mismatch after layout transform | NHWC/NCHW confusion | Check `LayoutTransform` pass, verify `get_tensor` axis order |
+| Quantization accuracy degraded | Poor calibration data | Use representative dataset; try per-channel quantization |
+| `KeyError` in `node_visitors` | Builder not registered | Check `builders/__init__.py` import |
+| Context binary compile failure | QNN op spec mismatch | Verify IO order and parameter names against `QnnOpDef.h` |
+| `online_prepare` vs offline mismatch | Context binary format | Use `--online_prepare` for QAIRT Visualizer; offline for deployment |