Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 98 additions & 0 deletions .claude/skills/qualcomm/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
name: qualcomm
description: Build, test, or develop the QNN (Qualcomm AI Engine Direct) backend. Use when working on backends/qualcomm/, building QNN (use backends/qualcomm/scripts/build.sh), adding new ops or passes, running QNN delegate
tests, or exporting models for Qualcomm HTP/GPU targets.
---

# QNN (Qualcomm AI Engine Direct) Backend

## Advanced Topics

When the user's request falls into one of these areas, read the corresponding file before proceeding:

| Topic | File | When to read |
|---|---|---|
| Export / lowering / quantization options / pass pipelines | `lowering_export.md` | User asks about exporting, lowering, quantization config, QuantDtype, QuantRecipe, pass pipelines |
| New op development | `new_op_development.md` | User asks to add/implement a new op or op builder |
| Model enablement | `model_enablement.md` | User asks to enable a new model end-to-end |
| Profiling & debugging | `profiling.md` | User asks about profiling, optrace, QHAS, QAIRT Visualizer *(file TBD)* |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

profiling.md is not available in this PR?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we will push profiling.md and custom_op.md in other PR ASAP. Because we have some enhancements for debugging and interface of custom op, the Skill would be included their PR.


## Building

Use `backends/qualcomm/scripts/build.sh`. Linux only (macOS not supported).

**Environment variables:**
- `QNN_SDK_ROOT` — path to QNN SDK (auto-downloaded if not set)
- `ANDROID_NDK_ROOT` — path to Android NDK (auto-downloaded if not set)
Comment on lines +25 to +26
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add PYTHON_PATH and LD_LIBRARY_PATH?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me help him understand this. :)


**Build targets:**

| Target | Default | Build dir |
|---|---|---|
| x86_64 (Python interface + host tools) | enabled | `build-x86/` |
| Android arm64-v8a (device runner) | enabled | `build-android/` |
| Hexagon DSP (direct mode) | disabled | `build-hexagon/` |
| OE Linux embedded | disabled | `build-oe-linux/` |

**Common build commands:**

```bash
# Full build (x86_64 + Android)
./backends/qualcomm/scripts/build.sh

# x86_64 only (faster, for Python interface development)
./backends/qualcomm/scripts/build.sh --skip_linux_android

# Android only (skip x86_64)
./backends/qualcomm/scripts/build.sh --skip_x86_64

# Incremental build (skip clean)
./backends/qualcomm/scripts/build.sh --no_clean

# Enable Hexagon DSP direct mode (requires HEXAGON_SDK_ROOT, HEXAGON_TOOLS_ROOT, DSP_VERSION)
./backends/qualcomm/scripts/build.sh --enable_hexagon

# OE Linux embedded target (requires TOOLCHAIN_ROOT_HOST, TOOLCHAIN_ROOT_TARGET)
./backends/qualcomm/scripts/build.sh --enable_linux_embedded

# Release build
./backends/qualcomm/scripts/build.sh --release

# Control parallelism
./backends/qualcomm/scripts/build.sh --job_number 8
```

**After x86_64 build**, the Python interface `.so` files are copied to `backends/qualcomm/python/` automatically.

## Testing

```bash
QNN_SDK_ROOT=/path/to/qnn_sdk \
ANDROID_NDK_ROOT=/path/to/android_ndk \
LD_LIBRARY_PATH=/path/to/executorch/build-x86/lib:/path/to/qnn_sdk/lib/x86_64-linux-clang \
PYTHONPATH=$(dirname $EXECUTORCH_ROOT) \
python backends/qualcomm/tests/test_qnn_delegate.py \
TestQNNFloatingPointOperator.test_qnn_backend_abs \
-H $HOST -s $DEVICE_SERIAL -m SM8850 -b build-android -a /path/to/artifacts
```

> **Note (build from source):** Set `PYTHONPATH` to the parent directory of the executorch repo root. Required because `executorch.examples.qualcomm` lives in the source tree and is not installed into site-packages.

Required flags: `-m` (SoC model), `-b` (Android build dir). Optional: `-s` (device serial), `-H` (host), `-a` (artifact dir), `-c` (compile only), `-x` (run on x86_64).

**Test classes:**

| Class | Description |
|---|---|
| `TestQNNFloatingPointOperator` | FP16 operator tests |
| `TestQNNQuantizedOperator` | Quantized operator tests |
| `TestQNNFloatingPointModel` | FP16 model-level tests |
| `TestQNNQuantizedModel` | Quantized model-level tests |
| `TestQNNFloatingPointUtils` | FP16 utility tests |
| `TestQNNQuantizedUtils` | Quantized utility tests |
| `TestExampleLLMScript` | LLM script tests |
| `TestExampleMultimodalityScript` | Multimodality script tests |
| `TestExampleOssScript` | OSS model script tests |
| `TestExampleQaihubScript` | QAI Hub script tests |
| `TestExampleScript` | General example script tests |
| `TestUtilsScript` | Utility script tests |
141 changes: 141 additions & 0 deletions .claude/skills/qualcomm/lowering_export.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# QNN Lowering / Export

## Common Setup

```python
from executorch.backends.qualcomm.serialization.qc_schema import QnnExecuTorchBackendType
from executorch.backends.qualcomm.utils.utils import (
generate_htp_compiler_spec,
generate_qnn_executorch_compiler_spec,
get_soc_to_chipset_map,
to_edge_transform_and_lower_to_qnn,
)

soc_model = get_soc_to_chipset_map()["SM8650"] # adjust SoC as needed
```

---

## FP16 Export

```python
backend_options = generate_htp_compiler_spec(use_fp16=True)
compiler_specs = generate_qnn_executorch_compiler_spec(
soc_model=soc_model,
backend_options=backend_options,
)
edge_prog_mgr = to_edge_transform_and_lower_to_qnn(model, example_inputs, compiler_specs)
et_program = edge_prog_mgr.to_executorch()
```

---

## Quantized (PTQ) Export

```python
import torch
from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
from executorch.backends.qualcomm.quantizer.quantizer import QnnQuantizer

# 1. Export to ATen IR
m = torch.export.export(model.eval(), example_inputs, strict=True).module()

# 2. Prepare for quantization
quantizer = QnnQuantizer(
backend=QnnExecuTorchBackendType.kHtpBackend,
soc_model=soc_model,
)
m = prepare_pt2e(m, quantizer)

# 3. Calibrate
m(*example_inputs)

# 4. Convert
m = convert_pt2e(m)

# 5. Lower to QNN
backend_options = generate_htp_compiler_spec(use_fp16=False)
compiler_specs = generate_qnn_executorch_compiler_spec(
soc_model=soc_model,
backend_options=backend_options,
)
edge_prog_mgr = to_edge_transform_and_lower_to_qnn(m, example_inputs, compiler_specs)
et_program = edge_prog_mgr.to_executorch()
```

---

## Quantized (QAT) Export

Same as PTQ but use `prepare_qat_pt2e` and run a training loop instead of calibration:

```python
from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_qat_pt2e

m = prepare_qat_pt2e(m, quantizer)
# training loop
m(*example_inputs)
m = convert_pt2e(m)
# ... same lowering steps as PTQ
```

---

## Quantization Options

| QuantDtype | Activation | Weight |
|---|---|---|
| `use_16a16w` | uint16 | int16 |
| `use_16a8w` | uint16 | int8 |
| `use_16a4w` | uint16 | int4 |
| `use_16a4w_block` | uint16 | int4 (block-wise) |
| `use_8a8w` | uint8 | int8 |
| `use_8a4w` | uint8 | int4 |

**Fine-grained control with QuantRecipe:**

```python
from executorch.backends.qualcomm.quantizer.quant_recipe import QuantRecipe, QuantGranularity

recipe = QuantRecipe(quant_dtype=QuantDtype.use_8a8w, is_qat=False)
recipe.add_node_target(targets={torch.ops.aten.linear.default}, quant_dtype=QuantDtype.use_16a8w)
recipe.add_regex(regex={"layers.[0-3].attention"}, quant_dtype=QuantDtype.use_16a4w)
```

---

## Pass Pipelines (QnnPassManager)

| Pipeline | When Called | Key Passes |
|---|---|---|
| `transform_for_annotation_pipeline` | Before `prepare_pt2e` (called internally by `QnnQuantizer`) | RemoveRedundancy, Decompose*, Recompose*, ReplaceInfValues |
| `transform_for_export_pipeline` | After `torch.export` | Decompose*, CanonicalizeConv, LiftConstantScalarOperands |
| `get_to_edge_transform_passes` | Before `to_edge` | AnnotateQuantAttrs, FoldQDQ, LayoutTransform, TagQuantIO, **ResolveDebugHandle (must be last)** |
| `transform_for_preprocess_pipeline` | Inside `QnnBackend.preprocess` | FoldQDQ(force_fold=True), InsertRequantize, InsertIOQDQ, LayoutTransform(insert_permute=True), FuseConsecutiveCast |

---

## Skipping Ops / Partial Delegation

```python
from executorch.backends.qualcomm.utils.utils import skip_annotation

# Skip specific node targets from being delegated
skip_annotation(model, skipped_ops={torch.ops.aten.add.Tensor})
```

---

## Dumping Context Binary

```python
from executorch.backends.qualcomm.utils.utils import dump_context_from_pte

dump_context_from_pte("model.pte", output_dir="./context_bins/")
```

---

## SoC Reference

See `_soc_info_table` in `backends/qualcomm/serialization/qc_schema.py`.
107 changes: 107 additions & 0 deletions .claude/skills/qualcomm/model_enablement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Model Enablement

Checklist for enabling a new model end-to-end on the QNN backend.

---

## 1. Identify Unsupported Ops

Export the model and check which ops fall back to CPU:

```python
from executorch.backends.qualcomm.utils.utils import capture_program

prog = capture_program(model, example_inputs)
for node in prog.exported_program.graph.nodes:
if node.op == "call_function":
print(node.target.__name__)
```

Or run the full lowering and inspect the partition result — nodes outside the delegate are CPU fallbacks.

For each unsupported op, follow `new_op_development.md`.

---

## 2. Add Export Script

Place the script under `examples/qualcomm/scripts/<model_name>.py`. Use `build_executorch_binary` as the standard entry point:

```python
from executorch.examples.qualcomm.utils import build_executorch_binary

build_executorch_binary(
model=model,
inputs=example_inputs,
soc_model=args.model,
file_name=f"{args.artifact}/{pte_filename}",
dataset=calibration_data, # None for FP16
quant_dtype=QuantDtype.use_8a8w, # omit for FP16
online_prepare=args.online_prepare,
)
```

For models requiring custom runners, add under `examples/qualcomm/oss_scripts/`.

---

## 3. Verify Delegation

After lowering, confirm the graph is fully delegated:

```python
from executorch.backends.qualcomm.utils.utils import draw_graph

draw_graph("model_graph", prog.exported_program.graph)
```

Expected: all compute nodes inside a single `torch.ops.higher_order.executorch_call_delegate` node. Any remaining `call_function` nodes are CPU fallbacks — investigate and fix.

---

## 4. Add Model-Level Tests

In `tests/test_qnn_delegate.py`, add to `TestQNNFloatingPointModel` and/or `TestQNNQuantizedModel`:

```python
def test_qnn_backend_my_model(self):
# setup model and inputs
module = MyModel()
sample_input = (torch.randn(1, 3, 224, 224),)
# lower and test
self.lower_module_and_test_output(module, sample_input)
```

For script-based tests (with artifact dependencies), add to `TestExampleScript` or `TestExampleOssScript`.

---

## 5. Accuracy Validation

Run on device and compare outputs against CPU reference:

```python
import torch

cpu_output = model(*example_inputs)
qnn_output = # load from device execution

torch.testing.assert_close(qnn_output, cpu_output, rtol=1e-2, atol=1e-2)
```

Typical tolerances:
- FP16: `rtol=1e-2, atol=1e-2`
- INT8 quantized: `rtol=1e-1, atol=1e-1` (accuracy depends on calibration quality)

---

## 6. Common Issues

| Symptom | Likely Cause | Fix |
|---|---|---|
| Op falls back to CPU | Missing builder or annotation | Add op builder + quantizer annotation |
| Shape mismatch after layout transform | NHWC/NCHW confusion | Check `LayoutTransform` pass, verify `get_tensor` axis order |
| Quantization accuracy degraded | Poor calibration data | Use representative dataset; try per-channel quantization |
| `KeyError` in `node_visitors` | Builder not registered | Check `builders/__init__.py` import |
| Context binary compile failure | QNN op spec mismatch | Verify IO order and parameter names against `QnnOpDef.h` |
| `online_prepare` vs offline mismatch | Context binary format | Use `--online_prepare` for QAIRT Visualizer; offline for deployment |
Loading
Loading