Cortex-M backend: lower aten.conv1d to existing quantized_conv2d kernels by rascani · Pull Request #19787 · pytorch/executorch

rascani · 2026-05-26T21:54:30Z

Summary

PyTorch Conv1d produces 3D NCW tensors at the edge dialect level, while CMSIS-NN's arm_convolve_wrapper_s8 only accepts 4D NHWC. Rather than add a dedicated conv1d C++ kernel that would duplicate the quantized conv2d machinery, this lowers conv1d to the existing cortex_m.quantized_conv2d / quantized_depthwise_conv2d ops by wrapping the runtime input in aten.unsqueeze_copy plus dim_order_ops._clone_dim_order(dim_order=[0,2,3,1]) to produce a logical (N, C, 1, W) tensor with NHWC byte layout, AoT-permuting the weight to 4D OHWI (or IHWO for depthwise) with H=1, and unwrapping the output with _clone_dim_order([0,1,2,3]) + squeeze_copy. The dim_order attribute is what gets is_channels_last_tensor to pass at runtime; the physical reorder inside the clone op is what makes CMSIS-NN's raw int8_t* arithmetic read the right bytes.

CortexMConv1DCheck accepts rank-3 patterns with the same activation tails (relu / hardtanh / hardsigmoid / clamp) the conv2d quantizer pattern already supports. ConvertToCortexMPass._lower_conv1d performs the full graph rewrite end-to-end. It inserts the wrap chain, creates the conv2d node, initialises its scratch buffer through the existing required_cmsis_nn_buffer_sizes lookup, and replaces uses of the original convolution node itself, so call()'s standard single-node-replacement tail doesn't need to know about conv1d.

Test plan

Wav2Letter (Conv1d + ReLU + log_softmax stack) collapses entirely; every Conv1d + ReLU pair fuses into a single quantized_conv2d, only log_softmax stays in aten. Test expectations populated.
Silero VAD's six Conv1d ops (1 STFT + 4 encoder + 1 final) all lower: 5 as regular conv2d, 1 as depthwise (the STFT layer has in_channels == groups == 1). Four of the five encoder ReLUs fuse; the remaining unfused ReLU is the post-LSTM F.relu(h), with no Conv1d producer to fuse into.

Op-level unit tests cover regular, depthwise, kernel=1 pointwise, larger kernels with padding, stride > 1, bias, an STFT-shaped in=1 depthwise case, and a Conv1d+ReLU fusion case.

pytorch-bot · 2026-05-26T21:54:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19787

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

/easycla is not responding

❌ 2 New Failures, 3 Unrelated Failures

As of commit fabc4c6 with merge base 043c404 ():

NEW FAILURES - The following jobs have failed:

pull / android / build-android (gh)
pull / test-mcu-cortex-m-backend / linux-job (gh)
RuntimeError: Command docker exec -t 77dc024672365a2c32d046335ac58ef285fefb116f65cede453a9127c7f86c90 /exec failed with exit code 1

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-static-llama-qnn-linux (stories_110m) / linux-job (gh) (trunk failure)
test_llama_stories_110m
pull / unittest / macos / macos-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2026-05-26T21:54:38Z

✅ login: rascani / name: RJ Ascani (4a7c185, fabc4c6)
❌ - login: @claude / name: Claude. The commit (4a7c185, fabc4c6) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please visit our EasyCLA portal and chat with our support bot.

github-actions · 2026-05-26T21:55:19Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

AdrianLundell · 2026-05-27T07:12:36Z

Just to make sure we have considered all alternatives, it would be a rather small change to change the runtime operator to handle 3D input/output and not have to introduce view-operators which we then need to clean up. On the other hand this is already done in the arm-backend so we should be able to re-use those passes for that rather easily. Do you have any preference?

PyTorch Conv1d produces 3D NCW tensors at the edge dialect level, while CMSIS-NN's `arm_convolve_wrapper_s8` only accepts 4D NHWC. Rather than add a dedicated conv1d C++ kernel that would duplicate the quantized conv2d machinery, this lowers conv1d to the existing cortex_m.quantized_conv2d / quantized_depthwise_conv2d ops by wrapping the runtime input in `aten.unsqueeze_copy` plus `dim_order_ops._clone_dim_order(dim_order=[0,2,3,1])` to produce a logical (N, C, 1, W) tensor with NHWC byte layout, AoT-permuting the weight to 4D OHWI (or IHWO for depthwise) with H=1, and unwrapping the output with `_clone_dim_order([0,1,2,3])` + `squeeze_copy`. The `dim_order` attribute is what gets `is_channels_last_tensor` to pass at runtime; the physical reorder inside the clone op is what makes CMSIS-NN's raw int8_t* arithmetic read the right bytes. New: `CortexMConv1DCheck` accepts rank-3 patterns with the same activation tails (relu / hardtanh / hardsigmoid / clamp) the conv2d quantizer pattern already supports. `ConvertToCortexMPass._lower_conv1d` performs the full graph rewrite end-to-end -- it inserts the wrap chain, creates the conv2d node, initialises its scratch buffer through the existing `required_cmsis_nn_buffer_sizes` lookup, and replaces uses of the original convolution node itself, so `call()`'s standard single-node-replacement tail doesn't need to know about conv1d. For the three target models: * Wav2Letter (Conv1d + ReLU + log_softmax stack) collapses entirely; every Conv1d + ReLU pair fuses into a single quantized_conv2d, only log_softmax stays in aten. Test expectations populated. * Silero VAD's six Conv1d ops (1 STFT + 4 encoder + 1 final) all lower: 5 as regular conv2d, 1 as depthwise (the STFT layer has in_channels == groups == 1). Four of the five encoder ReLUs fuse; the remaining unfused ReLU is the post-LSTM `F.relu(h)`, with no Conv1d producer to fuse into. * YOLO11 is 2D and unaffected. Op-level unit tests cover regular, depthwise, kernel=1 pointwise, larger kernels with padding, stride > 1, bias, an STFT-shaped in=1 depthwise case, and a Conv1d+ReLU fusion case. Follow-ups (intentionally deferred): factor the shared qparams / multiplier / weight-placeholder code between conv1d and conv2d helpers; collapse the 9 activation-tail pattern entries via a helper; add an explicit reject for `groups > 1 && !depthwise` in both the conv1d and existing conv2d paths. Co-authored-by: Claude <noreply@anthropic.com>

rascani · 2026-05-27T17:49:13Z

Just to make sure we have considered all alternatives, it would be a rather small change to change the runtime operator to handle 3D input/output and not have to introduce view-operators which we then need to clean up. On the other hand this is already done in the arm-backend so we should be able to re-use those passes for that rather easily. Do you have any preference?

Having the operator handle 3d i/o would be dependent on the work from #19299, right? We'd still need the NCW -> NWC conversion. That feels like a good north star to me, as there'd be less risk of unnecessary ops in the graph.

I'm not sure what your timeline is for #19299, but I think most of this PR would survive such a transformation. We'd just need to remove the view copy ops insertions. WDYT about moving forward with this and then cleaning up post #19299?

I'm also going to update this PR to switch to view copy ops so I can re-use the existing fuse passes.

Between two consecutive conv1d layers the lowering pass currently emits a 4-op no-op chain: `_clone_dim_order(to NCHW) -> squeeze_copy -> unsqueeze_copy -> _clone_dim_order(to NHWC)`. The two clones are inverse byte reorders and the squeeze/unsqueeze on the same dim cancel logically, so the data flows through unchanged -- but every op still materialises an activation-sized buffer. On Wav2Letter (12 chained Conv1ds) that's ~3 MB of redundant memory traffic per inference, ~2-6% of runtime budget at MCU bandwidths. The cleanup composes two passes: * Switch the conv1d lowering to use `aten.view_copy` (with the explicit shape arg) rather than `aten.unsqueeze_copy` / `aten.squeeze_copy.dims`. This lets `backends/transforms/fuse_view_copy.FuseViewCopyTransform` walk the view_copy <-> view_copy chain (treating the intermediate `_clone_dim_order` as a unary elementwise op it can pass through) and rewrite both view_copies to the same final shape; the noop ones then drop in `remove_noop_view_copy`. * Add a small `FoldInverseDimOrderClonePass` that runs after the view-copy fuse and removes any remaining pair of adjacent `_clone_dim_order` ops whose composed dim_order returns to the input's original dim_order. Net result on Wav2Letter: 60 graph nodes between the convs collapse to zero -- the post-pass IR is a direct `quantized_conv2d -> quantized_conv2d` flow with the wrap/unwrap only at the model boundaries (2 view_copy + 2 _clone_dim_order total, not 24 + 24). Silero VAD sees similar simplification for its encoder block; its remaining wraps are at the STFT/magnitude-spectrum and post-LSTM boundaries where the next op isn't a conv1d. `_lower_conv1d` reads the input shape from `x.args[1]` when the input is a view_copy from a previous conv1d's lowering (its meta["val"] hasn't been repopulated yet at that point in the pass; the explicit shape arg is the source of truth). `aten::view_copy.out` is added to the test runner's selected_ops_list so the on-device tests can resolve the view_copy kernel. Co-authored-by: Claude <noreply@anthropic.com>

AdrianLundell · 2026-05-29T15:33:37Z

Sounds good to me!

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 26, 2026

rascani force-pushed the cortex-m-conv1d-via-reshape branch from 8ff4633 to fabc4c6 Compare May 27, 2026 19:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cortex-M backend: lower aten.conv1d to existing quantized_conv2d kernels#19787

Cortex-M backend: lower aten.conv1d to existing quantized_conv2d kernels#19787
rascani wants to merge 2 commits into
pytorch:mainfrom
rascani:cortex-m-conv1d-via-reshape

rascani commented May 26, 2026

Uh oh!

pytorch-bot Bot commented May 26, 2026 •

edited

Loading

Uh oh!

linux-foundation-easycla Bot commented May 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

AdrianLundell commented May 27, 2026

Uh oh!

rascani commented May 27, 2026 •

edited

Loading

Uh oh!

AdrianLundell commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rascani commented May 26, 2026

Summary

Test plan

Uh oh!

pytorch-bot Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19787

❗ 1 Active SEVs

❌ 2 New Failures, 3 Unrelated Failures

Uh oh!

linux-foundation-easycla Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 26, 2026

This PR needs a release notes: label

Uh oh!

AdrianLundell commented May 27, 2026

Uh oh!

rascani commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AdrianLundell commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot Bot commented May 26, 2026 •

edited

Loading

linux-foundation-easycla Bot commented May 26, 2026 •

edited

Loading

This PR needs a `release notes:` label

rascani commented May 27, 2026 •

edited

Loading