Skip to content

Cortex-M backend: lower aten.conv1d to existing quantized_conv2d kernels#19787

Draft
rascani wants to merge 2 commits into
pytorch:mainfrom
rascani:cortex-m-conv1d-via-reshape
Draft

Cortex-M backend: lower aten.conv1d to existing quantized_conv2d kernels#19787
rascani wants to merge 2 commits into
pytorch:mainfrom
rascani:cortex-m-conv1d-via-reshape

Conversation

@rascani
Copy link
Copy Markdown
Contributor

@rascani rascani commented May 26, 2026

Summary

PyTorch Conv1d produces 3D NCW tensors at the edge dialect level, while CMSIS-NN's arm_convolve_wrapper_s8 only accepts 4D NHWC. Rather than add a dedicated conv1d C++ kernel that would duplicate the quantized conv2d machinery, this lowers conv1d to the existing cortex_m.quantized_conv2d / quantized_depthwise_conv2d ops by wrapping the runtime input in aten.unsqueeze_copy plus dim_order_ops._clone_dim_order(dim_order=[0,2,3,1]) to produce a logical (N, C, 1, W) tensor with NHWC byte layout, AoT-permuting the weight to 4D OHWI (or IHWO for depthwise) with H=1, and unwrapping the output with _clone_dim_order([0,1,2,3]) + squeeze_copy. The dim_order attribute is what gets is_channels_last_tensor to pass at runtime; the physical reorder inside the clone op is what makes CMSIS-NN's raw int8_t* arithmetic read the right bytes.

CortexMConv1DCheck accepts rank-3 patterns with the same activation tails (relu / hardtanh / hardsigmoid / clamp) the conv2d quantizer pattern already supports. ConvertToCortexMPass._lower_conv1d performs the full graph rewrite end-to-end. It inserts the wrap chain, creates the conv2d node, initialises its scratch buffer through the existing required_cmsis_nn_buffer_sizes lookup, and replaces uses of the original convolution node itself, so call()'s standard single-node-replacement tail doesn't need to know about conv1d.

Test plan

  • Wav2Letter (Conv1d + ReLU + log_softmax stack) collapses entirely; every Conv1d + ReLU pair fuses into a single quantized_conv2d, only log_softmax stays in aten. Test expectations populated.
  • Silero VAD's six Conv1d ops (1 STFT + 4 encoder + 1 final) all lower: 5 as regular conv2d, 1 as depthwise (the STFT layer has in_channels == groups == 1). Four of the five encoder ReLUs fuse; the remaining unfused ReLU is the post-LSTM F.relu(h), with no Conv1d producer to fuse into.

Op-level unit tests cover regular, depthwise, kernel=1 pointwise, larger kernels with padding, stride > 1, bias, an STFT-shaped in=1 depthwise case, and a Conv1d+ReLU fusion case.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 26, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19787

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 2 New Failures, 3 Unrelated Failures

As of commit fabc4c6 with merge base 043c404 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 26, 2026
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented May 26, 2026

CLA Not Signed

@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@AdrianLundell
Copy link
Copy Markdown
Collaborator

Just to make sure we have considered all alternatives, it would be a rather small change to change the runtime operator to handle 3D input/output and not have to introduce view-operators which we then need to clean up. On the other hand this is already done in the arm-backend so we should be able to re-use those passes for that rather easily. Do you have any preference?

PyTorch Conv1d produces 3D NCW tensors at the edge dialect level, while
CMSIS-NN's `arm_convolve_wrapper_s8` only accepts 4D NHWC. Rather than
add a dedicated conv1d C++ kernel that would duplicate the quantized
conv2d machinery, this lowers conv1d to the existing
cortex_m.quantized_conv2d / quantized_depthwise_conv2d ops by wrapping
the runtime input in `aten.unsqueeze_copy` plus
`dim_order_ops._clone_dim_order(dim_order=[0,2,3,1])` to produce a
logical (N, C, 1, W) tensor with NHWC byte layout, AoT-permuting the
weight to 4D OHWI (or IHWO for depthwise) with H=1, and unwrapping the
output with `_clone_dim_order([0,1,2,3])` + `squeeze_copy`. The `dim_order`
attribute is what gets `is_channels_last_tensor` to pass at runtime; the
physical reorder inside the clone op is what makes CMSIS-NN's raw
int8_t* arithmetic read the right bytes.

New: `CortexMConv1DCheck` accepts rank-3 patterns with the same
activation tails (relu / hardtanh / hardsigmoid / clamp) the conv2d
quantizer pattern already supports. `ConvertToCortexMPass._lower_conv1d`
performs the full graph rewrite end-to-end -- it inserts the wrap
chain, creates the conv2d node, initialises its scratch buffer through
the existing `required_cmsis_nn_buffer_sizes` lookup, and replaces uses
of the original convolution node itself, so `call()`'s standard
single-node-replacement tail doesn't need to know about conv1d.

For the three target models:

* Wav2Letter (Conv1d + ReLU + log_softmax stack) collapses entirely;
  every Conv1d + ReLU pair fuses into a single quantized_conv2d, only
  log_softmax stays in aten. Test expectations populated.
* Silero VAD's six Conv1d ops (1 STFT + 4 encoder + 1 final) all
  lower: 5 as regular conv2d, 1 as depthwise (the STFT layer has
  in_channels == groups == 1). Four of the five encoder ReLUs fuse;
  the remaining unfused ReLU is the post-LSTM `F.relu(h)`, with no
  Conv1d producer to fuse into.
* YOLO11 is 2D and unaffected.

Op-level unit tests cover regular, depthwise, kernel=1 pointwise,
larger kernels with padding, stride > 1, bias, an STFT-shaped
in=1 depthwise case, and a Conv1d+ReLU fusion case.

Follow-ups (intentionally deferred): factor the shared
qparams / multiplier / weight-placeholder code between conv1d and
conv2d helpers; collapse the 9 activation-tail pattern entries via a
helper; add an explicit reject for `groups > 1 && !depthwise` in both
the conv1d and existing conv2d paths.

Co-authored-by: Claude <noreply@anthropic.com>
@rascani
Copy link
Copy Markdown
Contributor Author

rascani commented May 27, 2026

Just to make sure we have considered all alternatives, it would be a rather small change to change the runtime operator to handle 3D input/output and not have to introduce view-operators which we then need to clean up. On the other hand this is already done in the arm-backend so we should be able to re-use those passes for that rather easily. Do you have any preference?

Having the operator handle 3d i/o would be dependent on the work from #19299, right? We'd still need the NCW -> NWC conversion. That feels like a good north star to me, as there'd be less risk of unnecessary ops in the graph.

I'm not sure what your timeline is for #19299, but I think most of this PR would survive such a transformation. We'd just need to remove the view copy ops insertions. WDYT about moving forward with this and then cleaning up post #19299?

I'm also going to update this PR to switch to view copy ops so I can re-use the existing fuse passes.

Between two consecutive conv1d layers the lowering pass currently emits a
4-op no-op chain: `_clone_dim_order(to NCHW) -> squeeze_copy -> unsqueeze_copy
-> _clone_dim_order(to NHWC)`. The two clones are inverse byte reorders and
the squeeze/unsqueeze on the same dim cancel logically, so the data flows
through unchanged -- but every op still materialises an activation-sized
buffer. On Wav2Letter (12 chained Conv1ds) that's ~3 MB of redundant
memory traffic per inference, ~2-6% of runtime budget at MCU bandwidths.

The cleanup composes two passes:

* Switch the conv1d lowering to use `aten.view_copy` (with the explicit
  shape arg) rather than `aten.unsqueeze_copy` / `aten.squeeze_copy.dims`.
  This lets `backends/transforms/fuse_view_copy.FuseViewCopyTransform`
  walk the view_copy <-> view_copy chain (treating the intermediate
  `_clone_dim_order` as a unary elementwise op it can pass through) and
  rewrite both view_copies to the same final shape; the noop ones then
  drop in `remove_noop_view_copy`.
* Add a small `FoldInverseDimOrderClonePass` that runs after the view-copy
  fuse and removes any remaining pair of adjacent `_clone_dim_order` ops
  whose composed dim_order returns to the input's original dim_order.

Net result on Wav2Letter: 60 graph nodes between the convs collapse to
zero -- the post-pass IR is a direct `quantized_conv2d -> quantized_conv2d`
flow with the wrap/unwrap only at the model boundaries (2 view_copy + 2
_clone_dim_order total, not 24 + 24). Silero VAD sees similar simplification
for its encoder block; its remaining wraps are at the STFT/magnitude-spectrum
and post-LSTM boundaries where the next op isn't a conv1d.

`_lower_conv1d` reads the input shape from `x.args[1]` when the input is a
view_copy from a previous conv1d's lowering (its meta["val"] hasn't been
repopulated yet at that point in the pass; the explicit shape arg is the
source of truth).

`aten::view_copy.out` is added to the test runner's selected_ops_list so
the on-device tests can resolve the view_copy kernel.

Co-authored-by: Claude <noreply@anthropic.com>
@rascani rascani force-pushed the cortex-m-conv1d-via-reshape branch from 8ff4633 to fabc4c6 Compare May 27, 2026 19:51
@AdrianLundell
Copy link
Copy Markdown
Collaborator

Sounds good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants