Cortex-M backend: lower aten.conv1d to existing quantized_conv2d kernels#19787
Cortex-M backend: lower aten.conv1d to existing quantized_conv2d kernels#19787rascani wants to merge 2 commits into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19787
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 2 New Failures, 3 Unrelated FailuresAs of commit fabc4c6 with merge base 043c404 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This PR needs a
|
|
Just to make sure we have considered all alternatives, it would be a rather small change to change the runtime operator to handle 3D input/output and not have to introduce view-operators which we then need to clean up. On the other hand this is already done in the arm-backend so we should be able to re-use those passes for that rather easily. Do you have any preference? |
PyTorch Conv1d produces 3D NCW tensors at the edge dialect level, while CMSIS-NN's `arm_convolve_wrapper_s8` only accepts 4D NHWC. Rather than add a dedicated conv1d C++ kernel that would duplicate the quantized conv2d machinery, this lowers conv1d to the existing cortex_m.quantized_conv2d / quantized_depthwise_conv2d ops by wrapping the runtime input in `aten.unsqueeze_copy` plus `dim_order_ops._clone_dim_order(dim_order=[0,2,3,1])` to produce a logical (N, C, 1, W) tensor with NHWC byte layout, AoT-permuting the weight to 4D OHWI (or IHWO for depthwise) with H=1, and unwrapping the output with `_clone_dim_order([0,1,2,3])` + `squeeze_copy`. The `dim_order` attribute is what gets `is_channels_last_tensor` to pass at runtime; the physical reorder inside the clone op is what makes CMSIS-NN's raw int8_t* arithmetic read the right bytes. New: `CortexMConv1DCheck` accepts rank-3 patterns with the same activation tails (relu / hardtanh / hardsigmoid / clamp) the conv2d quantizer pattern already supports. `ConvertToCortexMPass._lower_conv1d` performs the full graph rewrite end-to-end -- it inserts the wrap chain, creates the conv2d node, initialises its scratch buffer through the existing `required_cmsis_nn_buffer_sizes` lookup, and replaces uses of the original convolution node itself, so `call()`'s standard single-node-replacement tail doesn't need to know about conv1d. For the three target models: * Wav2Letter (Conv1d + ReLU + log_softmax stack) collapses entirely; every Conv1d + ReLU pair fuses into a single quantized_conv2d, only log_softmax stays in aten. Test expectations populated. * Silero VAD's six Conv1d ops (1 STFT + 4 encoder + 1 final) all lower: 5 as regular conv2d, 1 as depthwise (the STFT layer has in_channels == groups == 1). Four of the five encoder ReLUs fuse; the remaining unfused ReLU is the post-LSTM `F.relu(h)`, with no Conv1d producer to fuse into. * YOLO11 is 2D and unaffected. Op-level unit tests cover regular, depthwise, kernel=1 pointwise, larger kernels with padding, stride > 1, bias, an STFT-shaped in=1 depthwise case, and a Conv1d+ReLU fusion case. Follow-ups (intentionally deferred): factor the shared qparams / multiplier / weight-placeholder code between conv1d and conv2d helpers; collapse the 9 activation-tail pattern entries via a helper; add an explicit reject for `groups > 1 && !depthwise` in both the conv1d and existing conv2d paths. Co-authored-by: Claude <noreply@anthropic.com>
Having the operator handle 3d i/o would be dependent on the work from #19299, right? We'd still need the NCW -> NWC conversion. That feels like a good north star to me, as there'd be less risk of unnecessary ops in the graph. I'm not sure what your timeline is for #19299, but I think most of this PR would survive such a transformation. We'd just need to remove the view copy ops insertions. WDYT about moving forward with this and then cleaning up post #19299? I'm also going to update this PR to switch to view copy ops so I can re-use the existing fuse passes. |
Between two consecutive conv1d layers the lowering pass currently emits a 4-op no-op chain: `_clone_dim_order(to NCHW) -> squeeze_copy -> unsqueeze_copy -> _clone_dim_order(to NHWC)`. The two clones are inverse byte reorders and the squeeze/unsqueeze on the same dim cancel logically, so the data flows through unchanged -- but every op still materialises an activation-sized buffer. On Wav2Letter (12 chained Conv1ds) that's ~3 MB of redundant memory traffic per inference, ~2-6% of runtime budget at MCU bandwidths. The cleanup composes two passes: * Switch the conv1d lowering to use `aten.view_copy` (with the explicit shape arg) rather than `aten.unsqueeze_copy` / `aten.squeeze_copy.dims`. This lets `backends/transforms/fuse_view_copy.FuseViewCopyTransform` walk the view_copy <-> view_copy chain (treating the intermediate `_clone_dim_order` as a unary elementwise op it can pass through) and rewrite both view_copies to the same final shape; the noop ones then drop in `remove_noop_view_copy`. * Add a small `FoldInverseDimOrderClonePass` that runs after the view-copy fuse and removes any remaining pair of adjacent `_clone_dim_order` ops whose composed dim_order returns to the input's original dim_order. Net result on Wav2Letter: 60 graph nodes between the convs collapse to zero -- the post-pass IR is a direct `quantized_conv2d -> quantized_conv2d` flow with the wrap/unwrap only at the model boundaries (2 view_copy + 2 _clone_dim_order total, not 24 + 24). Silero VAD sees similar simplification for its encoder block; its remaining wraps are at the STFT/magnitude-spectrum and post-LSTM boundaries where the next op isn't a conv1d. `_lower_conv1d` reads the input shape from `x.args[1]` when the input is a view_copy from a previous conv1d's lowering (its meta["val"] hasn't been repopulated yet at that point in the pass; the explicit shape arg is the source of truth). `aten::view_copy.out` is added to the test runner's selected_ops_list so the on-device tests can resolve the view_copy kernel. Co-authored-by: Claude <noreply@anthropic.com>
8ff4633 to
fabc4c6
Compare
|
Sounds good to me! |
Summary
PyTorch Conv1d produces 3D NCW tensors at the edge dialect level, while CMSIS-NN's
arm_convolve_wrapper_s8only accepts 4D NHWC. Rather than add a dedicated conv1d C++ kernel that would duplicate the quantized conv2d machinery, this lowers conv1d to the existing cortex_m.quantized_conv2d / quantized_depthwise_conv2d ops by wrapping the runtime input inaten.unsqueeze_copyplusdim_order_ops._clone_dim_order(dim_order=[0,2,3,1])to produce a logical (N, C, 1, W) tensor with NHWC byte layout, AoT-permuting the weight to 4D OHWI (or IHWO for depthwise) with H=1, and unwrapping the output with_clone_dim_order([0,1,2,3])+squeeze_copy. Thedim_orderattribute is what getsis_channels_last_tensorto pass at runtime; the physical reorder inside the clone op is what makes CMSIS-NN's raw int8_t* arithmetic read the right bytes.CortexMConv1DCheckaccepts rank-3 patterns with the same activation tails (relu / hardtanh / hardsigmoid / clamp) the conv2d quantizer pattern already supports.ConvertToCortexMPass._lower_conv1dperforms the full graph rewrite end-to-end. It inserts the wrap chain, creates the conv2d node, initialises its scratch buffer through the existingrequired_cmsis_nn_buffer_sizeslookup, and replaces uses of the original convolution node itself, socall()'s standard single-node-replacement tail doesn't need to know about conv1d.Test plan
F.relu(h), with no Conv1d producer to fuse into.Op-level unit tests cover regular, depthwise, kernel=1 pointwise, larger kernels with padding, stride > 1, bias, an STFT-shaped in=1 depthwise case, and a Conv1d+ReLU fusion case.