Add Microbenchmarking Infrastructure and CI Using GVSOC CSR #162
Add Microbenchmarking Infrastructure and CI Using GVSOC CSR #162runwangdl wants to merge 2 commits intopulp-platform:develfrom
Conversation
📝 WalkthroughWalkthroughA new performance benchmarking header (perf_utils.h) is introduced, defining utilities to initialize, control, and read hardware performance counters on RISCV/PMSIS platforms. The GEMM implementation is instrumented to capture and report performance metrics during kernel execution. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@TargetLibraries/PULPOpen/inc/perf_utils.h`:
- Around line 1-3: Add the project's SPDX header and copyright notice to the top
of perf_utils.h (same style as Gemm.c): insert an SPDX-License-Identifier line
and the matching copyright/ownership comment block as the file prologue so the
reuse pre-commit hook passes; update the header in
TargetLibraries/PULPOpen/inc/perf_utils.h (top-of-file comment) to match the
project's existing license metadata.
In `@TargetLibraries/PULPOpen/src/Gemm.c`:
- Around line 21-29: The perf counters are started on core 0 before checking for
an early return when M_size==0, so core 0 can return without calling
perf_bench_stop(), leaving counters running; move the early-return check (the
M_size == 0/NUM_CORES logic) to occur before calling perf_bench_init(),
perf_bench_start(), and perf_bench_read, or alternatively ensure that before any
return when core_id==0 you call perf_bench_stop() and perf_bench_read/print
cleanup; update the code around
perf_bench_init()/perf_bench_start()/perf_bench_read() and the early-return
condition (referencing core_id, M_size, NUM_CORES, perf_bench_stop) so core 0
always stops and cleans up counters.
- Line 9: The file TargetLibraries/PULPOpen/src/Gemm.c fails CI due to
clang-format changes; run your formatter or the repository pre-commit hooks to
apply the project's clang-format style (e.g., run clang-format on Gemm.c or
execute the repo's pre-commit script), then stage the modified Gemm.c and update
the PR so the include line and surrounding formatting match the project's style;
ensure the header include "perf_utils.h" and nearby code conform after
formatting.
🧹 Nitpick comments (4)
TargetLibraries/PULPOpen/inc/perf_utils.h (2)
139-157: Addconstqualifiers to read-only pointer parameters.
endandstartare not modified; mark themconstfor safety and self-documentation. Same applies tostatsinperf_bench_print(Line 102).Proposed fix
-static inline void perf_bench_diff(perf_stats_t *result, - perf_stats_t *end, - perf_stats_t *start) { +static inline void perf_bench_diff(perf_stats_t *result, + const perf_stats_t *end, + const perf_stats_t *start) {
102-136: Heavyprintfwith%fformatting on an embedded target.Floating-point
printf(e.g.,%.2f,%.3f) can pull in a large soft-float formatting library, significantly inflating code size on PULP. Consider whether this is acceptable for your target, or switch to scaled-integer printing if code size matters.TargetLibraries/PULPOpen/src/Gemm.c (2)
25-29: Theperf_startsnapshot is redundant — counters were just reset.
perf_bench_start()callspi_perf_reset()thenpi_perf_start(), so all counters are zero. Reading intoperf_startimmediately after will yield near-zero values, makingperf_bench_diffequivalent to just usingperf_end. You can simplify by dropping the start snapshot and the diff, and printingperf_enddirectly.
365-376: Perf measurement captures only core 0's counters and includes non-kernel overhead.The measurement window includes the chunk-setup arithmetic (Lines 31–34) and the
snprintflabel formatting (Lines 372–374), not just the GEMM kernel. On core 0, this is likely negligible noise, but be aware the reported numbers aren't purely kernel cycles. Also, only core 0's counters are captured — if multi-core behavior is of interest, you'd need per-core reporting or a barrier-based approach.
| /* | ||
| * Performance Counter Utilities for PULP Benchmarking | ||
| */ |
There was a problem hiding this comment.
Missing SPDX license header — CI is failing.
The pipeline reports that the reuse pre-commit hook fails because this file has no license identifier or copyright notice. Add an SPDX header consistent with the rest of the project (see Gemm.c for an example).
Proposed fix
-/*
- * Performance Counter Utilities for PULP Benchmarking
- */
+/*
+ * SPDX-FileCopyrightText: 2026 ETH Zurich and University of Bologna
+ *
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Performance Counter Utilities for PULP Benchmarking
+ */📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /* | |
| * Performance Counter Utilities for PULP Benchmarking | |
| */ | |
| /* | |
| * SPDX-FileCopyrightText: 2026 ETH Zurich and University of Bologna | |
| * | |
| * SPDX-License-Identifier: Apache-2.0 | |
| * | |
| * Performance Counter Utilities for PULP Benchmarking | |
| */ |
🧰 Tools
🪛 GitHub Actions: CI • Lint & Licenses
[error] 1-1: pre-commit hook 'reuse' failed: no license identifier and no copyright notice in perf_utils.h.
[error] 1-1: pre-commit hook 'clang-format' modified the file. Please re-run hooks to apply formatting changes.
🤖 Prompt for AI Agents
In `@TargetLibraries/PULPOpen/inc/perf_utils.h` around lines 1 - 3, Add the
project's SPDX header and copyright notice to the top of perf_utils.h (same
style as Gemm.c): insert an SPDX-License-Identifier line and the matching
copyright/ownership comment block as the file prologue so the reuse pre-commit
hook passes; update the header in TargetLibraries/PULPOpen/inc/perf_utils.h
(top-of-file comment) to match the project's existing license metadata.
TargetLibraries/PULPOpen/src/Gemm.c
Outdated
|
|
||
| #include "DeeployPULPMath.h" | ||
| #include "pmsis.h" | ||
| #include "perf_utils.h" |
There was a problem hiding this comment.
CI clang-format failure — run the formatter.
The pipeline reports that clang-format modified this file. Re-run the pre-commit hooks to fix formatting before merge.
🤖 Prompt for AI Agents
In `@TargetLibraries/PULPOpen/src/Gemm.c` at line 9, The file
TargetLibraries/PULPOpen/src/Gemm.c fails CI due to clang-format changes; run
your formatter or the repository pre-commit hooks to apply the project's
clang-format style (e.g., run clang-format on Gemm.c or execute the repo's
pre-commit script), then stage the modified Gemm.c and update the PR so the
include line and surrounding formatting match the project's style; ensure the
header include "perf_utils.h" and nearby code conform after formatting.
TargetLibraries/PULPOpen/src/Gemm.c
Outdated
| // Performance monitoring structures | ||
| perf_stats_t perf_start, perf_end, perf_total; | ||
|
|
||
| // Initialize and start performance counters (only core 0) | ||
| if (core_id == 0) { | ||
| perf_bench_init(); | ||
| perf_bench_start(); | ||
| perf_bench_read(&perf_start); | ||
| } |
There was a problem hiding this comment.
Bug: early return on Line 37 leaves perf counters running on core 0.
If M < NUM_CORES, core 0 may get M_size == 0 and return at Line 37 without ever executing perf_bench_stop() (Line 368). This leaves counters running and skips the cleanup/print block entirely.
Move the early-return check before the perf init, or add stop logic before the return:
Proposed fix (option A — move early return before perf init)
int8_t core_id = pi_core_id();
int8_t log2Core = LOG2(NUM_CORES);
+ uint32_t M_chunk = (M >> log2Core) + ((M & (NUM_CORES - 1)) != 0);
+ uint32_t M_start = MIN(core_id * M_chunk, M);
+ uint32_t M_end = MIN(M_start + M_chunk, M);
+ uint32_t M_size = M_end - M_start;
+
+ if (M_size == 0) {
+ return;
+ }
+
// Performance monitoring structures
perf_stats_t perf_start, perf_end, perf_total;
// Initialize and start performance counters (only core 0)
if (core_id == 0) {
perf_bench_init();
perf_bench_start();
perf_bench_read(&perf_start);
}
- uint32_t M_chunk = (M >> log2Core) + ((M & (NUM_CORES - 1)) != 0);
- uint32_t M_start = MIN(core_id * M_chunk, M);
- uint32_t M_end = MIN(M_start + M_chunk, M);
- uint32_t M_size = M_end - M_start;
-
- if (M_size == 0) {
- return;
- }Also applies to: 36-38
🤖 Prompt for AI Agents
In `@TargetLibraries/PULPOpen/src/Gemm.c` around lines 21 - 29, The perf counters
are started on core 0 before checking for an early return when M_size==0, so
core 0 can return without calling perf_bench_stop(), leaving counters running;
move the early-return check (the M_size == 0/NUM_CORES logic) to occur before
calling perf_bench_init(), perf_bench_start(), and perf_bench_read, or
alternatively ensure that before any return when core_id==0 you call
perf_bench_stop() and perf_bench_read/print cleanup; update the code around
perf_bench_init()/perf_bench_start()/perf_bench_read() and the early-return
condition (referencing core_id, M_size, NUM_CORES, perf_bench_stop) so core 0
always stops and cleans up counters.
a6231d0 to
49cddd2
Compare
Xeratec
left a comment
There was a problem hiding this comment.
As discussed with @runwangdl offline, I am strongly against modifying the kernel libraries to add profiling code to every individual kernel.
Instead, I suggest adding a performance profiling code transformation that can be configured to different modes:
- "kernel": Profile the call to the kernel function (potentially for multiple cores)
- "tile": Profile the function inside the tiling loop (effectively profiling each tile)
- "layer ": Profile each layer from a given memory level
- "network": Profile the full network
Not that kernel in this case will include the small overhead of the function call. This functionality can be merged with the existing profiling infrastructure (probably requires some refactoring, though)
@Victor-Jung What do you think?
Example Artificial Logging Output
Optimally, I would imagine something like the output below, with a hierarchical breakdown of the different function calls and runtime and performance statistics at different levels.
Note that the numbers are artificial.
=== Cycle Breakdown ====================================================================================================
[network] 100855 cyc (100.00%)
- [matmul_node_1] 98876 cyc (58.04%)
- [L3-L2][Tile 0] 38363 cyc, 6070 cyc (15.8%) kernel, 32293 cyc (84.2%) overhead
- [L2-L1][Tile 0] 5642 cyc, 4716 cyc (83.6%) kernel, 926 cyc (16.4%) overhead
- [L3-L2][Tile 1] 24095 cyc, 3599 cyc (14.9%) kernel, 20596 cyc (85.1%) overhead
- [L2-L1][Tile 1] 3180 cyc, 2719 cyc (85.5%) kernel, 461 cyc (14.5%) overhead
- [L3-L2][Tile 2] 24141 cyc, 5881 cyc (24.4%) kernel, 18260 cyc (75.6%) overhead
- [L2-L1][Tile 2] 5526 cyc, 4653 cyc (84.2%) kernel, 873 cyc (15.8%) overhead
- [L3-L2][Tile 3] 11550 cyc, 3598 cyc (31.2%) kernel, 7952 cyc (68.8%) overhead
- [L2-L1][Tile 3] 3213 cyc, 2773 cyc (86.3%) kernel, 440 cyc (13.7%) overhead
- [matmul_node_2] 98876 cyc (38.04%)
- [L3-L2][Tile 0] 38363 cyc, 6070 cyc (15.8%) kernel, 32293 cyc (84.2%) overhead
- [L2-L1][Tile 0] 5642 cyc, 4716 cyc (83.6%) kernel, 926 cyc (16.4%) overhead
- [L3-L2][Tile 1] 24095 cyc, 3599 cyc (14.9%) kernel, 20596 cyc (85.1%) overhead
- [L2-L1][Tile 1] 3180 cyc, 2719 cyc (85.5%) kernel, 461 cyc (14.5%) overhead
========================================================================================================================
=== Statistics [network] ===============================================================================================
- [All Cores]
Cycles : 5642
Operations : 5642
Ops/Cycle : 1.000
- [Core 0]
Cycles : 5642
Instructions : 4716
Ins/Cycle : 0.836
Loads : 5142 (38.73%)
Stores : 148 (1.11%)
Branches : 93 (0.70%)
Taken Branches : 12 (12.90%)
Load Stalls : 402
Jump Stalls : 0
I$ Misses : 647
TCDM Cont. : 0
- [Core 1]
Cycles : 5642
Instructions : 4716
Ins/Cycle : 0.836
Loads : 5142 (38.73%)
Stores : 148 (1.11%)
Branches : 93 (0.70%)
Taken Branches : 12 (12.90%)
Load Stalls : 402
Jump Stalls : 0
I$ Misses : 647
TCDM Cont. : 0
[...]
========================================================================================================================
=== Statistics [matmul_node_1] =========================================================================================
[...]
========================================================================================================================
=== Statistics [matmul_node_1][L3-L2][Tile 0] ==========================================================================
[...]
========================================================================================================================
=== Statistics [matmul_node_1][L2-L1][Tile 0] ==========================================================================
[...]
========================================================================================================================
=== Statistics [matmul_node_1][L2-L1][Tile 0][PULP_MatMul_fp32_fp32_fp32_unroll1x7] ====================================
[...]
========================================================================================================================
This PR adds a CSR-based microbenchmarking mechanism from GVSOC.
Added
Changed
Fixed
PR Merge Checklist
develcommit and pointing todevel.CHANGELOG.mdfile has been updated.