Skip to content

Add Microbenchmarking Infrastructure and CI Using GVSOC CSR #162

Draft
runwangdl wants to merge 2 commits intopulp-platform:develfrom
runwangdl:microbenchmark
Draft

Add Microbenchmarking Infrastructure and CI Using GVSOC CSR #162
runwangdl wants to merge 2 commits intopulp-platform:develfrom
runwangdl:microbenchmark

Conversation

@runwangdl
Copy link
Contributor

This PR adds a CSR-based microbenchmarking mechanism from GVSOC.

Added

Changed

Fixed

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR reviewed and approved.
  3. All checks are passing.
  4. The CHANGELOG.md file has been updated.
  5. If the docker was modified, change back its link after review.

@runwangdl runwangdl marked this pull request as draft February 12, 2026 12:57
@runwangdl runwangdl self-assigned this Feb 12, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 12, 2026

📝 Walkthrough

Walkthrough

A new performance benchmarking header (perf_utils.h) is introduced, defining utilities to initialize, control, and read hardware performance counters on RISCV/PMSIS platforms. The GEMM implementation is instrumented to capture and report performance metrics during kernel execution.

Changes

Cohort / File(s) Summary
Performance Benchmarking Utilities
TargetLibraries/PULPOpen/inc/perf_utils.h
Introduces perf_stats_t struct to hold 15 performance counter fields (cycles, instructions, loads, stores, stalls, branches, misses, etc.) and five inline helper functions: perf_bench_init(), perf_bench_start(), perf_bench_stop(), perf_bench_read(), and perf_bench_print() for managing and reporting hardware counters; also includes perf_bench_diff() for computing counter deltas.
GEMM Instrumentation
TargetLibraries/PULPOpen/src/Gemm.c
Adds performance counter instrumentation by including perf_utils.h, declaring perf_start/perf_end/perf_total snapshot structs, and calling benchmark functions at function entry and exit (core 0 only) to measure and print labeled performance metrics.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 inconclusive)
Check name Status Explanation Resolution
Description check ❓ Inconclusive The description mentions adding a CSR-based microbenchmarking mechanism but provides no detail about what was actually added; the template sections are empty. Fill in the PR template sections with specifics about what was added (perf_utils.h with helper functions) and how it was integrated (instrumentation in Gemm.c).
Title check ❓ Inconclusive The title mentions adding 'Microbenchmarking Infrastructure' but doesn't specify the key implementation detail that it's based on GVSOC CSR, and doesn't mention the GEMM instrumentation which is the primary concrete change. Consider refining the title to be more specific about the primary change, such as 'Add performance benchmarking utilities and instrument GEMM with GVSOC CSR' to better reflect the main deliverables.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@TargetLibraries/PULPOpen/inc/perf_utils.h`:
- Around line 1-3: Add the project's SPDX header and copyright notice to the top
of perf_utils.h (same style as Gemm.c): insert an SPDX-License-Identifier line
and the matching copyright/ownership comment block as the file prologue so the
reuse pre-commit hook passes; update the header in
TargetLibraries/PULPOpen/inc/perf_utils.h (top-of-file comment) to match the
project's existing license metadata.

In `@TargetLibraries/PULPOpen/src/Gemm.c`:
- Around line 21-29: The perf counters are started on core 0 before checking for
an early return when M_size==0, so core 0 can return without calling
perf_bench_stop(), leaving counters running; move the early-return check (the
M_size == 0/NUM_CORES logic) to occur before calling perf_bench_init(),
perf_bench_start(), and perf_bench_read, or alternatively ensure that before any
return when core_id==0 you call perf_bench_stop() and perf_bench_read/print
cleanup; update the code around
perf_bench_init()/perf_bench_start()/perf_bench_read() and the early-return
condition (referencing core_id, M_size, NUM_CORES, perf_bench_stop) so core 0
always stops and cleans up counters.
- Line 9: The file TargetLibraries/PULPOpen/src/Gemm.c fails CI due to
clang-format changes; run your formatter or the repository pre-commit hooks to
apply the project's clang-format style (e.g., run clang-format on Gemm.c or
execute the repo's pre-commit script), then stage the modified Gemm.c and update
the PR so the include line and surrounding formatting match the project's style;
ensure the header include "perf_utils.h" and nearby code conform after
formatting.
🧹 Nitpick comments (4)
TargetLibraries/PULPOpen/inc/perf_utils.h (2)

139-157: Add const qualifiers to read-only pointer parameters.

end and start are not modified; mark them const for safety and self-documentation. Same applies to stats in perf_bench_print (Line 102).

Proposed fix
-static inline void perf_bench_diff(perf_stats_t *result,
-                                    perf_stats_t *end,
-                                    perf_stats_t *start) {
+static inline void perf_bench_diff(perf_stats_t *result,
+                                    const perf_stats_t *end,
+                                    const perf_stats_t *start) {

102-136: Heavy printf with %f formatting on an embedded target.

Floating-point printf (e.g., %.2f, %.3f) can pull in a large soft-float formatting library, significantly inflating code size on PULP. Consider whether this is acceptable for your target, or switch to scaled-integer printing if code size matters.

TargetLibraries/PULPOpen/src/Gemm.c (2)

25-29: The perf_start snapshot is redundant — counters were just reset.

perf_bench_start() calls pi_perf_reset() then pi_perf_start(), so all counters are zero. Reading into perf_start immediately after will yield near-zero values, making perf_bench_diff equivalent to just using perf_end. You can simplify by dropping the start snapshot and the diff, and printing perf_end directly.


365-376: Perf measurement captures only core 0's counters and includes non-kernel overhead.

The measurement window includes the chunk-setup arithmetic (Lines 31–34) and the snprintf label formatting (Lines 372–374), not just the GEMM kernel. On core 0, this is likely negligible noise, but be aware the reported numbers aren't purely kernel cycles. Also, only core 0's counters are captured — if multi-core behavior is of interest, you'd need per-core reporting or a barrier-based approach.

Comment on lines +1 to +3
/*
* Performance Counter Utilities for PULP Benchmarking
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing SPDX license header — CI is failing.

The pipeline reports that the reuse pre-commit hook fails because this file has no license identifier or copyright notice. Add an SPDX header consistent with the rest of the project (see Gemm.c for an example).

Proposed fix
-/*
- * Performance Counter Utilities for PULP Benchmarking
- */
+/*
+ * SPDX-FileCopyrightText: 2026 ETH Zurich and University of Bologna
+ *
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Performance Counter Utilities for PULP Benchmarking
+ */
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/*
* Performance Counter Utilities for PULP Benchmarking
*/
/*
* SPDX-FileCopyrightText: 2026 ETH Zurich and University of Bologna
*
* SPDX-License-Identifier: Apache-2.0
*
* Performance Counter Utilities for PULP Benchmarking
*/
🧰 Tools
🪛 GitHub Actions: CI • Lint & Licenses

[error] 1-1: pre-commit hook 'reuse' failed: no license identifier and no copyright notice in perf_utils.h.


[error] 1-1: pre-commit hook 'clang-format' modified the file. Please re-run hooks to apply formatting changes.

🤖 Prompt for AI Agents
In `@TargetLibraries/PULPOpen/inc/perf_utils.h` around lines 1 - 3, Add the
project's SPDX header and copyright notice to the top of perf_utils.h (same
style as Gemm.c): insert an SPDX-License-Identifier line and the matching
copyright/ownership comment block as the file prologue so the reuse pre-commit
hook passes; update the header in TargetLibraries/PULPOpen/inc/perf_utils.h
(top-of-file comment) to match the project's existing license metadata.


#include "DeeployPULPMath.h"
#include "pmsis.h"
#include "perf_utils.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

CI clang-format failure — run the formatter.

The pipeline reports that clang-format modified this file. Re-run the pre-commit hooks to fix formatting before merge.

🤖 Prompt for AI Agents
In `@TargetLibraries/PULPOpen/src/Gemm.c` at line 9, The file
TargetLibraries/PULPOpen/src/Gemm.c fails CI due to clang-format changes; run
your formatter or the repository pre-commit hooks to apply the project's
clang-format style (e.g., run clang-format on Gemm.c or execute the repo's
pre-commit script), then stage the modified Gemm.c and update the PR so the
include line and surrounding formatting match the project's style; ensure the
header include "perf_utils.h" and nearby code conform after formatting.

Comment on lines 21 to 29
// Performance monitoring structures
perf_stats_t perf_start, perf_end, perf_total;

// Initialize and start performance counters (only core 0)
if (core_id == 0) {
perf_bench_init();
perf_bench_start();
perf_bench_read(&perf_start);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: early return on Line 37 leaves perf counters running on core 0.

If M < NUM_CORES, core 0 may get M_size == 0 and return at Line 37 without ever executing perf_bench_stop() (Line 368). This leaves counters running and skips the cleanup/print block entirely.

Move the early-return check before the perf init, or add stop logic before the return:

Proposed fix (option A — move early return before perf init)
   int8_t core_id = pi_core_id();
   int8_t log2Core = LOG2(NUM_CORES);

+  uint32_t M_chunk = (M >> log2Core) + ((M & (NUM_CORES - 1)) != 0);
+  uint32_t M_start = MIN(core_id * M_chunk, M);
+  uint32_t M_end = MIN(M_start + M_chunk, M);
+  uint32_t M_size = M_end - M_start;
+
+  if (M_size == 0) {
+    return;
+  }
+
   // Performance monitoring structures
   perf_stats_t perf_start, perf_end, perf_total;

   // Initialize and start performance counters (only core 0)
   if (core_id == 0) {
     perf_bench_init();
     perf_bench_start();
     perf_bench_read(&perf_start);
   }

-  uint32_t M_chunk = (M >> log2Core) + ((M & (NUM_CORES - 1)) != 0);
-  uint32_t M_start = MIN(core_id * M_chunk, M);
-  uint32_t M_end = MIN(M_start + M_chunk, M);
-  uint32_t M_size = M_end - M_start;
-
-  if (M_size == 0) {
-    return;
-  }

Also applies to: 36-38

🤖 Prompt for AI Agents
In `@TargetLibraries/PULPOpen/src/Gemm.c` around lines 21 - 29, The perf counters
are started on core 0 before checking for an early return when M_size==0, so
core 0 can return without calling perf_bench_stop(), leaving counters running;
move the early-return check (the M_size == 0/NUM_CORES logic) to occur before
calling perf_bench_init(), perf_bench_start(), and perf_bench_read, or
alternatively ensure that before any return when core_id==0 you call
perf_bench_stop() and perf_bench_read/print cleanup; update the code around
perf_bench_init()/perf_bench_start()/perf_bench_read() and the early-return
condition (referencing core_id, M_size, NUM_CORES, perf_bench_stop) so core 0
always stops and cleans up counters.

@runwangdl runwangdl changed the title Add Microbenchmarking Infrastructure Using GVSOC CSR Add Microbenchmarking Infrastructure and CI Using GVSOC CSR Feb 12, 2026
Copy link
Member

@Xeratec Xeratec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed with @runwangdl offline, I am strongly against modifying the kernel libraries to add profiling code to every individual kernel.

Instead, I suggest adding a performance profiling code transformation that can be configured to different modes:

  • "kernel": Profile the call to the kernel function (potentially for multiple cores)
  • "tile": Profile the function inside the tiling loop (effectively profiling each tile)
  • "layer ": Profile each layer from a given memory level
  • "network": Profile the full network

Not that kernel in this case will include the small overhead of the function call. This functionality can be merged with the existing profiling infrastructure (probably requires some refactoring, though)

@Victor-Jung What do you think?

Example Artificial Logging Output

Optimally, I would imagine something like the output below, with a hierarchical breakdown of the different function calls and runtime and performance statistics at different levels.

Note that the numbers are artificial.

=== Cycle Breakdown ====================================================================================================
[network] 100855 cyc (100.00%)
- [matmul_node_1] 98876 cyc (58.04%)
  - [L3-L2][Tile 0]      38363 cyc, 6070 cyc (15.8%) kernel, 32293 cyc (84.2%) overhead
    - [L2-L1][Tile 0]     5642 cyc, 4716 cyc (83.6%) kernel,   926 cyc (16.4%) overhead
  - [L3-L2][Tile 1]      24095 cyc, 3599 cyc (14.9%) kernel, 20596 cyc (85.1%) overhead
    - [L2-L1][Tile 1]     3180 cyc, 2719 cyc (85.5%) kernel,   461 cyc (14.5%) overhead
  - [L3-L2][Tile 2]      24141 cyc, 5881 cyc (24.4%) kernel, 18260 cyc (75.6%) overhead
    - [L2-L1][Tile 2]     5526 cyc, 4653 cyc (84.2%) kernel,   873 cyc (15.8%) overhead
  - [L3-L2][Tile 3]      11550 cyc, 3598 cyc (31.2%) kernel,  7952 cyc (68.8%) overhead
    - [L2-L1][Tile 3]     3213 cyc, 2773 cyc (86.3%) kernel,   440 cyc (13.7%) overhead
- [matmul_node_2] 98876 cyc (38.04%)
  - [L3-L2][Tile 0]      38363 cyc, 6070 cyc (15.8%) kernel, 32293 cyc (84.2%) overhead
    - [L2-L1][Tile 0]     5642 cyc, 4716 cyc (83.6%) kernel,   926 cyc (16.4%) overhead
  - [L3-L2][Tile 1]      24095 cyc, 3599 cyc (14.9%) kernel, 20596 cyc (85.1%) overhead
    - [L2-L1][Tile 1]     3180 cyc, 2719 cyc (85.5%) kernel,   461 cyc (14.5%) overhead
========================================================================================================================

=== Statistics [network] ===============================================================================================
- [All Cores]
    Cycles          : 5642
    Operations      : 5642
    Ops/Cycle       : 1.000
- [Core 0]
    Cycles          : 5642
    Instructions    : 4716
    Ins/Cycle       : 0.836
    Loads           : 5142 (38.73%)
    Stores          :  148 (1.11%)
    Branches        :   93 (0.70%)
    Taken Branches  :   12 (12.90%)
    Load Stalls     : 402
    Jump Stalls     :   0
    I$ Misses       : 647
    TCDM Cont.      :   0
- [Core 1]
    Cycles          : 5642
    Instructions    : 4716
    Ins/Cycle       : 0.836
    Loads           : 5142 (38.73%)
    Stores          :  148 (1.11%)
    Branches        :   93 (0.70%)
    Taken Branches  :   12 (12.90%)
    Load Stalls     : 402
    Jump Stalls     :   0
    I$ Misses       : 647
    TCDM Cont.      :   0
[...]
========================================================================================================================

=== Statistics [matmul_node_1] =========================================================================================
[...]
========================================================================================================================

=== Statistics [matmul_node_1][L3-L2][Tile 0] ==========================================================================
[...]
========================================================================================================================

=== Statistics [matmul_node_1][L2-L1][Tile 0] ==========================================================================
[...]
========================================================================================================================

=== Statistics [matmul_node_1][L2-L1][Tile 0][PULP_MatMul_fp32_fp32_fp32_unroll1x7] ====================================
[...]
========================================================================================================================

@Xeratec Xeratec added the Feature Addition of new features label Feb 13, 2026
@Xeratec Xeratec added this to Deeploy Feb 13, 2026
@Xeratec Xeratec added this to the Release 0.2.2 milestone Feb 13, 2026
@Xeratec Xeratec moved this to In progress in Deeploy Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature Addition of new features

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

2 participants