Skip to content

Add more metrics for snapshot and state sync #2879

Merged
yzang2019 merged 8 commits intomainfrom
yang/add-metrics-statesync
Feb 27, 2026
Merged

Add more metrics for snapshot and state sync #2879
yzang2019 merged 8 commits intomainfrom
yang/add-metrics-statesync

Conversation

@yzang2019
Copy link
Contributor

@yzang2019 yzang2019 commented Feb 12, 2026

Describe your changes and provide context

This PR is adding more visibility around MemIAVL snapshot creation + replay + pruning, as well as state sync snapshot creation process.

With these metrics, we should have better visibility to correlate some timing for performance changes in relate to the snapshot behavior

Testing performed to validate your change

Tested locally and verified the metrics works

* main:
  chore: remove wasm dir on unsafe-reset (#2875)
  fix: respect existing genesis file (#2868)
  fix to halt due to reconstructing block from bad proposal (backported #2823) (#2873)
  chore(refactor): drop unused code (#2811)
  made the peer dialing less aggressive (backported #2799) (#2872)
  perf(store): lazy-init `sortedCache` in `cachekv.Store` (#2804)
  feat: embed genesis for well-known chains (#2835)
  fix: use MADV_RANDOM during loadtree (#2857)
@github-actions
Copy link

github-actions bot commented Feb 12, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedFeb 27, 2026, 8:25 AM

db.logger.Error("failed to prune snapshot", "err", err)
} else {
db.logger.Info("successfully pruned snapshot", "name", name)
otelMetrics.SnapshotPruneCount.Add(context.Background(), 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to measure the failure rate too right?

In which case, you can use the same metric and tag by status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!


// writeLeaf sends leaf and KV write operations to the pipeline
func (w *snapshotWriter) writeLeaf(version uint32, key, value, hash []byte) error {
// Track channel fill metrics for all channels
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing these since it seems they are not being used

@codecov
Copy link

codecov bot commented Feb 12, 2026

Codecov Report

❌ Patch coverage is 73.07692% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.43%. Comparing base (f748419) to head (c3601a6).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
sei-db/state_db/sc/memiavl/db.go 75.00% 4 Missing and 2 partials ⚠️
sei-cosmos/storev2/rootmulti/store.go 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #2879       +/-   ##
===========================================
+ Coverage   58.13%   68.43%   +10.29%     
===========================================
  Files        2111       24     -2087     
  Lines      173463     3817   -169646     
===========================================
- Hits       100847     2612    -98235     
+ Misses      63665      921    -62744     
+ Partials     8951      284     -8667     
Flag Coverage Δ
sei-chain 68.27% <73.07%> (+10.17%) ⬆️
sei-db 69.50% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-db/state_db/sc/memiavl/metrics.go 50.00% <ø> (ø)
sei-db/state_db/sc/memiavl/multitree.go 79.22% <100.00%> (+0.06%) ⬆️
sei-db/state_db/sc/memiavl/snapshot.go 59.37% <ø> (-0.93%) ⬇️
sei-cosmos/storev2/rootmulti/store.go 46.11% <0.00%> (-0.45%) ⬇️
sei-db/state_db/sc/memiavl/db.go 66.06% <75.00%> (-0.04%) ⬇️

... and 2087 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

}

// catchup the remaining entries in rlog
startTime := time.Now()

Check warning

Code scanning / CodeQL

Calling the system time Warning

Calling the system time may be a possible source of non-determinism
cloned.logger.Info("snapshot rewrite process completed", "duration_sec", totalElapsed, "duration_min", totalElapsed/60)
otelMetrics.SnapshotCreationLatency.Record(
totalRewriteElapsed := time.Since(startTime).Seconds()
cloned.logger.Info("snapshot rewrite process completed", "duration_sec", totalRewriteElapsed, "duration_min", totalRewriteElapsed/60)

Check notice

Code scanning / CodeQL

Floating point arithmetic Note

Floating point arithmetic operations are not associative and a possible source of non-determinism
)),
SnapshotRewriteCount: must(meter.Int64Counter(
"memiavl_snapshot_rewrite_count",
metric.WithDescription("Total num of times memiavl snapshot rewrite attempts"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: grammar seems a little weird, how about: "Total number of memiavl snapshot rewrite attempts"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that looks better!

* main: (66 commits)
  feat(flatkv): include legacyDB in ApplyChangeSets, LtHash, and read path (#2978)
  Deflake mempool tests with Eventually-based block waits (#2983)
  Demote noisy gasless classification log to debug level (#2982)
  Harden `TestStateLock_NoPOL` against proposal/timeout race (#2980)
  added a config parameter to limit outbound p2p connections. (#2974)
  merged unconditional and persistent peers status (#2977)
  Fix race between file pruning and in-flight parquet queries (#2975)
  fix(giga): don't migrate balance on failed txs (#2961)
  Fix hanging upgrade tests by adding timeouts to wait_for_height (#2976)
  Add snapshot import for Giga Live State (#2970)
  Fix Rocksdb MVCC read timestamp lifetime for iterators (#2971)
  Reduce exposed tendermint RPC endpoint (#2968)
  Deflake `TestStateLock_NoPOL` by widening propose timeout in test (#2969)
  go bench read + write receipts/logs for parquet vs pebble (#2794)
  [giga] clear up cache after Write (#2827)
  fix: use correct EVM storage key prefix in benchmark key generation (#2966)
  Harden staking precompile test against CI flakiness (#2967)
  Don't sync flatKV DBs when committing  (#2964)
  Fix flaky `TestStateLock_POLSafety1` (#2962)
  Add metrics for historical proof success/failure rate (#2958)
  ...
@yzang2019 yzang2019 enabled auto-merge (squash) February 27, 2026 08:24
@yzang2019 yzang2019 merged commit 5ad8d47 into main Feb 27, 2026
38 checks passed
@yzang2019 yzang2019 deleted the yang/add-metrics-statesync branch February 27, 2026 08:39
yzang2019 added a commit that referenced this pull request Feb 27, 2026
This PR is adding more visibility around MemIAVL snapshot creation +
replay + pruning, as well as state sync snapshot creation process.

With these metrics, we should have better visibility to correlate some
timing for performance changes in relate to the snapshot behavior

Tested locally and verified the metrics works
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants