Add more metrics for snapshot and state sync #2879
Conversation
* main: chore: remove wasm dir on unsafe-reset (#2875) fix: respect existing genesis file (#2868) fix to halt due to reconstructing block from bad proposal (backported #2823) (#2873) chore(refactor): drop unused code (#2811) made the peer dialing less aggressive (backported #2799) (#2872) perf(store): lazy-init `sortedCache` in `cachekv.Store` (#2804) feat: embed genesis for well-known chains (#2835) fix: use MADV_RANDOM during loadtree (#2857)
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
sei-db/state_db/sc/memiavl/db.go
Outdated
| db.logger.Error("failed to prune snapshot", "err", err) | ||
| } else { | ||
| db.logger.Info("successfully pruned snapshot", "name", name) | ||
| otelMetrics.SnapshotPruneCount.Add(context.Background(), 1) |
There was a problem hiding this comment.
We probably want to measure the failure rate too right?
In which case, you can use the same metric and tag by status?
|
|
||
| // writeLeaf sends leaf and KV write operations to the pipeline | ||
| func (w *snapshotWriter) writeLeaf(version uint32, key, value, hash []byte) error { | ||
| // Track channel fill metrics for all channels |
There was a problem hiding this comment.
Removing these since it seems they are not being used
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2879 +/- ##
===========================================
+ Coverage 58.13% 68.43% +10.29%
===========================================
Files 2111 24 -2087
Lines 173463 3817 -169646
===========================================
- Hits 100847 2612 -98235
+ Misses 63665 921 -62744
+ Partials 8951 284 -8667
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| } | ||
|
|
||
| // catchup the remaining entries in rlog | ||
| startTime := time.Now() |
Check warning
Code scanning / CodeQL
Calling the system time Warning
| cloned.logger.Info("snapshot rewrite process completed", "duration_sec", totalElapsed, "duration_min", totalElapsed/60) | ||
| otelMetrics.SnapshotCreationLatency.Record( | ||
| totalRewriteElapsed := time.Since(startTime).Seconds() | ||
| cloned.logger.Info("snapshot rewrite process completed", "duration_sec", totalRewriteElapsed, "duration_min", totalRewriteElapsed/60) |
Check notice
Code scanning / CodeQL
Floating point arithmetic Note
| )), | ||
| SnapshotRewriteCount: must(meter.Int64Counter( | ||
| "memiavl_snapshot_rewrite_count", | ||
| metric.WithDescription("Total num of times memiavl snapshot rewrite attempts"), |
There was a problem hiding this comment.
nit: grammar seems a little weird, how about: "Total number of memiavl snapshot rewrite attempts"
There was a problem hiding this comment.
Yup, that looks better!
* main: (66 commits) feat(flatkv): include legacyDB in ApplyChangeSets, LtHash, and read path (#2978) Deflake mempool tests with Eventually-based block waits (#2983) Demote noisy gasless classification log to debug level (#2982) Harden `TestStateLock_NoPOL` against proposal/timeout race (#2980) added a config parameter to limit outbound p2p connections. (#2974) merged unconditional and persistent peers status (#2977) Fix race between file pruning and in-flight parquet queries (#2975) fix(giga): don't migrate balance on failed txs (#2961) Fix hanging upgrade tests by adding timeouts to wait_for_height (#2976) Add snapshot import for Giga Live State (#2970) Fix Rocksdb MVCC read timestamp lifetime for iterators (#2971) Reduce exposed tendermint RPC endpoint (#2968) Deflake `TestStateLock_NoPOL` by widening propose timeout in test (#2969) go bench read + write receipts/logs for parquet vs pebble (#2794) [giga] clear up cache after Write (#2827) fix: use correct EVM storage key prefix in benchmark key generation (#2966) Harden staking precompile test against CI flakiness (#2967) Don't sync flatKV DBs when committing (#2964) Fix flaky `TestStateLock_POLSafety1` (#2962) Add metrics for historical proof success/failure rate (#2958) ...
This PR is adding more visibility around MemIAVL snapshot creation + replay + pruning, as well as state sync snapshot creation process. With these metrics, we should have better visibility to correlate some timing for performance changes in relate to the snapshot behavior Tested locally and verified the metrics works
Describe your changes and provide context
This PR is adding more visibility around MemIAVL snapshot creation + replay + pruning, as well as state sync snapshot creation process.
With these metrics, we should have better visibility to correlate some timing for performance changes in relate to the snapshot behavior
Testing performed to validate your change
Tested locally and verified the metrics works