[feat](cloud) Add system rate limit for meta-service#61516
[feat](cloud) Add system rate limit for meta-service#61516wyxxxcat wants to merge 4 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
5937d85 to
9195eb3
Compare
9195eb3 to
4ee3658
Compare
|
run buildall |
TPC-H: Total hot run time: 26994 ms |
TPC-DS: Total hot run time: 168584 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
7eedde9 to
f7e782f
Compare
01fcad6 to
fbccb73
Compare
|
run buildall |
TPC-H: Total hot run time: 26998 ms |
TPC-DS: Total hot run time: 168170 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
bobhan1
left a comment
There was a problem hiding this comment.
Review Summary
The core design (three-dimensional stress detection + sliding window + lock-free decision reads) is sound and practical. Main risks are around BVAR_FDB_INVALID_VALUE false-positive triggering and DEFER/closure interaction in the RPC macro. Recommend fixing these before merging.
| Severity | Count | Details |
|---|---|---|
| Bug | 3 | INVALID_VALUE false trigger, get_string_value potential UB, stoll may throw |
| Design | 3 | BE retry strategy insufficient, master switch not hot-updatable, DEFER bypass |
| Minor | 3 | Typo, dead code, test coverage |
Additional notes (not inline)
Test file naming: meta_service_helper_test.cpp tests meta_service_rate_limit_helper.h/cpp. Suggest renaming to meta_service_rate_limit_helper_test.cpp for consistency.
Test coverage gaps: Missing tests for:
fdb_performance_limited_by_name == BVAR_FDB_INVALID_VALUE(the false-positive bug)- FDB read latency trigger (only commit latency is tested)
- Memory pressure trigger (only CPU is tested)
- Window not-yet-full (first 60s) should not trigger
debug_string()output when multiple conditions fire simultaneously
57aff07 to
b295d23
Compare
b295d23 to
23e1dbe
Compare
| pthread_setname_np(pthread_self(), "ms_stress_det"); | ||
| LOG(INFO) << "MsStressDetector background thread started"; | ||
| ProcessResourceSampler sampler; | ||
| while (running_.load() == 1) { |
There was a problem hiding this comment.
std::bad_alloc may be thrown in auto decision = std::make_shared<MsStressDecision>();/samples_.push_back(sample); and cause this thread to terminate?
23e1dbe to
393c02a
Compare
Summary
This PR introduces an automatic rate limiting mechanism for the Meta Service (MS) in Doris Cloud. When the Meta Service or its underlying FoundationDB (FDB) cluster is under heavy load, incoming RPC requests will be proactively rejected with a
MS_TOO_BUSYerror code, preventing cascading failures and protecting system stability.Motivation
In production environments, the Meta Service can become overwhelmed due to high concurrency, FDB cluster performance degradation, or resource exhaustion (CPU/memory). Without a self-protection mechanism, this can lead to cascading failures, elevated latencies, and potential system-wide outages. This change adds a multi-dimensional stress detection system that automatically throttles requests when the system is under significant pressure.
Design
Stress Detection Dimensions
The rate limiter evaluates system stress across three independent dimensions, any of which can trigger rate limiting:
FDB Cluster Pressure (
fdb_cluster_under_pressure)ms_rate_limit_fdb_commit_latency_ms(default: 50ms) OR FDB read latency exceedsms_rate_limit_fdb_read_latency_ms(default: 5ms)performance_limited_byindicator reports a non-workload bottleneck (e.g., storage server, log server)FDB Client Thread Pressure (
fdb_client_thread_under_pressure)ms_rate_limit_fdb_client_thread_busyness_avg_percent(default: 70%) AND the instantaneous busyness exceedsms_rate_limit_fdb_client_thread_busyness_instant_percent(default: 90%)MS Process Resource Pressure (
ms_resource_under_pressure)ms_rate_limit_cpu_usage_percent(default: 95%) OR memory usage (both current and window average) exceedsms_rate_limit_memory_usage_percent(default: 95%)getrusage()delta over wall-clock time, normalized by CPU core count/proc/self/status(VmRSS) relative to total system memory viasysinfo()Sliding Window Mechanism
MsStressDetectorclass maintains astd::deque<WindowSample>of per-second samplesms_rate_limit_window_seconds, default: 60s) are evictedRequest Rejection Flow
RPC_PREPROCESSmacro inmeta_service_helper.his augmented with rate limit checking logicget_ms_stress_decision()is called to collect current metrics and evaluate stressunder_greate_stress()returns true, the request is immediately rejected withMetaServiceCode::MS_TOO_BUSY(6002) and a detailed debug string describing the trigger reasoncloud_meta_mgr.cpp), theMS_TOO_BUSYerror code is recognized and the error message is propagatedFault Injection for Testing
enable_ms_rate_limit_injection(default: false) andms_rate_limit_injection_probability(default: 5%, range: 0-100)std::mt19937random number generator for efficiencyFDB Performance Limited By Metric
g_bvar_fdb_performance_limited_by_nameis added to track the FDBperformance_limited_by.namefield from the FDB status JSON0if the limiter is "workload" (normal),-1otherwise (indicating an infrastructure bottleneck)metric.cppvia a newget_string_valuelambda that parses the FDB status JSONConfiguration Parameters
enable_ms_rate_limittrueenable_ms_rate_limit_injectionfalsems_rate_limit_injection_probability5ms_rate_limit_window_seconds60ms_rate_limit_fdb_commit_latency_ms50ms_rate_limit_fdb_read_latency_ms5ms_rate_limit_fdb_client_thread_busyness_avg_percent70ms_rate_limit_fdb_client_thread_busyness_instant_percent90ms_rate_limit_cpu_usage_percent95ms_rate_limit_memory_usage_percent95All threshold parameters (prefixed with
m) are mutable at runtime without restart.Update rpc white list
update list
get list
unset