Skip to content

[feat](cloud) Add system rate limit for meta-service#61516

Open
wyxxxcat wants to merge 4 commits intoapache:masterfrom
wyxxxcat:ms_rate_auto_adjust
Open

[feat](cloud) Add system rate limit for meta-service#61516
wyxxxcat wants to merge 4 commits intoapache:masterfrom
wyxxxcat:ms_rate_auto_adjust

Conversation

@wyxxxcat
Copy link
Copy Markdown
Collaborator

@wyxxxcat wyxxxcat commented Mar 19, 2026

Summary

This PR introduces an automatic rate limiting mechanism for the Meta Service (MS) in Doris Cloud. When the Meta Service or its underlying FoundationDB (FDB) cluster is under heavy load, incoming RPC requests will be proactively rejected with a MS_TOO_BUSY error code, preventing cascading failures and protecting system stability.

Motivation

In production environments, the Meta Service can become overwhelmed due to high concurrency, FDB cluster performance degradation, or resource exhaustion (CPU/memory). Without a self-protection mechanism, this can lead to cascading failures, elevated latencies, and potential system-wide outages. This change adds a multi-dimensional stress detection system that automatically throttles requests when the system is under significant pressure.

Design

Stress Detection Dimensions

The rate limiter evaluates system stress across three independent dimensions, any of which can trigger rate limiting:

  1. FDB Cluster Pressure (fdb_cluster_under_pressure)

    • Triggered when FDB commit latency exceeds ms_rate_limit_fdb_commit_latency_ms (default: 50ms) OR FDB read latency exceeds ms_rate_limit_fdb_read_latency_ms (default: 5ms)
    • AND the FDB performance_limited_by indicator reports a non-workload bottleneck (e.g., storage server, log server)
    • This ensures rate limiting only kicks in when FDB itself is the bottleneck, not when the cluster is simply handling a normal high workload
  2. FDB Client Thread Pressure (fdb_client_thread_under_pressure)

    • Uses a sliding window (default: 60 seconds) to compute the average FDB client thread busyness percentage
    • Triggered when the window average exceeds ms_rate_limit_fdb_client_thread_busyness_avg_percent (default: 70%) AND the instantaneous busyness exceeds ms_rate_limit_fdb_client_thread_busyness_instant_percent (default: 90%)
    • The dual-threshold (average + instant) design avoids false positives from transient spikes
  3. MS Process Resource Pressure (ms_resource_under_pressure)

    • Monitors the Meta Service process's own CPU and memory usage
    • Triggered when CPU usage (both current and window average) exceeds ms_rate_limit_cpu_usage_percent (default: 95%) OR memory usage (both current and window average) exceeds ms_rate_limit_memory_usage_percent (default: 95%)
    • CPU usage is calculated via getrusage() delta over wall-clock time, normalized by CPU core count
    • Memory usage is read from /proc/self/status (VmRSS) relative to total system memory via sysinfo()

Sliding Window Mechanism

  • A MsStressDetector class maintains a std::deque<WindowSample> of per-second samples
  • Each sample records: FDB client thread busyness, MS CPU usage, MS memory usage
  • Samples outside the configured window (ms_rate_limit_window_seconds, default: 60s) are evicted
  • Window averages are only considered valid when the window is fully populated (i.e., the time span of samples covers the full window)

Request Rejection Flow

  • The RPC_PREPROCESS macro in meta_service_helper.h is augmented with rate limit checking logic
  • Before processing any RPC request, get_ms_stress_decision() is called to collect current metrics and evaluate stress
  • If under_greate_stress() returns true, the request is immediately rejected with MetaServiceCode::MS_TOO_BUSY (6002) and a detailed debug string describing the trigger reason
  • On the BE side (cloud_meta_mgr.cpp), the MS_TOO_BUSY error code is recognized and the error message is propagated

Fault Injection for Testing

  • A fault injection mechanism is included for testing rate limiting behavior without actual system stress
  • Controlled by enable_ms_rate_limit_injection (default: false) and ms_rate_limit_injection_probability (default: 5%, range: 0-100)
  • When enabled, each request has a configurable probability of being artificially rate-limited
  • Uses thread-local std::mt19937 random number generator for efficiency

FDB Performance Limited By Metric

  • A new bvar g_bvar_fdb_performance_limited_by_name is added to track the FDB performance_limited_by.name field from the FDB status JSON
  • The value is mapped to: 0 if the limiter is "workload" (normal), -1 otherwise (indicating an infrastructure bottleneck)
  • This metric is collected in metric.cpp via a new get_string_value lambda that parses the FDB status JSON

Configuration Parameters

Parameter Type Default Description
enable_ms_rate_limit Bool true Master switch for rate limiting
enable_ms_rate_limit_injection mBool false Enable fault injection for testing
ms_rate_limit_injection_probability mInt32 5 Injection probability (0-100%)
ms_rate_limit_window_seconds mInt64 60 Sliding window size in seconds
ms_rate_limit_fdb_commit_latency_ms mInt64 50 FDB commit latency threshold (ms)
ms_rate_limit_fdb_read_latency_ms mInt64 5 FDB read latency threshold (ms)
ms_rate_limit_fdb_client_thread_busyness_avg_percent mInt64 70 FDB client thread avg busyness threshold (%)
ms_rate_limit_fdb_client_thread_busyness_instant_percent mInt64 90 FDB client thread instant busyness threshold (%)
ms_rate_limit_cpu_usage_percent mInt64 95 MS process CPU usage threshold (%)
ms_rate_limit_memory_usage_percent mInt64 95 MS process memory usage threshold (%)

All threshold parameters (prefixed with m) are mutable at runtime without restart.

Update rpc white list

update list

curl -X POST http://<meta-service-host>:<port>/MetaService/http/set_rpc_rate_limit_whitelist \
-H "Content-Type: application/json" \
-d '{
  "rpcs": ["commit_txn", "begin_txn", "get_txn"]
}'

get list

curl http://<meta-service-host>:<port>/MetaService/http/get_rpc_rate_limit_whitelist
{
"rpcs": [
  "commit_txn",
  "begin_txn",
  "get_txn"
]
}

unset

curl -X POST http://<meta-service-host>:<port>/MetaService/http/set_rpc_rate_limit_whitelist \
-H "Content-Type: application/json" \
-d '{"rpcs": []}'

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Mar 19, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@wyxxxcat wyxxxcat force-pushed the ms_rate_auto_adjust branch 5 times, most recently from 5937d85 to 9195eb3 Compare March 19, 2026 08:46
@wyxxxcat wyxxxcat force-pushed the ms_rate_auto_adjust branch from 9195eb3 to 4ee3658 Compare March 19, 2026 08:48
@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 26994 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 4ee3658b2d9200400164338360112f9aab2bef58, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17663	4449	4338	4338
q2	q3	10726	799	521	521
q4	4734	362	254	254
q5	8204	1251	1025	1025
q6	247	174	147	147
q7	838	853	674	674
q8	10944	1481	1352	1352
q9	6732	4703	4698	4698
q10	6509	1939	1647	1647
q11	472	266	250	250
q12	757	592	476	476
q13	18079	2926	2187	2187
q14	236	240	211	211
q15	q16	740	758	667	667
q17	740	843	465	465
q18	6032	5513	5296	5296
q19	1128	1131	621	621
q20	541	500	377	377
q21	5026	1943	1517	1517
q22	362	303	271	271
Total cold run time: 100710 ms
Total hot run time: 26994 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4750	4652	4571	4571
q2	q3	3894	4339	3846	3846
q4	888	1220	785	785
q5	4101	4374	4366	4366
q6	183	183	138	138
q7	1752	1666	1552	1552
q8	2570	2781	2618	2618
q9	7615	7321	7415	7321
q10	3891	3983	3606	3606
q11	521	429	419	419
q12	514	591	451	451
q13	2865	3233	2241	2241
q14	302	319	286	286
q15	q16	728	835	748	748
q17	1241	1432	1474	1432
q18	7372	6859	6651	6651
q19	909	930	928	928
q20	2093	2202	2045	2045
q21	4049	3479	3368	3368
q22	460	433	382	382
Total cold run time: 50698 ms
Total hot run time: 47754 ms

@doris-robot
Copy link
Copy Markdown

TPC-DS: Total hot run time: 168584 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 4ee3658b2d9200400164338360112f9aab2bef58, data reload: false

query5	4354	642	524	524
query6	354	238	213	213
query7	4209	477	268	268
query8	347	262	235	235
query9	8754	2717	2724	2717
query10	552	378	350	350
query11	6948	5099	4892	4892
query12	193	129	121	121
query13	1272	476	348	348
query14	5713	3679	3438	3438
query14_1	2903	2837	2832	2832
query15	210	191	179	179
query16	978	465	437	437
query17	900	725	630	630
query18	2449	455	356	356
query19	218	213	194	194
query20	139	126	129	126
query21	212	137	109	109
query22	13348	14160	14446	14160
query23	16193	16022	15711	15711
query23_1	15555	15654	15718	15654
query24	7719	1625	1219	1219
query24_1	1233	1235	1247	1235
query25	621	465	431	431
query26	1244	262	146	146
query27	2797	498	291	291
query28	4430	1839	1812	1812
query29	894	556	484	484
query30	307	230	199	199
query31	1021	938	867	867
query32	89	69	72	69
query33	520	337	291	291
query34	884	870	523	523
query35	632	695	592	592
query36	1110	1134	981	981
query37	137	108	83	83
query38	2914	2892	2866	2866
query39	865	835	804	804
query39_1	843	807	800	800
query40	236	156	139	139
query41	63	58	60	58
query42	261	265	258	258
query43	237	242	228	228
query44	
query45	201	192	180	180
query46	868	999	607	607
query47	2107	2138	2062	2062
query48	297	315	224	224
query49	643	466	385	385
query50	699	291	214	214
query51	4081	4009	3945	3945
query52	264	271	264	264
query53	286	335	282	282
query54	306	276	269	269
query55	98	90	78	78
query56	317	327	320	320
query57	1734	1899	1767	1767
query58	284	273	262	262
query59	2780	2940	2741	2741
query60	343	339	328	328
query61	156	158	152	152
query62	634	581	549	549
query63	309	272	284	272
query64	5276	1277	1018	1018
query65	
query66	1472	452	367	367
query67	24218	24277	24120	24120
query68	
query69	405	317	292	292
query70	993	982	965	965
query71	343	305	300	300
query72	2833	2696	2434	2434
query73	535	548	319	319
query74	9630	9536	9383	9383
query75	2869	2744	2476	2476
query76	2304	1039	694	694
query77	360	394	314	314
query78	10899	10989	10451	10451
query79	3239	768	566	566
query80	1722	628	531	531
query81	573	260	224	224
query82	987	154	119	119
query83	337	256	247	247
query84	301	120	96	96
query85	908	493	438	438
query86	495	315	308	308
query87	3098	3119	3067	3067
query88	3562	2666	2650	2650
query89	438	361	347	347
query90	2058	177	180	177
query91	170	160	140	140
query92	89	74	69	69
query93	1968	818	486	486
query94	643	324	304	304
query95	584	334	330	330
query96	647	516	228	228
query97	2433	2521	2377	2377
query98	244	222	221	221
query99	1012	1011	922	922
Total cold run time: 252277 ms
Total hot run time: 168584 ms

@doris-robot
Copy link
Copy Markdown

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.71% (19785/37535)
Line Coverage 36.25% (184867/509909)
Region Coverage 32.50% (143076/440295)
Branch Coverage 33.67% (62554/185813)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 72.61% (26675/36738)
Line Coverage 56.11% (285133/508195)
Region Coverage 53.47% (237567/444341)
Branch Coverage 55.12% (102692/186295)

@wyxxxcat wyxxxcat force-pushed the ms_rate_auto_adjust branch from 7eedde9 to f7e782f Compare March 20, 2026 06:35
@wyxxxcat wyxxxcat force-pushed the ms_rate_auto_adjust branch from 01fcad6 to fbccb73 Compare March 23, 2026 06:58
@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 26998 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit fbccb73644e11f6396c6ad5e490adc74d62fa9e3, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17691	4543	4331	4331
q2	q3	10638	762	540	540
q4	4686	372	260	260
q5	7620	1252	1036	1036
q6	186	174	148	148
q7	845	869	684	684
q8	10339	1524	1411	1411
q9	5832	4750	4747	4747
q10	6325	1955	1653	1653
q11	507	270	250	250
q12	777	587	475	475
q13	18042	2726	1940	1940
q14	238	237	223	223
q15	q16	758	734	684	684
q17	745	858	449	449
q18	5928	5519	5306	5306
q19	1146	1002	608	608
q20	554	504	384	384
q21	4546	2171	1596	1596
q22	372	333	273	273
Total cold run time: 97775 ms
Total hot run time: 26998 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4694	4728	4724	4724
q2	q3	3928	4402	3822	3822
q4	867	1205	763	763
q5	4110	4398	4390	4390
q6	184	178	141	141
q7	1838	1786	1570	1570
q8	2581	2785	2595	2595
q9	7551	7439	7349	7349
q10	3767	4008	3583	3583
q11	515	440	452	440
q12	520	620	485	485
q13	2607	2897	2096	2096
q14	285	298	278	278
q15	q16	777	794	718	718
q17	1184	1351	1299	1299
q18	7429	6782	6802	6782
q19	962	981	1052	981
q20	2115	2178	2033	2033
q21	4036	3509	3330	3330
q22	485	420	375	375
Total cold run time: 50435 ms
Total hot run time: 47754 ms

@doris-robot
Copy link
Copy Markdown

TPC-DS: Total hot run time: 168170 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit fbccb73644e11f6396c6ad5e490adc74d62fa9e3, data reload: false

query5	4326	622	508	508
query6	339	223	202	202
query7	4228	464	256	256
query8	333	256	238	238
query9	8718	2712	2687	2687
query10	528	388	334	334
query11	7026	5072	4875	4875
query12	188	131	126	126
query13	1267	461	344	344
query14	5660	3712	3431	3431
query14_1	2805	2825	2782	2782
query15	204	195	177	177
query16	991	461	438	438
query17	909	739	632	632
query18	2456	453	364	364
query19	218	245	179	179
query20	133	128	125	125
query21	210	134	108	108
query22	13124	13810	14935	13810
query23	16737	16121	15925	15925
query23_1	16023	16204	15711	15711
query24	7235	1584	1200	1200
query24_1	1210	1205	1212	1205
query25	541	457	395	395
query26	1224	261	155	155
query27	2783	473	293	293
query28	4481	1829	1831	1829
query29	791	548	510	510
query30	292	216	187	187
query31	1006	934	875	875
query32	82	70	70	70
query33	514	333	286	286
query34	889	866	529	529
query35	639	683	599	599
query36	1108	1116	911	911
query37	135	97	81	81
query38	2924	2900	2842	2842
query39	861	846	827	827
query39_1	803	794	794	794
query40	230	154	134	134
query41	61	59	58	58
query42	256	259	260	259
query43	239	243	223	223
query44	
query45	202	196	181	181
query46	888	959	607	607
query47	2467	2156	2055	2055
query48	308	325	226	226
query49	631	447	381	381
query50	674	269	207	207
query51	4098	4028	3918	3918
query52	258	266	251	251
query53	294	330	275	275
query54	313	262	268	262
query55	99	88	86	86
query56	319	323	317	317
query57	1931	1763	1748	1748
query58	282	271	267	267
query59	2800	2940	2762	2762
query60	339	328	326	326
query61	154	143	156	143
query62	617	587	535	535
query63	311	275	269	269
query64	4952	1279	969	969
query65	
query66	1462	452	357	357
query67	24159	24227	24107	24107
query68	
query69	401	316	293	293
query70	991	984	948	948
query71	329	306	289	289
query72	2862	2741	2610	2610
query73	537	569	320	320
query74	9608	9540	9390	9390
query75	2875	2736	2457	2457
query76	2272	1050	675	675
query77	371	370	313	313
query78	10942	11175	10489	10489
query79	1136	771	568	568
query80	707	607	557	557
query81	461	254	225	225
query82	1329	151	119	119
query83	365	259	233	233
query84	294	116	100	100
query85	847	497	444	444
query86	355	304	319	304
query87	3180	3105	3012	3012
query88	3528	2656	2639	2639
query89	421	365	344	344
query90	1945	174	169	169
query91	170	190	141	141
query92	78	70	70	70
query93	897	814	498	498
query94	463	306	305	305
query95	589	336	373	336
query96	638	510	237	237
query97	2510	2538	2388	2388
query98	239	233	222	222
query99	1050	1010	946	946
Total cold run time: 248765 ms
Total hot run time: 168170 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.78% (19854/37614)
Line Coverage 36.29% (185494/511176)
Region Coverage 32.52% (143527/441351)
Branch Coverage 33.73% (62852/186322)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.31% (27002/36831)
Line Coverage 56.73% (289117/509624)
Region Coverage 54.05% (240785/445491)
Branch Coverage 55.74% (104176/186890)

Copy link
Copy Markdown
Contributor

@bobhan1 bobhan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

The core design (three-dimensional stress detection + sliding window + lock-free decision reads) is sound and practical. Main risks are around BVAR_FDB_INVALID_VALUE false-positive triggering and DEFER/closure interaction in the RPC macro. Recommend fixing these before merging.

Severity Count Details
Bug 3 INVALID_VALUE false trigger, get_string_value potential UB, stoll may throw
Design 3 BE retry strategy insufficient, master switch not hot-updatable, DEFER bypass
Minor 3 Typo, dead code, test coverage

Additional notes (not inline)

Test file naming: meta_service_helper_test.cpp tests meta_service_rate_limit_helper.h/cpp. Suggest renaming to meta_service_rate_limit_helper_test.cpp for consistency.

Test coverage gaps: Missing tests for:

  • fdb_performance_limited_by_name == BVAR_FDB_INVALID_VALUE (the false-positive bug)
  • FDB read latency trigger (only commit latency is tested)
  • Memory pressure trigger (only CPU is tested)
  • Window not-yet-full (first 60s) should not trigger
  • debug_string() output when multiple conditions fire simultaneously

@wyxxxcat wyxxxcat force-pushed the ms_rate_auto_adjust branch 2 times, most recently from 57aff07 to b295d23 Compare March 27, 2026 07:36
@wyxxxcat wyxxxcat force-pushed the ms_rate_auto_adjust branch from b295d23 to 23e1dbe Compare March 30, 2026 09:03
pthread_setname_np(pthread_self(), "ms_stress_det");
LOG(INFO) << "MsStressDetector background thread started";
ProcessResourceSampler sampler;
while (running_.load() == 1) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::bad_alloc may be thrown in auto decision = std::make_shared<MsStressDecision>();/samples_.push_back(sample); and cause this thread to terminate?

@wyxxxcat wyxxxcat force-pushed the ms_rate_auto_adjust branch from 23e1dbe to 393c02a Compare March 30, 2026 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants