Refactor proxy server#4596
Draft
lvhan028 wants to merge 19 commits into
Draft
Conversation
Design for refactoring the proxy server with modular architecture, strategy pattern for routing, and new min_cache_usage strategy that polls backend /metrics for KV cache occupation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
13-task plan covering config, node registry, routing strategies, forwarding, streaming, distserve, app factory, and CLI updates. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Codes Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…cted, min_observed Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…sponse.py Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors the proxy server implementation (lmdeploy/serve/proxy/) from a large monolithic module into smaller components (config, node registry, routing strategies, forwarding, app factory, and DistServe router) and adds a new min_cache_usage routing strategy that polls /metrics.
Changes:
- Split proxy functionality into focused modules and rewired the CLI entrypoint to assemble
ProxyConfig+NodeRegistry+ routing strategy + FastAPI app. - Added
min_cache_usagerouting strategy with Prometheus text parsing + background polling and fallback tomin_expected_latency. - Added a new proxy test suite (config/node/routing/forwarding) and updated docs OpenAPI generation.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_lmdeploy/serve/proxy/test_routing.py | Adds unit tests for routing strategies (random/min-expected/min-observed/min-cache) and factory selection. |
| tests/test_lmdeploy/serve/proxy/test_node.py | Adds unit tests for Node defaults and NodeRegistry CRUD/persist/load. |
| tests/test_lmdeploy/serve/proxy/test_forwarding.py | Adds unit tests for forwarding header preparation. |
| tests/test_lmdeploy/serve/proxy/test_config.py | Adds unit tests for proxy config defaults, env override, enums, and exception. |
| lmdeploy/serve/proxy/utils.py | Removes old shared constants/enums/exceptions (moved into config.py). |
| lmdeploy/serve/proxy/streaming.py | Updates import location for APIServerException and minor formatting. |
| lmdeploy/serve/proxy/routing/random.py | Implements weighted-random routing by node speed. |
| lmdeploy/serve/proxy/routing/min_observed.py | Implements routing based on mean observed latency history. |
| lmdeploy/serve/proxy/routing/min_expected.py | Implements routing based on expected latency (unfinished/speed). |
| lmdeploy/serve/proxy/routing/min_cache.py | Implements min_cache_usage routing + /metrics polling + Prometheus parsing + fallback. |
| lmdeploy/serve/proxy/routing/base.py | Adds BaseStrategy with hooks for request start/end and lifecycle start/stop. |
| lmdeploy/serve/proxy/routing/init.py | Adds get_strategy() factory for routing strategies. |
| lmdeploy/serve/proxy/proxy.py | Replaces monolith with a slim entrypoint wiring config/registry/strategy/app and running Uvicorn. |
| lmdeploy/serve/proxy/node.py | Adds Node and NodeRegistry with persistence and cache-usage metric updates. |
| lmdeploy/serve/proxy/forwarding.py | Adds raw request forwarding helpers (streaming + non-streaming) and header handling. |
| lmdeploy/serve/proxy/distserve.py | Extracts DistServe (prefill/decode) routing into DistServeRouter. |
| lmdeploy/serve/proxy/config.py | Adds ProxyConfig, enums, constants, error codes/messages, and APIServerException. |
| lmdeploy/serve/proxy/app.py | Adds create_app() FastAPI factory with endpoints, lifespan, and integration with routing strategy/registry. |
| lmdeploy/serve/proxy/init.py | Re-exports proxy-related public API. |
| lmdeploy/cli/serve.py | Adds min_cache_usage to CLI --routing-strategy choices. |
| docs/zh_cn/conf.py | Updates OpenAPI spec generation to build a proxy app via create_app(). |
| docs/en/conf.py | Updates OpenAPI spec generation to build a proxy app via create_app(). |
| docs/superpowers/specs/2026-05-18-proxy-refactor-design.md | Adds design spec documenting the refactor and new routing strategy. |
| docs/superpowers/plans/2026-05-18-proxy-refactor.md | Adds detailed implementation plan for the refactor. |
Comments suppressed due to low confidence (1)
lmdeploy/serve/proxy/app.py:274
- Same issue for
/v1/completionsstreaming: computingtime.time() - startwhen scheduling the background task records near-zero latency. Compute elapsed time inside the background task at completion.
response = forward_request_stream(client, node.url, raw_request, '/v1/completions')
background_task = BackgroundTasks()
background_task.add_task(strategy.on_request_end, node, time.time() - start)
return ProxyStreamingResponse(response, background=background_task, media_type='text/event-stream')
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+134
to
+136
| @app.post('/nodes/add', dependencies=[Depends(validate_json_request)]) | ||
| async def add_node(node: Node, raw_request: Request = None): | ||
| try: |
Comment on lines
+50
to
+54
| try: | ||
| loop = asyncio.get_event_loop() | ||
| except RuntimeError: | ||
| return | ||
| if loop.is_closed(): |
Comment on lines
+245
to
+248
| response = forward_request_stream(client, node.url, raw_request, '/v1/chat/completions') | ||
| background_task = BackgroundTasks() | ||
| background_task.add_task(strategy.on_request_end, node, time.time() - start) | ||
| return ProxyStreamingResponse(response, background=background_task, media_type='text/event-stream') |
Comment on lines
+42
to
+68
| async def add(self, url: str, role: EngineRole = EngineRole.Hybrid, | ||
| models: list[str] | None = None, | ||
| status: Node | None = None) -> None: | ||
| async with self._lock: | ||
| if status is not None: | ||
| if status.models: | ||
| self._nodes.pop(url, None) | ||
| self._nodes[url] = status | ||
| await self._persist_unlocked() | ||
| return | ||
| node = status | ||
| else: | ||
| node = self._nodes.get(url, Node(url=url, role=role)) | ||
|
|
||
| if models is not None: | ||
| node.models = models | ||
| elif not node.models: | ||
| try: | ||
| import requests | ||
|
|
||
| from lmdeploy.serve.openai.api_client import APIClient | ||
| client = APIClient(api_server_url=url) | ||
| node.models = client.available_models | ||
| except requests.exceptions.RequestException as e: | ||
| logger.error(f"Exception when adding node {url}: {e}") | ||
| return | ||
|
|
Comment on lines
+65
to
+69
| p_nodes = await self.registry.get(model_name, role=EngineRole.Prefill) | ||
| if not p_nodes: | ||
| return self._handle_unavailable_model(model_name) | ||
| p_url = p_nodes[0].url | ||
| logger.info(f"A Prefill request is dispatched to {p_url}") |
| d_nodes = await self.registry.get(model_name, role=EngineRole.Decode) | ||
| if not d_nodes: | ||
| return self._handle_unavailable_model(model_name) | ||
| d_url = d_nodes[0].url |
Comment on lines
+119
to
+123
| if stream: | ||
| response = stream_generate(client, request_dict, d_url, endpoint) | ||
| background_task = BackgroundTasks() | ||
| resp = StreamingResponse(response, background=background_task, media_type='text/event-stream') | ||
| else: |
Previously, a new aiohttp.ClientSession was created per request and per metrics poll cycle. This leaked TCP connections and discarded connection pooling. Now one shared session is created in the app lifespan, stored on strategy.client and app.state.client, and reused by both request handlers and MinCacheUsageStrategy polling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add conn_limit and conn_limit_per_host to ProxyConfig, configurable via LMDEPLOY_PROXY_CONN_LIMIT (default 100) and LMDEPLOY_PROXY_CONN_LIMIT_PER_HOST (default 0=unlimited) env vars. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
0.0.0.0 is a bind address, not a connect address. Python's HTTP clients cannot connect to it, causing "Connection refused" errors. Replace 0.0.0.0 with 127.0.0.1 when adding nodes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…odel fetch failures
The api_server sends {'url': '...', 'status': {'models': [...], 'role': N}}
when self-registering, but the new Node model doesn't have a nested
status field. Extract models and role from the nested status dict in the
raw request body. Also, if model fetching fails, register the node with
an empty model list instead of silently dropping it.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.