Skip to content

Refactor proxy server#4596

Draft
lvhan028 wants to merge 19 commits into
InternLM:mainfrom
lvhan028:refactor/proxy-server
Draft

Refactor proxy server#4596
lvhan028 wants to merge 19 commits into
InternLM:mainfrom
lvhan028:refactor/proxy-server

Conversation

@lvhan028
Copy link
Copy Markdown
Collaborator

No description provided.

lvhan028 and others added 15 commits May 18, 2026 06:57
Design for refactoring the proxy server with modular architecture,
strategy pattern for routing, and new min_cache_usage strategy
that polls backend /metrics for KV cache occupation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
13-task plan covering config, node registry, routing strategies,
forwarding, streaming, distserve, app factory, and CLI updates.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Codes

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…cted, min_observed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…sponse.py

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 18, 2026 12:09
@lvhan028 lvhan028 marked this pull request as draft May 18, 2026 12:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the proxy server implementation (lmdeploy/serve/proxy/) from a large monolithic module into smaller components (config, node registry, routing strategies, forwarding, app factory, and DistServe router) and adds a new min_cache_usage routing strategy that polls /metrics.

Changes:

  • Split proxy functionality into focused modules and rewired the CLI entrypoint to assemble ProxyConfig + NodeRegistry + routing strategy + FastAPI app.
  • Added min_cache_usage routing strategy with Prometheus text parsing + background polling and fallback to min_expected_latency.
  • Added a new proxy test suite (config/node/routing/forwarding) and updated docs OpenAPI generation.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/test_lmdeploy/serve/proxy/test_routing.py Adds unit tests for routing strategies (random/min-expected/min-observed/min-cache) and factory selection.
tests/test_lmdeploy/serve/proxy/test_node.py Adds unit tests for Node defaults and NodeRegistry CRUD/persist/load.
tests/test_lmdeploy/serve/proxy/test_forwarding.py Adds unit tests for forwarding header preparation.
tests/test_lmdeploy/serve/proxy/test_config.py Adds unit tests for proxy config defaults, env override, enums, and exception.
lmdeploy/serve/proxy/utils.py Removes old shared constants/enums/exceptions (moved into config.py).
lmdeploy/serve/proxy/streaming.py Updates import location for APIServerException and minor formatting.
lmdeploy/serve/proxy/routing/random.py Implements weighted-random routing by node speed.
lmdeploy/serve/proxy/routing/min_observed.py Implements routing based on mean observed latency history.
lmdeploy/serve/proxy/routing/min_expected.py Implements routing based on expected latency (unfinished/speed).
lmdeploy/serve/proxy/routing/min_cache.py Implements min_cache_usage routing + /metrics polling + Prometheus parsing + fallback.
lmdeploy/serve/proxy/routing/base.py Adds BaseStrategy with hooks for request start/end and lifecycle start/stop.
lmdeploy/serve/proxy/routing/init.py Adds get_strategy() factory for routing strategies.
lmdeploy/serve/proxy/proxy.py Replaces monolith with a slim entrypoint wiring config/registry/strategy/app and running Uvicorn.
lmdeploy/serve/proxy/node.py Adds Node and NodeRegistry with persistence and cache-usage metric updates.
lmdeploy/serve/proxy/forwarding.py Adds raw request forwarding helpers (streaming + non-streaming) and header handling.
lmdeploy/serve/proxy/distserve.py Extracts DistServe (prefill/decode) routing into DistServeRouter.
lmdeploy/serve/proxy/config.py Adds ProxyConfig, enums, constants, error codes/messages, and APIServerException.
lmdeploy/serve/proxy/app.py Adds create_app() FastAPI factory with endpoints, lifespan, and integration with routing strategy/registry.
lmdeploy/serve/proxy/init.py Re-exports proxy-related public API.
lmdeploy/cli/serve.py Adds min_cache_usage to CLI --routing-strategy choices.
docs/zh_cn/conf.py Updates OpenAPI spec generation to build a proxy app via create_app().
docs/en/conf.py Updates OpenAPI spec generation to build a proxy app via create_app().
docs/superpowers/specs/2026-05-18-proxy-refactor-design.md Adds design spec documenting the refactor and new routing strategy.
docs/superpowers/plans/2026-05-18-proxy-refactor.md Adds detailed implementation plan for the refactor.
Comments suppressed due to low confidence (1)

lmdeploy/serve/proxy/app.py:274

  • Same issue for /v1/completions streaming: computing time.time() - start when scheduling the background task records near-zero latency. Compute elapsed time inside the background task at completion.
                response = forward_request_stream(client, node.url, raw_request, '/v1/completions')
                background_task = BackgroundTasks()
                background_task.add_task(strategy.on_request_end, node, time.time() - start)
                return ProxyStreamingResponse(response, background=background_task, media_type='text/event-stream')

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +134 to +136
@app.post('/nodes/add', dependencies=[Depends(validate_json_request)])
async def add_node(node: Node, raw_request: Request = None):
try:
Comment on lines +50 to +54
try:
loop = asyncio.get_event_loop()
except RuntimeError:
return
if loop.is_closed():
Comment thread lmdeploy/serve/proxy/app.py Outdated
Comment on lines +245 to +248
response = forward_request_stream(client, node.url, raw_request, '/v1/chat/completions')
background_task = BackgroundTasks()
background_task.add_task(strategy.on_request_end, node, time.time() - start)
return ProxyStreamingResponse(response, background=background_task, media_type='text/event-stream')
Comment on lines +42 to +68
async def add(self, url: str, role: EngineRole = EngineRole.Hybrid,
models: list[str] | None = None,
status: Node | None = None) -> None:
async with self._lock:
if status is not None:
if status.models:
self._nodes.pop(url, None)
self._nodes[url] = status
await self._persist_unlocked()
return
node = status
else:
node = self._nodes.get(url, Node(url=url, role=role))

if models is not None:
node.models = models
elif not node.models:
try:
import requests

from lmdeploy.serve.openai.api_client import APIClient
client = APIClient(api_server_url=url)
node.models = client.available_models
except requests.exceptions.RequestException as e:
logger.error(f"Exception when adding node {url}: {e}")
return

Comment on lines +65 to +69
p_nodes = await self.registry.get(model_name, role=EngineRole.Prefill)
if not p_nodes:
return self._handle_unavailable_model(model_name)
p_url = p_nodes[0].url
logger.info(f"A Prefill request is dispatched to {p_url}")
d_nodes = await self.registry.get(model_name, role=EngineRole.Decode)
if not d_nodes:
return self._handle_unavailable_model(model_name)
d_url = d_nodes[0].url
Comment on lines +119 to +123
if stream:
response = stream_generate(client, request_dict, d_url, endpoint)
background_task = BackgroundTasks()
resp = StreamingResponse(response, background=background_task, media_type='text/event-stream')
else:
lvhan028 and others added 4 commits May 18, 2026 12:27
Previously, a new aiohttp.ClientSession was created per request and per
metrics poll cycle. This leaked TCP connections and discarded connection
pooling. Now one shared session is created in the app lifespan, stored
on strategy.client and app.state.client, and reused by both request
handlers and MinCacheUsageStrategy polling.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add conn_limit and conn_limit_per_host to ProxyConfig, configurable via
LMDEPLOY_PROXY_CONN_LIMIT (default 100) and
LMDEPLOY_PROXY_CONN_LIMIT_PER_HOST (default 0=unlimited) env vars.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
0.0.0.0 is a bind address, not a connect address. Python's HTTP
clients cannot connect to it, causing "Connection refused" errors.
Replace 0.0.0.0 with 127.0.0.1 when adding nodes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…odel fetch failures

The api_server sends {'url': '...', 'status': {'models': [...], 'role': N}}
when self-registering, but the new Node model doesn't have a nested
status field. Extract models and role from the nested status dict in the
raw request body. Also, if model fetching fails, register the node with
an empty model list instead of silently dropping it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants