Skip to content

[client-v2] getNextAliveNode() always returns endpoints.get(0) — retries never fail over to alternate endpoints #2855

@claude

Description

@claude

Description

com.clickhouse.client.api.Client#getNextAliveNode() (client-v2) is hard-coded to return the first registered endpoint and never rotates:

// client-v2/src/main/java/com/clickhouse/client/api/Client.java:2200
private Endpoint getNextAliveNode() {
    return endpoints.get(0);
}

This method is the only mechanism the client-v2 INSERT / write / query code paths use to (re-)select an endpoint on retry. Every retry call site (in Client.java around lines 1329, 1359, 1376, 1551, 1567, 1583, 1680, 1697, 1725) reassigns selectedEndpoint = getNextAliveNode(), but since the method ignores any rotation/liveness state, retries always target the same (first) endpoint. If the first endpoint is unreachable, no registered alternative is ever attempted, and the client surfaces a connection failure even though other configured endpoints could have served the request.

The naming (getNextAliveNode) suggests the surrounding retry loop was authored expecting rotation/liveness logic that was never implemented, so the failover behavior is silently absent rather than explicitly disabled. The feature request to add real failover (#1838) was closed as not-planned (stale) on 2026-02-10, so this stub remains in the shipping code.

This is the Java-side analogue of clickhouse-go#136 (the Go driver failing to fall back to alt_hosts when the first host is down).

ClickHouse server version

Code analysis only; not verified against a running server. (Server in the investigation environment was 26.4.2.10, but reproduction is purely client-side and does not depend on server version.)

Reproduction

Minimal client-v2 snippet that exercises the broken path:

import com.clickhouse.client.api.Client;
import com.clickhouse.client.api.query.QueryResponse;

public class FailoverRepro {
    public static void main(String[] args) throws Exception {
        // First endpoint is dead (no listener on :1); second is healthy.
        try (Client client = new Client.Builder()
                .addEndpoint("http://127.0.0.1:1")
                .addEndpoint("http://localhost:8123")
                .setUsername("default")
                .setPassword("")
                .setRetryOnFailures(com.clickhouse.client.api.ClientFaultCause.ConnectTimeout,
                                   com.clickhouse.client.api.ClientFaultCause.NoHttpResponse,
                                   com.clickhouse.client.api.ClientFaultCause.ServerRetryable)
                .retryOnFailures(5)  // any N >= 1
                .build()) {

            try (QueryResponse r = client.query("SELECT 1").get()) {
                // Expected: succeed via the healthy second endpoint after the first fails.
                // Actual:   all N+1 attempts hit 127.0.0.1:1 and the query throws a
                //           connection-refused / connect-timeout exception.
            }
        }
    }
}

Expected: the retry loop tries localhost:8123 after 127.0.0.1:1 fails to connect, and the query returns successfully.

Actual: every retry repeatedly targets 127.0.0.1:1 because getNextAliveNode() returns endpoints.get(0) unconditionally; the healthy alternate is never tried.

Suggested fix

Client.java:2200-2202. At minimum, getNextAliveNode() should rotate across endpoints — e.g., an AtomicInteger round-robin index incremented on each call, with the result taken modulo endpoints.size(). A more complete fix would also track endpoints recently marked as failed within the current retry loop and skip them, and only advance the index on retries triggered by connection-level failures or retryable 5xx responses so that successful requests keep affinity to a single endpoint.

Link

Related upstream report (Go client analogue): ClickHouse/clickhouse-go#136
Closed feature request for the same functionality in client-v2: #1838

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions