Skip to content

Add deployment docs and improve test coverage#326

Open
bootjp wants to merge 6 commits intomainfrom
feature/prd-workload
Open

Add deployment docs and improve test coverage#326
bootjp wants to merge 6 commits intomainfrom
feature/prd-workload

Conversation

@bootjp
Copy link
Owner

@bootjp bootjp commented Feb 27, 2026

No description provided.

Copilot AI review requested due to automatic review settings February 27, 2026 03:46
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the project's operational readiness and robustness. It introduces detailed documentation for deploying a multi-node Elastickv cluster using Docker, providing clear instructions for manual setup on multiple virtual machines. Concurrently, it improves the reliability of the system through expanded test coverage, including a new test for Raft's dynamic cluster membership changes and a refactoring of DynamoDB compatibility tests. Additionally, the PR refines the Raft bootstrap mechanism by allowing explicit definition of initial cluster members and addresses a minor bug in DynamoDB JSON attribute unmarshalling.

Highlights

  • Deployment Documentation: A new comprehensive guide has been added for manually deploying a multi-node Elastickv cluster using Docker across multiple VMs, providing clear instructions for setup.
  • Raft Bootstrap Enhancement: A new --raftBootstrapMembers flag was introduced, allowing users to specify a fixed list of Raft cluster members during bootstrap, which improves control over initial cluster formation.
  • Improved Test Coverage: Test coverage has been expanded with a new integration test for the Raft AddVoter functionality and existing DynamoDB API compatibility tests were refactored into more granular sub-tests for better organization.
  • DynamoDB JSON Unmarshalling Fix: A bug in the attributeValue JSON unmarshalling logic within the DynamoDB adapter has been corrected to properly handle single-key maps.
  • Code Refactoring: Server startup and Raft cluster joining logic have been refactored for improved modularity, readability, and error handling.
Changelog
  • README.md
    • Updated to include a link to the new deployment documentation.
  • adapter/add_voter_join_test.go
    • Added a new test case to verify that a new voter can successfully join the Raft cluster and serve traffic.
  • adapter/dynamodb.go
    • Added comments to clarify the purpose of StartTS in buildTransactWriteItemsRequest for transaction alignment.
  • adapter/dynamodb_table_compat_test.go
    • Refactored the DynamoDB table API compatibility tests into distinct sub-tests and introduced helper functions for better organization and readability.
  • adapter/dynamodb_types.go
    • Corrected the JSON unmarshalling logic for attributeValue to properly handle single-key maps.
  • cmd/server/demo.go
    • Refactored the joinCluster function by extracting retry logic and AddVoter await into dedicated helper functions (joinNodeWithRetry, tryJoinNode).
  • docs/docker_multinode_manual_run.md
    • Added a new document detailing the manual setup of a 4-5 node Elastickv cluster on multiple VMs using docker run.
  • main.go
    • Introduced a new command-line flag --raftBootstrapMembers.
    • Updated buildShardGroups to accept bootstrap servers.
    • Refactored the main server startup logic into a new startRuntimeServers function.
  • main_bootstrap_test.go
    • Added new unit tests for the resolveBootstrapServers function, covering various scenarios for bootstrap member resolution.
  • multiraft_runtime.go
    • Modified newRaftGroup to accept an optional list of bootstrapServers for initial Raft cluster configuration.
  • multiraft_runtime_test.go
    • Updated the TestNewRaftGroupBootstrap function to reflect changes in newRaftGroup signature.
  • shard_config.go
    • Defined a new error type ErrInvalidRaftBootstrapMembersEntry.
    • Implemented the parseRaftBootstrapMembers function to parse the new bootstrap members flag.
  • shard_config_test.go
    • Added new unit tests for the parseRaftBootstrapMembers function, including tests for valid parsing, whitespace trimming, and error conditions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature for bootstrapping a multi-node cluster from a static configuration, complete with documentation and tests. It also improves test coverage by adding a new integration test for AddVoter and refactoring a large test for better readability. The changes are generally solid. I have a couple of suggestions to improve test robustness and address the use of a deprecated function.

return nodes[0].raft.State() == raft.Leader
}, waitTimeout, waitInterval)

adminConn, err := grpc.NewClient(nodes[0].grpcAddress, grpc.WithTransportCredentials(insecure.NewCredentials()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

grpc.NewClient is deprecated and should be avoided in new code. Please use grpc.Dial instead. This also applies to the other grpc.NewClient calls in this test on lines 158 and 166.

Suggested change
adminConn, err := grpc.NewClient(nodes[0].grpcAddress, grpc.WithTransportCredentials(insecure.NewCredentials()))
adminConn, err := grpc.Dial(nodes[0].grpcAddress, grpc.WithTransportCredentials(insecure.NewCredentials()))

})
require.NoError(t, err)
require.Empty(t, getOut.Item)
t.Run("ThreadItemOperations", func(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This sub-test, and others that follow, are not independent and rely on state created by the TableLifecycle sub-test. This means running a sub-test in isolation (e.g., go test -run TestDynamoDB_TableAPICompatibility/ThreadItemOperations) will fail.

To make the tests more robust and easier to debug, consider making each sub-test independent by performing its own setup. For example, you could call createThreadsTable(t) and createMessagesTable(t) at the beginning of each sub-test that requires them.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for bootstrapping a Raft cluster with an explicit member list, expands/organizes test coverage around bootstrapping and join behavior, and adds a multi-node Docker deployment runbook.

Changes:

  • Add --raftBootstrapMembers support (parsing + validation) and plumb explicit bootstrap server configuration into Raft startup.
  • Refactor startup code to use a startRuntimeServers helper and improve demo cluster join logic to await AddVoter.
  • Add/reshape tests for bootstrap parsing/validation and Raft join path; add deployment documentation.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
shard_config.go Adds parseRaftBootstrapMembers + new error for invalid entries.
shard_config_test.go Adds unit tests for parsing bootstrap member strings.
main.go Adds --raftBootstrapMembers, validates/derives bootstrap server list, plumbs into runtime startup; introduces startRuntimeServers.
main_bootstrap_test.go Adds tests for bootstrap server resolution/validation behavior.
multiraft_runtime.go Extends newRaftGroup to accept explicit bootstrap server config.
multiraft_runtime_test.go Updates newRaftGroup call site for new signature.
cmd/server/demo.go Refactors join logic into helpers and awaits AddVoter completion.
adapter/add_voter_join_test.go Adds an integration-style test covering AddVoter join + serving adapter traffic.
adapter/dynamodb_types.go Refactors AttributeValue JSON unmarshal path.
adapter/dynamodb_table_compat_test.go Refactors large compatibility test into subtests + helper closures.
adapter/dynamodb.go Adds clarification comment on transaction StartTS alignment with condition evaluation snapshot.
docs/docker_multinode_manual_run.md Adds multi-VM manual docker run runbook for 4–5 node clusters.
README.md Links the new deployment/runbook document.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +23 to +27
func TestAddVoterJoinPath_RegistersMemberAndServesAdapterTraffic(t *testing.T) {
const (
waitTimeout = 8 * time.Second
waitInterval = 100 * time.Millisecond
)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adapter unit tests in this repo generally call t.Parallel() at the start of each top-level test; this new test doesn’t, which can unnecessarily serialize the adapter test suite. Consider adding t.Parallel() here (ports are already dynamically assigned).

Copilot uses AI. Check for mistakes.
Comment on lines 112 to 115
t.Run("TableLifecycle", func(t *testing.T) {
createThreadsTable(t)
createMessagesTable(t)

Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These subtests share mutable state but only create the tables inside the TableLifecycle subtest. Running an individual subtest via go test -run TestDynamoDB_TableAPICompatibility/ThreadItemOperations will fail because the tables were never created. Move table creation to a shared setup before the t.Run calls, or have each subtest create (and optionally delete) the tables it depends on.

Suggested change
t.Run("TableLifecycle", func(t *testing.T) {
createThreadsTable(t)
createMessagesTable(t)
// Ensure required tables exist before running any subtests so that
// individual subtests can be run in isolation (e.g. with -run).
createThreadsTable(t)
createMessagesTable(t)
t.Run("TableLifecycle", func(t *testing.T) {

Copilot uses AI. Check for mistakes.
Comment on lines +150 to +164
var (
kind string
valueRaw json.RawMessage
found bool
)
for k, v := range raw {
return a.unmarshalSingleAttributeValue(k, v)
kind = k
valueRaw = v
found = true
break
}
return errors.New("invalid attribute value")
if !found {
return errors.New("invalid attribute value")
}
return a.unmarshalSingleAttributeValue(kind, valueRaw)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given len(raw) != 1 is already rejected above, the found flag and the if !found branch are redundant (the loop will always execute exactly once). This can be simplified by directly extracting the single map entry (or returning from the range loop as before).

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +4
# Dockerで4〜5ノードのElastickvクラスタを複数VMで手動構築する(docker compose不使用)

このドキュメントは、`docker run` のみで Elastickv の Raft クラスタを **複数VM** 上に構築する手順です。
`docker compose` は使いません。**単一VMへの全ノード同居は想定しません。**
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deployment/runbook doc is entirely in Japanese, while the existing docs/README are written in English. For consistency and broader accessibility, consider adding an English version (or at least an English summary at the top) so non-Japanese readers can follow the deployment steps.

Copilot uses AI. Check for mistakes.
Comment on lines +205 to +212
localAddr := groups[0].address
for _, s := range servers {
if string(s.ID) != raftID {
continue
}
if string(s.Address) != localAddr {
return nil, errors.WithStack(errors.Wrapf(ErrBootstrapMembersLocalAddrMismatch, "expected %q got %q", localAddr, s.Address))
}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolveBootstrapServers compares bootstrap member address against groups[0].address, but the error text says it must match --address. This can mislead when --raftGroups is provided (group address may differ from --address). Either validate against the actual --address value (pass it in) or adjust the error/message to reference the group address/--raftGroups instead of --address.

Copilot uses AI. Check for mistakes.
Comment on lines +36 to +40
bound, ls, retry, err := bindListeners(ctx, &lc, ports[i])
require.NoError(t, err)
require.False(t, retry)
ports[i] = bound
listeners = append(listeners, ls)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test assumes bindListeners will never return retry=true (EADDRINUSE), and fails immediately if it does. Since ports are assigned globally, collisions can happen in parallel test runs, making this test flaky. Use the existing retry loop pattern from setupNodes (or call setupNodes/createNode) to reassign ports until listeners bind successfully.

Suggested change
bound, ls, retry, err := bindListeners(ctx, &lc, ports[i])
require.NoError(t, err)
require.False(t, retry)
ports[i] = bound
listeners = append(listeners, ls)
for {
bound, ls, retry, err := bindListeners(ctx, &lc, ports[i])
require.NoError(t, err)
if !retry {
ports[i] = bound
listeners = append(listeners, ls)
break
}
// Port was in use; get a new suggestion and retry binding.
newPorts := assignPorts(1)
ports[i] = newPorts[0]
}

Copilot uses AI. Check for mistakes.
Comment on lines +80 to +90
docker run -d \
--name elastickv \
--restart unless-stopped \
--network host \
-v /var/lib/elastickv:/var/lib/elastickv \
ghcr.io/bootjp/elastickv:latest /app \
--address "10.0.0.11:50051" \
--redisAddress "0.0.0.0:6379" \
--dynamoAddress "0.0.0.0:8000" \
--raftId "n1" \
--raftDataDir "/var/lib/elastickv" \
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docker run examples here start the Elastickv container with --network host and bind --redisAddress/--dynamoAddress to 0.0.0.0, while also exposing the gRPC --address on a non-loopback IP. This combination publishes unauthenticated Redis, DynamoDB-compatible, and Raft admin/data-plane gRPC endpoints to the entire host network, so any machine with network access can read/modify data or reconfigure the cluster using redis-cli/grpcurl without credentials. To avoid unauthorized access, restrict these bindings to trusted interfaces and/or enforce network-level ACLs and application-layer authentication/TLS for Redis, DynamoDB, and Raft gRPC before recommending this setup for multi-VM deployments.

Copilot uses AI. Check for mistakes.
@bootjp bootjp requested a review from Copilot March 1, 2026 16:20
@bootjp
Copy link
Owner Author

bootjp commented Mar 1, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant feature for cluster bootstrapping using a fixed list of members via the --raftBootstrapMembers flag, which is very useful for environments without dynamic service discovery. The implementation is well-supported with new documentation, unit tests, and an end-to-end test.

Additionally, the PR improves test coverage with a new integration test for the AddVoter join path and refactors an existing large test for better maintainability. The demo server's cluster joining logic has also been made more robust.

My review found one medium-severity issue in the new add_voter_join_test.go test regarding error handling in goroutines, which could lead to flaky test results. Overall, this is a high-quality contribution that enhances the project's deployment flexibility and test robustness.

raftadmin.Register(s, r)

go func(srv *grpc.Server, l net.Listener) {
assert.NoError(t, srv.Serve(l))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assert.NoError call inside a goroutine can lead to race conditions in test reporting and will fail on expected shutdown errors. When the server is shut down during test cleanup, srv.Serve(l) will return an error (e.g., grpc.ErrServerStopped), causing assert.NoError to fail. This will mark the test as failed, but it won't stop the test's execution.

A better approach is to check for expected shutdown errors and only assert on unexpected ones. This applies to the other two goroutines starting the Redis and DynamoDB servers as well (lines 177 and 182).

if err := srv.Serve(l); err != nil && err != grpc.ErrServerStopped {
			assert.NoError(t, err)
		}

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +129 to +134
var lc net.ListenConfig
l, err := lc.Listen(context.Background(), "tcp", "127.0.0.1:0")
require.NoError(t, err)
addr := l.Addr().String()
require.NoError(t, l.Close())
return addr
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reserveTCPAddr closes the listener and returns the address, which is inherently racy (another process/test can grab the port before the node binds). This can make the E2E test flaky; consider allocating listeners and keeping them open until startRuntimeServers begins serving, or add a retry loop around server startup to handle transient "address already in use" failures.

Suggested change
var lc net.ListenConfig
l, err := lc.Listen(context.Background(), "tcp", "127.0.0.1:0")
require.NoError(t, err)
addr := l.Addr().String()
require.NoError(t, l.Close())
return addr
// Return a wildcard localhost address and let the actual server bind
// to an ephemeral port, avoiding the racy "reserve then close" pattern.
return "127.0.0.1:0"

Copilot uses AI. Check for mistakes.
Comment on lines +111 to +123
_, err := clients[writerIdx].RawPut(context.Background(), &pb.RawPutRequest{Key: key, Value: value})
require.NoError(t, err)

for i := range clients {
client := clients[i]
require.Eventually(t, func() bool {
resp, getErr := client.RawGet(context.Background(), &pb.RawGetRequest{Key: key})
if getErr != nil {
return false
}
return resp.Exists && bytes.Equal(resp.Value, value)
}, waitTimeout, waitInterval)
}
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These RPCs use context.Background() and the conns are configured with grpc.WaitForReady(true). If a node is down/unreachable, WaitForReady can cause the call to block indefinitely (no deadline), which can hang the test beyond require.Eventually's timeout. Prefer per-RPC contexts with timeouts/deadlines (or avoid WaitForReady in polling loops).

Copilot uses AI. Check for mistakes.
Comment on lines +294 to +301
conn, err := grpc.NewClient(ep.raftAddr,
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithDefaultCallOptions(grpc.WaitForReady(true)),
)
require.NoError(t, err)
conns = append(conns, conn)
clients = append(clients, pb.NewRawKVClient(conn))
}
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client conns are created with grpc.WithDefaultCallOptions(grpc.WaitForReady(true)), but the test calls RPCs with contexts that have no deadlines. In failure scenarios this can block indefinitely and hang the test run; consider removing WaitForReady here or ensuring every RPC uses a bounded context (e.g., context.WithTimeout).

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +66
followerConn, err := grpc.NewClient(nodes[1].grpcAddress,
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithDefaultCallOptions(grpc.WaitForReady(true)),
)
require.NoError(t, err)
t.Cleanup(func() { _ = followerConn.Close() })
followerRaw := pb.NewRawKVClient(followerConn)

leaderConn, err := grpc.NewClient(nodes[0].grpcAddress,
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithDefaultCallOptions(grpc.WaitForReady(true)),
)
require.NoError(t, err)
t.Cleanup(func() { _ = leaderConn.Close() })
leaderRaw := pb.NewRawKVClient(leaderConn)
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These gRPC conns are configured with grpc.WaitForReady(true), but later RPCs use a context.Background()-derived context with no deadline. If a node becomes unavailable, calls can block indefinitely and hang the test; prefer per-RPC deadlines/timeouts (or avoid WaitForReady for these test clients).

Copilot uses AI. Check for mistakes.
main.go Outdated
Comment on lines +211 to +215
return nil, errors.WithStack(errors.Wrapf(ErrBootstrapMembersLocalAddrMismatch, "expected %q got %q", localAddr, s.Address))
}
return servers, nil
}
return nil, errors.WithStack(errors.Wrapf(ErrBootstrapMembersMissingLocalNode, "raftId=%q", raftID))
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors.Wrapf already records stack/context (via cockroachdb/errors). Wrapping the result again with errors.WithStack(...) is redundant and slightly obscures the error chain; consider returning errors.Wrapf(...) directly here (and similarly for the MissingLocalNode case below).

Suggested change
return nil, errors.WithStack(errors.Wrapf(ErrBootstrapMembersLocalAddrMismatch, "expected %q got %q", localAddr, s.Address))
}
return servers, nil
}
return nil, errors.WithStack(errors.Wrapf(ErrBootstrapMembersMissingLocalNode, "raftId=%q", raftID))
return nil, errors.Wrapf(ErrBootstrapMembersLocalAddrMismatch, "expected %q got %q", localAddr, s.Address)
}
return servers, nil
}
return nil, errors.Wrapf(ErrBootstrapMembersMissingLocalNode, "raftId=%q", raftID)

Copilot uses AI. Check for mistakes.
@bootjp bootjp requested a review from Copilot March 2, 2026 20:04
@bootjp
Copy link
Owner Author

bootjp commented Mar 2, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant improvement for cluster deployment by adding a --raftBootstrapMembers flag, allowing for static cluster bootstrapping. This is well-documented in the new docker_multinode_manual_run.md guide. The test coverage is also enhanced with new end-to-end and unit tests for this feature, and existing tests have been refactored for better clarity. The demo cluster join logic has also been made more robust. My review found a few places in new test files where the deprecated grpc.NewClient function is used; I've suggested replacing it with grpc.Dial for better future maintainability. None of the provided rules were applicable to modify or drop these comments.

return nodes[0].raft.State() == raft.Leader
}, waitTimeout, waitInterval)

adminConn, err := grpc.NewClient(nodes[0].grpcAddress, grpc.WithTransportCredentials(insecure.NewCredentials()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

grpc.NewClient is deprecated and will be removed in a future version of google.golang.org/grpc. Please use grpc.Dial instead.

Suggested change
adminConn, err := grpc.NewClient(nodes[0].grpcAddress, grpc.WithTransportCredentials(insecure.NewCredentials()))
adminConn, err := grpc.Dial(nodes[0].grpcAddress, grpc.WithTransportCredentials(insecure.NewCredentials()))

Comment on lines +57 to +59
followerConn, err := grpc.NewClient(nodes[1].grpcAddress,
grpc.WithTransportCredentials(insecure.NewCredentials()),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

grpc.NewClient is deprecated and will be removed in a future version of google.golang.org/grpc. Please use grpc.Dial instead.

Suggested change
followerConn, err := grpc.NewClient(nodes[1].grpcAddress,
grpc.WithTransportCredentials(insecure.NewCredentials()),
)
followerConn, err := grpc.Dial(nodes[1].grpcAddress,
grpc.WithTransportCredentials(insecure.NewCredentials()),
)

Comment on lines +64 to +66
leaderConn, err := grpc.NewClient(nodes[0].grpcAddress,
grpc.WithTransportCredentials(insecure.NewCredentials()),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

grpc.NewClient is deprecated and will be removed in a future version of google.golang.org/grpc. Please use grpc.Dial instead.

Suggested change
leaderConn, err := grpc.NewClient(nodes[0].grpcAddress,
grpc.WithTransportCredentials(insecure.NewCredentials()),
)
leaderConn, err := grpc.Dial(nodes[0].grpcAddress,
grpc.WithTransportCredentials(insecure.NewCredentials()),
)

Comment on lines +368 to +370
conn, err := grpc.NewClient(ep.raftAddr,
grpc.WithTransportCredentials(insecure.NewCredentials()),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

grpc.NewClient is deprecated and will be removed in a future version of google.golang.org/grpc. Please use grpc.Dial instead.

Suggested change
conn, err := grpc.NewClient(ep.raftAddr,
grpc.WithTransportCredentials(insecure.NewCredentials()),
)
conn, err := grpc.Dial(ep.raftAddr,
grpc.WithTransportCredentials(insecure.NewCredentials()),
)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +208 to +209
if err := waitForJoinRetry(ctx, joinRetryInterval); err != nil {
return joinClusterWaitError(err)
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

joinNodeWithRetry always waits joinRetryInterval after a failed attempt, including after the final retry. This unnecessarily delays failure reporting by an extra interval. Consider only sleeping when i < joinRetries-1, or restructure the loop so the sleep happens before the next attempt.

Suggested change
if err := waitForJoinRetry(ctx, joinRetryInterval); err != nil {
return joinClusterWaitError(err)
if i < joinRetries-1 {
if err := waitForJoinRetry(ctx, joinRetryInterval); err != nil {
return joinClusterWaitError(err)
}

Copilot uses AI. Check for mistakes.
Comment on lines +114 to +143
func startBootstrapE2ECluster(
t *testing.T,
baseDir string,
nodeCount int,
startupAttempts int,
) ([]bootstrapE2EEndpoint, []*bootstrapE2ENode) {
t.Helper()

var (
lastErr error
nodes []*bootstrapE2ENode
)

for attempt := 0; attempt < startupAttempts; attempt++ {
endpoints := allocateBootstrapE2EEndpoints(t, nodeCount)
started, err := tryStartBootstrapE2ECluster(baseDir, endpoints)
if err == nil {
return endpoints, started
}
closeBootstrapE2ENodesIgnoreError(started)
lastErr = err
if !isAddressInUseError(err) {
break
}
nodes = nil
}

require.NoError(t, lastErr)
return nil, nodes
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startBootstrapE2ECluster retries startup on "address already in use", but it reuses the same baseDir across attempts while node IDs stay the same (n1..nN). If a partial attempt writes Raft state before failing, the next attempt can pick up stale state (with different addresses) and behave unpredictably. Use a per-attempt subdirectory (or a fresh temp dir) when retrying, or clean up the prior attempt’s Raft data before retrying.

Copilot uses AI. Check for mistakes.
title, ok := getOut.Item["title"].(*ddbTypes.AttributeValueMemberS)
require.True(t, ok)
require.Equal(t, "title1", title.Value)
// Shared setup for all sub-tests so each sub-test can run in isolation with -run.
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment claims each sub-test can run in isolation with -run, but the sub-tests share state (tables and items) and later sub-tests depend on earlier ones mutating data. Either make each sub-test fully self-contained (unique table names + seed required items inside each sub-test) or remove the isolation claim and keep this as a single end-to-end scenario test.

Suggested change
// Shared setup for all sub-tests so each sub-test can run in isolation with -run.
// Shared setup for all sub-tests.

Copilot uses AI. Check for mistakes.
Comment on lines +274 to 295
t.Run("TransactWriteItems", func(t *testing.T) {
_, txErr := client.TransactWriteItems(ctx, &dynamodb.TransactWriteItemsInput{
TransactItems: []ddbTypes.TransactWriteItem{
{
Put: &ddbTypes.Put{
TableName: aws.String(threadsTable),
Item: map[string]ddbTypes.AttributeValue{
"threadId": &ddbTypes.AttributeValueMemberS{Value: "t4"},
"title": &ddbTypes.AttributeValueMemberS{Value: "title4"},
"createdAt": &ddbTypes.AttributeValueMemberS{Value: "2026-01-04T00:00:00Z"},
"status": &ddbTypes.AttributeValueMemberS{Value: "pending"},
"accessToken": &ddbTypes.AttributeValueMemberS{Value: ""},
},
},
},
},
},
})
require.NoError(t, err)
queryPendingAfterTransact, err := client.Query(ctx, &dynamodb.QueryInput{
TableName: aws.String(threadsTable),
IndexName: aws.String("statusIndex"),
KeyConditionExpression: aws.String("#status = :status"),
ExpressionAttributeNames: map[string]string{
"#status": "status",
},
ExpressionAttributeValues: map[string]ddbTypes.AttributeValue{
":status": &ddbTypes.AttributeValueMemberS{Value: "pending"},
},
ScanIndexForward: aws.Bool(false),
})
require.NoError(t, txErr)

queryPendingAfterTransact := queryThreadsByStatus(t, "pending", false)
require.Len(t, queryPendingAfterTransact.Items, 2)
})
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TransactWriteItems asserts there are 2 pending items after inserting t4, but this relies on ThreadItemOperations having already inserted and updated t1-t3 in the same table. If this sub-test is run alone (e.g., -run TestDynamoDB_TableAPICompatibility/TransactWriteItems), it will likely fail. Seed the required items within this sub-test (or use separate tables per sub-test) so the expectation is local.

Copilot uses AI. Check for mistakes.
Comment on lines +165 to +188
GRPCURL_IMG="fullstorydev/grpcurl:v1.9.3"

# Wait for every node gRPC endpoint
for ip in 10.0.0.11 10.0.0.12 10.0.0.13 10.0.0.14 10.0.0.15; do
until docker run --rm --network host "${GRPCURL_IMG}" \
-plaintext "${ip}:50051" list >/dev/null 2>&1; do
sleep 1
done
done

# Check Raft members
docker run --rm --network host "${GRPCURL_IMG}" \
-plaintext -d '{}' 10.0.0.11:50051 RaftAdmin/GetConfiguration
```

For a 4-node cluster, remove `10.0.0.15` from the loop.

## 6) Validate Read/Write

Check leader:

```bash
docker run --rm --network host fullstorydev/grpcurl:v1.9.3 \
-plaintext -d '{}' 10.0.0.11:50051 RaftAdmin/Leader
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These commands pull and run the third-party image fullstorydev/grpcurl:v1.9.3 directly with docker run --network host using only a mutable tag, which creates a supply-chain risk. If the image registry or publisher is compromised and the v1.9.3 tag is repointed, an attacker-controlled image would run inside your cluster node with full gRPC network access to Elastickv and could exfiltrate or corrupt data. Consider pinning the image to a known-good immutable digest (or hosting/compiling your own grpcurl binary or internal image) and documenting how operators should verify the provenance of the tool before use.

Copilot uses AI. Check for mistakes.
@bootjp
Copy link
Owner Author

bootjp commented Mar 2, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant improvement for cluster bootstrapping by adding the --raftBootstrapMembers flag, allowing for a declarative, fixed-member startup. This is a more robust alternative to dynamic joins for initial cluster formation. The implementation is well-tested with new unit and end-to-end tests. Additionally, the PR improves existing test coverage by refactoring a large test into more manageable sub-tests and enhances the demo server's join logic to be more reliable. The new deployment documentation is clear and helpful. Overall, this is a high-quality contribution that improves both functionality and maintainability.


```bash
# Pin to an immutable digest (replace with a verified digest from your supply-chain policy).
GRPCURL_IMG="fullstorydev/grpcurl@sha256:<verified_digest>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While it's great to recommend pinning to a SHA256 digest for security, providing a placeholder <verified_digest> makes it harder for users to follow this guide. Consider providing a specific, known-good digest as an example, along with instructions on how a user can find and verify the latest digest for themselves. This would make the documentation more self-contained and user-friendly.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +24 to +27
| Node ID | VM | IP |
| --- | --- | --- |
| n1 | vm1 | 10.0.0.11 |
| n2 | vm2 | 10.0.0.12 |
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tables use a || prefix on each row (e.g., || Node ID | VM | IP |), which renders as an extra empty column in most Markdown renderers. Consider switching to standard table syntax with single | delimiters so the topology/fault-tolerance tables render correctly on GitHub.

Copilot uses AI. Check for mistakes.
Comment on lines +229 to +233
await, err := client.Await(ctx, future)
if err != nil {
return errors.WithStack(err)
}
if await.GetError() != "" {
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tryJoinNode uses the caller-provided ctx for both AddVoter and the blocking Await call. In the demo path this ctx is typically the errgroup context with no deadline, so Await can block indefinitely and prevent the retry loop from making progress. Use per-attempt context.WithTimeout (like other join-path tests do) or otherwise ensure ctx always has a deadline before calling Await.

Copilot uses AI. Check for mistakes.
Comment on lines +201 to +206
var lc net.ListenConfig
l, err := lc.Listen(context.Background(), "tcp", "127.0.0.1:0")
require.NoError(t, err)
addr := l.Addr().String()
require.NoError(t, l.Close())
return addr
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reserveTCPAddr closes the listener and returns the address, which does not actually reserve the port; another process/test can bind it before the node starts. This makes the E2E test inherently racy and reliant on retries. Prefer keeping listeners open until server startup (or using :0 binds at startup and plumbing the actual bound address through) to eliminate port-reuse flakes.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants