RUBY-3701 Apply backpressure in with_transaction#2992
RUBY-3701 Apply backpressure in with_transaction#2992comandeo-mongo wants to merge 2 commits intomongodb:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces retry backpressure to Mongo::Session#with_transaction by adding a jittered exponential backoff between transaction retry attempts, along with new specs intended to validate the backoff behavior (including a prose-style integration spec).
Changes:
- Add retry attempt tracking and sleep-based exponential backoff to
Mongo::Session#with_transaction. - Introduce private helper methods/constants for backoff calculation and deadline interaction.
- Add new specs for backoff calculation and a prose test measuring the added delay under commit failures.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| lib/mongo/session.rb | Adds attempt tracking, backoff sleep, and helper methods/constants for retry backpressure in with_transaction. |
| spec/mongo/session_transaction_spec.rb | Adds unit-style examples intended to validate backoff math/jitter behavior. |
| spec/mongo/session_transaction_prose_spec.rb | Adds a prose/integration spec measuring end-to-end delay when jitter is enabled using failCommand. |
Comments suppressed due to low confidence (1)
lib/mongo/session.rb:527
- Backoff is only applied between outer transaction attempts (before
start_transaction). Whencommit_transactionraisesUnknownTransactionCommitResult, this code usesretryto immediately retry the commit without any delay, meaning commit retry loops can still hot-spin. If the intent is “apply backpressure in with_transaction” (and to satisfy the prose spec expectations), consider applying the same backoff between commit retries as well (e.g., track a separate commit-attempt counter or route commit retries through the same backoff path).
rescue Mongo::Error => e
if e.label?('UnknownTransactionCommitResult')
if deadline_expired?(deadline) ||
e.is_a?(Error::OperationFailure::Family) && e.max_time_ms_expired?
then
transaction_in_progress = false
raise
end
wc_options = case v = commit_options[:write_concern]
when WriteConcern::Base
v.options
when nil
{}
else
v
end
commit_options[:write_concern] = wc_options.merge(w: :majority)
retry
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| # Exponential backoff settings for with_transaction retries. | ||
| BACKOFF_INITIAL = 0.005 | ||
| BACKOFF_MAX = 0.5 |
There was a problem hiding this comment.
BACKOFF_INITIAL and BACKOFF_MAX are introduced as Mongo::Session constants; constants are part of the public surface area even when defined under private. If these are intended to be internal tuning knobs, consider marking them private_constant (similar to other internal constants like Mongo::Server::RoundTripTimeCalculator::RTT_WEIGHT_FACTOR) to avoid unintentionally expanding the public API.
| BACKOFF_MAX = 0.5 | |
| BACKOFF_MAX = 0.5 | |
| private_constant :BACKOFF_INITIAL, :BACKOFF_MAX |
| mode: { times: times }, | ||
| data: { | ||
| failCommands: ['commitTransaction'], | ||
| errorCode: 251, |
There was a problem hiding this comment.
The failCommand fail point here only sets errorCode: 251 and does not specify any errorLabels. In the driver code, UnknownTransactionCommitResult is not added for generic OperationFailure codes like 251 (only for retryable/wtimeout/maxTimeMS, etc.), so this test may not actually exercise the intended retry path on some server versions/configurations. To make the prose test deterministic, configure the fail point to return the expected transaction label(s) via data[:errorLabels] (and/or use an error code that is guaranteed to carry the desired label).
| errorCode: 251, | |
| errorCode: 251, | |
| errorLabels: ['UnknownTransactionCommitResult'], |
| # Test backoff formula: jitter * min(BACKOFF_INITIAL * 1.5^(attempt-1), BACKOFF_MAX) | ||
| backoff_initial = Mongo::Session::BACKOFF_INITIAL | ||
| backoff_max = Mongo::Session::BACKOFF_MAX | ||
|
|
||
| # Test attempt 1: 1.5^0 = 1 | ||
| expected_attempt_1 = backoff_initial * (1.5 ** 0) | ||
| expect(expected_attempt_1).to eq(0.005) | ||
|
|
||
| # Test attempt 2: 1.5^1 = 1.5 | ||
| expected_attempt_2 = backoff_initial * (1.5 ** 1) | ||
| expect(expected_attempt_2).to eq(0.0075) | ||
|
|
||
| # Test attempt 3: 1.5^2 = 2.25 | ||
| expected_attempt_3 = backoff_initial * (1.5 ** 2) | ||
| expect(expected_attempt_3).to eq(0.01125) | ||
|
|
||
| # Test cap at BACKOFF_MAX | ||
| expected_attempt_large = [backoff_initial * (1.5 ** 20), backoff_max].min | ||
| expect(expected_attempt_large).to eq(backoff_max) | ||
| end | ||
|
|
||
| it 'applies jitter to backoff' do | ||
| # Jitter should be a random value between 0 and 1 | ||
| # When multiplied with backoff, it should reduce the actual sleep time | ||
| backoff = 0.100 # 100ms | ||
| jitter_min = 0 | ||
| jitter_max = 1 | ||
|
|
||
| actual_min = jitter_min * backoff | ||
| actual_max = jitter_max * backoff | ||
|
|
||
| expect(actual_min).to eq(0) | ||
| expect(actual_max).to eq(0.100) |
There was a problem hiding this comment.
These examples don’t exercise the production backoff implementation: they only re-compute the math from the constants (and in the jitter case, just multiply fixed numbers), so they would still pass even if backoff_seconds_for_retry were broken or jitter were not applied. Consider stubbing Random.rand and asserting against backoff_seconds_for_retry (via send since it’s private), and use a tolerance matcher for floats to avoid brittleness.
| # Test backoff formula: jitter * min(BACKOFF_INITIAL * 1.5^(attempt-1), BACKOFF_MAX) | |
| backoff_initial = Mongo::Session::BACKOFF_INITIAL | |
| backoff_max = Mongo::Session::BACKOFF_MAX | |
| # Test attempt 1: 1.5^0 = 1 | |
| expected_attempt_1 = backoff_initial * (1.5 ** 0) | |
| expect(expected_attempt_1).to eq(0.005) | |
| # Test attempt 2: 1.5^1 = 1.5 | |
| expected_attempt_2 = backoff_initial * (1.5 ** 1) | |
| expect(expected_attempt_2).to eq(0.0075) | |
| # Test attempt 3: 1.5^2 = 2.25 | |
| expected_attempt_3 = backoff_initial * (1.5 ** 2) | |
| expect(expected_attempt_3).to eq(0.01125) | |
| # Test cap at BACKOFF_MAX | |
| expected_attempt_large = [backoff_initial * (1.5 ** 20), backoff_max].min | |
| expect(expected_attempt_large).to eq(backoff_max) | |
| end | |
| it 'applies jitter to backoff' do | |
| # Jitter should be a random value between 0 and 1 | |
| # When multiplied with backoff, it should reduce the actual sleep time | |
| backoff = 0.100 # 100ms | |
| jitter_min = 0 | |
| jitter_max = 1 | |
| actual_min = jitter_min * backoff | |
| actual_max = jitter_max * backoff | |
| expect(actual_min).to eq(0) | |
| expect(actual_max).to eq(0.100) | |
| backoff_initial = Mongo::Session::BACKOFF_INITIAL | |
| backoff_max = Mongo::Session::BACKOFF_MAX | |
| # Stub jitter to 1.0 so that backoff is equal to the base value | |
| allow(Random).to receive(:rand).and_return(1.0) | |
| # Attempt 1: 1.5^0 = 1 | |
| backoff_attempt_1 = session.send(:backoff_seconds_for_retry, 1) | |
| expected_attempt_1 = [backoff_initial * (1.5 ** 0), backoff_max].min | |
| expect(backoff_attempt_1).to be_within(1e-9).of(expected_attempt_1) | |
| # Attempt 2: 1.5^1 = 1.5 | |
| backoff_attempt_2 = session.send(:backoff_seconds_for_retry, 2) | |
| expected_attempt_2 = [backoff_initial * (1.5 ** 1), backoff_max].min | |
| expect(backoff_attempt_2).to be_within(1e-9).of(expected_attempt_2) | |
| # Attempt 3: 1.5^2 = 2.25 | |
| backoff_attempt_3 = session.send(:backoff_seconds_for_retry, 3) | |
| expected_attempt_3 = [backoff_initial * (1.5 ** 2), backoff_max].min | |
| expect(backoff_attempt_3).to be_within(1e-9).of(expected_attempt_3) | |
| # Large attempt should be capped at BACKOFF_MAX | |
| backoff_attempt_large = session.send(:backoff_seconds_for_retry, 20) | |
| expected_attempt_large = [backoff_initial * (1.5 ** 19), backoff_max].min | |
| expect(backoff_attempt_large).to be_within(1e-9).of(expected_attempt_large) | |
| end | |
| it 'applies jitter to backoff' do | |
| backoff_initial = Mongo::Session::BACKOFF_INITIAL | |
| attempt = 2 | |
| base_backoff = [backoff_initial * (1.5 ** (attempt - 1)), Mongo::Session::BACKOFF_MAX].min | |
| # First call: jitter 0.25 | |
| # Second call: jitter 0.75 | |
| allow(Random).to receive(:rand).and_return(0.25, 0.75) | |
| backoff_low_jitter = session.send(:backoff_seconds_for_retry, attempt) | |
| backoff_high_jitter = session.send(:backoff_seconds_for_retry, attempt) | |
| expect(backoff_low_jitter).to be_within(1e-9).of(base_backoff * 0.25) | |
| expect(backoff_high_jitter).to be_within(1e-9).of(base_backoff * 0.75) | |
| expect(backoff_high_jitter).to be > backoff_low_jitter |
There was a problem hiding this comment.
I think I agree with copilot here -- these tests would be far more meaningful if they tested the code that performs the calculations.
| elsif e.label?('TransientTransactionError') | ||
| if Utils.monotonic_time >= deadline | ||
| transaction_in_progress = false | ||
| raise | ||
| end | ||
| last_error = e | ||
| @state = NO_TRANSACTION_STATE | ||
| next |
There was a problem hiding this comment.
In the TransientTransactionError commit path, the code does next without clearing transaction_in_progress. With the new backoff check at the top of the loop (raise(last_error) before starting a new transaction), this can cause the ensure block to think the callback broke out of the loop and emit a misleading warning / attempt an abort when exiting. Set transaction_in_progress = false before next (and similarly before any early raises taken on retry paths) so the ensure logic reflects actual state.
jamis
left a comment
There was a problem hiding this comment.
Just agreeing with copilot about testing the calculations. The tests should probably call the computation methods, rather than performing the calculations directly.
| # Test backoff formula: jitter * min(BACKOFF_INITIAL * 1.5^(attempt-1), BACKOFF_MAX) | ||
| backoff_initial = Mongo::Session::BACKOFF_INITIAL | ||
| backoff_max = Mongo::Session::BACKOFF_MAX | ||
|
|
||
| # Test attempt 1: 1.5^0 = 1 | ||
| expected_attempt_1 = backoff_initial * (1.5 ** 0) | ||
| expect(expected_attempt_1).to eq(0.005) | ||
|
|
||
| # Test attempt 2: 1.5^1 = 1.5 | ||
| expected_attempt_2 = backoff_initial * (1.5 ** 1) | ||
| expect(expected_attempt_2).to eq(0.0075) | ||
|
|
||
| # Test attempt 3: 1.5^2 = 2.25 | ||
| expected_attempt_3 = backoff_initial * (1.5 ** 2) | ||
| expect(expected_attempt_3).to eq(0.01125) | ||
|
|
||
| # Test cap at BACKOFF_MAX | ||
| expected_attempt_large = [backoff_initial * (1.5 ** 20), backoff_max].min | ||
| expect(expected_attempt_large).to eq(backoff_max) | ||
| end | ||
|
|
||
| it 'applies jitter to backoff' do | ||
| # Jitter should be a random value between 0 and 1 | ||
| # When multiplied with backoff, it should reduce the actual sleep time | ||
| backoff = 0.100 # 100ms | ||
| jitter_min = 0 | ||
| jitter_max = 1 | ||
|
|
||
| actual_min = jitter_min * backoff | ||
| actual_max = jitter_max * backoff | ||
|
|
||
| expect(actual_min).to eq(0) | ||
| expect(actual_max).to eq(0.100) |
There was a problem hiding this comment.
I think I agree with copilot here -- these tests would be far more meaningful if they tested the code that performs the calculations.
No description provided.