Optimize batch-read to avoid wasted disk IO by dao-jun · Pull Request #4741 · apache/bookkeeper

dao-jun · 2026-04-09T06:10:33Z

Motivation

Batch reads currently fetch the next entry before checking whether it can still fit within the maxSize budget. When the last entry in the batch exceeds the remaining space, the read result is discarded, but the disk IO has already happened. This creates wasted IO on the critical read path.

This change avoids that extra read while preserving the existing batch-read behavior, including returning the first entry even when a single entry alone exceeds the requested maxSize.

Changes

Add a bounded-read path (readEntryIfFits / getEntryIfFits) through the batch-read stack, from BatchedReadEntryProcessor down to Bookie, LedgerDescriptor, LedgerStorage, and EntryLogger.
Update batch-read logic to:
- read the first entry as before
- compute the remaining response budget for subsequent entries
- stop without reading the next entry when it cannot fit
Preserve partial-result behavior for missing subsequent entries, while allowing non-NoEntry storage failures after the first entry to propagate as an I/O error instead of being hidden by a partial response.
Implement size-aware entry reads in both DefaultEntryLogger and DirectEntryLogger by checking entry size from metadata before loading the full payload.
Wire the bounded-read path through all storage implementations, including:
- InterleavedLedgerStorage
- SortedLedgerStorage
- DbLedgerStorage / SingleDirectoryDbLedgerStorage
Keep framing semantics consistent by treating the per-entry 4-byte delimiter as part of the size budget.
Update mock batch-read behavior to match the production semantics.
Add regression and boundary tests covering:
- first entry larger than maxSize
- exact-fit remaining budget
- oversized subsequent entries
- default entry logger and direct entry logger paths
- interleaved, sorted, and DB ledger storage paths
- mock and client-facing batch-read behavior

In order to uphold a high standard for quality for code contributions, Apache BookKeeper runs various precommit
checks for pull requests. A pull request can only be merged when it passes precommit checks.

Be sure to do all the following to help us incorporate your contribution
quickly and easily:

If this PR is a BookKeeper Proposal (BP):

Make sure the PR title is formatted like:
<BP-#>: Description of bookkeeper proposal
e.g. BP-1: 64 bits ledger is support

Attach the master issue link in the description of this PR.

Attach the google doc link if the BP is written in Google Doc.

Otherwise:

Make sure the PR title is formatted like:
<Issue #>: Description of pull request
e.g. Issue 123: Description ...

Make sure tests pass via mvn clean apache-rat:check install spotbugs:check.

Replace <Issue #> in the title with the actual Issue number.

eolivelli

I can't find tests about failure scenarios.

Memory leaks can happen in case of failures (reading from storage...), client may see the wrong error code.....

dao-jun · 2026-04-13T13:04:37Z

@eolivelli I added 3 tests in BatchedReadEntryProcessorTest, PTAL

# Conflicts: # bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/DefaultEntryLogger.java

merlimat · 2026-05-12T22:17:36Z

@dao-jun Thanks for the optimization. I went through the PR end-to-end — overall the direction is good and the framing-budget contract is clean and consistent across the storage stack. A few things I'd like to discuss before merge.

Overall

The core idea is sound: today the batch-read path reads each entry from disk and only then checks whether it fits in maxSize, throwing the result away on overflow. This PR threads the remaining response budget down through Bookie → LedgerDescriptor → LedgerStorage → EntryLogger, lets DefaultEntryLogger/DirectEntryLogger short-circuit by reading just the size header, and updates BatchedReadEntryProcessor to use the bounded API for entries 2..N (keeping the "always return the first entry" semantics).

The framing budget contract is consistent end-to-end (entry.readableBytes() + 4 > maxEntrySize everywhere, matching what BatchedReadEntryProcessor adds to frameSize). Tests are thorough — exact-fit, off-by-one budgets, oversized-first-entry, missing entries, IOException propagation, all the storage flavors. Reference counting on the new paths checks out: WriteCache.get()/ReadCache.get() return freshly-allocated buffers, readEntrySize returns a thread-local, and ByteBufList.deallocate releases members.

Substantive concerns

1. Undocumented behavior change in exception handling — BatchedReadEntryProcessor.readData

} catch (Bookie.NoEntryException e) {
    if (data == null) { throw e; }
    break;
} catch (Throwable e) {
    if (data != null) { data.release(); }
    throw e;
}

The old code's catch (Throwable) silently broke and returned partial data on any error after the first entry — including transient IOExceptions. The new code only swallows NoEntryException; any other exception now propagates as EIO to the client, even if some entries were already read successfully. This is arguably more correct (hiding I/O errors is bad), but it's a real behavior change that isn't called out in the description and may surprise clients that relied on the lenient behavior. Could you call this out explicitly in the PR description?

2. Stat pollution in BookieImpl.readEntryIfFits

When entry == null (size-rejected), entrySize is 0, but the method still calls registerSuccessfulValue(0) on readBytesStats and records latency on readEntryStats. That mixes "actually read N bytes" and "size-rejected, didn't read" into the same histograms — the bytes histogram gets zero-spikes, and the latency histogram skews artificially low because size-rejected reads are much faster than full reads. Either skip recording on the null branch, or add a separate counter for size-rejected reads.

3. Significant code duplication in SingleDirectoryDbLedgerStorage.doGetEntryIfFits

doGetEntryIfFits is ~90 lines that mostly mirror doGetEntry, with the same if (entry.readableBytes() + Integer.BYTES > maxEntrySize) { release; return null; } block repeated four times (write cache, write-cache-being-flushed, read cache, after entry-log read). Could be reduced significantly by extracting a small checkBudget(ByteBuf, long) helper, or by adding a budget parameter to doGetEntry with a sentinel for "no limit". Smaller diff, easier to keep in sync going forward.

4. MockBookies.batchReadEntries no longer releases the skipped entry

The old code did entry.release() before breaking on overflow; the new code drops that call. This is actually a fix — MockLedgerData.getEntry() returns the same shared ByteBuf it stores in its map, so the old release() could free the only ref and corrupt subsequent reads (which is exactly what testBatchReadDoesNotReleaseOversizedSkippedEntry verifies). But the production BatchedReadEntryProcessor does release in the analogous overflow path (because Bookie.readEntry returns a freshly-allocated buffer), so the mock and the real bookie now have intentionally different ref-counting contracts. A short comment on the mock explaining "we don't own the buffer here" would prevent future confusion.

Smaller things

5. Inconsistent + 4 vs + Integer.BYTES — BatchedReadEntryProcessor uses literal 4, every other module uses Integer.BYTES. Pick one.

6. DefaultEntryLogger.readEntryIfFits returns null on oversized entries before calling validateEntry. If an entry's size header is corrupted to a huge value, this silently returns null ("doesn't fit") instead of catching the corruption. Probably acceptable — a later legitimate read would catch it — but worth a one-line comment so a future reader doesn't think it's an oversight.

7. When entryLogger.readEntryIfFits returns null in SingleDirectoryDbLedgerStorage.doGetEntryIfFits, the entry isn't put in the read cache. A subsequent call with a larger budget will hit disk again. Probably fine (we never actually read the data), just noting it.

Verdict

Direction is good, contract is clean, tests are solid. Before merging I'd ask for: (a) the exception-handling behavior change explicitly called out in the PR description, (b) stat recording tightened for size-rejected reads, and (c) the doGetEntryIfFits duplication reduced. The rest are nice-to-have.

dao-jun · 2026-05-14T02:25:25Z

/bkbot rerun-failure

dao-jun · 2026-05-14T11:55:45Z

@merlimat review comment addressed, PTAL

dao-jun added 2 commits April 9, 2026 13:39

Optimize batch-read to avoid wasted disk IO

2b9db24

Fix checkstyle

0c58386

dao-jun closed this Apr 9, 2026

dao-jun reopened this Apr 9, 2026

eolivelli requested changes Apr 13, 2026

View reviewed changes

Address review comments

b00b020

dao-jun requested a review from eolivelli April 13, 2026 13:04

Address review comments

0bbd2d1

dao-jun mentioned this pull request Apr 24, 2026

[fix] Upgrade to AsyncHTTP client 3.0.4 apache/pulsar#25023

Open

15 tasks

dao-jun added 2 commits May 8, 2026 16:18

Merge branch 'master' into dev/optimize_batch_read

cbac383

# Conflicts: # bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/DefaultEntryLogger.java

fix conflicts

1f1d168

dao-jun added 3 commits May 14, 2026 06:11

Address review comments

9031fdc

Address review comments

e4fd5eb

fix checkstyle

fcecb6b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize batch-read to avoid wasted disk IO#4741

Optimize batch-read to avoid wasted disk IO#4741
dao-jun wants to merge 9 commits into
apache:masterfrom
dao-jun:dev/optimize_batch_read

dao-jun commented Apr 9, 2026 •

edited

Loading

Uh oh!

eolivelli left a comment

Uh oh!

dao-jun commented Apr 13, 2026

Uh oh!

merlimat commented May 12, 2026

Uh oh!

dao-jun commented May 14, 2026

Uh oh!

dao-jun commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dao-jun commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

dao-jun commented Apr 13, 2026

Uh oh!

merlimat commented May 12, 2026

Overall

Substantive concerns

Smaller things

Verdict

Uh oh!

dao-jun commented May 14, 2026

Uh oh!

dao-jun commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dao-jun commented Apr 9, 2026 •

edited

Loading