Skip to content

Optimize batch-read to avoid wasted disk IO#4741

Open
dao-jun wants to merge 9 commits into
apache:masterfrom
dao-jun:dev/optimize_batch_read
Open

Optimize batch-read to avoid wasted disk IO#4741
dao-jun wants to merge 9 commits into
apache:masterfrom
dao-jun:dev/optimize_batch_read

Conversation

@dao-jun
Copy link
Copy Markdown
Member

@dao-jun dao-jun commented Apr 9, 2026

Motivation

Batch reads currently fetch the next entry before checking whether it can still fit within the maxSize budget. When the last entry in the batch exceeds the remaining space, the read result is discarded, but the disk IO has already happened. This creates wasted IO on the critical read path.

This change avoids that extra read while preserving the existing batch-read behavior, including returning the first entry even when a single entry alone exceeds the requested maxSize.

Changes

  • Add a bounded-read path (readEntryIfFits / getEntryIfFits) through the batch-read stack, from BatchedReadEntryProcessor down to Bookie, LedgerDescriptor, LedgerStorage, and EntryLogger.
  • Update batch-read logic to:
    • read the first entry as before
    • compute the remaining response budget for subsequent entries
    • stop without reading the next entry when it cannot fit
  • Preserve partial-result behavior for missing subsequent entries, while allowing non-NoEntry storage failures after the first entry to propagate as an I/O error instead of being hidden by a partial response.
  • Implement size-aware entry reads in both DefaultEntryLogger and DirectEntryLogger by checking entry size from metadata before loading the full payload.
  • Wire the bounded-read path through all storage implementations, including:
    • InterleavedLedgerStorage
    • SortedLedgerStorage
    • DbLedgerStorage / SingleDirectoryDbLedgerStorage
  • Keep framing semantics consistent by treating the per-entry 4-byte delimiter as part of the size budget.
  • Update mock batch-read behavior to match the production semantics.
  • Add regression and boundary tests covering:
    • first entry larger than maxSize
    • exact-fit remaining budget
    • oversized subsequent entries
    • default entry logger and direct entry logger paths
    • interleaved, sorted, and DB ledger storage paths
    • mock and client-facing batch-read behavior

In order to uphold a high standard for quality for code contributions, Apache BookKeeper runs various precommit
checks for pull requests. A pull request can only be merged when it passes precommit checks.


Be sure to do all the following to help us incorporate your contribution
quickly and easily:

If this PR is a BookKeeper Proposal (BP):

  • Make sure the PR title is formatted like:
    <BP-#>: Description of bookkeeper proposal
    e.g. BP-1: 64 bits ledger is support
  • Attach the master issue link in the description of this PR.
  • Attach the google doc link if the BP is written in Google Doc.

Otherwise:

  • Make sure the PR title is formatted like:
    <Issue #>: Description of pull request
    e.g. Issue 123: Description ...
  • Make sure tests pass via mvn clean apache-rat:check install spotbugs:check.
  • Replace <Issue #> in the title with the actual Issue number.

@dao-jun dao-jun closed this Apr 9, 2026
@dao-jun dao-jun reopened this Apr 9, 2026
Copy link
Copy Markdown
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find tests about failure scenarios.

Memory leaks can happen in case of failures (reading from storage...), client may see the wrong error code.....

@dao-jun
Copy link
Copy Markdown
Member Author

dao-jun commented Apr 13, 2026

@eolivelli I added 3 tests in BatchedReadEntryProcessorTest, PTAL

@dao-jun dao-jun requested a review from eolivelli April 13, 2026 13:04
dao-jun added 2 commits May 8, 2026 16:18
# Conflicts:
#	bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/DefaultEntryLogger.java
@merlimat
Copy link
Copy Markdown
Contributor

@dao-jun Thanks for the optimization. I went through the PR end-to-end — overall the direction is good and the framing-budget contract is clean and consistent across the storage stack. A few things I'd like to discuss before merge.

Overall

The core idea is sound: today the batch-read path reads each entry from disk and only then checks whether it fits in maxSize, throwing the result away on overflow. This PR threads the remaining response budget down through Bookie → LedgerDescriptor → LedgerStorage → EntryLogger, lets DefaultEntryLogger/DirectEntryLogger short-circuit by reading just the size header, and updates BatchedReadEntryProcessor to use the bounded API for entries 2..N (keeping the "always return the first entry" semantics).

The framing budget contract is consistent end-to-end (entry.readableBytes() + 4 > maxEntrySize everywhere, matching what BatchedReadEntryProcessor adds to frameSize). Tests are thorough — exact-fit, off-by-one budgets, oversized-first-entry, missing entries, IOException propagation, all the storage flavors. Reference counting on the new paths checks out: WriteCache.get()/ReadCache.get() return freshly-allocated buffers, readEntrySize returns a thread-local, and ByteBufList.deallocate releases members.

Substantive concerns

1. Undocumented behavior change in exception handling — BatchedReadEntryProcessor.readData

} catch (Bookie.NoEntryException e) {
    if (data == null) { throw e; }
    break;
} catch (Throwable e) {
    if (data != null) { data.release(); }
    throw e;
}

The old code's catch (Throwable) silently broke and returned partial data on any error after the first entry — including transient IOExceptions. The new code only swallows NoEntryException; any other exception now propagates as EIO to the client, even if some entries were already read successfully. This is arguably more correct (hiding I/O errors is bad), but it's a real behavior change that isn't called out in the description and may surprise clients that relied on the lenient behavior. Could you call this out explicitly in the PR description?

2. Stat pollution in BookieImpl.readEntryIfFits

When entry == null (size-rejected), entrySize is 0, but the method still calls registerSuccessfulValue(0) on readBytesStats and records latency on readEntryStats. That mixes "actually read N bytes" and "size-rejected, didn't read" into the same histograms — the bytes histogram gets zero-spikes, and the latency histogram skews artificially low because size-rejected reads are much faster than full reads. Either skip recording on the null branch, or add a separate counter for size-rejected reads.

3. Significant code duplication in SingleDirectoryDbLedgerStorage.doGetEntryIfFits

doGetEntryIfFits is ~90 lines that mostly mirror doGetEntry, with the same if (entry.readableBytes() + Integer.BYTES > maxEntrySize) { release; return null; } block repeated four times (write cache, write-cache-being-flushed, read cache, after entry-log read). Could be reduced significantly by extracting a small checkBudget(ByteBuf, long) helper, or by adding a budget parameter to doGetEntry with a sentinel for "no limit". Smaller diff, easier to keep in sync going forward.

4. MockBookies.batchReadEntries no longer releases the skipped entry

The old code did entry.release() before breaking on overflow; the new code drops that call. This is actually a fix — MockLedgerData.getEntry() returns the same shared ByteBuf it stores in its map, so the old release() could free the only ref and corrupt subsequent reads (which is exactly what testBatchReadDoesNotReleaseOversizedSkippedEntry verifies). But the production BatchedReadEntryProcessor does release in the analogous overflow path (because Bookie.readEntry returns a freshly-allocated buffer), so the mock and the real bookie now have intentionally different ref-counting contracts. A short comment on the mock explaining "we don't own the buffer here" would prevent future confusion.

Smaller things

5. Inconsistent + 4 vs + Integer.BYTESBatchedReadEntryProcessor uses literal 4, every other module uses Integer.BYTES. Pick one.

6. DefaultEntryLogger.readEntryIfFits returns null on oversized entries before calling validateEntry. If an entry's size header is corrupted to a huge value, this silently returns null ("doesn't fit") instead of catching the corruption. Probably acceptable — a later legitimate read would catch it — but worth a one-line comment so a future reader doesn't think it's an oversight.

7. When entryLogger.readEntryIfFits returns null in SingleDirectoryDbLedgerStorage.doGetEntryIfFits, the entry isn't put in the read cache. A subsequent call with a larger budget will hit disk again. Probably fine (we never actually read the data), just noting it.

Verdict

Direction is good, contract is clean, tests are solid. Before merging I'd ask for: (a) the exception-handling behavior change explicitly called out in the PR description, (b) stat recording tightened for size-rejected reads, and (c) the doGetEntryIfFits duplication reduced. The rest are nice-to-have.

@dao-jun
Copy link
Copy Markdown
Member Author

dao-jun commented May 14, 2026

/bkbot rerun-failure

@dao-jun
Copy link
Copy Markdown
Member Author

dao-jun commented May 14, 2026

@merlimat review comment addressed, PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants