feat(pt/dpmodel): add max and filter mode for lmdb#5413
feat(pt/dpmodel): add max and filter mode for lmdb#5413njzjz merged 3 commits intodeepmodeling:masterfrom
Conversation
📝 WalkthroughWalkthroughLmdbDataReader batch-size parsing now supports "max:N" and "filter:N". "filter:N" removes LMDB frames with nloc > N and rebuilds internal indices/groups; dataset length/indexing reflect retained frames and getitem maps to original LMDB keys (frame["fid"]). compute_block_targets() drops empty auto blocks and renormalizes weights. Docs and tests updated. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant LMDB as LMDB Env
participant Reader as LmdbDataReader
participant Sampler as Batch Sampler
Client->>Reader: init(batch_size="filter:N"|"max:N"|int, mixed_batch=...)
Reader->>LMDB: read frame metadata (nloc, sid, fid) for all frames
alt batch_size == "filter:N"
Reader->>Reader: compute _retained_keys = [fid where nloc <= N]
Reader->>Reader: rebuild _frame_nlocs/_frame_system_ids/_nloc_groups/_system_nframes
else batch_size == "max:N"
Reader->>Reader: set _max_rule and compute per-nloc sizes = max(1,floor(N/nloc))
else
Reader->>Reader: parse other batch_size rules
end
Sampler->>Reader: request batch for nloc group
Reader->>Sampler: get_batch_size_for_nloc(nloc) -> size
Sampler->>Reader: request frames (dataset indices)
Reader->>Reader: translate dataset indices -> original LMDB fids via _retained_keys
Reader->>Client: return frames with frame["fid"] = original LMDB key
Note right of Reader: compute_block_targets() removes empty blocks before sampling
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 352-396: The code sets self.nframes to the number of retained
frames but still builds groups and uses indices that refer to original LMDB
keys, which breaks iteration because __getitem__(index) reads the LMDB key
`index` directly; fix by introducing an explicit mapping between logical
retained indices and original LMDB indices (e.g. self._logical_to_original =
retained_indices), rebuild self._nloc_groups and self._system_groups to use
logical indices (0..self.nframes-1) while storing the original LMDB id where
needed, and update __getitem__ to translate the incoming logical index to the
original LMDB id via self._logical_to_original before reading the LMDB key;
ensure any code that previously appended `idx` (from retained_indices) now
appends the logical position and that any place needing the original id reads it
from self._logical_to_original.
In `@deepmd/utils/argcheck.py`:
- Line 3762: Update the documentation string for "filter:N" to use frame-level
wording instead of "removes the systems": explain that the LMDB reader filters
frames whose nloc > N (i.e., removes frames with more than N atoms), and note
that for mixed LMDB systems this operates per-frame rather than per-system;
reference the "filter:N" token and the field "nloc" so readers understand the
distinction.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 78174e2d-4bb3-459e-8770-1276b560b23a
📒 Files selected for processing (4)
deepmd/dpmodel/utils/lmdb_data.pydeepmd/pt/utils/lmdb_dataset.pydeepmd/utils/argcheck.pysource/tests/common/dpmodel/test_lmdb_data.py
There was a problem hiding this comment.
Pull request overview
Adds new LMDB batch sizing modes to better control atom-budgeted batching and optionally drop oversized frames, with accompanying tests and updated user-facing docs.
Changes:
- Extend
LmdbDataReaderbatch sizing to support"max:N"(upper-bound atom budget via floor division) and"filter:N"(drop frames withnloc > N). - Update PyTorch LMDB dataset docstring and config arg docs to mention the new batch size modes.
- Add unit tests covering
max:/filter:behavior, including interactions withauto_probblock slicing.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
deepmd/dpmodel/utils/lmdb_data.py |
Implements max:/filter: parsing, filtering behavior, and avoids NaNs in compute_block_targets when blocks become empty after filtering. |
deepmd/pt/utils/lmdb_dataset.py |
Expands batch_size docstring to describe the new rule strings forwarded to LmdbDataReader. |
deepmd/utils/argcheck.py |
Updates validation_data.batch_size documentation to mention "max:N" and "filter:N". |
source/tests/common/dpmodel/test_lmdb_data.py |
Adds targeted tests for max:/filter: and their interaction with samplers and auto_prob block behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5413 +/- ##
==========================================
+ Coverage 80.46% 80.49% +0.02%
==========================================
Files 820 820
Lines 86005 86063 +58
Branches 4139 4139
==========================================
+ Hits 69208 69277 +69
+ Misses 15521 15510 -11
Partials 1276 1276 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
deepmd/dpmodel/utils/lmdb_data.py (1)
499-514: Negative-index rejection is a deliberate API narrowing — document it.
LmdbDataReaderused to go throughself._txn.get(format(index, ...))which would simply miss for negatives. Now line 506 hard-rejectsindex < 0, which is the right call for Dataset-like objects but is a subtle behavior change for any caller that passesreader[-1](e.g. ad-hoc scripts or tests that grab "the last frame"). Since the class docstring at lines 231-257 doesn't mention indexing semantics, consider adding one line to__getitem__'s docstring that negative indices are not supported, or support them withindex += self.nframesas Python sequences do. Either is fine; just make it explicit.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 499 - 514, LmdbDataReader.__getitem__ now raises for index < 0 which changes previous behavior; update the __getitem__ docstring to state that negative indices are not supported (or alternatively implement Python-style wrapping by adding index += self.nframes when index < 0) so callers know the contract. Specifically, edit the docstring for __getitem__ in LmdbDataReader to add a concise sentence like "Negative indices are not supported; pass a non-negative index in [0, nframes) or handle wrapping externally." and reference the behavior around self._retained_keys and key lookup via self._txn.get to ensure the documentation matches the implemented check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 340-354: The batch_size string parsing accepts non-positive or
non-integer N values which can silently drop all frames; update the parsing
branches for "max:N" and "filter:N" to validate N: parse with a guarded
try/except to produce a clear ValueError on non-integers, then check that the
parsed N is a positive integer (e.g. N >= 1) and raise a ValueError with an
explanatory message if not; retain existing assignments to self._max_rule and
self._filter_rule (and self._max_rule = self._filter_rule for "filter:") and
ensure behavior of the n <= self._filter_rule predicate and nloc assumptions
remain consistent.
- Around line 69-72: The LMDB environment is opened with readahead=True which
causes excessive prefetch and page-cache pollution for random-access workloads
(see LmdbDataReader.__getitem__ used by SameNlocBatchSampler); change the
lmdb.open call that sets readahead=True to readahead=False (or expose a config
option to toggle readahead) so local SSD training avoids aggressive prefetching
and improves random-read throughput.
- Around line 773-789: When empty blocks are dropped (the nonempty list is
shorter than blocks because sum(system_nframes[stt:end]) == 0), add an INFO log
that lists which original blocks were removed and shows the new auto_prob_style
so operators can see the change in effective weights; use the existing symbols
blocks, nonempty, auto_prob_style and system_nframes to construct a message like
"Dropped empty blocks: <stt:end:weight,...>; updated auto_prob_style:
<new_style>" and emit it before assigning blocks = nonempty.
---
Nitpick comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 499-514: LmdbDataReader.__getitem__ now raises for index < 0 which
changes previous behavior; update the __getitem__ docstring to state that
negative indices are not supported (or alternatively implement Python-style
wrapping by adding index += self.nframes when index < 0) so callers know the
contract. Specifically, edit the docstring for __getitem__ in LmdbDataReader to
add a concise sentence like "Negative indices are not supported; pass a
non-negative index in [0, nframes) or handle wrapping externally." and reference
the behavior around self._retained_keys and key lookup via self._txn.get to
ensure the documentation matches the implemented check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 851ab184-a305-408f-b9c1-65bddd99c639
📒 Files selected for processing (3)
deepmd/dpmodel/utils/lmdb_data.pydeepmd/utils/argcheck.pysource/tests/common/dpmodel/test_lmdb_data.py
✅ Files skipped from review due to trivial changes (1)
- source/tests/common/dpmodel/test_lmdb_data.py
🚧 Files skipped from review as they are similar to previous changes (1)
- deepmd/utils/argcheck.py
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
deepmd/dpmodel/utils/lmdb_data.py (2)
782-786:⚠️ Potential issue | 🟡 MinorLog dropped auto-prob blocks because weights are renormalized.
When empty blocks are removed,
prob_sys_size_extrenormalizes the remaining block weights, changing the user’s effective sampling distribution. Emit an info log with the dropped blocks and rewritten style.Suggested logging
if len(nonempty) < len(blocks): + dropped = [ + f"{stt}:{end}:{weight}" + for stt, end, weight in blocks + if sum(system_nframes[stt:end]) == 0 + ] auto_prob_style = "prob_sys_size;" + ";".join( f"{stt}:{end}:{weight}" for stt, end, weight in nonempty ) + log.info( + "Dropped empty LMDB auto_prob blocks after filtering: %s; " + "updated auto_prob_style: %s", + dropped, + auto_prob_style, + ) blocks = nonempty🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 782 - 786, When empty blocks are removed from blocks (i.e., len(nonempty) < len(blocks)), emit an info log showing which blocks were dropped and the new auto_prob_style string so users know their sampling distribution was renormalized; update the branch that sets auto_prob_style and blocks (references: nonempty, blocks, auto_prob_style, prob_sys_size_ext) to log the list of dropped blocks and the rewritten style via the module logger (process/logger used elsewhere) before assigning blocks = nonempty.
340-346:⚠️ Potential issue | 🟡 MinorValidate
Nwhen parsingauto:N,max:N, andfilter:N.
int(batch_size.split(":")[1])still accepts non-positive thresholds and gives raw/ambiguous errors for malformed values like"filter:";"max:10:extra"is also parsed inconsistently. Please centralize parsing and requireN > 0.Suggested hardening
+ def _parse_positive_rule(prefix: str) -> int: + try: + n = int(batch_size.split(":", 1)[1]) + except (IndexError, ValueError) as exc: + raise ValueError( + f"Invalid batch_size {batch_size!r}: expected " + f"'{prefix}:N' with a positive integer N." + ) from exc + if n <= 0: + raise ValueError( + f"Invalid batch_size {batch_size!r}: N must be positive." + ) + return n + if isinstance(batch_size, str): if batch_size == "auto": self._auto_rule = 32 elif batch_size.startswith("auto:"): - self._auto_rule = int(batch_size.split(":")[1]) + self._auto_rule = _parse_positive_rule("auto") elif batch_size.startswith("max:"): - self._max_rule = int(batch_size.split(":")[1]) + self._max_rule = _parse_positive_rule("max") elif batch_size.startswith("filter:"): - self._filter_rule = int(batch_size.split(":")[1]) + self._filter_rule = _parse_positive_rule("filter") self._max_rule = self._filter_rule🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 340 - 346, The batch_size parsing accepts malformed or non-positive values; centralize parsing by adding a helper (e.g., _parse_positive_N(batch_size, prefix)) used by the branches handling "auto:", "max:", and "filter:" so that it: 1) ensures the string is exactly two parts split by a single colon (rejects "filter:" and "max:10:extra"), 2) converts the second part to int and validates N > 0 (raise ValueError with a clear message on failure), and 3) returns the int to assign to _auto_rule, _max_rule, and _filter_rule (and keep the existing behavior of setting _max_rule = _filter_rule for "filter:"). Use the helper in place of direct int(...) calls to harden all three branches.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 477-490: The docstring for get_batch_size_for_nloc incorrectly
claims "max:N / filter:N: never overshoots" while the implementation (checking
self._max_rule and returning max(1, self._max_rule // max(nloc, 1))) will clamp
to 1 when nloc > N and thus can overshoot for single oversized frames; update
the docstring to state that max:N/filter:N use floor(N / nloc) clamped to at
least 1 and therefore may still overshoot when an individual frame has more than
N atoms, referencing get_batch_size_for_nloc and the _max_rule behavior.
- Around line 353-368: The current logic silently skips applying filter:N when
mixed_batch is True; update the logic in the method using symbols _filter_rule,
mixed_batch, orig_frame_nlocs, retained_keys, nframes and lmdb_path to enforce
the documented behaviour: either (A) fail-fast by raising a clear exception when
_filter_rule is not None and mixed_batch is True (with a message referencing
filter:N and mixed_batch), or (B) compute per-frame nlocs for mixed batches
(i.e. ensure orig_frame_nlocs is populated for mixed_batch) and then apply the
same retained_keys comprehension and n_dropped logging so frames with nloc >
_filter_rule are dropped; pick one approach and implement it consistently so
filter:N is never silently disabled.
---
Duplicate comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 782-786: When empty blocks are removed from blocks (i.e.,
len(nonempty) < len(blocks)), emit an info log showing which blocks were dropped
and the new auto_prob_style string so users know their sampling distribution was
renormalized; update the branch that sets auto_prob_style and blocks
(references: nonempty, blocks, auto_prob_style, prob_sys_size_ext) to log the
list of dropped blocks and the rewritten style via the module logger
(process/logger used elsewhere) before assigning blocks = nonempty.
- Around line 340-346: The batch_size parsing accepts malformed or non-positive
values; centralize parsing by adding a helper (e.g.,
_parse_positive_N(batch_size, prefix)) used by the branches handling "auto:",
"max:", and "filter:" so that it: 1) ensures the string is exactly two parts
split by a single colon (rejects "filter:" and "max:10:extra"), 2) converts the
second part to int and validates N > 0 (raise ValueError with a clear message on
failure), and 3) returns the int to assign to _auto_rule, _max_rule, and
_filter_rule (and keep the existing behavior of setting _max_rule = _filter_rule
for "filter:"). Use the helper in place of direct int(...) calls to harden all
three branches.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 7b7f737a-a367-465e-a496-c9a090340f51
📒 Files selected for processing (3)
deepmd/dpmodel/utils/lmdb_data.pydeepmd/utils/argcheck.pysource/tests/common/dpmodel/test_lmdb_data.py
🚧 Files skipped from review as they are similar to previous changes (2)
- deepmd/utils/argcheck.py
- source/tests/common/dpmodel/test_lmdb_data.py
There was a problem hiding this comment.
🧹 Nitpick comments (4)
deepmd/dpmodel/utils/lmdb_data.py (2)
823-842:(stt, end, weight) not in nonemptycan misfire on float weights.The
droppedlist at Line 829-833 filtersblocksby “not innonempty”, comparing 3-tuples that include the parsed floatweight. In the current path the objects innonemptyare the same tuple instances taken fromblocks, so this works — but it is fragile: if anyone ever normalizes/reparses weights before this check (e.g., a future refactor that rounds or strips whitespace differently),0.2 != 0.2edge cases or formatting mismatches will make dropped blocks invisibly re-appear as “kept”.Safer to drive the diff from the index set used to build
nonempty:Suggested hardening
- nonempty = [ - (stt, end, weight) - for stt, end, weight in blocks - if sum(system_nframes[stt:end]) > 0 - ] + nonempty_idx = [ + i for i, (stt, end, _w) in enumerate(blocks) + if sum(system_nframes[stt:end]) > 0 + ] + nonempty = [blocks[i] for i in nonempty_idx] @@ - if len(nonempty) < len(blocks): + if len(nonempty) < len(blocks): ... - dropped = [ - f"{stt}:{end}:{weight}" - for (stt, end, weight) in blocks - if (stt, end, weight) not in nonempty - ] + kept = set(nonempty_idx) + dropped = [ + f"{stt}:{end}:{weight}" + for i, (stt, end, weight) in enumerate(blocks) + if i not in kept + ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 823 - 842, The comparison "(stt, end, weight) not in nonempty" can fail for float/formatting differences; update compute_block_targets to derive dropped from the index set used to build nonempty instead of tuple membership. When you construct nonempty, also record the corresponding indices (e.g., nonempty_idx or kept_idx) and build dropped as [blocks[i] formatted string for i in range(len(blocks)) if i not in nonempty_idx]; then regenerate auto_prob_style and set blocks = nonempty as before so behavior is unchanged but robust to float mismatches.
688-705:print_summaryshows0 nloc groupsformixed_batch=True.When
mixed_batch=True,self._nloc_groupsis intentionally left empty (lines 429-430), son_groups = len(self._nloc_groups) == 0and the summary reports zero nloc groups even though the dataset hasself.nframes > 0frames. Consider either skipping thenloc groupsline in that case or reporting"mixed"to avoid misleading logs.Suggested tweak
- n_groups = len(self._nloc_groups) + n_groups_str = ( + "mixed" if self.mixed_batch else str(len(self._nloc_groups)) + ) @@ log.info( f"LMDB {name}: {self.lmdb_path}, " - f"{self.nframes} frames, {n_groups} nloc groups, " + f"{self.nframes} frames, {n_groups_str} nloc groups, " f"batch_size={bs_str}, " f"mixed_batch={self.mixed_batch}" )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 688 - 705, The summary currently prints nloc groups as 0 when mixed_batch=True because self._nloc_groups is intentionally empty; update print_summary to detect self.mixed_batch and set a display string (e.g., "mixed") instead of computing len(self._nloc_groups), then use that string in the log; specifically change the n_groups calculation in print_summary (referencing print_summary, self._nloc_groups and self.mixed_batch) to produce n_groups_str = "mixed" when self.mixed_batch is True, otherwise set n_groups_str = str(len(self._nloc_groups)), and use n_groups_str in the log message so logs are not misleading.source/tests/common/dpmodel/test_lmdb_data.py (2)
906-916: Assertion on"positive"is coupled to the current error string.
self.assertIn("positive", str(ctx.exception))works today because both_parse_positive_ruleerror branches happen to contain the word, but theint(raw)ValueErrorpath uses"Expected '<prefix>N' with N a positive integer."and the bounds path uses"N must be a positive integer, got <n>.". A future wording change (e.g. "strictly positive", "non-negative") would break these tests even though behavior is unchanged. Consider asserting on the spec string (which is guaranteed to appear) plus the exception type, or useassertRaisesRegexwith a broader pattern.Example
- for spec in ("filter:", "filter:0", "max:-1"): - with self.assertRaises(ValueError) as ctx: - LmdbDataReader(self._uniform_path, self._type_map, batch_size=spec) - self.assertIn("positive", str(ctx.exception)) + for spec in ("filter:", "filter:0", "max:-1"): + with self.assertRaisesRegex(ValueError, re.escape(repr(spec))): + LmdbDataReader(self._uniform_path, self._type_map, batch_size=spec)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/tests/common/dpmodel/test_lmdb_data.py` around lines 906 - 916, The test test_invalid_batch_size_strings_rejected relies on the word "positive" in the error text; change the assertion to check the exception type plus the failing spec instead so wording changes won't break the test: replace the current with either self.assertRaisesRegex(ValueError, re.escape(spec)) around the LmdbDataReader(...) call (import re at top) or keep the with self.assertRaises(ValueError) as ctx and assertIn(spec, str(ctx.exception)); reference LmdbDataReader and test_invalid_batch_size_strings_rejected when making the change.
875-878: Silence the CodeQL "statement has no effect" warning.CodeQL flags
reader[len(reader)]/reader[-1]as statements with no effect because their return value is discarded. The call does trigger__getitem__(which is the whole point — it must raise), but the static analyzer can't tell. A tiny rewrite clears the warning and makes intent explicit:Proposed rewrite
- with self.assertRaises(IndexError): - reader[len(reader)] - with self.assertRaises(IndexError): - reader[-1] + with self.assertRaises(IndexError): + _ = reader[len(reader)] + with self.assertRaises(IndexError): + _ = reader[-1]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/tests/common/dpmodel/test_lmdb_data.py` around lines 875 - 878, The test currently calls reader[len(reader)] and reader[-1] inside self.assertRaises blocks but the expressions are plain statements (flagged by CodeQL as "statement has no effect"); change each bare expression to an explicit read that uses the return value (e.g., assign to a dummy variable) so the __getitem__ call is unambiguous to static analyzers — for both occurrences replace reader[len(reader)] and reader[-1] with an assignment like _ = reader[len(reader)] and _ = reader[-1] inside the existing with self.assertRaises(IndexError): blocks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 823-842: The comparison "(stt, end, weight) not in nonempty" can
fail for float/formatting differences; update compute_block_targets to derive
dropped from the index set used to build nonempty instead of tuple membership.
When you construct nonempty, also record the corresponding indices (e.g.,
nonempty_idx or kept_idx) and build dropped as [blocks[i] formatted string for i
in range(len(blocks)) if i not in nonempty_idx]; then regenerate auto_prob_style
and set blocks = nonempty as before so behavior is unchanged but robust to float
mismatches.
- Around line 688-705: The summary currently prints nloc groups as 0 when
mixed_batch=True because self._nloc_groups is intentionally empty; update
print_summary to detect self.mixed_batch and set a display string (e.g.,
"mixed") instead of computing len(self._nloc_groups), then use that string in
the log; specifically change the n_groups calculation in print_summary
(referencing print_summary, self._nloc_groups and self.mixed_batch) to produce
n_groups_str = "mixed" when self.mixed_batch is True, otherwise set n_groups_str
= str(len(self._nloc_groups)), and use n_groups_str in the log message so logs
are not misleading.
In `@source/tests/common/dpmodel/test_lmdb_data.py`:
- Around line 906-916: The test test_invalid_batch_size_strings_rejected relies
on the word "positive" in the error text; change the assertion to check the
exception type plus the failing spec instead so wording changes won't break the
test: replace the current with either self.assertRaisesRegex(ValueError,
re.escape(spec)) around the LmdbDataReader(...) call (import re at top) or keep
the with self.assertRaises(ValueError) as ctx and assertIn(spec,
str(ctx.exception)); reference LmdbDataReader and
test_invalid_batch_size_strings_rejected when making the change.
- Around line 875-878: The test currently calls reader[len(reader)] and
reader[-1] inside self.assertRaises blocks but the expressions are plain
statements (flagged by CodeQL as "statement has no effect"); change each bare
expression to an explicit read that uses the return value (e.g., assign to a
dummy variable) so the __getitem__ call is unambiguous to static analyzers — for
both occurrences replace reader[len(reader)] and reader[-1] with an assignment
like _ = reader[len(reader)] and _ = reader[-1] inside the existing with
self.assertRaises(IndexError): blocks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 4047583b-b9c2-40f6-9d84-66b4d90dab2f
📒 Files selected for processing (2)
deepmd/dpmodel/utils/lmdb_data.pysource/tests/common/dpmodel/test_lmdb_data.py
njzjz-bot
left a comment
There was a problem hiding this comment.
LGTM. I paid extra attention to the LMDB filter:N path: keeping _nsystems in original system-id space, re-keying retained frames into dataset index space, and renormalizing empty probability blocks all make sense together. The added tests also cover the tricky indexing/sampling cases well.
— OpenClaw 2026.4.22 (model: gpt-5.4)
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests