Skip to content

feat(pt/dpmodel): add max and filter mode for lmdb#5413

Merged
njzjz merged 3 commits intodeepmodeling:masterfrom
OutisLi:pr/lmdb
Apr 27, 2026
Merged

feat(pt/dpmodel): add max and filter mode for lmdb#5413
njzjz merged 3 commits intodeepmodeling:masterfrom
OutisLi:pr/lmdb

Conversation

@OutisLi
Copy link
Copy Markdown
Collaborator

@OutisLi OutisLi commented Apr 22, 2026

Summary by CodeRabbit

  • New Features

    • batch_size accepts "max:N" and "filter:N" in addition to "auto"/"auto:N"; batch-size calculation honors per-frame atom counts.
    • print_summary explicitly reports the active batch-size rule.
  • Bug Fixes

    • Dataset length, indexing, and returned frame IDs consistently reflect filtering; filtering preserves original system numbering.
    • Empty probability blocks are removed and weights renormalized so sampling remains valid even when systems/frames are fully dropped.
    • "filter:N" usage is disallowed with mixed-batch mode.
  • Documentation

    • Updated batch_size docs and validation help to describe "max:N" and "filter:N" semantics.
  • Tests

    • Added tests covering max/filter behaviors, filtering effects on indexing and sampling, error cases for invalid batch_size strings, and handling of fully filtered systems.

Copilot AI review requested due to automatic review settings April 22, 2026 11:52
@dosubot dosubot Bot added the new feature label Apr 22, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

📝 Walkthrough

Walkthrough

LmdbDataReader batch-size parsing now supports "max:N" and "filter:N". "filter:N" removes LMDB frames with nloc > N and rebuilds internal indices/groups; dataset length/indexing reflect retained frames and getitem maps to original LMDB keys (frame["fid"]). compute_block_targets() drops empty auto blocks and renormalizes weights. Docs and tests updated.

Changes

Cohort / File(s) Summary
Core LMDB reader
deepmd/dpmodel/utils/lmdb_data.py
Add strict positive-int rule parsing; accept "max:N" and "filter:N" (store _max_rule/_filter_rule); validate filter:N vs mixed_batch; compute _retained_keys and filter frames with nloc > N; rebuild _frame_nlocs, _frame_system_ids, _nloc_groups, _system_nframes; adjust __len__/indexing to retained-frame space; have __getitem__ translate dataset indices to original LMDB keys and set frame["fid"]; update get_batch_size_for_nloc() to apply max-rule formula; print_summary() reports rule; compute_block_targets() drops empty auto-prob blocks and renormalizes (may return no targets).
Public API docs / Arg checking
deepmd/pt/utils/lmdb_dataset.py, deepmd/utils/argcheck.py
Document expanded batch_size string semantics: "max:N" uses per-nloc max(1, floor(N/nloc)); "filter:N" uses same sizing but drops frames/systems exceeding N (LMDB: per-frame nloc; npy: whole-system natoms). No runtime logic changes in these files.
Tests
source/tests/common/dpmodel/test_lmdb_data.py
Add LMDB fixture with mixed sid/nloc; add TestMaxFilterBatchSize verifying "max:N" sizing, "filter:N" filtering, retained-key mapping, preserved original system IDs (zeroed for dropped systems), contiguous post-filter indexing, __getitem__ returns original fids, out-of-range raises, sampler yields only retained same-nloc frames, invalid rule strings raise, and block-target/sampler behavior when systems/blocks are fully dropped.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant LMDB as LMDB Env
    participant Reader as LmdbDataReader
    participant Sampler as Batch Sampler

    Client->>Reader: init(batch_size="filter:N"|"max:N"|int, mixed_batch=...)
    Reader->>LMDB: read frame metadata (nloc, sid, fid) for all frames
    alt batch_size == "filter:N"
        Reader->>Reader: compute _retained_keys = [fid where nloc <= N]
        Reader->>Reader: rebuild _frame_nlocs/_frame_system_ids/_nloc_groups/_system_nframes
    else batch_size == "max:N"
        Reader->>Reader: set _max_rule and compute per-nloc sizes = max(1,floor(N/nloc))
    else
        Reader->>Reader: parse other batch_size rules
    end
    Sampler->>Reader: request batch for nloc group
    Reader->>Sampler: get_batch_size_for_nloc(nloc) -> size
    Sampler->>Reader: request frames (dataset indices)
    Reader->>Reader: translate dataset indices -> original LMDB fids via _retained_keys
    Reader->>Client: return frames with frame["fid"] = original LMDB key
    Note right of Reader: compute_block_targets() removes empty blocks before sampling
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: introducing 'max' and 'filter' batch-size modes for LMDB data handling, which is the central feature across multiple files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 352-396: The code sets self.nframes to the number of retained
frames but still builds groups and uses indices that refer to original LMDB
keys, which breaks iteration because __getitem__(index) reads the LMDB key
`index` directly; fix by introducing an explicit mapping between logical
retained indices and original LMDB indices (e.g. self._logical_to_original =
retained_indices), rebuild self._nloc_groups and self._system_groups to use
logical indices (0..self.nframes-1) while storing the original LMDB id where
needed, and update __getitem__ to translate the incoming logical index to the
original LMDB id via self._logical_to_original before reading the LMDB key;
ensure any code that previously appended `idx` (from retained_indices) now
appends the logical position and that any place needing the original id reads it
from self._logical_to_original.

In `@deepmd/utils/argcheck.py`:
- Line 3762: Update the documentation string for "filter:N" to use frame-level
wording instead of "removes the systems": explain that the LMDB reader filters
frames whose nloc > N (i.e., removes frames with more than N atoms), and note
that for mixed LMDB systems this operates per-frame rather than per-system;
reference the "filter:N" token and the field "nloc" so readers understand the
distinction.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 78174e2d-4bb3-459e-8770-1276b560b23a

📥 Commits

Reviewing files that changed from the base of the PR and between 5d9cbdf and c3fc8e8.

📒 Files selected for processing (4)
  • deepmd/dpmodel/utils/lmdb_data.py
  • deepmd/pt/utils/lmdb_dataset.py
  • deepmd/utils/argcheck.py
  • source/tests/common/dpmodel/test_lmdb_data.py

Comment thread deepmd/dpmodel/utils/lmdb_data.py Outdated
Comment thread deepmd/utils/argcheck.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new LMDB batch sizing modes to better control atom-budgeted batching and optionally drop oversized frames, with accompanying tests and updated user-facing docs.

Changes:

  • Extend LmdbDataReader batch sizing to support "max:N" (upper-bound atom budget via floor division) and "filter:N" (drop frames with nloc > N).
  • Update PyTorch LMDB dataset docstring and config arg docs to mention the new batch size modes.
  • Add unit tests covering max:/filter: behavior, including interactions with auto_prob block slicing.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
deepmd/dpmodel/utils/lmdb_data.py Implements max:/filter: parsing, filtering behavior, and avoids NaNs in compute_block_targets when blocks become empty after filtering.
deepmd/pt/utils/lmdb_dataset.py Expands batch_size docstring to describe the new rule strings forwarded to LmdbDataReader.
deepmd/utils/argcheck.py Updates validation_data.batch_size documentation to mention "max:N" and "filter:N".
source/tests/common/dpmodel/test_lmdb_data.py Adds targeted tests for max:/filter: and their interaction with samplers and auto_prob block behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread deepmd/dpmodel/utils/lmdb_data.py Outdated
Comment thread deepmd/utils/argcheck.py Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 88.60759% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.49%. Comparing base (5d9cbdf) to head (3b66cb6).
⚠️ Report is 13 commits behind head on master.

Files with missing lines Patch % Lines
deepmd/dpmodel/utils/lmdb_data.py 88.60% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5413      +/-   ##
==========================================
+ Coverage   80.46%   80.49%   +0.02%     
==========================================
  Files         820      820              
  Lines       86005    86063      +58     
  Branches     4139     4139              
==========================================
+ Hits        69208    69277      +69     
+ Misses      15521    15510      -11     
  Partials     1276     1276              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread source/tests/common/dpmodel/test_lmdb_data.py Dismissed
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
deepmd/dpmodel/utils/lmdb_data.py (1)

499-514: Negative-index rejection is a deliberate API narrowing — document it.

LmdbDataReader used to go through self._txn.get(format(index, ...)) which would simply miss for negatives. Now line 506 hard-rejects index < 0, which is the right call for Dataset-like objects but is a subtle behavior change for any caller that passes reader[-1] (e.g. ad-hoc scripts or tests that grab "the last frame"). Since the class docstring at lines 231-257 doesn't mention indexing semantics, consider adding one line to __getitem__'s docstring that negative indices are not supported, or support them with index += self.nframes as Python sequences do. Either is fine; just make it explicit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 499 - 514,
LmdbDataReader.__getitem__ now raises for index < 0 which changes previous
behavior; update the __getitem__ docstring to state that negative indices are
not supported (or alternatively implement Python-style wrapping by adding index
+= self.nframes when index < 0) so callers know the contract. Specifically, edit
the docstring for __getitem__ in LmdbDataReader to add a concise sentence like
"Negative indices are not supported; pass a non-negative index in [0, nframes)
or handle wrapping externally." and reference the behavior around
self._retained_keys and key lookup via self._txn.get to ensure the documentation
matches the implemented check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 340-354: The batch_size string parsing accepts non-positive or
non-integer N values which can silently drop all frames; update the parsing
branches for "max:N" and "filter:N" to validate N: parse with a guarded
try/except to produce a clear ValueError on non-integers, then check that the
parsed N is a positive integer (e.g. N >= 1) and raise a ValueError with an
explanatory message if not; retain existing assignments to self._max_rule and
self._filter_rule (and self._max_rule = self._filter_rule for "filter:") and
ensure behavior of the n <= self._filter_rule predicate and nloc assumptions
remain consistent.
- Around line 69-72: The LMDB environment is opened with readahead=True which
causes excessive prefetch and page-cache pollution for random-access workloads
(see LmdbDataReader.__getitem__ used by SameNlocBatchSampler); change the
lmdb.open call that sets readahead=True to readahead=False (or expose a config
option to toggle readahead) so local SSD training avoids aggressive prefetching
and improves random-read throughput.
- Around line 773-789: When empty blocks are dropped (the nonempty list is
shorter than blocks because sum(system_nframes[stt:end]) == 0), add an INFO log
that lists which original blocks were removed and shows the new auto_prob_style
so operators can see the change in effective weights; use the existing symbols
blocks, nonempty, auto_prob_style and system_nframes to construct a message like
"Dropped empty blocks: <stt:end:weight,...>; updated auto_prob_style:
<new_style>" and emit it before assigning blocks = nonempty.

---

Nitpick comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 499-514: LmdbDataReader.__getitem__ now raises for index < 0 which
changes previous behavior; update the __getitem__ docstring to state that
negative indices are not supported (or alternatively implement Python-style
wrapping by adding index += self.nframes when index < 0) so callers know the
contract. Specifically, edit the docstring for __getitem__ in LmdbDataReader to
add a concise sentence like "Negative indices are not supported; pass a
non-negative index in [0, nframes) or handle wrapping externally." and reference
the behavior around self._retained_keys and key lookup via self._txn.get to
ensure the documentation matches the implemented check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 851ab184-a305-408f-b9c1-65bddd99c639

📥 Commits

Reviewing files that changed from the base of the PR and between c3fc8e8 and 86df0ca.

📒 Files selected for processing (3)
  • deepmd/dpmodel/utils/lmdb_data.py
  • deepmd/utils/argcheck.py
  • source/tests/common/dpmodel/test_lmdb_data.py
✅ Files skipped from review due to trivial changes (1)
  • source/tests/common/dpmodel/test_lmdb_data.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • deepmd/utils/argcheck.py

Comment thread deepmd/dpmodel/utils/lmdb_data.py Outdated
Comment thread deepmd/dpmodel/utils/lmdb_data.py
Comment thread deepmd/dpmodel/utils/lmdb_data.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
deepmd/dpmodel/utils/lmdb_data.py (2)

782-786: ⚠️ Potential issue | 🟡 Minor

Log dropped auto-prob blocks because weights are renormalized.

When empty blocks are removed, prob_sys_size_ext renormalizes the remaining block weights, changing the user’s effective sampling distribution. Emit an info log with the dropped blocks and rewritten style.

Suggested logging
     if len(nonempty) < len(blocks):
+        dropped = [
+            f"{stt}:{end}:{weight}"
+            for stt, end, weight in blocks
+            if sum(system_nframes[stt:end]) == 0
+        ]
         auto_prob_style = "prob_sys_size;" + ";".join(
             f"{stt}:{end}:{weight}" for stt, end, weight in nonempty
         )
+        log.info(
+            "Dropped empty LMDB auto_prob blocks after filtering: %s; "
+            "updated auto_prob_style: %s",
+            dropped,
+            auto_prob_style,
+        )
         blocks = nonempty
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 782 - 786, When empty blocks
are removed from blocks (i.e., len(nonempty) < len(blocks)), emit an info log
showing which blocks were dropped and the new auto_prob_style string so users
know their sampling distribution was renormalized; update the branch that sets
auto_prob_style and blocks (references: nonempty, blocks, auto_prob_style,
prob_sys_size_ext) to log the list of dropped blocks and the rewritten style via
the module logger (process/logger used elsewhere) before assigning blocks =
nonempty.

340-346: ⚠️ Potential issue | 🟡 Minor

Validate N when parsing auto:N, max:N, and filter:N.

int(batch_size.split(":")[1]) still accepts non-positive thresholds and gives raw/ambiguous errors for malformed values like "filter:"; "max:10:extra" is also parsed inconsistently. Please centralize parsing and require N > 0.

Suggested hardening
+        def _parse_positive_rule(prefix: str) -> int:
+            try:
+                n = int(batch_size.split(":", 1)[1])
+            except (IndexError, ValueError) as exc:
+                raise ValueError(
+                    f"Invalid batch_size {batch_size!r}: expected "
+                    f"'{prefix}:N' with a positive integer N."
+                ) from exc
+            if n <= 0:
+                raise ValueError(
+                    f"Invalid batch_size {batch_size!r}: N must be positive."
+                )
+            return n
+
         if isinstance(batch_size, str):
             if batch_size == "auto":
                 self._auto_rule = 32
             elif batch_size.startswith("auto:"):
-                self._auto_rule = int(batch_size.split(":")[1])
+                self._auto_rule = _parse_positive_rule("auto")
             elif batch_size.startswith("max:"):
-                self._max_rule = int(batch_size.split(":")[1])
+                self._max_rule = _parse_positive_rule("max")
             elif batch_size.startswith("filter:"):
-                self._filter_rule = int(batch_size.split(":")[1])
+                self._filter_rule = _parse_positive_rule("filter")
                 self._max_rule = self._filter_rule
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 340 - 346, The batch_size
parsing accepts malformed or non-positive values; centralize parsing by adding a
helper (e.g., _parse_positive_N(batch_size, prefix)) used by the branches
handling "auto:", "max:", and "filter:" so that it: 1) ensures the string is
exactly two parts split by a single colon (rejects "filter:" and
"max:10:extra"), 2) converts the second part to int and validates N > 0 (raise
ValueError with a clear message on failure), and 3) returns the int to assign to
_auto_rule, _max_rule, and _filter_rule (and keep the existing behavior of
setting _max_rule = _filter_rule for "filter:"). Use the helper in place of
direct int(...) calls to harden all three branches.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 477-490: The docstring for get_batch_size_for_nloc incorrectly
claims "max:N / filter:N: never overshoots" while the implementation (checking
self._max_rule and returning max(1, self._max_rule // max(nloc, 1))) will clamp
to 1 when nloc > N and thus can overshoot for single oversized frames; update
the docstring to state that max:N/filter:N use floor(N / nloc) clamped to at
least 1 and therefore may still overshoot when an individual frame has more than
N atoms, referencing get_batch_size_for_nloc and the _max_rule behavior.
- Around line 353-368: The current logic silently skips applying filter:N when
mixed_batch is True; update the logic in the method using symbols _filter_rule,
mixed_batch, orig_frame_nlocs, retained_keys, nframes and lmdb_path to enforce
the documented behaviour: either (A) fail-fast by raising a clear exception when
_filter_rule is not None and mixed_batch is True (with a message referencing
filter:N and mixed_batch), or (B) compute per-frame nlocs for mixed batches
(i.e. ensure orig_frame_nlocs is populated for mixed_batch) and then apply the
same retained_keys comprehension and n_dropped logging so frames with nloc >
_filter_rule are dropped; pick one approach and implement it consistently so
filter:N is never silently disabled.

---

Duplicate comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 782-786: When empty blocks are removed from blocks (i.e.,
len(nonempty) < len(blocks)), emit an info log showing which blocks were dropped
and the new auto_prob_style string so users know their sampling distribution was
renormalized; update the branch that sets auto_prob_style and blocks
(references: nonempty, blocks, auto_prob_style, prob_sys_size_ext) to log the
list of dropped blocks and the rewritten style via the module logger
(process/logger used elsewhere) before assigning blocks = nonempty.
- Around line 340-346: The batch_size parsing accepts malformed or non-positive
values; centralize parsing by adding a helper (e.g.,
_parse_positive_N(batch_size, prefix)) used by the branches handling "auto:",
"max:", and "filter:" so that it: 1) ensures the string is exactly two parts
split by a single colon (rejects "filter:" and "max:10:extra"), 2) converts the
second part to int and validates N > 0 (raise ValueError with a clear message on
failure), and 3) returns the int to assign to _auto_rule, _max_rule, and
_filter_rule (and keep the existing behavior of setting _max_rule = _filter_rule
for "filter:"). Use the helper in place of direct int(...) calls to harden all
three branches.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7b7f737a-a367-465e-a496-c9a090340f51

📥 Commits

Reviewing files that changed from the base of the PR and between 86df0ca and fc02035.

📒 Files selected for processing (3)
  • deepmd/dpmodel/utils/lmdb_data.py
  • deepmd/utils/argcheck.py
  • source/tests/common/dpmodel/test_lmdb_data.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • deepmd/utils/argcheck.py
  • source/tests/common/dpmodel/test_lmdb_data.py

Comment thread deepmd/dpmodel/utils/lmdb_data.py
Comment thread deepmd/dpmodel/utils/lmdb_data.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
deepmd/dpmodel/utils/lmdb_data.py (2)

823-842: (stt, end, weight) not in nonempty can misfire on float weights.

The dropped list at Line 829-833 filters blocks by “not in nonempty”, comparing 3-tuples that include the parsed float weight. In the current path the objects in nonempty are the same tuple instances taken from blocks, so this works — but it is fragile: if anyone ever normalizes/reparses weights before this check (e.g., a future refactor that rounds or strips whitespace differently), 0.2 != 0.2 edge cases or formatting mismatches will make dropped blocks invisibly re-appear as “kept”.

Safer to drive the diff from the index set used to build nonempty:

Suggested hardening
-    nonempty = [
-        (stt, end, weight)
-        for stt, end, weight in blocks
-        if sum(system_nframes[stt:end]) > 0
-    ]
+    nonempty_idx = [
+        i for i, (stt, end, _w) in enumerate(blocks)
+        if sum(system_nframes[stt:end]) > 0
+    ]
+    nonempty = [blocks[i] for i in nonempty_idx]
@@
-    if len(nonempty) < len(blocks):
+    if len(nonempty) < len(blocks):
         ...
-        dropped = [
-            f"{stt}:{end}:{weight}"
-            for (stt, end, weight) in blocks
-            if (stt, end, weight) not in nonempty
-        ]
+        kept = set(nonempty_idx)
+        dropped = [
+            f"{stt}:{end}:{weight}"
+            for i, (stt, end, weight) in enumerate(blocks)
+            if i not in kept
+        ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 823 - 842, The comparison
"(stt, end, weight) not in nonempty" can fail for float/formatting differences;
update compute_block_targets to derive dropped from the index set used to build
nonempty instead of tuple membership. When you construct nonempty, also record
the corresponding indices (e.g., nonempty_idx or kept_idx) and build dropped as
[blocks[i] formatted string for i in range(len(blocks)) if i not in
nonempty_idx]; then regenerate auto_prob_style and set blocks = nonempty as
before so behavior is unchanged but robust to float mismatches.

688-705: print_summary shows 0 nloc groups for mixed_batch=True.

When mixed_batch=True, self._nloc_groups is intentionally left empty (lines 429-430), so n_groups = len(self._nloc_groups) == 0 and the summary reports zero nloc groups even though the dataset has self.nframes > 0 frames. Consider either skipping the nloc groups line in that case or reporting "mixed" to avoid misleading logs.

Suggested tweak
-        n_groups = len(self._nloc_groups)
+        n_groups_str = (
+            "mixed" if self.mixed_batch else str(len(self._nloc_groups))
+        )
@@
         log.info(
             f"LMDB {name}: {self.lmdb_path}, "
-            f"{self.nframes} frames, {n_groups} nloc groups, "
+            f"{self.nframes} frames, {n_groups_str} nloc groups, "
             f"batch_size={bs_str}, "
             f"mixed_batch={self.mixed_batch}"
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 688 - 705, The summary
currently prints nloc groups as 0 when mixed_batch=True because
self._nloc_groups is intentionally empty; update print_summary to detect
self.mixed_batch and set a display string (e.g., "mixed") instead of computing
len(self._nloc_groups), then use that string in the log; specifically change the
n_groups calculation in print_summary (referencing print_summary,
self._nloc_groups and self.mixed_batch) to produce n_groups_str = "mixed" when
self.mixed_batch is True, otherwise set n_groups_str =
str(len(self._nloc_groups)), and use n_groups_str in the log message so logs are
not misleading.
source/tests/common/dpmodel/test_lmdb_data.py (2)

906-916: Assertion on "positive" is coupled to the current error string.

self.assertIn("positive", str(ctx.exception)) works today because both _parse_positive_rule error branches happen to contain the word, but the int(raw) ValueError path uses "Expected '<prefix>N' with N a positive integer." and the bounds path uses "N must be a positive integer, got <n>.". A future wording change (e.g. "strictly positive", "non-negative") would break these tests even though behavior is unchanged. Consider asserting on the spec string (which is guaranteed to appear) plus the exception type, or use assertRaisesRegex with a broader pattern.

Example
-        for spec in ("filter:", "filter:0", "max:-1"):
-            with self.assertRaises(ValueError) as ctx:
-                LmdbDataReader(self._uniform_path, self._type_map, batch_size=spec)
-            self.assertIn("positive", str(ctx.exception))
+        for spec in ("filter:", "filter:0", "max:-1"):
+            with self.assertRaisesRegex(ValueError, re.escape(repr(spec))):
+                LmdbDataReader(self._uniform_path, self._type_map, batch_size=spec)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/tests/common/dpmodel/test_lmdb_data.py` around lines 906 - 916, The
test test_invalid_batch_size_strings_rejected relies on the word "positive" in
the error text; change the assertion to check the exception type plus the
failing spec instead so wording changes won't break the test: replace the
current with either self.assertRaisesRegex(ValueError, re.escape(spec)) around
the LmdbDataReader(...) call (import re at top) or keep the with
self.assertRaises(ValueError) as ctx and assertIn(spec, str(ctx.exception));
reference LmdbDataReader and test_invalid_batch_size_strings_rejected when
making the change.

875-878: Silence the CodeQL "statement has no effect" warning.

CodeQL flags reader[len(reader)] / reader[-1] as statements with no effect because their return value is discarded. The call does trigger __getitem__ (which is the whole point — it must raise), but the static analyzer can't tell. A tiny rewrite clears the warning and makes intent explicit:

Proposed rewrite
-        with self.assertRaises(IndexError):
-            reader[len(reader)]
-        with self.assertRaises(IndexError):
-            reader[-1]
+        with self.assertRaises(IndexError):
+            _ = reader[len(reader)]
+        with self.assertRaises(IndexError):
+            _ = reader[-1]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/tests/common/dpmodel/test_lmdb_data.py` around lines 875 - 878, The
test currently calls reader[len(reader)] and reader[-1] inside self.assertRaises
blocks but the expressions are plain statements (flagged by CodeQL as "statement
has no effect"); change each bare expression to an explicit read that uses the
return value (e.g., assign to a dummy variable) so the __getitem__ call is
unambiguous to static analyzers — for both occurrences replace
reader[len(reader)] and reader[-1] with an assignment like _ =
reader[len(reader)] and _ = reader[-1] inside the existing with
self.assertRaises(IndexError): blocks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 823-842: The comparison "(stt, end, weight) not in nonempty" can
fail for float/formatting differences; update compute_block_targets to derive
dropped from the index set used to build nonempty instead of tuple membership.
When you construct nonempty, also record the corresponding indices (e.g.,
nonempty_idx or kept_idx) and build dropped as [blocks[i] formatted string for i
in range(len(blocks)) if i not in nonempty_idx]; then regenerate auto_prob_style
and set blocks = nonempty as before so behavior is unchanged but robust to float
mismatches.
- Around line 688-705: The summary currently prints nloc groups as 0 when
mixed_batch=True because self._nloc_groups is intentionally empty; update
print_summary to detect self.mixed_batch and set a display string (e.g.,
"mixed") instead of computing len(self._nloc_groups), then use that string in
the log; specifically change the n_groups calculation in print_summary
(referencing print_summary, self._nloc_groups and self.mixed_batch) to produce
n_groups_str = "mixed" when self.mixed_batch is True, otherwise set n_groups_str
= str(len(self._nloc_groups)), and use n_groups_str in the log message so logs
are not misleading.

In `@source/tests/common/dpmodel/test_lmdb_data.py`:
- Around line 906-916: The test test_invalid_batch_size_strings_rejected relies
on the word "positive" in the error text; change the assertion to check the
exception type plus the failing spec instead so wording changes won't break the
test: replace the current with either self.assertRaisesRegex(ValueError,
re.escape(spec)) around the LmdbDataReader(...) call (import re at top) or keep
the with self.assertRaises(ValueError) as ctx and assertIn(spec,
str(ctx.exception)); reference LmdbDataReader and
test_invalid_batch_size_strings_rejected when making the change.
- Around line 875-878: The test currently calls reader[len(reader)] and
reader[-1] inside self.assertRaises blocks but the expressions are plain
statements (flagged by CodeQL as "statement has no effect"); change each bare
expression to an explicit read that uses the return value (e.g., assign to a
dummy variable) so the __getitem__ call is unambiguous to static analyzers — for
both occurrences replace reader[len(reader)] and reader[-1] with an assignment
like _ = reader[len(reader)] and _ = reader[-1] inside the existing with
self.assertRaises(IndexError): blocks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 4047583b-b9c2-40f6-9d84-66b4d90dab2f

📥 Commits

Reviewing files that changed from the base of the PR and between fc02035 and 3b66cb6.

📒 Files selected for processing (2)
  • deepmd/dpmodel/utils/lmdb_data.py
  • source/tests/common/dpmodel/test_lmdb_data.py

@OutisLi OutisLi requested review from iProzd and njzjz April 24, 2026 02:29
Copy link
Copy Markdown
Contributor

@njzjz-bot njzjz-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I paid extra attention to the LMDB filter:N path: keeping _nsystems in original system-id space, re-keying retained frames into dataset index space, and renormalizing empty probability blocks all make sense together. The added tests also cover the tricky indexing/sampling cases well.

— OpenClaw 2026.4.22 (model: gpt-5.4)

@njzjz njzjz added this pull request to the merge queue Apr 26, 2026
Merged via the queue into deepmodeling:master with commit 9d63816 Apr 27, 2026
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants