Skip to content

Add/mm retain adacare#885

Open
jhnwu3 wants to merge 12 commits intomasterfrom
add/mm_retain_adacare
Open

Add/mm retain adacare#885
jhnwu3 wants to merge 12 commits intomasterfrom
add/mm_retain_adacare

Conversation

@jhnwu3
Copy link
Collaborator

@jhnwu3 jhnwu3 commented Mar 10, 2026

This pull request introduces new example scripts and documentation updates to support multimodal and hyperparameter-tuned drug recommendation models on the MIMIC-IV dataset, specifically using AdaCare and RETAIN model variants. It also improves the usability and clarity of existing example scripts.

Major additions and improvements:

1. New example scripts for drug recommendation:

  • Added drug_recommendation_mimic4_adacare.py demonstrating how to use the AdaCare model for drug recommendation on MIMIC-IV, including data loading, task setup, training, evaluation, and prediction inspection.
  • Added drug_recommendation_mimic4_adacare_optuna.py, providing an end-to-end example of hyperparameter tuning for AdaCare using Optuna, covering parameter search, training, and evaluation.
  • Added drug_recommendation_mimic4_multimodal_retain.py, showcasing the use of the new MultimodalRETAIN model for handling both sequential and non-sequential EHR features, with detailed steps and comparison to vanilla RETAIN.

2. Documentation enhancements:

  • Updated AdaCare and RETAIN model documentation to include the new MultimodalAdaCare and MultimodalRETAIN classes, improving discoverability and API reference completeness. [1] [2]

3. Usability and bug fixes in existing examples:

  • Ensured all example scripts use a consistent dataset cache directory for reproducibility and efficiency.
  • Fixed sample access in the RETAIN example to use the correct dataset indexing method, improving code reliability.

lock_path = Path(cache_dir) / "build.lock"
with FileLock(str(lock_path), timeout=7200):
# Re-check inside the lock — another process may have built it
# while we were waiting.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need help looking into file read checks like this. Mainly, for when the agent spins up multiple caching jobs at once, which leads to downstream issues like overwriting files, etc.

@jhnwu3 jhnwu3 requested review from EricSchrock and Logiquo March 10, 2026 21:02
Copy link
Collaborator

@EricSchrock EricSchrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran out of time and couldn't get to a full review but I reviewed the added docs, examples, and tests and skimmed the model and dataset changes. I found a few issues with the examples and tests.

@Logiquo might have better feedback on the models themselves and the changes to base_dataset.py.

@Logiquo
Copy link
Collaborator

Logiquo commented Mar 16, 2026

The dataset part LGTM, other than we don't need to check bin files and we do not need to invalidate caches (or if we do want to invalidate caches, the best way should be using StreamingDataset to read it to see if it throws exception).

Checking *.bin file won't help in this case.

@jhnwu3 jhnwu3 requested a review from Logiquo March 16, 2026 17:47
@jhnwu3
Copy link
Collaborator Author

jhnwu3 commented Mar 16, 2026

Thanks for the catch @EricSchrock . I've deleted the redundant test cases and edited some things so they don't break.

I've asked @Logiquo for a re-review/approval. I think the main thing is just making it more robust to user error.

One of the problems is that there can be race conditions if the user is not careful and wants to run multiple jobs at the same time with the same cache_dir. I've only discovered this myself after running the agent and cleansing an existing corrupted cache where he submits multiple parallel jobs, and I think it's not impossible for users to follow that same pattern randomly in their development cycle.

Copy link
Collaborator

@EricSchrock EricSchrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging a couple more cuda:# instances to fix. Submitting as "Comment" instead of "Approve" to clear "Changes Requested" but leave the final approval to @Logiquo.

# ---------------------------------------------------------------------------
# STEP 4: Define Optuna objective
# ---------------------------------------------------------------------------
DEVICE = "cuda:3" # or "cpu"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DEVICE = "cuda:3" # or "cpu"
DEVICE = "cuda:0" # or "cpu"

# STEP 5: Train the model
trainer = Trainer(
model=model,
device="cuda:4", # or "cpu"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
device="cuda:4", # or "cpu"
device="cuda:0", # or "cpu"

# ---------------------------------------------------------------------------
# STEP 4: Define Optuna objective
# ---------------------------------------------------------------------------
DEVICE = "cuda:2" # or "cpu"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DEVICE = "cuda:2" # or "cpu"
DEVICE = "cuda:0" # or "cpu"

Copy link
Collaborator

@EricSchrock EricSchrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looks like I have to approve to clear the "changes requested". Approving to unblock.

Copy link
Collaborator

@Logiquo Logiquo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@Logiquo
Copy link
Collaborator

Logiquo commented Mar 17, 2026

Thanks for the catch @EricSchrock . I've deleted the redundant test cases and edited some things so they don't break.

I've asked @Logiquo for a re-review/approval. I think the main thing is just making it more robust to user error.

One of the problems is that there can be race conditions if the user is not careful and wants to run multiple jobs at the same time with the same cache_dir. I've only discovered this myself after running the agent and cleansing an existing corrupted cache where he submits multiple parallel jobs, and I think it's not impossible for users to follow that same pattern randomly in their development cycle.

I agree, and i think LockFile should be sufficient to solve this issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants