Skip to content

[SyntheticEHR] HALO Baseline#897

Open
jalengg wants to merge 71 commits intosunlabuiuc:masterfrom
jalengg:halo-pr-integration
Open

[SyntheticEHR] HALO Baseline#897
jalengg wants to merge 71 commits intosunlabuiuc:masterfrom
jalengg:halo-pr-integration

Conversation

@jalengg
Copy link

@jalengg jalengg commented Mar 17, 2026

chufangao and others added 30 commits June 15, 2025 13:04
- Add HALO (Healthcare generative model using transformers) implementation
- Include example training script with configurable parameters
- Include example generation script for synthetic patient data
- Add canonical SLURM scripts with optimal parameters (80 epochs, batch_size 48, lr 0.0001)
- Register HALO in generators module
- Update HALO_MIMIC3Dataset with latest preprocessing
- Update README with HALO documentation
Remove README.rst changes that only documented CorGAN, not HALO.
This PR should focus solely on HALO implementation.
…ls to HALO notebook

Complete Tasks 3-7:
- Configuration panel with demo defaults
- Data upload with validation
- Training logic with checkpoint management
- Generation with CSV conversion
- Results display with quality checks and download

Notebook now has 24 cells with complete end-to-end workflow.
- Replace `!pip install` with subprocess.run() for error checking
- Show clear error message if installation fails
- Raise RuntimeError to stop notebook execution on failure

Fixes #1
- Remove PATIENTS.csv and patient_ids.txt (not used by HALO_MIMIC3Dataset)
- Handle Colab file renaming (ADMISSIONS (1).csv -> ADMISSIONS.csv)
- Allow uploading files one at a time with progress tracking
- Check Google Drive for existing files before requesting upload
- Add FORK variable to installation cell for easier testing

Fixes #4, #5, #6
Ensures Colab users always get the latest version from GitHub without
using cached packages. Critical for picking up recent fixes like the
halo_resources __init__.py.

Fixes #18
Use os.path.join() instead of string concatenation to properly handle
directory paths with or without trailing slashes.

Fixes #19
Fixes #21

The YAML config files in pyhealth/datasets/configs/ were not being
included when the package was installed via pip. This caused
FileNotFoundError for multiple datasets including HALO, MIMIC3,
MIMIC4, EHRShot, COVID-19 CXR, and Medical Transcriptions.

Added MANIFEST.in to specify which non-Python files should be
included in the package distribution.
Fixes #21

MANIFEST.in only affects sdist source distributions. When installing
via `pip install git+https://...` (as in Colab), pip relies on
package_data in setup.py to include non-Python files.

Added explicit package_data to ensure YAML configs in
pyhealth/datasets/configs/ are included in all install paths.
Removed MANIFEST.in as it provided no benefit for pip-from-git installs.
Timestamp reflects when notebook was last modified so users can
verify they are running the correct version. Reverts dynamic
install-time timestamp in favor of this static header approach.
When pkl_data_dir has no trailing slash (e.g. "/path/to/pkl_data"),
raw f-string concatenation produced invalid paths like
"/path/to/pkl_datacodeToIndex.pkl" instead of
"/path/to/pkl_data/codeToIndex.pkl".

Replace all pickle save paths with os.path.join(). Also add
os.makedirs() so the output directory is created if missing.
@jalengg jalengg marked this pull request as ready for review March 17, 2026 03:12
@jalengg jalengg changed the title Halo pr integration [SyntheticEHR] HALO Baseline Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants