-
Notifications
You must be signed in to change notification settings - Fork 567
[Foundational-EHR-Model-Multimodal] Create initial task (discharge notes + radiology notes) #840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
will-pang
wants to merge
34
commits into
sunlabuiuc:master
from
will-pang:FoundationalEHR/wp-create-multimodal-task-notes
+213
−36
Closed
Changes from all commits
Commits
Show all changes
34 commits
Select commit
Hold shift + click to select a range
5bdaa76
Create ehr_foundation_model task
will-pang 5a49feb
Add example for testing
will-pang 9777311
Update ehr_foundational_model_mimic4.py
will-pang 6ecf289
Merge remote-tracking branch 'upstream/master' into FoundationalEHR/w…
will-pang 60641c0
Update ehr_foundational_model_mimic4.py
will-pang e04f54e
Update ehr_foundational_model_mimic4.py
will-pang 13e46f0
Add handling of missing notes
will-pang f456e53
Update ehr_foundational_model_mimic4.py
will-pang 9eea8fe
update comments
will-pang eafd929
update comments
will-pang fe53e89
Update tuple_time_text_processor.py
will-pang 0e1df77
Update multimodal_task.py
will-pang 3d30dbf
Update tuple_time_text_processor.py
will-pang 20306e3
Update ehr_foundational_model_mimic4.py
will-pang 24d1e7b
Update tuple_time_text_processor.py
will-pang 4fab928
Update tuple_time_text_processor.py
will-pang bdf00e0
Update comments
will-pang 06cdd1e
Update ehr_foundational_model_mimic4.py
will-pang a1b2756
Update tuple_time_text_processor.py
will-pang adf3a53
Minor update in docs
will-pang aa1b0ed
Create test_ehr_foundational_model_mimic4.py
will-pang 3edf07e
Add unit test
will-pang 2413fd5
Update naming
will-pang a353f60
Update ehr_foundational_model_mimic4.py
will-pang 12e968d
Remove comments
will-pang b110a2e
Update ehr_foundational_model_mimic4.py
will-pang f886cdd
Delete test_ehr_foundational_model_mimic4.py
will-pang f1b74f2
Renaming updates
will-pang 6fc7bd0
Update ehr_foundational_model_mimic4.py
will-pang 628daa5
Merge branch 'sunlabuiuc:master' into FoundationalEHR/wp-create-multi…
will-pang 0584779
Merge branch 'sunlabuiuc:master' into FoundationalEHR/wp-multimodal-t…
will-pang 045e173
Merge branch 'FoundationalEHR/wp-multimodal-task-lab-events-icd-codes…
will-pang fe764ac
Update ehr_foundational_model_mimic4.py
will-pang 53761d6
Update ehr_foundational_model_mimic4.py
will-pang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| from datetime import datetime | ||
| from typing import Any, Dict, List, Optional | ||
| import os | ||
|
|
||
| # PyHealth Packages | ||
| from pyhealth.datasets import MIMIC4Dataset | ||
| from pyhealth.tasks.ehr_foundational_model_mimic4 import EHRFoundationalModelMIMIC4 | ||
| from pyhealth.tasks.base_task import BaseTask | ||
|
|
||
| # Load MIMIC4 Files | ||
| # There's probably better ways dealing with this on the cluster, but working locally for now | ||
| # (see: https://github.com/sunlabuiuc/PyHealth/blob/master/examples/mortality_prediction/multimodal_mimic4_minimal.py) | ||
|
|
||
| PYHEALTH_REPO_ROOT = '/Users/wpang/Desktop/PyHealth' | ||
|
|
||
| EHR_ROOT = os.path.join(PYHEALTH_REPO_ROOT, "srv/local/data/physionet.org/files/mimiciv/2.2") | ||
| NOTE_ROOT = os.path.join(PYHEALTH_REPO_ROOT, "srv/local/data/physionet.org/files/mimic-iv-note/2.2") | ||
| CXR_ROOT = os.path.join(PYHEALTH_REPO_ROOT,"srv/local/data/physionet.org/files/mimic-cxr-jpg/2.0.0") | ||
| CACHE_DIR = os.path.join(PYHEALTH_REPO_ROOT,"srv/local/data/wp/pyhealth_cache") | ||
|
|
||
| if __name__ == "__main__": | ||
|
|
||
| dataset = MIMIC4Dataset( | ||
| ehr_root=EHR_ROOT, | ||
| note_root=NOTE_ROOT, | ||
| ehr_tables=["diagnoses_icd", "procedures_icd", "prescriptions", "labevents"], | ||
| note_tables=["discharge", "radiology"], | ||
| cache_dir=CACHE_DIR, | ||
| num_workers=8, | ||
| dev=True | ||
| ) | ||
|
|
||
| # Apply multimodal task | ||
| task = EHRFoundationalModelMIMIC4() | ||
| samples = dataset.set_task(task, cache_dir=f"{CACHE_DIR}/task", num_workers=8) | ||
|
|
||
| # Get and print sample | ||
| sample = samples[0] | ||
| print(sample) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,135 @@ | ||
| from datetime import datetime | ||
| from typing import Any, Dict, List, Optional, Union, Tuple | ||
|
|
||
| from pyhealth.tasks.base_task import BaseTask | ||
|
|
||
| class EHRFoundationalModelMIMIC4(BaseTask): | ||
|
|
||
| task_name: str = "EHRFoundationalModelMIMIC4" | ||
| TOKEN_REPRESENTING_MISSING_TEXT = "<missing>" | ||
| TOKEN_REPRESENTING_MISSING_FLOAT = float("nan") | ||
|
|
||
| def __init__(self): | ||
| """Initialize the EHR Foundational Model task.""" | ||
| self.input_schema: Dict[str, Union[str, Tuple[str, Dict]]] = { | ||
| "discharge_note_times": ( | ||
| "tuple_time_text", | ||
| { | ||
| "tokenizer_name": "bert-base-uncased", | ||
| "type_tag": "note", | ||
| }, | ||
| ), | ||
| "radiology_note_times": ( | ||
| "tuple_time_text", | ||
| { | ||
| "tokenizer_name": "bert-base-uncased", | ||
| "type_tag": "note", | ||
| }, | ||
| ) | ||
| } | ||
| self.output_schema: Dict[str, str] = {"mortality": "binary"} | ||
|
|
||
will-pang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| def _clean_text(self, text: Optional[str]) -> Optional[str]: | ||
| """Return text if non-empty, otherwise None.""" | ||
| return text if text else None | ||
|
|
||
| def __call__(self, patient: Any) -> List[Dict[str, Any]]: | ||
| # Get demographic info to filter by age | ||
| demographics = patient.get_events(event_type="patients") | ||
| if not demographics: | ||
| return [] | ||
|
|
||
| demographics = demographics[0] | ||
|
|
||
| # Get visits | ||
| admissions = patient.get_events(event_type="admissions") | ||
| if len(admissions) == 0: | ||
| return [] | ||
|
|
||
| # Determine which admissions to process iteratively | ||
| # Check each admission's NEXT admission for mortality flag | ||
| admissions_to_process = [] | ||
| mortality_label = 0 | ||
|
|
||
| for i, admission in enumerate(admissions): | ||
| # Check if THIS admission has the death flag | ||
| if admission.hospital_expire_flag in [1, "1"]: | ||
| # Patient died in this admission - set mortality label | ||
| # but don't include this admission's data | ||
| mortality_label = 1 | ||
| break | ||
|
|
||
| # Check if there's a next admission with death flag | ||
| if i + 1 < len(admissions): | ||
| next_admission = admissions[i + 1] | ||
| if next_admission.hospital_expire_flag in [1, "1"]: | ||
| # Next admission has death - include current, set mortality | ||
| admissions_to_process.append(admission) | ||
| mortality_label = 1 | ||
| break | ||
|
|
||
| # No death in current or next - include this admission | ||
| admissions_to_process.append(admission) | ||
|
|
||
| if len(admissions_to_process) == 0: | ||
| return [] | ||
|
|
||
| # Aggregated notes and time offsets across all admissions (per hadm_id) | ||
| all_discharge_texts: List[str] = [] | ||
| all_discharge_times_from_admission: List[float] = [] | ||
| all_radiology_texts: List[str] = [] | ||
| all_radiology_times_from_admission: List[float] = [] | ||
|
|
||
| # Process each admission independently (per hadm_id) | ||
| for admission in admissions_to_process: | ||
| admission_time = admission.timestamp | ||
|
|
||
| # Get notes for this hadm_id only | ||
| discharge_notes = patient.get_events( | ||
| event_type="discharge", filters=[("hadm_id", "==", admission.hadm_id)] | ||
| ) | ||
| radiology_notes = patient.get_events( | ||
| event_type="radiology", filters=[("hadm_id", "==", admission.hadm_id)] | ||
| ) | ||
|
|
||
| for note in discharge_notes: #TODO: Maybe make this into a helper function? | ||
| try: | ||
| note_text = self._clean_text(note.text) | ||
| if note_text: | ||
| time_from_admission = ( | ||
| note.timestamp - admission_time | ||
| ).total_seconds() / 3600.0 | ||
| all_discharge_texts.append(note_text) | ||
| all_discharge_times_from_admission.append(time_from_admission) | ||
| except AttributeError: # note object is missing .text or .timestamp attribute (e.g. malformed note) | ||
| pass | ||
| if not discharge_notes: # If we get an empty list | ||
| all_discharge_texts.append(self.TOKEN_REPRESENTING_MISSING_TEXT) # Token representing missing text | ||
| all_discharge_times_from_admission.append(self.TOKEN_REPRESENTING_MISSING_FLOAT) # Token representing missing time(?) | ||
|
|
||
| for note in radiology_notes: #TODO: Maybe make this into a helper function? | ||
| try: | ||
| note_text = self._clean_text(note.text) | ||
| if note_text: | ||
| time_from_admission = ( | ||
| note.timestamp - admission_time | ||
| ).total_seconds() / 3600.0 | ||
| all_radiology_texts.append(note_text) | ||
| all_radiology_times_from_admission.append(time_from_admission) | ||
| except AttributeError: # note object is missing .text or .timestamp attribute (e.g. malformed note) | ||
| pass | ||
| if not radiology_notes: # If we receive empty list | ||
| all_radiology_texts.append(self.TOKEN_REPRESENTING_MISSING_TEXT) # Token representing missing text | ||
| all_radiology_times_from_admission.append(self.TOKEN_REPRESENTING_MISSING_FLOAT) # Token representing missing time(?) | ||
|
|
||
| discharge_note_times_from_admission = (all_discharge_texts, all_discharge_times_from_admission) | ||
| radiology_note_times_from_admission = (all_radiology_texts, all_radiology_times_from_admission) | ||
|
|
||
| return [ | ||
| { | ||
| "patient_id": patient.patient_id, | ||
| "discharge_note_times": discharge_note_times_from_admission, | ||
| "radiology_note_times": radiology_note_times_from_admission, | ||
| "mortality": mortality_label, | ||
| } | ||
| ] | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.