Skip to content

[Contribution] Create ehr_foundation_model task#840

Draft
will-pang wants to merge 20 commits intosunlabuiuc:masterfrom
will-pang:FoundationalEHR/wp-create-multimodal-task-notes
Draft

[Contribution] Create ehr_foundation_model task#840
will-pang wants to merge 20 commits intosunlabuiuc:masterfrom
will-pang:FoundationalEHR/wp-create-multimodal-task-notes

Conversation

@will-pang
Copy link
Contributor

@will-pang will-pang commented Feb 11, 2026

Contributor Information

Description

v0.2

Per John's feedback I've incorporated a few changes:
(1) Updated _compute_time_diffs(notes_with_timestamps, first_admission_time) helper function

  • Sorts notes chronologically by timestamp, then computes each note's offset (in hours) from the first admission time.
    • If a member does not have a note, it will return (["<missing>"], [0.0])
  • TODO: I should do a bit more testing on _compute_time_diffs just to see if there are edge-cases where (1) first_admission_time is missing (2) first_admission_time does not make sense relative to the timestamps of other notes. This work would inform us if we need better error handling, but I'll circle back to this once I push on other aspects of the PR.

(2) Added text tokenizer to @Rian354's pyhealth/processors/tuple_time_text_processor.py script

  • In pyhealth/tasks/ehr_foundational_model_mimic4.py you can also now pass the tokenizer settings in kwargs (see example #5 from tasks docs)

v0.1

Per John's feedback, I've incorporated a few changes:

  • Incorporate Rian's tuple_time_text_processor, which now feeds in radiology and discharge notes as (note_text, time_diff_hours) tuples
    • I think we still need to discuss how to compute time_diff, in the sense that whether the timestamps from note.timestamp give a proper chronology of time or not. If we need to arrange in chronological order, I probably need to add a .sort(lambda x: x['time_stamp']) function or something equivalent.
  • If the note is missing, we now return:
    return (["<missing>"], [0.0]) # TODO: How should we handle notes with missing timestamps? 
    
  • The code seems to bug out on the preprocessor. I'm not as familiar with litdata for serialization, but per Claude:

When litdata sees a tuple like (List[str], List[float], str), it flattens it and infers a single serializer for the whole field based on the first element it encounters. So if the first element it hits is a str, it uses the string serializer for everything — then fails when it encounters a float.


v0

More of a draft PR as I'm still fairly new to the inner workings of the package, but here'a few things that I still think needs to be done:

  • Seems to work when I run:
  dataset = MIMIC4Dataset(
          ehr_root=EHR_ROOT,
          note_root=NOTE_ROOT,
          ehr_tables=["diagnoses_icd", "procedures_icd", "prescriptions", "labevents"],
          note_tables=["discharge", "radiology"],
          cache_dir=CACHE_DIR,
          num_workers=16,
          dev = True
      )
  
      task = EHRFoundationalModelMIMIC4()    
      samples = dataset.set_task(task, cache_dir=f"{CACHE_DIR}/task", num_workers=8)

but when I run it outside of dev mode, I run into this error:

AttributeError: 'str' object has no attribute 'dtype'                                                             
/Users/wpang/.local/share/uv/python/cpython-3.12.12-macos-aarch64-none/lib/python3.12/multiprocessing/resource_tr 
acker.py:279: UserWarning: resource_tracker: There appear to be 10 leaked semaphore objects to clean up at        
shutdown                                                                                                          
warnings.warn('resource_tracker: There appear to be %d '      

Claude says that this relates to the notes varying by length from patient to patient (e.g., patient A might have 4 radiology notes and 2 discharge notes, whereas patient B might have 2 radiology notes and 5 discharge notes), but I'm a little stuck as I am still getting comfortable with the architecture of the package.


Testing Notes

  • To test, you can run this script: examples/foundation_ehr/multimodal_task.py

@will-pang will-pang changed the title Create ehr_foundation_model task [Contribution] Create ehr_foundation_model task Feb 11, 2026
Copy link
Collaborator

@jhnwu3 jhnwu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, some quick comments, because now I think I know what's happening that's happened to me before.

What you can probably do is just make sure every patient has a List[str]. For patients without a type of note, you can just append "<missing>" to denote a missing note or something of that sort or "". We'll probably have to standardize.

Another thing we can do is make sure to align our definitions with Rian's tuple time processors:

https://pyhealth.readthedocs.io/en/latest/api/processors/pyhealth.processors.TupleTimeTextProcessor.html

So you don't have to define a times feature.

i.e instead of

input_schema = "discharge_times" : 'tensor', "discharge" : 'raw'
what you can do is do:

input_schema = "discharge_note_times" : "tuple_time_text"

where each discharge_note_times is a (notes, times)

Let me know if this helps!

Copy link
Collaborator

@jhnwu3 jhnwu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other things we definitely need to revisit and follow now that I understand how the processors work, but fortunately PyHealth is really flexible so it should be doable:

  1. In our Task, we'll need explicitly define the processor classes themselves with arguments https://pyhealth.readthedocs.io/en/latest/api/processors.html -> this documentation should explain how to define processor arguments in a task
  2. The TupleTimeTextProcessor will need to leverage a HuggingFace Tokenizer so all texts will be tokenized into a [T x L] tensor of tokens, with a time tensor of Tdimension.
  3. The TimeImageProcessor I think fortunately works as it should with the litdata expectations, just two tensors.
  4. The TextEmbedding model will need to assume inputs are already tokenized

@will-pang
Copy link
Contributor Author

will-pang commented Feb 15, 2026

I still have a few outstanding questions on computing time_diff for @jhnwu3 (I think we were debugging the processors so ran out of time to address this), but wanted to check on computing time_diffs:

  1. Should I add a .sort(key=lambda x: x[1]) on the (text, timestamp) lists before computing time diffs? Seems like MIMIC uses fake dates, but perhaps the fake dates still preserve chronology so I wanted to check with you haha

  2. Anchor time vs. sequential diffs: Currently _compute_time_diffs computes time_diff from previous
    note. Should we instead use an anchor time (e.g., first admission time) so that diffs represent absolute offsets?

  3. When a patient has zero notes of a given type, we return (["<missing>"], [0.0]). Is this correct?

Adding John's Reply on Discord:

Screenshot 2026-02-15 at 2 46 34 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants