[Contribution] Create ehr_foundation_model task#840
[Contribution] Create ehr_foundation_model task#840will-pang wants to merge 20 commits intosunlabuiuc:masterfrom
Conversation
There was a problem hiding this comment.
Ahh, some quick comments, because now I think I know what's happening that's happened to me before.
What you can probably do is just make sure every patient has a List[str]. For patients without a type of note, you can just append "<missing>" to denote a missing note or something of that sort or "". We'll probably have to standardize.
Another thing we can do is make sure to align our definitions with Rian's tuple time processors:
So you don't have to define a times feature.
i.e instead of
input_schema = "discharge_times" : 'tensor', "discharge" : 'raw'
what you can do is do:
input_schema = "discharge_note_times" : "tuple_time_text"
where each discharge_note_times is a (notes, times)
Let me know if this helps!
…p-create-multimodal-task-notes
jhnwu3
left a comment
There was a problem hiding this comment.
Some other things we definitely need to revisit and follow now that I understand how the processors work, but fortunately PyHealth is really flexible so it should be doable:
- In our Task, we'll need explicitly define the processor classes themselves with arguments https://pyhealth.readthedocs.io/en/latest/api/processors.html -> this documentation should explain how to define processor arguments in a task
- The TupleTimeTextProcessor will need to leverage a HuggingFace Tokenizer so all texts will be tokenized into a [T x L] tensor of tokens, with a time tensor of Tdimension.
- The TimeImageProcessor I think fortunately works as it should with the litdata expectations, just two tensors.
- The TextEmbedding model will need to assume inputs are already tokenized
|
I still have a few outstanding questions on computing
Adding John's Reply on Discord:
|

Contributor Information
Description
v0.2
Per John's feedback I've incorporated a few changes:
(1) Updated
_compute_time_diffs(notes_with_timestamps, first_admission_time)helper function(["<missing>"], [0.0])_compute_time_diffsjust to see if there are edge-cases where (1)first_admission_timeis missing (2)first_admission_timedoes not make sense relative to the timestamps of other notes. This work would inform us if we need better error handling, but I'll circle back to this once I push on other aspects of the PR.(2) Added text tokenizer to @Rian354's
pyhealth/processors/tuple_time_text_processor.pyscriptpyhealth/tasks/ehr_foundational_model_mimic4.pyyou can also now pass the tokenizer settings inkwargs(see example #5 from tasks docs)v0.1
Per John's feedback, I've incorporated a few changes:
tuple_time_text_processor, which now feeds in radiology and discharge notes as(note_text, time_diff_hours)tuplestime_diff, in the sense that whether the timestamps fromnote.timestampgive a proper chronology of time or not. If we need to arrange in chronological order, I probably need to add a.sort(lambda x: x['time_stamp'])function or something equivalent.v0
More of a draft PR as I'm still fairly new to the inner workings of the package, but here'a few things that I still think needs to be done:
but when I run it outside of dev mode, I run into this error:
Claude says that this relates to the notes varying by length from patient to patient (e.g., patient A might have 4 radiology notes and 2 discharge notes, whereas patient B might have 2 radiology notes and 5 discharge notes), but I'm a little stuck as I am still getting comfortable with the architecture of the package.
Testing Notes
examples/foundation_ehr/multimodal_task.py