Skip to content

[Contribution] pyhealth.graph: Native knowledge graph support + GraphProcessor#853

Open
joshuasteier wants to merge 2 commits intosunlabuiuc:masterfrom
joshuasteier:feature/graph-module
Open

[Contribution] pyhealth.graph: Native knowledge graph support + GraphProcessor#853
joshuasteier wants to merge 2 commits intosunlabuiuc:masterfrom
joshuasteier:feature/graph-module

Conversation

@joshuasteier
Copy link
Contributor

Contributor Information

  • Name: Joshua Steier
  • Contribution Type: Core Infrastructure + Processor + Tests + Documentation

Description

Adds native knowledge graph support to PyHealth via a new pyhealth.graph module and a GraphProcessor in pyhealth.processors. This is foundational infrastructure enabling graph-based EHR models (GraphCare, G-BERT, KAME, etc.) to consume standard PyHealth datasets through the existing processor/schema pipeline.

Builds on Patrick Jiang's earlier BaseKGDataset work on the kg_embedding branch, updated for PyHealth 2.0's processor architecture.

What's included:

pyhealth.graph.KnowledgeGraph — Core data structure for medical knowledge graphs:

  • Loads (head, relation, tail) triples from Python lists or CSV/TSV files
  • Builds entity2id/relation2id mappings automatically
  • Stores edges as PyG-compatible edge_index + edge_type tensors
  • Provides k-hop subgraph() extraction via torch_geometric.utils.k_hop_subgraph
  • Supports optional pre-computed node features (TransE, LLM embeddings, etc.)
  • Neighbor lookup via BFS

pyhealth.processors.GraphProcessor — Registered processor bridging EHR codes to graphs:

  • Takes patient medical codes → looks up in KG → returns PyG Data subgraph
  • Handles flat code lists and multi-visit (list of list) code inputs
  • Optional max_nodes pruning (seeds always kept)
  • Implements FeatureProcessor interface: is_token(), schema(), dim(), spatial()

Collation update — Added PyG Data branch to collate_fn_dict_with_padding:

  • Detects PyG Data objects and batches via Batch.from_data_list()
  • torch-geometric is optional — gracefully skips when not installed
  • Zero impact on existing non-graph pipelines

Usage

from pyhealth.graph import KnowledgeGraph

kg = KnowledgeGraph(triples="data/umls_triples.csv")

input_schema = {
    "conditions": ("graph", {
        "knowledge_graph": kg,
        "num_hops": 2,
        "max_nodes": 500,
    }),
}

Design decisions:

  • torch-geometric is an optional dependency — imported only when graph features are used
  • User provides the KG — PyHealth does not generate it (per discussion with @jhnwu3)
  • Follows same patterns as TimeImageProcessor for registration and integration

Files to Review

New files:

  • pyhealth/graph/__init__.py — module exports
  • pyhealth/graph/knowledge_graph.py — KnowledgeGraph class
  • pyhealth/processors/graph_processor.py — GraphProcessor
  • tests/core/test_knowledge_graph.py — 22 unit tests
  • tests/core/test_graph_processor.py — 19 unit tests
  • docs/api/graph.rst — graph module docs
  • docs/api/graph/pyhealth.graph.KnowledgeGraph.rst — KG API docs
  • docs/api/processors/pyhealth.processors.GraphProcessor.rst — processor docs

Edited files:

  • pyhealth/processors/__init__.py — added GraphProcessor import
  • pyhealth/datasets/utils.py — added PyG collation branch
  • docs/api/processors.rst — added to toctree
  • docs/index.rst — added graph section

Testing

python -m unittest tests/core/test_knowledge_graph.py -v
# 22 tests

python -m unittest tests/core/test_graph_processor.py -v
# 19 tests (skip gracefully if torch-geometric not installed)

# Smoke tests
python pyhealth/graph/knowledge_graph.py
python pyhealth/processors/graph_processor.py

Next steps (separate PRs)

  1. GraphCare model consuming this infrastructure
  2. Pre-built KG artifacts for MIMIC (UMLS subgraphs, LLM-generated KGs)
  3. Additional graph models (G-BERT, KAME)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant