[WIP] Load TE core with RTLD_LOCAL to stop rocroller symbol leak#589
[WIP] Load TE core with RTLD_LOCAL to stop rocroller symbol leak#589sudhu2k wants to merge 2 commits into
Conversation
Switch libtransformer_engine.so from RTLD_GLOBAL to RTLD_LOCAL and link the torch/jax extensions against it explicitly via DT_NEEDED. This prevents librocroller.so symbols from interposing with HIP and fixes `free(): invalid size` in hipModuleLoad when TE is imported before MORI's shmem init on ROCm.
Claude WalkthroughIntent. On ROCm, Key changes.
Walkthrough.
Testing. No tests added. This is a loader/link-line change; verification is implicit (the extensions must link, and the original Notes for reviewers.
Generated by Claude. To request a code review, comment |
Replace `from transformer_engine.common import _get_shared_object_file`
in `build_tools/{pytorch,jax}.py` with a new `installed_te_core_lib_dir`
helper that locates an already-installed libtransformer_engine.so via
sysconfig. The previous import eagerly loaded framework extensions and
asserted they were installed, which broke builds of the framework
extension itself (e.g. `transformer_engine_rocm_torch.tar.gz`).
Description
Switch libtransformer_engine.so from RTLD_GLOBAL to RTLD_LOCAL and link the torch/jax extensions against it explicitly via DT_NEEDED. This prevents librocroller.so symbols from interposing with HIP and fixes
free(): invalid sizein hipModuleLoad when TE is imported before MORI's shmem init on ROCm.https://amd-hub.atlassian.net/browse/AIMORI-12
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: