Skip to content

[WIP] Load TE core with RTLD_LOCAL to stop rocroller symbol leak#589

Draft
sudhu2k wants to merge 2 commits into
devfrom
sudhu/fix-rocroller-global-symbol-leak
Draft

[WIP] Load TE core with RTLD_LOCAL to stop rocroller symbol leak#589
sudhu2k wants to merge 2 commits into
devfrom
sudhu/fix-rocroller-global-symbol-leak

Conversation

@sudhu2k
Copy link
Copy Markdown
Contributor

@sudhu2k sudhu2k commented May 15, 2026

Description

Switch libtransformer_engine.so from RTLD_GLOBAL to RTLD_LOCAL and link the torch/jax extensions against it explicitly via DT_NEEDED. This prevents librocroller.so symbols from interposing with HIP and fixes free(): invalid size in hipModuleLoad when TE is imported before MORI's shmem init on ROCm.

https://amd-hub.atlassian.net/browse/AIMORI-12

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Switch libtransformer_engine.so from RTLD_GLOBAL to RTLD_LOCAL and
link the torch/jax extensions against it explicitly via DT_NEEDED.
This prevents librocroller.so symbols from interposing with HIP and
fixes `free(): invalid size` in hipModuleLoad when TE is imported
before MORI's shmem init on ROCm.
@sudhu2k sudhu2k added the ci-level 3 CI test level 3 label May 15, 2026
@sudhu2k sudhu2k self-assigned this May 15, 2026
@github-actions
Copy link
Copy Markdown

Claude Walkthrough

Intent. On ROCm, libtransformer_engine.so was being loaded with RTLD_GLOBAL, which promoted its transitive librocroller.so symbols into the global namespace and let them interpose with HIP runtime helpers — causing hipModuleLoad to abort with free(): invalid size when TE was imported before MORI's shmem init (AIMORI-12). The PR switches the core library to RTLD_LOCAL and links the JAX/PyTorch framework extensions against it explicitly via ELF DT_NEEDED, so symbol resolution flows through the NEEDED graph instead of through global visibility.

Key changes.

  • Load the TE core C library with RTLD_LOCAL instead of RTLD_GLOBAL in transformer_engine/common/__init__.py:421.
  • Add transformer_engine to the link line of the JAX Pybind11 extension and resolve a library_dirs entry from a previously-installed core lib for incremental builds — build_tools/jax.py:139-158.
  • Same wiring for the PyTorch extension — build_tools/pytorch.py:131-138.
  • In the shared CMake build runner, collect each CMakeExtension's install directory and prepend it to every non-CMake extension's library_dirs so a clean build can find the freshly-built libtransformer_engine.sobuild_tools/build_ext.py:118-153.

Walkthrough.

transformer_engine/common/__init__.py — the one-line behavioral change. _load_core_library() now passes mode=ctypes.RTLD_LOCAL. The docstring spells out the rationale (rocroller interposition with HIP) so future maintainers do not revert it. With RTLD_LOCAL, framework extensions can no longer rely on the loader walking the global namespace to find TE core symbols — they must declare the dependency explicitly, which the build_tools changes do.

build_tools/build_ext.pyget_build_ext.run() is split into two phases (CMake extensions then everything else). The new code records each CMake extension's install_dir in cmake_install_dirs, and after CMake extensions are pulled out of self.extensions it appends every recorded directory into each remaining extension's library_dirs (dedup-guarded). This is necessary because the framework extensions now name transformer_engine in libraries, so the linker must be able to find libtransformer_engine.so on a clean build where the core lib was just produced in the same pip install invocation.

build_tools/jax.py and build_tools/pytorch.py — symmetric changes. Each builds a libraries list that now includes "transformer_engine" (jax also keeps "nccl" on the CUDA path). Each also tries to import _get_shared_object_file("core") and append its parent to library_dirs inside a try/except (ImportError, FileNotFoundError). The comment explains the redundancy: build_ext.py covers the clean-build path; the import-based lookup covers incremental builds and standalone tooling where the CMake step does not run beforehand.

Testing. No tests added. This is a loader/link-line change; verification is implicit (the extensions must link, and the original free(): invalid size crash must no longer reproduce when importing TE before MORI shmem init).

Notes for reviewers.

  • PR is marked [WIP]. The body uses an unchecked "Type of change" list, so the author has not classified it yet — effectively a bug fix + infra/build change.
  • Behavior under RTLD_LOCAL is a real semantic shift: any out-of-tree code that was relying on libtransformer_engine.so symbols being globally visible after import transformer_engine will now fail to resolve them. Reviewers should confirm no in-repo or known downstream code depends on that.
  • The _get_shared_object_file("core") fallback imports from transformer_engine.common at build time, which only succeeds if a prior install exists; the except clause makes that silently optional. On a truly fresh clean build the build_ext.py injection is doing the real work.
  • CUDA/NVIDIA path: libtransformer_engine.so will now also be loaded RTLD_LOCAL there. The PR description frames this purely as a ROCm fix, but the change is unconditional — worth confirming the CUDA extensions still resolve all needed symbols through the NEEDED edge alone.

Generated by Claude. To request a code review, comment /claude review.

Replace `from transformer_engine.common import _get_shared_object_file`
in `build_tools/{pytorch,jax}.py` with a new `installed_te_core_lib_dir`
helper that locates an already-installed libtransformer_engine.so via
sysconfig. The previous import eagerly loaded framework extensions and
asserted they were installed, which broke builds of the framework
extension itself (e.g. `transformer_engine_rocm_torch.tar.gz`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-level 3 CI test level 3

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant