[Draft] Newton-Schulz via cuSOLVERMp by vcherepanov-nv · Pull Request #2706 · NVIDIA/TransformerEngine

vcherepanov-nv · 2026-02-25T23:37:21Z

Description

Adds an API to call Newton-Schulz method on a distributed tensor.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Integrate cuSOLVERMp as a new dependency
Add corresponding API to TE/common
Add PyTorch binding and tests

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Add a new distributed Newton-Schulz inverse square root API to Transformer Engine's common C library. This wraps the cusolverMpNewtonSchulz library function, following the same pattern as the existing cuBLASMp integration for comm_gemm. New files: - newton_schulz.h: Public C API header with context management and computation functions - newton_schulz/newton_schulz.cpp: Implementation with RAII wrappers for cuSolverMp handles Build integration: - New NVTE_WITH_CUSOLVERMP CMake option and CUSOLVERMP_HOME env var - NVTE_CHECK_CUSOLVERMP error checking macro in logging.h - Conditional compilation guarded by NVTE_WITH_CUSOLVERMP Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Add PyTorch-level bindings for the cuSolverMp Newton-Schulz inverse square root API introduced in the previous commit. New files: - pytorch/csrc/extensions/newton_schulz.cpp: C++ extension wrapping the C API with PyTorch tensor support - pytorch/newton_schulz.py: Python wrapper that extracts NCCL communicator from torch.distributed ProcessGroup - tests/pytorch/distributed/test_newton_schulz.py: pytest launcher - tests/pytorch/distributed/run_newton_schulz.py: distributed test worker with reference implementation for numerical validation Modified files: - pytorch/csrc/extensions.h: Function declarations - pytorch/csrc/extensions/pybind.cpp: pybind11 registrations - pytorch/__init__.py: Public API export Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Fix API mismatches discovered during compilation: - cusolverMpCreate takes (handle*, deviceId, stream), not (handle*, stream) - cusolverMpCreateDeviceGrid takes handle as first arg with different parameter order - Use cusolverMpGridMapping_t (not cusolverMpGridLayout_t) and CUSOLVERMP_GRID_MAPPING_COL_MAJOR - cusolverMpCreateMatrixDesc has different parameter order: (desc*, grid, dtype, M, N, MB, NB, RSRC, CSRC, LLD) - cusolverMpNewtonSchulzDescriptorCreate takes only (nsDesc*) with no iteration/coefficient args - No cusolverMpStreamSet exists; create handle per-call with user stream - cusolverMpNewtonSchulz requires computeType and info parameters - Switch from generic template RAII to explicit deleter structs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

…build Add NVTE_WITH_CUSOLVERMP compiler define and cusolverMp include/library paths to the PyTorch C++ extension build, following the same pattern as NVTE_UB_WITH_MPI and NVTE_ENABLE_NVSHMEM. Without this, the #ifdef NVTE_WITH_CUSOLVERMP guards in the PyTorch extension code would never be active since the define was only set as PRIVATE in the CMake build for the common library. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Two fixes: - Use ProcessGroupNCCL._comm_ptr() to extract the raw NCCL communicator pointer instead of the non-existent get_nccl_comm() method - Pass global matrix dimensions (m, n) from Python to C++ instead of using local tensor dimensions, which would produce incorrect ScaLAPACK block sizes in the distributed computation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

cuSolverMp handle and grid creation are expensive operations. Move them from per-call creation in nvte_newton_schulz into the NVTECusolverMpCtx, which is their natural home — the context exists to encapsulate the grid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

cuSolverMp cannot work with the default CUDA stream. Create a dedicated stream inside nvte_cusolvermp_ctx_create and remove the stream parameter from both C API functions since the context now owns its stream. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

The internal dedicated stream was reading the input tensor before the caller's stream had finished producing it, resulting in all-zero output. Add event-based synchronisation: the internal stream waits for the caller's input to be ready, and the caller's stream waits for the output to be written. Replaces the blocking cudaStreamSynchronize. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

cuSolverMp is asynchronous and uses the host workspace during multi-GPU execution. The event-based output sync did not block the host, so the local workspace_host vector was destroyed while the GPU was still reading from it. Restore cudaStreamSynchronize to ensure the host workspace remains valid for the full duration of the operation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Avoid creating and destroying a cudaEvent_t on every nvte_newton_schulz call by making it a persistent member of NVTECusolverMpCtx, matching the existing pattern for the stream. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Replace single event with in_ready and out_ready events. After the cuSolverMp call, record out_ready on the internal stream and make the caller's stream wait on it, ensuring the output tensor is ready before the caller uses it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Replace reference-comparison test with a direct arithmetic check: if X is the inverse square root of A, then X @ A @ X must equal the identity matrix. This is more robust and removes the need for a separate reference implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-02-25T23:42:44Z

Greptile Summary

This PR integrates cuSolverMp to provide distributed Newton-Schulz matrix orthogonalization. The implementation adds ~650 lines across build configuration, C++ bindings, Python API, and tests.

Key changes:

Adds cuSolverMp as optional dependency with auto-detection from common library symbols
Implements C API wrapper in transformer_engine/common/newton_schulz/
Provides PyTorch binding with context management and stream synchronization
Includes distributed tests for float32/bfloat16 with 5 and 15 iterations

Major concerns from prior review rounds:

Documentation and tests claim "inverse square root" but implementation computes orthogonal matrix (polar decomposition)
Uses private PyTorch APIs (_get_backend, _comm_ptr) subject to breakage
Creates/destroys heavyweight context on every call instead of caching
Test verification checks wrong property (orthogonality vs inverse square root)
Missing validation for even row distribution across ranks
num_coefficients parameter validated but never used by cuSolverMp
Synchronous cudaMalloc/cudaFree on hot path cause device synchronization

Additional issue found:

Missing dtype validation despite documented float32/bfloat16 requirement

Confidence Score: 2/5

Multiple functional and performance issues need resolution before merge
Score reflects accumulated issues from multiple review rounds: incorrect test verification, misleading documentation (inverse square root vs orthogonalization), performance problems (per-call context creation, synchronous CUDA allocations), API stability risks (private PyTorch APIs), and missing validation (dtype, tensor distribution). While the build integration is solid, the core functionality has correctness and design issues that must be addressed.
Primary attention needed on transformer_engine/pytorch/newton_schulz.py (API design, validation), tests/pytorch/distributed/run_newton_schulz.py (incorrect verification), and transformer_engine/common/newton_schulz/newton_schulz.cpp (performance optimizations)

Important Files Changed

Filename	Overview
transformer_engine/common/newton_schulz/newton_schulz.cpp	Core C++ implementation using cuSolverMp. Previous reviews identified synchronous cudaMalloc on hot path and unused `num_coefficients` parameter (only validated, never passed to cuSolverMp)
transformer_engine/pytorch/newton_schulz.py	Python API with several issues: uses private PyTorch APIs, creates/destroys context per call (performance issue), missing dtype validation, and documents "inverse square root" but actually computes orthogonal matrix
tests/pytorch/distributed/run_newton_schulz.py	Test worker with incorrect verification - checks orthogonality (X @ X.t() ≈ I) instead of inverse square root (X @ A @ X ≈ I), and coefficient count mismatch with API defaults
transformer_engine/common/include/transformer_engine/newton_schulz.h	Header file with misleading documentation claiming "inverse square root" when it actually computes orthogonal matrix (polar decomposition)
transformer_engine/pytorch/init.py	Unconditionally imports `newton_schulz` even when feature not built, exposing it in public API despite potential runtime errors

Sequence Diagram

sequenceDiagram
    participant User as Python User
    participant PyAPI as newton_schulz.py
    participant PyExt as C++ Extension
    participant Common as newton_schulz.cpp
    participant cuSolver as cuSolverMp Library
    participant NCCL as NCCL Communicator

    User->>PyAPI: newton_schulz(x, group, iterations, coeffs)
    PyAPI->>PyAPI: Extract NCCL comm from ProcessGroup
    PyAPI->>PyAPI: Calculate global dims (m, n)
    PyAPI->>PyExt: cusolvermp_ctx_create(comm, nranks, rank)
    PyExt->>Common: nvte_cusolvermp_ctx_create()
    Common->>Common: Create CUDA stream & events
    Common->>cuSolver: cusolverMpCreate()
    Common->>cuSolver: cusolverMpCreateDeviceGrid()
    Common-->>PyExt: Return context pointer
    PyExt-->>PyAPI: Return context handle
    
    PyAPI->>PyExt: newton_schulz(ctx, m, n, x, iters, coeffs)
    PyExt->>Common: nvte_newton_schulz()
    Common->>Common: Stream synchronization (events)
    Common->>cuSolver: cusolverMpNewtonSchulz_bufferSize()
    cuSolver-->>Common: Workspace size
    Common->>Common: Allocate/grow workspace (cudaMalloc)
    Common->>cuSolver: cusolverMpNewtonSchulz()
    cuSolver->>NCCL: Distributed matrix operations
    NCCL-->>cuSolver: Sync results
    cuSolver-->>Common: Modified matrix (in-place)
    Common->>Common: Stream synchronization (events)
    Common-->>PyExt: Success
    PyExt-->>PyAPI: Success
    
    PyAPI->>PyExt: cusolvermp_ctx_destroy(ctx)
    PyExt->>Common: nvte_cusolvermp_ctx_destroy()
    Common->>Common: Free workspace
    Common->>cuSolver: Destroy grid & handle
    Common->>Common: Destroy stream & events
    Common-->>PyExt: Done
    PyExt-->>PyAPI: Done
    PyAPI-->>User: Modified tensor x

_{Last reviewed commit: d3740fb}

greptile-apps

_{15 files reviewed, 13 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-25T23:42:48Z

tests/pytorch/distributed/run_newton_schulz.py

+    # Check: if X = A^{-1/2}, then X @ A @ X should be the identity matrix
+    if rank == 0:
+        XXT = X @ X.t()
+        I = torch.eye(N, device=XXT.device, dtype=XXT.dtype)
+        max_diff = (XXT - I).abs().max().item()
+        print(f"Max |X @ X.t() - I|: {max_diff:.6e}", flush=True)


verification doesn't match the comment - if X = A^{-1/2}, the check should be X @ A @ X ≈ I, not X @ X.t() ≈ I. The current check verifies X is orthogonal, not that X is the inverse square root of A. Note that A_orig is created on line 76 but never used.

Suggested change

# Check: if X = A^{-1/2}, then X @ A @ X should be the identity matrix

if rank == 0:

XXT = X @ X.t()

I = torch.eye(N, device=XXT.device, dtype=XXT.dtype)

max_diff = (XXT - I).abs().max().item()

print(f"Max |X @ X.t() - I|: {max_diff:.6e}", flush=True)

# Check: if X = A^{-1/2}, then X @ A @ X should be the identity matrix

XAX = X @ A_orig @ X

I = torch.eye(N, device=XAX.device, dtype=XAX.dtype)

max_diff = (XAX - I).abs().max().item()

print(f"Max |X @ A @ X - I|: {max_diff:.6e}", flush=True)

if torch.allclose(XAX, I, atol=args.atol, rtol=args.rtol):

greptile-apps · 2026-02-25T23:42:49Z

transformer_engine/pytorch/newton_schulz.py

+    nccl_backend = group._get_backend(torch.device("cuda"))
+    return nccl_backend._comm_ptr()


uses private PyTorch APIs (_get_backend, _comm_ptr) that may change in future versions

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps · 2026-02-25T23:56:23Z

tests/pytorch/distributed/run_newton_schulz.py

+    quintic_coefficients = [
+        4.0848,
+        -6.8946,
+        2.9270,
+        3.9505,
+        -6.3029,
+        2.6377,
+        3.7418,
+        -5.5913,
+        2.3037,
+        2.8769,
+        -3.1427,
+        1.2046,
+        2.8366,
+        -3.0525,
+        1.2012,
+    ]
+    coefficients = (
+        quintic_coefficients if args.num_iterations == 5 else [1.5, -0.5, 0.0] * args.num_iterations
+    )


coefficients mismatch with API defaults - test uses 15 coefficients for 5 iterations, but newton_schulz.py defaults to 5 coefficients. This inconsistency means default API behavior isn't tested.

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-02-26T00:52:53Z

transformer_engine/common/include/transformer_engine/newton_schulz.h

+ *  \brief Functions for distributed Newton-Schulz inverse square root.
+ *
+ *  This API is a TE-native binding to the cuSolverMp library.
+ *  It computes an iterative Newton-Schulz inverse square root
+ *  approximation on a distributed matrix.


Documentation claims this computes "inverse square root" but the test validates orthogonality (X @ X.t() ≈ I), and commit dd1dd0b states "it approximates orthogonal matrix, not inverse square root". If this computes the polar decomposition (orthogonal factor), the documentation should be updated to reflect that. Inverse square root would satisfy X @ A @ X ≈ I, which is different from orthogonality.

greptile-apps · 2026-02-26T00:52:55Z

transformer_engine/pytorch/newton_schulz.py

+    m = x.size(0) * nranks  # rows are distributed across ranks
+    n = x.size(1)


Assumes rows are evenly distributed (m = x.size(0) * nranks) but doesn't validate this. If matrix size isn't divisible by nranks, the computed global size m will be incorrect, leading to wrong results from cuSOLVERMp. Consider adding validation:

Suggested change

m = x.size(0) * nranks # rows are distributed across ranks

n = x.size(1)

# Global matrix dimensions

# Rows must be evenly distributed across ranks

local_rows = x.size(0)

m = local_rows * nranks

n = x.size(1)

Then add a validation check that all ranks have the same local_rows via dist.all_reduce.

greptile-apps · 2026-02-26T00:52:56Z

transformer_engine/pytorch/newton_schulz.py

+    num_iterations: int = 5,
+    coefficients: Optional[List[float]] = None,
+) -> None:
+    """Compute Newton-Schulz inverse square root in-place on a distributed matrix.


Docstring says "inverse square root" but test checks orthogonality. Update to match actual behavior (see comment on header file).

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps · 2026-02-26T01:09:12Z

transformer_engine/common/newton_schulz/newton_schulz.cpp

+void nvte_newton_schulz(NVTECusolverMpCtx* ctx, int64_t m, int64_t n, NVTETensor x,
+                        int64_t num_iterations, const float* coefficients, int64_t num_coefficients,
+                        cudaStream_t caller_stream) {


num_coefficients parameter is unused

The num_coefficients parameter is accepted but never referenced in the function body. Neither cusolverMpNewtonSchulz_bufferSize (line 166) nor cusolverMpNewtonSchulz (line 183) receive this value. If cuSolverMp infers the count from num_iterations internally, then num_coefficients is dead code that should be removed from the API. If cuSolverMp actually needs it, then it should be passed to the cuSolverMp calls — otherwise the library may read out of bounds on the coefficients array.

greptile-apps · 2026-02-26T01:09:13Z

transformer_engine/pytorch/__init__.py

 from transformer_engine.pytorch import optimizers
 from transformer_engine.pytorch.export import onnx_export
 from transformer_engine.pytorch.cross_entropy import parallel_cross_entropy
+from transformer_engine.pytorch.newton_schulz import newton_schulz


Unconditional import of optional feature

newton_schulz is unconditionally imported and exported as part of the public API, even when TE is built without NVTE_WITH_CUSOLVERMP. While the function itself raises a runtime error when called, this exposes the symbol to all users and makes it appear as a supported feature in auto-complete and docs. Consider guarding this import behind a check (similar to how other optional features are handled), or at minimum adding a note in the docstring that the function requires NVTE_WITH_CUSOLVERMP=1 at build time.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-02-26T01:09:14Z

transformer_engine/common/newton_schulz/newton_schulz.cpp

+  // Allocate/grow device workspace
+  if (ctx->workspace_size < wrksp_size_device) {
+    if (ctx->workspace) {
+      NVTE_CHECK_CUDA(cudaFree(ctx->workspace));
+    }
+    NVTE_CHECK_CUDA(cudaMalloc(&ctx->workspace, wrksp_size_device));
+    ctx->workspace_size = wrksp_size_device;
+  }


Synchronous cudaFree/cudaMalloc on hot path

cudaFree followed by cudaMalloc inside nvte_newton_schulz will synchronize with the device each time the workspace needs to grow. Since the context is recreated on every call from newton_schulz.py (line 82-86 creates + destroys ctx each invocation), the workspace will never be reused across calls — the grow-only caching here is ineffective. Consider either:

Allowing callers to keep the context alive across calls, or

Using cudaMallocAsync/cudaFreeAsync on ctx->stream to avoid synchronous stalls.

greptile-apps · 2026-02-26T01:09:16Z

transformer_engine/pytorch/newton_schulz.py

+    ctx_ptr = tex.cusolvermp_ctx_create(nccl_comm_ptr, nranks, rank)
+    try:
+        tex.newton_schulz(ctx_ptr, m, n, x, num_iterations, coefficients)
+    finally:
+        tex.cusolvermp_ctx_destroy(ctx_ptr)


Context created/destroyed per call wastes resources

A new NVTECusolverMpCtx is created and destroyed on every invocation of newton_schulz. Context creation involves cudaStreamCreate, two cudaEventCreate calls, cusolverMpCreate, and cusolverMpCreateDeviceGrid — all of which are heavyweight operations. And since the context is destroyed afterward, the grow-only workspace caching in the C++ layer (lines 170-177 of newton_schulz.cpp) is never actually reused.

Consider caching the context (e.g., in a module-level dict keyed by (nccl_comm_ptr, nranks, rank)) and reusing it across calls, or exposing the context lifecycle to callers so they can amortize the cost when calling newton_schulz repeatedly in a training loop.

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-02-26T01:33:17Z

transformer_engine/pytorch/newton_schulz.py

+    assert (
+        len(coefficients) == num_iterations * 3
+    ), f"Unexpected number of coefficients: {len(coefficients)} for {num_iterations} iterations"


use ValueError instead of assert for validation - assert can be disabled with Python's -O flag

Suggested change

assert (

len(coefficients) == num_iterations * 3

), f"Unexpected number of coefficients: {len(coefficients)} for {num_iterations} iterations"

if len(coefficients) != num_iterations * 3:

raise ValueError(

f"Unexpected number of coefficients: {len(coefficients)} for {num_iterations} iterations"

)

greptile-apps · 2026-02-26T01:33:18Z

transformer_engine/pytorch/newton_schulz.py

+    if x.dim() != 2:
+        raise ValueError(f"Expected 2D tensor, got {x.dim()}D")
+    if not x.is_cuda:
+        raise ValueError("Input tensor must be on CUDA device")


missing contiguity check - C++ code uses data_ptr() which requires contiguous memory. Non-contiguous tensors will cause incorrect results.

Suggested change

if x.dim() != 2:

raise ValueError(f"Expected 2D tensor, got {x.dim()}D")

if not x.is_cuda:

raise ValueError("Input tensor must be on CUDA device")

if x.dim() != 2:

raise ValueError(f"Expected 2D tensor, got {x.dim()}D")

if not x.is_cuda:

raise ValueError("Input tensor must be on CUDA device")

if not x.is_contiguous():

raise ValueError("Input tensor must be contiguous")

Instead of requiring NVTE_WITH_CUSOLVERMP env var to be set for both the common library and PyTorch extension builds, inspect the already-built libtransformer_engine.so for exported symbols. This is more robust for incremental builds and CI environments where the env var may not be propagated to the extension build step. The PyTorch extension only calls nvte_* C API functions, so it does not need cusolverMp headers or libraries — only the compile definition. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

When NVTE_WITH_CUSOLVERMP is not defined, omit the Newton-Schulz functions entirely from the pybind module instead of registering stubs that throw runtime errors. The Python wrapper checks for the attribute at call time and raises a clear error message. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-02-27T03:15:45Z

transformer_engine/pytorch/newton_schulz.py

+    if x.dim() != 2:
+        raise ValueError(f"Expected 2D tensor, got {x.dim()}D")
+    if not x.is_cuda:
+        raise ValueError("Input tensor must be on CUDA device")


Missing dtype validation - docstring on line 36 states tensor must be float32 or bfloat16, but this isn't enforced. Passing unsupported dtypes leads to confusing errors from cuSolverMp.

Suggested change

if x.dim() != 2:

raise ValueError(f"Expected 2D tensor, got {x.dim()}D")

if not x.is_cuda:

raise ValueError("Input tensor must be on CUDA device")

if x.dim() != 2:

raise ValueError(f"Expected 2D tensor, got {x.dim()}D")

if not x.is_cuda:

raise ValueError("Input tensor must be on CUDA device")

if x.dtype not in (torch.float32, torch.bfloat16):

raise ValueError(f"Input tensor must be float32 or bfloat16, got {x.dtype}")

vcherepanov-nv and others added 19 commits February 8, 2026 22:38

Correct coefficients

e8c51f8

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

No stream synchronize

412445c

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Change test - it approximates orthogonal matrix, not inverse square root

dd1dd0b

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Generalize number of iterations in tests

85d33fb

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Remove extra info diag - everything should be in logs

a011231

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Add Newton-Schulz tests to the QA script

7c8a656

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

59e8aff

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 25, 2026

View reviewed changes

vcherepanov-nv added 2 commits February 25, 2026 15:47

Fix outdated comments

e433f06

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Remove unused variable

276b841

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps bot reviewed Feb 25, 2026

View reviewed changes

vcherepanov-nv and others added 2 commits February 26, 2026 00:45

Move magic numbers from tests to impl

7fad894

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1e726ce

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 26, 2026

View reviewed changes

Fix outdated comments

fac55db

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps bot reviewed Feb 26, 2026

View reviewed changes

Check num_coefficients

0732fc2

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

8eb6028

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 26, 2026

View reviewed changes

vcherepanov-nv and others added 3 commits February 27, 2026 03:01

[pre-commit.ci] auto fixes from pre-commit.com hooks

d3740fb

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 27, 2026

View reviewed changes

		nccl_backend = group._get_backend(torch.device("cuda"))
		return nccl_backend._comm_ptr()

		m = x.size(0) * nranks # rows are distributed across ranks
		n = x.size(1)

-    m = x.size(0) * nranks  # rows are distributed across ranks
-    n = x.size(1)
+    # Global matrix dimensions
+    # Rows must be evenly distributed across ranks
+    local_rows = x.size(0)
+    m = local_rows * nranks
+    n = x.size(1)

Conversation

vcherepanov-nv commented Feb 25, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Feb 25, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading