Conversation
…ipe state Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
for more information, see https://pre-commit.ci
Greptile SummaryThis PR introduces Key Changes:
API Design:
Confidence Score: 5/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[TE Module/Op] -->|emits| B[QuantizerRole]
B -->|module_type<br/>tensor_type<br/>name| C{CustomRecipe<br/>qfactory}
C -->|dispatches based<br/>on role fields| D[Quantizer Instance]
subgraph "QuantizerRole Fields"
B1[module_type:<br/>linear, grouped_linear, dpa]
B2[tensor_type:<br/>input, weight, grad_output]
B3[name:<br/>qkv, proj, fc1, fc2]
end
subgraph "Module Examples"
M1[Linear] -->|get_quantizer_roles| B
M2[GroupedLinear] -->|get_quantizer_roles| B
M3[LayerNormMLP] -->|get_quantizer_roles| B
M4[DotProductAttention] -->|get_quantizer_roles| B
end
subgraph "Quantizer Factories"
C -->|role-based dispatch| F1[NVFP4Quantizer]
C -->|role-based dispatch| F2[MXFP8Quantizer]
C -->|role-based dispatch| F3[Float8CurrentScalingQuantizer]
C -->|role-based dispatch| F4[Float8BlockQuantizer]
end
style B fill:#e1f5ff
style C fill:#fff4e6
style D fill:#e8f5e9
Last reviewed commit: 343f653 |
This comment was marked as off-topic.
This comment was marked as off-topic.
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
timmoon10
left a comment
There was a problem hiding this comment.
Overall this design is quite clean and generalizable.
transformer_engine/pytorch/custom_recipes/quantization_nvfp4.py
Outdated
Show resolved
Hide resolved
| base = [ | ||
| QuantizerRole(module_type="linear", tensor_type="input", name=name), | ||
| QuantizerRole(module_type="linear", tensor_type="weight", name=name), | ||
| QuantizerRole(module_type="linear", tensor_type="output", name=name), | ||
| ] | ||
| else: | ||
| base = [ | ||
| QuantizerRole(module_type="linear", tensor_type="grad_output", name=name), | ||
| QuantizerRole(module_type="linear", tensor_type="grad_input", name=name), | ||
| ] |
There was a problem hiding this comment.
"output" and "grad_input" roles don't make sense. In reality, we are implicitly assuming that the tensor will be consumed by another linear-like layer.
| base = [ | |
| QuantizerRole(module_type="linear", tensor_type="input", name=name), | |
| QuantizerRole(module_type="linear", tensor_type="weight", name=name), | |
| QuantizerRole(module_type="linear", tensor_type="output", name=name), | |
| ] | |
| else: | |
| base = [ | |
| QuantizerRole(module_type="linear", tensor_type="grad_output", name=name), | |
| QuantizerRole(module_type="linear", tensor_type="grad_input", name=name), | |
| ] | |
| base = [ | |
| QuantizerRole(module_type="linear", tensor_type="input", name=name), | |
| QuantizerRole(module_type="linear", tensor_type="weight", name=name), | |
| QuantizerRole(module_type="linear", tensor_type="input", name=name), | |
| ] | |
| else: | |
| base = [ | |
| QuantizerRole(module_type="linear", tensor_type="grad_output", name=name), | |
| QuantizerRole(module_type="linear", tensor_type="grad_output", name=name), | |
| ] |
Alternatively, if we want to use the output in FP8 DPA, the right role would be module_type="dpa" and module_type="input". We should probably make this configurable. I kind of like that this design is exposing the hidden assumptions we've been making.
There was a problem hiding this comment.
I agree about "output" and "grad_input" roles. Setting roles for those slots to None (the safest) and enabling the configuration. Also configured it in MHA.
tests/pytorch/test_custom_recipe.py
Outdated
| assert counts["input"] == 1 | ||
| assert counts["weight"] == 1 | ||
| assert counts["output"] == 1 | ||
| assert counts["grad_output"] == 1 | ||
| assert counts["grad_input"] == 1 |
There was a problem hiding this comment.
| assert counts["input"] == 1 | |
| assert counts["weight"] == 1 | |
| assert counts["output"] == 1 | |
| assert counts["grad_output"] == 1 | |
| assert counts["grad_input"] == 1 | |
| assert counts["input"] == 2 | |
| assert counts["weight"] == 1 | |
| assert counts["output"] == 0 | |
| assert counts["grad_output"] == 2 | |
| assert counts["grad_input"] == 0 |
Signed-off-by: Evgeny Tsykunov <etsykunov@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
for more information, see https://pre-commit.ci
| def is_gemm(self) -> bool: | ||
| """Whether this role belongs to a GEMM-based module.""" | ||
| return self.module_type in self.GEMM_MODULE_TYPES | ||
|
|
There was a problem hiding this comment.
I think this is baking in assumptions about what formats are similar (our recent experiences with grouped tensors makes me wonder if the requirements for "linear" and "grouped_linear" will diverge in the future), and it's also not giving us that much convenience.
| def is_gemm(self) -> bool: | |
| """Whether this role belongs to a GEMM-based module.""" | |
| return self.module_type in self.GEMM_MODULE_TYPES |
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Evgeny <etsykunov@gmail.com>
Signed-off-by: Evgeny <etsykunov@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: Evgeny <etsykunov@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: Evgeny <etsykunov@nvidia.com>
for more information, see https://pre-commit.ci
Description
Introducing
QuantizerRoleThis is an API that allows to go down to "set this
LayerNormLinearin this transformer layer to be less aggressively quantized." (fine-grained, per-module/per-tensor quantization control mechanism)See
test_custom_recipe.py::test_custom_recipe_quantization_targets().Quantizer factory uses roles to dispatch according to its needs.
TE module/op emits a list of
QuantizerRole:Linear,LayerNormLinear,LayerNormMLPemitmodule_type="linear"withtensor_typein{"input", "weight", "grad_output"}.GroupedLinearemitsmodule_type="grouped_linear".CustomRecipeaccepts aqfactorycallable that receivesQuantizerRoleand returns a quantizer.Factories can be composed - e.g., dispatch (to different sub-factories as an option) based on
module_type(dpavslinear) and then refine based ontensor_type.Type of change
Changes
Please list the changes introduced in this PR:
Checklist: