Conversation
…ipe state Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
for more information, see https://pre-commit.ci
Greptile SummaryThis PR introduces Key changes:
The implementation allows users to create custom quantization strategies like "use NVFP4 for all linear layers except attention projection layers, which should use MXFP8" by inspecting role fields in the factory function. The API is marked as experimental with appropriate warnings. Confidence Score: 5/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[User creates CustomRecipe with qfactory] --> B[autocast context with recipe]
B --> C[Module forward pass begins]
C --> D[Module emits QuantizerRole objects]
D --> E{CustomRecipe?}
E -->|Yes| F[Call qfactory for each role]
E -->|No| G[Use built-in recipe state]
F --> H[QuantizerRole inspection]
H --> I{Dispatch logic}
I -->|module_type='linear'| J[Return NVFP4Quantizer]
I -->|module_type='grouped_linear'| K[Return MXFP8Quantizer]
I -->|tensor_type='grad_output'| L[Return E5M2 quantizer]
I -->|Other roles| M[Return default quantizer]
J --> N[Quantizer used for tensor operations]
K --> N
L --> N
M --> N
G --> N
N --> O[Forward/backward computation]
style A fill:#e1f5ff
style F fill:#fff4e1
style H fill:#ffe1f5
style N fill:#e1ffe1
Last reviewed commit: 41656ab |
This comment was marked as off-topic.
This comment was marked as off-topic.
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
timmoon10
left a comment
There was a problem hiding this comment.
Overall this design is quite clean and generalizable.
transformer_engine/pytorch/custom_recipes/quantization_nvfp4.py
Outdated
Show resolved
Hide resolved
| base = [ | ||
| QuantizerRole(module_type="linear", tensor_type="input", name=name), | ||
| QuantizerRole(module_type="linear", tensor_type="weight", name=name), | ||
| QuantizerRole(module_type="linear", tensor_type="output", name=name), | ||
| ] | ||
| else: | ||
| base = [ | ||
| QuantizerRole(module_type="linear", tensor_type="grad_output", name=name), | ||
| QuantizerRole(module_type="linear", tensor_type="grad_input", name=name), | ||
| ] |
There was a problem hiding this comment.
"output" and "grad_input" roles don't make sense. In reality, we are implicitly assuming that the tensor will be consumed by another linear-like layer.
| base = [ | |
| QuantizerRole(module_type="linear", tensor_type="input", name=name), | |
| QuantizerRole(module_type="linear", tensor_type="weight", name=name), | |
| QuantizerRole(module_type="linear", tensor_type="output", name=name), | |
| ] | |
| else: | |
| base = [ | |
| QuantizerRole(module_type="linear", tensor_type="grad_output", name=name), | |
| QuantizerRole(module_type="linear", tensor_type="grad_input", name=name), | |
| ] | |
| base = [ | |
| QuantizerRole(module_type="linear", tensor_type="input", name=name), | |
| QuantizerRole(module_type="linear", tensor_type="weight", name=name), | |
| QuantizerRole(module_type="linear", tensor_type="input", name=name), | |
| ] | |
| else: | |
| base = [ | |
| QuantizerRole(module_type="linear", tensor_type="grad_output", name=name), | |
| QuantizerRole(module_type="linear", tensor_type="grad_output", name=name), | |
| ] |
Alternatively, if we want to use the output in FP8 DPA, the right role would be module_type="dpa" and module_type="input". We should probably make this configurable. I kind of like that this design is exposing the hidden assumptions we've been making.
There was a problem hiding this comment.
I agree about "output" and "grad_input" roles. Setting roles for those slots to None (the safest) and enabling the configuration. Also configured it in MHA.
tests/pytorch/test_custom_recipe.py
Outdated
| assert counts["input"] == 1 | ||
| assert counts["weight"] == 1 | ||
| assert counts["output"] == 1 | ||
| assert counts["grad_output"] == 1 | ||
| assert counts["grad_input"] == 1 |
There was a problem hiding this comment.
| assert counts["input"] == 1 | |
| assert counts["weight"] == 1 | |
| assert counts["output"] == 1 | |
| assert counts["grad_output"] == 1 | |
| assert counts["grad_input"] == 1 | |
| assert counts["input"] == 2 | |
| assert counts["weight"] == 1 | |
| assert counts["output"] == 0 | |
| assert counts["grad_output"] == 2 | |
| assert counts["grad_input"] == 0 |
Signed-off-by: Evgeny Tsykunov <etsykunov@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
for more information, see https://pre-commit.ci
| def is_gemm(self) -> bool: | ||
| """Whether this role belongs to a GEMM-based module.""" | ||
| return self.module_type in self.GEMM_MODULE_TYPES | ||
|
|
There was a problem hiding this comment.
I think this is baking in assumptions about what formats are similar (our recent experiences with grouped tensors makes me wonder if the requirements for "linear" and "grouped_linear" will diverge in the future), and it's also not giving us that much convenience.
| def is_gemm(self) -> bool: | |
| """Whether this role belongs to a GEMM-based module.""" | |
| return self.module_type in self.GEMM_MODULE_TYPES |
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Evgeny <etsykunov@gmail.com>
Signed-off-by: Evgeny <etsykunov@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: Evgeny <etsykunov@gmail.com>
for more information, see https://pre-commit.ci
Description
Introducing
QuantizerRoleThis is an API that allows to go down to "set this
LayerNormLinearin this transformer layer to be less aggressively quantized." (fine-grained, per-module/per-tensor quantization control mechanism)Quantizer factory uses roles to dispatch according to its needs.
TE module/op emits a list of
QuantizerRole:Linear,LayerNormLinear,LayerNormMLPemitmodule_type="linear"withtensor_typein{"input", "weight", "grad_output"}.GroupedLinearemitsmodule_type="grouped_linear".CustomRecipeaccepts aqfactorycallable that receivesQuantizerRoleand returns a quantizer.Factories can be composed - e.g., dispatch (to different sub-factories as an option) based on
module_type(dpavslinear) and then refine based ontensor_type.Type of change
Changes
Please list the changes introduced in this PR:
Checklist: