Skip to content

[PyTorch] Introduce quantizer roles#2620

Open
negvet wants to merge 25 commits intoNVIDIA:mainfrom
negvet:semantic_quantizer_roles
Open

[PyTorch] Introduce quantizer roles#2620
negvet wants to merge 25 commits intoNVIDIA:mainfrom
negvet:semantic_quantizer_roles

Conversation

@negvet
Copy link
Collaborator

@negvet negvet commented Jan 23, 2026

Description

Introducing QuantizerRole

@dataclasses.dataclass(frozen=True)
class QuantizerRole:
    module_type: str = ""   # e.g. "linear", "grouped_linear", "dpa"
    tensor_type: str = ""   # e.g. "input", "weight", "grad_output", "qkv", "s"
    name: str = ""          # instance name, e.g. "qkv", "proj", "fc1", "fc2"

This is an API that allows to go down to "set this LayerNormLinear in this transformer layer to be less aggressively quantized." (fine-grained, per-module/per-tensor quantization control mechanism)

Quantizer factory uses roles to dispatch according to its needs.

TE module/op emits a list of QuantizerRole:

  • Linear, LayerNormLinear, LayerNormMLP emit module_type="linear" with tensor_type in {"input", "weight", "grad_output"}.
  • GroupedLinear emits module_type="grouped_linear".

CustomRecipe accepts a qfactory callable that receives QuantizerRole and returns a quantizer.

Factories can be composed - e.g., dispatch (to different sub-factories as an option) based on module_type (dpa vs linear) and then refine based on tensor_type.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

negvet and others added 4 commits January 23, 2026 15:14
…ipe state

Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
@negvet negvet requested review from cyanguwa and timmoon10 January 23, 2026 15:32
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 23, 2026

Greptile Summary

This PR introduces QuantizerRole, a frozen dataclass that enables fine-grained, per-module/per-tensor quantization control through the new CustomRecipe API. The role object encapsulates three optional fields: module_type (e.g., "linear", "grouped_linear"), tensor_type (e.g., "input", "weight", "grad_output"), and name (instance identifier).

Key changes:

  • Added QuantizerRole dataclass in quantization.py with module_type, tensor_type, and name fields
  • Introduced CustomRecipe class accepting a qfactory callable that receives QuantizerRole and returns quantizer instances
  • Implemented get_quantizer_roles() method across all TE modules (Linear, GroupedLinear, LayerNormLinear, LayerNormMLP, attention modules) and fusible operations
  • Created CustomRecipeState to handle role-based quantizer dispatch
  • Added factory examples in quantization_recipes_base.py that mirror built-in recipes (current_scaling_quantizer_factory, mxfp8_quantizer_factory, float8_block_scaling_quantizer_factory, nvfp4_quantizer_factory)
  • Provided advanced example in quantization_factory_examples.py showing mixed quantization (NVFP4 for Linear, MXFP8 for GroupedLinear)
  • Comprehensive tests validate factory equivalence with built-in recipes and role-based dispatch behavior
  • Module files renamed for clarity: quantization_nvfp4.pyquantization_ref_nvfp4.py, quantization_current_scaling.pyquantization_ref_current_scaling.py

The implementation allows users to create custom quantization strategies like "use NVFP4 for all linear layers except attention projection layers, which should use MXFP8" by inspecting role fields in the factory function. The API is marked as experimental with appropriate warnings.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation is well-designed, thoroughly tested, and maintains backward compatibility. The API is marked as experimental with appropriate warnings. All changes follow consistent patterns across modules, and comprehensive tests validate equivalence with built-in recipes.
  • No files require special attention

Important Files Changed

Filename Overview
transformer_engine/common/recipe/init.py Added CustomRecipe class with qfactory callable parameter for fine-grained quantization control
transformer_engine/pytorch/quantization.py Introduced QuantizerRole dataclass and CustomRecipeState for role-based quantizer dispatch
transformer_engine/pytorch/module/base.py Added get_quantizer_roles() method and role properties for per-module quantizer configuration
transformer_engine/pytorch/custom_recipes/quantization_recipes_base.py New file with factory functions mirroring built-in recipes (current scaling, MXFP8, FP8 block, NVFP4)
transformer_engine/pytorch/custom_recipes/quantization_factory_examples.py Added example factory showing mixed quantization (NVFP4 for Linear, MXFP8 for GroupedLinear)
tests/pytorch/test_custom_recipe.py Comprehensive tests covering role-based dispatch, factory equivalence, and multiple module types
tests/pytorch/distributed/run_numerics_exact.py Updated factory to use QuantizerRole object with role.tensor_type instead of string parsing

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User creates CustomRecipe with qfactory] --> B[autocast context with recipe]
    B --> C[Module forward pass begins]
    C --> D[Module emits QuantizerRole objects]
    D --> E{CustomRecipe?}
    E -->|Yes| F[Call qfactory for each role]
    E -->|No| G[Use built-in recipe state]
    F --> H[QuantizerRole inspection]
    H --> I{Dispatch logic}
    I -->|module_type='linear'| J[Return NVFP4Quantizer]
    I -->|module_type='grouped_linear'| K[Return MXFP8Quantizer]
    I -->|tensor_type='grad_output'| L[Return E5M2 quantizer]
    I -->|Other roles| M[Return default quantizer]
    J --> N[Quantizer used for tensor operations]
    K --> N
    L --> N
    M --> N
    G --> N
    N --> O[Forward/backward computation]
    
    style A fill:#e1f5ff
    style F fill:#fff4e1
    style H fill:#ffe1f5
    style N fill:#e1ffe1
Loading

Last reviewed commit: 41656ab

@greptile-apps

This comment was marked as off-topic.

Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
greptile-apps[bot]

This comment was marked as outdated.

Signed-off-by: Evgeny <etsykunov@nvidia.com>
greptile-apps[bot]

This comment was marked as resolved.

Signed-off-by: Evgeny <etsykunov@nvidia.com>
greptile-apps[bot]

This comment was marked as resolved.

Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this design is quite clean and generalizable.

Comment on lines 1320 to 1329
base = [
QuantizerRole(module_type="linear", tensor_type="input", name=name),
QuantizerRole(module_type="linear", tensor_type="weight", name=name),
QuantizerRole(module_type="linear", tensor_type="output", name=name),
]
else:
base = [
QuantizerRole(module_type="linear", tensor_type="grad_output", name=name),
QuantizerRole(module_type="linear", tensor_type="grad_input", name=name),
]
Copy link
Collaborator

@timmoon10 timmoon10 Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"output" and "grad_input" roles don't make sense. In reality, we are implicitly assuming that the tensor will be consumed by another linear-like layer.

Suggested change
base = [
QuantizerRole(module_type="linear", tensor_type="input", name=name),
QuantizerRole(module_type="linear", tensor_type="weight", name=name),
QuantizerRole(module_type="linear", tensor_type="output", name=name),
]
else:
base = [
QuantizerRole(module_type="linear", tensor_type="grad_output", name=name),
QuantizerRole(module_type="linear", tensor_type="grad_input", name=name),
]
base = [
QuantizerRole(module_type="linear", tensor_type="input", name=name),
QuantizerRole(module_type="linear", tensor_type="weight", name=name),
QuantizerRole(module_type="linear", tensor_type="input", name=name),
]
else:
base = [
QuantizerRole(module_type="linear", tensor_type="grad_output", name=name),
QuantizerRole(module_type="linear", tensor_type="grad_output", name=name),
]

Alternatively, if we want to use the output in FP8 DPA, the right role would be module_type="dpa" and module_type="input". We should probably make this configurable. I kind of like that this design is exposing the hidden assumptions we've been making.

Copy link
Collaborator Author

@negvet negvet Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about "output" and "grad_input" roles. Setting roles for those slots to None (the safest) and enabling the configuration. Also configured it in MHA.

Comment on lines 310 to 314
assert counts["input"] == 1
assert counts["weight"] == 1
assert counts["output"] == 1
assert counts["grad_output"] == 1
assert counts["grad_input"] == 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert counts["input"] == 1
assert counts["weight"] == 1
assert counts["output"] == 1
assert counts["grad_output"] == 1
assert counts["grad_input"] == 1
assert counts["input"] == 2
assert counts["weight"] == 1
assert counts["output"] == 0
assert counts["grad_output"] == 2
assert counts["grad_input"] == 0

negvet and others added 2 commits February 20, 2026 14:31
Signed-off-by: Evgeny Tsykunov <etsykunov@nvidia.com>
greptile-apps[bot]

This comment was marked as resolved.

negvet and others added 5 commits February 20, 2026 15:05
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Comment on lines 85 to 88
def is_gemm(self) -> bool:
"""Whether this role belongs to a GEMM-based module."""
return self.module_type in self.GEMM_MODULE_TYPES

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is baking in assumptions about what formats are similar (our recent experiences with grouped tensors makes me wonder if the requirements for "linear" and "grouped_linear" will diverge in the future), and it's also not giving us that much convenience.

Suggested change
def is_gemm(self) -> bool:
"""Whether this role belongs to a GEMM-based module."""
return self.module_type in self.GEMM_MODULE_TYPES

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, removed

Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

17 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@negvet negvet changed the title [PyTorch] Introduce semantic quantizer roles [PyTorch] Introduce quantizer roles Feb 25, 2026
Evgeny and others added 3 commits February 25, 2026 14:30
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

24 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

24 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Evgeny and others added 2 commits February 25, 2026 16:43
Signed-off-by: Evgeny <etsykunov@gmail.com>
@negvet negvet requested review from ptrendx and timmoon10 February 25, 2026 16:45
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

24 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants