Conversation
dad4c68 to
aff53ff
Compare
4881d1b to
ac81c85
Compare
Greptile SummaryThis PR extends the grouped MXFP8 quantization to compute separate Key changes:
Important limitation: Grouped dbias remains unsupported for Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Input: Grouped Tensor] --> B{Check Shape Rep}
B -->|SAME_BOTH_DIMS or<br/>VARYING_FIRST_DIM| C[Supported: Single Tensor Mode]
B -->|VARYING_LAST_DIM or<br/>VARYING_BOTH_DIMS| D[Not Supported: Skip DBias]
C --> E[Validate: First dim % 128 == 0]
E --> F[group_quantize_mxfp8_kernel]
F --> G[Compute partial dbias<br/>per 128x128 chunk]
G --> H[Store to workspace<br/>M/128 x K layout]
H --> I[group_reduce_dbias_kernel]
I --> J{Shape Rep}
J -->|SAME_BOTH_DIMS| K[Offset = tensor_id * rows/128]
J -->|VARYING_FIRST_DIM| L[Offset = offsets_ptr / cols / 128]
K --> M[Reduce rows in workspace]
L --> M
M --> N[Output: Per-tensor dbias<br/>num_tensors x K]
Last reviewed commit: a7a06d2 |
|
/te-ci |
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci
3c30d6c to
53e99c3
Compare
for more information, see https://pre-commit.ci
| ? (first_logical_dim / num_tensors) | ||
| : first_dims_ptr[tensor_id]; | ||
|
|
||
| const size_t rows = tensor_rows / chunk_dim_Y; |
There was a problem hiding this comment.
Verify that tensor_rows is always divisible by chunk_dim_Y (128), otherwise this division silently truncates and skips tail row reduction.
| if (global_dim_X % CHUNK_DIM_X != 0) { | ||
| NVTE_DEVICE_ERROR( | ||
| "The grouped tensor must be divisible by 128x128 tiles without a tail tile."); | ||
| } |
There was a problem hiding this comment.
Let's see the performance impact of having this here.
There was a problem hiding this comment.
On B300, the difference is within measurement noise. Over 3 runs, nsys shows ~59.69 µs with the check vs. ~59.62 µs without.
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
|
|
||
| const size_t dbias_in_offset_Y = (shape_rep == ShapeRepresentation::SAME_BOTH_DIMS) | ||
| ? (tensor_id * (tensor_rows / chunk_dim_Y)) | ||
| : (offsets_ptr[tensor_id] / cols / chunk_dim_Y); |
There was a problem hiding this comment.
For VARYING_FIRST_DIM, the offset computation offsets_ptr[tensor_id] / cols / chunk_dim_Y assumes the data offset is divisible by cols * chunk_dim_Y. However, when tensors have varying first dimensions, the cumulative offset offsets_ptr[tensor_id] equals the sum of M_i * K for all previous tensors. If any M_i % chunk_dim_Y != 0, this division will truncate and compute an incorrect workspace offset, causing data corruption.
The kernel in group_quantize_mxfp8.cuh:109-111 validates each tensor's first dimension is divisible by 128, which ensures M_i % chunk_dim_Y == 0, but the workspace offset depends on the sum of all previous tensor sizes being correctly aligned. Verify this is always satisfied for VARYING_FIRST_DIM case.
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
|
/te-ci |
|
/te-ci |
Description
This PR adds a new kernel that computes
dbiasseparately for each tensor in a group and outputs a groupeddbiastensor containing per-tensordbiasvalues.Fixes # (issue)
Type of change
Changes
Checklist: