Skip to content

Conversation

@nastya236
Copy link
Contributor

@nastya236 nastya236 commented Jan 20, 2026

  1. Add per tensor scale for nvfp4 quantization for cuda and cpu.

qqmm, quantize, dequantize inputs optional 1D float32 array (global_scale) if mode == "nvfp4".

Tensor scale will help with small inputs:

import mlx.core as mx

x = mx.random.uniform(shape=(2, 16)) / 1e5
xq_ns, scales_ns = mx.quantize(x, mode="nvfp4")
global_scale=mx.absmax(x).astype(mx.float32)
xq_s, scales_s = mx.quantize(x, mode="nvfp4", global_scale = global_scale)

print(mx.allclose(scales_ns, mx.zeros_like(scales_ns)))
print(mx.allclose(scales_s, mx.zeros_like(scales_s)))
  1. AbsMax reduction type and mx.absmax op

For nvfp4 training we will compute amax often. So now there is a new reduction type in which abs is applied inside the all_reduce kernel.
Probably there is a way to do it better.

x.shape() = (4*4096, 11008)
mx.absmax(x): 0.000166 s
x.abs().max(): 0.000284 s

TODO: we probably want to support global_scale in metal as well but it requires changing all quantized operations.

@nastya236 nastya236 closed this Jan 20, 2026
@nastya236 nastya236 reopened this Jan 20, 2026
@nastya236 nastya236 changed the title Tensor scale nvfp4 [WIP] Tensor scale nvfp4 Jan 20, 2026
@nastya236 nastya236 changed the title [WIP] Tensor scale nvfp4 Tensor scale nvfp4 Jan 23, 2026
@nastya236 nastya236 marked this pull request as ready for review January 23, 2026 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants