feat: Add native FP8 model support with scale_inv dequantization #443
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Add native FP8 quantized model support for models like Qwen3-FP8. This enables loading and running FP8 models with per-block
scale factors (scale_inv) for dequantization.
Changes
bumblebee.ex
:preserve_source_typesoption toload_model/2to keep FP8 types during loadingpytorch_params.ex
preserve_source_typesthrough param loading pipelineensure_type/3to preserve FP8 types when option is setlayers.ex
fp8_aware_dense/3layer that handles FP8 quantized weightsscale_invparameterlayers/transformer.ex
:attention_denseoption toblocks/2,block/2,multi_head_attention/4text/qwen3.ex
fp8_aware_densefor attention viaattention_denseoptiongated_ffnto usefp8_aware_densefor FFN layersscale_invtoparams_mappingfor all attention and FFN layersTest plan
Dependencies
Requires (merge in order):
Usage