-
Notifications
You must be signed in to change notification settings - Fork 123
Add Qwen3-VL vision-language model support #442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
nyo16
wants to merge
11
commits into
elixir-nx:main
Choose a base branch
from
nyo16:feat/qwen3-vl
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add support for Qwen3-VL/Qwen2-VL vision-language models with: - Multimodal model (lib/bumblebee/multimodal/qwen3_vl.ex): - Combines vision encoder with Qwen3 text decoder - Visual embedding substitution (replaces image/video tokens) - Supports both image and video inputs via temporal dimension - Uses Qwen3 text model as decoder backbone - Vision encoder (lib/bumblebee/vision/qwen3_vl_vision.ex): - Patch embedding with 3D conv support (temporal + spatial) - Uses Layers.Transformer.blocks/2 as per best practices - Spatial patch merger with MLP projection - Rotary position embeddings (no learned pos embeds) - Featurizer (lib/bumblebee/vision/qwen3_vl_featurizer.ex): - Image and video preprocessing - Temporal dimension handling for video frames - Bicubic resize and normalization - Registrations in bumblebee.ex: - Qwen2VLForConditionalGeneration architecture - Qwen3VLForConditionalGeneration architecture - Featurizer and tokenizer mappings Test outputs match Python reference values to 4 decimal places. Note: Test is marked @Skip pending upload of tiny-random checkpoint to bumblebee-testing HuggingFace organization.
- Remove "model." prefix from text model HF paths since the loader infers and adds this prefix automatically - Fix vision encoder FFN layer names (fc1/fc2 -> linear_fc1/linear_fc2) - Fix vision merger layer names to match Qwen3VL checkpoint structure - Re-enable QK-norm for text model (Qwen3-VL does use it, unlike Qwen2VL) The model now loads correctly with all text and vision encoder parameters properly mapped. Only DeepStack merger and position embedding params remain unused (expected - these are optional features).
- Fix process_frame argument order (frame, featurizer) to match pipe usage - Add automatic image resizing to dimensions compatible with patch_size * merge_size - Handle different size config formats (height/width vs shortest_edge) - Update batch_template to handle various size formats Note: Vision encoder currently requires square images. Non-square support needs grid dimension tracking in patch merger.
…n encoder The vision encoder was producing incorrect image descriptions because it used 1D sequential positions for rotary embedding instead of 2D spatial coordinates. Changes: - Implement compute_2d_rotary_embedding/4 that computes separate row and column frequencies for each patch based on its grid position - Create custom vision_transformer_blocks/5 with 2D rotary support since Layers.Transformer.blocks only supports 1D positions - Add vision_attention_with_2d_rotary/5 for self-attention with 2D rotary - Implement apply_2d_rotary_embedding/4, split_rotary/2, rotate_half/1 - Add bilinear interpolation for learned position embeddings to match Python's fast_pos_embed_interpolate (48x48 grid to actual grid size) - Update parameter mapping for new layer names The fix ensures the vision encoder correctly captures spatial relationships between image patches, producing descriptions that match Python's output.
- Fix vision config loader to handle both embed_dim (Qwen2-VL) and hidden_size (Qwen3-VL) config formats - Also read intermediate_size directly from config when available - Update test with correct reference values from Python (transformers 4.57.3)
- Remove @tag :skip from test - Use roulis/tiny-random-Qwen3VLForConditionalGeneration checkpoint - Test validates text-only inference matches Python reference values
Qwen2-VL uses different parameter names (mlp.fc1 vs mlp.linear_fc1) so the current implementation only supports Qwen3-VL.
- Interactive example for image description with Qwen3-VL - Python code to generate tiny test model - Reference values comparison table (Python vs Elixir) - Implementation notes on 2D spatial rotary embeddings
- Add deepstack_merger function to vision encoder with postshuffle norm - Extract hidden states from encoder layers and pass through mergers - Add post_block_hook option to Layers.Transformer.blocks for injection - Document DeepStack decoder injection as TODO (not critical for function)
- Build text decoder directly to enable post_block_hook usage - Extract deepstack features from vision encoder output - Create visual position mask from image/video token IDs - Inject deepstack features at text decoder layers 0, 1, 2 - Add gated_ffn helper function for Qwen3 architecture DeepStack adds multi-scale visual information by: 1. Extracting hidden states from vision encoder layers [5, 11, 17] 2. Passing through separate merger MLPs (postshuffle norm) 3. Adding features to visual token positions in decoder layers
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds full support for the Qwen3-VL vision-language model family, enabling image-to-text generation with Bumblebee.
Model:
Qwen/Qwen3-VL-2B-Instruct(and other sizes)Architecture:
Qwen3VLForConditionalGenerationFeatures
Vision Encoder (
Bumblebee.Vision.Qwen3VLVision)Text Decoder
Featurizer (
Bumblebee.Vision.Qwen3VLFeaturizer){num_patches, channels * temporal * patch_h * patch_w}DeepStack Implementation
DeepStack provides multi-scale visual information by:
hidden_states[visual_mask] += deepstack_features[layer_idx]Infrastructure Changes
post_block_hookoption toLayers.Transformer.blocksfor per-layer injectionFiles Changed
New Files
lib/bumblebee/multimodal/qwen3_vl.ex- Main VL modellib/bumblebee/vision/qwen3_vl_vision.ex- Vision encoderlib/bumblebee/vision/qwen3_vl_featurizer.ex- Image preprocessingtest/bumblebee/multimodal/qwen3_vl_test.exs- Testsnotebooks/qwen3_vl.livemd- Usage examplesModified Files
lib/bumblebee.ex- Model/featurizer registrationslib/bumblebee/layers/transformer.ex- Addedpost_block_hookoptionTest Results
Usage Example
Parameter Loading
All parameters load correctly with no warnings:
References