Add Qwen3-VL vision-language model support #442

nyo16 · 2026-01-07T00:07:28Z

Summary

This PR adds full support for the Qwen3-VL vision-language model family, enabling image-to-text generation with Bumblebee.

Model: Qwen/Qwen3-VL-2B-Instruct (and other sizes)
Architecture: Qwen3VLForConditionalGeneration

Features

Vision Encoder (`Bumblebee.Vision.Qwen3VLVision`)

3D convolution patch embedding (supports video temporal dimension)
2D spatial rotary position embeddings for accurate spatial understanding
Bilinear interpolation for position embeddings (handles variable image sizes)
Patch merger with spatial reduction (2x2 → 1)
DeepStack feature extraction from layers [5, 11, 17]

Text Decoder

Based on Qwen3 architecture with QK-norm
Visual token substitution (replaces image placeholder tokens with vision embeddings)
DeepStack injection at decoder layers [0, 1, 2]
Full rotary position embedding support

Featurizer (`Bumblebee.Vision.Qwen3VLFeaturizer`)

Image preprocessing with configurable resize
Automatic padding to patch-aligned dimensions
Support for both images and video frames
Outputs flattened patches: {num_patches, channels * temporal * patch_h * patch_w}

DeepStack Implementation

DeepStack provides multi-scale visual information by:

Extracting hidden states from vision encoder layers [5, 11, 17] (1-indexed)
Passing each through separate merger MLPs with postshuffle norm (norm AFTER spatial merge)
Injecting features into text decoder at layers [0, 1, 2]
Formula: hidden_states[visual_mask] += deepstack_features[layer_idx]

Infrastructure Changes

Added post_block_hook option to Layers.Transformer.blocks for per-layer injection

Files Changed

New Files

lib/bumblebee/multimodal/qwen3_vl.ex - Main VL model
lib/bumblebee/vision/qwen3_vl_vision.ex - Vision encoder
lib/bumblebee/vision/qwen3_vl_featurizer.ex - Image preprocessing
test/bumblebee/multimodal/qwen3_vl_test.exs - Tests
notebooks/qwen3_vl.livemd - Usage examples

Modified Files

lib/bumblebee.ex - Model/featurizer registrations
lib/bumblebee/layers/transformer.ex - Added post_block_hook option

Test Results

# Unit test with tiny model
mix test test/bumblebee/multimodal/qwen3_vl_test.exs
1 test, 0 failures

# Real model test with image (448x448)
Image: 448x448x3
Patches: 784 → 196 visual tokens (after merge)
Generated: "This image shows a close-up of a single, small, dark-colored object..."
✓ DeepStack injection test with real image PASSED!

Usage Example

{:ok, model_info} = Bumblebee.load_model({:hf, "Qwen/Qwen3-VL-2B-Instruct"}, type: :bf16)
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "Qwen/Qwen3-VL-2B-Instruct"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, "Qwen/Qwen3-VL-2B-Instruct"},
  module: Bumblebee.Vision.Qwen3VLFeaturizer)

# Load and process image
image = StbImage.read_file!("photo.jpg")
image_inputs = Bumblebee.apply_featurizer(featurizer, image)

# Build prompt with image placeholder
prompt = """
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>
<|im_start|>assistant
"""

text_inputs = Bumblebee.apply_tokenizer(tokenizer, prompt)
inputs = Map.merge(text_inputs, image_inputs)

# Run inference
outputs = Axon.predict(model_info.model, model_info.params, inputs)

Parameter Loading

All parameters load correctly with no warnings:

Vision encoder: patch_embed, pos_embed, blocks.{0-23}, merger
DeepStack mergers: deepstack_merger_list.{0-2} (9 params total)
Text decoder: embedder, decoder.blocks.{0-27}, output_norm, lm_head

References

@Skip

Add support for Qwen3-VL/Qwen2-VL vision-language models with: - Multimodal model (lib/bumblebee/multimodal/qwen3_vl.ex): - Combines vision encoder with Qwen3 text decoder - Visual embedding substitution (replaces image/video tokens) - Supports both image and video inputs via temporal dimension - Uses Qwen3 text model as decoder backbone - Vision encoder (lib/bumblebee/vision/qwen3_vl_vision.ex): - Patch embedding with 3D conv support (temporal + spatial) - Uses Layers.Transformer.blocks/2 as per best practices - Spatial patch merger with MLP projection - Rotary position embeddings (no learned pos embeds) - Featurizer (lib/bumblebee/vision/qwen3_vl_featurizer.ex): - Image and video preprocessing - Temporal dimension handling for video frames - Bicubic resize and normalization - Registrations in bumblebee.ex: - Qwen2VLForConditionalGeneration architecture - Qwen3VLForConditionalGeneration architecture - Featurizer and tokenizer mappings Test outputs match Python reference values to 4 decimal places. Note: Test is marked @Skip pending upload of tiny-random checkpoint to bumblebee-testing HuggingFace organization.

- Remove "model." prefix from text model HF paths since the loader infers and adds this prefix automatically - Fix vision encoder FFN layer names (fc1/fc2 -> linear_fc1/linear_fc2) - Fix vision merger layer names to match Qwen3VL checkpoint structure - Re-enable QK-norm for text model (Qwen3-VL does use it, unlike Qwen2VL) The model now loads correctly with all text and vision encoder parameters properly mapped. Only DeepStack merger and position embedding params remain unused (expected - these are optional features).

- Fix process_frame argument order (frame, featurizer) to match pipe usage - Add automatic image resizing to dimensions compatible with patch_size * merge_size - Handle different size config formats (height/width vs shortest_edge) - Update batch_template to handle various size formats Note: Vision encoder currently requires square images. Non-square support needs grid dimension tracking in patch merger.

…n encoder The vision encoder was producing incorrect image descriptions because it used 1D sequential positions for rotary embedding instead of 2D spatial coordinates. Changes: - Implement compute_2d_rotary_embedding/4 that computes separate row and column frequencies for each patch based on its grid position - Create custom vision_transformer_blocks/5 with 2D rotary support since Layers.Transformer.blocks only supports 1D positions - Add vision_attention_with_2d_rotary/5 for self-attention with 2D rotary - Implement apply_2d_rotary_embedding/4, split_rotary/2, rotate_half/1 - Add bilinear interpolation for learned position embeddings to match Python's fast_pos_embed_interpolate (48x48 grid to actual grid size) - Update parameter mapping for new layer names The fix ensures the vision encoder correctly captures spatial relationships between image patches, producing descriptions that match Python's output.

- Fix vision config loader to handle both embed_dim (Qwen2-VL) and hidden_size (Qwen3-VL) config formats - Also read intermediate_size directly from config when available - Update test with correct reference values from Python (transformers 4.57.3)

@tag

- Remove @tag :skip from test - Use roulis/tiny-random-Qwen3VLForConditionalGeneration checkpoint - Test validates text-only inference matches Python reference values

Qwen2-VL uses different parameter names (mlp.fc1 vs mlp.linear_fc1) so the current implementation only supports Qwen3-VL.

- Interactive example for image description with Qwen3-VL - Python code to generate tiny test model - Reference values comparison table (Python vs Elixir) - Implementation notes on 2D spatial rotary embeddings

- Add deepstack_merger function to vision encoder with postshuffle norm - Extract hidden states from encoder layers and pass through mergers - Add post_block_hook option to Layers.Transformer.blocks for injection - Document DeepStack decoder injection as TODO (not critical for function)

- Build text decoder directly to enable post_block_hook usage - Extract deepstack features from vision encoder output - Create visual position mask from image/video token IDs - Inject deepstack features at text decoder layers 0, 1, 2 - Add gated_ffn helper function for Qwen3 architecture DeepStack adds multi-scale visual information by: 1. Extracting hidden states from vision encoder layers [5, 11, 17] 2. Passing through separate merger MLPs (postshuffle norm) 3. Adding features to visual token positions in decoder layers

nyo16 added 11 commits January 5, 2026 22:03

Enable Qwen3-VL test with tiny model from HuggingFace

35479af

- Remove @tag :skip from test - Use roulis/tiny-random-Qwen3VLForConditionalGeneration checkpoint - Test validates text-only inference matches Python reference values

Remove Qwen2-VL mappings (not tested, different param naming)

c805c32

Qwen2-VL uses different parameter names (mlp.fc1 vs mlp.linear_fc1) so the current implementation only supports Qwen3-VL.

Add Qwen3-VL Livebook with examples and test documentation

b07ac6b

- Interactive example for image description with Qwen3-VL - Python code to generate tiny test model - Reference values comparison table (Python vs Elixir) - Implementation notes on 2D spatial rotary embeddings

Remove appendix from Qwen3-VL Livebook

57553e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Qwen3-VL vision-language model support #442

Add Qwen3-VL vision-language model support #442

Uh oh!

nyo16 commented Jan 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add Qwen3-VL vision-language model support #442

Are you sure you want to change the base?

Add Qwen3-VL vision-language model support #442

Uh oh!

Conversation

nyo16 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Features

Vision Encoder (Bumblebee.Vision.Qwen3VLVision)

Text Decoder

Featurizer (Bumblebee.Vision.Qwen3VLFeaturizer)

DeepStack Implementation

Infrastructure Changes

Files Changed

New Files

Modified Files

Test Results

Usage Example

Parameter Loading

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nyo16 commented Jan 7, 2026 •

edited

Loading

Vision Encoder (`Bumblebee.Vision.Qwen3VLVision`)

Featurizer (`Bumblebee.Vision.Qwen3VLFeaturizer`)