[Bug] Fix QwenImageEditPlus Series on NPU #13017

zhangtao0408 · 2026-01-22T03:23:13Z

What does this PR do?

Test Codes:

import torch
import torch_npu
import torch.distributed as dist

import os, time
from PIL import Image
from diffusers import QwenImageEditPlusPipeline, ContextParallelConfig
from diffusers.utils import load_image

# Initialize Env
rank = int(os.getenv("RANK", 0))
world_size = int(os.getenv("WORLD_SIZE", 1))

if world_size > 1 and not dist.is_initialized():
	dist.init_process_group(backend="hccl")
	rank = dist.get_rank()
	device = torch.device("npu", rank % torch.npu.device_count())
	torch.npu.set_device(device)
else:
    device='npu'

image1 = load_image("https://github.com/vipshop/cache-dit/raw/main/examples/data/edit2509_1.jpg")
image2 = load_image("https://github.com/vipshop/cache-dit/raw/main/examples/data/edit2509_2.jpg")
prompt = "The magician bear is on the left, the alchemist bear is on the right, facing each other in the central park square"

pipe = QwenImageEditPlusPipeline.from_pretrained(
    "/PATH/TO/Qwen-Image-Edit-2509",
    torch_dtype=torch.bfloat16
).to(device)
pipe.transformer.set_attention_backend("_native_npu")

pipe.set_progress_bar_config(disable=rank != 0)
pipe.enable_model_cpu_offload(device=device)

if world_size > 1:
    pipe.transformer.enable_parallelism(
        config=ContextParallelConfig(ulysses_degree=world_size)
    )

with torch.inference_mode():
    # Inference
    torch.npu.synchronize()
    start_time = time.time()
    output = pipe(
        image=[image1, image2],
        prompt=prompt,
        generator=torch.Generator(device="cpu").manual_seed(0),
        true_cfg_scale=4.0,
        negative_prompt=" ",
        num_inference_steps=20,
        num_images_per_prompt=1,
        height=1024,
        width=1024,
    )
    torch.npu.synchronize()
    end_time = time.time()
    
    inference_time = end_time - start_time
    if rank == 0:
        output_image = output.images[0]
        output_image.save(f"qwen-image-ulysses{world_size}-time{inference_time:.2f}s.png")
        print(f"image saved at qwen-image-ulysses{world_size}-time{inference_time:.2f}s.png")

Run the code

# 1 card
python3 qwen_image_edit_test.py

# 4 card
python3 qwen_image_edit_test.py

Results in comment

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

cc @yiyixuxu @sayakpaul @asomoza @DN6

…n and refining mask checks.

…alidation and conversion logic.

zhangtao0408 · 2026-01-22T03:38:16Z

Results Log

Before PR

Traceback (most recent call last):
  File "/home/qwen_image_edit_test.py", line 42, in <module>
    _ = pipe(
        ^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/diffusers/src/diffusers/pipelines/qwenimage/pipeline_qwenimage_edit_plus.py", line 803, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/accelerate/hooks.py", line 175, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/diffusers/src/diffusers/models/transformers/transformer_qwenimage.py", line 923, in forward
    text_seq_len, _, encoder_hidden_states_mask = compute_text_seq_len_from_mask(
                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/diffusers/src/diffusers/models/transformers/transformer_qwenimage.py", line 167, in compute_text_seq_len_from_mask
    per_sample_len = torch.where(has_active, active_positions.max(dim=1).values + 1, torch.as_tensor(text_seq_len))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device. Expected NPU tensor, please check whether the input tensor device is correct.
[ERROR] 2026-01-22-02:30:34 (PID:633567, Device:0, RankID:-1) ERR01002 OPS invalid type

After PR

1 card

python3 qwen_image_edit_test.py
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 13.13it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 15.45it/s]
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:02<00:00,  2.12it/s]
Attention backends are an experimental feature and the API may be subject to change.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:14<00:00,  7.43s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [01:34<00:00,  4.70s/it]
image saved at qwen-image-ulysses1-time109.56s.png

4 cards

torchrun --nproc_per_node=4 qwen_image_edit_test.py
W0121 16:48:33.258000 627474 site-packages/torch/distributed/run.py:774] 
W0121 16:48:33.258000 627474 site-packages/torch/distributed/run.py:774] *****************************************
W0121 16:48:33.258000 627474 site-packages/torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0121 16:48:33.258000 627474 site-packages/torch/distributed/run.py:774] *****************************************
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  8.11it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 14.83it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 14.30it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 15.67it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 13.18it/s]
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:03<00:00,  1.95it/s]
Attention backends are an experimental feature and the API may be subject to change.
`enable_parallelism` is an experimental feature. The API may change in the future and breaking changes may be introduced at any time without warning.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00,  5.91s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:35<00:00,  1.77s/it]
image saved at qwen-image-ulysses4-time49.67s.png

Results Image

1 card

4 cards

…e performance and validation.

sayakpaul

Thanks, left some comments.

sayakpaul · 2026-01-22T11:00:10Z

src/diffusers/models/attention_dispatch.py

+    if (
+        attn_mask is not None
+        and torch.all(attn_mask != 0).item()
+    ):


Suggested change

if (

attn_mask is not None

and torch.all(attn_mask != 0).item()

):

if attn_mask is not None and torch.all(attn_mask != 0):

Won't it work?

Won't it work?

diffusers/src/diffusers/models/attention_dispatch.py

Lines 1131 to 1136 in 5c92a77

# Skip Attention Mask if all values are 1, `None` mask can speedup the computation

if (

attn_mask is not None

and torch.all(attn_mask != 0).item()

):

attn_mask = None

Thanks for the reply!

Since NPU FA does not support the [B, Seq_len_kv] mask shape passed by QwenImageEditPlus, and the unsqueeze/expand operations slow down execution, I added logic to bypass these steps when the mask is all 1s. This optimization significantly improves speed in context parallel, as shown in the test results below:

Stage Card End to End Time(s)

Skip expand mask (set to None) 1 108.22

Skip expand mask (set to None) 4 49.83

Expand mask 1 108.62

Expand mask 4 57.74

That's fine. I am asking if this condition would work (i.e., no item()):
if attn_mask is not None and torch.all(attn_mask != 0):

Thanks, that worked. I've removed item() and pushed the update.

sayakpaul · 2026-01-22T11:01:38Z

src/diffusers/models/attention_dispatch.py

+        # Skip Attention Mask if all values are 1, `None` mask can speedup the computation
+        if (
+            attn_mask is not None
+            and torch.all(attn_mask != 0).item()


Same as above.

sayakpaul · 2026-01-22T11:02:19Z

src/diffusers/models/transformers/transformer_qwenimage.py

-    per_sample_len = torch.where(has_active, active_positions.max(dim=1).values + 1, torch.as_tensor(text_seq_len))
+    per_sample_len = torch.where(
+        has_active, 
+        active_positions.max(dim=1).values + 1, 
+        torch.as_tensor(text_seq_len, device=encoder_hidden_states.device)
+    )


Seems like an unrelated change? If so, could you undo it?

This change is to fix #13015.

sayakpaul · 2026-01-22T14:42:04Z

src/diffusers/models/attention_dispatch.py

+        if (
+            attn_mask is not None
+            and attn_mask.ndim == 2
+            and attn_mask.shape[0] == query.shape[0]
+            and attn_mask.shape[1] == key.shape[1]
+        ):
+            B, Sq, Skv = attn_mask.shape[0], query.shape[1], key.shape[1]
+            attn_mask = ~attn_mask.to(torch.bool)
+            attn_mask = attn_mask.unsqueeze(1).expand(B, Sq, Skv).unsqueeze(1).contiguous()


Would it make sense to have a small utlity named _maybe_modify_attn_mask_npu() so that it can be reused in the two places (here and above)?

Thanks for your suggestion! I've updated the _maybe_modify_attn_mask_npu() method.

diffusers/src/diffusers/models/attention_dispatch.py

Lines 1114 to 1135 in 020a232

def _maybe_modify_attn_mask_npu(

query: torch.Tensor,

key: torch.Tensor,

attn_mask: Optional[torch.Tensor] = None

):

# Skip Attention Mask if all values are 1, `None` mask can speedup the computation

if (attn_mask is not None and torch.all(attn_mask != 0)):

attn_mask = None

# Reshape Attention Mask: [batch_size, seq_len_k] -> [batch_size, 1, sqe_len_q, seq_len_k]

# https://www.hiascend.com/document/detail/zh/Pytorch/730/apiref/torchnpuCustomsapi/docs/context/torch_npu-npu_fusion_attention.md

if (

attn_mask is not None

and attn_mask.ndim == 2

and attn_mask.shape[0] == query.shape[0]

and attn_mask.shape[1] == key.shape[1]

):

B, Sq, Skv = attn_mask.shape[0], query.shape[1], key.shape[1]

attn_mask = ~attn_mask.to(torch.bool)

attn_mask = attn_mask.unsqueeze(1).expand(B, Sq, Skv).unsqueeze(1).contiguous()

return attn_mask

HuggingFaceDocBuilderDev · 2026-01-23T04:16:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

TaoZhang-Work added 4 commits January 21, 2026 15:44

[Bug Fix][Qwen-Image-Edit] Fix Qwen-Image-Edit series on NPU

b103f42

Enhance NPU attention handling by converting attention mask to boolea…

3ed2a75

…n and refining mask checks.

Refine attention mask handling in NPU attention function to improve v…

5005564

…alidation and conversion logic.

Clean Code

e042b0d

Refine attention mask processing in NPU attention functions to enhanc…

5c92a77

…e performance and validation.

sayakpaul reviewed Jan 22, 2026

View reviewed changes

Remove item() ops on npu fa backend.

8abfddd

sayakpaul reviewed Jan 22, 2026

View reviewed changes

Reuse NPU attention mask by _maybe_modify_attn_mask_npu

020a232

sayakpaul approved these changes Jan 23, 2026

View reviewed changes

Merge branch 'main' into fix_npu_related_error

34da336

This was referenced Jan 24, 2026

[Feature] Support QwenImageEditPlus series attention mask for NPU #13016

Open

feat: NPU FA support attention mask vipshop/cache-dit#751

Merged

[Draft] Feat. Z image npu support #13041

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Fix QwenImageEditPlus Series on NPU #13017

[Bug] Fix QwenImageEditPlus Series on NPU #13017

Uh oh!

zhangtao0408 commented Jan 22, 2026 •

edited

Loading

Uh oh!

zhangtao0408 commented Jan 22, 2026

Uh oh!

sayakpaul left a comment

Uh oh!

sayakpaul Jan 22, 2026

Uh oh!

zhangtao0408 Jan 22, 2026

Uh oh!

sayakpaul Jan 22, 2026 •

edited

Loading

Uh oh!

zhangtao0408 Jan 22, 2026

Uh oh!

sayakpaul Jan 22, 2026

Uh oh!

sayakpaul Jan 22, 2026

Uh oh!

zhangtao0408 Jan 22, 2026

Uh oh!

sayakpaul Jan 22, 2026

Uh oh!

zhangtao0408 Jan 22, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	# Skip Attention Mask if all values are 1, `None` mask can speedup the computation
	if (
	attn_mask is not None
	and torch.all(attn_mask != 0).item()
	):
	attn_mask = None

Stage	Card	End to End Time(s)
Skip expand mask (set to `None`)	1	108.22
Skip expand mask (set to `None`)	4	49.83
Expand mask	1	108.62
Expand mask	4	57.74

	def _maybe_modify_attn_mask_npu(
	query: torch.Tensor,
	key: torch.Tensor,
	attn_mask: Optional[torch.Tensor] = None
	):
	# Skip Attention Mask if all values are 1, `None` mask can speedup the computation
	if (attn_mask is not None and torch.all(attn_mask != 0)):
	attn_mask = None

	# Reshape Attention Mask: [batch_size, seq_len_k] -> [batch_size, 1, sqe_len_q, seq_len_k]
	# https://www.hiascend.com/document/detail/zh/Pytorch/730/apiref/torchnpuCustomsapi/docs/context/torch_npu-npu_fusion_attention.md
	if (
	attn_mask is not None
	and attn_mask.ndim == 2
	and attn_mask.shape[0] == query.shape[0]
	and attn_mask.shape[1] == key.shape[1]
	):
	B, Sq, Skv = attn_mask.shape[0], query.shape[1], key.shape[1]
	attn_mask = ~attn_mask.to(torch.bool)
	attn_mask = attn_mask.unsqueeze(1).expand(B, Sq, Skv).unsqueeze(1).contiguous()

	return attn_mask

[Bug] Fix QwenImageEditPlus Series on NPU #13017

Are you sure you want to change the base?

[Bug] Fix QwenImageEditPlus Series on NPU #13017

Uh oh!

Conversation

zhangtao0408 commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Codes:

Before submitting

Who can review?

Uh oh!

zhangtao0408 commented Jan 22, 2026

Results Log

Results Image

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhangtao0408 commented Jan 22, 2026 •

edited

Loading

sayakpaul Jan 22, 2026 •

edited

Loading