Performance bottleneck in TransformerLayer and Attention when using attention mask

**Describe the bug**

Dear Nvidia experts,

When attention mask is used, these three lines will cause very significant slow down since they are looping over all the items in the batch.

 Performance bottleneck in TransformerLayer and Attention. 

[function get_indices ](https://github.com/NVIDIA/TransformerEngine/blob/f8b271fc06ac840f63907364c1dbd5833e9aef8e/transformer_engine/pytorch/attention/dot_product_attention/utils.py#L1548
)

[assertion the type of attention mask](https://github.com/NVIDIA/TransformerEngine/blob/f8b271fc06ac840f63907364c1dbd5833e9aef8e/transformer_engine/pytorch/attention/multi_head_attention.py#L760
)

[similarly another assertion the type of attention mask](https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/transformer.py#L813
)

See below for profiling trace: 

<img width="1280" height="811" alt="Image" src="https://github.com/user-attachments/assets/7258cf36-0f14-4ad7-aa0b-892f81564af2" />

As you can see from the largest bar from the bottom rows, operations taking most of the time in each transformer is now transformer_engine/pytorch/attention/dot_product_attention/utils.py(1518): get_indices, Similarly the other two assertion is also cause significant slow down.

**Steps/Code to reproduce bug**

This problem should manifest in any profiling run with data that have attention mask in the input of forward() 

**Expected behavior**

Multiple times of slow down when calling forward of TransformerLayer.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance bottleneck in TransformerLayer and Attention when using attention mask #2703

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance bottleneck in TransformerLayer and Attention when using attention mask #2703

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions