Skip to content

[Performance]: loading bandwidth abnormal #482

@kangkai98

Description

@kangkai98

Proposal to improve performance

No response

Report of performance regression

docker: vllm-ascend v0.9.1
ucm: v0.1.0
model: qwq-32b
storage: OceanStore A800
nfs: 192.168.192.101:/kv_cache on /mnt/test1 type nfs (rw,relatime,vers=3,rsize=262144,wsize=262144,namlen=255,hard,nolock,proto=rdma,nconnect=8,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.192.101,mountvers=3,mountproto=tcp,local_lock=all,addr=192.168.192.101)
start model: vllm serve /root/.cache/modelscope/hub/models/Qwen/QwQ-32B --max-model-len 32768 --max-num-batched-tokens 8192 --port 8000 --tensor-parallel-size 4 --block_size 128 --pipeline-parallel-size 1 --trust-remote-code --no-enable-prefix-caching --kv-transfer-config '{ "kv_connector": "UCMConnector",
"kv_connector_module_path": "ucm.integration.vllm.ucm_connector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"UCM_CONFIG_FILE":"/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml"
}
}'
}'
start test: vllm bench serve --model /root/.cache/modelscope/hub/models/Qwen/QwQ-32B --served-model-name /root/.cache/modelscope/hub/models/Qwen/QwQ-32B --dataset-name random --random-input-len 32767 --random-output-len 1 --num-prinf --seed 1

normal:
INFO: 120.9.5.51:57544 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 12-08 02:40:37 [async_llm.py:271] Added request cmpl-b7f05408b1724c569a8a289ff71cc0bb-0.
[2025-12-08 02:40:37] - ucm.integration.vllm.ucm_connector - INFO [ucm_connector.py:228] request_id: cmpl-b7f05408b1724c569a8a289ff71cc0bb-0, total_blocks_num: 255, hit hbm: 0, hit external: 0
[2025-12-08 02:40:41.799016][UC][D] Task(17,NFS::D2S,32640,2139095040) finished, wait=0.012346s, exec=1.280748s, bw=1.555488GB/s. [97606,98549][task_shard.h:108,operator()]
[2025-12-08 02:40:42.572807][UC][D] Task(17,NFS::D2S,32640,2139095040) finished, wait=0.019430s, exec=1.650939s, bw=1.206700GB/s. [97604,98548][task_shard.h:108,operator()]
[2025-12-08 02:40:42.686206][UC][D] Task(17,NFS::D2S,32640,2139095040) finished, wait=0.019045s, exec=1.577807s, bw=1.262630GB/s. [97605,98546][task_shard.h:108,operator()]
[2025-12-08 02:40:42.698518][UC][D] Task(17,NFS::D2S,32640,2139095040) finished, wait=0.022446s, exec=1.638335s, bw=1.215983GB/s. [97603,98547][task_shard.h:108,operator()]
INFO 12-08 02:41:00 [loggers.py:118] Engine 000: Avg prompt throughput: 3276.6 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 12-08 02:41:10 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

abnormal:
INFO: 120.9.5.51:57546 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 12-08 02:41:59 [async_llm.py:271] Added request cmpl-73b5601ea0e04fe7b9f1e0941821d20b-0.
[2025-12-08 02:41:59] - ucm.integration.vllm.ucm_connector - INFO [ucm_connector.py:228] request_id: cmpl-73b5601ea0e04fe7b9f1e0941821d20b-0, total_blocks_num: 255, hit hbm: 0, hit external: 255
[2025-12-08 02:42:04.854685][UC][D] Task(18,NFS::S2D,32640,2139095040) finished, wait=0.021372s, exec=5.516705s, bw=0.361119GB/s. [97603,98567][task_shard.h:108,operator()]
[2025-12-08 02:42:04.916599][UC][D] Task(18,NFS::S2D,32640,2139095040) finished, wait=0.014467s, exec=5.573084s, bw=0.357466GB/s. [97604,98585][task_shard.h:108,operator()]
[2025-12-08 02:42:04.939635][UC][D] Task(18,NFS::S2D,32640,2139095040) finished, wait=0.012428s, exec=5.651431s, bw=0.352510GB/s. [97606,98571][task_shard.h:108,operator()]
[2025-12-08 02:42:04.960851][UC][D] Task(18,NFS::S2D,32640,2139095040) finished, wait=0.013781s, exec=5.602387s, bw=0.355596GB/s. [97605,98565][task_shard.h:108,operator()]
INFO 12-08 02:42:10 [loggers.py:118] Engine 000: Avg prompt throughput: 3276.6 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
[2025-12-08 02:42:12.211663][UC][I] All blocks are hotness. [97465,100007][hotness_set.cc:64,UpdateHotness]
INFO 12-08 02:42:20 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions