Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,9 @@ repos:
hooks:
- id: ruff-check
args: [--fix, --exit-non-zero-on-fix]
exclude: ^examples/specdec_bench/specdec_bench/datasets/speed\.py$
- id: ruff-format
exclude: ^examples/specdec_bench/specdec_bench/datasets/speed\.py$

- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.17.1
Expand Down Expand Up @@ -93,6 +95,7 @@ repos:
examples/llm_eval/modeling.py|
examples/llm_qat/main.py|
examples/llm_sparsity/weight_sparsity/finetune.py|
examples/specdec_bench/specdec_bench/models/specbench_medusa.py|
examples/speculative_decoding/main.py|
examples/speculative_decoding/medusa_utils.py|
examples/speculative_decoding/server_generate.py|
Expand Down
110 changes: 107 additions & 3 deletions examples/specdec_bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,17 +28,121 @@ MTBench is available [here](https://huggingface.co/datasets/HuggingFaceH4/mt_ben
Download `nvidia/gpt-oss-120b-Eagle3` to a local directory `/path/to/eagle`.

```bash
python3 run.py --model_dir openai/gpt-oss-120b --tokenizer openai/gpt-oss-120b --draft_model_dir /path/to/eagle --mtbench question.jsonl --tp_size 1 --ep_size 1 --draft_length 3 --output_length 4096 --num_requests 80 --engine TRTLLM --concurrency 1 --postprocess gptoss

python3 run.py \
--model_dir openai/gpt-oss-120b \
--tokenizer openai/gpt-oss-120b \
--draft_model_dir /path/to/eagle \
--mtbench question.jsonl \
--tp_size 1 \
--ep_size 1 \
--draft_length 3 \
--output_length 4096 \
--num_requests 80 \
--engine TRTLLM \
--concurrency 1 \
--postprocess gptoss
```

### Running Random ids on GPT OSS + Eagle3

Download `nvidia/gpt-oss-120b-Eagle3` to a local directory `/path/to/eagle`.

```bash
python3 run.py --model_dir openai/gpt-oss-120b --tokenizer openai/gpt-oss-120b --draft_model_dir /path/to/eagle --random_isl 1024 --tp_size 1 --ep_size 1 --draft_length 3 --output_length 4096 --num_requests 40 --engine TRTLLM --concurrency 1
python3 run.py \
--model_dir openai/gpt-oss-120b \
--tokenizer openai/gpt-oss-120b \
--draft_model_dir /path/to/eagle \
--random_isl 1024 \
--tp_size 1 \
--ep_size 1 \
--draft_length 3 \
--output_length 4096 \
--num_requests 40 \
--engine TRTLLM \
--concurrency 1
```

### Running [SPEED-Bench](https://huggingface.co/datasets/nvidia/SPEED-Bench) on Llama 3.3 70B + Eagle 3

1. Install the requirements file using `pip install -r requirements_speed.txt`

2. Prepare the data using the provided script:

```bash
python3 prepare_data.py --dataset speed --config all
```

The data will be saved to `data/` directory, each config type (qualitative, throughput_1k, ...) to each own directory.

#### License

GOVERNING TERMS: This dataset is governed by the NVIDIA Evaluation Dataset License Agreement.

ADDITIONAL INFORMATION: MIT for bigcode/humanevalpack, RUCAIBox/MMATH, RUCAIBox/BAMBOO and EQ-Bench. Apache 2.0 for Writing Bench and Spec-Bench. CC BY 4.0 for FBK-MT/MCIF. MIT and Apache 2.0 for tianyang/repobench_python_v1.1, JetBrains-Research/lca-project-level-code-completion and tianyang/repobench_java_v1.1.

NOTICE: For each dataset a user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose. The `prepare_data.py` script automatically fetches data from all the source datasets.

Additional details are in [HuggingFace dataset repository](https://huggingface.co/datasets/nvidia/SPEED-Bench).

#### Qualitative split

```bash
python3 run.py \
--model_dir meta-llama/Llama-3.3-70B-Instruct \
--tokenizer meta-llama/Llama-3.3-70B-Instruct \
--draft_model_dir yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
--dataset speed \
--dataset_path data/speed/qualitative \
--tp_size 8 \
--ep_size 1 \
--draft_length 3 \
--output_length 4096 \
--engine TRTLLM \
--concurrency 32 \
--show_progress
```

#### Throughput split

```bash
python3 run.py \
--model_dir meta-llama/Llama-3.3-70B-Instruct \
--tokenizer meta-llama/Llama-3.3-70B-Instruct \
--draft_model_dir yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
--dataset speed \
--dataset_path data/speed/throughput_1k \
--tp_size 8 \
--ep_size 1 \
--draft_length 3 \
--output_length 4096 \
--engine TRTLLM \
--concurrency 32 \
--show_progress
```

For longer context (>8192 tokens), please use the following configuration when using TRTLLM:

```yaml
engine_args:
max_seq_len: 131072 # Model max context length (for Llama 3.3 70B)
enable_chunked_prefill: true
```

```bash
python3 run.py \
--model_dir meta-llama/Llama-3.3-70B-Instruct \
--tokenizer meta-llama/Llama-3.3-70B-Instruct \
--draft_model_dir yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
--dataset speed \
--dataset_path data/speed/throughput_16k \
--tp_size 8 \
--ep_size 1 \
--draft_length 3 \
--output_length 4096 \
--engine TRTLLM \
--concurrency 32 \
--show_progress \
--runtime_params runtime_args_long_context.yaml
```

## Notes
Expand Down
Loading