Skip to content

LP: vllm benchmarking with quantised models#3207

Open
almayne wants to merge 4 commits intoArmDeveloperEcosystem:mainfrom
almayne:vllm_bench_quantised
Open

LP: vllm benchmarking with quantised models#3207
almayne wants to merge 4 commits intoArmDeveloperEcosystem:mainfrom
almayne:vllm_bench_quantised

Conversation

@almayne
Copy link
Copy Markdown

@almayne almayne commented Apr 24, 2026

Before submitting a pull request for a new Learning Path, please review Create a Learning Path

  • I have reviewed Create a Learning Path

Please do not include any confidential information in your contribution. This includes confidential microarchitecture details and unannounced product information.

  • I have checked my contribution for confidential information

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the Creative Commons Attribution 4.0 International License.

Signed-off-by: Anna Mayne <anna.mayne@arm.com>
Copy link
Copy Markdown

@fadara01 fadara01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work!

I added some initial comments

vllm serve \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-num-batched-tokens 16000 \
--max-num-seqs 4 \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--max-num-batched tokens is too big and --max-num-seqs is too small.

lm_eval --model vllm --model_args pretrained=RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8,dtype=bfloat16,max_model_len=4096 --tasks gsm8k --batch_size 4 --limit 10
```

We would expect to see the precision is slightly lower with INT8.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should expect to see numbers similar to the ones reported here for int8: https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8


Repeat with the quantised model.
```bash
lm_eval --model vllm --model_args pretrained=RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8,dtype=bfloat16,max_model_len=4096 --tasks mmlu,gsm8k --batch_size auto --limit 10
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--limit 10 is not good enough for a solid accuracy comparison.


## Set up access to LLama3.1-8B models

To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face cli so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face cli guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the cli and login:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth adding an instruction that you should also sign the licence agreement etc for the meta-llama model?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting access to the model is covered in the paragraph below. Is there an additional step I've forgotten?

* Accuracy: --limit mmlu=10,gsm8k=500

### Throughput ratios: INT8/BF16
| Requests/s | Total Tokens/s | Output Tokens/s |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we ran a serving benchmark, I think we should report latency here too.


## Install build dependencies

Install the following packages required for running inference with vLLM on Arm64:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's only if we're building vLLM ourselves. We have another learning path that covers that. I have been told to cover only installing vLLM. I can add a link to that part of the other learning path that covers this if you like?

Copy link
Copy Markdown
Contributor

@nikhil-arm nikhil-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to redo the inference and benchmarking pages from scratch.
Also I did not find any mention of whisper which was one of the requirement if I understand correctly

python -m pip install --upgrade pip
```

## Install vLLM for CPU
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source build is not required for this use case as all the changes are merged. Linking to Docs should be enough if at all needed.

scheme = {
"targets": ["Linear"],
"weights": {
"num_bits": 4,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt it be 8 bit weights and w8a8 scripts instead of 4 bits

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this page is about int4 quantization - not a recipe for int8 quantization.
I think we should remove this page.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original request was to cover w8w8 results, and link across to previous int4 LP. So we need a good way to do that.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quantised model we are using in this LP already uses GPTQModifier, and the linked HuggingFace page has the w8a8 recipe. I'm not sure what the point would be of adding another w8a8 recipe here. It makes more sense to me to mention we can use GPTQModifier to further quantise to int4 without the accuracy loss we observe using the recipe in the preexisting LP. Or should we just scrap this page entirely?

```bash
pip install vllm[bench]

vllm bench throughput \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we discussed about running the inference via openAI apis and not using vllm throughput or latency tooling?

Copy link
Copy Markdown
Author

@almayne almayne Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I switched from vllm bench throughput to serve in the benchmarking section. This is just to demonstrate the environment is all setup ok. Let's catchup offline.

@nSircombe
Copy link
Copy Markdown

I think we need to redo the inference and benchmarking pages from scratch. Also I did not find any mention of whisper which was one of the requirement if I understand correctly

Llama and/or Whisper I think.

almayne added 3 commits April 27, 2026 08:30
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
Signed-off-by: Anna Mayne <anna.mayne@arm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants