Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
---
title: Setup vLLM
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## What is vLLM

[vLLM](https://docs.vllm.ai/en/latest/) is an open-source, high-throughput inference and serving engine for large language models (LLMs). It’s designed to maximise hardware efficiency, making LLM inference faster, more memory-efficient, and scalable.

## Understanding the Llama models

Llama 3.1 8B is an open-weight, text-only LLM with 8 billion parameters that can understand and generate text. You can view the model card at https://huggingface.co/meta-llama/Llama-3.1-8B.

Quantised models have their weights converted to a lower precision data type, which reduces the memory requirements of the model and can improve performance significantly. In the [Run vLLM inference with INT4 quantization on Arm servers](/learning-paths/servers-and-cloud-computing/vllm-acceleration/) Learning Path we have covered how to quantise a model yourself. There are also many publicly available quantised versions of popular models, such as https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8.

The notation w8a8 means that the weights have been quantised to 8-bit integers and the activations (the input data) are dynamically quantised to the same. This allows our kernels to utilise Arm's 8-bit integer matrix multiply feature I8MM. You can learn more about this in the [KleidiAI and matrix multiplication](/learning-paths/cross-platform/kleidiai-explainer/) Learning Path.

The RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 model we are using in this Learning Path only applies quantisation to the weights and activations in the linear layers of the transformer blocks. The activation quantisations are applied per-token and the weights are quantised per-channel. That is, each output channel dimension has a scaling factor applied between INT8 and BF16 representations.

## Set up your environment

Before you begin, make sure your environment meets these requirements:

- Python 3.12 on Ubuntu 22.04 LTS or newer
- At least 32 vCPUs, 96 GB RAM, and 64 GB of free disk space
This Learning Path was tested on an AWS Graviton4 c8g.12xlarge instance with 200 GB of attached storage.

## Install build dependencies

Install the following packages required for running inference with vLLM on Arm64:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's only if we're building vLLM ourselves. We have another learning path that covers that. I have been told to cover only installing vLLM. I can add a link to that part of the other learning path that covers this if you like?

```bash
sudo apt-get update -y
sudo apt install -y python3.12-venv python3.12-dev
```

Now install tcmalloc, a fast memory allocator from Google’s gperftools, which improves performance under high concurrency:
```bash
sudo apt-get install -y libtcmalloc-minimal4
```

## Create and activate a Python virtual environment

It’s best practice to install vLLM inside an isolated environment to prevent conflicts between system and project dependencies:
```bash
python3.12 -m venv vllm_env
source vllm_env/bin/activate
python -m pip install --upgrade pip
```

## Install vLLM for CPU
Copy link
Copy Markdown

@fadara01 fadara01 Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should divide this into section into 2 parts - the reader can choose which one they want:

  1. install prebuilt wheel (should come first).
  2. build from source with instructions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source build is not required for this use case as all the changes are merged. Linking to Docs should be enough if at all needed.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, I don't think we want to get into a src build here. Try and keep it tight. Plus the fact that it's working well OOTB is important.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link added


Install a recent CPU specific build of vLLM:
```bash
export VLLM_VERSION=0.19.1
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_aarch64.whl
```

If you wish to build vLLM from source you can follow the instructions in the [Build and Run vLLM on Arm Servers Learning Path](/learning-paths/servers-and-cloud-computing/vllm/vllm-setup/).

Your environment is now setup to run inference with vLLM. Next, you'll use vLLM to run inference on both quantised and non-quantised Llama models.
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
title: Run inference with vLLM
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Set up access to LLama3.1-8B models

To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face CLI so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face cli guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the CLI and login:
```bash
curl -LsSf https://hf.co/cli/install.sh | bash
hf auth login
```

Paste your access token into the terminal when prompted. To access Llama3.1-8B you need to request access on the Hugging Face website. Visit https://huggingface.co/meta-llama/Llama-3.1-8B and select "Expand to review and access". Complete the form and you should be granted access in a matter of minutes.

Now you can check that you are able to run inference on the non-quantised Llama model.

## Run inference on LLama3.1-8B

We will use the vLLM bench CLI to measure the throughput of our models later on. Install the required library and use a limited number of prompts to validate your environment. This will run a little slow the first time through as you download the models.
```bash
pip install vllm[bench]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't needed - should be installed when you install vllm.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely had to install it separately. Looks standard: https://docs.vllm.ai/en/latest/cli/#bench


vllm bench throughput \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we discussed about running the inference via openAI apis and not using vllm throughput or latency tooling?

Copy link
Copy Markdown
Author

@almayne almayne Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I switched from vllm bench throughput to serve in the benchmarking section. This is just to demonstrate the environment is all setup ok. Let's catchup offline.

--num-prompts 10 \
--dataset-name random \
--model meta-llama/Llama-3.1-8B
```

This will report the number of requests per second, the total number of tokens generated per second and the number of output tokens generated per second.

You can do the same for the quantised model:
```bash
vllm bench throughput \
--num-prompts 10 \
--dataset-name random \
--model RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8
```

You now have the quantised and non-quantised Llama models on your local machine. You have installed vLLM and demonstrated you can run inference on both your models. Now you can move on to benchmarking these models and compare their performance.
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
---
title: Evaluate Llama3.1-8B throughput and accuracy
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Llama performance benchmarking

We will use the vLLM bench cli to measure the throughput of our models. First, start the server and keep it running:
```bash
vllm serve \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-num-batched-tokens 16000 \
--max-num-seqs 4 \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--max-num-batched tokens is too big and --max-num-seqs is too small.

--data-parallel-size 1 \
Comment thread
almayne marked this conversation as resolved.
--max-model-len 2048 &
```

vLLM uses dynamic continuous batching to maximise hardware utilisation. Three key parameters govern this process:

* max-model-len, which is the maximum sequence length (number of tokens per request). No single prompt or generated sequence can exceed this limit.
* max-num-batched-tokens, which is the total number of tokens processed in one batch across all requests. The sum of input and output tokens from all concurrent requests must stay within this limit.
* max-num-seqs, which is the maximum number of requests the scheduler can place in one iteration.

Now the server is running, we can benchmark using the public ShareGPT dataset.
```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

vllm bench serve \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do greedy decoding here?
we can start doing sampling once we enable openrng in pytorch.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: this depends on llama's default config, it might be greedy by default - need to double check.

--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name sharegpt \
--dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 128 \
--request-rate 8 \
--max-concurrency 8 \
--percentile-metrics ttft,tpot \
--metric-percentiles 50,95,99 \
--save-result --result-dir bench_out --result-filename serve.json

```
The interesting results are Request throughput, Output token throughput, Total token throughput, TTFT (time to first token) and TPOT (time per output token).
Comment thread
almayne marked this conversation as resolved.

Repeat with the quantised model. You should see a significant improvement in the throughput results (increased tokens/s).
```bash
vllm serve \
--model RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 \
--max-num-batched-tokens 16000 \
--max-num-seqs 4 \
--data-parallel-size 1 \
--max-model-len 2048 &

vllm bench serve \
--model RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 \
--dataset-name sharegpt \
--dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 128 \
--request-rate 8 \
--max-concurrency 8 \
--percentile-metrics ttft,tpot \
--metric-percentiles 50,95,99 \
--save-result --result-dir bench_out --result-filename serve.json
```

## Llama accuracy benchmarking

The lm-evaluation-harness is the standard way to measure model accuracy across common academic benchmarks (for example MMLU, HellaSwag, GSM8K) and runtimes (such as Hugging Face, vLLM, and llama.cpp). In this section, you’ll run accuracy tests for both BF16 and INT8 deployments of your Llama models served by vLLM on Arm-based servers.

You will:
- Install the lm-eval harness with vLLM support
- Run benchmarks on a BF16 model and an INT8 (weight-quantized) model
- Interpret key metrics and compare quality across precisions

First install the required libraries for benchmarking with lm_eval.
```bash
pip install ray lm_eval[vllm]
```

Then use a limited number of prompts to validate your environment. This will be slower the first time through as you will download the test data associated with your selected task:
```bash
lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B,dtype=bfloat16,max_model_len=4096 --tasks mmlu --batch_size auto --limit 10

lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B,dtype=bfloat16,max_model_len=4096 --tasks gsm8k --batch_size 4 --limit 10
```

The [MMLU task](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/mmlu) is a set of multiple choice questions split into the subgroups listed above. It allows you to measure the ability of an LLM to understand questions and select the right answers.

The [GSM8k task](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/gsm8k) is a set of math problems that test an LLM's mathematical reasoning ability.

Repeat with the quantised model.
```bash
lm_eval --model vllm --model_args pretrained=RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8,dtype=bfloat16,max_model_len=4096 --tasks mmlu,gsm8k --batch_size auto --limit 10
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--limit 10 is not good enough for a solid accuracy comparison.


lm_eval --model vllm --model_args pretrained=RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8,dtype=bfloat16,max_model_len=4096 --tasks gsm8k --batch_size 4 --limit 10
```

We would expect to see the precision is slightly lower with INT8.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should expect to see numbers similar to the ones reported here for int8: https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8


## Summary of results

The benchmarking results you generate will depend on the hardware you are using. The values below were measured on an AWS Graviton4 c8g.12xlarge instance and provided as an example only. We've applied limits to the number of samples used to make them easily reproducible. A proper accuracy benchmark should be run over the whole dataset, though this can be time consuming. Using the INT8 quantised Llama3.1-8B model we observe throughput improvements of up to ~3x at a cost of up to ~7% in accuracy.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to mention AWS and Graviton4?


Llama benchmark config:
* Throughput: --num-prompts 128
* Accuracy: --limit mmlu=10,gsm8k=500

### Throughput ratios: INT8/BF16
| Requests/s | Total Tokens/s | Output Tokens/s |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we ran a serving benchmark, I think we should report latency here too.

| -------- | -------- | -------- |
| 3.17x | 2.30x | 1.44x |

### Accuracy delta: (BF16-INT8)/BF16
| MMLU | GSM8k |
| -------- | -------- |
| 3% | 6-7% |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these use the --limit 10 benchmark?
can we make sure the eval and accuracy matches the one on the HF model card?

Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
title: Quantisation techniques
Copy link
Copy Markdown
Contributor

@nikhil-arm nikhil-arm Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this page after overview. Provide quick overview on 8 bit dynamic quantization , meniton about using prequantized model and add quantization script for custom model quantization.

weight: 5

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Further quantisation
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels slightly out of place and misses a conclusion that compares this to int8 / bf16 etc.
maybe we should just update the existing int4 learning path with this recipe and update the benchmarks / accuracy numbers there accordingly.


We have used a publicly available w8a8 quantised model to improve performance with a small decrease in accuracy. We have previously covered how to quantise a model to even lower precision (int4) in the [Run vLLM inference with INT4 quantization on Arm servers](/learning-paths/servers-and-cloud-computing/vllm-acceleration/) Learning Path. Further quantisation of the model incurs additional accuracy losses due to the loss in precision. However there are other quantisation techniques that can reduce this accuracy loss. In the [quantisation recipe](/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model/) provided in the referenced learning path we use QuantizationModifier to quantise the model weights. The [w8a8 quantised model](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8#creation) you've been using in this Learning Path was instead quantised with GPTQModifier. GPTQModifier uses a calibration data set to quantise the model weights. We have found GPTQModifier produces smaller degradations in accuracy compared to QuantizationModifier and recommend a recipe like the below for int4 quantisation.

You will need to install the required packages before running the quantisation script.
```bash
pip install compressed-tensors==0.14.0.1
pip install llmcompressor==0.10.0.1
pip install datasets==4.6.0

python w4a8_quant.py
```

Where w4a8_quant.py contains:
```python
from transformers import AutoTokenizer
from datasets import Dataset, load_dataset
from transformers import AutoModelForCausalLM
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from compressed_tensors.quantization import QuantizationType, QuantizationStrategy
import random

model_id = "meta-llama/Meta-Llama-3.1-8B"

num_samples = 256
max_seq_len = 4096

tokenizer = AutoTokenizer.from_pretrained(model_id)

def preprocess_fn(example):
return {"text": example["text"]}

ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)

scheme = {
"targets": ["Linear"],
"weights": {
"num_bits": 4,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt it be 8 bit weights and w8a8 scripts instead of 4 bits

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this page is about int4 quantization - not a recipe for int8 quantization.
I think we should remove this page.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original request was to cover w8w8 results, and link across to previous int4 LP. So we need a good way to do that.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quantised model we are using in this LP already uses GPTQModifier, and the linked HuggingFace page has the w8a8 recipe. I'm not sure what the point would be of adding another w8a8 recipe here. It makes more sense to me to mention we can use GPTQModifier to further quantise to int4 without the accuracy loss we observe using the recipe in the preexisting LP. Or should we just scrap this page entirely?

"type": QuantizationType.INT,
"strategy": QuantizationStrategy.CHANNEL,
"symmetric": True,
"dynamic": False,
"group_size": None
},
"input_activations":
{
"num_bits": 8,
"type": QuantizationType.INT,
"strategy": QuantizationStrategy.TOKEN,
"dynamic": True,
"symmetric": False,
"observer": None,
},
"output_activations": None,
}

recipe = GPTQModifier(
targets="Linear",
config_groups={"group_0": scheme},
ignore=["lm_head"],
dampening_frac=0.01,
block_size=512,
)

model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True,
)

oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=max_seq_len,
num_calibration_samples=num_samples,
)
model.save_pretrained("Meta-Llama-3.1-8B-quantized.w4a8")
```

## Next steps

Now that you have your environment set up for benchmarking and quantising different models, you can experiment with:
- Longer benchmarking runs
- Benchmarking accuracy with different tasks: hellaswag, winogrande etc.
- Different quantisation techniques
- Different models

Your results will allow you to balance accuracy and performance when making decisions about model deployment.
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
title: Run vLLM inference with quantised models and benchmark on Arm servers

minutes_to_complete: 60

who_is_this_for: This is an introductory topic for developers interested in running inference on quantised models. This Learning Path shows you how to run inference on Llama 3.1-8B, with and without quantisation, and benchmark model performance and accuracy with vLLM's bench cli and the LM Evaluation Harness.

learning_objectives:
- Install a recent release of vLLM
- Run both quantised and non-quantised variants of Llama3.1-8B using vLLM
- Evaluate and compare model performance and accuracy using vLLM's bench cli and the LM Evaluation Harness

prerequisites:
- An Arm-based Linux server (Ubuntu 22.04+ recommended) with a minimum of 32 vCPUs, 64 GB RAM, and 64 GB free disk space
- Python 3.12 and basic familiarity with Hugging Face Transformers and quantisation

author: Anna Mayne

### Tags
skilllevels: Introductory
subjects: ML
armips:
- Neoverse
tools_software_languages:
- vLLM
- LM Evaluation Harness
- LLM
- Generative AI
- Python
- PyTorch
- Hugging Face
operatingsystems:
- Linux



further_reading:
- resource:
title: vLLM Documentation
link: https://docs.vllm.ai/
type: documentation
- resource:
title: vLLM GitHub Repository
link: https://github.com/vllm-project/vllm
type: website
- resource:
title: Hugging Face Model Hub
link: https://huggingface.co/models
type: website
- resource:
title: Build and Run vLLM on Arm Servers
link: /learning-paths/servers-and-cloud-computing/vllm/
type: website
- resource:
title: LM Evaluation Harness (GitHub)
link: https://github.com/EleutherAI/lm-evaluation-harness
type: website



### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Loading