ArmDeveloperEcosystem · almayne · Apr 24, 2026 · Apr 27, 2026 · Apr 27, 2026 · Apr 27, 2026
diff --git a/...servers-and-cloud-computing/vllm-benchmark-quantisation/1-overview-and-setup.md b/...servers-and-cloud-computing/vllm-benchmark-quantisation/1-overview-and-setup.md
@@ -0,0 +1,63 @@
+---
+title: Setup vLLM
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## What is vLLM
+
+[vLLM](https://docs.vllm.ai/en/latest/) is an open-source, high-throughput inference and serving engine for large language models (LLMs). It’s designed to maximise hardware efficiency, making LLM inference faster, more memory-efficient, and scalable.
+
+## Understanding the Llama models
+
+Llama 3.1 8B is an open-weight, text-only LLM with 8 billion parameters that can understand and generate text. You can view the model card at https://huggingface.co/meta-llama/Llama-3.1-8B.
+
+Quantised models have their weights converted to a lower precision data type, which reduces the memory requirements of the model and can improve performance significantly. In the [Run vLLM inference with INT4 quantization on Arm servers](/learning-paths/servers-and-cloud-computing/vllm-acceleration/) Learning Path we have covered how to quantise a model yourself. There are also many publicly available quantised versions of popular models, such as https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8. 
+
+The notation w8a8 means that the weights have been quantised to 8-bit integers and the activations (the input data) are dynamically quantised to the same. This allows our kernels to utilise Arm's 8-bit integer matrix multiply feature I8MM. You can learn more about this in the [KleidiAI and matrix multiplication](/learning-paths/cross-platform/kleidiai-explainer/) Learning Path.
+
+The RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 model we are using in this Learning Path only applies quantisation to the weights and activations in the linear layers of the transformer blocks. The activation quantisations are applied per-token and the weights are quantised per-channel. That is, each output channel dimension has a scaling factor applied between INT8 and BF16 representations.
+
+## Set up your environment
+
+Before you begin, make sure your environment meets these requirements:
+
+- Python 3.12 on Ubuntu 22.04 LTS or newer
+- At least 32 vCPUs, 96 GB RAM, and 64 GB of free disk space
+This Learning Path was tested on an AWS Graviton4 c8g.12xlarge instance with 200 GB of attached storage.
+
+## Install build dependencies
+
+Install the following packages required for running inference with vLLM on Arm64:
+```bash
+sudo apt-get update -y
+sudo apt install -y python3.12-venv python3.12-dev
+```
+
+Now install tcmalloc, a fast memory allocator from Google’s gperftools, which improves performance under high concurrency:
+```bash
+sudo apt-get install -y libtcmalloc-minimal4
+```
+
+## Create and activate a Python virtual environment
+
+It’s best practice to install vLLM inside an isolated environment to prevent conflicts between system and project dependencies:
+```bash
+python3.12 -m venv vllm_env
+source vllm_env/bin/activate
+python -m pip install --upgrade pip
+```
+
+## Install vLLM for CPU
+
+Install a recent CPU specific build of vLLM:
+```bash
+export VLLM_VERSION=0.19.1
+pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_aarch64.whl
+```
+
+If you wish to build vLLM from source you can follow the instructions in the [Build and Run vLLM on Arm Servers Learning Path](/learning-paths/servers-and-cloud-computing/vllm/vllm-setup/).
+
+Your environment is now setup to run inference with vLLM. Next, you'll use vLLM to run inference on both quantised and non-quantised Llama models.
diff --git a/...aths/servers-and-cloud-computing/vllm-benchmark-quantisation/2-run-inference.md b/...aths/servers-and-cloud-computing/vllm-benchmark-quantisation/2-run-inference.md
@@ -0,0 +1,43 @@
+---
+title: Run inference with vLLM
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Set up access to LLama3.1-8B models
+
+To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face CLI so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face cli guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the CLI and login:
+```bash
+curl -LsSf https://hf.co/cli/install.sh | bash
+hf auth login
+```
+
+Paste your access token into the terminal when prompted. To access Llama3.1-8B you need to request access on the Hugging Face website. Visit https://huggingface.co/meta-llama/Llama-3.1-8B and select "Expand to review and access". Complete the form and you should be granted access in a matter of minutes.
+
+Now you can check that you are able to run inference on the non-quantised Llama model. 
+
+## Run inference on LLama3.1-8B
+
+We will use the vLLM bench CLI to measure the throughput of our models later on. Install the required library and use a limited number of prompts to validate your environment. This will run a little slow the first time through as you download the models.
+```bash
+pip install vllm[bench]
+
+vllm bench throughput \
+--num-prompts 10 \
+--dataset-name random \
+--model meta-llama/Llama-3.1-8B
+```
+
+This will report the number of requests per second, the total number of tokens generated per second and the number of output tokens generated per second.
+
+You can do the same for the quantised model:
+```bash
+vllm bench throughput \
+--num-prompts 10 \
+--dataset-name random \
+--model RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 
+```
+
+You now have the quantised and non-quantised Llama models on your local machine. You have installed vLLM and demonstrated you can run inference on both your models. Now you can move on to benchmarking these models and compare their performance.
diff --git a/...paths/servers-and-cloud-computing/vllm-benchmark-quantisation/3-benchmarking.md b/...paths/servers-and-cloud-computing/vllm-benchmark-quantisation/3-benchmarking.md
@@ -0,0 +1,116 @@
+---
+title: Evaluate Llama3.1-8B throughput and accuracy
+weight: 4
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Llama performance benchmarking
+
+We will use the vLLM bench cli to measure the throughput of our models. First, start the server and keep it running:
+```bash
+vllm serve \
+  --model meta-llama/Llama-3.1-8B-Instruct \
+  --max-num-batched-tokens 16000 \
+  --max-num-seqs 4 \
+  --data-parallel-size 1 \
+  --max-model-len 2048 &
+  ```
+
+vLLM uses dynamic continuous batching to maximise hardware utilisation. Three key parameters govern this process:
+
+  * max-model-len, which is the maximum sequence length (number of tokens per request). No single prompt or generated sequence can exceed this limit.
+  * max-num-batched-tokens, which is the total number of tokens processed in one batch across all requests. The sum of input and output tokens from all concurrent requests must stay within this limit.
+  * max-num-seqs, which is the maximum number of requests the scheduler can place in one iteration.
+
+Now the server is running, we can benchmark using the public ShareGPT dataset.
+```bash
+wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+
+vllm bench serve \
+  --model meta-llama/Llama-3.1-8B-Instruct \
+  --dataset-name sharegpt \
+  --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
+  --num-prompts 128 \
+  --request-rate 8 \
+  --max-concurrency 8 \
+  --percentile-metrics ttft,tpot \
+  --metric-percentiles 50,95,99 \
+  --save-result --result-dir bench_out --result-filename serve.json
+
+```
+The interesting results are Request throughput, Output token throughput, Total token throughput, TTFT (time to first token) and TPOT (time per output token). 
+
+Repeat with the quantised model. You should see a significant improvement in the throughput results (increased tokens/s).
+```bash
+vllm serve \
+  --model RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 \
+  --max-num-batched-tokens 16000 \
+  --max-num-seqs 4 \
+  --data-parallel-size 1 \
+  --max-model-len 2048 &
+
+vllm bench serve \
+  --model RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 \
+  --dataset-name sharegpt \
+  --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
+  --num-prompts 128 \
+  --request-rate 8 \
+  --max-concurrency 8 \
+  --percentile-metrics ttft,tpot \
+  --metric-percentiles 50,95,99 \
+  --save-result --result-dir bench_out --result-filename serve.json
+```
+
+## Llama accuracy benchmarking
+
+The lm-evaluation-harness is the standard way to measure model accuracy across common academic benchmarks (for example MMLU, HellaSwag, GSM8K) and runtimes (such as Hugging Face, vLLM, and llama.cpp). In this section, you’ll run accuracy tests for both BF16 and INT8 deployments of your Llama models served by vLLM on Arm-based servers.
+
+You will:
+- Install the lm-eval harness with vLLM support
+- Run benchmarks on a BF16 model and an INT8 (weight-quantized) model
+- Interpret key metrics and compare quality across precisions
+
+First install the required libraries for benchmarking with lm_eval.
+```bash
+pip install ray lm_eval[vllm] 
+```
+
+Then use a limited number of prompts to validate your environment. This will be slower the first time through as you will download the test data associated with your selected task:
+```bash
+lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B,dtype=bfloat16,max_model_len=4096 --tasks mmlu --batch_size auto --limit 10
+
+lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B,dtype=bfloat16,max_model_len=4096 --tasks gsm8k --batch_size 4 --limit 10
+```
+
+The [MMLU task](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/mmlu) is a set of multiple choice questions split into the subgroups listed above. It allows you to measure the ability of an LLM to understand questions and select the right answers.
+
+The [GSM8k task](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/gsm8k) is a set of math problems that test an LLM's mathematical reasoning ability.
+
+Repeat with the quantised model.
+```bash
+lm_eval --model vllm --model_args pretrained=RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8,dtype=bfloat16,max_model_len=4096 --tasks mmlu,gsm8k --batch_size auto --limit 10
+
+lm_eval --model vllm --model_args pretrained=RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8,dtype=bfloat16,max_model_len=4096 --tasks gsm8k --batch_size 4 --limit 10
+```
+
+We would expect to see the precision is slightly lower with INT8.
+
+## Summary of results
+
+The benchmarking results you generate will depend on the hardware you are using. The values below were measured on an AWS Graviton4 c8g.12xlarge instance and provided as an example only. We've applied limits to the number of samples used to make them easily reproducible. A proper accuracy benchmark should be run over the whole dataset, though this can be time consuming. Using the INT8 quantised Llama3.1-8B model we observe throughput improvements of up to ~3x at a cost of up to ~7% in accuracy.
+
+Llama benchmark config:
+  * Throughput: --num-prompts 128
+  * Accuracy: --limit mmlu=10,gsm8k=500
+
+### Throughput ratios: INT8/BF16
+| Requests/s | Total Tokens/s | Output Tokens/s |
+| --------   | --------       | -------- |
+| 3.17x      | 2.30x          | 1.44x    |
+
+### Accuracy delta: (BF16-INT8)/BF16
+|  MMLU    | GSM8k    | 
+| -------- | -------- |
+| 3%       | 6-7%     |
diff --git a/...vers-and-cloud-computing/vllm-benchmark-quantisation/4-further--quantisation.md b/...vers-and-cloud-computing/vllm-benchmark-quantisation/4-further--quantisation.md
@@ -0,0 +1,100 @@
+---
+title: Quantisation techniques
+weight: 5
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Further quantisation
+
+We have used a publicly available w8a8 quantised model to improve performance with a small decrease in accuracy. We have previously covered how to quantise a model to even lower precision (int4) in the [Run vLLM inference with INT4 quantization on Arm servers](/learning-paths/servers-and-cloud-computing/vllm-acceleration/) Learning Path. Further quantisation of the model incurs additional accuracy losses due to the loss in precision. However there are other quantisation techniques that can reduce this accuracy loss. In the [quantisation recipe](/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model/) provided in the referenced learning path we use QuantizationModifier to quantise the model weights. The [w8a8 quantised model](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8#creation) you've been using in this Learning Path was instead quantised with GPTQModifier. GPTQModifier uses a calibration data set to quantise the model weights. We have found GPTQModifier produces smaller degradations in accuracy compared to QuantizationModifier and recommend a recipe like the below for int4 quantisation. 
+
+You will need to install the required packages before running the quantisation script.
+```bash
+pip install compressed-tensors==0.14.0.1
+pip install llmcompressor==0.10.0.1
+pip install datasets==4.6.0
+
+python w4a8_quant.py
+```
+
+Where w4a8_quant.py contains:
+```python
+from transformers import AutoTokenizer
+from datasets import Dataset, load_dataset
+from transformers import AutoModelForCausalLM
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import GPTQModifier
+from compressed_tensors.quantization import QuantizationType, QuantizationStrategy
+import random
+
+model_id = "meta-llama/Meta-Llama-3.1-8B"
+
+num_samples = 256
+max_seq_len = 4096
+
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+def preprocess_fn(example):
+  return {"text": example["text"]}
+
+ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
+ds = ds.shuffle().select(range(num_samples))
+ds = ds.map(preprocess_fn)
+
+scheme = {
+        "targets": ["Linear"],
+        "weights": {
+            "num_bits": 4,
+            "type": QuantizationType.INT,
+            "strategy": QuantizationStrategy.CHANNEL,
+            "symmetric": True,
+            "dynamic": False,
+            "group_size": None
+        },
+        "input_activations":
+            {
+            "num_bits": 8,
+            "type": QuantizationType.INT,
+            "strategy": QuantizationStrategy.TOKEN,
+            "dynamic": True,
+            "symmetric": False,
+            "observer": None,
+        },
+        "output_activations": None,
+}
+
+recipe = GPTQModifier(
+  targets="Linear",
+  config_groups={"group_0": scheme},
+  ignore=["lm_head"],
+  dampening_frac=0.01,
+  block_size=512,
+)
+
+model = AutoModelForCausalLM.from_pretrained(
+  model_id,
+  device_map="auto",
+  trust_remote_code=True,
+)
+
+oneshot(
+  model=model,
+  dataset=ds,
+  recipe=recipe,
+  max_seq_length=max_seq_len,
+  num_calibration_samples=num_samples,
+)
+model.save_pretrained("Meta-Llama-3.1-8B-quantized.w4a8")
+```
+
+## Next steps
+
+Now that you have your environment set up for benchmarking and quantising different models, you can experiment with:
+- Longer benchmarking runs
+- Benchmarking accuracy with different tasks: hellaswag, winogrande etc.
+- Different quantisation techniques
+- Different models
+
+Your results will allow you to balance accuracy and performance when making decisions about model deployment.
diff --git a/...earning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/_index.md b/...earning-paths/servers-and-cloud-computing/vllm-benchmark-quantisation/_index.md
@@ -0,0 +1,66 @@
+---
+title: Run vLLM inference with quantised models and benchmark on Arm servers
+
+minutes_to_complete: 60
+
+who_is_this_for: This is an introductory topic for developers interested in running inference on quantised models. This Learning Path shows you how to run inference on Llama 3.1-8B, with and without quantisation, and benchmark model performance and accuracy with vLLM's bench cli and the LM Evaluation Harness.
+
+learning_objectives: 
+    - Install a recent release of vLLM
+    - Run both quantised and non-quantised variants of Llama3.1-8B using vLLM
+    - Evaluate and compare model performance and accuracy using vLLM's bench cli and the LM Evaluation Harness
+
+prerequisites:
+    - An Arm-based Linux server (Ubuntu 22.04+ recommended) with a minimum of 32 vCPUs, 64 GB RAM, and 64 GB free disk space
+    - Python 3.12 and basic familiarity with Hugging Face Transformers and quantisation
+
+author: Anna Mayne
+
+### Tags
+skilllevels: Introductory
+subjects: ML
+armips:
+    - Neoverse
+tools_software_languages:
+    - vLLM
+    - LM Evaluation Harness
+    - LLM
+    - Generative AI
+    - Python
+    - PyTorch
+    - Hugging Face
+operatingsystems:
+    - Linux
+
+
+
+further_reading:
+    - resource:
+        title: vLLM Documentation
+        link: https://docs.vllm.ai/
+        type: documentation
+    - resource:
+        title: vLLM GitHub Repository
+        link: https://github.com/vllm-project/vllm
+        type: website
+    - resource:
+        title: Hugging Face Model Hub
+        link: https://huggingface.co/models
+        type: website
+    - resource:
+        title: Build and Run vLLM on Arm Servers
+        link: /learning-paths/servers-and-cloud-computing/vllm/
+        type: website
+    - resource:
+        title: LM Evaluation Harness (GitHub)
+        link: https://github.com/EleutherAI/lm-evaluation-harness
+        type: website
+
+
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1                       # _index.md always has weight of 1 to order correctly
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+learning_path_main_page: "yes"  # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---