-
Notifications
You must be signed in to change notification settings - Fork 280
LP: vllm benchmarking with quantised models #3207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,63 @@ | ||
| --- | ||
| title: Setup vLLM | ||
| weight: 2 | ||
|
|
||
| ### FIXED, DO NOT MODIFY | ||
| layout: learningpathall | ||
| --- | ||
|
|
||
| ## What is vLLM | ||
|
|
||
| [vLLM](https://docs.vllm.ai/en/latest/) is an open-source, high-throughput inference and serving engine for large language models (LLMs). It’s designed to maximise hardware efficiency, making LLM inference faster, more memory-efficient, and scalable. | ||
|
|
||
| ## Understanding the Llama models | ||
|
|
||
| Llama 3.1 8B is an open-weight, text-only LLM with 8 billion parameters that can understand and generate text. You can view the model card at https://huggingface.co/meta-llama/Llama-3.1-8B. | ||
|
|
||
| Quantised models have their weights converted to a lower precision data type, which reduces the memory requirements of the model and can improve performance significantly. In the [Run vLLM inference with INT4 quantization on Arm servers](/learning-paths/servers-and-cloud-computing/vllm-acceleration/) Learning Path we have covered how to quantise a model yourself. There are also many publicly available quantised versions of popular models, such as https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8. | ||
|
|
||
| The notation w8a8 means that the weights have been quantised to 8-bit integers and the activations (the input data) are dynamically quantised to the same. This allows our kernels to utilise Arm's 8-bit integer matrix multiply feature I8MM. You can learn more about this in the [KleidiAI and matrix multiplication](/learning-paths/cross-platform/kleidiai-explainer/) Learning Path. | ||
|
|
||
| The RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 model we are using in this Learning Path only applies quantisation to the weights and activations in the linear layers of the transformer blocks. The activation quantisations are applied per-token and the weights are quantised per-channel. That is, each output channel dimension has a scaling factor applied between INT8 and BF16 representations. | ||
|
|
||
| ## Set up your environment | ||
|
|
||
| Before you begin, make sure your environment meets these requirements: | ||
|
|
||
| - Python 3.12 on Ubuntu 22.04 LTS or newer | ||
| - At least 32 vCPUs, 96 GB RAM, and 64 GB of free disk space | ||
| This Learning Path was tested on an AWS Graviton4 c8g.12xlarge instance with 200 GB of attached storage. | ||
|
|
||
| ## Install build dependencies | ||
|
|
||
| Install the following packages required for running inference with vLLM on Arm64: | ||
| ```bash | ||
| sudo apt-get update -y | ||
| sudo apt install -y python3.12-venv python3.12-dev | ||
| ``` | ||
|
|
||
| Now install tcmalloc, a fast memory allocator from Google’s gperftools, which improves performance under high concurrency: | ||
| ```bash | ||
| sudo apt-get install -y libtcmalloc-minimal4 | ||
| ``` | ||
|
|
||
| ## Create and activate a Python virtual environment | ||
|
|
||
| It’s best practice to install vLLM inside an isolated environment to prevent conflicts between system and project dependencies: | ||
| ```bash | ||
| python3.12 -m venv vllm_env | ||
| source vllm_env/bin/activate | ||
| python -m pip install --upgrade pip | ||
| ``` | ||
|
|
||
| ## Install vLLM for CPU | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should divide this into section into 2 parts - the reader can choose which one they want:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Source build is not required for this use case as all the changes are merged. Linking to Docs should be enough if at all needed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. agree, I don't think we want to get into a src build here. Try and keep it tight. Plus the fact that it's working well OOTB is important.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Link added |
||
|
|
||
| Install a recent CPU specific build of vLLM: | ||
| ```bash | ||
| export VLLM_VERSION=0.19.1 | ||
| pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_aarch64.whl | ||
| ``` | ||
|
|
||
| If you wish to build vLLM from source you can follow the instructions in the [Build and Run vLLM on Arm Servers Learning Path](/learning-paths/servers-and-cloud-computing/vllm/vllm-setup/). | ||
|
|
||
| Your environment is now setup to run inference with vLLM. Next, you'll use vLLM to run inference on both quantised and non-quantised Llama models. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| --- | ||
| title: Run inference with vLLM | ||
| weight: 3 | ||
|
|
||
| ### FIXED, DO NOT MODIFY | ||
| layout: learningpathall | ||
| --- | ||
|
|
||
| ## Set up access to LLama3.1-8B models | ||
|
|
||
| To access the Llama models hosted by Hugging Face, you will need to install the Hugging Face CLI so that you can authenticate yourself and the harness can download what it needs. You should create an account on https://huggingface.co/ and follow the instructions [in the Hugging Face cli guide](https://huggingface.co/docs/huggingface_hub/en/guides/cli) to set up your access token. You can then install the CLI and login: | ||
| ```bash | ||
| curl -LsSf https://hf.co/cli/install.sh | bash | ||
| hf auth login | ||
| ``` | ||
|
|
||
| Paste your access token into the terminal when prompted. To access Llama3.1-8B you need to request access on the Hugging Face website. Visit https://huggingface.co/meta-llama/Llama-3.1-8B and select "Expand to review and access". Complete the form and you should be granted access in a matter of minutes. | ||
|
|
||
| Now you can check that you are able to run inference on the non-quantised Llama model. | ||
|
|
||
| ## Run inference on LLama3.1-8B | ||
|
|
||
| We will use the vLLM bench CLI to measure the throughput of our models later on. Install the required library and use a limited number of prompts to validate your environment. This will run a little slow the first time through as you download the models. | ||
| ```bash | ||
| pip install vllm[bench] | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this isn't needed - should be installed when you install vllm.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I definitely had to install it separately. Looks standard: https://docs.vllm.ai/en/latest/cli/#bench |
||
|
|
||
| vllm bench throughput \ | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we discussed about running the inference via openAI apis and not using vllm throughput or latency tooling?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's why I switched from vllm bench throughput to serve in the benchmarking section. This is just to demonstrate the environment is all setup ok. Let's catchup offline. |
||
| --num-prompts 10 \ | ||
| --dataset-name random \ | ||
| --model meta-llama/Llama-3.1-8B | ||
| ``` | ||
|
|
||
| This will report the number of requests per second, the total number of tokens generated per second and the number of output tokens generated per second. | ||
|
|
||
| You can do the same for the quantised model: | ||
| ```bash | ||
| vllm bench throughput \ | ||
| --num-prompts 10 \ | ||
| --dataset-name random \ | ||
| --model RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 | ||
| ``` | ||
|
|
||
| You now have the quantised and non-quantised Llama models on your local machine. You have installed vLLM and demonstrated you can run inference on both your models. Now you can move on to benchmarking these models and compare their performance. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,116 @@ | ||
| --- | ||
| title: Evaluate Llama3.1-8B throughput and accuracy | ||
| weight: 4 | ||
|
|
||
| ### FIXED, DO NOT MODIFY | ||
| layout: learningpathall | ||
| --- | ||
|
|
||
| ## Llama performance benchmarking | ||
|
|
||
| We will use the vLLM bench cli to measure the throughput of our models. First, start the server and keep it running: | ||
| ```bash | ||
| vllm serve \ | ||
| --model meta-llama/Llama-3.1-8B-Instruct \ | ||
| --max-num-batched-tokens 16000 \ | ||
| --max-num-seqs 4 \ | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. --max-num-batched tokens is too big and --max-num-seqs is too small. |
||
| --data-parallel-size 1 \ | ||
|
almayne marked this conversation as resolved.
|
||
| --max-model-len 2048 & | ||
| ``` | ||
|
|
||
| vLLM uses dynamic continuous batching to maximise hardware utilisation. Three key parameters govern this process: | ||
|
|
||
| * max-model-len, which is the maximum sequence length (number of tokens per request). No single prompt or generated sequence can exceed this limit. | ||
| * max-num-batched-tokens, which is the total number of tokens processed in one batch across all requests. The sum of input and output tokens from all concurrent requests must stay within this limit. | ||
| * max-num-seqs, which is the maximum number of requests the scheduler can place in one iteration. | ||
|
|
||
| Now the server is running, we can benchmark using the public ShareGPT dataset. | ||
| ```bash | ||
| wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json | ||
|
|
||
| vllm bench serve \ | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's do greedy decoding here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. EDIT: this depends on llama's default config, it might be greedy by default - need to double check. |
||
| --model meta-llama/Llama-3.1-8B-Instruct \ | ||
| --dataset-name sharegpt \ | ||
| --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \ | ||
| --num-prompts 128 \ | ||
| --request-rate 8 \ | ||
| --max-concurrency 8 \ | ||
| --percentile-metrics ttft,tpot \ | ||
| --metric-percentiles 50,95,99 \ | ||
| --save-result --result-dir bench_out --result-filename serve.json | ||
|
|
||
| ``` | ||
| The interesting results are Request throughput, Output token throughput, Total token throughput, TTFT (time to first token) and TPOT (time per output token). | ||
|
almayne marked this conversation as resolved.
|
||
|
|
||
| Repeat with the quantised model. You should see a significant improvement in the throughput results (increased tokens/s). | ||
| ```bash | ||
| vllm serve \ | ||
| --model RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 \ | ||
| --max-num-batched-tokens 16000 \ | ||
| --max-num-seqs 4 \ | ||
| --data-parallel-size 1 \ | ||
| --max-model-len 2048 & | ||
|
|
||
| vllm bench serve \ | ||
| --model RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 \ | ||
| --dataset-name sharegpt \ | ||
| --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \ | ||
| --num-prompts 128 \ | ||
| --request-rate 8 \ | ||
| --max-concurrency 8 \ | ||
| --percentile-metrics ttft,tpot \ | ||
| --metric-percentiles 50,95,99 \ | ||
| --save-result --result-dir bench_out --result-filename serve.json | ||
| ``` | ||
|
|
||
| ## Llama accuracy benchmarking | ||
|
|
||
| The lm-evaluation-harness is the standard way to measure model accuracy across common academic benchmarks (for example MMLU, HellaSwag, GSM8K) and runtimes (such as Hugging Face, vLLM, and llama.cpp). In this section, you’ll run accuracy tests for both BF16 and INT8 deployments of your Llama models served by vLLM on Arm-based servers. | ||
|
|
||
| You will: | ||
| - Install the lm-eval harness with vLLM support | ||
| - Run benchmarks on a BF16 model and an INT8 (weight-quantized) model | ||
| - Interpret key metrics and compare quality across precisions | ||
|
|
||
| First install the required libraries for benchmarking with lm_eval. | ||
| ```bash | ||
| pip install ray lm_eval[vllm] | ||
| ``` | ||
|
|
||
| Then use a limited number of prompts to validate your environment. This will be slower the first time through as you will download the test data associated with your selected task: | ||
| ```bash | ||
| lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B,dtype=bfloat16,max_model_len=4096 --tasks mmlu --batch_size auto --limit 10 | ||
|
|
||
| lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B,dtype=bfloat16,max_model_len=4096 --tasks gsm8k --batch_size 4 --limit 10 | ||
| ``` | ||
|
|
||
| The [MMLU task](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/mmlu) is a set of multiple choice questions split into the subgroups listed above. It allows you to measure the ability of an LLM to understand questions and select the right answers. | ||
|
|
||
| The [GSM8k task](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/gsm8k) is a set of math problems that test an LLM's mathematical reasoning ability. | ||
|
|
||
| Repeat with the quantised model. | ||
| ```bash | ||
| lm_eval --model vllm --model_args pretrained=RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8,dtype=bfloat16,max_model_len=4096 --tasks mmlu,gsm8k --batch_size auto --limit 10 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
|
||
| lm_eval --model vllm --model_args pretrained=RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8,dtype=bfloat16,max_model_len=4096 --tasks gsm8k --batch_size 4 --limit 10 | ||
| ``` | ||
|
|
||
| We would expect to see the precision is slightly lower with INT8. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should expect to see numbers similar to the ones reported here for int8: https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8 |
||
|
|
||
| ## Summary of results | ||
|
|
||
| The benchmarking results you generate will depend on the hardware you are using. The values below were measured on an AWS Graviton4 c8g.12xlarge instance and provided as an example only. We've applied limits to the number of samples used to make them easily reproducible. A proper accuracy benchmark should be run over the whole dataset, though this can be time consuming. Using the INT8 quantised Llama3.1-8B model we observe throughput improvements of up to ~3x at a cost of up to ~7% in accuracy. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it okay to mention AWS and Graviton4? |
||
|
|
||
| Llama benchmark config: | ||
| * Throughput: --num-prompts 128 | ||
| * Accuracy: --limit mmlu=10,gsm8k=500 | ||
|
|
||
| ### Throughput ratios: INT8/BF16 | ||
| | Requests/s | Total Tokens/s | Output Tokens/s | | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. given that we ran a serving benchmark, I think we should report latency here too. |
||
| | -------- | -------- | -------- | | ||
| | 3.17x | 2.30x | 1.44x | | ||
|
|
||
| ### Accuracy delta: (BF16-INT8)/BF16 | ||
| | MMLU | GSM8k | | ||
| | -------- | -------- | | ||
| | 3% | 6-7% | | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do these use the |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,100 @@ | ||
| --- | ||
| title: Quantisation techniques | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Move this page after overview. Provide quick overview on 8 bit dynamic quantization , meniton about using prequantized model and add quantization script for custom model quantization. |
||
| weight: 5 | ||
|
|
||
| ### FIXED, DO NOT MODIFY | ||
| layout: learningpathall | ||
| --- | ||
|
|
||
| ## Further quantisation | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this feels slightly out of place and misses a conclusion that compares this to int8 / bf16 etc. |
||
|
|
||
| We have used a publicly available w8a8 quantised model to improve performance with a small decrease in accuracy. We have previously covered how to quantise a model to even lower precision (int4) in the [Run vLLM inference with INT4 quantization on Arm servers](/learning-paths/servers-and-cloud-computing/vllm-acceleration/) Learning Path. Further quantisation of the model incurs additional accuracy losses due to the loss in precision. However there are other quantisation techniques that can reduce this accuracy loss. In the [quantisation recipe](/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model/) provided in the referenced learning path we use QuantizationModifier to quantise the model weights. The [w8a8 quantised model](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8#creation) you've been using in this Learning Path was instead quantised with GPTQModifier. GPTQModifier uses a calibration data set to quantise the model weights. We have found GPTQModifier produces smaller degradations in accuracy compared to QuantizationModifier and recommend a recipe like the below for int4 quantisation. | ||
|
|
||
| You will need to install the required packages before running the quantisation script. | ||
| ```bash | ||
| pip install compressed-tensors==0.14.0.1 | ||
| pip install llmcompressor==0.10.0.1 | ||
| pip install datasets==4.6.0 | ||
|
|
||
| python w4a8_quant.py | ||
| ``` | ||
|
|
||
| Where w4a8_quant.py contains: | ||
| ```python | ||
| from transformers import AutoTokenizer | ||
| from datasets import Dataset, load_dataset | ||
| from transformers import AutoModelForCausalLM | ||
| from llmcompressor import oneshot | ||
| from llmcompressor.modifiers.quantization import GPTQModifier | ||
| from compressed_tensors.quantization import QuantizationType, QuantizationStrategy | ||
| import random | ||
|
|
||
| model_id = "meta-llama/Meta-Llama-3.1-8B" | ||
|
|
||
| num_samples = 256 | ||
| max_seq_len = 4096 | ||
|
|
||
| tokenizer = AutoTokenizer.from_pretrained(model_id) | ||
|
|
||
| def preprocess_fn(example): | ||
| return {"text": example["text"]} | ||
|
|
||
| ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train") | ||
| ds = ds.shuffle().select(range(num_samples)) | ||
| ds = ds.map(preprocess_fn) | ||
|
|
||
| scheme = { | ||
| "targets": ["Linear"], | ||
| "weights": { | ||
| "num_bits": 4, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shouldnt it be 8 bit weights and w8a8 scripts instead of 4 bits There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this page is about int4 quantization - not a recipe for int8 quantization. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Original request was to cover w8w8 results, and link across to previous int4 LP. So we need a good way to do that.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The quantised model we are using in this LP already uses GPTQModifier, and the linked HuggingFace page has the w8a8 recipe. I'm not sure what the point would be of adding another w8a8 recipe here. It makes more sense to me to mention we can use GPTQModifier to further quantise to int4 without the accuracy loss we observe using the recipe in the preexisting LP. Or should we just scrap this page entirely? |
||
| "type": QuantizationType.INT, | ||
| "strategy": QuantizationStrategy.CHANNEL, | ||
| "symmetric": True, | ||
| "dynamic": False, | ||
| "group_size": None | ||
| }, | ||
| "input_activations": | ||
| { | ||
| "num_bits": 8, | ||
| "type": QuantizationType.INT, | ||
| "strategy": QuantizationStrategy.TOKEN, | ||
| "dynamic": True, | ||
| "symmetric": False, | ||
| "observer": None, | ||
| }, | ||
| "output_activations": None, | ||
| } | ||
|
|
||
| recipe = GPTQModifier( | ||
| targets="Linear", | ||
| config_groups={"group_0": scheme}, | ||
| ignore=["lm_head"], | ||
| dampening_frac=0.01, | ||
| block_size=512, | ||
| ) | ||
|
|
||
| model = AutoModelForCausalLM.from_pretrained( | ||
| model_id, | ||
| device_map="auto", | ||
| trust_remote_code=True, | ||
| ) | ||
|
|
||
| oneshot( | ||
| model=model, | ||
| dataset=ds, | ||
| recipe=recipe, | ||
| max_seq_length=max_seq_len, | ||
| num_calibration_samples=num_samples, | ||
| ) | ||
| model.save_pretrained("Meta-Llama-3.1-8B-quantized.w4a8") | ||
| ``` | ||
|
|
||
| ## Next steps | ||
|
|
||
| Now that you have your environment set up for benchmarking and quantising different models, you can experiment with: | ||
| - Longer benchmarking runs | ||
| - Benchmarking accuracy with different tasks: hellaswag, winogrande etc. | ||
| - Different quantisation techniques | ||
| - Different models | ||
|
|
||
| Your results will allow you to balance accuracy and performance when making decisions about model deployment. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,66 @@ | ||
| --- | ||
| title: Run vLLM inference with quantised models and benchmark on Arm servers | ||
|
|
||
| minutes_to_complete: 60 | ||
|
|
||
| who_is_this_for: This is an introductory topic for developers interested in running inference on quantised models. This Learning Path shows you how to run inference on Llama 3.1-8B, with and without quantisation, and benchmark model performance and accuracy with vLLM's bench cli and the LM Evaluation Harness. | ||
|
|
||
| learning_objectives: | ||
| - Install a recent release of vLLM | ||
| - Run both quantised and non-quantised variants of Llama3.1-8B using vLLM | ||
| - Evaluate and compare model performance and accuracy using vLLM's bench cli and the LM Evaluation Harness | ||
|
|
||
| prerequisites: | ||
| - An Arm-based Linux server (Ubuntu 22.04+ recommended) with a minimum of 32 vCPUs, 64 GB RAM, and 64 GB free disk space | ||
| - Python 3.12 and basic familiarity with Hugging Face Transformers and quantisation | ||
|
|
||
| author: Anna Mayne | ||
|
|
||
| ### Tags | ||
| skilllevels: Introductory | ||
| subjects: ML | ||
| armips: | ||
| - Neoverse | ||
| tools_software_languages: | ||
| - vLLM | ||
| - LM Evaluation Harness | ||
| - LLM | ||
| - Generative AI | ||
| - Python | ||
| - PyTorch | ||
| - Hugging Face | ||
| operatingsystems: | ||
| - Linux | ||
|
|
||
|
|
||
|
|
||
| further_reading: | ||
| - resource: | ||
| title: vLLM Documentation | ||
| link: https://docs.vllm.ai/ | ||
| type: documentation | ||
| - resource: | ||
| title: vLLM GitHub Repository | ||
| link: https://github.com/vllm-project/vllm | ||
| type: website | ||
| - resource: | ||
| title: Hugging Face Model Hub | ||
| link: https://huggingface.co/models | ||
| type: website | ||
| - resource: | ||
| title: Build and Run vLLM on Arm Servers | ||
| link: /learning-paths/servers-and-cloud-computing/vllm/ | ||
| type: website | ||
| - resource: | ||
| title: LM Evaluation Harness (GitHub) | ||
| link: https://github.com/EleutherAI/lm-evaluation-harness | ||
| type: website | ||
|
|
||
|
|
||
|
|
||
| ### FIXED, DO NOT MODIFY | ||
| # ================================================================================ | ||
| weight: 1 # _index.md always has weight of 1 to order correctly | ||
| layout: "learningpathall" # All files under learning paths have this same wrapper | ||
| learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. | ||
| --- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need GCC-12 here, see this page for details: https://docs.vllm.ai/en/latest/getting_started/installation/cpu/#build-wheel-from-source:~:text=Full%20build%20(with,%C2%B6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's only if we're building vLLM ourselves. We have another learning path that covers that. I have been told to cover only installing vLLM. I can add a link to that part of the other learning path that covers this if you like?