Granite Switch — Build AI models like you build software

| Browse Adapters | Models on HF | Tutorials |

Most AI models are monolithic — all capabilities baked into one set of weights. Granite Switch lets you compose a model from independent, task-specific components: pick the capabilities you need, compose a single checkpoint in minutes, then swap or upgrade individual components as your needs change.

Browse available libraries in the Granite Libraries collection on Hugging Face.

Key Features

Composable — Combine independently developed adapters into one checkpoint, whether IBM's or yours. Swap, upgrade, or customize without retraining.
Fast — Built on IBM's Activated LoRA technology for efficient KV cache reuse, low latency, and high inference throughput.
Accurate — Task-specific adapters can match and even surpass the accuracy of significantly larger generalist models, while requiring only a fraction of the serving cost. See the adapter catalog for benchmark comparisons across all 12 adapters.
Inference-ready — Support for Hugging Face and vLLM.

Quick Start

Install

python -m venv venv && source venv/bin/activate

# Granite-Switch installation is based on your usecase:
pip install "granite-switch[compose]"   # Compose modular models
pip install "granite-switch[hf]"        # HuggingFace inference
pip install "granite-switch[vllm]"      # vLLM production inference (0.19.x)
pip install "granite-switch[vllm20]"    # vLLM 0.20+ (requires CUDA 13+)
pip install "granite-switch[dev]"       # Everything (uses vLLM 0.19.x by default)
pip install "granite-switch[dev-vllm20]" # Dev environment with vLLM 0.20+

Requires Python 3.9+ and PyTorch 2.0+.

vLLM version note: This project currently defaults to vLLM 0.19.1 due to vLLM 0.20's dependency on CUDA 13.0+ (via PyTorch 2.11), which is incompatible with many existing environments running CUDA 12.x drivers. Use .[vllm20] if your environment supports CUDA 13+.

Compose a Model

Compose a base Granite model with adapter libraries into a single deployable checkpoint:

python -m granite_switch.composer.compose_granite_switch \
  --base-model ibm-granite/granite-4.1-3b \
  --adapters ibm-granite/granitelib-core-r1.0 ibm-granite/granitelib-rag-r1.0  ibm-granite/granitelib-guardian-r1.0 \
  --output ./my-model

Use the Adapter Composer to browse available adapters, compare benchmarks, and generate a ready-to-run compose command.

This downloads the base model, embeds compatible LoRA adapters (with a preference towards activated LoRA), adds control tokens and a chat template, and produces a model directory that works with both HuggingFace and vLLM.

For convenience, you can find already composed Granite Switch models for the Granite 4.1 model family here:

Run Inference

vLLM + Mellea (recommended):

pip install mellea
# Example with the 3B model 
python -m vllm.entrypoints.openai.api_server --model ibm-granite/granite-switch-4.1-3b-preview --port 8000

from mellea.backends.openai import OpenAIBackend
from mellea.stdlib.components.intrinsic import rag
from mellea.stdlib.context import ChatContext

backend = OpenAIBackend(
    model_id="ibm-granite/granite-switch-4.1-3b-preview",
    base_url="http://localhost:8000/v1",
    api_key="unused",
)
backend.register_embedded_adapter_model("ibm-granite/granite-switch-4.1-3b-preview")

query = "I want to ask you something. what is...mmmm the the main city(capital you call it,right?) of France?"
ctx = ChatContext()

rewritten = rag.rewrite_question(query, ctx, backend)
print(f"original:  {query}")
print(f"rewritten: {rewritten}")
# => "What is the capital of France?"

HuggingFace:

import granite_switch.hf  # Register HF backend

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-switch-4.1-3b-preview", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-switch-4.1-3b-preview")

messages = [{"role": "user", "content": "What is the capital of France?"}]
documents = [{"doc_id": "1", "text": "Paris is the capital of France."}]

prompt = tokenizer.apply_chat_template(
    messages,
    documents=documents,
    adapter_name="answerability",  # activates the answerability adapter
    add_generation_prompt=True,
    tokenize=False,
)
outputs = model.generate(**tokenizer(prompt, return_tensors="pt").to(model.device))
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# => "answerable"

How It Works

Granite Switch uses a switch layer—a small attention-based mechanism that reads control tokens from the input and determines which adapter's LoRA weights to apply at each position.

What makes composition work:

KV cache normalization — each adapter sees only the base model's KV cache, never another adapter's internal state
No joint training required — adapters are developed, tested, and published independently
Standard inference — The entire model loads in vLLM with zero code changes

Documentation

For detailed tutorials and many working examples, see the Tutorials section.

IBM ❤️ Open Source AI

Granite Switch was started by IBM Research.

License

Granite Switch has an Apache-2.0 license, as found in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github		.github
docs		docs
src/granite_switch		src/granite_switch
tests		tests
tutorials		tutorials
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Granite Switch — Build AI models like you build software

Key Features

Quick Start

Install

Compose a Model

Run Inference

How It Works

Documentation

IBM ❤️ Open Source AI

License

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Granite Switch — Build AI models like you build software

Key Features

Quick Start

Install

Compose a Model

Run Inference

How It Works

Documentation

IBM ❤️ Open Source AI

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages