Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
19c6c8a
fixes
jlamypoirier Mar 20, 2026
1b6fcd0
fix
jlamypoirier Mar 20, 2026
573c6d8
fix
jlamypoirier Mar 20, 2026
3658c02
fix
jlamypoirier Mar 21, 2026
ab39e26
stuff
jlamypoirier Mar 24, 2026
2255845
stuff
jlamypoirier Mar 25, 2026
68b68b2
Fix intermittent test_data_streaming failure with fakeredis 2.34+
bigximik Mar 25, 2026
ce3e85a
Reduce losses with token counts instead of sequence length
jlamypoirier Mar 25, 2026
ac7fa09
Merge remote-tracking branch 'origin/jlp_batch_fixes' into jlp_reduce…
jlamypoirier Mar 25, 2026
b429c5e
Fix tflops units, loss reduce ops, and CPU test support
jlamypoirier Mar 26, 2026
647fbb7
Fix MTP Llama converter to map head.final_norm to model.mtp_norms.0
jlamypoirier Mar 26, 2026
d26d4ec
Fix DistributedDim pickling to allow DataLoader workers with streamin…
jlamypoirier Mar 26, 2026
fa607fc
Expand test_preprocessing.py with comprehensive coverage
jlamypoirier Mar 26, 2026
9af9675
Fix cross-document masking bounds and padding document count bugs, ex…
jlamypoirier Mar 26, 2026
f3975a5
Rename FieldUpdate to FieldOverride, convert derived fields to cached…
jlamypoirier Mar 27, 2026
fe63066
Various fixes and improvements across data, schedule, docs, and build
jlamypoirier Mar 27, 2026
b935eb5
Fix bugs in engine/base_model and engine/config_utils
jlamypoirier Mar 27, 2026
cada39f
Add docstrings to config classes, update docs and mkdocs
jlamypoirier Mar 27, 2026
d752f03
Add config docs generator script and MkDocs hook, ignore generated ou…
jlamypoirier Mar 27, 2026
d47904c
Add unit tests for generate_config_docs.py
jlamypoirier Mar 27, 2026
e6eea7b
Add Triton GRPO loss kernel with vocab-parallel support and tests
jlamypoirier Mar 27, 2026
8ae107a
Fix bugs in fast_llm/engine and related modules
jlamypoirier Mar 27, 2026
9513bca
Fix NameError in lora_linear forward_only with out_channel_begin
jlamypoirier Mar 27, 2026
eb11394
Fix swapped args in MoE _add_shared_experts call and wrong error mess…
jlamypoirier Mar 27, 2026
edfe59f
Remove duplicate NotImplementedError check in LanguageModelDPOLoss
jlamypoirier Mar 27, 2026
a0b91d2
Remove dead code in MLP and StochasticMixer
jlamypoirier Mar 27, 2026
ac72582
Fix grad scaler load and Apriel dt_rank auto formula
jlamypoirier Mar 27, 2026
021a71a
Add parallelism documentation (user guide and developer guide)
jlamypoirier Mar 28, 2026
8a665fd
Fix .bin checkpoint loading in HuggingFace handler
jlamypoirier Mar 28, 2026
beca881
Remove unnecessary .value calls on StrEnum members
jlamypoirier Mar 28, 2026
7d594ef
Fix GatedDeltaNetConfig losing value_heads validation due to duplicat…
jlamypoirier Mar 30, 2026
0de1b6c
Fix five logic bugs found in audit of core/, layers/block/, data/data…
jlamypoirier Mar 30, 2026
d32b6b0
Fix streaming tests: keep ProcessGroupPool alive and use dynamic xrea…
jlamypoirier Mar 31, 2026
08e0b5e
Various fixes across data, layers, and conversions
jlamypoirier Mar 31, 2026
28f14d8
Merge remote-tracking branch 'origin/main' into jlp_general_improvements
jlamypoirier Mar 31, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ __pycache__/
# Doc build
.cache
site
docs/reference/

# Distribution / packaging
*.egg-info/
Expand Down
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,12 +60,12 @@ As a truly open-source project, Fast-LLM allows full customization and extension

We'll walk you through how to use Fast-LLM to train a large language model on a cluster with multiple nodes and GPUs. We'll show an example setup using a Slurm cluster and a Kubernetes cluster.

For this demo, we will train a Mistral-7B model from scratch for 100 steps on random data. The config file `examples/mistral-4-node-benchmark.yaml` is pre-configured for a multi-node setup with 4 DGX nodes, each with 8 A100-80GB or H100-80GB GPUs.
For this demo, we will train a Mistral-7B model from scratch for 100 steps on random data. The config file `examples/mistral.yaml` defines the model architecture and training settings, while the example launch scripts are pre-configured for a 4-node setup with 8 GPUs per node.

> [!NOTE]
> Fast-LLM scales from a single GPU to large clusters. You can start small and expand based on your resources.

Expect to see a significant speedup in training time compared to other libraries! For training Mistral-7B, Fast-LLM is expected to achieve a throughput of **9,800 tokens/s/H100** (batch size 32, sequence length 8k) on a 4-node cluster with 32 H100s.
Expect to see a significant speedup in training time compared to other libraries! For training Mistral-7B, Fast-LLM is expected to achieve a throughput of **9,800 tokens/s/H100** (micro-batch size 8k tokens, total batch size 256k tokens) on a 4-node cluster with 32 H100s.

### Running Fast-LLM on a Slurm Cluster

Expand All @@ -77,7 +77,7 @@ Expect to see a significant speedup in training time compared to other libraries

#### Steps

1. Deploy the [nvcr.io/nvidia/pytorch:24.07-py3](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) Docker image to all nodes (recommended), because it contains all the necessary dependencies.
1. Deploy the [nvcr.io/nvidia/pytorch:25.11-py3](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) Docker image to all nodes (recommended), because it contains all the necessary dependencies.
2. Install Fast-LLM on all nodes:

```bash
Expand All @@ -88,7 +88,7 @@ Expect to see a significant speedup in training time compared to other libraries
#SBATCH --ntasks=$(scontrol show node | grep -c NodeName)
#SBATCH --exclusive

srun bash -c 'pip install --no-cache-dir -e "git+https://github.com/ServiceNow/Fast-LLM.git#egg=llm[CORE,OPTIONAL,DEV]"'
srun bash -c 'pip install --no-cache-dir "fast-llm[CORE,OPTIONAL] @ git+https://github.com/ServiceNow/Fast-LLM.git"'
EOF
```

Expand All @@ -115,7 +115,7 @@ Now, you can sit back and relax while Fast-LLM trains your model at full speed!

#### Steps

1. Create a Kubernetes [PersistentVolumeClaim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) (PVC) named `fast-llm-home` that will be mounted to `/home/fast-llm` in the container using [examples/fast-llm-pvc.yaml](examples/fast-llm-pvc.yaml):
1. Create a Kubernetes [PersistentVolumeClaim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) (PVC) named `pvc-fast-llm-home` that will be mounted to `/home/fast-llm` in the container using [examples/fast-llm-pvc.yaml](examples/fast-llm-pvc.yaml):

```bash
kubectl apply -f examples/fast-llm-pvc.yaml
Expand Down
194 changes: 85 additions & 109 deletions docs/developer_guide/conversion.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,124 +76,99 @@ class AwesomeHuggingfaceCheckpointHandler(HuggingfaceStateDictCheckpointHandler)

### Configuration conversion

The configuration conversion utility interfaces between two configurations in the form of nested dictionaries:
a serialized Fast-LLM configuration and an external configuration.
The `_load_config` method is expected to read the configuration on disk, as expected by the checkpoint format,
and return the same configuration in the forma of a nested dictionary,
with `_save_config` handling the reverse operation.
See the [Hugging Face implementation](https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/engine/checkpoint/huggingface.py) for an example.

To perform the conversion, the checkpoint handler relies on a list of `ParamConverter` objects,
which describe how individual parameters (or in some case multiple ones) should be converted.
The `ParamConverter` base interface is a dataclass consisting of two variables and two methods:

* `fast_llm_names: tuple[tuple[str, ...], ...]`: An array of entry names on the Fast-LLM side, in tuple format.
For example, `((transformer, head_groups),)` refers to the single entry `config["transformer"]["head_groups"]`.
* `export_names: tuple[tuple[str, ...], ...]`: An array of entry names on the external side, in the same tuple format.
* `export_params(self, fast_llm_values: tuple[typing.Any, ...]) -> tuple[typing.Any, ...]`:
This method takes the configuration parameters corresponding to `fast_llm_names` (in the same order),
and returns converted parameters corresponding to `export_names`.
* `import_params(self, export_values: tuple[typing.Any, ...]) -> tuple[typing.Any, ...]`:
The converse of`export_params`, converting parameters corresponding to `export_names` into those corresponding to `fast_llm_names`.

While not strictly part of the interface, it may also be useful to define a dataclass `__post_init__`,
for example to restrict the number of parameters in `fast_llm_names` and `export_names`.

Fast-LLM offers several generic configuration converter classes, including:

* `RenameParamConverter`: A simple 1-1 mapping between parameters, with optional renaming but identical value.
Typically, most converters are of this type.
* `ConstantImportParamConverter`: A 1-0 mapping for Fast-LLM parameters that without an equivalent in the external format,
that must take a specific value `fast_llm_value` for conversion to make sense (i.e., they take a hard-coded value in the external format).
This type of converter is common for Hugging Face converters, as Hugging Face models support much fewer configuration parameters.
* `ConstantExportParamConverter`: A 0-1 mapping, the converse of `ConstantImportParamConverter`
* `MappedConfigParamConverter`: A 1-1 mapping similar to `RenameParamConverter`, but with a non-trivial relation between values.

In addition to those, you may need to implement your own custom converter.
Here is an example that associates several Fast-LLM variables with a tuple.
Configuration conversion is handled by a `HuggingFaceBaseModelConverter` subclass,
which is linked to the handler via a `base_model_converter_class` class variable.
The converter implements three class methods:

```python
@dataclasses.dataclass(kw_only=True)
class PackingParamConverter(ParamConverter):
def __post_init__(self):
# There may be any number of Fast-LLM variables, but only one external one
Assert.eq(len(self.export_names), 1)

def export_params(self, fast_llm_values):
# Pack the values into a single tuple.
return (fast_llm_values,)

def import_params(self, export_values):
# Unpack the values. We can safely assume `export_values` has length one because of the assertion in `__post_init__`
return export_values[0]
```
* `import_config(cls, config: dict) -> dict`:
Reads the external (e.g., Hugging Face) configuration dict and returns a Fast-LLM `base_model` config dict.
* `export_config(cls, config: BaseModelConfig) -> dict`:
Takes a Fast-LLM `BaseModelConfig` object and returns the corresponding external configuration dict.
* `get_converters(cls, config: BaseModelConfig, exported_config: dict) -> list[WeightConverter]`:
Returns the list of weight converters for this model (described in the next section).

Now that we've seen how parameter converters work, we're ready to add them to our handler class.
We do so by creating a list of converters in the `_create_config_converters` class method.
Continuing our `AwesomeModel` handler example, we define:
The `_load_config` and `_save_config` methods on the handler read and write the external configuration file.
See the [Hugging Face implementation](https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/engine/checkpoint/huggingface.py) for their default implementation.

Continuing our `AwesomeModel` example, the base model converter class could look like:

```python
class AwesomeBaseModelConverter(HuggingFaceBaseModelConverter):
@classmethod
def _create_config_converters(cls) -> list[ParamConverter]:
# For Hugging Face handlers, we need to call the superclass method.
return super()._create_config_converters() + [
# A trivial example where both the name and value are the same on both sides.
RenameParamConverter(
fast_llm_names=(("vocab_size",),),
export_names=(("vocab_size",),),
),
# A non-trivial example of `RenameParamConverter` with renaming and handling of nested dictionaries.
RenameParamConverter(
fast_llm_names=(("transformer", "rotary", "theta"),), export_names=(("rope_theta",),)
),
# A constant import example indicating that the external format does not support absolute positional embeddings.
ConstantImportParamConverter(fast_llm_names=(("use_position_embeddings",),), fast_llm_value=False),
# The `architectures` parameter is a common use case for `ConstantExportParamConverter` in Hugging Face models.
ConstantExportParamConverter(export_names=(("architectures",),), export_value=["AwesomeModelForCausalLM"]),
# A value mapping example, where we match Fast-LLM activation types with their Hugging Face equivalents.
MappedConfigParamConverter(
fast_llm_names=(("transformer", "activation_type"),),
export_names=(("hidden_act",),),
fast_llm_value=ActivationType.from_hf_name,
export_value=lambda activation_type: activation_type.hf_name,
),
# A more hypothetical example using `PackingParamConverter` to pack two parameters `epsilon_1`, `epsilon_2` into a tuple `eps`.
PackingParamConverter(
fast_llm_names=(("epsilon_1",),("epsilon_2",)),
export_names=(("eps",),),
),
]
```
def import_config(cls, config: dict) -> dict:
# Build and return a Fast-LLM base_model config dict from the external config.
return {
"hidden_size": config["hidden_size"],
"embeddings": {"vocab_size": config["vocab_size"]},
"decoder": {
"num_blocks": config["num_hidden_layers"],
"block": {
"mixer": {
"heads": config["num_attention_heads"],
"head_groups": config.get("num_key_value_heads", config["num_attention_heads"]),
"rotary": {"type": "default", "theta": config.get("rope_theta", 10000)},
"add_linear_biases": False,
},
"mlp": {
"intermediate_size": config["intermediate_size"],
"gated": True,
"activation": ActivationType.from_hf_name(config["hidden_act"]),
"add_linear_biases": False,
},
"normalization": {"type": "rms_norm", "epsilon": config["rms_norm_eps"]},
},
},
"head": {"normalization": {"type": "rms_norm", "epsilon": config["rms_norm_eps"]}},
"tied_embedding_weight": config.get("tie_word_embeddings", False),
}

!!! note "How conversion works"
The once the converters are defined, the conversion utility takes it from there.
Exporting works as follows (importing work similarly):
*The handler creates an empty export config dict, then loops over its list of converters. For each converter, it:
* Reads the value of each parameter defined in `fast_llm_names`, and gathers them in a tuple.
*Calls `converter.export_params`, providing the set of read values as argument.
* Ensure that the returned value has the correct length (that of `export_names`)
* Set the respective values in the export config dict.
@classmethod
def export_config(cls, config: AwesomeBaseModelConfig) -> dict:
# Build and return the external config dict from the Fast-LLM config object.
decoder_block = config.decoder.block
return {
"model_type": "awesome_model",
"architectures": ["AwesomeModelForCausalLM"],
"hidden_size": config.hidden_size,
"vocab_size": config.embeddings.vocab_size,
"num_hidden_layers": config.decoder.num_blocks,
"num_attention_heads": decoder_block.mixer.heads,
"num_key_value_heads": decoder_block.mixer.head_groups,
"rope_theta": decoder_block.mixer.rotary.theta,
"intermediate_size": decoder_block.mlp.intermediate_size,
"hidden_act": decoder_block.mlp.activation.hf_name,
"rms_norm_eps": decoder_block.normalization.epsilon,
"tie_word_embeddings": config.tied_embedding_weight,
}

!!! note "About `MISSING` and `DEFAULT`"
If a value is not found during import, it will be replaced by the `MISSING` tag.
The converter's `import_params` has the opportunity to handle this missing value,
and if a `MISSING`, the handler will throw an error because it does not know what value to set on the Fast-LLM side.
@classmethod
def get_converters(cls, config: AwesomeBaseModelConfig, exported_config: dict) -> list[WeightConverter]:
# Described in the next section.
...
```

The `MISSING` tag is also supported during export,
but has a different meaning as the value is always expected to be found in the Fast-LLM configuration.
Instead, `export_params` may return a `MISSING` tag indicating that no value should not be added to the Fast-LLM config.
It may also return `DEFAULT`, which will be replaced by the default value for the configuration parameter.
Then wire the converter into the handler via `base_model_converter_class`:

Note that the handling of `MISSING` and `DEFAULT` is experimental and may be improved in the future.
```python
class AwesomeHuggingfaceCheckpointHandler(HuggingfaceStateDictCheckpointHandler):
_model_class = AwesomeModelConfig
architecture = "AwesomeModelForCausalLM"
base_model_converter_class = AwesomeBaseModelConverter

@classmethod
def get_transformers_configuration_class(cls):
from transformers import AutoConfig
return AutoConfig
```

### State conversion

State conversion follows the same principle as configuration conversion, but acts on flat dictionaries of state tensors.
Converters are defined by subclassing `WeightConverter`, with the interface:

* `fast_llm_name: str | tuple[str, ...]`: An entry name or array of entry names on the Fast-LLM side.
For example, `((transformer, head_groups),)` refers to the single entry `config["transformer"]["head_groups"]`.
* `export_name: str | tuple[str, ...]`: An entry name or array of entry names on the external side.
* `fast_llm_name: str | tuple[str, ...]`: A state dict key, or tuple of keys, on the Fast-LLM side.
For example, `"layers.0.mixer.weight"` or `("layers.0.weight_1", "layers.0.weight_2")`.
* `export_name: str | tuple[str, ...]`: A state dict key, or tuple of keys, on the external side.
* `export_weight(self, weight: tuple[torch.Tensor | SafeTensorSlice, ...]) -> tuple[torch.Tensor | SafeTensorSlice, ...]`:
This method takes the state dict entries corresponding to `fast_llm_name` (in the same order),
and returns converted entries corresponding to `export_name`.
Expand Down Expand Up @@ -225,19 +200,20 @@ class TransposeWeightConverter(WeightConverter):
return (weight[0][:].transpose().contiguous(),)
```

We define the list of weight converters in the `_create_weight_converters` method.
Continuing our `AwesomeModel` handler example, we define:
We define the list of weight converters in the `get_converters` class method of the base model converter.
Continuing our `AwesomeModel` example, we define:

```python
def _create_weight_converters(self) -> list[WeightConverter]:
@classmethod
def get_converters(cls, config: AwesomeBaseModelConfig, exported_config: dict) -> list[WeightConverter]:
converters = []
# The set of converters may depend on the base model configuration, which is accessible through `self._model.base_model_config`.
num_layers = len(self._model.config.base_model.decoder)
# The set of converters may depend on the base model configuration.
num_layers = config.decoder.num_blocks

# A simple renaming example, for the word embeddings.
converters.append(WeightConverter("layers.0.word_embeddings_weight", "model.embed_tokens.weight"))

# We usually want to loop dynamically over layers
# We usually want to loop dynamically over layers.
for i in range(num_layers):
# A `SplitWeightConverter` example, splitting a weight in two.
converters.append(SplitWeightConverter(
Expand Down
Loading
Loading