fix: Bumps to transformers==5.0.0 by nrfulton · Pull Request #418 · generative-computing/mellea

nrfulton · 2026-02-05T23:19:46Z

Bump to transformers v5

Type of PR

Bug Fix
New Feature
Documentation
Other

Description

Link to Issue: Pin transformers==4.57.6 and also figure out a migration path for legacy cache code #367

This PR updates our KV smash code to use transformers v5. This requires moving away form the Legacy Cache implementation. The code here is originally from @csbobby.

This PR is still a draft. There are several changes needed:

Investigate the reason for our vllm transformers pin
Ask Docling folks if they plan on removing their transformers version upper bound / investigate work-arounds if not.
Update huggingface.py backend and unit tests.

Testing

Tests added to the respective file if code was changed
New code has 100% coverage if code as added
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

github-actions · 2026-02-05T23:19:57Z

The PR description has been updated. Please fill out the template for your PR to be reviewed.

nrfulton · 2026-02-05T23:20:08Z

@guicho271828 Why do we have a version pin for transformers in the vllm dependency?

mergify · 2026-02-05T23:20:21Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

psschwei · 2026-02-06T00:01:51Z

Just a heads up, when I tried bumping the transformers version it opened a whole can of worms: it required a vllm bump, which required a bump to outlines, which would've required code changes in the backends.
I didn't dig into it too deeply though (had that as one of my todos)

nrfulton · 2026-02-06T22:41:49Z

Just a heads up, when I tried bumping the transformers version it opened a whole can of worms: it required a vllm bump, which required a bump to outlines, which would've required code changes in the backends. I didn't dig into it too deeply though (had that as one of my todos)

Yeah.

The vllm/outlines thing seems like we should be able to work-around. @guicho271828 has already looked into removing outlines entirely. Now might be the time to do that if we need to make changes to bump the version.

There's also a transformers v4 dependency in docling; I'm not sure what their status is on supporting the latest transformers, though.

I think those are the only two blockers. They are both pretty annoying blockers, though :(

nrfulton · 2026-04-02T17:04:27Z

Docling is no longer a blocker, but vllm still has a depdnency on transformersv4.

Watching vllm-project/vllm#30566

nrfulton · 2026-04-02T17:18:10Z

Alternatively, we can implement #777 and then we don't need to wait for vllm.

psschwei · 2026-04-02T18:50:29Z

Alternatively, we can implement #777 and then we don't need to wait for vllm.

Dropping the vllm backend would mean we just access vllm through it's openai endpoint, right? I'm on board with that.

…sformers_v5

avinash2692 · 2026-04-22T19:06:50Z

Tracking it here and blocked by this : huggingface/transformers#45507

avinash2692 · 2026-05-12T18:52:38Z

waiting on this from hf repo: huggingface/transformers#45813

…sformers_v5

planetf1 · 2026-05-14T11:13:19Z

                merged_attention=torch.ones_like(output_complete).to(self._device),
-                q_end=len(input_ids[0]),  # type: ignore
+                q_end=input_ids["input_ids"].shape[1],  # type: ignore
                scores=hf_output.scores,


post_processing() at L1217 (and L1247) does input_ids["input_ids"].shape[1], but when called from the KV-cache path (_generate_from_context_with_kv_cache), input_ids is a plain torch.Tensor — subscripting it with a string raises TypeError. processing() already handles this with the isinstance guard at L1144; post_processing() needs the same treatment. L1247 is inside a try/except Exception: pass so it silently loses token-usage telemetry rather than crashing, but L1217 is unguarded.

planetf1 · 2026-05-14T11:13:30Z

    "peft>=0.18.1", # Native aLoRA support added in PEFT 0.18.0
-    "transformers>=4.53.2,<5",
+    "transformers (>=4.42.0,<6.0.0,!=5.0.*,!=5.1.*,!=5.2.*,!=5.3.*,!=5.5.0)", # 5.5.0 broke granite 4.0
    "trl==0.19.1",


The new constraint >=4.42.0,<6.0.0,!=5.0.*,!=5.1.*,!=5.2.*,!=5.3.*,!=5.5.0 widens the lower bound back into v4 territory, but the v4 code paths (legacy_cache_smash, merge_dynamic_caches, tokens_to_legacy_cache) were removed in this PR. DynamicCache.layers and CacheLayerMixin do not exist in v4 — anyone resolving to the old supported range will hit AttributeError at the KV smash call site with no hint about needing to upgrade. If the intent is v5-only (which the code implies), I would suggest >=5.4.0,!=5.5.0,<6.0.0 and drop the exclusion list.

planetf1 · 2026-05-14T11:13:38Z

+            getattr(layer, "is_sliding", False) for layer in getattr(c, "layers", [])
+        ):
+            raise ValueError("Check the issue.")
+


raise ValueError("Check the issue.") is a placeholder that will reach users. Granite 4 models (called out in the version-pin comment) use sliding-window attention, so this is a realistic hit. Could this be replaced with something descriptive before the PR leaves draft? e.g. "KV cache smashing does not currently support sliding-window attention layers. Disable prefix caching for this model." — and ideally a link to a tracking issue if one exists.

planetf1 · 2026-05-14T11:13:46Z

+
+
+if __name__ == "__main__":
+    from mellea.backends.huggingface import LocalHFBackend


The __main__ block is handy as a smoke test but does not belong in a library module — it creates a circular import (kv_block_helpers → huggingface → kv_block_helpers) and would trigger a full model download if anyone runs python -m mellea.backends.kv_block_helpers by accident. scratchpad/ (git-ignored) or docs/kv_smash/ (where kvcache.py already lives) would be a better home for it.

planetf1 · 2026-05-14T11:13:56Z

+    """Merge multiple v5 DynamicCache objects by concatenating KV states along the time axis."""
+    caches = list(caches)
+    assert len(caches) >= 1
+


assert len(caches) >= 1 is stripped under python -O — an empty list would silently return an empty DynamicCache(). Since this is a public function, a ValueError would be safer: if not caches: raise ValueError("caches must be non-empty").

planetf1 · 2026-05-14T11:13:58Z

 Provides functions for merging ``DynamicCache`` and legacy tuple caches along the
-time axis (``merge_dynamic_caches``, ``legacy_cache_smash``), and
+time axis (``merge_dynamic_caches_v5``, ``legacy_cache_smash``), and
 ``tokens_to_legacy_cache`` for converting a tokenized prompt into a prefilled KV


The module docstring still references legacy_cache_smash and tokens_to_legacy_cache, both removed in this PR. Worth updating to reflect the new exports (prefill_cache_v5, merge_dynamic_caches_v5, merge_v5) before this lands.

planetf1 · 2026-05-14T11:14:10Z

 )

-if TYPE_CHECKING:
-    from transformers.tokenization_utils import PreTrainedTokenizer


Minor: removing the if TYPE_CHECKING: block leaves TYPE_CHECKING imported but unused at L9 (from typing import TYPE_CHECKING, Any, overload). Ruff F401 will catch this — just needs dropping from the typing import.

planetf1 · 2026-05-14T11:14:11Z

    )
    tokenizer.pad_token = tokenizer.eos_token
-    tokenizer.add_special_tokens = False



tokenizer.add_special_tokens = False was assigning False to a method reference — a no-op in v4 that had no effect on tokenisation. Removing it is correct, but a one-line comment would help future readers who might assume the removal changed training behaviour and try to restore it properly.

planetf1 · 2026-05-14T11:14:20Z

 from .utils import to_chat, to_tool_calls

+_TRANSFORMERS_V5 = Version(_transformers_module.__version__) >= Version("5.0")
+


_TRANSFORMERS_V5 is only used in one place (the MPS adapter kwargs guard). If the constraint is tightened to v5-only (see version constraint comment above), this flag becomes redundant and can be dropped. If dual v4/v5 support is intended, it would need to gate all the divergence points — the input_ids dict access, the c.layers iteration, and the merge function dispatch.

planetf1

comment

nrfulton requested a review from guicho271828 February 5, 2026 23:19

nrfulton assigned nrfulton and unassigned nrfulton Feb 6, 2026

Adds working transformers v5 kv smash code.

32fcaab

nrfulton force-pushed the bobby_and_nathan/transformers_v5 branch from 7de778d to 32fcaab Compare February 6, 2026 22:31

change huggingface.py; everythings broken :)

8fed9c2

This was referenced Apr 2, 2026

Remove vllm backend and dependency #780

Closed

feat: removing vllm backend #781

Merged

avinash2692 added 2 commits April 8, 2026 10:26

Merge remote-tracking branch 'origin/main' into bobby_and_nathan/tran…

884a2a3

…sformers_v5

adding transformers 5 same as dockling

d4a98fa

github-actions Bot added the bug Something isn't working label Apr 8, 2026

avinash2692 added 4 commits April 9, 2026 11:21

making changes to be v5 compatible

4f81567

changing generate signature to comply with v5

b4c0869

some more changes to make it v5 compatible

10d061d

minor type check in hf model call

3ee3d1d

jakelorocco mentioned this pull request Apr 29, 2026

test: add tokenizer regression test for granite-4.0-micro #948

Draft

11 tasks

avinash2692 marked this pull request as ready for review April 29, 2026 14:45

avinash2692 requested review from a team as code owners April 29, 2026 14:45

avinash2692 requested review from AngeloDanducci and planetf1 April 29, 2026 14:45

avinash2692 added 3 commits May 12, 2026 12:09

Merge remote-tracking branch 'origin/main' into bobby_and_nathan/tran…

76593e5

…sformers_v5

changing tests to correct mock return

61b370f

small changes to the hf test to make it compatible with 5.0

57b97a3

planetf1 reviewed May 14, 2026

View reviewed changes

psschwei assigned avinash2692 May 18, 2026



		if __name__ == "__main__":
		from mellea.backends.huggingface import LocalHFBackend

		from .utils import to_chat, to_tool_calls

		_TRANSFORMERS_V5 = Version(_transformers_module.__version__) >= Version("5.0")

Conversation

nrfulton commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bump to transformers v5

Type of PR

Description

Testing

Uh oh!

github-actions Bot commented Feb 5, 2026

Uh oh!

nrfulton commented Feb 5, 2026

Uh oh!

mergify Bot commented Feb 5, 2026

Merge Protections

🟢 Enforce conventional commit

Uh oh!

psschwei commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nrfulton commented Feb 6, 2026

Uh oh!

nrfulton commented Apr 2, 2026

Uh oh!

nrfulton commented Apr 2, 2026

Uh oh!

psschwei commented Apr 2, 2026

Uh oh!

avinash2692 commented Apr 22, 2026

Uh oh!

avinash2692 commented May 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nrfulton commented Feb 5, 2026 •

edited

Loading

psschwei commented Feb 6, 2026 •

edited

Loading