Skip to content

fix: Bumps to transformers==5.0.0#418

Open
nrfulton wants to merge 11 commits into
generative-computing:mainfrom
nrfulton:bobby_and_nathan/transformers_v5
Open

fix: Bumps to transformers==5.0.0#418
nrfulton wants to merge 11 commits into
generative-computing:mainfrom
nrfulton:bobby_and_nathan/transformers_v5

Conversation

@nrfulton
Copy link
Copy Markdown
Member

@nrfulton nrfulton commented Feb 5, 2026

Bump to transformers v5

Type of PR

  • Bug Fix
  • New Feature
  • Documentation
  • Other

Description

This PR updates our KV smash code to use transformers v5. This requires moving away form the Legacy Cache implementation. The code here is originally from @csbobby.

This PR is still a draft. There are several changes needed:

  • Investigate the reason for our vllm transformers pin
  • Ask Docling folks if they plan on removing their transformers version upper bound / investigate work-arounds if not.
  • Update huggingface.py backend and unit tests.

Testing

  • Tests added to the respective file if code was changed
  • New code has 100% coverage if code as added
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

@nrfulton nrfulton requested a review from guicho271828 February 5, 2026 23:19
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 5, 2026

The PR description has been updated. Please fill out the template for your PR to be reviewed.

@nrfulton
Copy link
Copy Markdown
Member Author

nrfulton commented Feb 5, 2026

@guicho271828 Why do we have a version pin for transformers in the vllm dependency?

@mergify
Copy link
Copy Markdown

mergify Bot commented Feb 5, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

@psschwei
Copy link
Copy Markdown
Member

psschwei commented Feb 6, 2026

Just a heads up, when I tried bumping the transformers version it opened a whole can of worms: it required a vllm bump, which required a bump to outlines, which would've required code changes in the backends.
I didn't dig into it too deeply though (had that as one of my todos)

@nrfulton nrfulton assigned nrfulton and unassigned nrfulton Feb 6, 2026
@nrfulton nrfulton force-pushed the bobby_and_nathan/transformers_v5 branch from 7de778d to 32fcaab Compare February 6, 2026 22:31
@nrfulton
Copy link
Copy Markdown
Member Author

nrfulton commented Feb 6, 2026

Just a heads up, when I tried bumping the transformers version it opened a whole can of worms: it required a vllm bump, which required a bump to outlines, which would've required code changes in the backends. I didn't dig into it too deeply though (had that as one of my todos)

Yeah.

The vllm/outlines thing seems like we should be able to work-around. @guicho271828 has already looked into removing outlines entirely. Now might be the time to do that if we need to make changes to bump the version.

There's also a transformers v4 dependency in docling; I'm not sure what their status is on supporting the latest transformers, though.

I think those are the only two blockers. They are both pretty annoying blockers, though :(

@nrfulton
Copy link
Copy Markdown
Member Author

nrfulton commented Apr 2, 2026

Docling is no longer a blocker, but vllm still has a depdnency on transformersv4.

Watching vllm-project/vllm#30566

@nrfulton
Copy link
Copy Markdown
Member Author

nrfulton commented Apr 2, 2026

Alternatively, we can implement #777 and then we don't need to wait for vllm.

@psschwei
Copy link
Copy Markdown
Member

psschwei commented Apr 2, 2026

Alternatively, we can implement #777 and then we don't need to wait for vllm.

Dropping the vllm backend would mean we just access vllm through it's openai endpoint, right? I'm on board with that.

@github-actions github-actions Bot added the bug Something isn't working label Apr 8, 2026
@avinash2692
Copy link
Copy Markdown
Member

Tracking it here and blocked by this : huggingface/transformers#45507

@avinash2692 avinash2692 marked this pull request as ready for review April 29, 2026 14:45
@avinash2692 avinash2692 requested review from a team as code owners April 29, 2026 14:45
@avinash2692
Copy link
Copy Markdown
Member

waiting on this from hf repo: huggingface/transformers#45813

merged_attention=torch.ones_like(output_complete).to(self._device),
q_end=len(input_ids[0]), # type: ignore
q_end=input_ids["input_ids"].shape[1], # type: ignore
scores=hf_output.scores,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

post_processing() at L1217 (and L1247) does input_ids["input_ids"].shape[1], but when called from the KV-cache path (_generate_from_context_with_kv_cache), input_ids is a plain torch.Tensor — subscripting it with a string raises TypeError. processing() already handles this with the isinstance guard at L1144; post_processing() needs the same treatment. L1247 is inside a try/except Exception: pass so it silently loses token-usage telemetry rather than crashing, but L1217 is unguarded.

Comment thread pyproject.toml
"peft>=0.18.1", # Native aLoRA support added in PEFT 0.18.0
"transformers>=4.53.2,<5",
"transformers (>=4.42.0,<6.0.0,!=5.0.*,!=5.1.*,!=5.2.*,!=5.3.*,!=5.5.0)", # 5.5.0 broke granite 4.0
"trl==0.19.1",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new constraint >=4.42.0,<6.0.0,!=5.0.*,!=5.1.*,!=5.2.*,!=5.3.*,!=5.5.0 widens the lower bound back into v4 territory, but the v4 code paths (legacy_cache_smash, merge_dynamic_caches, tokens_to_legacy_cache) were removed in this PR. DynamicCache.layers and CacheLayerMixin do not exist in v4 — anyone resolving to the old supported range will hit AttributeError at the KV smash call site with no hint about needing to upgrade. If the intent is v5-only (which the code implies), I would suggest >=5.4.0,!=5.5.0,<6.0.0 and drop the exclusion list.

getattr(layer, "is_sliding", False) for layer in getattr(c, "layers", [])
):
raise ValueError("Check the issue.")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise ValueError("Check the issue.") is a placeholder that will reach users. Granite 4 models (called out in the version-pin comment) use sliding-window attention, so this is a realistic hit. Could this be replaced with something descriptive before the PR leaves draft? e.g. "KV cache smashing does not currently support sliding-window attention layers. Disable prefix caching for this model." — and ideally a link to a tracking issue if one exists.



if __name__ == "__main__":
from mellea.backends.huggingface import LocalHFBackend
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The __main__ block is handy as a smoke test but does not belong in a library module — it creates a circular import (kv_block_helpershuggingfacekv_block_helpers) and would trigger a full model download if anyone runs python -m mellea.backends.kv_block_helpers by accident. scratchpad/ (git-ignored) or docs/kv_smash/ (where kvcache.py already lives) would be a better home for it.

"""Merge multiple v5 DynamicCache objects by concatenating KV states along the time axis."""
caches = list(caches)
assert len(caches) >= 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert len(caches) >= 1 is stripped under python -O — an empty list would silently return an empty DynamicCache(). Since this is a public function, a ValueError would be safer: if not caches: raise ValueError("caches must be non-empty").

Provides functions for merging ``DynamicCache`` and legacy tuple caches along the
time axis (``merge_dynamic_caches``, ``legacy_cache_smash``), and
time axis (``merge_dynamic_caches_v5``, ``legacy_cache_smash``), and
``tokens_to_legacy_cache`` for converting a tokenized prompt into a prefilled KV
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring still references legacy_cache_smash and tokens_to_legacy_cache, both removed in this PR. Worth updating to reflect the new exports (prefill_cache_v5, merge_dynamic_caches_v5, merge_v5) before this lands.

Comment thread mellea/backends/openai.py
)

if TYPE_CHECKING:
from transformers.tokenization_utils import PreTrainedTokenizer
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: removing the if TYPE_CHECKING: block leaves TYPE_CHECKING imported but unused at L9 (from typing import TYPE_CHECKING, Any, overload). Ruff F401 will catch this — just needs dropping from the typing import.

Comment thread cli/alora/train.py
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_special_tokens = False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokenizer.add_special_tokens = False was assigning False to a method reference — a no-op in v4 that had no effect on tokenisation. Removing it is correct, but a one-line comment would help future readers who might assume the removal changed training behaviour and try to restore it properly.

from .utils import to_chat, to_tool_calls

_TRANSFORMERS_V5 = Version(_transformers_module.__version__) >= Version("5.0")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_TRANSFORMERS_V5 is only used in one place (the MPS adapter kwargs guard). If the constraint is tightened to v5-only (see version constraint comment above), this flag becomes redundant and can be dropped. If dual v4/v5 support is intended, it would need to gate all the divergence points — the input_ids dict access, the c.layers iteration, and the merge function dispatch.

Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants