fix: Bumps to transformers==5.0.0#418
Conversation
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
|
@guicho271828 Why do we have a version pin for transformers in the vllm dependency? |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
Just a heads up, when I tried bumping the transformers version it opened a whole can of worms: it required a vllm bump, which required a bump to outlines, which would've required code changes in the backends. |
7de778d to
32fcaab
Compare
Yeah. The vllm/outlines thing seems like we should be able to work-around. @guicho271828 has already looked into removing outlines entirely. Now might be the time to do that if we need to make changes to bump the version. There's also a transformers v4 dependency in docling; I'm not sure what their status is on supporting the latest transformers, though. I think those are the only two blockers. They are both pretty annoying blockers, though :( |
|
Docling is no longer a blocker, but vllm still has a depdnency on Watching vllm-project/vllm#30566 |
|
Alternatively, we can implement #777 and then we don't need to wait for vllm. |
Dropping the vllm backend would mean we just access vllm through it's openai endpoint, right? I'm on board with that. |
|
Tracking it here and blocked by this : huggingface/transformers#45507 |
|
waiting on this from hf repo: huggingface/transformers#45813 |
| merged_attention=torch.ones_like(output_complete).to(self._device), | ||
| q_end=len(input_ids[0]), # type: ignore | ||
| q_end=input_ids["input_ids"].shape[1], # type: ignore | ||
| scores=hf_output.scores, |
There was a problem hiding this comment.
post_processing() at L1217 (and L1247) does input_ids["input_ids"].shape[1], but when called from the KV-cache path (_generate_from_context_with_kv_cache), input_ids is a plain torch.Tensor — subscripting it with a string raises TypeError. processing() already handles this with the isinstance guard at L1144; post_processing() needs the same treatment. L1247 is inside a try/except Exception: pass so it silently loses token-usage telemetry rather than crashing, but L1217 is unguarded.
| "peft>=0.18.1", # Native aLoRA support added in PEFT 0.18.0 | ||
| "transformers>=4.53.2,<5", | ||
| "transformers (>=4.42.0,<6.0.0,!=5.0.*,!=5.1.*,!=5.2.*,!=5.3.*,!=5.5.0)", # 5.5.0 broke granite 4.0 | ||
| "trl==0.19.1", |
There was a problem hiding this comment.
The new constraint >=4.42.0,<6.0.0,!=5.0.*,!=5.1.*,!=5.2.*,!=5.3.*,!=5.5.0 widens the lower bound back into v4 territory, but the v4 code paths (legacy_cache_smash, merge_dynamic_caches, tokens_to_legacy_cache) were removed in this PR. DynamicCache.layers and CacheLayerMixin do not exist in v4 — anyone resolving to the old supported range will hit AttributeError at the KV smash call site with no hint about needing to upgrade. If the intent is v5-only (which the code implies), I would suggest >=5.4.0,!=5.5.0,<6.0.0 and drop the exclusion list.
| getattr(layer, "is_sliding", False) for layer in getattr(c, "layers", []) | ||
| ): | ||
| raise ValueError("Check the issue.") | ||
|
|
There was a problem hiding this comment.
raise ValueError("Check the issue.") is a placeholder that will reach users. Granite 4 models (called out in the version-pin comment) use sliding-window attention, so this is a realistic hit. Could this be replaced with something descriptive before the PR leaves draft? e.g. "KV cache smashing does not currently support sliding-window attention layers. Disable prefix caching for this model." — and ideally a link to a tracking issue if one exists.
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| from mellea.backends.huggingface import LocalHFBackend |
There was a problem hiding this comment.
The __main__ block is handy as a smoke test but does not belong in a library module — it creates a circular import (kv_block_helpers → huggingface → kv_block_helpers) and would trigger a full model download if anyone runs python -m mellea.backends.kv_block_helpers by accident. scratchpad/ (git-ignored) or docs/kv_smash/ (where kvcache.py already lives) would be a better home for it.
| """Merge multiple v5 DynamicCache objects by concatenating KV states along the time axis.""" | ||
| caches = list(caches) | ||
| assert len(caches) >= 1 | ||
|
|
There was a problem hiding this comment.
assert len(caches) >= 1 is stripped under python -O — an empty list would silently return an empty DynamicCache(). Since this is a public function, a ValueError would be safer: if not caches: raise ValueError("caches must be non-empty").
| Provides functions for merging ``DynamicCache`` and legacy tuple caches along the | ||
| time axis (``merge_dynamic_caches``, ``legacy_cache_smash``), and | ||
| time axis (``merge_dynamic_caches_v5``, ``legacy_cache_smash``), and | ||
| ``tokens_to_legacy_cache`` for converting a tokenized prompt into a prefilled KV |
There was a problem hiding this comment.
The module docstring still references legacy_cache_smash and tokens_to_legacy_cache, both removed in this PR. Worth updating to reflect the new exports (prefill_cache_v5, merge_dynamic_caches_v5, merge_v5) before this lands.
| ) | ||
|
|
||
| if TYPE_CHECKING: | ||
| from transformers.tokenization_utils import PreTrainedTokenizer |
There was a problem hiding this comment.
Minor: removing the if TYPE_CHECKING: block leaves TYPE_CHECKING imported but unused at L9 (from typing import TYPE_CHECKING, Any, overload). Ruff F401 will catch this — just needs dropping from the typing import.
| ) | ||
| tokenizer.pad_token = tokenizer.eos_token | ||
| tokenizer.add_special_tokens = False | ||
|
|
There was a problem hiding this comment.
tokenizer.add_special_tokens = False was assigning False to a method reference — a no-op in v4 that had no effect on tokenisation. Removing it is correct, but a one-line comment would help future readers who might assume the removal changed training behaviour and try to restore it properly.
| from .utils import to_chat, to_tool_calls | ||
|
|
||
| _TRANSFORMERS_V5 = Version(_transformers_module.__version__) >= Version("5.0") | ||
|
|
There was a problem hiding this comment.
_TRANSFORMERS_V5 is only used in one place (the MPS adapter kwargs guard). If the constraint is tightened to v5-only (see version constraint comment above), this flag becomes redundant and can be dropped. If dual v4/v5 support is intended, it would need to gate all the divergence points — the input_ids dict access, the c.layers iteration, and the merge function dispatch.
Bump to transformers v5
Type of PR
Description
transformers==4.57.6and also figure out a migration path for legacy cache code #367This PR updates our KV smash code to use transformers v5. This requires moving away form the Legacy Cache implementation. The code here is originally from @csbobby.
This PR is still a draft. There are several changes needed:
Testing