[feat] Resume from ckpt by kevssim · Pull Request #135 · modelscope/twinkle

kevssim · 2026-03-31T01:45:42Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

在TrasnfomersModel和MultiLoraModel实现完整训练状态的恢复——包括优化器、调度器、RNG配置以及数据集跳过

gemini-code-assist

Code Review

This pull request implements a comprehensive "Strict Resume" feature for Transformers models, enabling the restoration of full training state including optimizer, scheduler, scaler, RNG states, and data progress. Key changes involve implementing load_training_state and read_training_progress across the model, server, and client layers, alongside dataloader enhancements to support sample-level skipping for map-style datasets. Feedback highlights several critical improvements: ensuring deterministic RNG in distributed settings by avoiding unseeded random states, addressing the deprecated use of StopIteration in generators, improving security by using weights_only=True during checkpoint loading, and removing an accidental BOM character in the client generator. Additionally, a more robust approach for re-initializing the dataloader is suggested to avoid modifying private PyTorch attributes.

gemini-code-assist · 2026-03-31T01:47:55Z

+            self.dataloader.__initialized = False
+            self._rebuild_sampler_stack()
+            self.dataloader.__initialized = True


Accessing and modifying the private attribute __initialized of torch.utils.data.DataLoader is brittle and relies on internal implementation details of PyTorch that could change. A safer approach to update the sampler stack after the dataloader has been created is to simply re-instantiate the underlying self.dataloader using the stored self.dataloader_params.

Suggested change

self.dataloader.__initialized = False

self._rebuild_sampler_stack()

self.dataloader.__initialized = True

if self.dataloader is not None:

self.dataloader = None

self._lazy_init_dataloader()

…into resume_from_ckpt

kevssim · 2026-04-01T06:42:20Z

/gemini summary

tastelikefeet · 2026-04-15T07:40:41Z

        )
        response.raise_for_status()

+    def load_training_state(self, name: str, **kwargs) -> Dict[str, Any]:


load_training_state和read_training_progress什么区别，能否合并为一个呢

tastelikefeet · 2026-04-15T07:43:29Z

+        twinkle_path = model.save(
+            name=f'twinkle-epoch-{epoch}',
+            save_optimizer=True,
+            consumed_train_samples=consumed_train_samples,


dataloader.get_consumed_samples()?

或者，dataloader.get_state()，更通用一些

另外，这里额外测试下torchrun/ray的兼容性，还有megatron和transformers双模型的兼容性

tastelikefeet · 2026-04-15T07:54:45Z

+        adapter_name = kwargs.pop('adapter_name', _default_adapter_name)
+        optimizer_config = self.optimizer_group[adapter_name]
+
+        if not Platform.is_master():


这里ray和torchrun都需要确保正确，megatron部分也需要对应考虑

…amples

…s with resume_from_checkpoint

…kpointRequest

…kpoint

…lti_lora, and docs

…_load_rng_state

…sumeFromCheckpointRequest

kevssim and others added 13 commits March 27, 2026 12:00

docs: add transformers resume design spec

5cd3c0f

docs: refine transformers resume design spec

91eeaeb

docs: trim resume state fields

6eebda8

docs: add npu resume compatibility requirements

cdd9c1b

chore: ignore local worktrees

1542492

wip

9883118

wip

d41a634

wip

21f9918

fix

1e59531

wip

9bb3f39

fix

fdf1f71

wip

6cf5160

Merge branch 'modelscope:main' into resume_from_ckpt

144ffe6

gemini-code-assist Bot reviewed Mar 31, 2026

View reviewed changes

kevssim added 15 commits March 31, 2026 09:52

lint

e21f870

Merge branch 'resume_from_ckpt' of https://github.com/kevssim/twinkle …

3359209

…into resume_from_ckpt

wip

70ebe50

wip

483778d

wip

039789b

wip

54de1a4

wip

920ab86

lint

ffd6304

wip

582bd41

wip

9cb6106

wip

c0cf72e

wip

505a75c

fix

a222b5b

wip

7499e00

doc

cd0b094

Merge remote-tracking branch 'origin/main' into resume_from_ckpt

27e76c6

tastelikefeet reviewed Apr 14, 2026

View reviewed changes

Comment thread cookbook/client/twinkle/self_host/self_cognition.py

tastelikefeet reviewed Apr 15, 2026

View reviewed changes

Comment thread docs/source_en/Components/Model/TransformersModel.md Outdated

tastelikefeet reviewed Apr 15, 2026

View reviewed changes

Comment thread src/twinkle/model/transformers/transformers.py Outdated

tastelikefeet reviewed Apr 15, 2026

View reviewed changes

Comment thread src/twinkle/model/transformers/strategy/accelerate.py

kevssim added 10 commits April 16, 2026 11:02

Merge remote-tracking branch 'origin' into resume_from_ckpt

5d68910

wip

9326e64

feat: add resume_from_checkpoint abstract method to TwinkleModel base

670f0c1

feat(dataloader): add resume_from_checkpoint wrapping skip_consumed_s…

784730c

…amples

feat(transformers): replace load_training_state/read_training_progres…

3db38e9

…s with resume_from_checkpoint

feat(megatron): add resume_from_checkpoint and save trainer_state.json

94679d5

refactor(cookbook): use model.resume_from_checkpoint API

832ce87

feat(types): replace training state request types with ResumeFromChec…

e3a3cd6

…kpointRequest

feat(server): replace training state endpoints with /resume_from_chec…

a3effab

…kpoint

feat(client): replace training state methods with resume_from_checkpoint

383336d

kevssim marked this pull request as draft April 21, 2026 03:47

kevssim added 11 commits April 21, 2026 11:48

docs: update checkpoint/resume documentation for unified API

54a1db6

fix: remove stale load_training_state references from __init__.py, mu…

597cbd9

…lti_lora, and docs

fix(transformers): pass correct file paths to _load_scaler_state and …

c55ab9f

…_load_rng_state

fix: guard rng_state.pt existence check, add Config extra=allow to Re…

8f76b7b

…sumeFromCheckpointRequest

wip

4ffa5c7

wip

0b43055

wip

c8bc9ab

wip

8c0399e

Merge remote-tracking branch 'origin/main' into resume_from_ckpt

94af275

refactor: delete resume_utils.py, inline logic in fsdp2.py, update docs

10b4a20

wip

3df191a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Resume from ckpt#135

[feat] Resume from ckpt#135
kevssim wants to merge 52 commits intomodelscope:mainfrom
kevssim:resume_from_ckpt

kevssim commented Mar 31, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Uh oh!

Uh oh!

Uh oh!

kevssim commented Apr 1, 2026

Uh oh!

Uh oh!

tastelikefeet Apr 15, 2026

Uh oh!

tastelikefeet Apr 15, 2026

Uh oh!

tastelikefeet Apr 15, 2026

Uh oh!

tastelikefeet Apr 15, 2026

Uh oh!

Uh oh!

Uh oh!

tastelikefeet Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kevssim commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

PR information

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kevssim commented Apr 1, 2026

Uh oh!

Uh oh!

tastelikefeet Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

tastelikefeet Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

tastelikefeet Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

tastelikefeet Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tastelikefeet Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevssim commented Mar 31, 2026 •

edited

Loading