Skip to content

[feat] Resume from ckpt#135

Draft
kevssim wants to merge 52 commits intomodelscope:mainfrom
kevssim:resume_from_ckpt
Draft

[feat] Resume from ckpt#135
kevssim wants to merge 52 commits intomodelscope:mainfrom
kevssim:resume_from_ckpt

Conversation

@kevssim
Copy link
Copy Markdown
Collaborator

@kevssim kevssim commented Mar 31, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

在TrasnfomersModel和MultiLoraModel实现完整训练状态的恢复——包括优化器、调度器、RNG配置以及数据集跳过

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a comprehensive "Strict Resume" feature for Transformers models, enabling the restoration of full training state including optimizer, scheduler, scaler, RNG states, and data progress. Key changes involve implementing load_training_state and read_training_progress across the model, server, and client layers, alongside dataloader enhancements to support sample-level skipping for map-style datasets. Feedback highlights several critical improvements: ensuring deterministic RNG in distributed settings by avoiding unseeded random states, addressing the deprecated use of StopIteration in generators, improving security by using weights_only=True during checkpoint loading, and removing an accidental BOM character in the client generator. Additionally, a more robust approach for re-initializing the dataloader is suggested to avoid modifying private PyTorch attributes.

Comment thread src/twinkle/dataloader/retry_sampler.py Outdated
Comment thread client_tools/client_generator.py Outdated
Comment on lines +137 to +139
self.dataloader.__initialized = False
self._rebuild_sampler_stack()
self.dataloader.__initialized = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing and modifying the private attribute __initialized of torch.utils.data.DataLoader is brittle and relies on internal implementation details of PyTorch that could change. A safer approach to update the sampler stack after the dataloader has been created is to simply re-instantiate the underlying self.dataloader using the stored self.dataloader_params.

Suggested change
self.dataloader.__initialized = False
self._rebuild_sampler_stack()
self.dataloader.__initialized = True
if self.dataloader is not None:
self.dataloader = None
self._lazy_init_dataloader()

Comment thread src/twinkle/dataloader/retry_sampler.py Outdated
Comment thread src/twinkle/model/transformers/transformers.py Outdated
@kevssim
Copy link
Copy Markdown
Collaborator Author

kevssim commented Apr 1, 2026

/gemini summary

Comment thread cookbook/client/twinkle/self_host/self_cognition.py
Comment thread client_tools/client_generator.py Outdated
)
response.raise_for_status()

def load_training_state(self, name: str, **kwargs) -> Dict[str, Any]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_training_state和read_training_progress什么区别,能否合并为一个呢

twinkle_path = model.save(
name=f'twinkle-epoch-{epoch}',
save_optimizer=True,
consumed_train_samples=consumed_train_samples,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataloader.get_consumed_samples()?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

或者,dataloader.get_state(),更通用一些

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外,这里额外测试下torchrun/ray的兼容性,还有megatron和transformers双模型的兼容性

Comment thread docs/source_en/Components/Model/TransformersModel.md Outdated
Comment thread src/twinkle/model/transformers/transformers.py Outdated
adapter_name = kwargs.pop('adapter_name', _default_adapter_name)
optimizer_config = self.optimizer_group[adapter_name]

if not Platform.is_master():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里ray和torchrun都需要确保正确,megatron部分也需要对应考虑

Comment thread src/twinkle/model/transformers/strategy/accelerate.py
@kevssim kevssim marked this pull request as draft April 21, 2026 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants