Skip to content

fix: CPU OOM issue during LoRA training#41

Open
SongwuJob wants to merge 1 commit intoOpenMOSS:mainfrom
SongwuJob:main
Open

fix: CPU OOM issue during LoRA training#41
SongwuJob wants to merge 1 commit intoOpenMOSS:mainfrom
SongwuJob:main

Conversation

@SongwuJob
Copy link

When conducting fine-tuning with the provided LoRA training script, CPU memory usage continuously increases over time and eventually the process is killed by the system due to out-of-memory (OOM).

The issue is caused by enabling torch.cuda.memory._record_memory_history(enabled="all"), which records CUDA memory events and stores them on the CPU. As training progresses, the accumulated memory history leads to excessive CPU memory consumption, resulting in CPU OOM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant