Conversation
There was a problem hiding this comment.
Code Review
This pull request integrates the ACE-Step 1.5 music generation model into the framework, adding the necessary DiT, VAE, text encoder, and tokenizer models along with a dedicated pipeline for text-to-music, audio cover, and repainting tasks. The update includes VRAM management support, comprehensive documentation, and various inference and training examples. Technical feedback identifies a missing torch import in the new audio loading operator, an unassigned tensor operation in the audio utility, unused debugging code in the pipeline, and a type hint mismatch in the DiT model's attention mechanism.
| @@ -3,6 +3,7 @@ | |||
| import imageio.v3 as iio | |||
| from PIL import Image | |||
| import torchaudio | |||
| """ | ||
| if waveform.dim() == 3: | ||
| waveform = waveform[0] | ||
| waveform.cpu() |
There was a problem hiding this comment.
The call to waveform.cpu() returns a copy of the tensor on the CPU but does not modify the original tensor in place. Since the return value is ignored, the waveform tensor remains on its original device (e.g., CUDA), which may lead to errors if subsequent code expects a CPU tensor.
| waveform.cpu() | |
| waveform = waveform.cpu() |
| # TODO: remove this | ||
| newtext = prompt + "\n\n" + lyric_text |
| position_embeddings: tuple[torch.Tensor, torch.Tensor] = None, | ||
| output_attentions: Optional[bool] = False, | ||
| **kwargs: Unpack[FlashAttentionKwargs], | ||
| ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]: |
There was a problem hiding this comment.
The type hint for the forward method in AceStepAttention specifies a 3-element tuple, but the implementation only returns 2 elements (attn_output, attn_weights). This mismatch can cause issues with static analysis tools or type checkers.
| ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]: | |
| ) -> tuple[torch.Tensor, Optional[torch.Tensor]]: |
ACE-Step Integration
Summary
Integrate ACE-Step 1.5 into DiffSynth-Studio, supporting:
ACE-Step 1.5 is an open-source music generation model based on DiT (Diffusion Transformer) architecture, supporting text-to-music, audio cover, repainting and other functionalities, running efficiently on consumer-grade hardware (minimum ~3GB VRAM with offload).
Model Components
diffsynth/models/ace_step_dit.pydiffsynth/models/ace_step_conditioner.pydiffsynth/models/ace_step_text_encoder.pydiffsynth/models/ace_step_vae.pydiffsynth/models/ace_step_tokenizer.pyInference Scripts
examples/ace_step/model_inference/Ace-Step1.5.pyexamples/ace_step/model_inference/acestep-v15-turbo-shift1.pyexamples/ace_step/model_inference/acestep-v15-turbo-shift3.pyexamples/ace_step/model_inference/acestep-v15-turbo-continuous.pyexamples/ace_step/model_inference/acestep-v15-base.pyexamples/ace_step/model_inference/acestep-v15-sft.pyexamples/ace_step/model_inference/acestep-v15-xl-base.pyexamples/ace_step/model_inference/acestep-v15-xl-sft.pyexamples/ace_step/model_inference/acestep-v15-xl-turbo.pyexamples/ace_step/model_inference/acestep-v15-base-CoverTask.pyexamples/ace_step/model_inference/acestep-v15-base-RepaintTask.pyTraining Scripts
examples/ace_step/model_training/full/examples/ace_step/model_training/validate_full/examples/ace_step/model_training/lora/examples/ace_step/model_training/validate_lora/Dependencies Added
audio:torchaudio,torchcodecNotes
shiftandnum_inference_steps. Turbo variants use 8 steps; base variants use 16-32 steps.text2music,cover, andrepainttask types via thetask_typeparameter.model_configs.pywith its weight file SHA256 hash for automatic code variant selection during loading.