Skip to content

Ace step#1408

Open
mi804 wants to merge 17 commits intomodelscope:mainfrom
mi804:ace-step
Open

Ace step#1408
mi804 wants to merge 17 commits intomodelscope:mainfrom
mi804:ace-step

Conversation

@mi804
Copy link
Copy Markdown
Collaborator

@mi804 mi804 commented Apr 23, 2026

ACE-Step Integration

Summary

Integrate ACE-Step 1.5 into DiffSynth-Studio, supporting:

  • Text-to-Music: Generate complete music from text descriptions + lyrics
  • Audio Cover: Keep the rhythm/melody of source audio, change style or lyrics
  • Repaint: Regenerate audio in specified time ranges while keeping the rest intact
  • LoRA Training: Fine-tune the DiT model with LoRA for custom styles

ACE-Step 1.5 is an open-source music generation model based on DiT (Diffusion Transformer) architecture, supporting text-to-music, audio cover, repainting and other functionalities, running efficiently on consumer-grade hardware (minimum ~3GB VRAM with offload).

Model Components

Component File Params
DiT diffsynth/models/ace_step_dit.py ~2.8B (base) / ~4B (XL)
ACEStepConditioner diffsynth/models/ace_step_conditioner.py ~500M (includes ConditionEncoder, LyricEncoder, TimbreEncoder)
Text Encoder diffsynth/models/ace_step_text_encoder.py 0.6B (Qwen3-Embedding wrapper)
VAE diffsynth/models/ace_step_vae.py ~168M (CNN architecture with Snake1d activation)
Tokenizer diffsynth/models/ace_step_tokenizer.py ~210M (ResidualFSQ + AttentionPooler, cover mode only)

Inference Scripts

Script Description
examples/ace_step/model_inference/Ace-Step1.5.py Base text2music (turbo variant)
examples/ace_step/model_inference/acestep-v15-turbo-shift1.py Shift=1 turbo variant
examples/ace_step/model_inference/acestep-v15-turbo-shift3.py Shift=3 turbo variant
examples/ace_step/model_inference/acestep-v15-turbo-continuous.py Continuous duration turbo variant
examples/ace_step/model_inference/acestep-v15-base.py Base (non-distilled) variant
examples/ace_step/model_inference/acestep-v15-sft.py SFT fine-tuned variant
examples/ace_step/model_inference/acestep-v15-xl-base.py XL base variant (~4B DiT)
examples/ace_step/model_inference/acestep-v15-xl-sft.py XL SFT variant (~4B DiT)
examples/ace_step/model_inference/acestep-v15-xl-turbo.py XL turbo variant (~4B DiT)
examples/ace_step/model_inference/acestep-v15-base-CoverTask.py Audio cover task
examples/ace_step/model_inference/acestep-v15-base-RepaintTask.py Repaint task

Training Scripts

  • Full Training: 9 scripts in examples/ace_step/model_training/full/
  • Full Training Validation: 9 scripts in examples/ace_step/model_training/validate_full/
  • LoRA Training: 9 scripts in examples/ace_step/model_training/lora/
  • LoRA Training Validation: 9 scripts in examples/ace_step/model_training/validate_lora/

Dependencies Added

  • Optional dependency group audio: torchaudio, torchcodec

Core dependencies (transformers, einops, accelerate, peft, etc.) were already present in pyproject.toml.

Notes

  • VRAM Management: VRAM offload is enabled by default. Minimum ~3GB VRAM required with offload configuration.
  • Scheduler: Flow Matching scheduler with configurable shift and num_inference_steps. Turbo variants use 8 steps; base variants use 16-32 steps.
  • Task Types: The pipeline supports text2music, cover, and repaint task types via the task_type parameter.
  • LM Not Integrated: The ACE-Step LLM component (used for Simple Mode / prompt expansion) is not included in this integration. Users should provide structured parameters (caption, lyrics, bpm, keyscale, etc.) directly.
  • 9 DiT Variants: The integration supports 9 DiT checkpoint variants: turbo (default), turbo-shift1, turbo-shift3, turbo-continuous, base, sft, xl-base, xl-sft, xl-turbo.
  • Model Hash Routing: Each DiT variant is registered in model_configs.py with its weight file SHA256 hash for automatic code variant selection during loading.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates the ACE-Step 1.5 music generation model into the framework, adding the necessary DiT, VAE, text encoder, and tokenizer models along with a dedicated pipeline for text-to-music, audio cover, and repainting tasks. The update includes VRAM management support, comprehensive documentation, and various inference and training examples. Technical feedback identifies a missing torch import in the new audio loading operator, an unassigned tensor operation in the audio utility, unused debugging code in the pipeline, and a type hint mismatch in the DiT model's attention mechanism.

@@ -3,6 +3,7 @@
import imageio.v3 as iio
from PIL import Image
import torchaudio
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The LoadPureAudioWithTorchaudio class introduced in this file uses torch.nn.functional.pad, but torch is not imported. This will cause a NameError at runtime when the operator is executed.

Suggested change
import torchaudio
import torch, torchaudio

Comment thread diffsynth/utils/data/audio.py Outdated
"""
if waveform.dim() == 3:
waveform = waveform[0]
waveform.cpu()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to waveform.cpu() returns a copy of the tensor on the CPU but does not modify the original tensor in place. Since the return value is ignored, the waveform tensor remains on its original device (e.g., CUDA), which may lead to errors if subsequent code expects a CPU tensor.

Suggested change
waveform.cpu()
waveform = waveform.cpu()

Comment thread diffsynth/pipelines/ace_step.py Outdated
Comment on lines +292 to +293
# TODO: remove this
newtext = prompt + "\n\n" + lyric_text
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable newtext is assigned but never used within the process method. This appears to be leftover debugging code, as indicated by the # TODO: remove this comment, and should be removed to maintain code quality.

position_embeddings: tuple[torch.Tensor, torch.Tensor] = None,
output_attentions: Optional[bool] = False,
**kwargs: Unpack[FlashAttentionKwargs],
) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The type hint for the forward method in AceStepAttention specifies a 3-element tuple, but the implementation only returns 2 elements (attn_output, attn_weights). This mismatch can cause issues with static analysis tools or type checkers.

Suggested change
) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:
) -> tuple[torch.Tensor, Optional[torch.Tensor]]:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant