Ace step by mi804 · Pull Request #1408 · modelscope/DiffSynth-Studio

mi804 · 2026-04-23T10:18:12Z

ACE-Step Integration

Summary

Integrate ACE-Step 1.5 into DiffSynth-Studio, supporting:

Text-to-Music: Generate complete music from text descriptions + lyrics
Audio Cover: Keep the rhythm/melody of source audio, change style or lyrics
Repaint: Regenerate audio in specified time ranges while keeping the rest intact
LoRA Training: Fine-tune the DiT model with LoRA for custom styles

ACE-Step 1.5 is an open-source music generation model based on DiT (Diffusion Transformer) architecture, supporting text-to-music, audio cover, repainting and other functionalities, running efficiently on consumer-grade hardware (minimum ~3GB VRAM with offload).

Model Components

Component	File	Params
DiT	`diffsynth/models/ace_step_dit.py`	~2.8B (base) / ~4B (XL)
ACEStepConditioner	`diffsynth/models/ace_step_conditioner.py`	~500M (includes ConditionEncoder, LyricEncoder, TimbreEncoder)
Text Encoder	`diffsynth/models/ace_step_text_encoder.py`	0.6B (Qwen3-Embedding wrapper)
VAE	`diffsynth/models/ace_step_vae.py`	~168M (CNN architecture with Snake1d activation)
Tokenizer	`diffsynth/models/ace_step_tokenizer.py`	~210M (ResidualFSQ + AttentionPooler, cover mode only)

Inference Scripts

Script	Description
`examples/ace_step/model_inference/Ace-Step1.5.py`	Base text2music (turbo variant)
`examples/ace_step/model_inference/acestep-v15-turbo-shift1.py`	Shift=1 turbo variant
`examples/ace_step/model_inference/acestep-v15-turbo-shift3.py`	Shift=3 turbo variant
`examples/ace_step/model_inference/acestep-v15-turbo-continuous.py`	Continuous duration turbo variant
`examples/ace_step/model_inference/acestep-v15-base.py`	Base (non-distilled) variant
`examples/ace_step/model_inference/acestep-v15-sft.py`	SFT fine-tuned variant
`examples/ace_step/model_inference/acestep-v15-xl-base.py`	XL base variant (~4B DiT)
`examples/ace_step/model_inference/acestep-v15-xl-sft.py`	XL SFT variant (~4B DiT)
`examples/ace_step/model_inference/acestep-v15-xl-turbo.py`	XL turbo variant (~4B DiT)
`examples/ace_step/model_inference/acestep-v15-base-CoverTask.py`	Audio cover task
`examples/ace_step/model_inference/acestep-v15-base-RepaintTask.py`	Repaint task

Training Scripts

Full Training: 9 scripts in examples/ace_step/model_training/full/
Full Training Validation: 9 scripts in examples/ace_step/model_training/validate_full/
LoRA Training: 9 scripts in examples/ace_step/model_training/lora/
LoRA Training Validation: 9 scripts in examples/ace_step/model_training/validate_lora/

Dependencies Added

Optional dependency group audio: torchaudio, torchcodec

Core dependencies (transformers, einops, accelerate, peft, etc.) were already present in pyproject.toml.

Notes

VRAM Management: VRAM offload is enabled by default. Minimum ~3GB VRAM required with offload configuration.
Scheduler: Flow Matching scheduler with configurable shift and num_inference_steps. Turbo variants use 8 steps; base variants use 16-32 steps.
Task Types: The pipeline supports text2music, cover, and repaint task types via the task_type parameter.
LM Not Integrated: The ACE-Step LLM component (used for Simple Mode / prompt expansion) is not included in this integration. Users should provide structured parameters (caption, lyrics, bpm, keyscale, etc.) directly.
9 DiT Variants: The integration supports 9 DiT checkpoint variants: turbo (default), turbo-shift1, turbo-shift3, turbo-continuous, base, sft, xl-base, xl-sft, xl-turbo.
Model Hash Routing: Each DiT variant is registered in model_configs.py with its weight file SHA256 hash for automatic code variant selection during loading.

gemini-code-assist

Code Review

This pull request integrates the ACE-Step 1.5 music generation model into the framework, adding the necessary DiT, VAE, text encoder, and tokenizer models along with a dedicated pipeline for text-to-music, audio cover, and repainting tasks. The update includes VRAM management support, comprehensive documentation, and various inference and training examples. Technical feedback identifies a missing torch import in the new audio loading operator, an unassigned tensor operation in the audio utility, unused debugging code in the pipeline, and a type hint mismatch in the DiT model's attention mechanism.

gemini-code-assist · 2026-04-23T10:24:55Z

@@ -3,6 +3,7 @@
 import imageio.v3 as iio
 from PIL import Image
 import torchaudio


The LoadPureAudioWithTorchaudio class introduced in this file uses torch.nn.functional.pad, but torch is not imported. This will cause a NameError at runtime when the operator is executed.

Suggested change

import torchaudio

import torch, torchaudio

gemini-code-assist · 2026-04-23T10:24:55Z

    """
    if waveform.dim() == 3:
        waveform = waveform[0]
+    waveform.cpu()


The call to waveform.cpu() returns a copy of the tensor on the CPU but does not modify the original tensor in place. Since the return value is ignored, the waveform tensor remains on its original device (e.g., CUDA), which may lead to errors if subsequent code expects a CPU tensor.

Suggested change

waveform.cpu()

waveform = waveform.cpu()

gemini-code-assist · 2026-04-23T10:24:55Z

+        # TODO: remove this
+        newtext = prompt + "\n\n" + lyric_text


The variable newtext is assigned but never used within the process method. This appears to be leftover debugging code, as indicated by the # TODO: remove this comment, and should be removed to maintain code quality.

gemini-code-assist · 2026-04-23T10:24:56Z

+        position_embeddings: tuple[torch.Tensor, torch.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:


The type hint for the forward method in AceStepAttention specifies a 3-element tuple, but the implementation only returns 2 elements (attn_output, attn_weights). This mismatch can cause issues with static analysis tools or type checkers.

Suggested change

) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:

) -> tuple[torch.Tensor, Optional[torch.Tensor]]:

mi804 added 15 commits April 17, 2026 17:06

model-code

36c203d

pipeline_t2m

a604d76

acestep t2m

9d09e04

t2m

95cfb77

t2m

f5a3201

low_vram

b0680ef

ace-step train

c53c813

reference audio input

f2e3427

noncover

1186379

codes

394db06

style

a80fb84

Merge branch 'main' of https://github.com/modelscope/DiffSynth-Studio

29bf66c

docs

002e3cd

path

3da6254

remove comments

641801e

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

mi804 added 2 commits April 24, 2026 10:33

remove modelconfigs comments

85bd87b

mior fix

15f4749

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ace step#1408

Ace step#1408
mi804 wants to merge 17 commits intomodelscope:mainfrom
mi804:ace-step

mi804 commented Apr 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:
	) -> tuple[torch.Tensor, Optional[torch.Tensor]]:

Conversation

mi804 commented Apr 23, 2026

ACE-Step Integration

Summary

Model Components

Inference Scripts

Training Scripts

Dependencies Added

Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant