Singing Demos for "VISinger: Variational Inference with adversarial learning for end-to-end singing voice synthesis"

Authors: Yongmao Zhang, Jian Cong, Heyang Xue, Lei Xie, Pengcheng Zhu, Mengxiao Bi

In this paper, we propose VISinger, a complete end-to-end high-quality singing voice synthesis (SVS) system that directly generates audio waveform from lyrics and musical score. Our approach is inspired by VITS, which adopts VAE-based posterior encoder augmented with normalizing flow based prior encoder and adversarial decoder to realize complete end-to-end speech generation. VISinger follows the main architecture of VITS, but makes substantial improvements to the prior encoder based on the characteristics of singing. First, instead of using phoneme-level mean and variance, we introduce a length regulator and a frame prior network to get the frame-level mean and variance, modeling the rich acoustic variation in singing. Second, we further introduce an F0 predictor to guide the frame prior network, leading to stabler singing performance. Finally, we modify the duration predictor to specifically predict the phoneme to note duration ratio, helped with singing note normalization. Experiments on a professional Mandarin singing corpus show that VISinger significantly outperforms FastSpeech+Neural-Vocoder two-stage approach and the oracle VITS.


 

Selected clips from Subjective Evaluation (Note that all are Mandarin Chinese singing clips)

Recording FS+WaveRNN (Finetuned) FS+HiFiGAN (Finetuned) VITS VISinger

 

Some long singing sequences from recent popular Chinese songs synthesized by the proposed VISinger

《柠檬树》(Lemon tree)

《胆小鬼》(coward)

《没那么简单》(Not so simple)

《易燃易爆炸》(Flammable and explosive)