FINALLY: fast and universal speech enhancement with studio-like quality


Contents

  1. Abstract
  2. Architecture
  3. Real data demo
  4. Comparison with other methods
  5. Examples of clusters obtained during LMOS studies



Abstract

In this paper, we address the challenge of speech enhancement in real-world recordings, which often contain various forms of distortion, such as background noise, reverberation, and microphone artifacts. We revisit the use of Generative Adversarial Networks (GANs) for speech enhancement and theoretically show that GANs are naturally inclined to seek the point of maximum density within the conditional clean speech distribution, which, as we argue, is essential for speech enhancement task. We study various feature extractors for perceptual loss to facilitate the stability of adversarial training, developing a methodology for probing the structure of the feature space. This leads us to integrate WavLM-based perceptual loss into MS-STFT adversarial training pipeline, creating an effective and stable training procedure for the speech enhancement model. The resulting speech enhancement model, which we refer to as FINALLY, builds upon the HiFi++ architecture, augmented with a WavLM encoder and a novel training pipeline. Empirical results on various datasets confirm our model's ability to produce clear, high-quality speech at 48 kHz, achieving state-of-the-art performance in the field of speech enhancement.



Architecture



The model is a six-component neural network consisting of WavLM-large, SpectralUNet, Upsampler, WaveUNet, SpectralMaskNet and the Upsample WaveUNet modules. SpectralUNet is responsible for initial preprocessing of audio in the spectral domain using two-dimensional convolutions. Additionaly, the SSL features, obtained from WavLM-large, are added to spectral ones. The Upsampler is a HiFi-GAN generator-based module that increases the temporal resolution of the input tensor, mapping it to the waveform domain. WaveUNet performs post-processing in the waveform domain and improves the output of the Upsampler by incorporating phase information gleaned directly from the raw input waveform. SpectralMaskNet is applied to perform spectrum-based post-processing and thus, remove any possible artifacts that remained after WaveUNet. Thus, the model alternates between time and frequency domains, allowing for effective audio restoration. Finally, the Upsample WaveUNet is a learnable upsampler of the signal sampling rate, consisting of the WaveUNet with an additional convolutional upsampling block in the decoder that upsamples the temporal resolution by 3 times.



Real data demo

The model is designed to solve the universal speech enhancement problem and, hence, is capable of tackling diverse distortions types.
Input Output



Comparison with other methods


Comparison with diffusion

As it can be seen, diffusion model hallucinates when the input is significantly degraded, our model, on countrary, provides more precise content restoration, sometimes slightly sacrificing perceptual quality, and, subsequently, achieving better enhancement quality.
Input UNIVERSE Ours Ground truth


All example samples

Input UNIVERSE Ours Ground truth

Input HIFI-GAN-2 Ours



Examples of clusters obtained during LMOS studies

As we mentioned in our paper, we generated clusters with the help of VITS. In this part we provide the examples of different clusters. As it can be heard, the diversity of samples from one cluster is not caused by the phrase, speaker or phoneme duration mismatch. WavLM tends to preserve this structure whilst the L2- space, for instance, usually not.
Cluster 1 Cluster 2 Cluster 3 Cluster 4