1. Abstract
Text-to-speech (TTS) synthetic data augmentation has been widely used in various speech processing tasks, but its effectiveness in speech separation remains understudied. In this paper, we present SpeakerAugment+ (SA+), a neural TTS-based dynamic data augmentation framework for speech separation. The SA+ framework consists of two modules: speaker module, which learns a Gaussian Mixture Model (GMM) to characterize the distribution over speaker embeddings in training data and samples unseen speaker embeddings during inference; speech module, which conditions speech synthesis using speaker embeddings with controllable speaker parameters. SA+ incorporates three augmentation techniques: speaker generation, parameter manipulation and utterance generation, enhancing speaker and utterance diversity from different perspectives. Following the SA+ framework, we design FS2-SA+ and Matcha-SA+, which are based on FastSpeech 2 and Matcha-TTS, respectively. We evaluate SA+ across multiple separation models and datasets, and the results demonstrate a substantial improvement in speech separation performance. Matcha-SA+ generates higher-quality speech and achieves better separation performance in intra-corpus tests. Conversely, FS2-SA+ supports a broad range of speaker parameter adjustments, leading to better generalization. Besides, the relationship between harmonicity and speech separation is a widely researched topic, and our findings indicate that speech lacking an explicit harmonic structure, when generated by neural TTS, can function as augmented data to improve speech separation. This research underscores the effectiveness of neural TTS-based data augmentation in speech separation tasks. We hope that our work can offer insights for future studies investigating data augmentation strategies within speech separation.

Fig1: SA+ Framework. The speaker module learns the GMM distribution over speaker embeddings from the training data, with unseen speaker embeddings sampled during inference. The speech module utilizes these speaker embeddings to condition the synthesis of speech waveforms. This enables manipulation of speaker parameters via parameter factors.

Fig2: The architecture of FS2-SA+. The speaker module learns the GMM distribution over speaker embeddings from the training data, with unseen speaker embeddings sampled during inference. Following FastSpeech 2, the speech module uses speaker embeddings to condition the synthesis of mel-spectrograms. This allows pitch, energy, and duration to be manipulated via the variance adaptor, thereby increasing speaker diversity. The HiFi-GAN is used to estimate waveform signals.

Fig3: The architecture of Matcha-SA+. The speaker module generates diverse speaker embeddings. Following Matcha-TTS, the speech module based on OT-CFM utilizes speaker embeddings to condition the generation of mel-spectrograms. The duration factor is manipulated through the duration predictor to enhance speaker diversity. The HiFi-GAN is employed to estimate waveform signals.
2. FS2-SA+ Samples
We present audio samples that are randomly synthesized using FS2-SA+ to showcase the diversity of synthesized speech. Besides, we provide the ground truth and SA samples to compare them with SA+ results.
1. For the first time in years the Republicans also captured both houses of Congress. | |||||||
Ground Truth | SA-Vocoder | FS2-SA+1 | FS2-SA+2 | FS2-SA+3 | FS2-SA+4 | FS2-SA+5 | FS2-SA+6 |
---|---|---|---|---|---|---|---|
2. Much of the ground beef consumed in the United States comes from dairy cows. | |||||||
Ground Truth | SA-Vocoder | FS2-SA+1 | FS2-SA+2 | FS2-SA+3 | FS2-SA+4 | FS2-SA+5 | FS2-SA+6 |
3. The process by which the lens focuses on external objects is called accommodation. | |||||||
Ground Truth | SA-Vocoder | FS2-SA+1 | FS2-SA+2 | FS2-SA+3 | FS2-SA+4 | FS2-SA+5 | FS2-SA+6 |
4. Hear the waves crashing against the shore. | |||||||
Ground Truth | SA-Vocoder | FS2-SA+1 | FS2-SA+2 | FS2-SA+3 | FS2-SA+4 | FS2-SA+5 | FS2-SA+6 |
5. People are scared says Richard Ross executive director of the Center for the Study of Investor Behavior a research organization in Chicago. | |||||||
Ground Truth | SA-Vocoder | FS2-SA+1 | FS2-SA+2 | FS2-SA+3 | FS2-SA+4 | FS2-SA+5 | FS2-SA+6 |
6. The population lives by herding goats and sheep or by trading. | |||||||
Ground Truth | SA-Vocoder | FS2-SA+1 | FS2-SA+2 | FS2-SA+3 | FS2-SA+4 | FS2-SA+5 | FS2-SA+6 |
7. Two narrow gauge railroads from China enter the city from the northeast and northwest. | |||||||
Ground Truth | SA-Vocoder | FS2-SA+1 | FS2-SA+2 | FS2-SA+3 | FS2-SA+4 | FS2-SA+5 | FS2-SA+6 |
8. Both petroleum and natural gas deposits are scattered through eastern Ohio. | |||||||
Ground Truth | SA-Vocoder | FS2-SA+1 | FS2-SA+2 | FS2-SA+3 | FS2-SA+4 | FS2-SA+5 | FS2-SA+6 |
3. Matcha-SA+ Samples
We present audio samples that are randomly synthesized using Matcha-SA+.
1. He never obtained a secure academic position or permanent employment. | |||||||
Ground Truth | FS2-SA+ | Matcha-SA+1 | Matcha-SA+2 | Matcha-SA+3 | Matcha-SA+4 | Matcha-SA+5 | Matcha-SA+6 |
---|---|---|---|---|---|---|---|
2. For the first time in years the republicans also captured both houses of congress. | |||||||
Ground Truth | FS2-SA+ | Matcha-SA+1 | Matcha-SA+2 | Matcha-SA+3 | Matcha-SA+4 | Matcha-SA+5 | Matcha-SA+6 |
3. The process by which the lens focuses on external objects is called accommodation. | |||||||
Ground Truth | FS2-SA+ | Matcha-SA+1 | Matcha-SA+2 | Matcha-SA+3 | Matcha-SA+4 | Matcha-SA+5 | Matcha-SA+6 |
4. Instead of a modest profit at low cost continental by the second quarter the newly expanded unit has struggled with losses. | |||||||
Ground Truth | FS2-SA+ | Matcha-SA+1 | Matcha-SA+2 | Matcha-SA+3 | Matcha-SA+4 | Matcha-SA+5 | Matcha-SA+6 |
5. Military policy was to keep the travel routes open and protect the settled areas. | |||||||
Ground Truth | FS2-SA+ | Matcha-SA+1 | Matcha-SA+2 | Matcha-SA+3 | Matcha-SA+4 | Matcha-SA+5 | Matcha-SA+6 |
6. "doublequote most complaints ,comma "doublequote he adds ,comma "doublequote are made by other lawyers "doublequote whose clients have received letters from other attorneys .period. | |||||||
Ground Truth | FS2-SA+ | Matcha-SA+1 | Matcha-SA+2 | Matcha-SA+3 | Matcha-SA+4 | Matcha-SA+5 | Matcha-SA+6 |
4. Separation Demo
Separated speech using separation models trained with and without SA+.
1. Mixture | Ground Truth | w/o SA+, SI-SNRi=8.2 dB | w/ SA+, SI-SNRi=17.6 dB |
---|---|---|---|
2. Mixture | Ground Truth | w/o SA+, SI-SNRi=4.7 dB | w/ SA+, SI-SNRi=16.4 dB |
3. Mixture | Ground Truth | w/o SA+, SI-SNRi=-0.3 dB | w/ SA+, SI-SNRi=12.9 dB |
4. Mixture | Ground Truth | w/o SA+, SI-SNRi=9.6 dB | w/ SA+, SI-SNRi=15.9 dB |
5. Mixture | Ground Truth | w/o SA+, SI-SNRi=0.4 dB | w/ SA+, SI-SNRi=15.4 dB |