Challenge Description

WildSpoof Challenge aims to advance the use of in-the-wild data in two speech processing tasks that generate and detect spoofed speech, Text-to-Speech (TTS) and Spoofing-aware Automatic Speaker Verification (SASV).

TTS models usually require clean, high-fidelity studio recordings, posing scalability and accessibility barriers. Recently, with the improved robustness of TTS models, training on in-the-wild speech data has become feasible. Conversely, SASV is a more recent field, emerging as ASV technology has matured. As the malicious use of TTS technology grows, extending ASV to reject synthesized speech has become a critical research focus in recent years.

The lack of appropriate speech resources has hindered progress in both tasks. Most TTS studies using in-the-wild data still rely on artificially synthesized noisy datasets rather than real-world recordings, making results difficult to compare and reproduce. Interest in SASV has been demonstrated by the success of the SASV 2022 Challenge, organized by our team. However, the absence of large-scale datasets with both speaker and spoof labels has restricted participants from developing integrated SASV systems–only fusion systems were explored. Furthermore, despite advancements in TTS, the limited spoofing detection datasets, mostly from clean, controlled environments, have dissuaded participants from building single, end-to-end systems. This lack of diverse, real-world data has not only impeded the development of more robust systems but also slowed progress in the field. Consequently, current SASV models are exposed to overfitting on existing datasets and may fail to keep up with the evolving sophistication of modern TTS techniques.

To this end, we propose the “WildSpoof” challenge, featuring two tracks with coherent training and evaluation protocols, datasets, and baselines. This challenge is made possible by the recently introduced SpoofCeleb dataset. Its human speech subset provides in-the-wild data for TTS training, alongside multiple baseline TTS models. SpoofCeleb, for the first time, offers participants the opportunity to develop single SASV models with data from over a thousand speakers and synthesized speech samples generated by more than 20 different TTS systems. The two tracks are designed to interact, enabling SASV participants to evaluate their models on synthesized speech from the TTS track, and vice versa.

Why we need to foster research on TTS and SASV using “in-the-wild” data?

  • Democratizing TTS development: Research on in-the-wild TTS has the potential to democratize TTS technology by eliminating the dependence on professional recording environments. This drastically reduces data collection costs and enables broader accessibility for users.
  • TTS with robustness and diversity: In-the-wild data adds variability in channel conditions, speaking styles, accents, and noise, leading to more generalized and adaptable TTS models.
  • Creating realistic threats in SASV: The majority of SASV systems are trained and tested under clean conditions, which do not reflect real-world deployment environments. In-the-wild TTS data simulates more realistic attack scenarios, making SASV systems more secure and future-proof.
  • Encouraging co-evolution: By creating a shared benchmark where the output of in-the-wild TTS systems becomes input for SASV evaluation, and vice versa, we simulate an adversarial co-evolution between synthesis and detection systems, which mirrors real-world dynamics.

Why this challenge is timely?

  • Recent advances of in-the-wild TTS: Until just a few years ago, training TTS models on uncurated data was considered too noisy and unreliable. However, breakthroughs in model architectures and self-supervised learning now made this a tractable and promising direction.
  • Urgent need for more realistic SASV evaluation: As synthetic speech becomes harder to detect, the mismatch between controlled training data and real-world conditions is widening. This challenge provides a critical testbed to stress-test detection systems under realistic scenarios.