WildSpoof Challenge aims to advance the use of in-the-wild data in two speech processing tasks that generate and detect spoofed speech, Text-to-Speech (TTS) and Spoofing-aware Automatic Speaker Verification (SASV).
TTS models usually require clean, high-fidelity studio recordings, posing scalability and accessibility barriers. Recently, with the improved robustness of TTS models, training on in-the-wild speech data has become feasible. Conversely, SASV is a more recent field, emerging as ASV technology has matured. As the malicious use of TTS technology grows, extending ASV to reject synthesized speech has become a critical research focus in recent years.
The lack of appropriate speech resources has hindered progress in both tasks. Most TTS studies using in-the-wild data still rely on artificially synthesized noisy datasets rather than real-world recordings, making results difficult to compare and reproduce. Interest in SASV has been demonstrated by the success of the SASV 2022 Challenge, organized by our team. However, the absence of large-scale datasets with both speaker and spoof labels has restricted participants from developing integrated SASV systems–only fusion systems were explored. Furthermore, despite advancements in TTS, the limited spoofing detection datasets, mostly from clean, controlled environments, have dissuaded participants from building single, end-to-end systems. This lack of diverse, real-world data has not only impeded the development of more robust systems but also slowed progress in the field. Consequently, current SASV models are exposed to overfitting on existing datasets and may fail to keep up with the evolving sophistication of modern TTS techniques.
To this end, we propose the “WildSpoof” challenge, featuring two tracks with coherent training and evaluation protocols, datasets, and baselines. This challenge is made possible by the recently introduced SpoofCeleb dataset. Its human speech subset provides in-the-wild data for TTS training, alongside multiple baseline TTS models. SpoofCeleb, for the first time, offers participants the opportunity to develop single SASV models with data from over a thousand speakers and synthesized speech samples generated by more than 20 different TTS systems. The two tracks are designed to interact, enabling SASV participants to evaluate their models on synthesized speech from the TTS track, and vice versa.