participants will train models using the TTS In the Wild (TITW) dataset — a large-scale, in-the-wild corpus created via a fully automated pipeline applied to VoxCeleb1, a diverse dataset of over 1,250 speakers recorded in real-world conditions. TITW offers two training subsets:
Together, they comprise approximately 360 hours of paired speech–text data (173 h Hard, 189 h Easy), ideal for benchmarking TTS models under realistic “wild” conditions. Baseline experiments demonstrate modern TTS architectures can successfully train on TITW‑Easy (achieving UTMOS > 3.0), while TITW‑Hard remains a stretch goal for future methods. Two evaluation protocols—Known Speaker, Known Text (KSKT) and Known Speaker, Unknown Text (KSUT)—allow systematic comparison across models.
The SASV track is built upon SpoofCeleb, a large-scale in-the-wild dataset designed to advance speech deepfake detection and spoofing-aware automatic speaker verification (SASV). SpoofCeleb combines natural “bona fide” speech from over 1,250 speakers with synthetic speech generated by 23 state-of-the-art TTS systems. Unlike prior datasets recorded in clean, controlled environments, SpoofCeleb uses speech collected from real-world, diverse, and noisy conditions based on the VoxCeleb1 corpus. It provides:
SpoofCeleb enables the development of single-system SASV models by providing speaker labels alongside diverse spoofed speech samples. This unique resource allows researchers to build and evaluate SASV systems under realistic, challenging conditions, fostering progress in robust, secure, and accessible speech technologies.