Challenge Tracks

1. TTS Track

participants will train models using the TTS In the Wild (TITW) dataset — a large-scale, in-the-wild corpus created via a fully automated pipeline applied to VoxCeleb1, a diverse dataset of over 1,250 speakers recorded in real-world conditions. TITW offers two training subsets:

  • TITW‑Hard: Raw speech segments automatically transcribed, segmented, and filtered—highly challenging due to background noise and varied recording conditions.
  • TITW‑Easy: A refined subset where audio is enhanced (e.g., via DEMUCS) and further filtered using DNSMOS scoring; nonetheless, still noisy compared to studio quality.

Together, they comprise approximately 360 hours of paired speech–text data (173 h Hard, 189 h Easy), ideal for benchmarking TTS models under realistic “wild” conditions. Baseline experiments demonstrate modern TTS architectures can successfully train on TITW‑Easy (achieving UTMOS > 3.0), while TITW‑Hard remains a stretch goal for future methods. Two evaluation protocols—Known Speaker, Known Text (KSKT) and Known Speaker, Unknown Text (KSUT)—allow systematic comparison across models.

2. SASV track

The SASV track is built upon SpoofCeleb, a large-scale in-the-wild dataset designed to advance speech deepfake detection and spoofing-aware automatic speaker verification (SASV). SpoofCeleb combines natural “bona fide” speech from over 1,250 speakers with synthetic speech generated by 23 state-of-the-art TTS systems. Unlike prior datasets recorded in clean, controlled environments, SpoofCeleb uses speech collected from real-world, diverse, and noisy conditions based on the VoxCeleb1 corpus. It provides:

  • Over 2.5 million audio segments, including both human and spoofed speech.
  • Wide variability in channel conditions, recording environments, and speaking styles, closely reflecting real-world deployment scenarios.
  • Training, validation, and test partitions with carefully designed evaluation protocols.
  • Baseline models and results for both speech deepfake detection and SASV tasks to facilitate fair comparisons and benchmarking.

SpoofCeleb enables the development of single-system SASV models by providing speaker labels alongside diverse spoofed speech samples. This unique resource allows researchers to build and evaluate SASV systems under realistic, challenging conditions, fostering progress in robust, secure, and accessible speech technologies.