Part of Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1 (NeurIPS Datasets and Benchmarks 2021) round2
Zhiyuan Tang, Dong Wang, Yanguang Xu, Jianwei Sun, Xiaoning Lei, Shuaijiang Zhao, cheng wen, Xingjun Tan, Chuandong Xie, Shuran Zhou, Rui Yan, Chenjia Lv, Yang Han, Wei Zou, Xiangang Li
This paper introduces an open source speech dataset, KeSpeech, which involves 1,542 hours of speech signals recorded by 27,237 speakers in 34 cities in China, and the pronunciation includes standard Mandarin and its 8 subdialects. The new dataset possesses several properties. Firstly, the dataset provides multiple labels including content transcription, speaker identity and subdialect, hence supporting a variety of speech processing tasks, such as speech recognition, speaker recognition, and subdialect identification, as well as other advanced techniques like multi-task learning and conditional learning. Secondly, some of the text samples were parallel recorded with both the standard Mandarin and a particular subdialect, allowing for new applications such as subdialect style conversion. Thirdly, the number of speakers is much larger than other open-source datasets, making it suitable for tasks that require training data from vast speakers. Finally, the speech signals were recorded in two phases, which opens the opportunity for the study of the time variance property of human speech. We present the design principle of the KeSpeech dataset and four baseline systems based on the new data resource: speech recognition, speaker verification, subdialect identification and voice conversion. The dataset is free for all academic usage.