Addressing "Documentation Debt" in Machine Learning: A Retrospective Datasheet for BookCorpus

Part of Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1 (NeurIPS Datasets and Benchmarks 2021) round1

Bibtex Paper Reviews And Public Comment » Supplemental

Authors

John Bandy, Nicholas Vincent

Abstract

This paper contributes a formal case study in retrospective dataset documentation and pinpoints several problems with the influential BookCorpus dataset. Recent work has underscored the importance of dataset documentation in machine learning research, including by addressing ``documentation debt'' for datasets that have been used widely but documented sparsely. BookCorpus is one such dataset. Researchers have used BookCorpus to train OpenAI's GPT-N models and Google's BERT models, but little to no documentation exists about the dataset's motivation, composition, collection process, etc. We offer a retrospective datasheet with key context and information about BookCorpus, including several notable deficiencies. In particular, we find evidence that (1) BookCorpus violates copyright restrictions for many books, (2) BookCorpus contains thousands of duplicated books, and (3) BookCorpus exhibits significant skews in genre representation. We also find hints of other potential deficiencies that call for future research, such as lopsided author contributions. While more work remains, this initial effort to provide a datasheet for BookCorpus offers a cautionary case study and adds to growing literature that urges more careful, systematic documentation of machine learning datasets.