Ramya Srinivasan, Emily Denton, Jordan Famularo, Negar Rostamzadeh, Fernando Diaz, Beth Coleman
Machine learning (ML) techniques are increasingly being employed within a variety of creative domains. For example, ML tools are being used to analyze the authenticity of artworks, to simulate artistic styles, and to augment human creative processes. While this progress has opened up new creative avenues, it has also paved the way for adverse downstream effects such as cultural appropriation (e.g., cultural misrepresentation, offense, and undervaluing) and representational harm. Many such concerning issues stem from the training data in ways that diligent evaluation can uncover, prevent, and mitigate. We posit that, when developing an arts-based dataset, it is essential to consider the social factors that influenced the process of conception and design, and the resulting gaps must be examined in order to maximize understanding of the dataset's meaning and future impact. Each dataset creator's decision produces opportunities, but also omissions. Each choice, moreover, builds on preexisting histories of the data's formation and handling across time by prior actors including, but not limited to, art collectors, galleries, libraries, archives, museums, and digital repositories. To illuminate the aforementioned aspects, we provide a checklist of questions customized for use with art datasets in order to help guide assessment of the ways that dataset design may either perpetuate or shift exclusions found in repositories of art data. The checklist is organized to address the dataset creator's motivation together with dataset provenance, composition, collection, pre-processing, cleaning, labeling, use (including data generation), distribution, and maintenance. Two case studies exemplify the value and application of our questionnaire.