Part of Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1 (NeurIPS Datasets and Benchmarks 2021) round2
Thibaut Horel, Lorenzo Masoero, Raj Agrawal, Daria Roithmayr, Trevor Campbell
The lack of accessibility to data on policing has severely limited researchers’ ability to conduct thorough quantitative analyses on police activity and behavior, particularly with regard to predicting and explaining police violence. In the present work, we provide a new dataset that contains information on the personnel, activities, use of force, and complaints in the Chicago Police Department (CPD). The raw data, obtained from the CPD via a series of requests under the Freedom of Information Act (FOIA), consists of 35 unlinked, inconsistent, and undocumented spreadsheets. Our paper provides a cleaned, linked, and documented version of this data that can be reproducibly generated via open source code. We provide a detailed description of the dataset contents, the procedures for cleaning the data, and summary statistics. The data have a rich variety of uses, such as prediction (e.g., predicting misconduct from officer traits, experience, and assigned units), network analysis (e.g., detecting communities within the social network of officers co-listed on complaints), spatiotemporal data analysis (e.g., investigating patterns of officer shooting events), causal inference (e.g., tracking the effects of new disciplinary practices, new training techniques, and new oversight on complaints and use of force), and much more. Access to this dataset will enable the machine learning community to meaningfully engage with the problem of police violence.