'Transparency and reproducibility in artificial intelligence' by Benjamin Haibe-Kains, George Alexandru Adam, Ahmed Hosny, Farnoosh Khodakarami, Massive Analysis Quality Control (MAQC) Society Board of Directors, Levi Waldron, Bo Wang, Chris McIntosh, Anna Goldenberg, Anshul Kundaje, Casey S. Greene, Tamara Broderick, Michael M. Hoffman, Jeffrey T. Leek, Keegan Korthauer, Wolfgang Huber, Alvis Brazma, Joelle Pineau, Robert Tibshirani, Trevor Hastie, John P. A. Ioannidis, John Quackenbush & Hugo J. W. L. Aerts in (2020) 586 Nature E14–E16 comments
Breakthroughs in artificial intelligence (AI) hold enormous potential as it can automate complex tasks and go even beyond human performance. In their study, McKinney et al. showed the high potential of AI for breast cancer screening. However, the lack of details of the methods and algorithm code undermines its scientific value. Here, we identify obstacles that hinder transparent and reproducible AI research as faced by McKinney et al., and provide solutions to these obstacles with implications for the broader field.
Haibe-Kains et al argue that the work both demonstrates the potential of AI and the challenges of making such work reproducible: 'the absence of sufficiently documented methods and computer code underlying the study effectively undermines its scientific value' and 'limits the evidence required for others to prospectively validate and clinically implement such technologies'.
Scientific progress depends on the ability of independent researchers to scrutinize the results of a research study, to reproduce the study’s main results using its materials, and to build on them in future studies (https://www.nature.com/nature-research/editorial-policies/reporting-standards). Publication of insufficiently documented research does not meet the core requirements underlying scientific discovery. Merely textual descriptions of deep-learning models can hide their high level of complexity. Nuances in the computer code may have marked effects on the training and evaluation of results, potentially leading to unintended consequences. Therefore, transparency in the form of the actual computer code used to train a model and arrive at its final set of parameters is essential for research reproducibility. McKinney et al. stated that the code used for training the models has “a large number of dependencies on internal tooling, infrastructure and hardware”, and claimed that the release of the code was therefore not possible. Computational reproducibility is indispensable for high-quality AI applications; more complex methods demand greater transparency. In the absence of code, reproducibility falls back on replicating methods from textual description. Although, McKinney and colleagues claim that all experiments and implementation details were described in sufficient detail in the supplementary methods section of their Article to “support replication with non-proprietary libraries”, key details about their analysis are lacking. Even with extensive description, reproducing complex computational pipelines based purely on text is a subjective and challenging task.
In addition to the reproducibility challenges inherent to purely textual descriptions of methods, the description by McKinney et al. of the model development as well as data processing and training pipelines lacks crucial details. The definitions of several hyperparameters for the model’s architecture (composed of three networks referred to as the breast, lesion and case models) are missing (Table 1). In their publication, McKinney et al.1 did not disclose the settings for the augmentation pipeline; the transformations used are stochastic and can considerably affect model performance. Details of the training pipeline were also missing. Without this key information, independent reproduction of the training pipeline is not possible.
Numerous frameworks and platforms exist to make artificial intelligence research more and reproducible (Table 2). For the sharing of code, these include Bitbucket, GitHub and GitLab, among others. The many software dependencies of large-scale machine learning applications require appropriate control of the software environment, which can be achieved through package managers including Conda, as well as container and virtualization systems, including Code Ocean, Gigantum, Colaboratory and Docker. If virtualization of the McKinney et al. internal tooling proved to be difficult, they could have released the computer code and documentation. The authors could have also created small artificial examples or used small public datasets to show how new data must be processed to train the model and generate predictions. Sharing the fitted model (architecture along with learned parameters) should be simple aside from privacy concerns that the model may reveal sensitive information about the set of patients used to train it.
Haibe-Kains notes techniques for achieving differential privacy: many platforms allow sharing of deep learning models that in addition to improving accessibility and transparency can accelerate models into production and clinical implementation.
Another crucial aspect of ensuring reproducibility lies in access to the data the models were derived from. In their study, McKinney et al. used two large datasets under license, properly disclosing this limitation in their publication. The sharing of patient health information is highly regulated owing to privacy concerns. Despite these challenges, the sharing of raw data has become more common in biomedical literature, increasing from under 1% in the early 2000s to 20% today. However, if the data cannot be shared, the model predictions and data labels themselves should be released, allowing further statistical analyses. Above all, concerns about data privacy should not be used as a way to distract from the requirement to release code.