'Envisioning data sharing for the biocomputing community' by Enrico Riccardi, Sergio Pantano and Raffaello Potestio in (2019) 9(3)
Interface Focus comments
The scientific community is facing a revolution in several aspects of its modus operandi, ranging from the way science is done—data production, collection, analysis—to the way it is communicated and made available to the public, be that an academic audience or a general one. These changes have been largely determined by two key players: the big data revolution or, less triumphantly, the impressive increase in computational power and data storage capacity; and the accelerating paradigm switch in science publication, with people and policies increasingly pushing towards open access frameworks. All these factors prompt the undertaking of initiatives oriented to maximize the effectiveness of the computational efforts carried out worldwide. Taking the moves from these observations, we here propose a coordinated initiative, focusing on the computational biophysics and biochemistry community but general and flexible in its defining characteristics, which aims at addressing the growing necessity of collecting, rationalizing, sharing and exploiting the data produced in this scientific environment.
The authors argue
The power of computational methods in the study of living systems has immensely grown in the past few decades. The size of the systems that can be studied by means of computer-based approaches has boosted from a few to millions of atoms, while the accuracy of force fields has systematically improved thanks to ab initio calculations and the integration of experimental data, making these in silico models predictive. A prominent role has been and is being played by coarse-grained models, that is, simplified representations of complex biological systems whose level of resolution is lower than atomistic (crossing all scales up to the continuum) but enable the simulation of larger objects for longer times. Small yet powerful computer stations, GPU cards, small-sized computer clusters and ever-improving algorithms on top of everything have made cutting-edge research in computational biophysics amenable to groups at all scales, from the single person to the hundred-unit teams. As of now, 90% of the data that existed in the beginning of 2018 were created in the last 2 years.
All these advancements have tremendously contributed to push forward our understanding of the numerous, intricate, intermingled and multi-scale processes and phenomena that can be encircled in the broad definition of life. From the water self-protonization reaction spontaneously turning water into its charged constituents to the flux of red blood cells in the vascular stream, computer models are now numerous enough and sufficiently accurate, in spite of the obvious limitations and shortcomings deriving from the approximations they entail, so as to permit a remarkably deeper insight into the functioning of biological entities. Above all, these models are becoming increasingly predictive.
The cosmic inflation of computational biophysics naturally has its inevitable downsides. We identify here the most prominent in the following issues:
- The distribution of immense treasures of data (molecular dynamics trajectories to name just the most obvious) which would otherwise remain confined in the laboratories that produced them.
-
The storage of data in a compact, reliable, secure protocol, to contrast the data loss due to hardware failures and outdated software.
-
The development of common procedures to rationalize data storage, avoiding data overflow and limiting the costs of backups.
-
The limitation of many redundant research efforts, due to the necessity of a given group to re-do the work of another just for the sake of obtaining those data which are needed as ‘input’ for further investigation rather than representing an objective themselves.
-
The reduction of the plurality of standards for input and output files, metadata, algorithms, etc., which often degenerate in a detrimental incompatibility between data and codes that would be otherwise sensible and useful to put in contact. This phenomenon creates quasi-closed communities of single-software users and prevents researchers from creating simple and effective pipelines or assemblies of algorithms to obtain new results.
-
The mitigation of research opacity, a consequence of the difficulty to access the raw data, metadata, input files and detailed documentation of procedures and algorithms of many works, as well as to reproduce their results.
This last aspect is particularly worrisome, as it does not consist of a limitation of current approaches or a gap with respect to an otherwise achievable optimum; rather, it bears the risk of distorted, misguided and even intentionally falsified scientific behaviour. Indeed, alarming reports indicated that ‘bad science’ is becoming a prominent problem due to the current publishing format, in that the latter incentivizes publication for its own sake, e.g. by bringing ground-breaking claims to prominence while scarcely selecting against false positives and poorly designed experiments and analyses. The phenomenon is particularly grave if it is not unintentional, rather it originates from deliberate cheating or loafing. Normalized misbehaviour is generating an increasing deviance from ethical norms; furthermore, science is endangered by statistical misunderstanding.
One of the worst consequences of dubious or opaque work is that it undermines the credibility and integrity of research as a whole, a phenomenon with dramatic societal consequences made even more critical by intentional scientific misconduct, for the detection of which a framework is currently missing. An intuitive yet simplistic measure of the dimension of scientific misconduct is given by the number of retracted papers, yet it is at best conservative, thus making the quantification of the cost of this phenomenon, for the academic community as well as for society at large, extremely difficult, if not impossible.
Consequently, there is a growing urge for improving meta-research, that is, for instruments to evaluate and ameliorate research methods and practices. As is natural for a young and growing field, efforts to date are uncoordinated and fragmented. Even with the employment of ‘good practices’, a vast quantity of research output is not fully exploited or even goes wasted, with the file drawer problem —that is, to refrain from publishing evidence contrary to an author’s hypotheses—being one of the most representative issues. The lack of protocols, standards and procedures to grant wide data accessibility leads to substantial costs (in terms of personnel time, facility resources, research funds).
A lively discussion between researchers and policy makers is ongoing at several intranational and international levels in order to reverse this trend. Among the most recent initiatives undertaken at the European level, it is worth mentioning the creation of the European Open Science Cloud (https://www.eosc-portal.eu), whose objective is to provide a safe environment for researchers to store, analyse and re-use data for research, innovation and educational purposes.
Performing a very restrictive selection from a heterogeneous literature, we here consider and take the moves from two proposals in particular: Reformulating Science (methodological, cultural, structural reforms) and Science Utopia. With the suggestions therein in mind, and limited to the research fields of computational biophysics and chemistry, we consider a strategical priority to coordinate at a supranational level the availability of scientific data and software in order to increase research efficiency, reproducibility and openness. This is certainly not a novel idea and several successful examples of curated databases integrating biological information can be found in life sciences, such as the Genebank, UniProt and the Worldwide Protein Data Bank (PDB). An initiative like the one proposed here is the NoMaD project, which maintains one of the largest open repositories for computational material science. We believe that a global effort has to be undertaken in order to rationalize the complex ecosystem of software and the goldmine of information that is emerging from the collective, albeit often independent, work of a steadily growing research community. Moreover, the availability of the data would also contribute to boost the scientific progress in developing countries, where the scarcity of resources impairs the training of highly specialized researchers.