Making the collective knowledge of chemistry open and machine actionable – Nature.com

To be practical, the data capture step needs to be both as close as possible to the way chemists work and it should ensure that the chemical data generated can be practically reused by other researchers. We give examples for what ‘machine-actionable data’ means in Box 1.

In chemistry, most samples in the lab are produced with a chemical reaction. Trying to predict the conditions at which a reaction can take place optimally is still one of the major challenges in chemistry. Machine-learning methods are expected to help us in this area14. However, for this to work we need to report data in a format that can be used in machine learning, and also report ‘failed’ experiments15,16. One can easily see the dilemma here; if an experiment—after 99 ‘failed’ attempts—finally works, there is little motivation, if any, for a researcher to spend 1% of their time in reporting the one successful experiment and the remaining 99% of the time on the ‘failed’ ones.

Capturing synthetic data

In chemistry, the number of possible steps and combinations of steps is nearly infinite. For example, the order in which the reagents are added can clearly decide whether a reaction will be successful or not17,18—and any machine-learning efforts will fail if such information is not reported correctly. This is exactly what is missing in many of the existing databases. For example, by mining the patent literature19 one can obtain a wealth of information on which chemicals can be synthesized20. However, the actual procedure of the syntheses cannot be mined systematically: the order of addition, the heating, the stirring and, of course, the workup and purification. And the situation is even more dire for inorganic chemistry21. Similarly, all the databases contain no information about the attempts that did not work and are biased towards certain reaction types22,23,24. This lack of reports on ‘failed’ reactions adds to other factors that lead to certain types of reactions being more prominent than others—for example, looking into the most used reactions in medicinal chemistry, Brown and Boström found that amide formation was mentioned at least once in about half of the selected set of manuscripts published in the Journal of Medicinal Chemistry in 2014 (ref. 25).

Ideally, to capture synthesis information we need to find a balance between the flexibility of a sheet of paper, on which chemists can record anything they want in any format they like26, and imposing a structure such that the captured data can be easily reused for machine-learning applications. The flexibility is key to ensure chemists will widely adopt the tool10,27, whereas from a data-management perspective a highly structured database (for example, filled via a long form) would be much easier to use. In high-throughput experimentation settings the latter might clearly be a natural approach, but for many manually created, small datasets1, this might not be a feasible approach, as to capture all the possible scenarios would result in such a gigantic form that chemists would need special training to navigate it.

Among the different ELNs no consensus has been reached about this design point. Some allow complete flexibility and have the look and feel of a typical note-taking app, whereby one needs natural-language processing to make the information machine-readable which, unavoidably, leads to information loss. At the other end of the spectrum are those that have a lot of structure with the design of a new form for every eventuality, which might be ideal for machine learning but poses a burden to use for non-routine chemistry.

A possible solution to these challenges, which is implemented in the chemotion and the cheminfo ELNs (Table 1), is to stick to the text-based form chemists are used to, but to combine it with templates to structure the text. This hybrid approach is described in Box 2. In practice, we found that some free text fields are always required to give chemists the necessary flexibility to express their motivation, thought process and interpretation). Parts of this can be captured via specific fields, for instance, the related literature, or spectral annotations. For many other parts, the free, potentially unstructured, thought process is exactly what one would like to capture (for example, to annotate when an experiment failed for an unexpected reason, such as a beam drop at the synchrotron).

Characterization data formats and metadata

After a sample has been synthesized, it needs to be characterized. Thereby, we want to ensure that researchers all over the world, as well as their computational agents, can use the data. Clearly, data models, which describe how data are stored in a data format, and metadata, which describe datasets, are not the typical focus of a chemist. However, a lot of chemical data is currently stored in a wide variety of proprietary files (Supplementary Table 2). In the short term, this might not look like a real problem, but in the long term, this is not sustainable. For example, one can lose access to all the files once the software license associated with particular equipment expires; or collaborators in another institute that want to use the data do not have access to the same software. Also, a hodgepodge of inconsistent formats clearly hampers data mining efforts.

Requiring all individual researchers to manually convert all their spectra into a standard format will be a large, potentially insurmountable and non-scalable burden on the researchers. Therefore, an essential step in progressing towards such an open platform is to convert the data into a standardized structured form before it even enters the ELN (thesis 2 in Fig. 1). This is an essential service an ELN must provide to a user. That is, the ELN will take the data as they are provided by the spectrometer, and convert it into a standardized form. The cheminfo implementation, for example, uses JCAMP-DX files (Joint Committee on Atomic and Molecular Physical Data Exchange format; see Extended Data Fig. 1 for an example) as a standard representation for most spectra. This format has been recommended by IUPAC (International Union of Pure and Applied Chemistry) for many spectra together with recommended vocabulary28, and is also recommended by the chemotion ELN, and used in the Open Spectral Database29. However, in principle, any other format (Supplementary Table 4) can be used as long as it is standardized and openly documented. Indeed, some newer formats have a native support for advanced features, such as linking to standardized vocabularies, and might be preferable (see Extended Data Fig. 2 for an example). For example, there were efforts (spearheaded by the pharmaceutical industry) to develop a ‘unified data model’ for compound synthesis and testing, or the ‘Allotrope data format’, which tries to collect the full data life cycle in one file. Some, like the autoprotocol or XDL30, even try to capture the link between hardware (such as reaction vessels) and synthesis steps in a way that can be understood (and executed) by both robots and humans.

One can argue that some existing formats and data schema are old-fashioned and that we should develop new ones. However, anyone proposing a new format should realize that if a characterization method has N formats provided by the instrument manufacturers and M ‘standard’ formats are invented, we need to write and maintain N × M conversion programs and M2 programs to be able to compare the different ‘standard’ formats. This indicates that it can be more productive to update existing solutions and make them interoperable compared with creating new ones (thesis 5 in Fig. 1).

It is important to note that data become much more useful, and interoperable, if they are linked and described using a controlled, hierarchical vocabulary, that is, an ontology. Using a formal ontology allows us to infer information from the context encoded in the vocabulary. For example, we might have Raman and infrared spectra, as well as the cities of the measurement stored in our database. The ontology will not only remove ambiguities in spelling of the cities, but it will also tell us which cities to include if we search for, say, all organic samples with vibrational spectra measured in a particular country. At the technical level, this is enabled by the fact that the ontology will encode that both infrared and Raman spectroscopy are forms of vibrational spectroscopy and that cities are located in countries. That is, it allows us to go from machine-readable to machine interpretable on a global scale (global because the terms are standardized and shared via uniform resource identifiers (URIs)). In practice, however, ontologies (and related semantic web technologies) remain underused. The main reasons are probably that the diversity of the ontologies is too large and that existing ones are not well integrated31. Clearly, we cannot expect chemists to manually annotate their data using an ontology. This is something an ELN needs to do automatically in the background. However, for this to be practical, ELN developers need to connect with other initiatives to register, standardize, link32 and adopt ontologies.

Let us now assume the ideal situation that most chemists have settled on a standard data reporting form (for the most important characterization techniques in a subdomain, such as gas adsorption isotherms, X-ray adsorption spectroscopy and cyclic voltammetry), and also accept that open science should not be an afterthought. This implies that the ELN must take the file in whatever form it comes from the instrument, convert into a standard form and permanently connect it to the chemical that was characterized (Fig. 2). Such conversion tools (see Supplementary Table 2 for examples) can be developed independently of each other and reused in all ELNs. For instance, the chemotion ELN reuses some of the libraries that we have been developing for the cheminfo ELN (cheminfo.github.io). Having such common conversion tools would also create the incentive to adopt a common schema.

Fig. 2: Overview of a possible importation procedure of the ELN.
figure 2

If an instrument is coupled to the network one can, through scanning the barcode on the sample, upload the analysis result directly into a database. Alternatively, one can upload files via drag and drop through a web interface (front end). In both cases, the ELN ensures that the data are converted into a standard form such that anyone with a web browser can visualize and further analyse it. Other parties can access the data, for example, using an access token mechanism70, via a representational state transfer (REST) application programming interface (API) or published on a repository. Importantly, all the steps can take place from a different location, and hence enable collaboration. This data infrastructure is implemented in the open-source cheminfo ELN. Folder icon reproduced from image designed using resources from Flaticon.com; laptop photo by Scott Graham on Unsplash.

Provenance of data

One crucial step in this process is to match the spectrum with the correct sample. A URI system (can be printed as barcodes) can help to avoid mistakes in this step. For instance, in the cheminfo ELN, scanning the barcode will create the upload information for automatic importation from the computers to which the spectrometers are connected. From there, the system can take the file from the computer, convert it into the standard form and store it as an attachment to a sample that has been created in the ELN (for example, as the product of some reaction). This automatic importation not only makes it much easier, and less error-prone, for the chemist to store the data in the ELN, but also it allows us to automatically record a lot of metadata—for example, the importation workflow can fill in information about the instrument (such as the manufacturer, serial number, humidity and temperature of the room) that is not always recorded in the output files of the measurements (see Extended Figs. 1 and 2 for examples).

Data processing

After data have been produced and imported into the ELN, they usually need to be further analysed. At present, chemists have to switch between different, often proprietary, software to carry out this analysis. They might rely on the software provided by the instrument manufacturer to perform peak picking or baseline correction, and then use another plotting tool to overlay the data. In an open-science vision, one would like to ensure that one can not only access data, but, equally important, can also reproduce the subsequent analyses. Likewise, if the chemistry community embraced the view that the ELN converted data into a commonly agreed standard form, the analysis tools become independent of a particular instrument or even characterization technique (Box 3).

If we design the platform with a common interface, ensure a modular architecture and ensure a reusability of the key components, we have the first step towards an ecosystem in which libraries are developed for specific tools that accelerate the workflow of chemists (thesis 4 in Fig. 1 and Box 3). The modular nature would allow that experts in one technique, for example, NMR spectroscopy develop tools that can then be reused by other ELNs. An example for this is the NMRium project33, which is a reusable web component that can, with three lines of code, be plugged into another ELN system. To make this work, it is important that the components can talk with each other via standardized protocols.

In an open-science vision, the code for these components should be open. One of the concerns regarding open-source software is the danger that a project might ‘die out’ if one maintainer leaves the project, whereas a successful commercial software might seem to have the promise of continuity. However, there are many successful examples (such as Linux and Python) of open-source projects are maintained by the community, yet leave many options for commercial initiatives (for instance, support contracts and maintenance of a custom installation). Similarly, at universities a common analytical infrastructure (such as the routine NMR service) is often supported using institutional funding—a similar model might also be appropriate for a digital infrastructure. Importantly, open-source code has the advantage that the underlying assumptions and equations for any analysis are documented and everyone can verify, replicate or even improve the analysis. Also, in contrast to closed-source (commercial) tools that are discontinued because of a change in business interests, the development can be reanimated at any time, as the code is openly accessible and reusable.

Publication of reusable and machine-actionable data

The work of a scientist is not completed when all the materials are synthesized and characterized. An essential part of the scientific process is the dissemination of the results to make sure that others can build on top of one’s work. Typically, we are used to thinking of ‘others’ as other scientists in the same field. However, science is increasingly multidisciplinary, and hence non-specialists might also need to understand the data. Additionally, the move towards open science is a logical consequence of the notion that if the taxpayer paid for the research, the ownership of the research data should be the public at large, which can empower citizen (data) science34,35. We have a glimpse of the power of the reuse of data with the discoveries of Don Swanson, an information scientist without formal training in medicine who analysed literature from the Medline database and found previously undiscovered knowledge, such as links between magnesium deficiency and migraine35. Clearly, there is nothing fundamental about chemistry that prohibits us from leveraging such approaches to science.

Usually, however, in contrast to the publication of an article, the publication of all the scientific data on which the article is based is reduced to an afterthought. Most of us have still been educated with the idea that we need to be selective about which data to publish, instead of embracing the idea that all the scientific data we generate is an integral part of the science we do: data are typically only published to fulfil the requirements of journal policy or data management plan—without reuse in mind. This probably explains why many ELNs do not feature an option to export data to a repository.

In the open-science platform we propose, the publication of the scientific data is simply seen as one of the applications of the ELN. The users can select the samples which they want to publish and create an entry on a repository that contains all the relevant raw data (Fig. 3). The application ensures that data are reported in a form that can be easily reused by other researchers as well as by machines. For the chemists writing a publication, this means that they can provide a DOI (digital object identifier) to supplementary material and augment every figure with a link at which readers can interact with the raw data or download it for follow-up studies. Both the chemotion and cheminfo ELNs implement parts of this functionality. The cheminfo ELN exports data to the general-purpose Zenodo36 repository whereas the chemotion ELN can export data to the chemotion repository37, which focuses on chemical synthesis and characterization data).

Fig. 3: Example of the flow of data from an ELN to an interactive visualization for the reader of a paper.
figure 3

Once all the chemicals for which the synthesis and characterization data needs to be published are selected, the ELN compiles the data and uploads it to a repository (in this case Zenodo36). These data are not only machine-readable, but also can be accessed through a browser, and a human reader can also use the same visualization tools as the authors of the article71. The implementation sketched in this figure is implemented in the open-source cheminfo ELN. Panel b screenshot reproduced from Zenodo under a Creative Commons license CC BY 4.0.

In a similar vein, an ELN might also allow importing entries from a repository. This means that researchers might import the entire lab notebook used to produce the published results. Importantly, as the characterization data are also provided in the repository, researchers also have access to the original characterization data and might overlay them with their new results. To our knowledge, at the moment no ELN fully implements this automatic reimportation procedure.

Spread the love

Leave a Reply

Your email address will not be published.