Next Article in Journal
BLE-GSpeed: A New BLE-Based Dataset to Estimate User Gait Speed
Previous Article in Journal
Mid-Cycle Observations of CR Boo and Estimation of the System’s Parameters
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Data Descriptor

A Compendium of Chemical Class and Use Type Open Access Databases

1
iES Landau, Institute for Environmental Sciences, University of Koblenz-Landau, D-76829 Landau, Germany
2
EERES, Ökosystemforschung Anlage Eußerthal, D-76857 Eußerthal, Germany
*
Author to whom correspondence should be addressed.
Submission received: 12 October 2020 / Revised: 25 November 2020 / Accepted: 2 December 2020 / Published: 4 December 2020
(This article belongs to the Section Chemoinformatics)

Abstract

:
With an ever-increasing production and registration of chemical substances, obtaining reliable and up to date information on their use types (UT) and chemical class (CC) is of crucial importance. We evaluated the current status of open access chemical substance databases (DBs) regarding UT and CC information using the “Meta-analysis of the Global Impact of Chemicals” (MAGIC) graph as a benchmark. A decision tree-based selection process was used to choose the most suitable out of 96 databases. To compare the DB content for 100 weighted, randomly selected chemical substances, an extensive quantitative and qualitative analysis was performed. It was found that four DBs yielded more qualitative and quantitative UT and CC results than the current MAGIC graph: The European Bioinformatics Institute DB, ChemSpider, the English Wikipedia page, and the National Center for Biotechnology Information (NCBI). The NCBI, along with its subsidiary DBs PubChem and Medical Subject Headings (MeSH), showed the best performance according to the defined criteria. To analyse large datasets, harmonisation of the available information might be beneficial, as the available DBs mostly aggregate information without harmonising them.
Dataset: Chemical Class and Use Type Compendium available online at www.mdpi.com/xxx/s1.
Dataset License: CC-BY-SA.

1. Summary

In the 21st century, a continuously increasing number of compounds and substances is used and potentially released into the environment [1]. Understanding how they affect the environment is the main objective of ecotoxicological research [2]. To further our understanding of their impact and to better comprehend environmental chemicals in general, additional information detailing their use type and associated chemical classes is urgently needed. In this context, substance databases (DBs) play an essential role as information sources for research, governmental institutions, regulation, citizens’ science, and companies alike ([3]; see also Appendix A Database Compendium References). To date, an unintelligibly broad range of DBs is publicly available (see Appendix A Database Compendium References).
In ecotoxicology, DBs are vital to easily store, manage, and access information, for instance, concerning adverse effects on a trophic and species level [4]. Generally, DBs may contain similar information on substances but serve different purposes, like Kyoto Encyclopedia of Genes and Genomes (KEGG), European Chemicals Agency (ECHA), and DrugBank, which convey information on genomes, chemical registration data and pharmaceutical information, respectively [5,6,7], other DBs particularly focus on chemical structures and the ontology of chemicals [8]. Therefore, information obtained from DBs often need to be harmonised and verified before use [9]. For chemical substances, the use type (UT; descriptor of for what purpose(s) a chemical is used) and chemical class (CC; descriptor to which group a specific chemical belongs according to its molecular structure) are relevant information to understand their pathways into the environment, as well as the potential environmental effects and the behaviour in the environment [10,11]. However, little research is available on the quality and quantity of UT and CC information in open access chemical substance DBs.
A decision tree-based selection process (Figure 1) evaluating relevant factors like accessibility, extent and a final UT and CC analysis of the best performing DBs was applied. The decision tree was designed to find DBs, which are frequently curated and updated, as well as open access for researchers worldwide. The DBs remaining after the application of selection criteria were evaluated along with the “Meta-analysis of the Global Impact of Chemicals” (MAGIC) graph DB [12]. This “Meta-analysis of the Global Impact of Chemicals” (MAGIC) graph is a labelled property graph database [12]. The MAGIC graph harmonises and integrates multiple DBs and focuses on potential environmental impact chemicals (PEIC) accounting for synonyms for substance and compound names. The MAGIC Graph currently relies on information from the Pesticide Action Network (PAN) DB for UT, which has only limited UT+CC information. Therefore, we evaluated which other DBs would be a suitable addition, for the quality and quantity of UT and CC information. This was done using a set of 100 weighted, randomly selected chemicals (see Methods, Section 3.3). Hence, the goal was to identify UT and CC DBs, which were most suitable to be integrated into the MAGIC graph DB regarding data quality and quantity, thereby, e.g., enabling the interpretation of exposure patterns in the environment among CC and UT. High quantity and quality UT and CC information will prospectively be used and integrated into the MAGIC graph.

2. Data Description

2.1. The “Chemical Class and Use Type Compendium” Dataset

The current version of the compendium of CC and UT DBs can be accessed via https://static.magic.eco/Compendium. Table 1 and Table 2 provide an overview of the compendium’s columns used in its two different worksheets, with Table 1 explaining the first worksheet (decision tree selection process; see Section 2.2.) and Table 2 explaining the second worksheet (UT and CC validation process) of the compendium.

2.2. Database Selection Process

Using the criteria as described in Table 1, a total of 96 DBs were evaluated. Of those, 21 were discarded in decision tree criteria 1–3 (Table 3), indicating that they are not uniquely identifiable or do not provide any search function. Four DB websites were mentioned in the literature, but did not exist or were not accessible anymore, when tried in November 2019. Seven DBs were discarded as no search could be performed (i.e., second criterion). Reasons were, e.g., that the website still existed, however, discontinued its DB or search function due to funding [13] or was undergoing maintenance during the analysis period [14]. Seven DBs were discarded due to the URL (i.e., third criterion). In two cases, the URL was identical, while the other ones had different URLs, which, however, led to the same DB. Twenty-four DBs had a different focus (criterion 4), which means that they did not provide substance property information. The BindingDB.org [15] is an example for a DB that may have other uses (interactions between small molecules and protein); however, it did not fulfil the 4th criterion and was therefore excluded. Ten DBs were discarded due to criterion 5 as they had infrequent updates. While some DBs had no unique data information and therefore, only mirrored another DB, none were discarded during this step, as they had been discarded before. An example is the LookChem shop page which displayed substance information that directly came from Wikipedia [16]. Twelve DBs were either not openly accessible or not free of charge (criterion 6 and 7), which led to them being discarded, as they did not fulfil the open access criterion. Those websites might, in general, have the option for manual searches, like Cayman [17], but they forbid crawling (i.e., automatically querying large subsets of data) the DB webpage or extract data information through batch search without buying an annual licence (i.e., Herts [18] (criterion 7)).
Further, five DBs were discarded for comprising less than 1500 substances according to criterion 9. Seven were omitted because they did not fulfil criterion 10, i.e., they did not include relevant UT or CC information. Seven DBs websites were discarded following criterion 11, in which the DB was checked for whether it was wholly integrated into another DB or website. Examples are PubChem and Medical Subject Headings (MesH) which are completely integrated into the National Center for Biotechnology Information (NCBI) [19] or the major restructuring of some U.S.-based DBs, like TOXNET [20], which were integrated into PubChem [21] and PubMed [22], amongst others on the 16 December 2019 [20]. According to the final criterion 12, five DBs were discarded as they were not DBs containing large sets of PEIC information of global relevance. After this decision tree-based evaluation, nine DBs were further analysed (see also Table 2, or Section 2.3) and their suitability for integration into the MAGIC graph was assessed (see Methods, Section 3.3). Detailed information on the further evaluated DBs can be found in the Chemical Class and Use Type Compendium Sheet 2 (https://static.magic.eco/Compendium) and Table 4 and Table 5.

2.3. UC and CC Database Quantitative Extent and Quality

For DBs left after the decision tree analysis (i.e., NCBI, Comptox, Wikipedia, European Bioinformatics Institute (EBI), ChemSpider, KEGG, MAGIC graph, National Pesticide Information Center Product Research Online (NPRO), ECHA, and the DrugBank; Table 4), the content was qualitatively and quantitatively evaluated (for detailed methods, see Section 3.3). This was performed using a total of 100 weighted randomly selected chemical substances (see Methods, Section 3.3). Between 15 and 100 of the 100 chemical substances were listed in each of the remaining DBs. DrugBank, with 15 substance entries, had the smallest, whereas the NCBI website, with 100 out of 100, had the highest number of substances matched (Table 3). Comptox and ChemSpider found the second and third highest number of substances (98 and 99, respectively). Wikipedia (English version), as a general DB with a broad range of topics and entries [23], scored better than many other DBs with specific topical focus (e.g., NPRO, KEGG, DrugBank) [5,7,24], however, less well than DBs designed with a general focus on chemical substances and their properties (NCBI, ECHA, Comptox, EBI, ChemSpider) [6,19,25,26,27]. This was expected, as the criteria were designed to find a DB, which has detailed up to date information on a broad range of relevant substances. Concerning quantitative UT and CC information, the NCBI DB collection has the most substance entries, including this information (Table 4). It is the primary domain for various substance-related DBs, like PubChem and MesH, with millions of compound and substance entries [21]. The current MAGIC graph DB search resulted in 56 (out of the total of 100) substance entries with UT information and 48 substance entries with CC information. Hence, NCBI performed better than the MAGIC graph concerning UT information (86 substances with UT information), as well as Comptox (85%) and Wikipedia (63%). Concerning CC information, NCBI contained 82 substances with CC information compared to Wikipedia (58), ChemSpider (68), and EBI (50), and performed better than the current MAGIC graph (48 substances).
The qualitative evaluation assessed how detailed the information contained in the DB is, e.g., herbicide or insecticide was considered a qualitatively better UT information than pesticide (for detailed methods, see Section 3.3). The results can be seen in Table 5. The NCBI DB group performed best overall.
The percentage of high-quality CC entries within the MAGIC graph compared to the quantitative amount CC information had the lowest ratio (i.e., 56%) among all assessed DBs. NPRO did not provide any CC information and, therefore, could not be evaluated. In general, the NCBI ranked highest with 78 substances found with high-quality CC information (95%) and 79 (92%) substances found with high-quality UT information, as it included 22 substances more than the next highest ranked DB. No DB had a higher entry to quality percentage than the NCBI regarding CC. However, five DBs had more than the NCBI’s 92% high-quality UT information rate, with NPRO ranking first (97%; 37 of 38 substances with UT information had high-quality information). This could be explained by the fact that NCBI, in general, has more entries with UT information (79 of 86 substance entries that had UT information also had high-quality UT information; also see the amount of quality UT entries compared to the total substance list in percentage in Table 5). Consequently, the NCBI provides the highest rate of UT and CC quality information compared to the original set of 100 substances.
Therefore, NCBI and its subdomains PubChem and MesH, is the best performing DB regarding available UT and CC substance information, of all 96 tested DBs, and appears to be most suitable. Furthermore, as the most promising candidate, the DB was selected for integration into the MAGIC graph. However, future integration of other DBs, like NPRO, for specific pesticide UT information, or KEGG, which offered comparably fewer but high-quality results, might serve as valuable additional sources. The unique strength and weaknesses of all 96 DBs can be viewed in the Compendium.

3. Methods

3.1. Database Compilation

The search for adequate DBs was performed from November until December 2019. First, an extensive search for DBs, focusing on substance property data was performed. DBs were compiled through a literature research. Search engines, like “Google.com”, “Google Scholar”, or “Web of Science”, were used. The search was performed using the keywords “Substance”, “Use type”, “Chemical compound”, “Ecotoxicology”, “Toxicology”, “Chemical Class”, and “Database”. Keywords were also used in combined search strings, and different variations for some keywords were also used. However, the search was limited to substance DBs that were available in German or English language. More than 50 relevant scientific papers and ~500 websites, and domains, as well as sub-domains were examined in the process. Generally, any DB found was added to the initial dataset (see Supplementary Material Use type and Chemical Class Compendium sheet 1).

3.2. Decision Tree-Based Database Evaluation

DBs were evaluated using the decision tree given in Figure 1 between December 2019 and August 2020. The criteria were assessed in an order that accounts for their importance, combined with its verifiability and from general criteria to more in-depth criteria in descending order. The decision tree was designed with the aim to find chemical substance DBs, which are frequently updated, provide a large, unique dataset and are open access for researchers worldwide. However, DBs not fulfilling a particular criterion could still be of value for other research questions and purposes. In the following paragraphs, the criteria and the reason for implementation are described in detail and decision tree order. The website hosting the DB had first to fulfil the following three criteria: Firstly, it needed to have some measures to ensure data integrity and, if, e.g., extracted from a dated paper, still needed to exist. Secondly, it was required, that a substance search could be performed in the DB. The substance used for this criterion was Dichlorodiphenyltrichloroethane (DDT) (CAS: 50-29-3) due to its extensive documentation and high ecotoxicological relevance. However, this criterion was met regardless of whether the DB included the substance or not, as long as the search could be accessed and yielded a result. As a third criterion, a unique URL was mandatory, as multiple DB searches can refer to the same larger DB. Likewise, a website might just refer to other websites without hosting a DB itself. All DBs with duplicate URLs were excluded. Websites that had the same primary domain, and were thus connected DBs, were manually compared and only one of them kept in the dataset.
The fourth criterion focused on the question of whether the DB had a direct relation to the topic, i.e., concerns chemical substances and their properties. In the next criterion, the date of the last update that the DB received was considered. Frequent updates were required (i.e., updated within the last 18 months). In the sixth criterion, the respective DBs copyright and open access police were verified. For criterion 7, the pricing model of the website was checked, as only free of charge DBs were desired. This criterion ensured that the method presented in this paper is reproducible on a broad scale. Criterion eight focused on the uniqueness of the information in a DB, i.e., that the DB not merely reproduces information from another DB or harmonises more than one source. As an example, a substance web store has a DB including information obtained from Wikipedia for all relevant substances, and both DBs are listed in the DB set. In this case, the original DB would be chosen. Another example would be a DB that unites multiple other DBs or their substance information without information loss.
The next criterion assessed whether the DB listed more than 1500 PEICs. If not, the question was asked whether it conveyed information on unique, relevant substances or compounds. For criterion 10, we assessed whether the DB contained CC or UT information or even both. Like criterion four, DBs that were discarded during this step might be useful for other purposes. Criterion 11 evaluated whether DBs were wholly integrated into other DBs and to account for complete overlap of DBs. In this step, even original data sources were discarded, to favour meta-databases and DB collections, like NCBI or ChemSpider, as they accumulate information. The last criterion (12) assessed the spatial applicability of the DB, distinguishing DBs with global, universal relevance from regional DBs including substances of only local importance, therefore being deemed less relevant for the present study.

3.3. Chemical Class and Use Type Verification

The DBs that passed the decision tree analysis (n = 9) and the MAGIC graph were further on thoroughly tested for their quantitative and qualitative properties using 100 randomly selected substances from the MAGIC graph DB. The majority of chemicals could be found in 1–3 DBs. Substances of high importance for ecotoxicological research (e.g., pesticides) were often found in >7 DBs. Substances were generally considered to be of higher importance if they had a larger number of linked DBS in the MAGIC graph because the MAGIC graph primarily aggregates ecotoxicological DBs. A list of all substances in the MAGIC graph was compiled (n = 19,069), including the number of datasets in which the substance was found. Out of this list, a weighted sample was drawn randomly. The weight per sample equals the number of datasets, in which the substance was contained. Figure 2a shows the original dataset with the total of 19,069 substances found in the MAGIC graph, and Figure 2b shows the weighted random sample. Like this, the 100 substances were weighted in a way that represents ecotoxicological importance. As a proxy, the number of datasets linked to a substance within the MAGIC graph was used.
Subsequently, manual searches were performed for each DB, on whether it had accounts of the designated substances. The manual search of the substances contained in the DB was performed in late December 2019 and January 2020. Afterwards, the DB was ranked by dividing the entries into one of three categories:
  • no relevant information,
  • basic information, and
  • detailed information.
The sum of substances with any information regarding UT and CC (basic information (1) and the substances detailed information (2)) was calculated for each of the 9 DBs and the MAGIC graph. This is the foundation for the quantitative analysis, during which they were compared to the number of substances found in the DB and the total amount of substances given (n = 100) (the results can be found in Table 4). The sum of substances with detailed information (2) was calculated for the qualitative analysis (the results can be found in Table 5). Here, the sum of substances with detailed UT and CC information were compared to the sum of substances found with any (basic and detailed) UT and CC information, respectively, as well as the total amount of substances given (n = 100).
Subsequently, the results were compared to the other DBs and ranked according to quality and quantity of information available. From this, the DB with the best performance was chosen.

4. User Notes

The website https://magic.eco/ provides access to the most recent version of the MAGIC graph. It offers the possibility to visit individual chemical identifiers and to discover the synonyms and generalisations and the data that is currently connected to the chemical. The website also provides a user with an option to download an up-to-date version of the Microsoft® Excel worksheet published with this data descriptor (https://static.magic.eco/Compendium).

Supplementary Materials

Chemical Class and Use Type Compendium available online at https://0-www-mdpi-com.brum.beds.ac.uk/2306-5729/5/4/114/s1.

Author Contributions

Conceptualisation, N.H., S.B., and J.W.; methodology, N.H., S.B., and J.W.; formal analysis, N.H.; investigation, N.H.; resources, R.S.; data curation, N.H., S.B., and J.W.; writing—original draft preparation, N.H., S.B., and R.S.; writing—review and editing, N.H., S.B., J.W., S.S., L.L.P., and R.S.; visualisation, N.H.; supervision, S.B. and R.S.; project administration, S.S. and R.S.; funding acquisition, S.S. and R.S. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the German Society for the Advancement of Sciences (DFG SCHU 2271/6-2).

Acknowledgments

We thank Felix Hoegerl for technical support.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Database Compendium References

References

  1. Bernhardt, E.S.; Rosi, E.J. Synthetic chemicals as agents of global change. Front. Ecol. Environ. 2017, 15, 84–90. [Google Scholar] [CrossRef]
  2. Newman, M.C. Fundamentals of Ecotoxicology. In Fundamentals of Ecotoxicology, 4th ed.; CRC Press: Boca Ration, FL, USA, 2014. [Google Scholar] [CrossRef]
  3. Lukyanenko, R.; Parsons, J.; Wiersma, Y. Citizen Science 2.0: Data Management Principles to Harness the Power of the Crowd. In Service-Oriented Perspectives in Design Science Research; DESRIST 2011, Lecture Notes in Computer Science; Jain, H., Sinha, A.P., Vitharana, P., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6629. [Google Scholar] [CrossRef] [Green Version]
  4. Bejarano, A.C.; Farr, J.K.; Jenne, P.; Chu, V.; Hielscher, A. The Chemical Aquatic Fate and Effects database (CAFE), a tool that supports assessments of chemical spills in aquatic environments. Environ. Toxicol. Chem. 2016, 35, 1576–1586. [Google Scholar] [CrossRef]
  5. KEGG COMPOUND Database. 2020. Available online: https://www.genome.jp/kegg/compound/ (accessed on 5 August 2020).
  6. Information on Chemicals—ECHA. 2020. Available online: https://echa.europa.eu/information-on-chemicals (accessed on 5 August 2020).
  7. DrugBank. 2020. Available online: https://www.drugbank.ca/ (accessed on 5 August 2020).
  8. Feunang, Y.D.; Eisner, R.; Knox, C.; Chepelev, L.L.; Hastings, J.; Owen, G.; Fahy, E.; Steinbeck, C.; Subramaniam, S.; Bolton, E.; et al. ClassyFire: Automated chemical classification with a comprehensive, computable taxonomy. J. Cheminform. 2016, 8, 61. [Google Scholar] [CrossRef] [Green Version]
  9. Connors, K.A.; Beasley, A.; Barron, M.G.; Belanger, S.E.; Bonnell, M.; Brill, J.L.; de Zwart, D.; Kienzler, A.; Krailler, J.; Otter, R.; et al. Creation of a Curated Aquatic Toxicology Database: EnviroTox. Environ. Toxicol. Chem. 2019, 38, 1062–1073. [Google Scholar] [CrossRef] [Green Version]
  10. NAP. Strategies to Protect the Health of Deployed U.S. Forces: Detecting, Characterizing, and Documenting Exposures. Washington (DC). 2000. Available online: https://www.nap.edu/read/9767/chapter/6#71 (accessed on 10 August 2020).
  11. Wood, A. Compendium of Pesticide Common Names. 2020. Available online: http://www.alanwood.net/pesticides/index.html (accessed on 15 May 2020).
  12. Bub, S.; Wolfram, J.; Stehle, S.; Petschick, L.L.; Schulz, R. Graphing Ecotoxicology: The MAGIC Graph for Linking Environmental Data on Chemicals. Data 2019, 4, 34. [Google Scholar] [CrossRef] [Green Version]
  13. CSAR. 2019. Available online: http://csardock.org/ (accessed on 5 August 2020).
  14. SCRIPDB. SCRIPDB University of Toronto. 2020. Available online: http://dcv.uhnres.utoronto.ca/SCRIPDB/search/ (accessed on 5 August 2020).
  15. The Binding Database. Available online: http://bindingDB.org/bind/index.js (accessed on 5 August 2020).
  16. LookChem. Look for Chemicals all over the World. 2020. Available online: https://www.lookchem.com/last (accessed on 5 August 2020).
  17. Home | Cayman Chemical. 2020. Available online: https://www.caymanchem.com/ (accessed on 5 August 2020).
  18. University of Hertfordshire. PPDB—Pesticides Properties DataBase. 2020. Available online: http://sitem.herts.ac.uk/aeru/ppDB/ (accessed on 5 August 2020).
  19. National Center for Biotechnology Information. 2020. Available online: https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/ (accessed on 5 August 2020).
  20. Toxnet. Available online: https://www.nlm.nih.gov/toxnet/index.html (accessed on 20 April 2020).
  21. Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B.A.; Thiessen, P.A.; Yu, B.; et al. PubChem 2019 update: Improved access to chemical data. Nucleic Acids Res. 2019, 47, D1102–D1109. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  22. Pubmeddev. Home-PubMed-NCBI. 2020. Available online: https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/pubmed/ (accessed on 5 August 2020).
  23. Wikipedia (Hg.) Main Page. 2020. Available online: https://en.wikipedia.org/w/index.php?title=Main_Page&oldid=939357440 (accessed on 5 August 2020).
  24. NPRO. 2019. Available online: http://npic.orst.edu/NPRO/ (accessed on 5 August 2020).
  25. Chemistry Dashboard | Home. 2020. Available online: https://comptox.epa.gov/dashboard/ (accessed on 5 August 2020).
  26. EMBL-EBI. The European Bioinformatics Institute <EMBL-EBI. 2020. Available online: https://www.ebi.ac.uk/ (accessed on 5 August 2020).
  27. ChemSpider Search and Share Chemistry. 2020. Available online: http://www.ChemSpider.com/Default.aspx (accessed on 5 August 2020).
Figure 1. Decision tree that was used to evaluate the initial set of databases (DBs) (n = 96).
Figure 1. Decision tree that was used to evaluate the initial set of databases (DBs) (n = 96).
Data 05 00114 g001
Figure 2. Number of linked datasets for individual substances in the MAGIC graph. (a) Distribution of the number of linked DBs per substance in the MAGIC graph (n = 19,069). (b) Distribution of the number of linked DBs in the weighted sample drawn from the MAGIC graph (n = 100).
Figure 2. Number of linked datasets for individual substances in the MAGIC graph. (a) Distribution of the number of linked DBs per substance in the MAGIC graph (n = 19,069). (b) Distribution of the number of linked DBs in the weighted sample drawn from the MAGIC graph (n = 100).
Data 05 00114 g002
Table 1. Description of the columns and their content as used in the Chemical Class (CC) and Use Type (UT) Compendium sheet 1—DB selection process (https://static.magic.eco/Compendium). Each row in the compendium describes a different DB.
Table 1. Description of the columns and their content as used in the Chemical Class (CC) and Use Type (UT) Compendium sheet 1—DB selection process (https://static.magic.eco/Compendium). Each row in the compendium describes a different DB.
ColumnDescription
Name DBFull name of the DB, according to the information provided by the website
Criterion 1If the DB still exists.
Criterion 2If a search can be performed in the DB.
Criterion 3The URL used is unique or the DB is unique.
Criterion 4The website has a DB with a visible connection to the topic (chemical substance information, pesticide information, etc.)
Criterion 5Is the website and DB current (updated in the last six months) or are there extensive updates occurring once a year. (Then the last 18 months are reviewed and taken into account.)
Criterion 6Legal Status: Is scrapping of data allowed, under what copyright is the data available, the website allows download/usage of data and integration into another website or using the majority of the DB information for a research paper.
Criterion 7The website is free of charge for academic/scientific purposes or in general, meaning no associated license fees.
Criterion 8The DB data is unique and original to the website or—if not—the website harmonises multiple data sources.
Criterion 9There are more than 1500 PEICs available on the website or the DB has unique information on individual substances that are not available in other DBs.
Criterion 10Uses chemical class or use type information and provides one or both of these.
Criterion 11Is not entirely integrated by another DB or has additional information that is not part of another DB (e.g., original DB) that reached the same stage in the decision tree.
Criterion 12Softer criterion, the website/DB is global and not only of regional relevance. (Below 10k entries, and specifically for a particular state or country, DBs with above 50k substance entries are generally seen to be of international relevance.)
URLThe URL to the landing page through which the DB can be publicly accessed.
Initial discard criteriaThe criteria why the DB was discarded.
Chemical Abstracts Service (CAS) Number Information on the availability of CAS information within the DB.
Last accessThe point in time when the DB/website was last accessed.
Table 2. Description of the columns and their content as used in the Chemical Class and Use Type Compendium sheet 2—UT and CC analysis (https://static.magic.eco/Compendium).
Table 2. Description of the columns and their content as used in the Chemical Class and Use Type Compendium sheet 2—UT and CC analysis (https://static.magic.eco/Compendium).
ColumnDescription
CAS NumberThe CAS number of the substance.
Substance NameThe common name of the substance (IUPAC/product name).
SMILESSimplified Molecular Input Line Entry Specification (SMILES) structure description of the substance.
Substance known *Indicates if the substance was found in the DB (1 = yes, 0 = no).
DB UT *If and which UT information could be found regarding the substance (“Unknown” indicates that no information was available).
DB CC *If and which CC information could be found regarding the substance (CAS & Substance name) (“Unclassified” indicates that no information was available).
* These columns are provided for each of the ten DBs more closely evaluated.
Table 3. The twelve selection criteria, their description, the number of discarded DBs during each decision step, the number of DBs remaining after each decision step. The right column describes the number of DBs (out of the total of 96 DBs evaluated) that fulfilled the respective criterion. Those nine DBs left were further analysed along with the “Meta-analysis of the Global Impact of Chemicals” (MAGIC) graph DB. (For more details on decision criteria, see Table 1.)
Table 3. The twelve selection criteria, their description, the number of discarded DBs during each decision step, the number of DBs remaining after each decision step. The right column describes the number of DBs (out of the total of 96 DBs evaluated) that fulfilled the respective criterion. Those nine DBs left were further analysed along with the “Meta-analysis of the Global Impact of Chemicals” (MAGIC) graph DB. (For more details on decision criteria, see Table 1.)
Criterion Decision (Criterion)DBs That Fulfil the Respective Criterion * [n]Discarded DBs [n]Remaining DBs [n]
Start- 096
1DB exists96492
2Search function80785
3Unique DB/URL88778
4A relevant focus of DB652455
5Frequent updates ( 18 months)741045
6Legal status731035
7Free of charge access DB74233
8Data(set)76033
9>1500 PEICs63528
10UT or CC or both41721
11Overlap29714
12DB of global relevance5959
Result 9
* Influence of filter application on the total number of DBs (n = 96), if criteria are applied independently. Possible results are: Yes, No, NA. Yes is shown in this column.
Table 4. Quantitative analysis of the nine selected DBs and the current MAGIC graph. The number of substances found in the DB, as well as the percentage of substances that were found, are provided. Furthermore, the number of substance entries with UT, as well as the number of substances with CC information, is described. Lastly, the percentage of substances with CC/UT information is compared to the substance entries that were found in total and the result given in [%].
Table 4. Quantitative analysis of the nine selected DBs and the current MAGIC graph. The number of substances found in the DB, as well as the percentage of substances that were found, are provided. Furthermore, the number of substance entries with UT, as well as the number of substances with CC information, is described. Lastly, the percentage of substances with CC/UT information is compared to the substance entries that were found in total and the result given in [%].
DatabaseKnown Substances (100 = max)Substances with UT InformationPercentage Substances with UT Relative to the Number of Known Substances [%]Substances with CC InformationPercentage of Substances with CC Relative to the Number of Known Substances [%]
ChemSpider9948486869
Comptox EPA9885874142
DrugBank1512801067
EBI6648745077
ECHA88323722
KEGG5343834179
MAGIC-Graph10056564848
NCBI group10086868282
NPRO383810000
Wikipedia6463985890
Table 5. Qualitative analysis of the nine selected DBs and the current MAGIC graph. The number of substances with detailed (quality) UT information in the DB and the percentage of substances with quality UT information compared to all substances with UT information that were found are provided. Furthermore, the number of substance entries with quality CC, as well as the percentage of substances with quality CC information, compared to all substances with CC information, is described.
Table 5. Qualitative analysis of the nine selected DBs and the current MAGIC graph. The number of substances with detailed (quality) UT information in the DB and the percentage of substances with quality UT information compared to all substances with UT information that were found are provided. Furthermore, the number of substance entries with quality CC, as well as the percentage of substances with quality CC information, compared to all substances with CC information, is described.
DB NameSubstances
with Quality UT Information
Percentage of Substances with Quality UT Relative to the Number of All Substances with UT Information [%]Substances with Quality CCPercentage of Substances with Quality CC Relative to the Number of all Substances with CC Information [%]
ChemSpider45946088
Comptox EPA55653073
DrugBank1192990
EBI45944794
ECHA226900
KEGG40933790
MAGIC-Graph52932756
NCBI group79927895
NPRO379700
Wikipedia57914781
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Heinemann, N.; Bub, S.; Wolfram, J.; Stehle, S.; Petschick, L.L.; Schulz, R. A Compendium of Chemical Class and Use Type Open Access Databases. Data 2020, 5, 114. https://0-doi-org.brum.beds.ac.uk/10.3390/data5040114

AMA Style

Heinemann N, Bub S, Wolfram J, Stehle S, Petschick LL, Schulz R. A Compendium of Chemical Class and Use Type Open Access Databases. Data. 2020; 5(4):114. https://0-doi-org.brum.beds.ac.uk/10.3390/data5040114

Chicago/Turabian Style

Heinemann, Niklas, Sascha Bub, Jakob Wolfram, Sebastian Stehle, Lara L. Petschick, and Ralf Schulz. 2020. "A Compendium of Chemical Class and Use Type Open Access Databases" Data 5, no. 4: 114. https://0-doi-org.brum.beds.ac.uk/10.3390/data5040114

Article Metrics

Back to TopTop