Chemistry and Biology databases

We now have a number of very large virtual libraries of compounds that are available for make on demand, a couple of providers are Enamine 70 billion molecules, Wuxi 200 million molecules, Liverpool ChiroChem 1 billion 3D rich molecules.

Probably the largest database for commercial compounds is eMolecules- "Google for Molecules". Draw chemical structures using drawing packages such as JME (embedded applet), ISIS/Draw, ChemDraw or ChemSketch, and then instantly search over 8.0 million unique chemical structures from more than 140 leading chemical suppliers. Search results include reference links to properties and spectra from sources such as DrugBank, National Cancer Institute, NIST WebBook, PubChem, EPA and more.

Zinc is a free database of commercially available compounds ideal for virtual screening. One really nice feature is the property defined sub-sets, such as lead-like (Teague, Davis, Leeson, Oprea, Angew Chem Int Ed Engl. 1999 Dec 16;38(24):3743-3748.), drug-like (Lipinski, J Pharmacol Toxicol Methods. 2000 Jul-Aug;44(1):235-49.) etc. These can all be downloaded and searched locally.

MMsINCdatabase a free web-oriented database of commercially-available compounds for virtual screening and chemoinformatic applications. MMsINC contains over 4 million non-redundant chemical compounds in 3D formats. MMsINC is provided by the Molecular Modeling Section in the Department of Pharmaceutical Sciences at the University of Padova, (Italy) in collaboration with the Software Support Services & Development Laboratory (S3D) at the Center for Advanced Studies, Research and Development (CRS4) in Sardinia.

The largest chemical database is PubChem is organized as three linked databases within the NCBI's Entrez information retrieval system. These are PubChem Substance, PubChem Compound, and PubChem BioAssay. PubChem also provides a fast chemical structure similarity search tool. The database also contains a variety of calculated physicochemical properties for each molecule. Many compounds have links to primary literature and increasingly other databases are providing links to PubChem.

ChemSpider is a free access service providing a structure centric community for chemists. Providing access to millions of chemical structures and integration to a multitude of other online services, ChemSpider is the richest single source of structure-based chemistry information an invaluable source of spectral information. It also hosts a property prediction service. .

Guide to Pharmacolgy created in a collaboration between The British Pharmacological Society (BPS) and the International Union of Basic and Clinical Pharmacology (IUPHAR) and now developed jointly with funding from the Wellcome Trust, is intended to become a “one-stop shop” portal to pharmacological information. The human targets covered are shown in the image below.

chart

BindingDB is a public, web-accessible database of measured binding affinities, focusing chiefly on the interactions of protein considered to be drug-targets with small, drug-like molecules. BindingDB contains 2.9M data for 1.3M Compounds and 9.3K Targets. Of those, 1,397K data for 655K Compounds and 4.5K Targets.

Chem-TCM is the digital database of individual molecules, constituents of plants used in the traditional Chinese herbal medicine. The database consists of four major parts: chemical identification, botanical information, predicted activity against common Western therapeutic targets, and estimated molecular activity according to traditional Chinese herbal medicine categories.

ChEMBL is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data).

The PDBbind database is designed to provide a collection of experimentally measured binding affinity data (Kd, Ki, and IC50) exclusively for the protein-ligand complexes available in the Protein Data Bank (PDB). All of the binding affinity data compiled in this database are cited from original references.

GRAC database is a searchable online database of information from the 5th (2011) edition of the BPS Guide to Receptors and Channels (GRAC) [1], which provides a succinct overview of the key properties of over 1600 established or potential pharmacological targets

Supertarget an extensive web resource for analyzing 332828 drug-target interactions.

Therapeutic Target Database is a database to provide information about the known and explored therapeutic protein and nucleic acid targets, the targeted disease, pathway information and the corresponding drugs directed at each of these targets. Also included in this database are links to relevant databases containing information about target function, sequence, 3D structure, ligand binding properties, enzyme nomenclature and drug structure, therapeutic class, clinical development status. All information provided are fully referenced.

The Centre for Therapeutic Target Validation platform brings together information on the relationships between potential drug targets and diseases. The core concept is to identify evidence of an association between a target and disease from various data types.The Centre for Therapeutic Target Validation is a pre competitive public-private venture that aims to provide evidence on the biological validity of therapeutic targets and provide an initial assessment of the likely effectiveness of pharmacological intervention on these targets, using genome-scale experiments and analysis. The platform currently contains 28,931 targets, 3,049,882 associations for 10,053 diseases.

MACiE, which stands for Mechanism, Annotation and Classification in Enzymes, G. L. Holliday, C. Andreini, J. D. Fischer, S. A. Rahman, D. E. Almonacid, S. T. Williams and W. R. Pearson. Nucleic Acids Research, 40, D783-D789, 2012. Medline ID: 22058127. The current version of MACiE (Version 3.0) contains 335 fully annotated enzyme reaction mechanisms

ChEBI Release 145 is live with 50089 fully annotated entities. ChEBI stands for 'Chemical Entities of Biological Interest'. It is a freely available database of 'small molecular entities', developed at the EBI. The term 'molecular entity' encompasses any constitutionally or isotopically distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer, etc., identifiable as a separately distinguishable entity.

The Chemical Structure Lookup Service allows you to search through 39 million indexed structures from 80 different databases. Very fast and again you can use a variety of formats (including SMILES), including an embedded java applet, to create the query.

The DrugBank is a richly annotated database of drug and drug target information. It contains extensive data on the nomenclature, ontology, chemistry, structure, function, action, pharmacology, pharmacokinetics, metabolism and pharmaceutical properties of both small molecule and large molecule drugs. As the table below shows the amount of information available in DrugBank has increased considerably since the first version. For Safari users this information is instantly available using the DrugBank Safari extension.

Comparison between the coverage in DrugBank 1.0, 2.0 and DrugBank 3.0

Category	1.0	2.0	3.0
No. of data fields	88	108	148
No. of search types	8	12	16
No. of drug-action pathways	0	0	223
No. of drugs with metabolizing enzyme data	0	0	762
No. of drug metabolites	0	0	811
No. of drugs with drug transporter data	0	0	516
No. of SNP-associated drug effects	0	0	113
No. of drugs with patent/pricing/manufacturer data	0	0	1208
No. of food–drug interactions	0	714	1039
No. of drug–drug interactions	0	13242	13795
No. of ADMET parameters (Caco-2, LogS)	0	276	890
No. of QSAR parameters per drug	5	6	14
No. of FDA-approved small molecule drugs	841	1344	1424
No. of biotech drugs	113	123	132
No. of nutraceutical drugs	61	69	82
No. of withdrawn drugs	0	57	68
No. of illicit drugs	0	188	189
No. of experimental drugs	2894	3116	5210
Total No. of experimental and FDA small molecule drugs	3796	4774	6684
Total No. of experimental and FDA drugs	3909	4897	6816
No. of names/brands/synonyms	18304	28447	37171
No. of approved-drug drug targets (unique)	524	1565	1768
No. of all drug targets (unique)	2133	3037	4326
No. of approved-drug enzymes/carriers (unique)	0	0	164
No. of all drug enzymes/carriers (unique)	0	0	169
No. of external database links	12	18	31

A Natural product database http://bioinformatics.charite.de/supernatural/

ChemBioFinder is the latest version of the suite of databases provided by Cambridgesoft, there about 500,000 compounds indexed from a variety of databases including (The Merck Index, R&D Insight/Chemists, ChemINDEX Database, NCI, AIDS & Cancer, Traditional Chinese Medicines, Ashgate Drugs: Synonyms & Properties, Nanogen Index, ChemACX, Sigma Aldrich MSDS)

CoCoCo is a suite of molecular databases for high throughput virtual screening purposes. CoCoCo collects molecular structural information of commercial compounds from various chemical vendors by providing it in a ready-to-use format. The main characteristic of CoCoCo is to include structural information about conformational states of the compounds.

RxList (www.rxlist.com) provides electronic versions of the FDA’s drug-product data sheets

SkinSensDB: a curated database for skin sensitization assays Skin sensitization is an important toxicological endpoint in drug development and regulatory decision making. Chemical sensitizers act as haptens binding to protein molecules to trigger immune responses that could induce allergic contact dermatitis. To facilitate development of AOP-based computational prediction methods, a novel curated database named SkinSensDB has been constructed by manual curation of published literatures. DOI.

DisGeNET is a discovery platform integrating information on gene-disease associations (GDAs) from several public data sources and the literature doi.

The CTTV platform brings together information on the relationships between potential drug targets and diseases. The core concept is to identify evidence of an association between a target and disease from various data types.

Reframe.db A screening library of 12,000 molecules assembled by combining three databases (Clarivate Integrity, GVK Excelra GoStar and Citeline Pharmaprojects) to facilitate drug repurposing

Superdrug2 is a comprehensive resource for approved/marketed drugs. It contains details of 4,500 active pharmaceutical ingredients annotated with regulatory details, chemical structures (2D and 3D), dosage, biological targets, physicochemical properties, external identifiers, side-effects and pharmacokinetic data.

KLIFS is a kinase database that dissects experimental structures of catalytic kinase domains and the way kinase inhibitors interact with them. The KLIFS structural alignment enables the comparison of all structures and ligands to each other. Moreover, the KLIFS residue numbering scheme capturing the catalytic cleft with 85 residues enables the comparison of the interaction patterns of kinase-inhibitors, for example, to identify crucial interactions determining kinase-inhibitor selectivity. DOI.

There is also SwissBioisostere a web service designed to give ideas about potential bioisosteres, this is derived from a matched molecular pair (MMP) analysis of ChEMBL 17. Two different queries are possible: You are interested in a range of possible replacements for a single substructure ( e.g. replacements for an amide group ); or you want to know details about a particular substructural replacement of interest ( e.g. carboxylic acid vs. tetrazole ). Whilst this is very comprehensive it contains a lot of transformations that were never intended to be bioisosteric replacements.

It is also worth noting the Enamine REAL Database The current release of the REAL database comprises over 700 million compounds that comply with “rule of 5” and Verber criteria: MW≤500, SlogP≤5, HBA≤10, HBD≤5, rotatable bonds≤10, and TPSA≤140. This is a database of enumerated synthetically accessible structures.

Worth reading

The NAR Database issue is an annual update of available databases, the focus is largely biological databases but it includes chemistry, toxicology, and target validation resources.

Who are the good suppliers?

Derek Lowe on his “In the Pipeline” blog has been compiling a list of chemical suppliers based on users feedback, the most up to date list can be found here. However I’ve included the “Good Suppliers” below, if a supplier is not listed here it might be worth heading over to the up to date list to check and see if any feedback is available, it might save you money and time!

Good Suppliers

ABCR: good prices and hit rate on orders. Very professional.
Activate: expensive, but what’s there is there, and it’s the right stuff.
Adesis: not cheap, but very reliable and willing to work with customers to deliver similar compounds.
Advanced Chem Tech: recommended for peptide/amino acid stuff.
AK Scientific: several good reports on availability and purity.
Alinda: have ordered one thing from them, which was fine.
Anaspec: good reports on reliability
Apollo: good stuff, but catalog needs to be a bit more in line with their real stock.
Array: very pricey, but it’s all there.
Astatech: good experience reported
Bionet: interesting catalog, doesn’t back-order you.
Chembridge: a big catalog, but it’s all real. Occasional purity problem.
Chem/Impex: good hit rate on availability. Some questions on their chiral purities.
Combi-Blocks: good list of useful intermediates, delivers on them.
Enamine: similar to ChemBridge in many ways. Big catalog. Not the fastest out there.
Florida Center for Heterocyclics: occasional purity issues, but they do deliver.
Frontier: great source for boronic acids and the like.
Life Chemicals: have had good experiences with compound purity here.
Lu: good source for custom peptides.
Matrix: interesting catalog, which they will really ship to you.
Maybridge: on the border of being one of the big guys. Very reliable.
Midwest: good reports on reliability.
Netchem: custom synthesis, but (for once!) with good turnaround and purity.
Oakwood/Fluorochem: good prices and reliability.
Peptide Protein Research: good for custom peptides.
Pharmacore: good stock of intermediates.
Rieke: reliable, only game in town for many odd reagents.
Strem: well known for quality inorganics and organometallics.
Synquest: used to be PCR. Good customer service.
Synthonix: stuff is in stock, customer service is responsive.
TCI: has always delivered, and quickly.
Transworld: very reliable and responsive.
Tyger: have never had a problem with them.
Waterstone Chemicals: good experience on pricing and availability

Worth reading Drug- and Lead-likeness, Target Class, and Molecular Diversity Analysis of 7.9 Million Commercially Available Organic Compounds Provided by 29 Suppliers

Structural Databases

The Cambridge Crystallographic Data Centre (CCDC) compiles and distributes the Cambridge Structural Database (CSD), the world's repository of experimentally determined organic and metal-organic crystal structures.

The Crystallography Open Database (COD) provides open-access collection of crystal structures of organic, inorganic, metal-organic compounds and minerals, currently there are 214780 entries in COD.

GPCRdb contains data, diagrams and web tools for G protein-coupled receptors (GPCRs). Users can browse all GPCR crystal structures and the largest collections of receptor mutants

The RCSB Protein Data Bank contains 226,262 bimolecular structures. Of the 210 drugs registered by the FDA between 2010 and 2016 the molecular targets for 94% of these NMEs are known, the PDB contains 5,914 structures containing one of the known targets and/or a new drug, providing structural coverage for 88% of the recently approved NMEs across all therapeutic areas. DOI.

The Protein Data Bank in Europe contains 226,262 entries

Binding MOAD a subset of the Protein Data Bank (PDB), containing every high-quality example of ligand-protein binding. Hence, we call it the Mother of All Databases (MOAD).

MINICRYST is a Crystallographic and Crystallochemical Database for Minerals and their Structural Analogues

There is also a Database of Zeolite Structures

Updated 4 May 2021