Navigation

Building a Fragment Collection

One of the attractions of fragment-based screening is based on the observation “Fragment Space” is smaller than “Chemical Space” and can be more effectively probed with a relatively small library

  • A million compounds cover only a small fraction of the suggested 1060 Chemical Space, whilst 2000 compounds can probe much of the 106 Fragment Space

Several approaches have been described in the design of fragment libraries. Most comply with the commonly accepted Astex “Rule-of-Three” (MW <300, H-bond donors/acceptors <=3, cLogP <3). Ideally they should also have solubility measured.

Since fragments would be predicted to have only modest affinity for the molecular target screening has to be carried out at relatively high concentrations. For this reason solubility is an absolutely critical property, whilst there are several algorithms to predict solubility the data generated by Selcia during the construction of their fragment library would suggest they leave something to be desired. The plot below shows calculated solubility (WSKOWWIN software part of the SRC EPI suite) plotted versus the experimental data using a turbidometric assay.

The lack of predictability is perhaps unsurprising since the algorithm was probably trained using larger molecules than these fragments. Selcia and Maybridge have thus taken the approach that all fragments should have measured solubility.

Another consequence of screening at high concentrations is that the influence of impurities can be amplified, careful quality control, particularly the looking for the presence of metals is critical. Strong acids or bases may also overwhelm the buffer solution.

Fragment Library Design

One approach for choosing compounds to be included in the fragment collection is to take a collection of biologically active compounds (e.g. known drugs) and fragment them (e.g. RECAP from CCG), the most common fragments containing 8 or more heavy atoms are then used to search commercial compound collections. 

BindingDB is a public, web-accessible database of measured binding affinities, focusing chiefly on the interactions of protein considered to be drug-targets with small, drug-like molecules. BindingDB contains 910,836 binding data, for 6,263 protein targets and 378,980 small molecules.

The first step was to tidy up the structures using sdwash.

sdwash prepares SD files by carrying out a number of operations on the molecular data field, which include 2D depiction layout, hydrogen correction, salt and solvent removal, chirality and bond type normalization, tautomer generation, adjustment and enumeration of protonation states.

The file was then filtered to remove reactive or very high molecular weight compounds.

Fragmentation using sdfrag, followed by filtering out the very low molecular weight fragments (MWt < 50) resulting in an sdf file containing around 100,000 fragments, many of which were duplicates. This file was then converted to canonical SMILES format using OpenBabel. Datadesk was then used to generate the frequency breakdown for all fragments, the top 100 most common fragments are shown in the table below.

Frequency breakdown of recap 

Total Cases 98221 Number of Categories 19721

GroupCount
Cc1ccccc12802
c1ccncc12348
C1CNCCN11815
NC(C)(C)C1496
c1ccccc11351
C1COCCN11293
c1ccc(cc1)O1048
Nc1ccc(cc1)O739
CCOC729
CC(=O)O574
C(F)(F)F556
CC(=O)[O-]539
Nc1ccccc1526
OCc1ccccc1515
C1CN(CCN1)C509
C[S](=O)=O505
NC(C=O)C(C)C504
c1c(cccc1Cl)Cl456
NCc1ccccc1443
N(CC)CC432
CCCC392
NCCCC384
NCC=O375
C1CCCN1366
NCC#N319
C1CC(CN1)F314
OCCOC290
C1CCCCN1428
Cc1ccncc1262
Cc1ccc(cc1)O255
Cc1ccc(cc1)C#N253
c1ccc(cc1)F250
Cc1ccccc1O250
CCc1ccc(cc1Cl)Cl238
Nc1ccc(cc1)Cl236
Cc1ccc(cc1)F233
C1CCCC1230
C1CCCCC1222
C1NC(CC1Cl)C=O222
Oc1ccc2c(c1)CCC2CC(=O)O222
c1ccc(s1)Cl216
CC(Cc1c[nH]c2c1cccc2)N215
NC(C)C214
C1CCC(N1)C=O212
c1ccc(cc1)OC209
Cc1cc(c(cc1)O)O199
c1(ccc2c(c1)ccc(c2)Cl)[S](=O)=O188
NCCC188
CCN(C)C184
CCCOC183
[S](=O)(=O)c1ccc(cc1)O178
Cc1ccc(cc1)Br169
Cc1cc(ccc1)C(N)=N157
C1CNCCN1CC(=O)O156
Nc1ccc(cc1)OC(F)(F)F150
CCN1CCOCC1149
C(=O)CC148
Cc1ccc(cc1)Cl147
C1C(CC(N1)C)F144
Oc1ccc(cc1)C(=O)O144
C1CNC(C1N)=O141
COCCOC140
NCC(=O)[O-]140
NC1CCNCC1134
Nc1ccc(cn1)Cl132
CC1CCCCC1131
C(=O)c1ccccc1130
Nc1cc(ccc1)C128
Cc1ccc(cc1)C125
Cc1ccccc1OC125
C(=O)c1c(cncc1Cl)Cl123
CC(C)O122
N1C(CC2C(C1)CCCC2)C=O122
C1C(NCC1)C#N121
CCCO121
c1cc(cc(c1)O)O120
C1Cc2c(ccc(c2)F)N1117
C(=O)CC#N116
Nc1ccc(cc1O)CC=O116
Nc1ncccc1N116
c1cnc[nH]1113
Cc1c(cccc1F)F113
Sc1ccccc1112
[S](=O)(=O)c1ccccc1N112
C1NC(CC1Cl)C(=O)NC(C)(C)C111

Click here if you want to “chemicalize” the page. All SMILES strings will then appear as popup structures.

chemicalize.org is a public web resource developed by ChemAxon which uses ChemAxon’s Name to Structure parsing to identify chemical structures on webpages and other text. Related to each structure, structure based predictions are available, as well a search interface is provided to discovery substructures or similar structures

Using the PDB

Whilst the analysis of the BindingDB structures gives us a view of the fragments present in small molecules, we don’t necessarily know much about how these fragments bind to the target protein. So for comparison I also did this with the ligands found as part of the crystal structures in the Protein Data Base http://www.rcsb.org/pdb/home/home.do. In this case we can map the fragments back onto the original ligand(s) and and explore potential binding interactions. The PDB also contains interesting biomolecules that may provide novel fragments. All 125907 identified ligands were downloaded in sdf format, in order to process these ligands it was first necessary to clean up the file using sdwash.

The resulting file was then filtered to screen out ligands containing unusual elements using sdfilter

he fragments were then generated using sdfrag a that program generates molecular fragments from each molecule in a catenated sequence of SD files. sdfrag is an SVL program intended to be run in MOE/batch. This generated over 4 million fragments many of which were duplicates in a 2.3 GB file.

Handling an sdf file of this size can become an issue so I converted it to canonical SMILES format using OpenBabel, reducing the file size to 32 MB.

I then opened the file in BBEdit sorted the file and created a separate file of duplicate structures. It then becomes a relatively easy if somewhat tedious exercise to browse through the file looking a series of identical SMILES strings. There were a number of large “fragments” obviously derived from steroids, porphyrins, staurosporin analogues, there were also a large number of amino acids. Perhaps not surprisingly there are also a significant number of nucleotide analogues.

Another group that was very well represented were sugars, with many examples of deoxygenated or amino sugars, these perhaps represent an interesting group of compounds. They are usually very soluble, have a number of groups capable of interacting with the target protein in a usually well defined 3D structure.

There also a number of organic fragments also appeared regularly, a selection of which are shown below. All contain heteroatoms capable of interacting with the target protein, many contain the sort of functionality that might be expected to interact with a kinase. Notably there are also a significant number of aromatic fragments.

An alternative approach is to take common frameworks or privileged structures (e.g. biphenyl) and decorate them with small functional groups (carboxylate, amine, hydroxyl, halo).

An increasing number of commercial companies are now offering well defined fragments for screening, as might be expected there is significant overlap between the various companies however most also contain unique fragments. 

Another approach is to look at fragments that have already been reported as hits.