Navigation

Selecting Compounds from a Virtual Screening Run

Whilst high-throughput screening (HTS) has been the starting point for many successful drug discovery programs the cost of screening, the accessibility of a large diverse sample collection, or throughput of the primary assay may preclude HTS as a starting point and identification of a smaller selection of compounds with a higher probability of being a hit may be desired. Directed or Virtual screening is a computational technique used in drug discovery research designed to identify potential hits for evaluation in primary assays. It involves the rapid in silico assessment of large libraries of chemical structures in order to identify those structures that most likely to be active against a drug target. The key question is then how many molecules do you select from your virtual screen?

The results of a virtual screening run are effectively a rank ordering of the virtual screening deck ordered by whatever scoring function(s) that have been used. The task then becomes selection of molecules for experimental determination of activity.

I posed this question on the website and the results are shown below. Whilst this obviously a limited snapshot it is interesting that there is a wide variety of responses.

Some people also emailed me with further information. For companies with large internal physical screening collections, and the ability to cherry pick samples, it effectively costs the same to fill a high density plate (>1000 compounds) as it does to select a handful of compounds. On the other hand if the scientist has to purchase compounds then the logistics and cost become a significant obstacle. It would have been interesting to compare different virtual screening techniques, academic versus biotech versus large pharma etc. but I doubt I’d get as many answers from a multi-page questionnaire.

There is an interesting publication “Predictiveness curves in virtual screening” by Charly Empereur-mot et al DOI in which they look compare several docking methods and use the predictiveness curve as a quantification of the predictive performance of virtual screening methods on a fraction of a given molecular dataset. They use the Directory of Useful Decoys datasets (DUD) for comparison and were kind enough to provide me with the results, I’ve just used the data generated using Autodock Vina.

DUD consists of a total of 2,950 active compounds against a total of 40 targets. For each active, 36 “decoys” with similar physical properties (e.g. molecular weight, calculated LogP) but dissimilar topology

As an aside the DUD dataset was designed to evaluate docking algorithms, the decoys were intentionally designed to be structurally distinct from the actives. This was done to ensure that the decoys were truly inactive. While this makes DUD-E an excellent benchmark for docking, it makes it a poor choice for machine learning. Pat Walters highlights this on his blog https://patwalters.github.io/Please-Stop-Fishing/

Compared to the typical results of high-throughput screening where the hit rate is usually <1%, as the table below shows DUD contains an unusually high concentration of actives (2-5%), but the results of the virtual screening are certainly very informative.

TargetNo. of activesNo. of compoundsPrevalence
ACE4918460.0265
ACHE10739990.0268
ADA399660.0404
ALR22610210.0255
AMPC218070.0260
AR7929330.0269
CDK27221460.0336
COMT114790.0230
COX-1259360.0267
COX-2426137150.0311
DHFR41087770.0467
EGFR475164710.0288
ER ago6726370.0254
ER antago3914870.0262
FGFR112046700.0257
FXA14658910.0248
GART409190.0435
GPB5221920.0237
GR7830250.0258
HIVPR6221000.0295
HIVRT4315620.0275
HMGR3515150.0231
HSP903710160.0364
INHA8633520.0257
MR156510.0230
NA4919230.0255
P3845495950.0473
PARP3513860.0253
PDE58820660.0426
PNP5010860.0460
PPAR8532120.0265
PR2710680.0253
RXR207700.0260
SAHH3313790.0239
SRC15964780.0245
THR7225280.0285
TK229130.0241
TRP4917130.0286
VEGFR28829940.0294
Minimum114790.0230
Maximum475164710.0473
Mean9731340.0294
Median5019230.0265

Table 1 shows a summary of the partial metrics at 2% and 5% of the ordered dataset for virtual screens performed using Autodock Vina, partial total gain (pTG), partial area under the curve (pAUC), Enrichment factors (EF)

Table 1Autodock Vina – Top 2% datasetAutodock Vina – Top 5% dataset
TargetpTG 2%pAUC 2%EF 2%Actives 2%Cpds 2%pTG 5%pAUC 5%EF 5%Actives 5%Cpds 5%
ACE0.0200.0483.053370.0190.0752.84793
ACHE0.0240.0383.748800.0190.1074.1122200
ADA0.0200.0000.000200.0180.0000.00049
ALR20.0980.0283.742210.0710.1546.80952
AMPC0.0210.0132.261170.0190.0340.94141
AR0.1610.15711.9619590.1080.2687.8331147
CDK20.0870.1179.7014430.0630.1905.2419108
COMT0.0000.0914.351100.0000.1825.44324
COX-10.1540.11311.826190.1020.2507.17947
COX-20.3220.23418.031542750.1930.39710.14216686
DHFR0.2150.0705.47451760.1500.1183.5673439
EGFR0.0480.0383.26313300.0360.0712.1952824
ER ago0.3140.19217.0823530.1860.3839.8433132
ER antago0.0590.1108.907300.0400.1735.081075
FGFR10.0120.0030.832940.0100.0160.674234
FXA0.0290.0111.3741180.0230.0361.5011295
GART0.1080.0000.000190.0870.0051.00246
GPB0.1130.0262.873440.0810.1014.2211110
GR0.0230.0995.729610.0190.1112.5510152
HIVPR0.1470.0384.736430.0990.0913.5111106
HIVRT0.0470.1217.957320.0380.1614.14979
HMGR0.0150.0352.792310.0120.0491.14276
HSP900.0390.0000.000210.0320.0040.54151
INHA0.0790.19112.0421680.0510.2576.5028168
MR0.3460.22918.606140.2150.51714.471133
NA0.0190.0000.000390.0180.0000.00097
P380.0310.0121.54141920.0260.0492.2952480
PARP0.1140.0714.243280.0800.0913.39670
PDE50.0470.0091.683420.0370.0431.818104
PNP0.0110.0000.000220.0090.0000.00055
PPAR0.3040.21916.2828650.1830.37210.3344161
PR0.0120.0091.801220.0100.0271.47254
RXR0.6530.33026.4711160.3620.62014.811539
SAHH0.1260.0698.956280.0860.1744.84869
SRC0.0990.0535.64181300.0700.1354.7838324
THR0.1290.0977.5711510.0910.1493.8714127
TK0.0190.0000.000190.0150.0000.00046
TRP0.0370.0373.003350.0290.0692.03586
VEGFR20.0070.0624.548600.0060.1012.7212150
Minimum0.0000.0000.000100.0000.0000.00024
Maximum0.6530.33026.471543300.3620.62014.81216824
Mean0.1050.0766.2012630.0700.1434.2020157
Median0.0480.0484.246390.0380.1013.511097

Perhaps the first thing to note is the enrichment factor (after selecting the top 2% of the dataset) over all the targets varies from 0 to a maximum of 26 with a mean of 6. Where Enrichment factors were computed as follows:

where Hitsx % is the number of active compounds in the top x % of the ranked dataset, Hitsis the total number of active compounds in the dataset, N x % is the number of compounds in the x % of the dataset and N t is the total number of compounds in the dataset. Unfortunately it is not possible to predict how much enrichment might be achieved.

Another way to look is to sort the data set by score and then plot number of ligands versus the number of active identified . For DHFR active ligands were identified among the highest scoring structures, but for GART the top 40 or so scoring ligands were inactives. The diagonal line gives an idea of the prevalence of hits with random picking.

The objective for HTS Analysis is not to identify every active compound in the screening set, but rather to identify sufficient active series to support the active chemistry effort available, similarly the aim of virtual screening is not to identify every hit but rather to identify sufficient active series to support the active chemistry effort available. If we assume the percentage of true actives in the virtual library is 0.5% then the enrichment due to virtual screening might take it up to 3%. So for if you select 100 compounds for experimental determination one might expect 3 actives, if you want multiple series, (in case a series is lost due to off-target activity), you would probably want to evaluate 1000 compounds.

It is probably not wise to simply select the first 1000 compounds since it is likely that some chemotypes may be repeated, better to aim to select diverse chemotypes.

This might seem like a lot of compounds, but a back of the envelope calculation for the cost of a virtual screen is around $10,000 [taking into account hardware costs, licenses, maintenance and support, salaries], in addition you are probably going to be committing substantial biology and chemistry resources on any hits, so why would you want to penny pinch on the purchase of compounds?