Cross-referencing the Project Moonshot compounds
The project COVID moonshot is generating a significant amount of data both biochemical data distributed by PostEra and crystallographic data generated and distributed by the team at Diamond.
The COVID Moonshot is an ambitious crowdsourced initiative to accelerate the development of a COVID antiviral. We work in the open with no intellectual property constraints. This way, any scientist can view submitted drug designs and experimental data to inspire new design ideas. We use our cutting-edge machine learning tools and Folding@home's crowdsourced supercomputer to determine which drug designs to send to our partners to make and test in the lab. With each drug design tested, we get closer to our goal.
It is sometimes difficult to cross-reference compounds between multiple sources so I've downloaded the compounds with associated data calculated InChiKeys and then used the InChiKey to link compounds from different sources within Vortex. This means you have the biochemical data together with PDB code (if available) or the fragalysis code for the crystal structure. I've also annotated with identifiers from multiple databases (ChEMBL, PubChem etc.), calculated physicochemical properties (LogP/D, TPSA, HBD/A etc) and then exported in sdf format. I've also clustered the structures to aid navigation.
You can download the zipped sdf file here.
Updated. I was asked if I could provide this file in SMILES format so here it is.
I plan to try and have a look at ways to visualise the data when I can find some free time.