protein-structure

Comparison of protein structure prediction algorithms

01 12 24

The majority of drug targets are proteins and knowledge of the 3D structure of the protein can be very helpful for structure based design. Whilst the PDB contains 227,933 structures there are still a number of structures that lack structural information. In 2018 Deepmind released AlphaFold an artificial Intelligence program design to predict protein 3D structure from the amino-acid sequence DOI. Since then there have a series of updates that have included the ability to handle small molecules, co-factors, nucleic acids, protein complexes etc. AlphaFold has been used in collaboration with the EBI to create AlphaFold DB which provides open access to over 200 million protein structures, covering the human proteome and the proteomes of 47 other key organisms important in research and global health. A recent addition is Foldseek a protein structural search program that allows users to search the AlphaFold Database.

pfCAalphafold

David Baker, Demis Hassabis and John Jumper were awarded the 2024 Nobel Prize for Chemistry. One half of the prize has been awarded to David Baker “for computational protein design” and the other half jointly to Demis Hassabis and John M. Jumper “for protein structure prediction.”

Whilst AphaFold gets much of the publicity, it has served to spawn a number of related programs, comparison of the different options is difficult especially when looking at the various licensing options. Fortunately, Brian Naughton has posted a very useful summary. http://blog.booleanbiotech.com/alphafold3-boltz-chai1.html.

AlphaFold Protein Structure Database in 2024

03 11 23

A recent publication describes the continued evolution of the AlphaFold Protein Structure Database created by EMBL-EBI and DeepMind. From an initial 300K structures it now contains 214 million predicted protein structures.

You can read the paper here DOI.

The AlphaFold Database Protein Structure Database (AlphaFold DB, https://alphafold.ebi.ac.uk) has significantly impacted structural biology by amassing over 214 million predicted protein structures, expanding from the initial 300k structures released in 2021. Enabled by the groundbreaking AlphaFold2 artificial intelligence (AI) system, the predictions archived in AlphaFold DB have been integrated into primary data resources such as PDB, UniProt, Ensembl, InterPro and MobiDB. Our manuscript details subsequent enhancements in data archiving, covering successive releases encompassing model organisms, global health proteomes, Swiss-Prot integration, and a host of curated protein datasets. We detail the data access mechanisms of AlphaFold DB, from direct file access via FTP to advanced queries using Google Cloud Public Datasets and the programmatic access endpoints of the database. We also discuss the improvements and services added since its initial release, including enhancements to the Predicted Aligned Error viewer, customisation options for the 3D viewer, and improvements in the search engine of AlphaFold DB.

AlphaFold predicts structure of almost every catalogued protein known to science

29 07 22

A little over a year ago I highlighted the AlphaFold Protein Structure Database in which AlphaFold DB provided open access to protein structure predictions for the human proteome and 20 other key organisms to accelerate scientific research. Well things have moved on.

DeepMind and EMBL’s European Bioinformatics Institute (EMBL-EBI) have made AI-powered predictions of the three-dimensional structures of nearly all catalogued proteins known to science freely and openly available to the scientific community, via the AlphaFold Protein Structure Database.

The database is being expanded by approximately 200 times, from nearly 1 million protein structures to over 200 million, covering almost every organism on Earth that has had its genome sequenced. The expansion of the database includes predicted structures for a wide range of species, including plants, bacteria, animals, and other organisms.

The full dataset of all predictions is available at no cost and under a CC-BY-4.0 licence from Google Cloud Public Datasets. We've grouped this by single-species for ease of downloading subsets or all of the data. We suggest that you only download the full dataset if you need to process all the data with local computing resources (the size of the dataset is 23 TiB, ~1M tar files).

Downloads can be found here https://alphafold.ebi.ac.uk/download#full-dataset-section.

It is worth noting that AlphaFold2 is not the only protein structure prediction tool available, there is also RoseTTAFold, OpenFold, and FastFold.

CASP15 details

04 05 22

The details of the latest Critical Assessment of Structure Prediction (CASP) experiment to determine and advance the state of the art in modeling biomolecular structures have been published https://predictioncenter.org/casp15/index.cgi.

Modeling categories

The core of CASP remains the same: blind testing of methods with independent assessment against experiment to establish the state-of-art in modeling proteins and protein complexes. CASP15 will include following categories.

Single Protein and Domain Modeling As in previous CASPs, the accuracy of single proteins and where appropriate single protein domains will be assessed, using the established metrics. Two changes will be the elimination of the distinction between template-based and template-free modeling, and an emphasis on the fine-grained accuracy of models, such as local main chain motifs and side chains. Because of the high accuracy of the new modeling methods, we expect assessment against high resolution experimental structures will be most informative.
Assembly As in recent CASPs, the ability of current methods to correctly model domain-domain, subunit-subunit, and protein-protein interactions will be assessed. We will again work in close collaboration with our CAPRI partners. Because of the promising deep learning results reported so far, substantial progress is expected.
Accuracy Estimation Members of the community will be invited to submit accuracy estimates for multimeric complexes and inter-subunit interfaces. There will no longer be a category for estimating the accuracy of single protein models, since it has become clear these cannot compete with modeling method specific estimates. Instead, there will be increased emphasis on assessment of self-reported accuracy estimates at the atomic level. Note the units will now be pLDDT, not Angstroms.
RNA structures and complexes There will be a pilot experiment to assess the accuracy of modeling for RNA models and protein-RNA complexes. The assessment will be done in collaboration with the RNA-Puzzles and Marta Szachniuk's group in Poznan.
Protein-ligand complexes Subject to the availability of adequate resources, there will also be a pilot experiment in this area. Deep-learning is already having an impact here, and there is high interest because of the relevance to drug design.
Data Assisted As in recent CASPs, there will be assessment of the extent to which the accuracy of models can be increased by the provision of sparse data, particularly that provided by SAXS and mass spectroscopy/chemical crosslinking. Only targets where these low-resolution data are likely to be useful will be considered, that is, large single proteins and complexes. As previously, we will work with collaborators to obtain the necessary experimental data. Targets will initially be released without the experimental data, followed by a second round of prediction including those data.
Protein conformational ensembles Following the success of deep-learning methods for single structures, it is increasingly important to assess methods for predicting structure ensembles. This is a huge area, ranging from the many conformations of disordered regions to the small number of conformations that may be involved in allosteric transitions and enzyme excited states to local protein dynamics. While it is clear that deep learning and other methods have the potential to generate ensembles in some circumstances, the difficulty is in finding cases where there are sufficiently accurate and extensive experimental data to allow rigorous assessment. One promising avenue is modeling sets of conformations in regions of cryo-EM structures where there is evidence of local conformational heterogeneity. If suitable cases arise, we will present these as a special type of sub-target. First requesting conformational ensembles that will be evaluated against the election density map and then in a possible second stage providing the map for data assisted ensemble prediction. A second possibility is for cases where detailed NMR data have already established the structure of two or more conformations. We have a good lead for a few targets of this type. In addition to this, we are considering a non-blind experiment (a departure from normal CASP practice), where we will first ask those interested to reproduce the known conformations. We will also ask participants to identify any additional conformations that appear to be present. It may then be possible to test these against existing or new experimental data.

Details of the targets will be made available over the next week https://predictioncenter.org/casp15/targetlist.cgi.

AlphaFold Protein Structure Database

25 07 21

The AlphaFold Protein Structure Database Developed by DeepMind and EMBL-EBI is now available online.

AlphaFold DB provides open access to protein structure predictions for the human proteome and 20 other key organisms to accelerate scientific research.

AlphaFold DB currently provides predicted structures for the organisms listed below and includes human, laboratory species, and key pathogens. All the predictions for all the species can be downloaded from the EBI FTP site ftp://ftp.ebi.ac.uk/pub/databases/alphafold.

Species	Common Name	Reference Proteome	Predicted Structures	Download
Arabidopsis thaliana	Arabidopsis	UP000006548	27,434	Download (3642 MB)
Caenorhabditis elegans	Nematode worm	UP000001940	19,694	Download (2601 MB)
Candida albicans	C. albicans	UP000000559	5,974	Download (965 MB)
Danio rerio	Zebrafish	UP000000437	24,664	Download (4141 MB)
Dictyostelium discoideum	Dictyostelium	UP000002195	12,622	Download (2150 MB)
Drosophila melanogaster	Fruit fly	UP000000803	13,458	Download (2174 MB)
Escherichia coli	E. coli	UP000000625	4,363	Download (448 MB)
Glycine max	Soybean	UP000008827	55,799	Download (7142 MB)
Homo sapiens	Human	UP000005640	23,391	Download (4784 MB)
Leishmania infantum	L. infantum	UP000008153	7,924	Download (1481 MB)
Methanocaldococcus jannaschii	M. jannaschii	UP000000805	1,773	Download (171 MB)
Mus musculus	Mouse	UP000000589	21,615	Download (3547 MB)
Mycobacterium tuberculosis	M. tuberculosis	UP000001584	3,988	Download (421 MB)
Oryza sativa	Asian rice	UP000059680	43,649	Download (4416 MB)
Plasmodium falciparum	P. falciparum	UP000001450	5,187	Download (1132 MB)
Rattus norvegicus	Rat	UP000002494	21,272	Download (3404 MB)
Saccharomyces cerevisiae	Budding yeast	UP000002311	6,040	Download (960 MB)
Schizosaccharomyces pombe	Fission yeast	UP000002485	5,128	Download (776 MB)
Staphylococcus aureus	S. aureus	UP000008816	2,888	Download (268 MB)
Trypanosoma cruzi	T. cruzi	UP000002296	19,036	Download (2905 MB)
Zea mays	Maize	UP000007305	39,299	Download (5014 MB)

The search bar at the top of the query page accepts queries based on protein name, gene name, UniProt identifier, or organism name. At present you can't search using a sequence and look for similar proteins. You would first need to do a BLAST search and use the results from that as queries.

Here I searched for Plasmodium falciparum carbonic anhydrase (Q8IHW5) a potential Malaria target. As you can see there is no crystal structure in the PDB. Whilst the active site is predicted with high confidence there are clearly regions for which there is very low confidence.

pfCAalphafold

You can then download the structure in PDB or mmCIF format.

I made a homology model (in purple below) of this protein a while back and it has little sequence similarity with any proteins in the PDB. Despite not including a Zinc the Alphafold Predicted Structure includes histidines in positions to potentially coordinate to the Zinc. If it is possible to include the Zinc in the structure prediction I'd be interested in finding out.

PfCA

Overall I'd say this is a very useful starting point.