Today, DeepMind announced that it has probably solved one of the most significant problems in biology: how the strand of amino acids in a protein folds into a three-dimensional shape that provides their complex functions. This is a computational task that has resisted the efforts of many very intelligent biologists for decades, despite the use of supercomputer-level hardware for these calculations. Instead, DeepMind trained its system using 128 specialized processors over several weeks; it now returns potential structures within a few days.
The limitations of the system are not yet clear – DeepMind says it is currently planning to work with peer-reviewed work, and has only provided a blog post and some press releases. But the system is clearly working more efficiently than before, after only more than doubling the efficiency of the best system in four years. Even if it is not useful under any circumstances, the advance probably means that the structure of many proteins can now only be predicted from the DNA sequence of the gene that encodes them, which would mean a big change for biology.
Between the folds
To make proteins, our cells (and the cells of every other organism) chemically bind amino acids to form a chain. This works because each amino acid has a common backbone that can be chemically linked to the formation of a polymer. But each of the 20 amino acids used in life has a separate set of atoms attached to that base. They can be charged or neutral, acidic or basic, and so on, and these properties determine how each amino acid interacts with its neighbors and the environment.
The interaction of these amino acids determines the three-dimensional structure that the chain takes after its formation. Hydrophobic amino acids enter the structure to avoid an aqueous medium. Positive and negatively charged amino acids attract each other. Hydrogen bonds lead to the formation of regular spirals or parallel sheets. Taken together, they lead to what might otherwise be a disordered chain to form an ordered structure. And this ordered structure, in turn, determines the behavior of the protein, allowing it to act as a catalyst, bind to DNA or control muscle contraction.
Determining the order of amino acids in the protein chain is relatively simple. They are determined by the order of the DNA bases in the gene encoding the protein. And because we’ve managed to sequence whole genomes very well, we have an excess of gene sequences and, therefore, a huge excess of protein sequences available to us now. However, for many of them, we do not even imagine what a compound protein looks like, which makes it difficult to determine their functioning.
Given that the backbone of a protein is very flexible, almost any two amino acids of a protein can potentially interact with each other. Therefore, finding out which of them actually interact in a composite protein, and how this interaction minimizes the free energy of the final configuration, becomes an unsolvable computational problem when the number of amino acids becomes too large. In fact, when any amino acid can occupy any potential coordinates in three-dimensional space, it becomes difficult to figure out where to put it.
Despite the difficulties, some progress has been made, including through distributed computing and gamification of assembly. But the current biennial event, called the Critical Assessment of Protein Structure Prediction (CASP), has seen rather erroneous progress throughout its existence. And in the absence of a successful algorithm, humans are faced with the difficult task of protein purification, and then with the help of X-ray diffraction or cryoelectron microscopy to determine the structure of the purified form, which can take years.
DeepMind enters the fray
DeepMind is an artificial intelligence company that Google acquired in 2014. Since then, she has made several splashes, developing systems that have successfully taken over people in Go, chess and even Star Craft. In several notable successes, the system has learned by simply giving it the rules of the game before letting it go.
This system is incredibly powerful, but it was not clear that it would work to coagulate the protein. On the one hand, there is no obvious external standard of “victory” – if you get a structure with very low free energy, it does not guarantee that there is something a little lower. There are also not many rules. Yes, amino acids with opposite charges will reduce free energy if they are close together. But this will not happen if it costs dozens of hydrogen bonds and hydrophobic amino acids sticking out in the water.
So how do you adapt AI to work in these conditions? For their new algorithm, called AlphaFold, the DeepMind team viewed the protein as a spatial graph of the network, each amino acid a node, and the connections between them are mediated by their proximity to the compound protein. The AI itself then learns to perform the task of ascertaining the configuration and strength of these bonds by providing it with previously defined structures of more than 170,000 proteins derived from a publicly available database.
After receiving a new protein, AlphaFold searches for any proteins with a related sequence and aligns the corresponding parts of the sequences. It also looks for proteins with known structures that also have regions of similarity. As a rule, these approaches perfectly optimize the local features of the structure, but are not so able to predict the overall structure of the protein – smoothing a bunch of highly optimized pieces together does not necessarily give an optimal whole. And this is where part of the algorithm, based on in-depth learning, was used to make sure that the overall structure was consistent.
Obvious success, but with limitations
For this year’s CASP, AlphaFold and other participants’ algorithms were solved for a series of proteins that have either not yet been resolved (and resolved as the problem persists), or have been resolved but not yet published. Thus, the creators of the algorithms did not have the opportunity to prepare the system for real information, and the conclusion of the algorithms could be compared with the best real data as part of the task.
AlphaFold did pretty well – actually much better than any other record. For about two-thirds of the proteins for which he predicted structure, this was within the experimental error you would get if you tried to repeat structural studies in the laboratory. In general, the accuracy rating, which ranges from zero to 100, averaged 92 – again, the range you would see if you tried to get the structure twice under two different conditions.
According to any reasonable standard, the computational problem of determining the structure of a protein is solved.
Unfortunately, there are a lot of stupid proteins. Some are immediately stuck in the membrane; others quickly pick up chemical modifications. Others require long-term interaction with specialized enzymes that burn energy to cause other proteins to re-assemble. Most likely, AlphaFold will not be able to consider all these extreme cases, and without scientific work describing the system, the system will need some time – and some real use – to figure out its limitations. This is not to take away from incredible achievements, but only to warn against unfounded expectations.
The key question now is how quickly the system will become available to the biological research community so that its limitations can be identified, and we can begin to apply it in cases where it is likely to work well and have significant value as a pathogenic protein structure. microorganisms or mutated forms found in cancer cells.