Stability Oracle: A Powerful Tool for Engineering Stable Proteins

Oct, 2024

Vibrant digital tree radiating energy in a futuristic landscape. Symbolizes growth and technology convergence, ideal for innovative tech and environmental themes

In the rapidly evolving world of biotechnology, the ability to engineer proteins with enhanced stability is a critical challenge. Whether developing industrial biocatalysts or designing more effective pharmaceutical biologics, proteins that can withstand unfolding and aggregation are essential. Computational methods that can accurately predict how amino acid mutations will impact a protein’s thermodynamic stability could revolutionize the protein engineering process. However, until now, existing computational tools have struggled to reliably identify stabilizing mutations.

Enter Stability Oracle, a new deep learning framework that outperforms state-of-the-art methods in predicting thermodynamically stabilizing protein mutations. Developed by a team of researchers at the University of Texas at Austin, Stability Oracle represents a major leap forward in our ability to computationally engineer proteins with enhanced stability.

The Need for Improved Stability Prediction

Proteins are the workhorses of biology, carrying out a vast array of critical functions within living organisms. Their ability to fold into complex three-dimensional structures, and maintain those structures, is essential to their function. Proteins that are prone to unfolding or aggregation are often less effective or even completely non-functional.

This is a major challenge in the development of protein-based biotechnologies. Industrial enzymes used in manufacturing processes, for example, need to be able to withstand harsh conditions like high temperatures or the presence of organic solvents. Pharmaceutical protein drugs, likewise, must maintain their structural integrity during production, storage, and administration. Improving the thermodynamic stability of these proteins is a key priority.

Traditionally, this has been a laborious, trial-and-error process. Researchers would make iterative mutations to a protein’s sequence, test the effects on stability experimentally, and gradually work towards a more stable variant. But this approach is time-consuming and resource-intensive. Computational methods that can accurately predict the effects of mutations on protein stability could vastly accelerate this process.

Over the past 15 years, a variety of computational stability prediction tools have been developed, employing both physics-based and machine learning approaches. However, these methods have struggled with several key issues that have prevented them from having a transformative impact on protein engineering.

“The lack of data and machine learning engineering issues have prevented deep learning algorithms from having a similarly revolutionary impact on protein stability prediction as they have had in other areas of biology and chemistry,” explains Daniel Diaz, one of the lead authors of the Stability Oracle study.

The primary challenges include data scarcity, bias, and leakage, as well as the use of inappropriate performance metrics. Current datasets are heavily biased towards destabilizing mutations, with stabilizing mutations making up only 30% or less of the data. There is also significant overlap between training and test sets, leading to overly optimistic performance evaluations that do not reflect real-world generalization.

Perhaps most importantly, the field has been overly focused on metrics like Pearson correlation and root mean squared error (RMSE), which do not adequately capture a model’s ability to identify stabilizing mutations – the key goal for protein engineering applications.

“Improvements in these metrics do not necessarily translate into improvements for identifying stabilizing mutations,” Diaz notes. “Metrics like precision, recall, and area under the receiver operating characteristic curve are much more relevant for evaluating the practical usefulness of these models.”

Introducing Stability Oracle

To address these longstanding challenges, the Stability Oracle team took a multi-pronged approach, developing new data curation techniques, innovative deep learning architectures, and more appropriate performance evaluation methods.

The foundation of Stability Oracle is a graph-transformer neural network that learns to extract structural features from the local chemical environment surrounding a target amino acid residue. This “masked microenvironment” is then combined with embeddings representing the wild-type and mutant amino acids to predict the change in thermodynamic stability (ΔΔG) resulting from that mutation.

“Rather than relying on computationally generated mutant structures, which can be expensive and error-prone, Stability Oracle learns to implicitly model how the ‘from’ and ‘to’ amino acids interact with the local chemistry,” explains Chengyue Gong, another lead author.

This design choice allows Stability Oracle to efficiently generate predictions for all 380 possible single-point mutations starting from a single protein structure – a vast improvement in computational efficiency over previous structure-based methods.

To tackle the data challenges, the researchers curated several new training and test datasets. They used sequence clustering to ensure minimal overlap between proteins in the training and test sets, a critical step to properly evaluate generalization.

They also introduced a novel data augmentation technique called “thermodynamic permutations” (TP). TP leverages the state-function property of Gibbs free energy to expand a relatively small set of experimental ΔΔG measurements into a much larger, thermodynamically valid dataset. Importantly, TP generates a balanced distribution of stabilizing and destabilizing mutations, rather than the heavily skewed datasets used in prior work.

“TP allows us to better assess a model’s ability to identify stabilizing mutations, which is the key goal for protein engineering applications,” Diaz notes.

In addition to the TP-augmented datasets, the team also fine-tuned the Stability Oracle framework on a massive dataset of over 2 million protein stability measurements, derived from a high-throughput proteolysis assay on natural and de novo mini-protein domains.

Outperforming the State of the Art

With these innovations in data curation and model architecture, Stability Oracle demonstrates a remarkable ability to predict thermodynamically stabilizing protein mutations. On a rigorously curated test set, Stability Oracle achieved a precision of 0.70 and a recall of 0.69 in identifying stabilizing mutations (defined as ΔΔG < -0.5 kcal/mol).

Importantly, this performance surpasses that of existing state-of-the-art computational tools, which typically achieve only around 20% success in identifying stabilizing mutations. Stability Oracle’s precision in this task is on par with free energy perturbation (FEP) methods, which are considered the gold standard for computational stability prediction but are prohibitively computationally expensive for large-scale protein engineering applications.

“Stability Oracle’s ability to match the performance of FEP methods, while being several orders of magnitude faster, is a major breakthrough,” says Adam Klivans, senior author on the study.

The team also developed a sequence-based counterpart to Stability Oracle, called Prostata-IFML, by fine-tuning the powerful protein language model ESM-2. While Prostata-IFML also demonstrated impressive performance, Stability Oracle’s structure-based approach still outperformed the sequence-only model across a range of metrics.

“The fact that Stability Oracle, with far fewer parameters than Prostata-IFML, can outperform a state-of-the-art sequence model highlights the value of incorporating structural information,” Gong explains. “Protein structures contain critical information beyond just the amino acid sequence.”

Stability Oracle’s structural awareness is further evidenced by its ability to accurately predict stabilizing mutations across different regions of a protein. Analysis of the model’s predictions showed no bias towards identifying stabilizing mutations on the protein surface versus the core, a common limitation of previous structure-based methods.

“Stability Oracle is able to generalize well to mutations in both solvent-exposed and buried regions of the protein,” Diaz notes. “This is an important capability for engineering proteins with enhanced stability.”

Accelerating Protein Engineering

The implications of Stability Oracle’s performance go far beyond just academic benchmarking. This tool has the potential to dramatically accelerate the development of a wide range of protein-based biotechnologies.

“Accurate identification of stabilizing mutations will impact everything from predicting protein therapeutics with greater shelf-life, to engineering enzymes that can withstand harsh industrial conditions,” says Andrew Ellington, a co-author and expert in protein engineering.

For example, in the development of protein-based drugs, the ability to computationally screen millions of potential mutations and identify the most stabilizing ones could vastly reduce the time and cost of experimental optimization. Similarly, in industrial biocatalysis, Stability Oracle could guide the engineering of enzymes that are more resistant to denaturation, expanding the range of processes they can be applied to.

Beyond just predicting the effects of single-point mutations, the Stability Oracle team is already working on extending the framework to handle higher-order mutations. “Data scarcity is an even bigger challenge for predicting the effects of multiple simultaneous mutations,” Diaz explains. “But the innovations we’ve developed with Stability Oracle, like thermodynamic permutations, provide a roadmap for tackling this problem.”

The researchers also see Stability Oracle as a stepping stone towards a broader goal of using deep learning to guide the de novo design of highly stable protein scaffolds. “If we can accurately model how mutations impact stability, the next frontier is using that knowledge to computationally design entirely new protein structures from scratch,” Klivans says.

A Path Forward for Protein Engineering

The development of Stability Oracle represents a significant milestone in the quest to harness the power of deep learning for protein engineering. By addressing longstanding challenges in data quality, model architecture, and performance evaluation, this framework demonstrates the potential for AI-guided protein design to transform a wide range of biotechnologies.

“Stability Oracle establishes a new benchmark for computational stability prediction, and provides a clear path forward for fine-tuning structure-based transformers to virtually any protein phenotype,” Diaz concludes. “This is a necessary task for accelerating the development of protein-based biotechnologies in the years to come.”

As the field of protein engineering continues to evolve, tools like Stability Oracle will undoubtedly play an increasingly central role. By empowering researchers to engineer more stable, more effective protein-based products, this technology could have far-reaching impacts on industries ranging from pharmaceuticals to clean energy. The future of biotechnology is looking more stable than ever.

Reference(s)