Cost-effective ML-based solutions to help you optimize enzymes for temperature performance
CUSTOMER CHALLENGE
A Ginkgo partner recently reduced the amount of exogenous enzyme they needed to add to their manufacturing process by ten-fold, saving them millions of dollars in input costs by using Ginkgo’s enzyme discovery services.
The challenge was that their whole-cell catalyst performance was dropping in their high-temperature process — requiring supplementation with expensive, exogenous enzymes to maintain yields. Here is how they solved that challenge with speed and flexibility leveraging the power of Ginkgo’s tools.
PROTEIN SOURCING: Finding a Better Needle in Nature’s Haystack
With Ginkgo’s tools you don’t need to rely on the obvious—you can dive deeper into nature’s diversity to uncover solutions others miss. Our partner had hit a wall, focusing only on enzyme variants close to their starting point enzyme sequence.
Using our computational tools, we sifted through billions of enzymes, narrowing the field to 942 diverse candidates for testing. The result? A broader search that led to the perfect enzyme that performed well at high temperature, reducing the need for exogenous supplementation by 10x.
OUTCOMES: Better Performance Through Design and Scale
After testing these candidates, we identified a novel enzyme with remarkable properties. The new enzyme was five times more active at elevated temperature than the starting point. Furthermore, the improved enzyme had only 35% sequence identity to the original enzyme.
With this novel enzyme, our partner reduced costs and increased flexibility in downstream applications. Here’s why this success matters:
Reduced Manufacturing Costs: An improved enzyme led our customer to a 90% reduction in exogenous enzyme use and a 10% increase in product yield. This translated to substantial decreases in cost of goods for their product.
Increase in Flexibility: Additionally, with a quarter of candidates being improved over the starting point in bench-scale testing, our customer gained flexibility in their final application. Multiple high-performing candidates enabled them to scale with confidence that they’d have a solution even after testing for parameters that are difficult to assess at the bench scale.
Compatibility with Industrial Processes: optimizing enzymes for thermostability (and other metrics like solvent or pH tolerance) enables use of biocatalysis in a wider range of demanding industrial processes.
Talk to us about ML-guided tools for protein engineering
At Ginkgo, we are committed to helping our partners unlock the full potential of enzymes for their industrial processes. Whether it’s finding a novel enzyme or optimizing existing ones, our expertise in scale, diversity, and advanced ML-guided protein engineering ensures you get the most out of your enzyme engineering efforts. Learn more about Ginkgo’s enzyme services here.
Get in touch
Schedule a 30-minute technical consultation with our enzyme engineers
A member of our team will reach to find a convenient time
"*" indicates required fields
Talk soon!
A member of our team will reach out at the email provided to schedule your consultation.
Sign up here to join our community and be the first to know about model releases and new features!
This is a new chapter for Ginkgo, and we’re just getting started. As we continue to develop and release more models and services, we’re excited to see how you’ll use these tools to drive innovation in biology.
Subscribe to updates from Ginkgo AI
We’ll send you news on our latest models, API features, and work in AI.
Large Language Models (LLMs), when trained with large collections of protein sequence data, have proven effective for protein engineering tasks including structure prediction1, functional annotation2, and generation of diverse enzymes3. The biological codebase at Ginkgo Bioworks includes the Unified Metagenomic Database (UMDB), a collection of metagenomic sequence data with more than 2 billion protein sequences, most of which do not appear in public repositories.
Here we introduce AA-0, a 650M parameter model following the ESM-2 architecture, trained on public data combined with proprietary sequences from the UMDB. We compare the performance of AA-0 to ESM-2 on popular benchmarks as well as a collection of internal benchmarks relevant to our commercial work in the Ginkgo Bioworks foundry.
AA-0 performs comparably to ESM-2 across a range of 235 external and 73 internal protein engineering tasks. Although the UMDB added 112 M distinct sequence clusters to the 51 M UniRef clusters available for training, the additional data did not result in uniform improvements across all tasks. These results suggest that modern protein LLMs are not limited strictly by the size of their training dataset. To reach the full potential of AI for protein engineering may require more specialized forms of task-specific training data.
Why we built AA-0
Ginkgo’s mission is to make biology easier to engineer. Over the years, we’ve worked with more than 100 commercial partners to support R&D projects ranging from therapeutics and pharmaceutical manufacturing to industrial enzymes and agriculture.
By releasing AA-0 to the public, we hope to make Ginkgo’s capabilities and resources more accessible to biotechnology developers. We’re excited to see what you’ll build with them!
The first release supports the common use cases of embedding calculation and generation via masked language modeling. The platform supports calls to both ginkgo-aa0-650M and esm2-650M, so that users can compare their performance as we have done here.
Users can access a free tier and competitive pricing for larger jobs.
About Ginkgo’s Unified Metagenomic Database (UMDB)
We developed AA-0 using the 2023 UMDB corpus of about 2B protein sequences. The UMDB is derived primarily from microbial DNA extracted from soil samples and sourced from diverse geographic regions. The sequence collection was initially assembled to support R&D projects for our customers including microbial strain engineering, enzyme discovery and protein engineering.
Importantly, the UMDB was not created with the primary goal of training a general-purpose protein LLM. The resource is heavily biased toward microbial genomes and includes few sequences from other taxa. One of our goals for creating AA-0 was to better understand how the composition of the training dataset impacts downstream model performance across different protein engineering tasks.
Since 2023, the UMDB has continued to grow and now includes about 3.3B unique protein sequences, spread across 416M clusters at a clustering threshold of 50% sequence identity (SeqID50). Recent additions include public resources like MGnify4 as well as new proprietary collections of extremophiles and strains relevant to agriculture. Future releases may include models trained with this larger dataset.
Structuring the combined dataset
The AA-0 training dataset was constructed following an approach similar to that described for ESM-21. We started by collecting the publically available UniRef50/90 clusters5 from the September 2021 release. These sequences are clustered at two different levels of sequence identity, 50% (seqID50) and 90% (seqID90), allowing a hierarchical sampling procedure. Sequences are selected first from the larger seqID50 clusters, then from the smaller seqID90 clusters, to ensure representative diversity for training.
We added sequences from the UMDB to the UniRef dataset by assigning them, when possible, to existing UniRef90 clusters meeting the 90% identity threshold. Representative sequences were chosen for each cluster and similarly assigned to the existing UniRef50 clusters. When clustering criteria weren’t satisfied, new clusters were spawned to contain the UMDB sequences. Clustering was performed using the easy-linclust workflow of MMseqs26 with 80% coverage.
The clustering process resulted in 172M seqID50 clusters, a substantial increase from the ~60M found in the original UniRef50. Looking inside the new clusters, we found remarkably little overlap between the public and UMDB sequences (Fig. 1). These results indicate that the combined dataset includes many novel sequences unlike anything used to train previous models. New sequences mean new information and, potentially, new opportunities for AA-0 to learn the patterns that occur in naturally evolved proteins.
Figure 1. Sequence novelty in the UMDB. 65% of protein sequence clusters used to train AA-0 included only sequences from the UMDB, 30% included only UniRef50 sequences, and 5% included sequences from both sources. The low degree of overlap indicates that the UMDB supplied many novel sequences for training.
Selecting a strategy for filtering, sampling and training
We explored a variety of approaches to filter sequences for quality and sample them from the combined dataset to use for training (Table 1). To evaluate the impact of different strategies, we used them to train a smaller model of 150M parameters. We used a smaller, 150M-parameter, version of ESM-2 to provide a similarly powered baseline comparison. Two kinds of benchmarking tests were used to evaluate performance: ProteinGym and Owl, our in-house benchmark, which we describe more below. The sampling strategies we tried included:
Sequence quality filter. We removed sequences with indications of low quality, for example the inclusion of non-amino-acid characters.
Minimum cluster size. We removed SeqID50 clusters containing fewer than the indicated number of sequences, reasoning they might not provide representative data.
Samples per cluster. We sampled either 1 or the indicated number sequences from each SeqID50 cluster, trading off wider cluster diversity for deeper cluster sampling.
Sequence length reweighting. We adjusted sampling to reduce the probability of choosing sequences shorter than the indicated length, which are more likely to represent sequences of lower utility (e.g. short non-structural proteins) or fragments.
Single-representative sampling. We sampled only the representative sequences for each SeqID50 cluster as determined by the clustering algorithm, simplifying sampling but losing finer in-cluster variations.
ESM2 150M
Trial 0
Trial 1
Trial 2
Trial 3
Trial 4
Trial 5
Sequence quality filtering
False
True
True
True
True
True
SeqID50 min cluster size
1
1
1
2
100
2
Samples per SeqID50 cluster
1
1
1
1
50
1
Sequence length reweighting threshold
1
1
100
100
100
100
Only return cluster representatives
False
False
False
False
False
True
Owl Score
0.204
0.173
0.161
0.185
0.223
0.240
0.231
ProteinGym Score
0.318
0.292
0.293
0.291
0.318
0.257
0.302
Table 1. Model comparisons under different filtering and sampling strategies. Performance metrics are reported as a Spearman correlation between model scores and experimental measurements. The top performing strategies for each benchmark are indicated in bold. Performance metrics are reported as a Spearman correlation between model scores and experimental measurements. The top performing strategies for each benchmark are indicated in bold.
Although no strategy was the unambiguous winner for both benchmarks, we chose the strategy in trial 3 as giving an effective balance of performance. This entailed removing all seqID50 clusters with only 1 sequence and introducing a length reweighting threshold of 100 base pairs to sample fewer short sequences. The maximum length for training sequences was set to 512, with random cropping of sequences longer than this length.
AA-0 was trained on an 8×8 configuration on Google Cloud Platform with A100 GPUs. Except as noted below, training followed the guidelines described for ESM-21. In hyperparameter search experiments, we didn’t find any that meaningfully improved outcomes. We implemented two primary changes which, in our hands, were essential for reliable training:
We made use of Xavier uniform initializations for KVQ weights in the attention layers with gain set to 1/sqrt(2).
We used the AdamW optimizer with settings lr=4e-4, weight_decay=1e-5.
Like ESM-2, we used a linear learning rate scheduler with 2000 warmup steps reducing to 10% maximum learning rate over the training duration. Following the sampling and filtering pattern selected above, we trained for 1M steps on the combined dataset followed by 150k steps of fine-tuning on UniRef50 sequences. We found that this fine-tuning improved some downstream tasks on a select number of targets, as described below.
Model evaluation on standard and in-house protein engineering tasks
To evaluate the performance of AA-0, we made use of the public benchmark collections DGEB7 and ProteinGym8. We were also interested in testing the model specifically against the kind of protein engineering workflows that we encounter at Ginkgo. For this, we used the internally developed Owl benchmark. In the plots below, we compare the performance of 3 models.
ESM-2 refers to esm2_t33_650M_UR50D, the model documented here and in the original paper1.
AA-0-base indicates ginkgo-aa-0-650m, the model trained on the combined dataset including our UMDB sequences.
AA-0 is ginkgo-aa-0-650m-finetune-UR50-150k, in which AA-0-base underwent an additional 150k rounds of additional fine-tuning with sequences from UniRef50.
The Diverse Genomic Embedding Benchmark (DGEB), composed by TattaBio, is a collection of tasks that make use of the embeddings from a protein sequence encoder model. For example, using pooled representations to search a sequence collection for similar proteins.
Figure 2. Comparison of model performance using DGEB. The tasks on the left belong to six types: BiGene Mining, Evolutionary Distance Similarity (EDS), Classification, Pair Classification, Clustering and Retrieval. The reported scoring metric varies by task type, with higher scores representing better performance.
ProteinGym is a collection of benchmarks that challenge a model to predict the effect of mutations on the measured function on a protein sequence8. We focused on the collections of protein substitution variants created with Deep Mutational Scanning (DMS). The 217 total assays were collected into five assay categories: organismal fitness, enzyme activity, protein binding, protein expression and protein stability. The distribution of scores within each category gives an overview of the performance of each model.
Broadly speaking, the AA-0 and ESM-2 models performed comparably (Fig. 3). When examining the medians of the distributions, AA-0 was marginally better at tasks relating to predicting protein stability and marginally worse at predicting enzyme activity (though there is high overlap in the performance distributions). Tasks related to protein binding were challenging for both models, highlighting the difficulty of predicting interactions from sequence data.
Figure 3. Comparison of model performance using ProteinGym. The indicated models were used to score collections of protein sequences representing DMS substitutions. For each collection, performance is reported as a Spearman correlation between the model-derived score and the measured activity.
The 217 assays are grouped into five categories by the type of property being measured. Box plots indicate the mean score for each category, as well as standard deviations and outliers.
The Owl benchmark, named for our in-house protein design software suite, was developed at Ginkgo to reflect tasks relevant for our work in commercial protein engineering. AI-guided protein discovery uses the model as an embedder to identify functionally similar proteins. Protein engineering is aided by scoring potential sequence variations that may be functionally relevant.
Owl includes 73 collections of protein sequence variants, each labeled with a functional measurement performed during the course of a real customer program. Examples of functional measurements include enzyme activity, specificity or expression titer. As above, we report model performance as a Spearman correlation between model scores and empirical measurements, grouping scores into categories to provide high-level overview (Fig. 4).
Figure 4. Comparison of model performance using Ginkgo’s Owl benchmark. The indicated models were used to score collections of engineered protein sequences. For each collection, performance is reported as a Spearman correlation between the model-derived score and the measured activity.
The 73 assays are grouped into three categories by the type of property being measured. Box plots indicate the mean score for each category, as well as standard deviations and outliers.
Overall, we find roughly comparable results between the different models. Interestingly, we find many examples of a negative correlation between model scores and experimental outcome, particularly for the use case of predicting enzyme specificity.
Why might enzymes with improved specificity tend to have lower model-derived scores? The datasets collected for the Owl benchmark come from different kinds of enzymes for being engineered for different functional goals, making generalizations difficult. But this result might indicate important differences in the kinds of sequences that result from natural evolution and protein engineering. For example, an enzyme engineering project might seek to focus an enzyme activity on a particular target that is disfavored in a natural context. If evolution and engineering tend to move sequences in different directions, model-derived scores might negatively correlate with actual measured performance.
Fine-tuning improves performance on viral sequences
The UMDB does not represent a uniform sample of all naturally evolved protein sequences. It is primarily a collection of microbial DNA extracted from soil. As we explored AA-0, we were interested in how this bias in the training data might impact its performance.
The ProteinGym benchmark assays include proteins sourced from humans, other eukaryotes, prokaryotes and viruses. Breaking out the performance of AA-0 by taxon, we found substantially weaker performance on viral proteins (Fig. 5). We suspect this is a result of viral sequences being poorly represented in our training data. Viral sequences are particularly diverse, fast-evolving, and often unlike proteins found in cellular life forms. This result emphasizes the importance of learning from viral sequences directly to be able to model them accurately.
Performance on viral sequences improved markedly following 150k steps of additional fine tuning with the UniRef50 sequences. This improvement motivated us to include the UniRef50 fine-tuning in the model now available through the Ginkgo AI developer portal.
Figure 5. Model performance by taxon. The 217 assays of the ProteinGym ESM collection are grouped by taxon of origin: Human, non-human Eukaryote, Prokaryote or Virus. For each assay, performance is reported as a Spearman correlation between the model-derived score and the measured activity. Box plots indicate the mean score for each category, as well as standard deviations and outliers.
Conclusions
What drives the performance of an LLM? In different contexts, AI researchers have identified model size, training data, and compute as fundamental resources that govern a model’s scaling behavior9. Here we investigated the impact of training data on the performance of a protein sequence LLM. We supplemented the ~60M UniRef50 sequence clusters used to train ESM-2 with an additional 112M clusters from the Ginkgo UMDB. The resulting model, AA-0, showed comparable performance across a range of benchmarking tasks, indicating that training data alone was not a limiting resource.
Our experience with AA-0 holds lessons for the development of AI models for applied protein engineering:
The importance of data quality. In preparing AA-0 we explored a variety of strategies for filtering and sampling sequences from the very large UMDB. The selected strategy significantly impacted model performance, suggesting that further exploration in this area might lead to continued improvements. DNA sequencing technology is advancing quickly, leading to exponential growth in datasets and rapid proliferation in data collection techniques. Sequence-based AI models will benefit from standardized and optimized approaches to curate all this data.
The value of data representation. We found the AA-0-base model performed poorly on viral sequences, probably because they were sparsely represented in its training data. This weakness was partially corrected by additional fine tuning with UniRef50 sequences, and could also be improved by curating more representative datasets for future models.
The particular challenges of protein engineering. AA-0 performed well when predicting enzyme activity, a common task in the Ginkgo foundry. Interestingly, the model struggled to predict enzyme specificity, often producing scores that were negatively correlated with measured outcome. This suggests that engineered proteins may include sequence features unlike the evolved proteins used for model training. Future models may require new datasets that capture the features of successful engineered proteins, or may need other strategies to accommodate protein engineering as a use case.
The need for more task-specific data. In commercial protein engineering projects at the Ginkgo foundry, LLMs are not used to generate functional proteins de novo. Instead, libraries of generated sequences are built and tested for a particular desired function. These results from assay-labeled libraries become training data for additional rounds of AI-guided engineering, leading to performance improvements greater than those achieved with sequence-based models alone. Future models will benefit from new datasets assay-labeled for functional outcomes of interest including substrate affinity, enzyme specificity, and expression in particular microbial hosts.
AI can make biology easier to engineer. This is the first of many intended releases from the Ginkgo AI team. We are excited to begin peeling back the curtain and enabling bioengineers across the world to access our technologies. As we scale up our training efforts (we are currently training models 10x larger than these and more!), we will be eager to share our findings and plan to make the resultant models available to the community.
Ready to see what’s possible? Visit our developer portal to access everything you need to start using the API’s free tier, including detailed documentation, tutorials, and sample code. Access the portal today and be among the first to explore our new API. — And to get you started, we’re offering 2,000 sequences (i.e. ~1M tokens) of free inference in our initial language model! Just fill out the form below.
References
1. Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123-1130. doi:10.1126/science.ade2574 2. Yu T, Cui H, Li JC, Luo Y, Jiang G, Zhao H. Enzyme function prediction using contrastive learning. Science. 2023;379(6639):1358-1363. doi:10.1126/science.adf2465 3. Ruffolo JA, Nayfach S, Gallagher J, et al. Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences. bioRxiv 2024.04.22.590591. doi:10.1101/2024.04.22.590591 4. Richardson L, Allen B, Baldi G, et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Research. 2023;51(D1):D753-D759. doi:10.1093/nar/gkac1080 5. Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2015;31(6):926-932. doi:10.1093/bioinformatics/btu739 6. Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35(11):1026-1028. doi:10.1038/nbt.3988 7. West-Roberts J, Kravitz J, Jha N, Cornman A, Hwang Y. Diverse Genomic Embedding Benchmark for functional evaluation across the tree of life. bioRxiv 2024.07.10.602933. doi:10.1101/2024.07.10.602933 8. Notin P, Kollasch AW, Ritter D, et al. ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design. bioRxiv 2023.12.07.570727. doi:10.1101/2023.12.07.570727 9. Hoffmann J, Borgeaud S, Mensch A, et al. Training Compute-Optimal Large Language Models. arXiv:2203.15556. doi:10.48550/arXiv.2203.15556
Acknowledgements
Thanks to the Ginkgo protein engineers, software developers and AI experts who helped to build AA-0: Zachary Kurtz, Matt Chamberlin, Eric Danielson, Alex Carlin, Michal Jastrzebski, Dana Merrick, Dmitriy Ryaboy, Emily Wrenbeck & Ankit Gupta.
When shifting from plant-based extraction to a fermentation-based production to scale to global demand for their nutraceutical, our partner encountered an enzymatic bottleneck in engineering their yeast host.
Ginkgo Bioworks’ Enzyme Intelligence suite of tools, equipped with AI-guided design and a vast proprietary protein database, engineered an enzyme achieving over 600% performance improvement. This leap in enzyme activity enabled our partner to establish sustainable and efficient production through fermentation, allowing them to meet their market demand.
Executive Summary
A nutraceutical company partnered with Ginkgo Bioworks because they needed to shift their production from plant-extraction to fermentation, thereby lowering cost of production and allowing them to scale production to meet global demand.
Our work started with Metagenomic Sourcing to identify alternatives to the natural enzyme that was a bottleneck in production
We leveraged Protein Engineering to develop chimeric proteins from the metagenomic campaign with increased function.
Ginkgo’s protein engineering team identified enzymes with 600% increase in activity compared to the native starting protein.
As a result of our engineering, the collaboration developed a yeast strain with high titers of our partner’s target molecules. Our approach achieved a significant cost reduction and increased efficiency in their production, enabling them to scale production to meet global demand.
The Challenge:
To meet the challenges with inefficient production of their molecule from plants and to reduce production costs, our partner sought to switch from traditional plant extractions to synthesizing them in yeast, using the power of fermentation. Together, we identified biological pathways capable of producing these rare molecules in large quantities. However, one enzyme from the original plant pathway was a bottleneck in yeast. To overcome this, our partner utilized Ginkgo’s Enzyme Intelligence suite of engineering tools to optimize production of this enzyme so as to meet their production targets.
Our Work:
We searched for natural equivalents of the bottleneck enzyme through metagenomic analysis, identifying homologs in Ginkgo’s proprietary database of more than 2 billion genes. However, although some of these display increased activity, none of the homologs passed the required threshold. We developed a Protein Engineering campaign to develop sequences with a high probability of function.
Our engineers developed a combinatorial library based on these homologs, using sequence-based statistical models for sequence redesign and scanning mutagenesis of the native enzyme.
This 3,000-member library was subjected to screening in our yeast host for enzymes with increased activity.
We iterated on this initial library and carried forward our learnings into additional rounds of protein engineering.
After five cycles of engineering, we identified a sequence that had over 600%improvement in activity. This enzyme showed <80% sequence identity to the original sequence.
Outcome:
Our identification of this key enzyme provided a strategic pathway for our partner to carve a unique space and meet demand through fermentation. They expanded their project with us further to achieve their production goals.
Ginkgo’s initial success led to an expansion of the project to take our partner’s proof-of-concept to commercial product.
Ginkgo’s engineers continued Strain Improvement on their initial strain, further optimizing their biosynthesis pathway. Throughout the project, we screened over 19,000 enzymes across the entire pathway to develop their commercial production strain.
In the final stages of strain engineering, Ginkgo started Fermentation and Downstream Processing to make certain that the strain we were developing could be transferred to production facilities so our partner could hit the ground running.
Our final piece of work was completing a Technology Transfer to our partner’s production facility. We verified that our strain and process functioned at scale, leaving our partner with peace of mind as they grew production to meet global demand.
Work With Us
Ginkgo Bioworks’ Enzyme Intelligence suite of tools enables our partners to achieve their production goals through biology. turns significant technical challenges into opportunities. We work with our partners to optimize pathways to produce their target small molecules production processes. We encourage potential partners to discover how our tools and expertise can elevate your projects, lower costs, and enhance sustainability. Shape the future of biotechnology and drug development with us—contact us to begin exploring revolutionary solutions.
We’re so excited to announce our new partnership with Prozomix, a UK-based biotech company focused on novel biocatalyst discovery and manufacturing!
Together, we aim to build out the production of next generation enzyme plates for active pharmaceutical ingredient (API) manufacturing. This collaboration aims to leverage Ginkgo’s Enzyme Services and industry-leading AI/ML models along with Prozomix’s existing enzyme libraries and deep experience manufacturing enzyme plates.
Ginkgo’s Technology Network brings together a diverse array of partners, spanning AI, genetic medicines, biologics, and manufacturing, with the aim of integrating their capabilities to provide customers with robust end-to-end solutions for successful R&D outcomes. With Prozomix now in the Technology Network, Ginkgo customers will have access to Prozomix’s scalable contract manufacturing services, including enzyme samples from mg to kg scale.
For several decades, demands for both improved supply chain sustainability and reduction of costs of goods sold has driven the pharma industry towards the adoption of biocatalysts in commercial API manufacturing. Existing enzyme plates offer users an opportunity to rapidly screen potential candidates early in development to identify and de-risk the use of biocatalysts capable of supporting specific reactions in API manufacturing routes. As such, biocatalyst adoption largely depends on the diversity and performance of the enzymes available in these plates.
Prozomix and Ginkgo are partnering to usher in a new generation of biocatalysts built off of sequences and activity data from previous enzyme libraries.
Ginkgo will build class-specific AI models informed by enzyme sequences and data from its own massive metagenomic database as well as Prozomix’s enzyme libraries and associated screening data. These models can then be used to discover novel functional enzyme sequences. Prozomix intends to then use next-gen enzyme libraries, designed by these models, to manufacture novel enzyme plates.
Together, we expect these next-gen enzyme plates to have a diversity and performance that traditional plates lack, potentially unlocking biocatalytic opportunities where previous plates have failed. These plates will be freely available to all pharma process chemistry groups, provided that screening data is shared back with Ginkgo to drive further refinement of the Ginkgo AI/ML models.
“With a global reputation for de-risking early stage biocatalytic processes, we believe the Ginkgo partnership will keep Prozomix at the forefront of best in class biocatalyst provision throughout the AI revolution, enabling our customers to continue saving and improving more lives.”
Simon J. Charnock, CEO of Prozomix
API manufacturing is poised to greatly benefit from the latest in enzyme engineering and AI/ML enzyme models.
We are so excited to partner with Prozomix to get enzymes into as many API routes as possible and help partners meet both their COGs savings and sustainability goals.
We’re thrilled to announce our new partnership with GreenLab, an emerging next generation plant-biotechnology company!
GreenLab is developing a product with the purpose of degrading PFAS and will leverage Ginkgo Enzyme Services to discover a novel enzyme of critical importance for use in this application.
Ginkgo Enzyme Services provides companies with end-to-end enzyme discovery and optimization R&D services. Given its extensive expertise in this space and the nature of this particular project, Ginkgo is providing these services under its success-based pricing model, created to help companies de-risk their research and development efforts.
Cornfield Factories
GreenLab’s proprietary technology allows it to grow enzymes and other proteins inside a corn kernel. By producing proteins in a cultivated crop, GreenLab can readily scale up production across acres of cornfields, with little additional up-front capital and infrastructure.
After the protein of interest is extracted from the kernel with minimal waste, most of the corn used will then proceed along the existing value chain, including food, feed or fuel. GreenLab already has two transformative proteins in commercial production, including manganese peroxidase (a multipurpose environmental remediation solution) and brazzein (which delivers a high-intensity sweetness).
The PFAS Problem
PFAS, short for “per- and polyfluoroalkyl substances”, describes a group of manufactured chemicals that have commonly been used in nonstick and waterproofing agents for decades. They bear the moniker of ‘forever chemicals’ owing to their enduring nature and inability to break down in the environment. They are associated with many dangerous health effects including cancer, reproductive and immune system harm, and other diseases.
A Kernel of Hope: Bio-Based Solutions to Break Down PFAS
There is currently no known commercial process for degrading these forever chemicals, but GreenLab is on a mission to change that and reverse their perpetual environmental buildup. PFAS degradation is a significantly complex problem, and currently no PFAS-degrading enzymes have been commercialized. GreenLab aims to tackle this difficult enzymological problem by leveraging Ginkgo Enzyme Services to discover and develop a novel enzyme for use in their PFAS degradation application. This project is the first step in a journey that could potentially lead to the first deployment of a commercially viable enzymatic solution that can degrade one of the most recalcitrant chemicals in existence.
Ginkgo Enzyme Services
Ginkgo will lead an metagenomic discovery campaign leveraging its vast metagenomic database to identify a library of PFAS-degrading enzymes. Ginkgo will then use advanced ultra-high throughput screening methods to identify unique enzymes with desired activity and transfer the best candidates to GreenLab. In later stages of this collaboration, Ginkgo will further engage in AI-enabled enzyme engineering to further improve on the discovered enzyme.
“GreenLab is eager to work with Ginkgo towards solving such a massive and prevalent environmental and health problem. By leveraging Ginkgo Enzyme Services to conduct our enzyme discovery and development, we believe we’re enabling our R&D team to produce, pilot, and deploy our product faster and with less risk than any other option we considered.”
Karen Wilson, CEO of GreenLab
At Ginkgo, we say that our partners can find the needle they’re looking for in our tech stack.
We are thrilled to be working with GreenLab on PFAS degradation, and are ready to utilize our platform to solve such a challenge. We’ll be deploying our powerful AI-enabled in-house computational tools, best-in-class enzyme Codebase, and ultra-high throughput screening methods as we seek to find a novel enzyme fit for GreenLab to address this globally important enzymological problem.
Allonnia, the bio-ingenuity company™ dedicated to extracting value where others see waste, plans to work with Ginkgo and GreenLab to help discover a novel enzyme to combat PFAS, and will work with GreenLab as a commercial partner deploying the enzyme in their end-to-end PFAS solution. In doing so, Allonnia is furthering its commitment to the identification of a biological solution for PFAS degradation. The company has already introduced a PFAS separation and concentration solution with EPOC Enviro’s SAFF unit, a sustainable PFAS remediation technology. Integrating a process for the biodegradation of PFAS concentrate discovered through this project into Allonnia’s solution would represent a breakthrough closed-loop approach. Additionally, Allonnia believes that this solution could be expanded in the future to serve as a degradation technology for other applications where there is a significant unmet need today, such as in-situ soil remediation.
To learn more about Ginkgo Enzyme Services and how you can access Ginkgo’s success-based pricing, visit ginkgobioworks.com/enzyme-services/.
Interested in leveraging Ginkgo Enzyme Services for your R&D? Get in touch here!
Enzyme Engineering and Artificial Intelligence: A New Frontier
Enzymes are the heroes of biotechnology, serving as biological catalysts that make life’s complex reactions look easy. Inside of the cell, enzymes direct the flow of molecules through metabolic pathways, orchestrating biological functions. Outside of their cellular context, enzymes have been co-opted for specialized roles in manufacturing, speeding up processes that would otherwise be painstakingly slow. In pharmaceuticals, enzymes are custom-engineered to act as targeted therapeutics. Whether in life sciences or industrial applications, enzymes elevate our ability to engineer processes and enact chemistries by facilitating reactions with speed and specificity.
For years, scientists have used a variety of tools to design and optimize these crucial biological components. Traditional methods have often hinged on exploiting evolutionary pressures—letting nature do the heavy lifting over generations and then picking the winners. Structure-based prediction techniques, like Rosetta, also made a significant impact, allowing researchers to model how tweaks to an enzyme’s structure could influence its activity.
But we’re entering a new era–one in which we can train Artificial Intelligence (AI) models based on large biological data sets. This is where Ginkgo Bioworks comes in. Our expansive cell engineering platform is a data-generating powerhouse, churning out the kind of high-quality, voluminous data that AI algorithms thrive on. The marriage of this large-scale data generation with AI models allows us to transcend previous limitations, making Ginkgo an ideal environment to train and deploy machine learning tools for the complex art of enzyme engineering.
The AI Story: Big Data, Bigger Breakthroughs
AI learns from large data sets. Ginkgo Bioworks generates these types of data: we make it possible for you to produce and learn from large data sets. Our extensive repositories of enzymes not only cover a wide range of protein sequences but are also complemented by highly targeted data, revealing precise sequence-function correlations. This dual-data approach is implemented through machine learning cycles in our enzyme engineering projects, enabling us to iteratively refine predictive models.
Ginkgo has developed an AI tool, Owl, to fine-tune enzymes for a specialized role. An expansive data set provides the foundational architecture. To construct the intricate details, however, we employ data that is calibrated to the specific enzyme and its intended function. This enables Owl, our machine learning tool, to not merely “learn” but to “apply” its learnings, writing the intricate, detailed novel enzyme that our scientists require. Owl can “see in the dark” and discern viable paths in complex enzyme design landscapes.
Ginkgo’s approach to enzyme design isn’t merely data accumulation; it’s strategic data deployment. Our Foundry is equipped to generate an extensive range of high-quality biological data at scale. From DNA design and synthesis to high-throughput screening, we create vast data sets corroborating structure-function relationships. Owl thrives in this environment, allowing us to design enzyme variants tailored to our partners’ unique specifications, whether that’s enzyme activity, specificity, or other parameters.
As we navigate the complexities of enzyme design and optimization, think of Owl as the expert navigator and our robust data sets and data-generating capabilities as the compass and map. Together, they form a symbiotic alliance that not only challenges but also redefines the boundaries of traditional R&D.
Tackling Enzymes in Central Carbon Metabolism: the power of iteration and integration
Enzymes that regulate flux through Central Carbon Metabolism (CCM) are biological masterpieces. These proteins have been shaped by billions of years of evolutionary refinement to execute their functions with unmatched precision and, in many cases, maintain high sequence and structure conservation throughout the tree of life.
In one example of Owl-guided enzyme optimization, we were asked to improve the reaction kinetics of an enzyme involved in CCM. While this enzyme had been studied for the past 50 years, the best improvement we found in the literature was a 2-fold increase in the kcat/KM–catalytic efficiency; our customer needed a 10-fold improvement in the efficiency of this enzyme in order to meet their economic targets.
Our approach to this project leveraged our Foundry’s ability to generate and test large libraries of strains. In our initial data-generation phase, we created a first-generation library featuring 2,000 distinct enzyme variants crafted using a structure-based design, as well as semi-rational methods like active-site mutagenesis for targeted alterations. This is an important step because it generated a data set for initial Owl training. With this information in hand, we designed a second generation library to give Owl more information: we maintained the library size of the first but incorporated insights from the previous round, resulting in an exciting 3.9-fold improvement—a leap that surpassed anything we had seen before.
But the real improvements were just beginning. The third generation of this program brought us to a pivotal point in our optimization journey. Leveraging Owl’s predictive analytics, we strategically developed a broad library of 4,000 enzyme variants, generating diversity where it mattered most. The result was an unprecedented 4.5-fold improvement in enzyme efficiency, serving as a testament to Owl’s growing mastery in predictive capability.
Data from these three consecutive generations positioned us to make our biggest improvements yet. Given the data that our scientists had generated, Owl continued to generate increasingly sophisticated models of enzyme function. The final iteration culminated in a fourth generation where only 100 enzyme variants needed to be tested. The result, which marked the successful completion of this customer program, was astonishing: a 10-fold improvement in enzyme function, verified through meticulous arrayed activity assays and detailed protein characterization. By integrating the large data sets generated by Ginkgo’s cell engineering platform with Owl’s predictive power, we surpassed the bounds of natural evolution and decades of research reported in the literature meet our customer’s targets.
The future of enzyme engineering: large data and machine learning at Ginkgo Bioworks
The confluence of big data and AI accelerates the pace of innovation to unprecedented speeds. Ginkgo’s cell engineering platform is an ecosystem designed for generating expansive, high-quality data sets customized for complex biological inquiries. This data, in turn, fuels the predictive power of AI models. Together, they form a symbiotic relationship that enables us to challenge the limitations of natural evolution and traditional research methods.
As stakeholders in the biotechnology industry, navigating complex R&D challenges requires more than just robust tools; it requires effective partnerships. Ginkgo Bioworks offers the specialized machine learning models and data-generation capabilities necessary to advance your research and overcome bottlenecks. Our suite of resources is designed to integrate seamlessly with your objectives, providing actionable insights and solutions tailored to your specific challenges.
Ginkgo is investing in the future of AI for biotech: see our recent announcement with Google about developing foundation generative AI models for DNA and protein. Leverage our expertise and technology for your next project, and to join us in pushing the boundaries of what is possible in synthetic biology.
We’re thrilled to announce our new collaboration with Factorial Biotechnologies, an emerging single-cell sequencing company with a novel intracellular library preparation technology.
Through this partnership, Factorial will leverage Ginkgo Enzyme Services to develop a novel isothermal DNA polymerase for use in their single-cell next-generation sequencing (NGS) library prep kit. Given our extensive expertise in this space, Ginkgo will provide these services under our success-based pricing model, created to help companies de-risk their research and development efforts.
Single-cell sequencing is a promising technique to better understand genetic and functional diversity within complex tissues and biological systems, but its impact has been limited, due to complex laboratory workflows and high cost.
Factorial Biotechnologies aims to dramatically simplify the workflow of single-cell sequencing with an extraction-free technology that makes it possible for complete NGS libraries to be prepared inside of intact cells within a mixed cell population. The potential for this scalable, high-throughput, and cost-efficient technology spans scientific research in the healthcare and life science industries, including precision oncology, immunology, cell and gene therapy, and quality control and screening for synthetic biology. With Factorial’s in-cell library prep technology and barcoding scheme, single-cell libraries can also be prepared using digital PCR workflows.
To support this promising technology, Ginkgo will lead a campaign in P. pastoris to develop a novel enzyme — isothermal DNA polymerase — instrumental to Factorial’s innovative NGS library prep kit. Our advanced ultra high throughput screening methods can help identify unique enzymes and valuable reagents with desired activity and functions for innovative life science tools and research. Performing discovery and high throughput screening in our proprietary P. pastoris expression system enables synergy between early innovation and manufacturability of these valuable reagents.
“We look forward to working with Ginkgo to develop and optimize a unique and important piece of our workflow. We’re eager to see our cost-effective, high throughput technology help researchers and clinicians deliver on the promise of single-cell genomics.”
John Wells, Co-Founder and CEO of Factorial Biotechnologies
We are so excited to power Factorial’s differentiated technology on our platform. We believe Factorial’s extraction-free library prep will be a game-changer for single-cell sequencing, and we’re proud to help play a part in it. Ginkgo Enzyme Services is uniquely suited to rapidly enable novel molecular diagnostic assays through broad metagenomic searches and efficient AI-enabled enzyme engineering.
Enzymes power a diverse array of applications across industries from industrial processing, chemical manufacturing, therapeutics, as well as applications such as Factorial’s innovation in life sciences and molecular diagnostics. Ginkgo’s platform enables discovery and development of enzymes to enable innovators across industries who seek to make better technologies more accessible.
To learn more about Ginkgo Enzyme Services and how you can access Ginkgo’s success-based pricing, please visit ginkgobioworks.com/enzyme-services/.
Find the full press release here along with all of the latest news from the Ginkgo team.
Today, we’re announcing our new partnership with Voodoo Scientific!
Voodoo plans to leverage Ginkgo Enzyme Services to help produce a component of ultra-premium spirit products that are truly smooth.
Most distilled alcoholic beverages produce some degree of harsh sensation, or “bite,” when consumed, which is a major deterrent for many potential customers. Voodoo identified the scientific cause for this harshness and created an enzymatic solution to give distillers the ability to manage it. Distillers can use Voodoo’s novel enzymatic solution to produce more premium products by creating smooth spirits.
Our extensive protein discovery and design capabilities will be used to help develop and optimize the enzyme critical to Voodoo’s product for a wide range of conditions in spirits manufacturing, from craft to global-scale production environments.
“Providing distillers with a means to eliminate, or control, the harshness of their spirits products is very gratifying,” said Joana Montenegro, co-founder and Chief Science Officer at Voodoo. “We believe we can enable new innovation in this large global industry and in ways that are truly meaningful to consumers seeking premium experiences. Ginkgo was the best choice of partners for us among the ones we considered because of their unique combination of strong scientific capabilities and a business model that fits an early-stage company like ours.”
Engineering this class of enzyme to operate under the unique conditions required for distilled alcoholic beverages is a great application for Ginkgo Enzyme Services. Improving the functionality of enzymes underpinning critical production processes – making enzymes work better – is an area we’re passionate about because it opens up real business opportunities for our customers, especially as they push into new product development.
To learn more about Ginkgo’s work in this space, join us on June 30th from 10:00 – 11:00 am ET for our Functional Food Proteins with Microbial Expression Systems virtual event.
Reducing the Environmental Footprint of Enzymatic Production of APIs
Today we’re pleased to announce an expansion of our existing partnership with Centrient, the global business-to-business leader in sustainable antibiotics, next-generation statins and anti-fungals. The partnership is aimed at broadening Centrient’s portfolio of environmentally friendly active pharmaceutical ingredients (APIs), following the success of previous work together.
Our ongoing partnership with Centrient focuses on improving the sustainability of fermentation and enzymatic syntheses of beta-lactam antibiotic APIs. In the first phase of this project, we delivered an enzyme with significantly improved efficiency, reducing the environmental footprint of enzymatic production of amoxicillin and cephalexin APIs. These semi-synthetic beta-lactam antibiotics are widely prescribed to both children and adults and are on the World Health Organization’s List of Essential Medicines. Centrient aims to build on these improvements through ongoing strain projects on our platform which focus on reducing carbon emissions and waste production compared to traditional chemical routes.
“Our partnership with Ginkgo is fully aligned with our main purpose: to improve lives through innovative and sustainable manufacturing of medicines,” said Jorge Gil-Martinez, Chief Scientific Officer at Centrient. “The initial success of this collaboration has led us to expand our joint efforts to design new ways of producing essential medicines, minimizing the environmental impact of antibiotic manufacturing. Moreover, as we design and execute our Open Innovation business model, this strategic collaboration creates synergies to accelerate the diversification of our portfolio, a strategic pillar for the future of our company. Access to external disruptive technologies, focusing on enzymes and fermentation, contributes to our vision to be a diversified and integrated partner of choice for generic medicines.”
Our partnership with Centrient, which began in 2021, underscores Ginkgo’s commitment to supporting biopharma companies in bringing much-needed innovation to the field. We are inspired by the early success we’ve already seen in our partnership and look forward to expanding our joint efforts to ultimately support better patient outcomes.
Find the full press release here along with all of the latest news from the Ginkgo team.
Optimizing enzymes for Zymtronix’s cell-free manufacturing platform
Today we are pleased to announce our partnership with Zymtronix, a developer of cell-free process technologies. Together we aim to optimize enzymes used in Zymtronix’s proprietary cell-free platform for the production of important ingredients in food, agriculture, cosmetics and pharmaceuticals.
Enzymatic biocatalysis is a powerful manufacturing technology that can enable the production of a wide range of chemicals and molecules. Zymtronix’s cell-free platform is designed to solve challenges associated with traditional biocatalysis and seeks to enable the production of a wide range of products with precision and productivity. By partnering with us to build and produce bioengineered market-ready enzymes, Zymtronix anticipates being able to extend its solutions into the pharma, nutrition, agriculture markets, among others.
Zymtronix aims to leverage Ginkgo Enzyme Services to discover, optimize, and produce enzymes
Ginkgo Enzyme Services offers partners end-to-end support for the discovery, optimization, and production of enzymes for diverse applications. Through the partnership, we will leverage our suite of enzyme services to engineer enzymes for Zymtronix’s applications using metagenomic enzyme discovery as well as improve enzyme expression and production host performance.
We’re thrilled to welcome Zymtronix to the platform and support their applications in sustainable ingredients and beyond. We’ve built out our platform to serve a wide variety of enzyme discovery, engineering, optimization and scale up efforts, and we’re so excited for the work to come in this partnership. Zymtronix’s cell-free biomanufacturing platform is pioneering solutions for various industries, and we’re eager to leverage our end-to-end capabilities and help expand its efforts in transforming the way enzymes are used.
“This partnership will greatly accelerate our work of bringing the precision and scalability of cell-free biomanufacturing and sustainable ingredients to market starting with alternatives to animal sources; Ginkgo is uniquely able to support us with both enzyme engineering and strain expression, helping us continue to accelerate commercialization,” said Stéphane Corgié, CEO-CTO and founder, Zymtronix. “We hope to extend this partnership in the future to facilitate the production of multiple end-market products.”