Launching Ginkgo Datapoints: Transforming AI Model Training in Biology

 

Today, we’re proud to announce the launch of Ginkgo Datapoints to usher in the next era of biotechnology by making the training of AI models easier and more efficient. 

Datapoints is specialized in generating large, high-quality biological datasets with a fast turnaround time and at a competitive price per datapoint with a streamlined deal structure. Ginkgo Datapoints will launch with several data generation products this Fall including protein characterization and functional genomics.

The Functional Genomics product, the flagship offering under Ginkgo Datapoints, available now, delivers large-scale perturbation datasets that power our partners’ AI models of cell and disease biology for use in target identification and validation and drug discovery. Ginkgo Datapoints addresses some of the most significant challenges in the industry for AI model training: data availability, quality, and uniformity.

The launch of Ginkgo Datapoints marks a significant step forward in our mission to make biology easier to engineer. With Ginkgo Datapoints, we’re passing on our economies of scale to our customers by generating large, high-quality datasets for our customers at a price per dataset that makes training biological foundation models feasible. Ginkgo Datapoints is more than just a service—it’s a commitment to driving innovation and accelerating the development of new therapies and solutions across the biotech industry.

Jason Kelly, CEO of Ginkgo Bioworks

There’s a burgeoning market of drug and product developers who want to take advantage of AI, and their models are ravenous for data. With Ginkgo Datapoints, we’re answering the call for how to generate the data this new era of life sciences needs. We’re focusing Ginkgo’s massive infrastructure to do biological data generation at AI scale, allowing our Datapoints customers to make bold bets in model training that will meaningfully impact drug discovery for target ID or fields like antibody therapeutics. Our goal is to eliminate the data bottlenecks slowing down AI-driven advancements in biology, and I’m excited about the team and technologies we’ve assembled for it.

John Androsavich, General Manager of Ginkgo Datapoints

Ginkgo Datapoints products are the next evolution of our previously launched Lab Data-as-a-Service offerings. These products will offer key features that are of critical importance to customers building AI for biology, ranging from biopharmas to techbio to big tech companies. Ginkgo Datapoints’ Functional Genomics product is designed for:

  • Flexibility and Customizability: Customers provide their sequences or library inputs and choose the dataset design that’s right for them, selecting from a number of ready-to-go product parameters. Variables include dataset size, cell lines or primary cells of choice, assay readouts, and data format and labeling preferences. Customers can also consult with Ginkgo’s AI experts to receive design recommendations for their application, customized to whether the datasets will be used to train foundation models or task-specific models, to validate existing models, or pursue another outcome such as hit discovery.
  • Scalability and Speed: Leveraging our state-of-the-art automation and backend data management infrastructure, Ginkgo Datapoints can routinely deliver millions of data points by screening customer libraries to generate rich, high quality, neatly-compiled high throughput transcriptome, cell painting, or other omics profiling data in as little as three weeks.
  • Data Quality and Cost Efficiency: Due to the scale economies of Ginkgo’s highly automated lab, Ginkgo Datapoints can offer volume discounts as datasets grow in size, making it feasible to feed data-hungry AI models.
  • Attractive Deal Terms: Customers own the data generated by Ginkgo Datapoints, acquiring it through fee-for-service pricing.

Ginkgo Datapoints embodies our mission of making biology easier to engineer by unlocking the full potential of AI in biology. By generating the data that the industry needs, Ginkgo Datapoints is poised to be a vital resource for researchers and companies leveraging AI and machine learning for drug discovery and beyond.

Unlock AI to unlock Biology. Learn more about Ginkgo Datapoints and how we can accelerate your AI efforts with our cutting-edge data services here.

Introducing Ginkgo’s Model API: A Programmable Interface for Ginkgo’s AI Research

Today, we’re thrilled to announce a bold new step for Ginkgo: the launch of our model API, a powerful tool designed to bring biological AI models directly to machine learning scientists.

Powered by our partnership with Google Cloud, we’ll be making this API publicly available on our website today. Researchers, developers, and enterprise teams will also be able to access the models that power our API on Google’s Model Garden soon.

With this programmer-friendly ultra-low cost API, Ginkgo is making our internally-developed AI tools available to anyone, and we couldn’t be more excited to begin sharing our work. The interface provides an easy and scalable way to access sophisticated models trained on protein and DNA data, starting with our first release: a machine learning model trained on a proprietary Ginkgo dataset. Read more about our first model — AA-0, a large-scale model trained on 2+ billion proprietary Ginkgo protein sequences — here.

We’re excited to see how the community builds on top of these models and our API to enable a wide range of applications in biology.

While our mission is to make biology easier to engineer, we don’t have a monopoly on interesting and creative uses of language models and other AI innovations in biology. That’s why we’re making these models as affordable and accessible as possible — including our first model trained on a proprietary Ginkgo dataset — so that you can build new applications on top of them today. We’re excited to see users build tools like iterative protein design programs that call our protein generation API or to use our embedding API to compute features for a clustering algorithm.

This and future protein LLMs empower companies to generate novel insights and accelerate the discovery of new therapeutics. By harnessing the power of AI to analyze and understand complex protein structures and interactions, researchers and enterprises can streamline their research pipelines, optimize lead identification, and ultimately bring life-saving medicines to market faster and more efficiently. Building on models that learn from Ginkgo’s private data, unavailable to the public, can enable companies to unlock hidden patterns and potential therapeutic targets that would otherwise remain elusive.

This is a new chapter for Ginkgo, and we’re just getting started. As we continue to develop and release more models and services, we’re excited to see how you’ll use these tools to drive innovation in biology. Sign up below to join our community and be the first to know about model releases and new features.

We have a multitude of models under development, spanning machine learning methods like language modeling and diffusion for conditional design. To begin, our first protein language model release will support two use-cases:

  • Generation via Masked Language Modeling: given a sequence of amino acids with one or more <mask> tokens, the model will complete the sequence. 
  • Embedding calculation: Calculate the final hidden layer of the trained model to extract valuable representations for downstream tasks. To begin, our model returns the mean-pooled representation across the length axis.

Over the next year, we’ll roll out more models and expand the API’s capabilities, building a robust suite of tools that will enable you to solve complex problems in drug discovery, synthetic biology, genomics, and more using the latest machine learning methods. Visit our portal to access our model API and explore our first model.

Flexibility is everything. Alongside our first proprietary model, which leverages unique datasets from Ginkgo, you’ll also have access to publicly available models like ESM2. This means you can explore and experiment with different approaches, all through a single streamlined platform.

We’re committed to making advanced machine learning tools accessible, which is why our API comes with competitive pricing and a free tier. We’ve structured our costs to make it easy for you to jump in, experiment, and get predictions without worrying about high fees. Our initial models will have a free tier, and our introductory pricing will be approximately $0.18 per million tokens. This means for a protein with around 500 amino acids, users should be able to get predictions on 2,000 sequences for roughly 20 cents.

Ready to see what’s possible? Visit our developer portal to access everything you need to start using the API’s free tier, including detailed documentation, tutorials, and sample code. Access the portal today and be among the first to explore our new API and first protein LLM. — And to get you started, we’re offering 2,000 sequences (i.e. ~1M tokens) of free inference in our initial language model! Just fill out the form below.

AminoAcid-0 (AA-0): A Protein LLM Trained with 2 Billion Proprietary Sequences

Reviewing the design and performance of the first model released for
Ginkgo’s AI developer platform

by Seth Ritter and Jake Wintermute


Sign up here to join our community and be the first to know about model releases and new features!

This is a new chapter for Ginkgo, and we’re just getting started. As we continue to develop and release more models and services, we’re excited to see how you’ll use these tools to drive innovation in biology. 


Large Language Models (LLMs), when trained with large collections of protein sequence data, have proven effective for protein engineering tasks including structure prediction1, functional annotation2, and generation of diverse enzymes3. The biological codebase at Ginkgo Bioworks includes the Unified Metagenomic Database (UMDB), a collection of metagenomic sequence data with more than 2 billion protein sequences, most of which do not appear in public repositories.

Here we introduce AA-0, a 650M parameter model following the ESM-2 architecture, trained on public data combined with proprietary sequences from the UMDB. We compare the performance of AA-0 to ESM-2 on popular benchmarks as well as a collection of internal benchmarks relevant to our commercial work in the Ginkgo Bioworks foundry.

AA-0 performs comparably to ESM-2 across a range of 235 external and 73 internal protein engineering tasks. Although the UMDB added 112 M distinct sequence clusters to the 51 M UniRef clusters available for training, the additional data did not result in uniform improvements across all tasks. These results suggest that modern protein LLMs are not limited strictly by the size of their training dataset. To reach the full potential of AI for protein engineering may require more specialized forms of task-specific training data.

Why we built AA-0

Ginkgo’s mission is to make biology easier to engineer. Over the years, we’ve worked with more than 100 commercial partners to support R&D projects ranging from therapeutics and pharmaceutical manufacturing to industrial enzymes and agriculture. 

Like many in biotech, we’re excited about AI-based tools and we have used them extensively for projects including enzyme discovery and protein engineering

By releasing AA-0 to the public, we hope to make Ginkgo’s capabilities and resources more accessible to biotechnology developers. We’re excited to see what you’ll build with them!

Accessing the AA-0 model

The AA-0 model API is available through Ginkgo’s AI developer portal. Read more about Ginkgo’s model API here.

The first release supports the common use cases of embedding calculation and generation via masked language modeling. The platform supports calls to both ginkgo-aa0-650M and esm2-650M, so that users can compare their performance as we have done here.

Users can access a free tier and competitive pricing for larger jobs.

About Ginkgo’s Unified Metagenomic Database (UMDB)

We developed AA-0 using the 2023 UMDB corpus of about 2B protein sequences. The UMDB is derived primarily from microbial DNA extracted from soil samples and sourced from diverse geographic regions. The sequence collection was initially assembled to support R&D projects for our customers including microbial strain engineering, enzyme discovery and protein engineering.

Importantly, the UMDB was not created with the primary goal of training a general-purpose protein LLM. The resource is heavily biased toward microbial genomes and includes few sequences from other taxa. One of our goals for creating AA-0 was to better understand how the composition of the training dataset impacts downstream model performance across different protein engineering tasks.

Since 2023, the UMDB has continued to grow and now includes about 3.3B unique protein sequences, spread across 416M clusters at a clustering threshold of 50% sequence identity (SeqID50). Recent additions include public resources like MGnify4 as well as new proprietary collections of extremophiles and strains relevant to agriculture. Future releases may include models trained with this larger dataset.

Structuring the combined dataset

The AA-0 training dataset was constructed following an approach similar to that described for ESM-21. We started by collecting the publically available UniRef50/90 clusters5 from the September 2021 release. These sequences are clustered at two different levels of sequence identity, 50% (seqID50) and 90% (seqID90), allowing a hierarchical sampling procedure. Sequences are selected first from the larger seqID50 clusters, then from the smaller seqID90 clusters, to ensure representative diversity for training.

We added sequences from the UMDB to the UniRef dataset by assigning them, when possible, to existing UniRef90 clusters meeting the 90% identity threshold. Representative sequences were chosen for each cluster and similarly assigned to the existing UniRef50 clusters. When clustering criteria weren’t satisfied, new clusters were spawned to contain the UMDB sequences. Clustering was performed using the easy-linclust workflow of MMseqs26 with 80% coverage.

The clustering process resulted in 172M seqID50 clusters, a substantial increase from the ~60M found in the original UniRef50. Looking inside the new clusters, we found remarkably little overlap between the public and UMDB sequences (Fig. 1). These results indicate that the combined dataset includes many novel sequences unlike anything used to train previous models. New sequences mean new information and, potentially, new opportunities for AA-0 to learn the patterns that occur in naturally evolved proteins.

Figure 1. Sequence novelty in the UMDB. 65% of protein sequence clusters used to train AA-0 included only sequences from the UMDB, 30% included only UniRef50 sequences, and 5% included sequences from both sources. The low degree of overlap indicates that the UMDB supplied many novel sequences for training.

Selecting a strategy for filtering, sampling and training

We explored a variety of approaches to filter sequences for quality and sample them from the combined dataset to use for training (Table 1). To evaluate the impact of different strategies, we used them to train a smaller model of 150M parameters. We used a smaller, 150M-parameter, version of ESM-2 to provide a similarly powered baseline comparison. Two kinds of benchmarking tests were used to evaluate performance: ProteinGym and Owl, our in-house benchmark, which we describe more below. The sampling strategies we tried included:

  • Sequence quality filter. We removed sequences with indications of low quality, for example the inclusion of non-amino-acid characters.
  • Minimum cluster size. We removed SeqID50 clusters containing fewer than the indicated number of sequences, reasoning they might not provide representative data.
  • Samples per cluster. We sampled either 1 or the indicated number sequences from each SeqID50 cluster, trading off wider cluster diversity for deeper cluster sampling.
  • Sequence length reweighting. We adjusted sampling to reduce the probability of choosing sequences shorter than the indicated length, which are more likely to represent sequences of lower utility (e.g. short non-structural proteins) or fragments.
  • Single-representative sampling. We sampled only the representative sequences for each SeqID50 cluster as determined by the clustering algorithm, simplifying sampling but losing finer in-cluster variations.
ESM2
150M
Trial 0Trial 1Trial 2Trial 3Trial 4Trial 5
Sequence quality filteringFalseTrueTrueTrueTrueTrue
SeqID50 min cluster size11121002
Samples per SeqID50 cluster1111501
Sequence length reweighting threshold11100100100100
Only return cluster representativesFalseFalseFalseFalseFalseTrue
Owl Score0.2040.1730.1610.1850.2230.2400.231
ProteinGym Score0.3180.2920.2930.2910.3180.2570.302

Table 1. Model comparisons under different filtering and sampling strategies. Performance metrics are reported as a Spearman correlation between model scores and experimental measurements. The top performing strategies for each benchmark are indicated in bold. Performance metrics are reported as a Spearman correlation between model scores and experimental measurements. The top performing strategies for each benchmark are indicated in bold.

Although no strategy was the unambiguous winner for both benchmarks, we chose the strategy in trial 3 as giving an effective balance of performance. This entailed removing all seqID50 clusters with only 1 sequence and introducing a length reweighting threshold of 100 base pairs to sample fewer short sequences. The maximum length for training sequences was set to 512, with random cropping of sequences longer than this length.

AA-0 was trained on an 8×8 configuration on Google Cloud Platform with A100 GPUs. Except as noted below, training followed the guidelines described for ESM-21. In hyperparameter search experiments, we didn’t find any that meaningfully improved outcomes. We implemented two primary changes which, in our hands, were essential for reliable training:

  • We made use of Xavier uniform initializations for KVQ weights in the attention layers with gain set to 1/sqrt(2).
  • We used the AdamW optimizer with settings lr=4e-4, weight_decay=1e-5. 

Like ESM-2, we used a linear learning rate scheduler with 2000 warmup steps reducing to 10% maximum learning rate over the training duration. Following the sampling and filtering pattern selected above, we trained for 1M steps on the combined dataset followed by 150k steps of fine-tuning on UniRef50 sequences. We found that this fine-tuning improved some downstream tasks on a select number of targets, as described below.

Model evaluation on standard and in-house protein engineering tasks

To evaluate the performance of AA-0, we made use of the public benchmark collections DGEB7 and ProteinGym8. We were also interested in testing the model specifically against the kind of protein engineering workflows that we encounter at Ginkgo. For this, we used the internally developed Owl benchmark. In the plots below, we compare the performance of 3 models.

  • ESM-2 refers to esm2_t33_650M_UR50D, the model documented here and in the original paper1.
  • AA-0-base indicates ginkgo-aa-0-650m, the model trained on the combined dataset including our UMDB sequences.
  • AA-0 is ginkgo-aa-0-650m-finetune-UR50-150k, in which AA-0-base underwent an additional 150k rounds of additional fine-tuning with sequences from UniRef50.

The Diverse Genomic Embedding Benchmark (DGEB), composed by TattaBio, is a collection of tasks that make use of the embeddings from a protein sequence encoder model. For example, using pooled representations to search a sequence collection for similar proteins.

Figure 2. Comparison of model performance using DGEB. The tasks on the left belong to six types: BiGene Mining, Evolutionary Distance Similarity (EDS), Classification, Pair Classification, Clustering and Retrieval. The reported scoring metric varies by task type, with higher scores representing better performance.

ProteinGym is a collection of benchmarks that challenge a model to predict the effect of mutations on the measured function on a protein sequence8. We focused on the collections of protein substitution variants created with Deep Mutational Scanning (DMS). The 217 total assays were collected into five assay categories: organismal fitness, enzyme activity, protein binding, protein expression and protein stability. The distribution of scores within each category gives an overview of the performance of each model.

Broadly speaking, the AA-0 and ESM-2 models performed comparably (Fig. 3). When examining the medians of the distributions, AA-0 was marginally better at tasks relating to predicting protein stability and marginally worse at predicting enzyme activity (though there is high overlap in the performance distributions). Tasks related to protein binding were challenging for both models, highlighting the difficulty of predicting interactions from sequence data.

Figure 3. Comparison of model performance using ProteinGym. The indicated models were used to score collections of protein sequences representing DMS substitutions. For each collection, performance is reported as a Spearman correlation between the model-derived score and the measured activity. 

The 217 assays are grouped into five categories by the type of property being measured. Box plots indicate the mean score for each category, as well as standard deviations and outliers.

The Owl benchmark, named for our in-house protein design software suite, was developed at Ginkgo to reflect tasks relevant for our work in commercial protein engineering. AI-guided protein discovery uses the model as an embedder to identify functionally similar proteins. Protein engineering is aided by scoring potential sequence variations that may be functionally relevant.

Owl includes 73 collections of protein sequence variants, each labeled with a functional measurement performed during the course of a real customer program. Examples of functional measurements include enzyme activity, specificity or expression titer. As above, we report model performance as a Spearman correlation between model scores and empirical measurements, grouping scores into categories to provide high-level overview (Fig. 4).

Figure 4. Comparison of model performance using Ginkgo’s Owl benchmark. The indicated models were used to score collections of engineered protein sequences. For each collection, performance is reported as a Spearman correlation between the model-derived score and the measured activity. 

The 73 assays are grouped into three categories by the type of property being measured. Box plots indicate the mean score for each category, as well as standard deviations and outliers.

Overall, we find roughly comparable results between the different models. Interestingly, we find many examples of a negative correlation between model scores and experimental outcome, particularly for the use case of predicting enzyme specificity.

Why might enzymes with improved specificity tend to have lower model-derived scores? The datasets collected for the Owl benchmark come from different kinds of enzymes for being engineered for different functional goals, making generalizations difficult. But this result might indicate important differences in the kinds of sequences that result from natural evolution and protein engineering. For example, an enzyme engineering project might seek to focus an enzyme activity on a particular target that is disfavored in a natural context. If evolution and engineering tend to move sequences in different directions, model-derived scores might negatively correlate with actual measured performance.

Fine-tuning improves performance on viral sequences

The UMDB does not represent a uniform sample of all naturally evolved protein sequences. It is primarily a collection of microbial DNA extracted from soil. As we explored AA-0, we were interested in how this bias in the training data might impact its performance.

The ProteinGym benchmark assays include proteins sourced from humans, other eukaryotes, prokaryotes and viruses. Breaking out the performance of AA-0 by taxon, we found substantially weaker performance on viral proteins (Fig. 5). We suspect this is a result of viral sequences being poorly represented in our training data. Viral sequences are particularly diverse, fast-evolving, and often unlike proteins found in cellular life forms. This result emphasizes the importance of learning from viral sequences directly to be able to model them accurately.

Performance on viral sequences improved markedly following 150k steps of additional fine tuning with the UniRef50 sequences. This improvement motivated us to include the UniRef50 fine-tuning in the model now available through the Ginkgo AI developer portal.

Figure 5. Model performance by taxon. The 217 assays of the ProteinGym ESM collection are grouped by taxon of origin: Human, non-human Eukaryote, Prokaryote or Virus. For each assay, performance is reported as a Spearman correlation between the model-derived score and the measured activity. Box plots indicate the mean score for each category, as well as standard deviations and outliers.

Conclusions

What drives the performance of an LLM? In different contexts, AI researchers have identified model size, training data, and compute as fundamental resources that govern a model’s scaling behavior9. Here we investigated the impact of training data on the performance of a protein sequence LLM. We supplemented the ~60M UniRef50 sequence clusters used to train ESM-2 with an additional 112M clusters from the Ginkgo UMDB. The resulting model, AA-0, showed comparable performance across a range of benchmarking tasks, indicating that training data alone was not a limiting resource.

Our experience with AA-0 holds lessons for the development of AI models for applied protein engineering:

The importance of data quality. In preparing AA-0 we explored a variety of strategies for filtering and sampling sequences from the very large UMDB. The selected strategy significantly impacted model performance, suggesting that further exploration in this area might lead to continued improvements. DNA sequencing technology is advancing quickly, leading to exponential growth in datasets and rapid proliferation in data collection techniques. Sequence-based AI models will benefit from standardized and optimized approaches to curate all this data.

The value of data representation. We found the AA-0-base model performed poorly on viral sequences, probably because they were sparsely represented in its training data. This weakness was partially corrected by additional fine tuning with UniRef50 sequences, and could also be improved by curating more representative datasets for future models.

The particular challenges of protein engineering. AA-0 performed well when predicting enzyme activity, a common task in the Ginkgo foundry. Interestingly, the model struggled to predict enzyme specificity, often producing scores that were negatively correlated with measured outcome. This suggests that engineered proteins may include sequence features unlike the evolved proteins used for model training. Future models may require new datasets that capture the features of successful engineered proteins, or may need other strategies to accommodate protein engineering as a use case.

The need for more task-specific data. In commercial protein engineering projects at the Ginkgo foundry, LLMs are not used to generate functional proteins de novo. Instead, libraries of generated sequences are built and tested for a particular desired function. These results from assay-labeled libraries become training data for additional rounds of AI-guided engineering, leading to performance improvements greater than those achieved with sequence-based models alone. Future models will benefit from new datasets assay-labeled for functional outcomes of interest including substrate affinity, enzyme specificity, and expression in particular microbial hosts.

AI can make biology easier to engineer. This is the first of many intended releases from the Ginkgo AI team. We are excited to begin peeling back the curtain and enabling bioengineers across the world to access our technologies. As we scale up our training efforts (we are currently training models 10x larger than these and more!), we will be eager to share our findings and plan to make the resultant models available to the community.


Ready to see what’s possible? Visit our developer portal to access everything you need to start using the API’s free tier, including detailed documentation, tutorials, and sample code. Access the portal today and be among the first to explore our new API. — And to get you started, we’re offering 2,000 sequences (i.e. ~1M tokens) of free inference in our initial language model! Just fill out the form below.


References

1. Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123-1130. doi:10.1126/science.ade2574
2. Yu T, Cui H, Li JC, Luo Y, Jiang G, Zhao H. Enzyme function prediction using contrastive learning. Science. 2023;379(6639):1358-1363. doi:10.1126/science.adf2465
3. Ruffolo JA, Nayfach S, Gallagher J, et al. Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences. bioRxiv 2024.04.22.590591. doi:10.1101/2024.04.22.590591
4. Richardson L, Allen B, Baldi G, et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Research. 2023;51(D1):D753-D759. doi:10.1093/nar/gkac1080
5. Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2015;31(6):926-932. doi:10.1093/bioinformatics/btu739
6. Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35(11):1026-1028. doi:10.1038/nbt.3988
7. West-Roberts J, Kravitz J, Jha N, Cornman A, Hwang Y. Diverse Genomic Embedding Benchmark for functional evaluation across the tree of life. bioRxiv 2024.07.10.602933. doi:10.1101/2024.07.10.602933
8. Notin P, Kollasch AW, Ritter D, et al. ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design. bioRxiv 2023.12.07.570727. doi:10.1101/2023.12.07.570727
9. Hoffmann J, Borgeaud S, Mensch A, et al. Training Compute-Optimal Large Language Models. arXiv:2203.15556. doi:10.48550/arXiv.2203.15556

Acknowledgements

Thanks to the Ginkgo protein engineers, software developers and AI experts who helped to build AA-0: Zachary Kurtz, Matt Chamberlin, Eric Danielson, Alex Carlin, Michal Jastrzebski, Dana Merrick, Dmitriy Ryaboy, Emily Wrenbeck & Ankit Gupta.


Request Free Inference Tokens for Ginkgo’s Model API

To get you started exploring our recently announced model API, we’re offering 2,000 sequences of free inference.

Ready to see what’s possible? Visit our developer portal to create an account and access everything you need to start using the API’s free tier, including detailed documentation, tutorials, and sample code. Access the portal today and be among the first to explore our new API and first protein LLM.

Fill out the form below to request 2,000 sequences (i.e. ~1M tokens) of free inference in our initial language model!