AminoAcid-0 (AA-0): A Protein LLM Trained with 2 Billion Proprietary Sequences

Reviewing the design and performance of the first model released for
Ginkgo’s AI developer platform

by Seth Ritter and Jake Wintermute


Sign up here to join our community and be the first to know about model releases and new features!

This is a new chapter for Ginkgo, and we’re just getting started. As we continue to develop and release more models and services, we’re excited to see how you’ll use these tools to drive innovation in biology. 


Large Language Models (LLMs), when trained with large collections of protein sequence data, have proven effective for protein engineering tasks including structure prediction1, functional annotation2, and generation of diverse enzymes3. The biological codebase at Ginkgo Bioworks includes the Unified Metagenomic Database (UMDB), a collection of metagenomic sequence data with more than 2 billion protein sequences, most of which do not appear in public repositories.

Here we introduce AA-0, a 650M parameter model following the ESM-2 architecture, trained on public data combined with proprietary sequences from the UMDB. We compare the performance of AA-0 to ESM-2 on popular benchmarks as well as a collection of internal benchmarks relevant to our commercial work in the Ginkgo Bioworks foundry.

AA-0 performs comparably to ESM-2 across a range of 235 external and 73 internal protein engineering tasks. Although the UMDB added 112 M distinct sequence clusters to the 51 M UniRef clusters available for training, the additional data did not result in uniform improvements across all tasks. These results suggest that modern protein LLMs are not limited strictly by the size of their training dataset. To reach the full potential of AI for protein engineering may require more specialized forms of task-specific training data.

Why we built AA-0

Ginkgo’s mission is to make biology easier to engineer. Over the years, we’ve worked with more than 100 commercial partners to support R&D projects ranging from therapeutics and pharmaceutical manufacturing to industrial enzymes and agriculture. 

Like many in biotech, we’re excited about AI-based tools and we have used them extensively for projects including enzyme discovery and protein engineering

By releasing AA-0 to the public, we hope to make Ginkgo’s capabilities and resources more accessible to biotechnology developers. We’re excited to see what you’ll build with them!

Accessing the AA-0 model

The AA-0 model API is available through Ginkgo’s AI developer portal. Read more about Ginkgo’s model API here.

The first release supports the common use cases of embedding calculation and generation via masked language modeling. The platform supports calls to both ginkgo-aa0-650M and esm2-650M, so that users can compare their performance as we have done here.

Users can access a free tier and competitive pricing for larger jobs.

About Ginkgo’s Unified Metagenomic Database (UMDB)

We developed AA-0 using the 2023 UMDB corpus of about 2B protein sequences. The UMDB is derived primarily from microbial DNA extracted from soil samples and sourced from diverse geographic regions. The sequence collection was initially assembled to support R&D projects for our customers including microbial strain engineering, enzyme discovery and protein engineering.

Importantly, the UMDB was not created with the primary goal of training a general-purpose protein LLM. The resource is heavily biased toward microbial genomes and includes few sequences from other taxa. One of our goals for creating AA-0 was to better understand how the composition of the training dataset impacts downstream model performance across different protein engineering tasks.

Since 2023, the UMDB has continued to grow and now includes about 3.3B unique protein sequences, spread across 416M clusters at a clustering threshold of 50% sequence identity (SeqID50). Recent additions include public resources like MGnify4 as well as new proprietary collections of extremophiles and strains relevant to agriculture. Future releases may include models trained with this larger dataset.

Structuring the combined dataset

The AA-0 training dataset was constructed following an approach similar to that described for ESM-21. We started by collecting the publically available UniRef50/90 clusters5 from the September 2021 release. These sequences are clustered at two different levels of sequence identity, 50% (seqID50) and 90% (seqID90), allowing a hierarchical sampling procedure. Sequences are selected first from the larger seqID50 clusters, then from the smaller seqID90 clusters, to ensure representative diversity for training.

We added sequences from the UMDB to the UniRef dataset by assigning them, when possible, to existing UniRef90 clusters meeting the 90% identity threshold. Representative sequences were chosen for each cluster and similarly assigned to the existing UniRef50 clusters. When clustering criteria weren’t satisfied, new clusters were spawned to contain the UMDB sequences. Clustering was performed using the easy-linclust workflow of MMseqs26 with 80% coverage.

The clustering process resulted in 172M seqID50 clusters, a substantial increase from the ~60M found in the original UniRef50. Looking inside the new clusters, we found remarkably little overlap between the public and UMDB sequences (Fig. 1). These results indicate that the combined dataset includes many novel sequences unlike anything used to train previous models. New sequences mean new information and, potentially, new opportunities for AA-0 to learn the patterns that occur in naturally evolved proteins.

Figure 1. Sequence novelty in the UMDB. 65% of protein sequence clusters used to train AA-0 included only sequences from the UMDB, 30% included only UniRef50 sequences, and 5% included sequences from both sources. The low degree of overlap indicates that the UMDB supplied many novel sequences for training.

Selecting a strategy for filtering, sampling and training

We explored a variety of approaches to filter sequences for quality and sample them from the combined dataset to use for training (Table 1). To evaluate the impact of different strategies, we used them to train a smaller model of 150M parameters. We used a smaller, 150M-parameter, version of ESM-2 to provide a similarly powered baseline comparison. Two kinds of benchmarking tests were used to evaluate performance: ProteinGym and Owl, our in-house benchmark, which we describe more below. The sampling strategies we tried included:

  • Sequence quality filter. We removed sequences with indications of low quality, for example the inclusion of non-amino-acid characters.
  • Minimum cluster size. We removed SeqID50 clusters containing fewer than the indicated number of sequences, reasoning they might not provide representative data.
  • Samples per cluster. We sampled either 1 or the indicated number sequences from each SeqID50 cluster, trading off wider cluster diversity for deeper cluster sampling.
  • Sequence length reweighting. We adjusted sampling to reduce the probability of choosing sequences shorter than the indicated length, which are more likely to represent sequences of lower utility (e.g. short non-structural proteins) or fragments.
  • Single-representative sampling. We sampled only the representative sequences for each SeqID50 cluster as determined by the clustering algorithm, simplifying sampling but losing finer in-cluster variations.
ESM2
150M
Trial 0Trial 1Trial 2Trial 3Trial 4Trial 5
Sequence quality filteringFalseTrueTrueTrueTrueTrue
SeqID50 min cluster size11121002
Samples per SeqID50 cluster1111501
Sequence length reweighting threshold11100100100100
Only return cluster representativesFalseFalseFalseFalseFalseTrue
Owl Score0.2040.1730.1610.1850.2230.2400.231
ProteinGym Score0.3180.2920.2930.2910.3180.2570.302

Table 1. Model comparisons under different filtering and sampling strategies. Performance metrics are reported as a Spearman correlation between model scores and experimental measurements. The top performing strategies for each benchmark are indicated in bold. Performance metrics are reported as a Spearman correlation between model scores and experimental measurements. The top performing strategies for each benchmark are indicated in bold.

Although no strategy was the unambiguous winner for both benchmarks, we chose the strategy in trial 3 as giving an effective balance of performance. This entailed removing all seqID50 clusters with only 1 sequence and introducing a length reweighting threshold of 100 base pairs to sample fewer short sequences. The maximum length for training sequences was set to 512, with random cropping of sequences longer than this length.

AA-0 was trained on an 8×8 configuration on Google Cloud Platform with A100 GPUs. Except as noted below, training followed the guidelines described for ESM-21. In hyperparameter search experiments, we didn’t find any that meaningfully improved outcomes. We implemented two primary changes which, in our hands, were essential for reliable training:

  • We made use of Xavier uniform initializations for KVQ weights in the attention layers with gain set to 1/sqrt(2).
  • We used the AdamW optimizer with settings lr=4e-4, weight_decay=1e-5. 

Like ESM-2, we used a linear learning rate scheduler with 2000 warmup steps reducing to 10% maximum learning rate over the training duration. Following the sampling and filtering pattern selected above, we trained for 1M steps on the combined dataset followed by 150k steps of fine-tuning on UniRef50 sequences. We found that this fine-tuning improved some downstream tasks on a select number of targets, as described below.

Model evaluation on standard and in-house protein engineering tasks

To evaluate the performance of AA-0, we made use of the public benchmark collections DGEB7 and ProteinGym8. We were also interested in testing the model specifically against the kind of protein engineering workflows that we encounter at Ginkgo. For this, we used the internally developed Owl benchmark. In the plots below, we compare the performance of 3 models.

  • ESM-2 refers to esm2_t33_650M_UR50D, the model documented here and in the original paper1.
  • AA-0-base indicates ginkgo-aa-0-650m, the model trained on the combined dataset including our UMDB sequences.
  • AA-0 is ginkgo-aa-0-650m-finetune-UR50-150k, in which AA-0-base underwent an additional 150k rounds of additional fine-tuning with sequences from UniRef50.

The Diverse Genomic Embedding Benchmark (DGEB), composed by TattaBio, is a collection of tasks that make use of the embeddings from a protein sequence encoder model. For example, using pooled representations to search a sequence collection for similar proteins.

Figure 2. Comparison of model performance using DGEB. The tasks on the left belong to six types: BiGene Mining, Evolutionary Distance Similarity (EDS), Classification, Pair Classification, Clustering and Retrieval. The reported scoring metric varies by task type, with higher scores representing better performance.

ProteinGym is a collection of benchmarks that challenge a model to predict the effect of mutations on the measured function on a protein sequence8. We focused on the collections of protein substitution variants created with Deep Mutational Scanning (DMS). The 217 total assays were collected into five assay categories: organismal fitness, enzyme activity, protein binding, protein expression and protein stability. The distribution of scores within each category gives an overview of the performance of each model.

Broadly speaking, the AA-0 and ESM-2 models performed comparably (Fig. 3). When examining the medians of the distributions, AA-0 was marginally better at tasks relating to predicting protein stability and marginally worse at predicting enzyme activity (though there is high overlap in the performance distributions). Tasks related to protein binding were challenging for both models, highlighting the difficulty of predicting interactions from sequence data.

Figure 3. Comparison of model performance using ProteinGym. The indicated models were used to score collections of protein sequences representing DMS substitutions. For each collection, performance is reported as a Spearman correlation between the model-derived score and the measured activity. 

The 217 assays are grouped into five categories by the type of property being measured. Box plots indicate the mean score for each category, as well as standard deviations and outliers.

The Owl benchmark, named for our in-house protein design software suite, was developed at Ginkgo to reflect tasks relevant for our work in commercial protein engineering. AI-guided protein discovery uses the model as an embedder to identify functionally similar proteins. Protein engineering is aided by scoring potential sequence variations that may be functionally relevant.

Owl includes 73 collections of protein sequence variants, each labeled with a functional measurement performed during the course of a real customer program. Examples of functional measurements include enzyme activity, specificity or expression titer. As above, we report model performance as a Spearman correlation between model scores and empirical measurements, grouping scores into categories to provide high-level overview (Fig. 4).

Figure 4. Comparison of model performance using Ginkgo’s Owl benchmark. The indicated models were used to score collections of engineered protein sequences. For each collection, performance is reported as a Spearman correlation between the model-derived score and the measured activity. 

The 73 assays are grouped into three categories by the type of property being measured. Box plots indicate the mean score for each category, as well as standard deviations and outliers.

Overall, we find roughly comparable results between the different models. Interestingly, we find many examples of a negative correlation between model scores and experimental outcome, particularly for the use case of predicting enzyme specificity.

Why might enzymes with improved specificity tend to have lower model-derived scores? The datasets collected for the Owl benchmark come from different kinds of enzymes for being engineered for different functional goals, making generalizations difficult. But this result might indicate important differences in the kinds of sequences that result from natural evolution and protein engineering. For example, an enzyme engineering project might seek to focus an enzyme activity on a particular target that is disfavored in a natural context. If evolution and engineering tend to move sequences in different directions, model-derived scores might negatively correlate with actual measured performance.

Fine-tuning improves performance on viral sequences

The UMDB does not represent a uniform sample of all naturally evolved protein sequences. It is primarily a collection of microbial DNA extracted from soil. As we explored AA-0, we were interested in how this bias in the training data might impact its performance.

The ProteinGym benchmark assays include proteins sourced from humans, other eukaryotes, prokaryotes and viruses. Breaking out the performance of AA-0 by taxon, we found substantially weaker performance on viral proteins (Fig. 5). We suspect this is a result of viral sequences being poorly represented in our training data. Viral sequences are particularly diverse, fast-evolving, and often unlike proteins found in cellular life forms. This result emphasizes the importance of learning from viral sequences directly to be able to model them accurately.

Performance on viral sequences improved markedly following 150k steps of additional fine tuning with the UniRef50 sequences. This improvement motivated us to include the UniRef50 fine-tuning in the model now available through the Ginkgo AI developer portal.

Figure 5. Model performance by taxon. The 217 assays of the ProteinGym ESM collection are grouped by taxon of origin: Human, non-human Eukaryote, Prokaryote or Virus. For each assay, performance is reported as a Spearman correlation between the model-derived score and the measured activity. Box plots indicate the mean score for each category, as well as standard deviations and outliers.

Conclusions

What drives the performance of an LLM? In different contexts, AI researchers have identified model size, training data, and compute as fundamental resources that govern a model’s scaling behavior9. Here we investigated the impact of training data on the performance of a protein sequence LLM. We supplemented the ~60M UniRef50 sequence clusters used to train ESM-2 with an additional 112M clusters from the Ginkgo UMDB. The resulting model, AA-0, showed comparable performance across a range of benchmarking tasks, indicating that training data alone was not a limiting resource.

Our experience with AA-0 holds lessons for the development of AI models for applied protein engineering:

The importance of data quality. In preparing AA-0 we explored a variety of strategies for filtering and sampling sequences from the very large UMDB. The selected strategy significantly impacted model performance, suggesting that further exploration in this area might lead to continued improvements. DNA sequencing technology is advancing quickly, leading to exponential growth in datasets and rapid proliferation in data collection techniques. Sequence-based AI models will benefit from standardized and optimized approaches to curate all this data.

The value of data representation. We found the AA-0-base model performed poorly on viral sequences, probably because they were sparsely represented in its training data. This weakness was partially corrected by additional fine tuning with UniRef50 sequences, and could also be improved by curating more representative datasets for future models.

The particular challenges of protein engineering. AA-0 performed well when predicting enzyme activity, a common task in the Ginkgo foundry. Interestingly, the model struggled to predict enzyme specificity, often producing scores that were negatively correlated with measured outcome. This suggests that engineered proteins may include sequence features unlike the evolved proteins used for model training. Future models may require new datasets that capture the features of successful engineered proteins, or may need other strategies to accommodate protein engineering as a use case.

The need for more task-specific data. In commercial protein engineering projects at the Ginkgo foundry, LLMs are not used to generate functional proteins de novo. Instead, libraries of generated sequences are built and tested for a particular desired function. These results from assay-labeled libraries become training data for additional rounds of AI-guided engineering, leading to performance improvements greater than those achieved with sequence-based models alone. Future models will benefit from new datasets assay-labeled for functional outcomes of interest including substrate affinity, enzyme specificity, and expression in particular microbial hosts.

AI can make biology easier to engineer. This is the first of many intended releases from the Ginkgo AI team. We are excited to begin peeling back the curtain and enabling bioengineers across the world to access our technologies. As we scale up our training efforts (we are currently training models 10x larger than these and more!), we will be eager to share our findings and plan to make the resultant models available to the community.


Ready to see what’s possible? Visit our developer portal to access everything you need to start using the API’s free tier, including detailed documentation, tutorials, and sample code. Access the portal today and be among the first to explore our new API. — And to get you started, we’re offering 2,000 sequences (i.e. ~1M tokens) of free inference in our initial language model! Just fill out the form below.


References

1. Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123-1130. doi:10.1126/science.ade2574
2. Yu T, Cui H, Li JC, Luo Y, Jiang G, Zhao H. Enzyme function prediction using contrastive learning. Science. 2023;379(6639):1358-1363. doi:10.1126/science.adf2465
3. Ruffolo JA, Nayfach S, Gallagher J, et al. Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences. bioRxiv 2024.04.22.590591. doi:10.1101/2024.04.22.590591
4. Richardson L, Allen B, Baldi G, et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Research. 2023;51(D1):D753-D759. doi:10.1093/nar/gkac1080
5. Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2015;31(6):926-932. doi:10.1093/bioinformatics/btu739
6. Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35(11):1026-1028. doi:10.1038/nbt.3988
7. West-Roberts J, Kravitz J, Jha N, Cornman A, Hwang Y. Diverse Genomic Embedding Benchmark for functional evaluation across the tree of life. bioRxiv 2024.07.10.602933. doi:10.1101/2024.07.10.602933
8. Notin P, Kollasch AW, Ritter D, et al. ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design. bioRxiv 2023.12.07.570727. doi:10.1101/2023.12.07.570727
9. Hoffmann J, Borgeaud S, Mensch A, et al. Training Compute-Optimal Large Language Models. arXiv:2203.15556. doi:10.48550/arXiv.2203.15556

Acknowledgements

Thanks to the Ginkgo protein engineers, software developers and AI experts who helped to build AA-0: Zachary Kurtz, Matt Chamberlin, Eric Danielson, Alex Carlin, Michal Jastrzebski, Dana Merrick, Dmitriy Ryaboy, Emily Wrenbeck & Ankit Gupta.


Posted By