Annotate article

7/26/2023

Clustering huge protein sequence sets in linear time. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. An interactive notebook showing use of the trained models to produce Pfam class predictions as well as embeddings is available in GitHub at.

An interactive notebook that demonstrates inference using ProtCNN is available at. ProtCNN inference was run using a custom Python script that (1) read in FASTA records and (2) ran inference of the ProtCNN as a TensorFlow SavedModel. Trained models are available in Google Cloud Storage at, including the ensembles trained on the Pfam-seed random split, Pfam-seed clustered split, Pfam-full random split (all Pfam v.32.0) and the models used to generate Pfam-N v.34.0. The training and validation datasets used for creating each model are available as described in the preceding section. Code that documents model training using Python v.3.7 is available on GitHub at. The TensorFlow API, specifically tensorflow-gpu v.1.15.4, was used to implement and train all deep models using the architectures described in the Methods. These results suggest that deep learning models will be a core component of future protein annotation tools. This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation.

Combining deep models with existing methods significantly improves remote homology detection, suggesting that the deep models learn complementary information. The models infer known patterns of evolutionary substitutions and learn representations that accurately cluster sequences from unseen families. Here, we train deep learning models to accurately predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms.

Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications.

0 Comments

Annotate article

Leave a Reply.

Author

Archives

Categories