Abstract
Proteomics aims to achieve a comprehensive understanding of biological systems by characterizing proteins and their modifications. A central computational challenge in mass spectrometry (MS)-based proteomics is the identification of peptides from tandem MS spectra. Conventional database search methods are limited by their dependence on existing protein sequence databases, leaving a large fraction of spectra unidentified. De novo peptide sequencing offers a solution by generating peptide sequences directly from spectra, but current approaches suffer from limited accuracy, restricted generalizability across MS platforms, and challenges in handling complex peptide classes such as immunopeptides.
Casanovo addresses these limitations by leveraging recent advances in deep learning and natural language processing. Built on a transformer-based architecture, Casanovo translates spectra into amino acid sequences with unprecedented accuracy, significantly surpassing existing academic and commercial de novo sequencing tools. This project will further enhance Casanovo's performance by compiling the most comprehensive training dataset to date, optimizing model architectures for accuracy and efficiency, and developing specialized strategies for immunopeptidomics applications. In particular, the work will focus on overcoming the unique challenges of non-tryptic peptides and diverse MS instrumentation. Additionally, new methods for confidence estimation will be developed, establishing a statistical framework for interpreting de novo identifications.
By advancing the accuracy, robustness, and accessibility of de novo sequencing, this project will establish a next-generation AI framework for peptide discovery, enabling deeper biological insights and accelerating research in proteomics, advancing immunology, and driving innovation in biotechnology.
Researcher(s)
Research team(s)
Project type(s)