Research team

Expertise

Dr. Bittremieux's research deals with developing advanced machine learning techniques to uncover novel knowledge from mass spectrometry-based proteomics and metabolomics data. While his current research mainly focuses on how deep learning can be used to analyze mass spectrometry data he is interested in a wide variety of bioinformatics problems. An important part of his work involves developing insights and computational approaches for quality control in biological mass spectrometry.

Exposomics: A holistic approach to assess environmental exposures and their impact on endocrine and metabolic disorders (EXPOSOME 2.0). 01/01/2026 - 31/12/2031

Abstract

Background: The exposome encompasses the totality of environmental exposures of an individual or organism throughout life (including exposure to chemicals, diet, lifestyle, climate factors, stress), and how these exposures impact biology (e.g., metabolites, hormones, etc.) and health. In particular, exposure to endocrine disrupting chemicals (EDCs), including metabolic disrupting chemicals (MDCs), has been linked to a broad range of non-communicable diseases and environmental health effects. Workflows for gathering and interpreting exposome data are still in development and are currently focusing on elucidating physiological pathways that link exposure to adverse effects. Ultimately, this will lead to a holistic understanding of how exposures interact with the phenotype to cause adverse health outcomes with potentially large societal, economic, and ecological costs. Aims: We will use innovative approaches to decipher the human exposome from early life on up to adulthood and its association with endocrine and metabolic alterations (leading to disorders, such as liver diseases, metabolic syndrome, diabetes, and obesity), as well as effects on other important physiological processes mostly driven by endocrine and metabolic signaling.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

Molecular Discovery in Untargeted Metabolomics through Advanced Data Science and Machine Learning 01/11/2025 - 31/10/2029

Abstract

Although our capacity for molecular discovery from biological samples via untargeted small molecule mass spectrometry (MS) has profoundly advanced over the past decades, the field still grapples with a fundamental challenge: the vast majority of MS/MS spectra remain unannotated, significantly limiting the amount of insights these studies can generate. To address this gap, my research envisions a paradigm shift from conventional heuristic-driven analysis to a robust, data-driven approach, capable of unveiling novel molecular insights from MS data. To this end, I propose a three-pronged approach to enhance MS data interpretation. First, I will develop a novel spectral library searching framework that leverages target–decoy strategies and semi-supervised machine learning to improve annotation sensitivity and confidence. Second, I will address the challenge of chimeric spectra by creating a deep learning-based deconvolution framework, enabling accurate resolution of overlapping isotopic envelopes. Third, I will design an AI-driven repository-scale molecular networking approach to uncover previously uncharacterized molecular analogs, expanding our capacity for small molecule discovery. By unlocking the wealth of unannotated MS data, this project will provide important advances for biomedical and environmental research, empowering the scientific community with next-generation tools for molecular discovery.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

Privacy in proteomics: Safeguarding personally identifiable information in clinical omics data. 01/11/2025 - 31/10/2028

Abstract

Advancements in mass spectrometry-based proteomics have revolutionized the study of complex biological systems, enabling the characterization of thousands of proteins from human samples in a single experiment. However, important questions have been raised on the potential ability to re-identify individuals via MS-based proteomics data. While genomic and transcriptomic data have been extensively studied for privacy risks, the privacy implications of proteomics data are mostly unknown. Inspired by facial recognition techniques, I propose a novel approach to identify privacy risks within clinical proteomics data. I then aim to mitigate these risks by developing an approach to de-identify the data, while preserving data utility. By addressing both the identification of privacy risks and the development of mitigation strategies, this project stands at the forefront of an emerging field. It tackles a problem that is poised to become critical in the near future as clinical proteomics data becomes more detailed and widely shared. The outcomes will provide a much-needed roadmap for secure and ethical data sharing in proteomics, ensuring that this field continues to drive scientific innovation while safeguarding individual privacy. This work has the potential to set new standards for privacy-conscious research in proteomics, establishing a foundation for ethically sustainable and impactful biomedical science.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

De novo mass spectrometry peptide sequencing with a transformer large language model. 01/11/2025 - 31/10/2027

Abstract

Proteomics aims to achieve a comprehensive understanding of biological systems by characterizing proteins and their modifications. A central computational challenge in mass spectrometry (MS)-based proteomics is the identification of peptides from tandem MS spectra. Conventional database search methods are limited by their dependence on existing protein sequence databases, leaving a large fraction of spectra unidentified. De novo peptide sequencing offers a solution by generating peptide sequences directly from spectra, but current approaches suffer from limited accuracy, restricted generalizability across MS platforms, and challenges in handling complex peptide classes such as immunopeptides. Casanovo addresses these limitations by leveraging recent advances in deep learning and natural language processing. Built on a transformer-based architecture, Casanovo translates spectra into amino acid sequences with unprecedented accuracy, significantly surpassing existing academic and commercial de novo sequencing tools. This project will further enhance Casanovo's performance by compiling the most comprehensive training dataset to date, optimizing model architectures for accuracy and efficiency, and developing specialized strategies for immunopeptidomics applications. In particular, the work will focus on overcoming the unique challenges of non-tryptic peptides and diverse MS instrumentation. Additionally, new methods for confidence estimation will be developed, establishing a statistical framework for interpreting de novo identifications. By advancing the accuracy, robustness, and accessibility of de novo sequencing, this project will establish a next-generation AI framework for peptide discovery, enabling deeper biological insights and accelerating research in proteomics, advancing immunology, and driving innovation in biotechnology.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

Deep Reinforcement Learning for Mass Spectrometry Data Acquisition in Metabolomics. 01/10/2025 - 31/07/2026

Abstract

Mass spectrometry-based metabolomics is a powerful analytical technique used for identifying small molecules in complex biological samples. However, current data acquisition methods have limitations in capturing all relevant molecules. To address this issue, we propose using artificial intelligence (AI) to optimize mass spectrometry data acquisition in real-time, maximizing the number and quality of identified metabolites. First, large amounts of publicly available mass spectrometry data will be used to develop a deep neural network that can predict the quality of generated fragmentation spectra based on instrument configurations. Second, we will use offline reinforcement learning to explore novel instrument configurations to enhance the data acquisition process. A critical focus will be placed on defining a suitable reward function that guides the AI agent's exploration, considering factors such as spectrum quality, novelty of acquired spectra, and resource utilization. Third, we will use a virtual mass spectrometry environment to simulate the fragmentation process and allow the AI agent to control data acquisition. This will enable thorough assessment and comparison against baseline approaches and alternative strategies. Once fully trained and validated, the AI agent will be deployed onto a mass spectrometer to autonomouslycontrol the data acquisition process in real time, evaluating its performance in detecting putative metabolites compared to traditional approaches. By utilizing AI to optimize molecular discovery from untargeted metabolomics experiments, we will enhance the identification of metabolites that were previously overlooked, unlocking valuable biological insights. These advances will have transformative implications for precision medicine, drug discovery, and many other areas of the life sciences.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

FoodOmics Philippines: studying the Molecular Composition of Endemic Heirloom and Unconventional Foods in the Philippines 01/09/2025 - 31/08/2027

Abstract

The Philippines is home to a large diversity of endemic and heirloom foods, yet their nutritional and bioactive properties remain largely unexplored, limiting their potential for improving nutrition, public health, and sustainable food systems. This project seeks to uncover the molecular composition of these unique foods using advanced metabolomics and bioinformatics. As a collaborative effort, Filipino researchers will lead food sampling and mass spectrometry analyses, leveraging their expertise in local biodiversity and food systems, while Flemish partners will contribute bioinformatics expertise, co-developing data analysis workflows and training programs. Through workshops and joint research activities, the project will enhance foodomics and computational biology expertise across both regions. By actively engaging local communities and policymakers, we aim to translate scientific findings into real-world applications, fostering long-term global innovation in sustainable food systems.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

Accurate and scalable AI-driven de novo sequencing for immunopeptide discovery. 01/09/2025 - 31/08/2026

Abstract

Casanovo is an AI-powered software platform for de novo peptide sequencing, enabling the identification of novel peptides directly from tandem mass spectrometry data without relying on predefined protein databases. This capability is critical for applications like immunopeptidomics, where conventional workflows leave most spectra unassigned due to the unpredictability of in vivo peptide processing and the presence of post-translational modifications. Casanovo leverages transformer-based deep learning to achieve state-of-the-art performance, significantly outperforming existing academic and commercial solutions. This project will advance Casanovo from a high-performing research prototype to a widely usable and robust software platform. We will introduce support for post-translational modifications, accelerate analysis through optimized cloud-native infrastructure, and deliver an intuitive web interface that democratizes access to AI-powered proteomics. In parallel, we will engage end users through structured pilot studies and market research to refine the product's positioning, validate pricing models, and define a viable business strategy. By addressing a critical bottleneck in mass spectrometry data interpretation, this project positions Casanovo to transform peptide discovery across immunotherapy, vaccine development, infectious disease research, and other key areas of biomedical innovation.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

Novel 3D multi-electrode technology to record from complex electrically excitable tissues and organoids. 01/06/2025 - 31/05/2027

Abstract

This application is to request funding to purchase a state-of-the-art 3D Multi-Electrode Array platform (MEA) to enable electrophysiological recordings from complex electrically excitable tissues and organoids. To study the electrophysiological properties of excitable cells, patch-clamping is deemed the gold-standard, but it is an extremely labor-intensive and invasive technique and limited to short-term measurements of individual or small numbers of cells at a single time point. In contrast, MEAs enable high-throughput non-invasive longitudinal real‐time measurements of functional cellular networks without disrupting important cell-cell contacts whilst allowing for the recording of many hundreds to thousands of cells simultaneously therefore providing greater insight into important physiological processes. Current MEA systems at the University of Antwerp only include setups using arrays of planar electrodes which are not suitable for recording from the complex tissues such as brain and cardiac organoids or tissue sections as the electrodes do not get close to the active cells. In contrast this 3D MEA system consists of arrays of ~0.1 mm raised electrodes which allow for repeated recordings from active cells within these organoids and tissues which can be grown under various experimental conditions. There is an urgent need as increasing numbers of research groups at the University of Antwerp use such tissue models but have no means to record from them. The 3D MEA platform is the most suitable instrument and will help many groups to functionally elucidate the pathomechanisms of neurological and cardiac disorders as well as provide the opportunity to rapidly screen large drug libraries.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

Bioinformatics Solutions For the Comprehensive Study of the Human Immunopeptidome. 01/01/2025 - 31/12/2028

Abstract

The adaptive immune system works by recognizing and responding to infected or malignant cells by recognizing peptides bound to major histocompatibility complex (MHC) molecules. This induces an immune response by producing antibodies or directly attacking infected or abnormal cells to eliminate the threat. Mass spectrometry-based immunopeptidomics is a key approach to understand the adaptive immune system by identifying and characterizing peptides presented on MHC molecules. However, there is a lack of optimized bioinformatics tools for immunopeptidomics data analysis, resulting in very low spectrum annotation rates and missing out on important insights into the immune system. To overcome this challenge, we will develop a powerful de novo immunopeptide sequencing solution using deep learning to uncover increased biological knowledge from immunopeptidomics data. We will apply this tool to study the presence of aberrant peptides, e.g. due to errors in translation or transcriptional splicing, and non-human peptides, originating from pathogens and other organisms, in the human immunopeptidome. These innovations have the potential to unlock new biological and biomedical insights into the adaptive immune system that will catalyze the development of novel immunotherapies and vaccines.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

Artificial Intelligence to Uncover Patterns in Mass Spectrometry Data Across Repositories. 01/01/2025 - 31/12/2028

Abstract

The relentless growth of data in the life sciences, notably in small molecule mass spectrometry (MS), presents a unique opportunity for groundbreaking discoveries. This project will introduce powerful artificial intelligence (AI) techniques to transcend traditional analysis paradigms that treat datasets in isolation, integrating fragmented data from large public databases to reveal insights that individual studies alone cannot uncover. At its core, our aim is to innovate by shifting from analyzing individual MS experiments to a comprehensive analysis across large repositories. This paradigm shift will unlock the untapped potential of public MS data, interpreting new observations within the context of the extensive molecular diversity documented in data repositories. To achieve this goal, we will develop AI-driven tools for simulating spectral libraries and incorporating statistical confidence in molecular identification. Additionally, we will employ multimodal representation learning techniques to bridge the gap between spectra and molecules on a repository scale. Standing at the intersection of AI, machine learning, and computational MS, our objective is to provide an integrated analysis of complex molecular data. This will pave the way for transformative advances across various scientific domains in the life sciences, including metabolomics, drug discovery, and environmental sciences, revolutionizing the approach to molecular discovery in the era of big data.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

The Live Mouse Tracker (LMT) as a versatile drug screening platform for rare neurological diseases. 01/01/2025 - 31/12/2025

Abstract

Establishing effective therapies for rare neurodevelopmental diseases remains one of the greatest challenges in molecular medicine. Although advances in next-generation sequencing technologies have led to the discovery of hundreds of novel genetic syndromes over the past decade, the development of individualized therapies continues to lag behind. Each rare disorder, while affecting a small group, contributes to a global burden estimated to impact over 300 million individuals. The complexity arises from the fact that these disorders, often caused by mutations in different genes, affect multiple cellular pathways, generating an overwhelming volume of data that must be analyzed to inform therapeutic strategies. Current drug interventions have seen limited success in translating promising preclinical findings into patient-ready treatments. The rapid rise of AI technologies, however, has the potential to transform this landscape. AI-driven algorithms are increasingly capable of navigating vast biomedical datasets, revealing drug candidates for rare diseases at an unprecedented pace. Many start-ups are already capitalizing on this potential, generating a flood of drug candidates for preclinical evaluation. However, this surge in candidate therapies has shifted the bottleneck from drug discovery to preclinical testing. Traditional murine test batteries are labor-intensive, expensive, and time-consuming, necessitating a standardized, scalable, and efficient platform to meet the growing demand for drug screening. We propose the development and commercialization of our Live Mouse Tracker (LMT) platform, a cutting-edge tool designed to address this critical need. The LMT system automates behavioral analysis, capable of tracking up to 39 different behaviors in groups of mice over 24-hour periods. This high-throughput capability provides a rapid and comprehensive assessment of drug efficacy in preclinical models. Our initial validation will focus on the fragile X syndrome, a widely studied neurodevelopmental disorder for which no effective treatment currently exists. By evaluating drugs that target multiple affected pathways simultaneously, we aim to pioneer a new approach to rare disease therapy development. During this project, we will validate the robustness of the LMT platform and extend it into a fully integrated service, as well as explore collaboration with other university partners to offer comprehensive preclinical drug testing solutions. This service platform has the potential to revolutionize the drug development pipeline, ensuring that AI-generated candidate drugs can be rapidly and reliably assessed, accelerating the path from bench to bedside. Through this initiative, we aim to bridge the gap between drug discovery and therapeutic application, bringing hope to millions of patients with rare neurological diseases.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

Deep Learning for Comprehensive Small Molecule Discovery From Untargeted Mass Spectrometry Data. 01/10/2024 - 30/09/2027

Abstract

Although small molecule mass spectrometry (MS) is a vital tool in various life sciences domains, its potential is hindered by the low annotation rate of MS/MS spectra, limiting our ability to uncover critical biological insights. This research project aims to revolutionize small molecule MS by harnessing the power of deep learning and multimodal integration to overcome this challenge. I will develop several complementary deep learning strategies for small molecule identification. First, I will develop a learned spectrum similarity score for the discovery of structurally related analogs. Second, I will use generative AI techniques to simulate comprehensive spectral libraries. Third, I will develop a solution for de novo molecule identification directly from MS/MS spectra, reducing the reliance on spectral libraries and expanding the range of discoverable molecules. Furthermore, I will introduce a holistic approach to MS by integrating three disparate data sources—MS/MS spectra, molecular structures, and natural language descriptions—into a shared latent space using multimodal representation learning. This paradigm shift will allow for direct linking of MS/MS observations to molecular structures and expert knowledge, enabling semantic search and retrieval of molecular information. Moreover, I will employ explainable AI techniques to interpret model decisions and provide insights into MS experimentation patterns.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

Bioinformatics network for proteomics and mass spectrometry 01/01/2024 - 31/12/2028

Abstract

Proteomics, the study of proteins and their functions, is a critical area in biology and medicine. With mass spectrometry (MS), researchers can analyze large amounts of proteomics samples, leading to valuable insights into complex biological processes. MS datasets require specialized data analysis techniques, which has led to the development of several powerful bioinformatics tools and pipelines for mass spectrometry-based proteomics. Nevertheless, the increasingly large volume and complex nature of MS-based proteomics data pose significant challenges that hinder progress in the field. To address these, there is a need for an open and collaborative approach to science. We have identified four key challenges that we will address through this Scientific Research Network (SRN): - Highly performant bioinformatics tools: As proteomics datasets grow in size, computational bottlenecks arise. Through this SRN, we will foster the development of highly performant and interoperable bioinformatics tools and workflows to process these datasets efficiently, enabling faster and more transparent analyses. - Machine learning integration: While machine learning holds great promise for proteomics data analysis, integrating it into practical workflows remains complex. Our SRN will work to bridge this gap, making machine learning techniques more accessible and seamlessly integrated into routine analyses. - Effective benchmarking: The diversity of analysis approaches makes it challenging to compare methods effectively. Our objective is to establish standardized benchmarking methods that allow researchers to systematically evaluate and improve their analysis pipelines. - Community building and educational resources: Proteomics data analysis requires specialized knowledge that is continuously evolving, making it difficult for young scientists and data science experts to enter the field. Our proposed SRN aims to build a supportive community for early-career researchers and create high-quality educational resources that facilitate the learning curve and provide accessible pathways for newcomers. With three research units in Flanders that are global leaders in MS-based proteomics, this SRN will make Flanders a focal point in the field of proteomics bioinformatics. Our collaboration with international partners will further enhance the visibility of Flemish research and contribute to a competitive position in the international research landscape, making the region attractive for ambitious and talented young researchers to work in. The six partnering research units have strong ties with the proteomics bioinformatics community within Europe and beyond, which we aim to maximally exploit to achieve our long-term goals. Indeed, instead of tackling these challenges alone, each of the six research units intends to take up a leading role in the wider research community to reach our objectives. Through this SRN, we will formalize the existing connections between the six partners and provide a clear collaborative vision and structure to drive progress and effectively mobilize the wider research community. The scope of our goals underscores the necessity of a community-scale effort. All six partners have taken up central roles in existing initiatives, such as the European Bioinformatics Community for Mass Spectrometry (EuBIC-MS), the Human Proteome Organization's Proteomics Standards Initiative (HUPO-PSI), the ELIXIR Life Science Infrastructure, and the Computational Mass Spectrometry (CompMS) interest group of the International Society for Computational Biology (ISCB), providing the critical mass of researchers required to achieve our goals.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

Reference data-driven metabolomics to study the molecular composition of South African foods. 01/01/2024 - 31/12/2026

Abstract

Understanding the molecular composition of food is essential for studying its impact on human health. We have recently developed a new approach called reference data-driven metabolomics, which can perform diet readouts from untargeted metabolomics data. However, this approach currently lacks diverse and geographically representative reference data. To address this, we will expand our reference food molecular database to include indigenous and locally cultivated foods from South Africa, a region with rich cultural and culinary traditions and nutritional diversity, analyze their molecular composition using mass spectrometry, and integrate the data into the Global FoodOmics reference database. Additionally, we will develop user-friendly bioinformatics tools that simplify the data analysis process, making reference data-driven metabolomics accessible to researchers with diverse backgrounds, and study the molecular composition of indigenous South African foods. Through collaboration between South African universities and the University of Antwerp, we will combine expertise in analytical chemistry, bioinformatics, nutrition, and agricultural sciences to advance metabolomics research, expand scientific knowledge of South African diets, and provide evidence-based insights for improving nutrition and health in South African populations.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

Computational mass spectrometry and artificial intelligence to unravel the immunopeptidome. 01/10/2023 - 30/09/2027

Abstract

The adaptive immune system is a crucial component of the immune response, providing specific defense against a wide range of pathogens and contributing to the development of immunological memory. Immunopeptidomics is a rapidly evolving field that uses mass spectrometry-based approaches to identify and quantify immunopeptides, which play a vital role in the recognition and elimination of infected or malignant cells by T cells. However, the annotation rate of immunopeptides from mass spectrometry data is currently severely limited, resulting in a significant loss of biological information. To overcome this challenge, we will develop specialized bioinformatics tools for analyzing mass spectrometry immunopeptidomics data. Specifically, we will develop an efficient and sensitive open modification search engine to identify immunopeptides that have undergone post-translational modifications. Furthermore, we will develop a deep learning-based de novo peptide sequencing approach optimized for the analysis of immunopeptidomics data. The tools developed in this project have the potential to significantly expand the amount of biological information that can be obtained from immunopeptidomics experiments, leading to transformational breakthroughs in the field.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

Artificial intelligence-powered knowledge base of the observed molecular universe. 01/12/2022 - 30/11/2027

Abstract

Despite recent breakthroughs in artificial intelligence (AI) that have led to disruptive advances across many scientific domains, there are still challenges in adopting state-of-the-art AI techniques in the life sciences. Notably, analysis of small molecule untargeted mass spectrometry (MS) data is still based on expert knowledge and manually compiled rules, and each experiment is analyzed in isolation without taking into account prior knowledge. Instead, this project will develop more powerful approaches in which untargeted MS data is interpreted within the context of the vast background of previously generated, publicly available data. The research hypothesis driving the proposed project is that advanced AI techniques can uncover hidden knowledge from large amounts of open MS data in public repositories to gain a deeper understanding into the molecular composition of complex biological samples. We will develop machine learning solutions to explore the observed molecular universe and build a comprehensive small molecule knowledge base. These ambitious goals build on our unique expertise in both AI and MS to create next-generation, data-driven software solutions for molecular discovery from untargeted MS data.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

Bioinformatics and machine learning for large-scale metabolomics data analysis. 01/12/2022 - 30/11/2026

Abstract

Despite recent breakthroughs in artificial intelligence (AI) that have led to disruptive advances across many scientific domains, there are still challenges in adopting state-of-the-art AI techniques in the life sciences. Notably, analysis of small molecule untargeted mass spectrometry (MS) data is still based on expert knowledge and manually compiled rules, and each experiment is analyzed in isolation without taking into account prior knowledge. Instead, this project will develop more powerful approaches in which untargeted MS data is interpreted within the context of the vast background of previously generated, publicly available data. The research hypothesis driving the proposed project is that advanced AI techniques can uncover hidden knowledge from large amounts of open MS data in public repositories to gain a deeper understanding into the molecular composition of complex biological samples. We will develop machine learning solutions to explore the observed molecular universe and build a comprehensive small molecule knowledge base. These ambitious goals build on our unique expertise in both AI and MS to create next-generation, data-driven software solutions for molecular discovery from untargeted MS data.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

Precision Medicine Technologies (PreMeT) 01/01/2021 - 31/12/2026

Abstract

Precision medicine is an approach to tailor healthcare individually, on the basis of the genes, lifestyle and environment of an individual. It is based on technologies that allow clinicians to predict more accurately which treatment and prevention strategies for a given disease will work in which group of affected individuals. Key drivers for precision medicine are advances in technology, such as the next generation sequencing technology in genomics, the increasing availability of health data and the growth of data sciences and artificial intelligence. In these domains, 6 strong research teams of the UAntwerpen are now joining forces to translate their research and offer a technology platform for precision medicine (PreMeT) towards industry, hospitals, research institutes and society. The mission of PreMeT is to enable precision medicine through an integrated approach of genomics and big data analysis.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

Tracing ions in mass spectra to identify small molecules (TractION). 01/11/2017 - 31/12/2025

Abstract

Currently, data analysis and interpretation is the most time consuming step in structural elucidation of small molecules. This still requires a lot of manual intervention time by highly trained MS experts. Moreover, the manual nature of this step makes it vulnerable to human errors. The goal of this project is to reduce the current bottleneck of data interpretation by the evaluation and development of an automatic identification pipeline. This pipeline is based on advanced spectral libraries together with adapted search algorithms and state-of-the-art pattern mining technology.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

De novo mass spectrometry peptide sequencing with a transformer large language model. 01/05/2024 - 30/04/2025

Abstract

The primary challenge in proteomics is identifying amino acid sequences from tandem mass spectra, which traditionally has been achieved using sequence database searching. As this method is limited to known protein sequences, de novo peptide sequencing presents an interesting alternative for the discovery of unexpected peptides. Casanovo is a state-of-the-art tool for de novo peptide sequencing, harnessing similar technologies underpinning large language models to translate mass spectra into amino acid sequences. The goal of this project is to enhance Casanovo and make it the preferred solution for de novo peptide sequencing. This will be achieved by compiling an extensive training dataset from diverse biological samples and mass spectrometry instruments and scaling up Casanovo's neural network to increase its learning capacity. Additionally, we will create a tailored model for the analysis of immunopeptidomics data by fine-tuning Casanovo's capabilities. Finally, we will develop a user-friendly web interface, making Casanovo accessible to a broad range of researchers and overcoming hardware limitations through cloud computing.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

Enabling mobile and data-driven pathogen monitoring through a paired nanopore squiggle–genome sequence database. 01/05/2023 - 31/12/2024

Abstract

Infectious disease monitoring is a global need, and the threat of existing and emerging pathogens poses a major challenge to public health. Nanopore sequencing is a revolutionary technology that enables portable sequencing and has shown its merit in the COVID-19 pandemic. This technology could enable existing laboratories that have no or limited infectious disease surveillance capacity to 'leapfrog' to sequencing-based pathogen monitoring. However, this potential hinges on the ability to operate in resource-limited settings, which is, to date, hindered by data storage and processing needs. The raw data, referred to as 'squiggles,' requires significant storage space and decoding it to DNA sequences requires graphical processing units (GPUs) that consume significant amounts of power. In this pandemic preparedness proof-of-concept project, we will build on advances from our IOF-SBO funded project LeapSEQ to remove significant hurdles to enable mobile and data-driven pathogen monitoring. These hurdles include: (1) a need for scalable storage solutions for squiggle data, (2) the lack of available pathogen data, and (3) improved computational solutions for interacting with squiggle data. We will tackle these problems by engineering and populating a proof-of-concept paired nanopore squiggle–genome sequence database using our portable LeapSEQ lab and by developing efficient data-driven algorithms for rapid pathogen monitoring. We will develop this database with strategic partners at ITM and UA and further explore LeapSEQ valorization potential in the context of global pathogen monitoring.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

Transferable deep learning for sequence based prediction of molecular interactions. 01/10/2019 - 30/09/2023

Abstract

Machine learning can be used to elucidate the presence or absence of interactions. In particular for life science research, the prediction of molecular interactions that underlie the mechanics of cells, pathogens and the immune system is a problem of great relevance. Here we aim to establish a fundamentally new technology that can predict unknown interaction graphs with models trained on the vast amount of molecular interaction data that is nowadays available thanks to high-throughput experimental techniques. This will be accomplished using a machine learning workflow that can learn the patterns in molecular sequences that underlie interactions. We will tackle this problem in a generalizable way using the latest generation of neural networks approaches by establishing a generic encoding for molecular sequences that can be readily translated to various biological problems. This encoding will be fed into an advanced deep neural network to model general molecular interactions, which can then be fine-tuned to highly specific use cases. The features that underlie the successful network will then be translated into novel visualisations to allow interpretation by biologists. We will assess the performance of this framework using both computationally simulated and real-life experimental sequence and interaction data from a diverse range of relevant use cases.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project

Intelligent quality control for mass spectrometry-based proteomics 01/10/2017 - 31/07/2021

Abstract

As mass spectrometry proteomics has matured over the past few years, a growing emphasis has been placed on quality control (QC), which is becoming a crucial factor to endorse the generated experimental results. Mass spectrometry is a highly complex technique, and because its results can be subject to significant variability, suitable QC is necessary to model the influence of this variability on experimental results. Nevertheless, extensive quality control procedures are currently lacking due to the absence of QC information alongside the experimental data and the high degree of difficulty in interpreting this complex information. For mass spectrometry proteomics to mature a systematic approach to quality control is essential. To this end we will first provide the technical infrastructure to generate QC metrics as an integral element of a mass spectrometry experiment. We will develop the qcML standard file format for mass spectrometry QC data and we will establish procedures to include detailed QC data alongside all data submissions to PRIDE, a leading public repository for proteomics data. Second, we will use this newly generated wealth of QC data to develop advanced machine learning techniques to uncover novel knowledge on the performance of a mass spectrometry experiment. This will make it possible to improve the experimental set-up, optimize the spectral acquisition, and increase the confidence in the generated results, massively empowering biological mass spectrometry.

Researcher(s)

Research team(s)

Project type(s)

  • Research Project