Data driven approaches towards computational genome interpretation for identification of disease causing mutations
27 June 2018
UAntwerp - Campus Drie Eiken - Building O - Auditorium O6 - Universiteitsplein 1 - 2610 WILRIJK (route: UAntwerpen, Campus Drie Eiken
Prof B. Loeys, Prof G. Vandeweyer & M. Alaerts, PhD
PhD defence Ajay Kumar - Faculty of Medicine and Health Sciences
Abstract (Presentation in English)
The advent of next generation sequencing (NGS) technologies has paved the way to understand disease mechanisms at much faster scale than earlier anticipated thereby leading towards discovery of higher number of disease causing genetic variants/mutations. The success of the application of NGS technologies can be seen as an example of an inductive reasoning process where the underlying genomic data drives the suitable hypothesis by which the mechanism of disease can be explained. In the current thesis work we introduce data driven approaches and address the challenges associated towards the development of these approaches for the identification of disease causing mutations. These technologies are fast and generate high throughput genomic data. Hence it requires the development of automated procedural routines that formulate the roadmap from discovery of these variants to their functional interpretation.
There are two main contributions of this thesis. First towards development of novel computational tools for the detection and interpretation of single nucleotide variations (SNVs) associated with bicuspid aortic valve (BAV) with thoracic aortic aneurysm (TAA) disease from NGS data. Second contribution is towards development of a novel statistical method for identification of copy number variations (CNVs) from targeted resequencing (TR) NGS data.
For identification of SNVs from the NGS data a novel gene prioritization tool named pBRIT which integrates 10 different annotation sources to prioritize candidate genes through a Bayesian regression model. The utility of this method was examined on several retrospective and prospective benchmark datasets and its performance was compared with several existing methods. The dynamic implementation of pBRIT enables users to perform large scale exome prioritization and enables them to intuitively explore the results. Additionally an automated bioinformatics pipeline was developed for rare variant association analysis (RVSA). Together these two complementary strategies helped in pinpointing the SMAD6 gene to be associated with BAV/TAA disease.
Similarly for identification of CNVs, another novel statistical method named varAmpliCNV developed which was designed specifically to detect CNVs from amplicon-based targeted TR data. It incorporates with PCA/MDS based method to control the variance present in the data. Comparison with three existing tools demonstrate the superior performance of varAmpliCNV with higher sensitivity and specificity.
Finally, we conclude that in the era of high throughput NGS the huge amount of data being generated stresses the necessity for the development of robust data driven approaches that can easily be scalable and generalized to wide range of problems in the domain of genetics research.