This course offers an introduction to the advanced computational analysis of complex and / or large biomedical datasets. The course addresses the foundations of the partially overlapping fields of multivariate statistics and data mining, both from a theoretical perspective as from an applied and practical hands-on point of view. The course provides an extension to earlier courses on bioinformatics and univariate statistics and addresses following topics:
I. introduction to different data types and data mining problems
- A formal overview of different data types in biology and medicine: quantitative data (e.g. coming from ‘omics' platforms), string data (mainly DNA and protein sequences), text, graph data (biological networks), image data
- An introduction to the challenges of data mining and machine learning.
II. Overview of data mining techniques
- Introduction: preprocessing and basic exploratory analysis (univariate statistics) of quantitative data: a revision of statistical concepts (only a revision in the context of the course).
- Unsupervised learning: clustering, PCA
- An introduction to classification methods: overview of classification systems, model validation (e.g. different cross-validation techniques)
- Biomedical feature selection and dimensionality reduction
- Supervised learning techniques (a solid introduction to commonly used techniques and algorithms): regression techniques, discriminant analysis, support vector machines, random forests, ensemble classifiers, decision trees, neural networks, naive Bayes, association rule mining
- Biomedical text mining
- Visual data mining
III. Biomedical data mining applications
In a number of case studies, and through real research results it will be shown how these techniques can be employed to extract novel insights from biomedical data. These lectures should cover diverse data types (e.g. quantitative molecular data, molecular sequences, molecular interactions, ontologies, text, physiological measurements, patient meta-data, …) and several of the techniques addressed above.
The practical part will familiarize the students with the statistical programming language R. In the first place, students should be able to correctly read in a dataset, generate graphs and perform elementary data-manipulations. Subsequently, some techniques for statistical data-analysis (linear regression, ANOVA, multivariate techniques,...) are illustrated, whereby the students should be able to use the help files and search the internet for the code to solve a particular problem. In the end, programming techniques including for-loops and custom-made functions will be illustrated to facilitate repetitive analyses.