Abstract
Over the last two decades, the availability of data has grown exponentially due to technological advances such as cheaper and bigger storage and a growth of information-collecting devices. As a result, datasets have increased tremendously in size, often containing millions of observations and variables. This development has created new challenges to the fields of statistics and machine learning, which aim to analyze these large datasets in an efficient and comprehensive way. In this project we focus on regression analysis, one of the most popular tools for modelling a response variable as a function of a number of predictor variables.
A major difficulty in regression analysis, is that the quality of the data is generally unknown. In particular, the data may contain anomalies, measurement error and other types of contamination. Ignoring this fact can have disastrous effects on the results of any method for data analysis. On the other hand, detecting contamination is very difficult, and even more so when the size of the data set increases. This motivates a need for methodology for regression which is robust to data contamination, so that reliable results can be obtained even when the dataset is contaminated.
Traditionally, robust statistics considered "casewise" contamination that appears on the level of the observation. This means that an observation is either contaminated, or it is completely free from contamination. More recently, it has been put forward that "cellwise" contamination on the level of the cell is a more appropriate assumption in the context of big data. A cellwise contamination model implies that for a given observation, certain variables may be reliable whereas others may not be. The challenge thus becomes to identify the uncontaminated data cells and use those for the estimation, while limiting the influence of the contaminated ones.
While several proposals have been made for regression under cellwise contamination, the whole line of research lacks direction and general foundations. For casewise contamination, general frameworks for the development of robust estimators exist, and they include tools for analyzing their statistical and computational properties. The lack of cellwise counterparts to these frameworks makes the problem of cellwise contamination in general poorly understood.
This proposal bridges knowledge from robust statistics, machine learning and optimization and builds on my very recent work on robust covariance estimation to fundamentally tackle the problem of cellwise outliers in regression. The project starts off by creating a clear overview of the state-of-the-art through a benchmark study and a summary of the existing theory. It will then investigate a general framework for cellwise robust linear regression, derive the properties of the framework and design efficient optimization strategies. It allows for extensions in the direction of regularized estimation and nonlinear modelling. In addition to the development of methodology, the project aims to assess the gravity of cellwise contamination in practical challenges by collaborating with experts on macro-economic time series modelling and drug development.
Given the ubiquity of regression analysis, the anticipated results imply a broad potential impact, reaching far outside of the foundational disciplines of statistics and computer science, in disciplines including epidemiology, omics, physics, chemometrics, and economic policy.
Researcher(s)
Research team(s)
Project type(s)