Data integration for clinical coding algorithms

Date: 27 November 2017

Venue: Campus Drie Eiken, Promotiezaal Q0.02 - Universiteitsplein 1 - 2610 Antwerpen-Wilrijk (route: UAntwerpen, Campus Drie Eiken)

Time: 4:00 PM

Organization / co-organization: Department of Mathematics and Computer Science

PhD candidate: Elyne Scheurwegs

Principal investigator: Bart Goethals & Walter Daelemans

Short description: PhD defence Elyne Scheurwegs - Faculty of Science - Department of Mathematics and Computer Science


During a hospital visit, a patient stay is manually encoded with codes that reflect the observed diagnosis and procedures for a patient during that stay. This process, which has an effect on hospital reimbursement, can be assisted by (semi-)automatically predicting the codes. The research in this thesis focuses on methods that tackle the underlying issues when these codes are being predicted using all information present in the electronic health record of a patient. An electronic health record contains a heterogeneous set of data, ranging from unstructured, textual information (e.g. discharge files) to rigidly structured data (e.g. lab results), which in turn requires different representation methods. Finding an optimal representation for each data type, and the way information from these data sources can be combined to a unified data model are prominent issues, together with considering how complementary data sources are to each other and if developed models are capable of being independent of a rigid underlying structure. This research is conducted on a real-life dataset, in a language (Dutch) for which annotated (medical) resources are only partially available.

Different methods for data integration are proposed, where an early data integration method attempts to find an optimal representation of all information at once, while reducing the redundancy of the information extracted. A late data integration method will use a meta-learner on top of predictions for each individual source in isolation and is able to decide which sources individually provide the most accurate prediction. An intermediate data integration method focuses on reducing the information overlap between individual features using database coverage, to be able to represent a filtered list of features to a primary classifier. The latter technique compromises between the late data integration technique, where each source is overgeneralised, and early data integration, where the input of different sources is often unbalanced in both the number of and the quality of features.

To represent information found in textual sources, we have experimented with unsupervised and dictionary-based methods to extract multi-word expressions (MWEs), which are used to represent medical facts. We propose alternative methods where results equivalent to the state-of-the-art are reached, but where little annotated data sources are required. This makes our techniques particularly useful when processing text in languages other than English, where MWEs are only partially present in dictionaries.