When (Not) To Pool Data (or, how to alleviate the dirty little secret of machine learning)

Jun 24, 2024·
Jilles Vreeken

Suppose we wish to develop a machine learning model for our favourite medical application, e.g. for detecting a rare disease, or suggesting individualized treatment such as for decreasing cholestorol levels. To do so, we need training data. When we dive into the IKIM data lake we find data for eight cohorts, all measured over the same variables. Our machine learning model can only handle a single training dataset, however. Each of these cohorts alone are too small to learn a good model, so instead, we throw all data on one big heap and start training. After a little while, we have a well-trained model, which confidently tells us… we should do less exercise if we want to decrease cholestorol. Wait, what? That does not make sense? Oh shoot, we ran into Simpson’s paradox! We grouped data we should not have! Which datasets could we have safely grouped? Or, how could have learned a single model from different datasets without running this risk? These, and related questions on how we can discover which datasets share the same causal mechanism, are exactly what I will try to answer in this presentation.

About Jilles:

Jilles is faculty (W3, tenured) at the CISPA Helmholtz Center for Information Security, where he leads the research group on Exploratory Data Analysis. He is Honorary Professor of Computer Science at Saarland University and ELLIS Faculty of the Saarbrücken Unit on Artificial Intelligence and Machine Learning. His research is concerned with causality and unsupervised learning. In particular, he enjoys developing theory and algorithms for answering fundamentally exploratory questions, such as ‘what is going on in my data?’, ‘what causes what and how?’, ‘what can we learn from this model?’ without having to make unnecessary or unjustified assumptions. To identify what is worth knowing, he likes to take a principled approach, such as based on information theory, and then proceed to develop efficient algorithms for extracting useful interpretable results. He is interested in causal inference under realistic conditions, such as e.g. under hidden confounding or selection bias, when the i.i.d. assumption does not hold, or while making use of background knowledge. He is always interested in how to summarize the essence of complex data and models in easily understandable and actionable terms, and using these to obtain better, more robust, and more useful models.

Visit his website for more: https://vreeken.eu