Full design of our model lake Amalur accepted at TKDE 2024

Extended from our vision paper at ICDE 2024, the full version of Amalur is now available at TKDE!

Amalur: The Convergence of Data Integration and Machine Learning

By Ziyu Li, Wenbo Sun, Danning Zhan, Yan Kang, Lydia Chen, Alessandro Bozzon, and Rihan Hai

Abstract—Machine learning (ML) training data is often scattered across disparate collections of datasets, called data silos. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of manual work and computational resources. With data privacy constraints, data often cannot leave the premises of data silos; hence model training should proceed in a decentralized manner. In this work, we present a vision of bridging traditional data integration (DI) techniques with the requirements of modern machine learning systems. We explore the possibilities of utilizing metadata
obtained from data integration processes for improving the effectiveness, efficiency, and privacy of ML models. Towards this direction, we analyze ML training and inference over data silos. Bringing data integration and machine learning together, we highlight new research opportunities from the aspects of systems, representations, factorized learning, and federated learning.


Index Terms—Machine learning, data integration, federated learning.

Rihan Hai
Rihan Hai
Assistant professor

My research focuses on data integration and related dataset discovery in large-scale data lakes.