Wenbo Sun will give a Lightning Talk on Cross-Source ML Model Training at ICDE24

Cross-Source ML Model Training

By Wenbo Sun, Rihan Hai

Machine learning (ML) training data is often fragmented across various sources. Typically, ML models over disparate sources are trained using one of two paradigms: distributed or centralized. In this talk, I will focus on the underexplored role of data integration (DI) metadata in both methodologies. I will highlight three challenges: (1) how to formalize the multi-dataset relationships in ML applications, e.g., federated learning? (2) Given DI metadata, how can we enable the automated reformulation of ML algorithms to work with different data sources? (3) Given different data sources, ML algorithms, and hardware settings, which is more efficient for training ML models: a distributed or a centralized setting? Our solution is threefold. First, we formalize the complex relationships among data sources involved in model training using the widely adopted mapping formalism of tuple-generating dependencies (tgds). Second, we propose an approach that transforms DI metadata into matrix representations, and streamlines data transformation and linear algebra operations over source datasets. Third, we present an optimization method, effectively deciding between a distributed or centralized setting.

Rihan Hai
Rihan Hai
Assistant professor

My research focuses on data integration and related dataset discovery in large-scale data lakes.