Data science workflows require extracting, preparing and integrating data from multiple data sources. Due to the lack of proper tooling this is a very cumbersome process that hinders the productivity of data scientists. Moreover, this is a very slow process: most of the times, data scientists prepare data in a data processing system or a data lake, and export it in the form of a table, in order for it to be consumed by a Machine Learning (ML) algorithm.
Recent advances in the area of factorized ML, allow us to push down certain linear algebra (LA) operators, and to execute them closer to the data sources. At the same time, we have a proliferation of novel data exploration and discovery tools as well as dataset relatedness and matching algorithms. With this work we argue that this is the right moment to revisit all the components of classic data integration (DI) systems, and to see how these fit into modern data lakes that are meant to support LA as a first-class citizen.
In this paper we first investigate how the advances in factorized ML and modern data integration techniques influence and can benefit from each other, forming new research opportunities. We then describe Amalur: a reference architecture of a next-generation data lake system which facilitates linear algebra processing over heterogeneous sources. We propose a formal representation based on matrices, which connects to the schema mapping formalism in first-order logic, and enables LA factorization over joinable or unionable data in a data lake. Finally, we outline the future research challenges related to next-generation data lake systems.