Amalur: Next-generation Data Integration in Data Lakes
Data science workflows require extracting, preparing and integrating data from multiple data sources. Due to the lack of proper tooling this is a very cumbersome process that hinders the productivity of data scientists. Moreover, this is a very slow process: most of the times, data scientists prepare data in a data processing system or a data lake, and export it in the form of a table, in order for it to be consumed by a Machine Learning (ML) algorithm. Recent advances in the area of factorized ML, allow us to push down certain linear algebra (LA) operators, and to execute them closer to the data sources. At the same time, we have a proliferation of novel data exploration and discovery tools as well as dataset relatedness and matching algorithms. With Amalur project we believe that this is the right moment to revisit all the components of classic data integration (DI) systems, and to see how these fit into modern data lakes that are meant to support LA as a first-class citizen. In this project we investigate how the advances in factorized ML and modern data integration techniques influence and can benefit from each other, forming new research opportunities.