Data Integration in Data Lakes

Abstract

Although big data is being discussed for some years, it still has many research challenges, such as the variety of data. The diversity of data sources often exists in information silos, which are a collection of non-integrated data management systems with heterogeneous schemas, query languages, and data models. It poses a huge difficulty to efficiently integrate, access, and query the large volume of diverse data in these information silos with the traditional ‘schema-on-write’ approaches such as data warehouses. Data lake systems have been proposed as a solution to this problem, which are repositories storing raw data in its original formats and providing a common access interface. In this talk, I will discuss the landscape of existing data lake problems, and our solutions for integrating multiple heterogeneous data sources in data lakes. I will also introduce the recent advances in supporting AI in data lakes.

Date
Jun 27, 2022 2:53 PM — Jun 28, 2022 10:00 PM
Location
Basel, Switzerland (HYBRID EVENT)

Slides: https://drive.google.com/file/d/18yJPuC1_8wJFC8emXFfGGcOLQjGDqm9p/view?usp=sharing

Rihan Hai
Rihan Hai
Assistant professor

My research focuses on data integration and related dataset discovery in large-scale data lakes.