Data lake concept and systems: a survey

Abstract

Although big data has been discussed for some years, it still has many research challenges, especially the variety of data. It poses a huge difficulty to efficiently integrate, access, and query the large volume of diverse data in information silos with the traditional ‘schema-on-write’ approaches such as data warehouses. Data lakes have been proposed as a solution to this problem. They are repositories storing raw data in its original formats and providing a common access interface. This survey reviews the development, definition, and architectures of data lakes. We provide a comprehensive overview of research questions for designing and building data lakes. We classify the existing data lake systems based on their provided functions, which makes this survey a useful technical reference for designing, implementing and applying data lakes. We hope that the thorough comparison of existing solutions and the discussion of open research challenges in this survey would motivate the future development of data lake research and practice

Publication
arXiv preprint arXiv:2106.09592
Rihan Hai
Rihan Hai
Assistant professor

My research focuses on data integration and related dataset discovery in large-scale data lakes.