Our paper on feature discovery accepted at ICDE2024

AutoFeat: Transitive Feature Discovery over Join Paths

by Andra Ionescu, Kiril Vasilev, Florena Buse, Rihan Hai, Asterios Katsifodimos

Abstract—Can we automatically discover machine learning (ML) features in a large data lake in order to increase the accuracy of a given ML model? Existing solutions either focus on simple star schemata, failing to discover features in more complex real-world schemata or consider only PK-FK relationships in clean, curated databases. However, real-world data lakes can contain long join paths of uncurated joinability relationships resulting from automated dataset discovery methods. This paper proposes a novel ranking-based feature discovery method called AutoFeat. Given a base table with a target label, AutoFeat explores multi-hop, transitive join paths to find relevant features in order to augment the base table with additional features, ultimately leading to increased accuracy of an ML model. AutoFeat is general: it evaluates the predictive power of features without the need to train an ML model, ranking join paths using the concepts of relevance and redundancy. Our experiments on real-world open data show that AutoFeat is efficient: it can find features of high predictive power on data lakes with an increased number of dataset joinability relationships 5x-44x faster than baseline approaches. In addition, AutoFeat is effective, improving accuracy by 16% on average compared to the baseline approaches, even in noisy, uncurated data lakes.

Rihan Hai
Rihan Hai
Assistant professor

My research focuses on data integration and related dataset discovery in large-scale data lakes.