Shuffle Overhead Analysis for the Layered Data Abstractions

Barhate, Pratik Narhar

Apache Spark is one of the most widely adopted open-source Big Data processing engines. High performance and ease of use for a wide class of users are some of the primary reasons for the wide adoption. Although data partitioning increases…

Apache Spark is one of the most widely adopted open-source Big Data processing engines. High performance and ease of use for a wide class of users are some of the primary reasons for the wide adoption. Although data partitioning increases the performance of the analytics workload, its application to Apache Spark is very limited due to layered data abstractions. Once data is written to a stable storage system like Hadoop Distributed File System (HDFS), the data locality information is lost, and while reading the data back into Spark’s in-memory layer, the reading process is random which incurs shuffle overhead. This report investigates the use of metadata information that is stored along with the data itself for reducing shuffle overload in the join-based workloads. It explores the Hyperspace library to mitigate the shuffle overhead for Spark SQL applications. The article also introduces the Lachesis system to solve the shuffle overhead problem. The benchmark results show that the persistent partition and co-location techniques can be beneficial for matrix multiplication using SQL (Structured Query Language) operator along with the TPC-H analytical queries benchmark. The study concludes with a discussion about the trade-offs of using integrated stable storage to layered storage abstractions. It also discusses the feasibility of integration of the Machine Learning (ML) inference phase with the SQL operators along with cross-engine compatibility for employing data locality information.

Copyright Statement