URL of JLESC: https://jlesc.github.io

URL of the 5th JLESC workshop: https://jlesc.github.io/events/5th-jlesc-workshop/

Talk by Gabriel Antoniu: Spark versus Flink: Understanding Performance in Big Data Analytics Frameworks

Abstract: Big Data analytics has recently gained increasing popularity as a tool to process large amounts of data on-demand. Spark and Flink are two Apache-based data analytics frameworks that facilitate the development of multi-step data pipelines using directly acyclic graph patterns. Making the most out of these frameworks is challenging because efficient executions strongly rely on complex parameter configurations and on an in-depth understanding of the underlying architectural choices.

Although extensive research has been devoted to improving and evaluating the performance of such analytics frameworks, most of them benchmark them against Hadoop, as a baseline, a rather unfair comparison considering the fundamentally different design principles. This work aims to bring some justice in this respect, by directly comparing the performance of Spark and Flink. Our goal is to identify and explain the impact of the different architectural choices and the parameter configurations on the perceived end-to-end performance. To this end, we develop a methodology for correlating the parameter settings and the operators execution plan with the resource usage. We use this methodology to dissect the performance of Spark and Flink with several representative batch and iterative workloads on up to 100 nodes. We highlight how performance correlates to operators, to resource usage and to the specifics of the internal framework design.