February 11, 2021
Published by Neville Li, Claire McGinty, Sahith Nallapareddy, & Joel Östlund
In this post we’ll discuss how Spotify optimized and sped up elements from our largest Dataflow job, Wrapped 2019, for Wrapped 2020 using a technique called Sort Merge Bucket (SMB) join. We’ll present the design and implementation of SMB and how we incorporated it into our data pipelines.
Introduction
Shuffle is the core building block for many big data transforms, such as a join, GroupByKey, or other reduce operations. Unfortunately, it’s also one of the most expensive steps in many pipelines. Sort Merge Bucket is an optimization that reduces shuffle by doing work up front on the producer side. The intuition is that for datasets commonly and frequently joined on a known key, e.g., user events with user metadata on a user ID, we can write them in bucket files with records bucketed and sorted by that key. By knowing which files contain a subset of keys and in what order, shuffle becomes a matter of merge-sorting values from matching bucket files, completely eliminating costly disk and network I/O of moving key–value pairs around. Andrea Nardelli carried out the original investigation on Sort Merge Buckets for his 2018 master’s thesis, and we started looking into generalizing the idea as a Scio module afterwards.