This class implements sort-based shuffle's hash-style shuffle fallback path. This write path
writes incoming records to separate files, one file per reduce partition, then concatenates these
per-partition files to form a single output file, regions of which are served to reducers.
Records are not buffered in memory. It writes output in a format
that can be served / consumed via
org.apache.spark.shuffle.IndexShuffleBlockResolver.
This write path is inefficient for shuffles with large numbers of reduce partitions because it
simultaneously opens separate serializers and file streams for all partitions. As a result,
SortShuffleManager only selects this write path when
- no Ordering is specified,
- no Aggregator is specified, and
- the number of partitions is less than
spark.shuffle.sort.bypassMergeThreshold
.
This code used to be part of
org.apache.spark.util.collection.ExternalSorter but was
refactored into its own class in order to reduce code complexity; see SPARK-7855 for details.
There have been proposals to completely remove this code path; see SPARK-6026 for details.