Hello, Why does Spark use Snappy by default when compressing data within Parquet? I’ve read that when shuffling, speed is prioritized above compression ratio. Is that true, and are there other things to consider?
Also, are there any recent benchmarks that the community has performed that evaluate the performance of Spark when using Snappy compared to other codecs? I’d be interested not only in the impact when using other codecs for the intermediate and shuffle files, but also for the storage at rest. For example, I know there are different configuration options that allow me to set the codec for these internal files, or for the final parquet files stored in the lakehouse. Before I decide to use a codec other than the default in my work, I want to understand any tradeoffs better. Thanks, Nik