Hello,

Why does Spark use Snappy by default when compressing data within Parquet?
I’ve read that when shuffling, speed is prioritized above compression
ratio. Is that true, and are there other things to consider?

Also, are there any recent benchmarks that the community has performed that
evaluate the performance of Spark when using Snappy compared to other
codecs? I’d be interested not only in the impact when using other codecs
for the intermediate and shuffle files, but also for the storage at rest.
For example, I know there are different configuration options that allow me
to set the codec for these internal files, or for the final parquet files
stored in the lakehouse.

Before I decide to use a codec other than the default in my work, I want to
understand any tradeoffs better.

Thanks,
Nik

Reply via email to