Thank you for the detailed response. This is helpful. I’ll read your article, and test my data as you’ve described.
On Tue, Aug 26, 2025 at 3:05 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi Nikolas, > > *Why Spark defaults to Snappy for Parquet.* In analytics scans the > bottleneck is usually *CPU to decompress Parquet pages*, not raw I/O. > Snappy gives *very fast decode* at a decent ratio, so end-to-end query > latency is typically better than heavier codecs like GZIP. For colder data, > GZIP (or ZSTD) can make sense if you’re chasing storage savings and can > afford slower reads. > > Two different codec decisions to make > > 1. > > Intermediates (shuffle/spill/broadcast) — speed > ratio > I keep fast codecs here; changing them rarely helps unless the > network/disk is the bottleneck and I have spare CPU: > > *spark.conf.set("spark.shuffle.compress", "true") > spark.conf.set("spark.shuffle.spill.compress", "true") > spark.conf.set("spark.io.compression.codec", "lz4") // snappy or zstd > are also viable > * > > 2. > > Storage at rest (final Parquet in the lake/lakehouse) — pick by hot vs > cold > - > > *Hot / frequently scanned:* *Snappy* for fastest reads. > - > > *Cold / archival:* *GZIP* (or try *ZSTD*) for much smaller files; > accept slower scans. > > *spark.conf.set("spark.sql.parquet.compression.codec", "snappy") // or > "gzip" or "zstd"* > > > This mirrors what I wrote up for *BigQuery external Parquet on object > storage *as attached (different engine, same storage trade-off): I used > *Parquet > + GZIP* when exporting to Cloud Storage (great size reduction) and noted > that *external tables read slower than native*—so I keep hot data > “native” and push colder tiers to cheaper storage with heavier compression. > In that piece, a toy query ran ~*190 ms* on native vs ~*296 ms* on the > external table (≈43% slower), which is the kind of latency gap you trade > for cost/footprint savings on colder data . > > *Bigger levers than the codec* > The codec choice matters, but *reading fewer bytes* matters more! In my > article I lean heavily on *Hive-style partition layouts* for external > Parquet (multiple partition keys, strict directory order), and call out > gotchas like keeping *non-Parquet junk out of leaf directories *(external > table creation/reads can fail/slow if the layout’s messy) . > > How I would benchmark on your data > Write the same dataset three ways (snappy, gzip, zstd), then measure: > > - > > total bytes on storage, > - > > Spark SQL *scan time* and *CPU time* in the UI, > - > > effect of *partition pruning* with realistic filters. > Keep the shuffle settings fast (above) so you’re testing scan costs, > not an artificially slow shuffle. > > My rules of thumb > > - > > If *latency* and interactive work matter → *Snappy* Parquet. > - > > If *storage $$* dominates and reads are rare → *GZIP* (or *ZSTD* as a > middle ground). > - > > Regardless of codec, *partition pruning + sane file sizes* move the > needle the most (that’s the core of my “Hybrid Curated Storage” approach) > > HTH > > Regards > Dr Mich Talebzadeh, > Architect | Data Science | Financial Crime | Forensic Analysis | GDPR > > (P.S. The background and examples I referenced are from my article on > using *GCS external Parquet* with *Snappy/GZIP/ZSTD* and Hive > partitioning for cost/perf balance—feel free to skim the compression/export > and partitioning sections.) > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > On Tue, 26 Aug 2025 at 17:59, Nikolas Vanderhoof < > nikolasrvanderh...@gmail.com> wrote: > >> Hello, >> >> Why does Spark use Snappy by default when compressing data within >> Parquet? I’ve read that when shuffling, speed is prioritized above >> compression ratio. Is that true, and are there other things to consider? >> >> Also, are there any recent benchmarks that the community has performed >> that evaluate the performance of Spark when using Snappy compared to other >> codecs? I’d be interested not only in the impact when using other codecs >> for the intermediate and shuffle files, but also for the storage at rest. >> For example, I know there are different configuration options that allow me >> to set the codec for these internal files, or for the final parquet files >> stored in the lakehouse. >> >> Before I decide to use a codec other than the default in my work, I want >> to understand any tradeoffs better. >> >> Thanks, >> Nik >> >