Thank you for the detailed response. This is helpful. I’ll read your
article, and test my data as you’ve described.

On Tue, Aug 26, 2025 at 3:05 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Nikolas,
>
> *Why Spark defaults to Snappy for Parquet.* In analytics scans the
> bottleneck is usually *CPU to decompress Parquet pages*, not raw I/O.
> Snappy gives *very fast decode* at a decent ratio, so end-to-end query
> latency is typically better than heavier codecs like GZIP. For colder data,
> GZIP (or ZSTD) can make sense if you’re chasing storage savings and can
> afford slower reads.
>
> Two different codec decisions to make
>
>    1.
>
>    Intermediates (shuffle/spill/broadcast) — speed > ratio
>    I keep fast codecs here; changing them rarely helps unless the
>    network/disk is the bottleneck and I have spare CPU:
>
>    *spark.conf.set("spark.shuffle.compress", "true")
>    spark.conf.set("spark.shuffle.spill.compress", "true")
>    spark.conf.set("spark.io.compression.codec", "lz4")   // snappy or zstd 
> are also viable
>    *
>
>    2.
>
>    Storage at rest (final Parquet in the lake/lakehouse) — pick by hot vs
>    cold
>    -
>
>       *Hot / frequently scanned:* *Snappy* for fastest reads.
>       -
>
>       *Cold / archival:* *GZIP* (or try *ZSTD*) for much smaller files;
>       accept slower scans.
>
>    *spark.conf.set("spark.sql.parquet.compression.codec", "snappy") // or 
> "gzip" or "zstd"*
>
>
> This mirrors what I wrote up for *BigQuery external Parquet on object
> storage *as attached (different engine, same storage trade-off): I used 
> *Parquet
> + GZIP* when exporting to Cloud Storage (great size reduction) and noted
> that *external tables read slower than native*—so I keep hot data
> “native” and push colder tiers to cheaper storage with heavier compression.
> In that piece, a toy query ran ~*190 ms* on native vs ~*296 ms* on the
> external table (≈43% slower), which is the kind of latency gap you trade
> for cost/footprint savings on colder data .
>
> *Bigger levers than the codec*
> The codec choice matters, but *reading fewer bytes* matters more! In my
> article I lean heavily on *Hive-style partition layouts* for external
> Parquet (multiple partition keys, strict directory order), and call out
> gotchas like keeping *non-Parquet junk out of leaf directories *(external
> table creation/reads can fail/slow if the layout’s messy) .
>
> How I would benchmark on your data
> Write the same dataset three ways (snappy, gzip, zstd), then measure:
>
>    -
>
>    total bytes on storage,
>    -
>
>    Spark SQL *scan time* and *CPU time* in the UI,
>    -
>
>    effect of *partition pruning* with realistic filters.
>    Keep the shuffle settings fast (above) so you’re testing scan costs,
>    not an artificially slow shuffle.
>
> My rules of thumb
>
>    -
>
>    If *latency* and interactive work matter → *Snappy* Parquet.
>    -
>
>    If *storage $$* dominates and reads are rare → *GZIP* (or *ZSTD* as a
>    middle ground).
>    -
>
>    Regardless of codec, *partition pruning + sane file sizes* move the
>    needle the most (that’s the core of my “Hybrid Curated Storage” approach)
>
> HTH
>
> Regards
> Dr Mich Talebzadeh,
> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>
> (P.S. The background and examples I referenced are from my article on
> using *GCS external Parquet* with *Snappy/GZIP/ZSTD* and Hive
> partitioning for cost/perf balance—feel free to skim the compression/export
> and partitioning sections.)
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> On Tue, 26 Aug 2025 at 17:59, Nikolas Vanderhoof <
> nikolasrvanderh...@gmail.com> wrote:
>
>> Hello,
>>
>> Why does Spark use Snappy by default when compressing data within
>> Parquet? I’ve read that when shuffling, speed is prioritized above
>> compression ratio. Is that true, and are there other things to consider?
>>
>> Also, are there any recent benchmarks that the community has performed
>> that evaluate the performance of Spark when using Snappy compared to other
>> codecs? I’d be interested not only in the impact when using other codecs
>> for the intermediate and shuffle files, but also for the storage at rest.
>> For example, I know there are different configuration options that allow me
>> to set the codec for these internal files, or for the final parquet files
>> stored in the lakehouse.
>>
>> Before I decide to use a codec other than the default in my work, I want
>> to understand any tradeoffs better.
>>
>> Thanks,
>> Nik
>>
>

Reply via email to