Re: parquet vs orc files

Kurt Fehlhauer Thu, 22 Feb 2018 11:31:37 -0800

Hi Kane,

It really depends on your use case. I generally use Parquet because it
seems to have better support beyond Spark. However, if you are dealing with
partitioned Hive tables, the current versions of Spark have an issue where
compression will not be applied. This will be fixed in version 2.3.0. See
https://issues.apache.org/jira/browse/SPARK-21786 for more details. If you
are at the file level compression is applied just fine.

I also agree with Stephen Joung's recommendation. I would also watch the
video on Oreilly to get more context around that slide deck. This will give
you an in-depth understanding of how sorting is taken advantage of within
Spark.

Regards,
Kurt

On Wed, Feb 21, 2018 at 1:54 PM, Kane Kim <kane.ist...@gmail.com> wrote:

> Hello,
>
> Which format is better supported in spark, parquet or orc?
> Will spark use internal sorting of parquet/orc files (and how to test
> that)?
> Can spark save sorted parquet/orc files?
>
> Thanks!
>

Re: parquet vs orc files

Reply via email to