Hi Kane, It really depends on your use case. I generally use Parquet because it seems to have better support beyond Spark. However, if you are dealing with partitioned Hive tables, the current versions of Spark have an issue where compression will not be applied. This will be fixed in version 2.3.0. See https://issues.apache.org/jira/browse/SPARK-21786 for more details. If you are at the file level compression is applied just fine.
I also agree with Stephen Joung's recommendation. I would also watch the video on Oreilly to get more context around that slide deck. This will give you an in-depth understanding of how sorting is taken advantage of within Spark. Regards, Kurt On Wed, Feb 21, 2018 at 1:54 PM, Kane Kim <kane.ist...@gmail.com> wrote: > Hello, > > Which format is better supported in spark, parquet or orc? > Will spark use internal sorting of parquet/orc files (and how to test > that)? > Can spark save sorted parquet/orc files? > > Thanks! >