Re: parquet vs orc files

Jörn Franke Thu, 22 Feb 2018 11:42:24 -0800

Look at the documentation of the formats. In any case:
* use additionally partitions on the filesystem
* sort the data on filter columns - otherwise you do not benefit form min/max 
and bloom filters




> On 21. Feb 2018, at 22:58, Kane Kim <kane.ist...@gmail.com> wrote:
> 
> Thanks, how does min/max index work? Can spark itself configure bloom filters 
> when saving as orc?
> 
>> On Wed, Feb 21, 2018 at 1:40 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>> In the latest version both are equally well supported.
>> 
>> You need to insert the data sorted on filtering columns
>> Then you will benefit from min max indexes and in case of orc additional 
>> from bloom filters, if you configure them.
>> In any case I recommend also partitioning of files (do not confuse with 
>> Spark partitioning ).
>> 
>> What is best for you you have to figure out in a test. This highly depends 
>> on the data and the analysis you want to do.
>> 
>> > On 21. Feb 2018, at 21:54, Kane Kim <kane.ist...@gmail.com> wrote:
>> >
>> > Hello,
>> >
>> > Which format is better supported in spark, parquet or orc?
>> > Will spark use internal sorting of parquet/orc files (and how to test 
>> > that)?
>> > Can spark save sorted parquet/orc files?
>> >
>> > Thanks!
>

Re: parquet vs orc files

Reply via email to