To add, schema evaluation is better for parquet compared to orc (at the cost of a bit slowness) as orc is truly index based; especially useful in case you would want to delete some column later.
Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar <https://about.me/sushrutikhar?promo=email_sig> On Fri, Feb 23, 2018 at 1:10 AM, Jörn Franke <jornfra...@gmail.com> wrote: > Look at the documentation of the formats. In any case: > * use additionally partitions on the filesystem > * sort the data on filter columns - otherwise you do not benefit form > min/max and bloom filters > > > > On 21. Feb 2018, at 22:58, Kane Kim <kane.ist...@gmail.com> wrote: > > Thanks, how does min/max index work? Can spark itself configure bloom > filters when saving as orc? > > On Wed, Feb 21, 2018 at 1:40 PM, Jörn Franke <jornfra...@gmail.com> wrote: > >> In the latest version both are equally well supported. >> >> You need to insert the data sorted on filtering columns >> Then you will benefit from min max indexes and in case of orc additional >> from bloom filters, if you configure them. >> In any case I recommend also partitioning of files (do not confuse with >> Spark partitioning ). >> >> What is best for you you have to figure out in a test. This highly >> depends on the data and the analysis you want to do. >> >> > On 21. Feb 2018, at 21:54, Kane Kim <kane.ist...@gmail.com> wrote: >> > >> > Hello, >> > >> > Which format is better supported in spark, parquet or orc? >> > Will spark use internal sorting of parquet/orc files (and how to test >> that)? >> > Can spark save sorted parquet/orc files? >> > >> > Thanks! >> > >