Hi.

I have been hearing a fair bit about Parquet versus ORC tables.

In a nutshell I can say that Parquet is a predecessor to ORC (both provide
columnar type storage) but I notice that it is still being used
especially with Spark users.

In mitigation it appears that Spark users are reluctant to use ORC despite
the fact that with inbuilt Store Index it offers superior optimisation with
data and stats at file, stripe and row group level. Both Parquet and ORC
offer SNAPPY compression as well. ORC offers ZLIB as default.

There may be other than technical reasons for this adaption, for example
too much reliance on Hive plus the fact that it is easier to flatten
Parquet than ORC (whatever that means).

I for myself use either text files or ORC with Hive and Spark and don't
really see any reason why I should adopt others like Avro, Parquet etc.

Appreciate any verification or experience on this.

Thanks
,

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com

Reply via email to