Re: ORC v/s Parquet for Spark 2.0

Sudhir Babu Pothineni Tue, 26 Jul 2016 15:19:44 -0700

Just correction:

ORC Java libraries from Hive are forked into Apache ORC. Vectorization default.


Do not know If Spark leveraging this new repo?

<dependency>
 <groupId>org.apache.orc</groupId>
    <artifactId>orc</artifactId>
    <version>1.1.2</version>
    <type>pom</type>
</dependency>








Sent from my iPhone
> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
> 
> parquet was inspired by dremel but written from the ground up as a library 
> with support for a variety of big data systems (hive, pig, impala, cascading, 
> etc.). it is also easy to add new support, since its a proper library.
> 
> orc bas been enhanced while deployed at facebook in hive and at yahoo in 
> hive. just hive. it didn't really exist by itself. it was part of the big 
> java soup that is called hive, without an easy way to extract it. hive does 
> not expose proper java apis. it never cared for that.
> 
>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU 
>> <ovidiu-cristian.ma...@inria.fr> wrote:
>> Interesting opinion, thank you
>> 
>> Still, on the website parquet is basically inspired by Dremel (Google) [1] 
>> and part of orc has been enhanced while deployed for Facebook, Yahoo [2].
>> 
>> Other than this presentation [3], do you guys know any other benchmark?
>> 
>> [1]https://parquet.apache.org/documentation/latest/
>> [2]https://orc.apache.org/docs/
>> [3] 
>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>> 
>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>> 
>>> when parquet came out it was developed by a community of companies, and was 
>>> designed as a library to be supported by multiple big data projects. nice
>>> 
>>> orc on the other hand initially only supported hive. it wasn't even 
>>> designed as a library that can be re-used. even today it brings in the 
>>> kitchen sink of transitive dependencies. yikes
>>> 
>>> 
>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfra...@gmail.com> wrote:
>>>> I think both are very similar, but with slightly different goals. While 
>>>> they work transparently for each Hadoop application you need to enable 
>>>> specific support in the application for predicate push down. 
>>>> In the end you have to check which application you are using and do some 
>>>> tests (with correct predicate push down configuration). Keep in mind that 
>>>> both formats work best if they are sorted on filter columns (which is your 
>>>> responsibility) and if their optimatizations are correctly configured (min 
>>>> max index, bloom filter, compression etc) . 
>>>> 
>>>> If you need to ingest sensor data you may want to store it first in hbase 
>>>> and then batch process it in large files in Orc or parquet format.
>>>> 
>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <janardhan...@gmail.com> wrote:
>>>>> 
>>>>> Just wondering advantages and disadvantages to convert data into ORC or 
>>>>> Parquet. 
>>>>> 
>>>>> In the documentation of Spark there are numerous examples of Parquet 
>>>>> format. 
>>>>> 
>>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>> 
>>>>> Also : current data compression is bzip2
>>>>> 
>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>>>>  
>>>>> This seems like biased.
>

Re: ORC v/s Parquet for Spark 2.0

Reply via email to