Re: ORC v/s Parquet for Spark 2.0

Jörn Franke Wed, 27 Jul 2016 13:30:56 -0700

Kudu has been from my impression be designed to offer somethings between hbase 
and parquet for write intensive loads - it is not faster for warehouse type of 
querying compared to parquet (merely slower, because that is not its use case). 
  I assume this is still the strategy of it.


For some scenarios it could make sense together with parquet and Orc. However I 
am not sure what the advantage towards using hbase + parquet and Orc.

> On 27 Jul 2016, at 11:47, "u...@moosheimer.com" <u...@moosheimer.com> wrote:
> 
> Hi Gourav,
> 
> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in 
> memory db with data storage while Parquet is "only" a columnar storage format.
> 
> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... 
> that's more a wish :-).
> 
> Regards,
> Uwe
> 
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
> 
>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <gourav.sengu...@gmail.com>:
>> 
>> Gosh,
>> 
>> whether ORC came from this or that, it runs queries in HIVE with TEZ at a 
>> speed that is better than SPARK.
>> 
>> Has anyone heard of KUDA? Its better than Parquet. But I think that someone 
>> might just start saying that KUDA has difficult lineage as well. After all 
>> dynastic rules dictate.
>> 
>> Personally I feel that if something stores my data compressed and makes me 
>> access it faster I do not care where it comes from or how difficult the 
>> child birth was :)
>> 
>> 
>> Regards,
>> Gourav
>> 
>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni 
>>> <sbpothin...@gmail.com> wrote:
>>> Just correction:
>>> 
>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization 
>>> default. 
>>> 
>>> Do not know If Spark leveraging this new repo?
>>> 
>>> <dependency>
>>>  <groupId>org.apache.orc</groupId>
>>>     <artifactId>orc</artifactId>
>>>     <version>1.1.2</version>
>>>     <type>pom</type>
>>> </dependency>
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Sent from my iPhone
>>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>> 
>>> 
>>>> parquet was inspired by dremel but written from the ground up as a library 
>>>> with support for a variety of big data systems (hive, pig, impala, 
>>>> cascading, etc.). it is also easy to add new support, since its a proper 
>>>> library.
>>>> 
>>>> orc bas been enhanced while deployed at facebook in hive and at yahoo in 
>>>> hive. just hive. it didn't really exist by itself. it was part of the big 
>>>> java soup that is called hive, without an easy way to extract it. hive 
>>>> does not expose proper java apis. it never cared for that.
>>>> 
>>>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU 
>>>>> <ovidiu-cristian.ma...@inria.fr> wrote:
>>>>> Interesting opinion, thank you
>>>>> 
>>>>> Still, on the website parquet is basically inspired by Dremel (Google) 
>>>>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo 
>>>>> [2].
>>>>> 
>>>>> Other than this presentation [3], do you guys know any other benchmark?
>>>>> 
>>>>> [1]https://parquet.apache.org/documentation/latest/
>>>>> [2]https://orc.apache.org/docs/
>>>>> [3] 
>>>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>>> 
>>>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>>>>> 
>>>>>> when parquet came out it was developed by a community of companies, and 
>>>>>> was designed as a library to be supported by multiple big data projects. 
>>>>>> nice
>>>>>> 
>>>>>> orc on the other hand initially only supported hive. it wasn't even 
>>>>>> designed as a library that can be re-used. even today it brings in the 
>>>>>> kitchen sink of transitive dependencies. yikes
>>>>>> 
>>>>>> 
>>>>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfra...@gmail.com> wrote:
>>>>>>> I think both are very similar, but with slightly different goals. While 
>>>>>>> they work transparently for each Hadoop application you need to enable 
>>>>>>> specific support in the application for predicate push down. 
>>>>>>> In the end you have to check which application you are using and do 
>>>>>>> some tests (with correct predicate push down configuration). Keep in 
>>>>>>> mind that both formats work best if they are sorted on filter columns 
>>>>>>> (which is your responsibility) and if their optimatizations are 
>>>>>>> correctly configured (min max index, bloom filter, compression etc) . 
>>>>>>> 
>>>>>>> If you need to ingest sensor data you may want to store it first in 
>>>>>>> hbase and then batch process it in large files in Orc or parquet format.
>>>>>>> 
>>>>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <janardhan...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Just wondering advantages and disadvantages to convert data into ORC 
>>>>>>>> or Parquet. 
>>>>>>>> 
>>>>>>>> In the documentation of Spark there are numerous examples of Parquet 
>>>>>>>> format. 
>>>>>>>> 
>>>>>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>>>>> 
>>>>>>>> Also : current data compression is bzip2
>>>>>>>> 
>>>>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>>>>>>>  
>>>>>>>> This seems like biased.
>>

Re: ORC v/s Parquet for Spark 2.0

Reply via email to