Kudu has been from my impression be designed to offer somethings between hbase and parquet for write intensive loads - it is not faster for warehouse type of querying compared to parquet (merely slower, because that is not its use case). I assume this is still the strategy of it.
For some scenarios it could make sense together with parquet and Orc. However I am not sure what the advantage towards using hbase + parquet and Orc. > On 27 Jul 2016, at 11:47, "u...@moosheimer.com" <u...@moosheimer.com> wrote: > > Hi Gourav, > > Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in > memory db with data storage while Parquet is "only" a columnar storage format. > > As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... > that's more a wish :-). > > Regards, > Uwe > > Mit freundlichen Grüßen / best regards > Kay-Uwe Moosheimer > >> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta <gourav.sengu...@gmail.com>: >> >> Gosh, >> >> whether ORC came from this or that, it runs queries in HIVE with TEZ at a >> speed that is better than SPARK. >> >> Has anyone heard of KUDA? Its better than Parquet. But I think that someone >> might just start saying that KUDA has difficult lineage as well. After all >> dynastic rules dictate. >> >> Personally I feel that if something stores my data compressed and makes me >> access it faster I do not care where it comes from or how difficult the >> child birth was :) >> >> >> Regards, >> Gourav >> >>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni >>> <sbpothin...@gmail.com> wrote: >>> Just correction: >>> >>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization >>> default. >>> >>> Do not know If Spark leveraging this new repo? >>> >>> <dependency> >>> <groupId>org.apache.orc</groupId> >>> <artifactId>orc</artifactId> >>> <version>1.1.2</version> >>> <type>pom</type> >>> </dependency> >>> >>> >>> >>> >>> >>> >>> >>> >>> Sent from my iPhone >>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote: >>>> >>> >>>> parquet was inspired by dremel but written from the ground up as a library >>>> with support for a variety of big data systems (hive, pig, impala, >>>> cascading, etc.). it is also easy to add new support, since its a proper >>>> library. >>>> >>>> orc bas been enhanced while deployed at facebook in hive and at yahoo in >>>> hive. just hive. it didn't really exist by itself. it was part of the big >>>> java soup that is called hive, without an easy way to extract it. hive >>>> does not expose proper java apis. it never cared for that. >>>> >>>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU >>>>> <ovidiu-cristian.ma...@inria.fr> wrote: >>>>> Interesting opinion, thank you >>>>> >>>>> Still, on the website parquet is basically inspired by Dremel (Google) >>>>> [1] and part of orc has been enhanced while deployed for Facebook, Yahoo >>>>> [2]. >>>>> >>>>> Other than this presentation [3], do you guys know any other benchmark? >>>>> >>>>> [1]https://parquet.apache.org/documentation/latest/ >>>>> [2]https://orc.apache.org/docs/ >>>>> [3] >>>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet >>>>> >>>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote: >>>>>> >>>>>> when parquet came out it was developed by a community of companies, and >>>>>> was designed as a library to be supported by multiple big data projects. >>>>>> nice >>>>>> >>>>>> orc on the other hand initially only supported hive. it wasn't even >>>>>> designed as a library that can be re-used. even today it brings in the >>>>>> kitchen sink of transitive dependencies. yikes >>>>>> >>>>>> >>>>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfra...@gmail.com> wrote: >>>>>>> I think both are very similar, but with slightly different goals. While >>>>>>> they work transparently for each Hadoop application you need to enable >>>>>>> specific support in the application for predicate push down. >>>>>>> In the end you have to check which application you are using and do >>>>>>> some tests (with correct predicate push down configuration). Keep in >>>>>>> mind that both formats work best if they are sorted on filter columns >>>>>>> (which is your responsibility) and if their optimatizations are >>>>>>> correctly configured (min max index, bloom filter, compression etc) . >>>>>>> >>>>>>> If you need to ingest sensor data you may want to store it first in >>>>>>> hbase and then batch process it in large files in Orc or parquet format. >>>>>>> >>>>>>>> On 26 Jul 2016, at 04:09, janardhan shetty <janardhan...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Just wondering advantages and disadvantages to convert data into ORC >>>>>>>> or Parquet. >>>>>>>> >>>>>>>> In the documentation of Spark there are numerous examples of Parquet >>>>>>>> format. >>>>>>>> >>>>>>>> Any strong reasons to chose Parquet over ORC file format ? >>>>>>>> >>>>>>>> Also : current data compression is bzip2 >>>>>>>> >>>>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy >>>>>>>> >>>>>>>> This seems like biased. >>