Re: Format dillema

Edward Capriolo Tue, 20 Jun 2017 10:13:09 -0700

It is whack that two optimized row columnar formats exists and each
respective project (hive/impala) has good support for one and lame/no
support for the other.

Impala is now an Apache project.  Also 'whack' and 'lame' are technical
terms often used by the people in the real world that have to use TEXT
format because they care about interoperability.

As the world's hugest hive fan I can say: Impala is a really nice tool.
Many queries work at interactive speeds on large datasets. (Anecdotal)  I
highly doubt Hive + LLAP will be in that ball-park of performance for maybe
2 years.

Presto. Ha Ha I call presto the "tease". It teases you by letting you think
you will not have to re-write your queries, then you do to have to deal
with nulls and try_cast. It "teases" you because some queries work at
interactive speeds. Then it reaches this point where based on your data
size it goes from "interactive speed" to "kinda slow". Then it reaches the
point where it goes from "kinda slow" to "fail after 20 minutes". Then you
just switch back to hive because regardless of the speed 4 minutes / 10
minutes/ whatever you are about 99.999 certain the query will actually run.

On Tue, Jun 20, 2017 at 12:51 PM, Owen O'Malley <owen.omal...@gmail.com>
wrote:

> You should also try LLAP. With ORC or text, it will cache the hot columns
> and partitions in memory. I can't seem to find the slides yet, but the
> Comcast team had good results with LLAP:
>
> https://dataworkssummit.com/san-jose-2017/sessions/hadoop-
> query-performance-smackdown/
>
> https://twitter.com/thejasn/status/875065727056715776
>
> Now that ORC has a C++ reader (and soon a writer), someone could write a
> patch for Impala to support ORC. You'd need to talk to the Impala project
> though.
>
> .. Owen
>
> On Tue, Jun 20, 2017 at 1:00 AM, Furcy Pin <furcy....@flaminem.com> wrote:
>
>> Another option would be to try Facebook's Presto https://prestodb.io/
>>
>> Like Impala, Presto is designed for fast interactive querying over Hive
>> tables, but it is also capable of querying data from many other SQL sources
>> (mySQL, postgreSQL, Kafka, Cassandra, ... https://prestodb.io/docs/curre
>> nt/connector.html)
>>
>> In terms of performances on small queries, it seems to be as fast as
>> Impala, a league over Spark-SQL, and of course two leagues over Hive.
>>
>> Unlike Impala, Presto is also able to read ORC file format, and make the
>> most of it (e.g. read pre-aggregated values from ORC headers).
>>
>> It can also make use of Hive's bucketing feature, while Impala still
>> cannot:
>> https://github.com/prestodb/presto/issues/6666
>> https://issues.apache.org/jira/browse/IMPALA-3118
>>
>> Regards,
>>
>> Furcy
>>
>>
>>
>>
>>
>> On Tue, Jun 20, 2017 at 5:36 AM, Sruthi Kumar Annamneedu <
>> sruthikumar...@gmail.com> wrote:
>>
>>> Try using Parquet with Snappy compression and Impala will work with this
>>> combination.
>>>
>>> On Sun, Jun 18, 2017 at 3:35 AM, rakesh sharma <
>>> rakeshsharm...@hotmail.com> wrote:
>>>
>>>> We are facing an issue of format. We would like to do bi style queries
>>>> from hive using impala and that supports parquet but we would like the data
>>>> to be compressed to the best ratio like orc. But impala cannot query orc
>>>> formats. What can be a design consideration for this. Any help
>>>>
>>>> Thanks
>>>> Rakesh
>>>>
>>>> Get Outlook for Android <https://aka.ms/ghei36>
>>>>
>>>>
>>>
>>
>

Re: Format dillema

Reply via email to