Re: Hive and Impala

Mich Talebzadeh Wed, 02 Mar 2016 10:27:11 -0800

OK two questions here please:


   1. Which version of Hive are you running
   2. Have you tried Hive on Spark which does both DAG & In-memory
   calculation.

Query Hive on Spark job[1] stages:
INFO  : 2
INFO  : 3



HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 2 March 2016 at 18:14, Dayong <will...@gmail.com> wrote:

> Tez is kind of outdated and Orc is so dedicated on hive. In addition, hive
> metadata store can be decoupled from hive as well. In reality, we do suffer
> from hive's performance even for ETL job. As result, we'll switch to
> implala + spark/ flink.
>
> Thanks,
> Dayong
>
> On Mar 2, 2016, at 10:35 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> I forgot besides LLAP you are going to have Hive Hybrid Procedural SQL On
> Hadoop (HPL/SQL) which is going to add another dimension to Hive
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 2 March 2016 at 15:30, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> SQL plays an increasing important role on Hadoop. As of today Hive IMO
>> provides the best and most robust solution to anything resembling to Data
>> Warehouse "solution" on Hadoop, chiefly by means of its powerful metastore
>> which can be hosted on a variety of mission critical databases plus Hive's
>> ever increasing support for a variety of file types on HDFs from humble
>> textfile to ORC. The remaining tools are little more than query tools that
>> crucially rely on Hive Metastore for their needs. Take away Hive component
>> and they are more and less lame ducks.
>>
>> Hive on MR speed was perceived to be slow but what the hec we are talking
>> about a Data Warehouse here which in most part should be batch oriented
>> and not user-facing and batch oriented. In Hive 0.14 and 2.0 you can use
>> Spark and Tez as the execution engine and if you are well into functional
>> programming, you can deploy Spark on Hive. If you look around from Impala
>> to Spark the architecture is essentially a query tool.
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 2 March 2016 at 13:52, Dayong <will...@gmail.com> wrote:
>>
>>> As I remember of few weeks before in Hadoop weekly news feed, cloudera
>>> has a benchmark showing implala is a little better than spark SQL and hive
>>> with tez. You can check that. From my experience, hive is still leading
>>> tool for regular ETL job since it is stable. The other tool are better for
>>> adhoc and interactive query use case. Cloudera bet on implala especially
>>> with its new kudo project.
>>>
>>> Thanks,
>>> Dayong
>>>
>>> On Mar 1, 2016, at 5:14 PM, Edward Capriolo <edlinuxg...@gmail.com>
>>> wrote:
>>>
>>> My nocks on impala. (not intended to be a post knocking impala)
>>>
>>> Impala really has not delivered on the complex types that hive has
>>> (after promising it for quite a while), also it only works with the
>>> 'blessed' input formats, parquet, avro, text.
>>>
>>> It is very annoying to work with impala, In my version if you create a
>>> partition in hive impala does not see it. You have to run "refresh".
>>>
>>> In impala I do not have all the UDFS that hive has like percentile, etc.
>>>
>>> Impala is fast. Many data-analysts / data-scientist types that can't
>>> wait 10 seconds for a query so when I need top produce something for them I
>>> make sure the data has no complex types and uses a table type that impala
>>> understands.
>>>
>>> But for my work I still work primarily in hive, because I do not want to
>>> deal with all the things that impala does not have/might have/ and when I
>>> need something special like my own UDFs it is easier to whip up the
>>> solution in hive.
>>>
>>> Having worked with M$ SQL server, and vertica, Impala is on par with
>>> them but I don'think of it like i think of hive. To me it just feels like a
>>> vertica that I can cheat loading sometimes because it is backed by hdfs.
>>>
>>> Hive is something different, I am making pipelines, I am transforming
>>> data, doing streaming, writing custom udfs, querying JSON directly. Its not
>>> != impala.
>>>
>>> ::random message of the day::
>>>
>>>
>>>
>>>
>>> On Tue, Mar 1, 2016 at 4:38 PM, Ashok Kumar <ashok34...@yahoo.com>
>>> wrote:
>>>
>>>>
>>>> Dr Mitch,
>>>>
>>>> My two cents here.
>>>>
>>>> I don't have direct experience of Impala but in my humble opinion I
>>>> share your views that Hive provides the best metastore of all Big Data
>>>> systems. Looking around almost every product in one form and shape use Hive
>>>> code somewhere. My colleagues inform me that Hive is one of the most stable
>>>> Big Data products.
>>>>
>>>> With the capabilities of Spark on Hive and Hive on Spark or Tez plus of
>>>> course MR, there is really little need for many other products in the same
>>>> space. It is good to keep things simple.
>>>>
>>>> Warmest
>>>>
>>>>
>>>> On Tuesday, 1 March 2016, 11:33, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>
>>>> I have not heard of Impala anymore. I saw an article in LinkedIn titled
>>>>
>>>> "Apache Hive Or Cloudera Impala? What is Best for me?"
>>>>
>>>> "We can access all objects from Hive data warehouse with HiveQL which
>>>> leverages the map-reduce architecture in background for data retrieval and
>>>> transformation and this results in latency."
>>>>
>>>> My response was
>>>>
>>>> This statement is no longer valid as you have choices of three engines
>>>> now with MR, Spark and Tez. I have not used Impala myself as I don't think
>>>> there is a need for it with Hive on Spark or Spark using Hive metastore
>>>> providing whatever needed. Hive is for Data Warehouse and provides what is
>>>> says on the tin. Please also bear in mind that Hive offers ORC storage
>>>> files that provide store Index capabilities further optimizing the queries
>>>> with additional stats at file, stripe and row group levels.
>>>>
>>>> Anyway the question is with Hive on Spark or Spark using Hive metastore
>>>> what we cannot achieve that we can achieve with Impala?
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Hive and Impala

Reply via email to