Re: Hive on Spark Engine versus Spark using Hive metastore

Koert Kuipers Tue, 02 Feb 2016 21:01:05 -0800

ok i am sure there is some way to do it. i am going to guess snippets of
hive code stuck together with oozie jobs or whatever. the oozie jobs become
the re-usable pieces perhaps? now you got sql and xml, completely lacking
any benefits of a compiler to catch errors. unit tests will be slow if even
available at all. so yeah
yeah i am sure it can be made to *work*. just like you can get a nail into
a wall with a screwdriver if you really want.


On Tue, Feb 2, 2016 at 11:49 PM, Koert Kuipers <ko...@tresata.com> wrote:

> yeah but have you ever seen somewhat write a real analytical program in
> hive? how? where are the basic abstractions to wrap up a large amount of
> operations (joins, groupby's) into a single function call? where are the
> tools to write nice unit test for that?
>
> for example in spark i can write a DataFrame => DataFrame that internally
> does many joins, groupBys and complex operations. all unit tested and
> perfectly re-usable. and in hive? copy paste round sql queries? thats just
> dangerous.
>
> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>> Hive has numerous extension points, you are not boxed in by a long shot.
>>
>>
>> On Tuesday, February 2, 2016, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> uuuhm with spark using Hive metastore you actually have a real
>>> programming environment and you can write real functions, versus just being
>>> boxed into some version of sql and limited udfs?
>>>
>>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote:
>>>
>>>> When comparing the performance, you need to do it apple vs apple. In
>>>> another thread, you mentioned that Hive on Spark is much slower than Spark
>>>> SQL. However, you configured Hive such that only two tasks can run in
>>>> parallel. However, you didn't provide information on how much Spark SQL is
>>>> utilizing. Thus, it's hard to tell whether it's just a configuration
>>>> problem in your Hive or Spark SQL is indeed faster. You should be able to
>>>> see the resource usage in YARN resource manage URL.
>>>>
>>>> --Xuefu
>>>>
>>>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk>
>>>> wrote:
>>>>
>>>>> Thanks Jeff.
>>>>>
>>>>>
>>>>>
>>>>> Obviously Hive is much more feature rich compared to Spark. Having
>>>>> said that in certain areas for example where the SQL feature is available
>>>>> in Spark, Spark seems to deliver faster.
>>>>>
>>>>>
>>>>>
>>>>> This may be:
>>>>>
>>>>>
>>>>>
>>>>> 1.    Spark does both the optimisation and execution seamlessly
>>>>>
>>>>> 2.    Hive on Spark has to invoke YARN that adds another layer to the
>>>>> process
>>>>>
>>>>>
>>>>>
>>>>> Now I did some simple tests on a 100Million rows ORC table available
>>>>> through Hive to both.
>>>>>
>>>>>
>>>>>
>>>>> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>>>>
>>>>> 1       0       0       63
>>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
>>>>> xxxxxxxxxx
>>>>>
>>>>> 5       0       4       31
>>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
>>>>> xxxxxxxxxx
>>>>>
>>>>> 100000  99      999     188
>>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
>>>>> xxxxxxxxxx
>>>>>
>>>>> Time taken: 50.805 seconds, Fetched 3 row(s)
>>>>>
>>>>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>>>>
>>>>> 1       0       0       63
>>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
>>>>> xxxxxxxxxx
>>>>>
>>>>> 5       0       4       31
>>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
>>>>> xxxxxxxxxx
>>>>>
>>>>> 100000  99      999     188
>>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
>>>>> xxxxxxxxxx
>>>>>
>>>>> Time taken: 50.358 seconds, Fetched 3 row(s)
>>>>>
>>>>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>>>>
>>>>> 1       0       0       63
>>>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
>>>>> xxxxxxxxxx
>>>>>
>>>>> 5       0       4       31
>>>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
>>>>> xxxxxxxxxx
>>>>>
>>>>> 100000  99      999     188
>>>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
>>>>> xxxxxxxxxx
>>>>>
>>>>> Time taken: 50.563 seconds, Fetched 3 row(s)
>>>>>
>>>>>
>>>>>
>>>>> So three runs returning three rows just over 50 seconds
>>>>>
>>>>>
>>>>>
>>>>> *Hive 1.2.1 on spark 1.3.1 execution engine*
>>>>>
>>>>>
>>>>>
>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>>>> (1, 5, 100000);
>>>>>
>>>>> INFO  :
>>>>>
>>>>> Query Hive on Spark job[4] stages:
>>>>>
>>>>> INFO  : 4
>>>>>
>>>>> INFO  :
>>>>>
>>>>> Status: Running (Hive on Spark job[4])
>>>>>
>>>>> INFO  : Status: Finished successfully in 82.49 seconds
>>>>>
>>>>>
>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>>
>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>>>>> |                 dummy.random_string                 | dummy.small_vc  |
>>>>> dummy.padding  |
>>>>>
>>>>>
>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>>
>>>>> | 1         | 0                | 0                | 63
>>>>> | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
>>>>> xxxxxxxxxx     |
>>>>>
>>>>> | 5         | 0                | 4                | 31
>>>>> | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
>>>>> xxxxxxxxxx     |
>>>>>
>>>>> | 100000    | 99               | 999              | 188
>>>>> | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
>>>>> xxxxxxxxxx     |
>>>>>
>>>>>
>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>>
>>>>> 3 rows selected (82.66 seconds)
>>>>>
>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>>>> (1, 5, 100000);
>>>>>
>>>>> INFO  : Status: Finished successfully in 76.67 seconds
>>>>>
>>>>>
>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>>
>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>>>>> |                 dummy.random_string                 | dummy.small_vc  |
>>>>> dummy.padding  |
>>>>>
>>>>>
>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>>
>>>>> | 1         | 0                | 0                | 63
>>>>> | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
>>>>> xxxxxxxxxx     |
>>>>>
>>>>> | 5         | 0                | 4                | 31
>>>>> | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
>>>>> xxxxxxxxxx     |
>>>>>
>>>>> | 100000    | 99               | 999              | 188
>>>>> | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
>>>>> xxxxxxxxxx     |
>>>>>
>>>>>
>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>>
>>>>> 3 rows selected (76.835 seconds)
>>>>>
>>>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>>>> (1, 5, 100000);
>>>>>
>>>>> INFO  : Status: Finished successfully in 80.54 seconds
>>>>>
>>>>>
>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>>
>>>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>>>>> |                 dummy.random_string                 | dummy.small_vc  |
>>>>> dummy.padding  |
>>>>>
>>>>>
>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>>
>>>>> | 1         | 0                | 0                | 63
>>>>> | rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
>>>>> xxxxxxxxxx     |
>>>>>
>>>>> | 5         | 0                | 4                | 31
>>>>> | vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
>>>>> xxxxxxxxxx     |
>>>>>
>>>>> | 100000    | 99               | 999              | 188
>>>>> | abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
>>>>> xxxxxxxxxx     |
>>>>>
>>>>>
>>>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>>>
>>>>> 3 rows selected (80.718 seconds)
>>>>>
>>>>>
>>>>>
>>>>> Three runs returning the same rows in 80 seconds.
>>>>>
>>>>>
>>>>>
>>>>> It is possible that My Spark engine with Hive is 1.3.1 which is out of
>>>>> date and that causes this lag.
>>>>>
>>>>>
>>>>>
>>>>> There are certain queries that one cannot do with Spark. Besides it
>>>>> does not recognize CHAR fields which is a pain.
>>>>>
>>>>>
>>>>>
>>>>> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>>>>>
>>>>>          > SELECT t.calendar_month_desc, c.channel_desc,
>>>>> SUM(s.amount_sold) AS TotalSales
>>>>>
>>>>>          > FROM sales s, times t, channels c
>>>>>
>>>>>          > WHERE s.time_id = t.time_id
>>>>>
>>>>>          > AND   s.channel_id = c.channel_id
>>>>>
>>>>>          > GROUP BY t.calendar_month_desc, c.channel_desc
>>>>>
>>>>>          > ;
>>>>>
>>>>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>>>>>
>>>>> .
>>>>>
>>>>> You are likely trying to use an unsupported Hive feature.";
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>>
>>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>>
>>>>>
>>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>>
>>>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase
>>>>> ASE 15", ISBN 978-0-9563693-0-7*.
>>>>>
>>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>>>> 978-0-9759693-0-4*
>>>>>
>>>>> *Publications due shortly:*
>>>>>
>>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>>>> 978-0-9563693-3-8
>>>>>
>>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, 
>>>>> volume
>>>>> one out shortly
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> NOTE: The information in this email is proprietary and confidential.
>>>>> This message is for the designated recipient only, if you are not the
>>>>> intended recipient, you should destroy it immediately. Any information in
>>>>> this message shall not be understood as given or endorsed by Peridale
>>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>>>> stated. It is the responsibility of the recipient to ensure that this 
>>>>> email
>>>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>>>> nor their employees accept any responsibility.
>>>>>
>>>>>
>>>>>
>>>>> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com]
>>>>> *Sent:* 02 February 2016 23:12
>>>>> *To:* user@hive.apache.org
>>>>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>>>>
>>>>>
>>>>>
>>>>> I think the diff is not only about which does optimization but more on
>>>>> feature parity. Hive on Spark offers all functional features that Hive
>>>>> offers and these features play out faster. However, Spark SQL is far from
>>>>> offering this parity as far as I know.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> My understanding is that with Hive on Spark engine, one gets the Hive
>>>>> optimizer and Spark query engine
>>>>>
>>>>>
>>>>>
>>>>> With spark using Hive metastore, Spark does both the optimization and
>>>>> query engine. The only value add is that one can access the underlying 
>>>>> Hive
>>>>> tables from spark-sql etc
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Is this assessment correct?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>>>
>>>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>>>
>>>>>
>>>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>>>
>>>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase
>>>>> ASE 15", ISBN 978-0-9563693-0-7*.
>>>>>
>>>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>>>> 978-0-9759693-0-4*
>>>>>
>>>>> *Publications due shortly:*
>>>>>
>>>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>>>> 978-0-9563693-3-8
>>>>>
>>>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, 
>>>>> volume
>>>>> one out shortly
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> NOTE: The information in this email is proprietary and confidential.
>>>>> This message is for the designated recipient only, if you are not the
>>>>> intended recipient, you should destroy it immediately. Any information in
>>>>> this message shall not be understood as given or endorsed by Peridale
>>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>>>> stated. It is the responsibility of the recipient to ensure that this 
>>>>> email
>>>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>>>> nor their employees accept any responsibility.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>> --
>> Sorry this was sent from mobile. Will do less grammar and spell check
>> than usual.
>>
>
>

Re: Hive on Spark Engine versus Spark using Hive metastore

Reply via email to