Re: Hive on Spark Engine versus Spark using Hive metastore

Edward Capriolo Tue, 02 Feb 2016 17:10:13 -0800

Hive has numerous extension points, you are not boxed in by a long shot.

On Tuesday, February 2, 2016, Koert Kuipers <ko...@tresata.com> wrote:


> uuuhm with spark using Hive metastore you actually have a real
> programming environment and you can write real functions, versus just being
> boxed into some version of sql and limited udfs?
>
> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com
> <javascript:_e(%7B%7D,'cvml','xzh...@cloudera.com');>> wrote:
>
>> When comparing the performance, you need to do it apple vs apple. In
>> another thread, you mentioned that Hive on Spark is much slower than Spark
>> SQL. However, you configured Hive such that only two tasks can run in
>> parallel. However, you didn't provide information on how much Spark SQL is
>> utilizing. Thus, it's hard to tell whether it's just a configuration
>> problem in your Hive or Spark SQL is indeed faster. You should be able to
>> see the resource usage in YARN resource manage URL.
>>
>> --Xuefu
>>
>> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk
>> <javascript:_e(%7B%7D,'cvml','m...@peridale.co.uk');>> wrote:
>>
>>> Thanks Jeff.
>>>
>>>
>>>
>>> Obviously Hive is much more feature rich compared to Spark. Having said
>>> that in certain areas for example where the SQL feature is available in
>>> Spark, Spark seems to deliver faster.
>>>
>>>
>>>
>>> This may be:
>>>
>>>
>>>
>>> 1.    Spark does both the optimisation and execution seamlessly
>>>
>>> 2.    Hive on Spark has to invoke YARN that adds another layer to the
>>> process
>>>
>>>
>>>
>>> Now I did some simple tests on a 100Million rows ORC table available
>>> through Hive to both.
>>>
>>>
>>>
>>> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>>>
>>>
>>>
>>>
>>>
>>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>>
>>> 1       0       0       63
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
>>> xxxxxxxxxx
>>>
>>> 5       0       4       31
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
>>> xxxxxxxxxx
>>>
>>> 100000  99      999     188
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
>>> xxxxxxxxxx
>>>
>>> Time taken: 50.805 seconds, Fetched 3 row(s)
>>>
>>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>>
>>> 1       0       0       63
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
>>> xxxxxxxxxx
>>>
>>> 5       0       4       31
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
>>> xxxxxxxxxx
>>>
>>> 100000  99      999     188
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
>>> xxxxxxxxxx
>>>
>>> Time taken: 50.358 seconds, Fetched 3 row(s)
>>>
>>> spark-sql> select * from dummy where id in (1, 5, 100000);
>>>
>>> 1       0       0       63
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
>>> xxxxxxxxxx
>>>
>>> 5       0       4       31
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
>>> xxxxxxxxxx
>>>
>>> 100000  99      999     188
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
>>> xxxxxxxxxx
>>>
>>> Time taken: 50.563 seconds, Fetched 3 row(s)
>>>
>>>
>>>
>>> So three runs returning three rows just over 50 seconds
>>>
>>>
>>>
>>> *Hive 1.2.1 on spark 1.3.1 execution engine*
>>>
>>>
>>>
>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>> (1, 5, 100000);
>>>
>>> INFO  :
>>>
>>> Query Hive on Spark job[4] stages:
>>>
>>> INFO  : 4
>>>
>>> INFO  :
>>>
>>> Status: Running (Hive on Spark job[4])
>>>
>>> INFO  : Status: Finished successfully in 82.49 seconds
>>>
>>>
>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>
>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>>> |                 dummy.random_string                 | dummy.small_vc  |
>>> dummy.padding  |
>>>
>>>
>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>
>>> | 1         | 0                | 0                | 63                |
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
>>> xxxxxxxxxx     |
>>>
>>> | 5         | 0                | 4                | 31                |
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
>>> xxxxxxxxxx     |
>>>
>>> | 100000    | 99               | 999              | 188               |
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
>>> xxxxxxxxxx     |
>>>
>>>
>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>
>>> 3 rows selected (82.66 seconds)
>>>
>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>> (1, 5, 100000);
>>>
>>> INFO  : Status: Finished successfully in 76.67 seconds
>>>
>>>
>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>
>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>>> |                 dummy.random_string                 | dummy.small_vc  |
>>> dummy.padding  |
>>>
>>>
>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>
>>> | 1         | 0                | 0                | 63                |
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
>>> xxxxxxxxxx     |
>>>
>>> | 5         | 0                | 4                | 31                |
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
>>> xxxxxxxxxx     |
>>>
>>> | 100000    | 99               | 999              | 188               |
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
>>> xxxxxxxxxx     |
>>>
>>>
>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>
>>> 3 rows selected (76.835 seconds)
>>>
>>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
>>> (1, 5, 100000);
>>>
>>> INFO  : Status: Finished successfully in 80.54 seconds
>>>
>>>
>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>
>>> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
>>> |                 dummy.random_string                 | dummy.small_vc  |
>>> dummy.padding  |
>>>
>>>
>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>
>>> | 1         | 0                | 0                | 63                |
>>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
>>> xxxxxxxxxx     |
>>>
>>> | 5         | 0                | 4                | 31                |
>>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
>>> xxxxxxxxxx     |
>>>
>>> | 100000    | 99               | 999              | 188               |
>>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
>>> xxxxxxxxxx     |
>>>
>>>
>>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>>>
>>> 3 rows selected (80.718 seconds)
>>>
>>>
>>>
>>> Three runs returning the same rows in 80 seconds.
>>>
>>>
>>>
>>> It is possible that My Spark engine with Hive is 1.3.1 which is out of
>>> date and that causes this lag.
>>>
>>>
>>>
>>> There are certain queries that one cannot do with Spark. Besides it does
>>> not recognize CHAR fields which is a pain.
>>>
>>>
>>>
>>> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>>>
>>>          > SELECT t.calendar_month_desc, c.channel_desc,
>>> SUM(s.amount_sold) AS TotalSales
>>>
>>>          > FROM sales s, times t, channels c
>>>
>>>          > WHERE s.time_id = t.time_id
>>>
>>>          > AND   s.channel_id = c.channel_id
>>>
>>>          > GROUP BY t.calendar_month_desc, c.channel_desc
>>>
>>>          > ;
>>>
>>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>>>
>>> .
>>>
>>> You are likely trying to use an unsupported Hive feature.";
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>
>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>
>>>
>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>
>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>>> 15", ISBN 978-0-9563693-0-7*.
>>>
>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>> 978-0-9759693-0-4*
>>>
>>> *Publications due shortly:*
>>>
>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>> 978-0-9563693-3-8
>>>
>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>>> one out shortly
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> NOTE: The information in this email is proprietary and confidential.
>>> This message is for the designated recipient only, if you are not the
>>> intended recipient, you should destroy it immediately. Any information in
>>> this message shall not be understood as given or endorsed by Peridale
>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>> stated. It is the responsibility of the recipient to ensure that this email
>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>> nor their employees accept any responsibility.
>>>
>>>
>>>
>>> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com
>>> <javascript:_e(%7B%7D,'cvml','xzh...@cloudera.com');>]
>>> *Sent:* 02 February 2016 23:12
>>> *To:* user@hive.apache.org
>>> <javascript:_e(%7B%7D,'cvml','user@hive.apache.org');>
>>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>>>
>>>
>>>
>>> I think the diff is not only about which does optimization but more on
>>> feature parity. Hive on Spark offers all functional features that Hive
>>> offers and these features play out faster. However, Spark SQL is far from
>>> offering this parity as far as I know.
>>>
>>>
>>>
>>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk
>>> <javascript:_e(%7B%7D,'cvml','m...@peridale.co.uk');>> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> My understanding is that with Hive on Spark engine, one gets the Hive
>>> optimizer and Spark query engine
>>>
>>>
>>>
>>> With spark using Hive metastore, Spark does both the optimization and
>>> query engine. The only value add is that one can access the underlying Hive
>>> tables from spark-sql etc
>>>
>>>
>>>
>>>
>>>
>>> Is this assessment correct?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>
>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>
>>>
>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>
>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>>> 15", ISBN 978-0-9563693-0-7*.
>>>
>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>> 978-0-9759693-0-4*
>>>
>>> *Publications due shortly:*
>>>
>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>> 978-0-9563693-3-8
>>>
>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>>> one out shortly
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> NOTE: The information in this email is proprietary and confidential.
>>> This message is for the designated recipient only, if you are not the
>>> intended recipient, you should destroy it immediately. Any information in
>>> this message shall not be understood as given or endorsed by Peridale
>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>> stated. It is the responsibility of the recipient to ensure that this email
>>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries
>>> nor their employees accept any responsibility.
>>>
>>>
>>>
>>>
>>>
>>
>>
>

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.

Re: Hive on Spark Engine versus Spark using Hive metastore

Reply via email to