Re: Hive on Spark Engine versus Spark using Hive metastore

Xuefu Zhang Tue, 02 Feb 2016 18:40:36 -0800

Yes, regardless what spark mode you're running in, from Spark AM webui, you
should be able to see how many task are concurrently running. I'm a little
surprised to see that your Hive configuration only allows 2 map tasks to
run in parallel. If your cluster has the capacity, you should parallelize
all the tasks to achieve optimal performance. Since I don't know your Spark
SQL configuration, I cannot tell how much parallelism you have over there.
Thus, I'm not sure if your comparison is valid.


--Xuefu

On Tue, Feb 2, 2016 at 5:08 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote:

> Hi Jeff,
>
>
>
> In below
>
>
>
> …. You should be able to see the resource usage in YARN resource manage
> URL.
>
>
>
> Just to be clear we are talking about Port 8088/cluster?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Koert Kuipers [mailto:ko...@tresata.com]
> *Sent:* 03 February 2016 00:09
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> uuuhm with spark using Hive metastore you actually have a real
> programming environment and you can write real functions, versus just being
> boxed into some version of sql and limited udfs?
>
>
>
> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote:
>
> When comparing the performance, you need to do it apple vs apple. In
> another thread, you mentioned that Hive on Spark is much slower than Spark
> SQL. However, you configured Hive such that only two tasks can run in
> parallel. However, you didn't provide information on how much Spark SQL is
> utilizing. Thus, it's hard to tell whether it's just a configuration
> problem in your Hive or Spark SQL is indeed faster. You should be able to
> see the resource usage in YARN resource manage URL.
>
> --Xuefu
>
>
>
> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
> Thanks Jeff.
>
>
>
> Obviously Hive is much more feature rich compared to Spark. Having said
> that in certain areas for example where the SQL feature is available in
> Spark, Spark seems to deliver faster.
>
>
>
> This may be:
>
>
>
> 1.    Spark does both the optimisation and execution seamlessly
>
> 2.    Hive on Spark has to invoke YARN that adds another layer to the
> process
>
>
>
> Now I did some simple tests on a 100Million rows ORC table available
> through Hive to both.
>
>
>
> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 100000);
>
> 1       0       0       63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
> xxxxxxxxxx
>
> 5       0       4       31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
> xxxxxxxxxx
>
> 100000  99      999     188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
> xxxxxxxxxx
>
> Time taken: 50.805 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 100000);
>
> 1       0       0       63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
> xxxxxxxxxx
>
> 5       0       4       31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
> xxxxxxxxxx
>
> 100000  99      999     188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
> xxxxxxxxxx
>
> Time taken: 50.358 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 100000);
>
> 1       0       0       63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi               1
> xxxxxxxxxx
>
> 5       0       4       31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA               5
> xxxxxxxxxx
>
> 100000  99      999     188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe          100000
> xxxxxxxxxx
>
> Time taken: 50.563 seconds, Fetched 3 row(s)
>
>
>
> So three runs returning three rows just over 50 seconds
>
>
>
> *Hive 1.2.1 on spark 1.3.1 execution engine*
>
>
>
> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1,
> 5, 100000);
>
> INFO  :
>
> Query Hive on Spark job[4] stages:
>
> INFO  : 4
>
> INFO  :
>
> Status: Running (Hive on Spark job[4])
>
> INFO  : Status: Finished successfully in 82.49 seconds
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
> |                 dummy.random_string                 | dummy.small_vc  |
> dummy.padding  |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | 1         | 0                | 0                | 63                |
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
> xxxxxxxxxx     |
>
> | 5         | 0                | 4                | 31                |
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
> xxxxxxxxxx     |
>
> | 100000    | 99               | 999              | 188               |
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
> xxxxxxxxxx     |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> 3 rows selected (82.66 seconds)
>
> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1,
> 5, 100000);
>
> INFO  : Status: Finished successfully in 76.67 seconds
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
> |                 dummy.random_string                 | dummy.small_vc  |
> dummy.padding  |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | 1         | 0                | 0                | 63                |
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
> xxxxxxxxxx     |
>
> | 5         | 0                | 4                | 31                |
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
> xxxxxxxxxx     |
>
> | 100000    | 99               | 999              | 188               |
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
> xxxxxxxxxx     |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> 3 rows selected (76.835 seconds)
>
> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1,
> 5, 100000);
>
> INFO  : Status: Finished successfully in 80.54 seconds
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | dummy.id  | dummy.clustered  | dummy.scattered  | dummy.randomised
> |                 dummy.random_string                 | dummy.small_vc  |
> dummy.padding  |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> | 1         | 0                | 0                | 63                |
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi  |          1      |
> xxxxxxxxxx     |
>
> | 5         | 0                | 4                | 31                |
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA  |          5      |
> xxxxxxxxxx     |
>
> | 100000    | 99               | 999              | 188               |
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  |     100000      |
> xxxxxxxxxx     |
>
>
> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+
>
> 3 rows selected (80.718 seconds)
>
>
>
> Three runs returning the same rows in 80 seconds.
>
>
>
> It is possible that My Spark engine with Hive is 1.3.1 which is out of
> date and that causes this lag.
>
>
>
> There are certain queries that one cannot do with Spark. Besides it does
> not recognize CHAR fields which is a pain.
>
>
>
> spark-sql> *CREATE TEMPORARY TABLE tmp AS*
>
>          > SELECT t.calendar_month_desc, c.channel_desc,
> SUM(s.amount_sold) AS TotalSales
>
>          > FROM sales s, times t, channels c
>
>          > WHERE s.time_id = t.time_id
>
>          > AND   s.channel_id = c.channel_id
>
>          > GROUP BY t.calendar_month_desc, c.channel_desc
>
>          > ;
>
> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
>
> .
>
> You are likely trying to use an unsupported Hive feature.";
>
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com]
> *Sent:* 02 February 2016 23:12
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore
>
>
>
> I think the diff is not only about which does optimization but more on
> feature parity. Hive on Spark offers all functional features that Hive
> offers and these features play out faster. However, Spark SQL is far from
> offering this parity as far as I know.
>
>
>
> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
> Hi,
>
>
>
> My understanding is that with Hive on Spark engine, one gets the Hive
> optimizer and Spark query engine
>
>
>
> With spark using Hive metastore, Spark does both the optimization and
> query engine. The only value add is that one can access the underlying Hive
> tables from spark-sql etc
>
>
>
>
>
> Is this assessment correct?
>
>
>
>
>
>
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
>
>
>
>
>
>

Re: Hive on Spark Engine versus Spark using Hive metastore

Reply via email to