Yes, regardless what spark mode you're running in, from Spark AM webui, you should be able to see how many task are concurrently running. I'm a little surprised to see that your Hive configuration only allows 2 map tasks to run in parallel. If your cluster has the capacity, you should parallelize all the tasks to achieve optimal performance. Since I don't know your Spark SQL configuration, I cannot tell how much parallelism you have over there. Thus, I'm not sure if your comparison is valid.
--Xuefu On Tue, Feb 2, 2016 at 5:08 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote: > Hi Jeff, > > > > In below > > > > …. You should be able to see the resource usage in YARN resource manage > URL. > > > > Just to be clear we are talking about Port 8088/cluster? > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > *Sybase ASE 15 Gold Medal Award 2008* > > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > > > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > > Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE > 15", ISBN 978-0-9563693-0-7*. > > co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4* > > *Publications due shortly:* > > *Complex Event Processing in Heterogeneous Environments*, ISBN: > 978-0-9563693-3-8 > > *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume > one out shortly > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > *From:* Koert Kuipers [mailto:ko...@tresata.com] > *Sent:* 03 February 2016 00:09 > *To:* user@hive.apache.org > *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore > > > > uuuhm with spark using Hive metastore you actually have a real > programming environment and you can write real functions, versus just being > boxed into some version of sql and limited udfs? > > > > On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com> wrote: > > When comparing the performance, you need to do it apple vs apple. In > another thread, you mentioned that Hive on Spark is much slower than Spark > SQL. However, you configured Hive such that only two tasks can run in > parallel. However, you didn't provide information on how much Spark SQL is > utilizing. Thus, it's hard to tell whether it's just a configuration > problem in your Hive or Spark SQL is indeed faster. You should be able to > see the resource usage in YARN resource manage URL. > > --Xuefu > > > > On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk> > wrote: > > Thanks Jeff. > > > > Obviously Hive is much more feature rich compared to Spark. Having said > that in certain areas for example where the SQL feature is available in > Spark, Spark seems to deliver faster. > > > > This may be: > > > > 1. Spark does both the optimisation and execution seamlessly > > 2. Hive on Spark has to invoke YARN that adds another layer to the > process > > > > Now I did some simple tests on a 100Million rows ORC table available > through Hive to both. > > > > *Spark 1.5.2 on Hive 1.2.1 Metastore* > > > > > > spark-sql> select * from dummy where id in (1, 5, 100000); > > 1 0 0 63 > rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 > xxxxxxxxxx > > 5 0 4 31 > vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 > xxxxxxxxxx > > 100000 99 999 188 > abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 > xxxxxxxxxx > > Time taken: 50.805 seconds, Fetched 3 row(s) > > spark-sql> select * from dummy where id in (1, 5, 100000); > > 1 0 0 63 > rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 > xxxxxxxxxx > > 5 0 4 31 > vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 > xxxxxxxxxx > > 100000 99 999 188 > abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 > xxxxxxxxxx > > Time taken: 50.358 seconds, Fetched 3 row(s) > > spark-sql> select * from dummy where id in (1, 5, 100000); > > 1 0 0 63 > rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 > xxxxxxxxxx > > 5 0 4 31 > vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 > xxxxxxxxxx > > 100000 99 999 188 > abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 > xxxxxxxxxx > > Time taken: 50.563 seconds, Fetched 3 row(s) > > > > So three runs returning three rows just over 50 seconds > > > > *Hive 1.2.1 on spark 1.3.1 execution engine* > > > > 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, > 5, 100000); > > INFO : > > Query Hive on Spark job[4] stages: > > INFO : 4 > > INFO : > > Status: Running (Hive on Spark job[4]) > > INFO : Status: Finished successfully in 82.49 seconds > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised > | dummy.random_string | dummy.small_vc | > dummy.padding | > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > | 1 | 0 | 0 | 63 | > rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | > xxxxxxxxxx | > > | 5 | 0 | 4 | 31 | > vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | > xxxxxxxxxx | > > | 100000 | 99 | 999 | 188 | > abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | > xxxxxxxxxx | > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > 3 rows selected (82.66 seconds) > > 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, > 5, 100000); > > INFO : Status: Finished successfully in 76.67 seconds > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised > | dummy.random_string | dummy.small_vc | > dummy.padding | > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > | 1 | 0 | 0 | 63 | > rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | > xxxxxxxxxx | > > | 5 | 0 | 4 | 31 | > vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | > xxxxxxxxxx | > > | 100000 | 99 | 999 | 188 | > abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | > xxxxxxxxxx | > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > 3 rows selected (76.835 seconds) > > 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in (1, > 5, 100000); > > INFO : Status: Finished successfully in 80.54 seconds > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised > | dummy.random_string | dummy.small_vc | > dummy.padding | > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > | 1 | 0 | 0 | 63 | > rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | > xxxxxxxxxx | > > | 5 | 0 | 4 | 31 | > vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | > xxxxxxxxxx | > > | 100000 | 99 | 999 | 188 | > abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | > xxxxxxxxxx | > > > +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ > > 3 rows selected (80.718 seconds) > > > > Three runs returning the same rows in 80 seconds. > > > > It is possible that My Spark engine with Hive is 1.3.1 which is out of > date and that causes this lag. > > > > There are certain queries that one cannot do with Spark. Besides it does > not recognize CHAR fields which is a pain. > > > > spark-sql> *CREATE TEMPORARY TABLE tmp AS* > > > SELECT t.calendar_month_desc, c.channel_desc, > SUM(s.amount_sold) AS TotalSales > > > FROM sales s, times t, channels c > > > WHERE s.time_id = t.time_id > > > AND s.channel_id = c.channel_id > > > GROUP BY t.calendar_month_desc, c.channel_desc > > > ; > > Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7 > > . > > You are likely trying to use an unsupported Hive feature."; > > > > > > > > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > *Sybase ASE 15 Gold Medal Award 2008* > > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > > > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > > Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE > 15", ISBN 978-0-9563693-0-7*. > > co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4* > > *Publications due shortly:* > > *Complex Event Processing in Heterogeneous Environments*, ISBN: > 978-0-9563693-3-8 > > *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume > one out shortly > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > *From:* Xuefu Zhang [mailto:xzh...@cloudera.com] > *Sent:* 02 February 2016 23:12 > *To:* user@hive.apache.org > *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore > > > > I think the diff is not only about which does optimization but more on > feature parity. Hive on Spark offers all functional features that Hive > offers and these features play out faster. However, Spark SQL is far from > offering this parity as far as I know. > > > > On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk> > wrote: > > Hi, > > > > My understanding is that with Hive on Spark engine, one gets the Hive > optimizer and Spark query engine > > > > With spark using Hive metastore, Spark does both the optimization and > query engine. The only value add is that one can access the underlying Hive > tables from spark-sql etc > > > > > > Is this assessment correct? > > > > > > > > Thanks > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > *Sybase ASE 15 Gold Medal Award 2008* > > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > > > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > > Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE > 15", ISBN 978-0-9563693-0-7*. > > co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4* > > *Publications due shortly:* > > *Complex Event Processing in Heterogeneous Environments*, ISBN: > 978-0-9563693-3-8 > > *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume > one out shortly > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > > > > > >