Hive has numerous extension points, you are not boxed in by a long shot. On Tuesday, February 2, 2016, Koert Kuipers <ko...@tresata.com> wrote:
> uuuhm with spark using Hive metastore you actually have a real > programming environment and you can write real functions, versus just being > boxed into some version of sql and limited udfs? > > On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang <xzh...@cloudera.com > <javascript:_e(%7B%7D,'cvml','xzh...@cloudera.com');>> wrote: > >> When comparing the performance, you need to do it apple vs apple. In >> another thread, you mentioned that Hive on Spark is much slower than Spark >> SQL. However, you configured Hive such that only two tasks can run in >> parallel. However, you didn't provide information on how much Spark SQL is >> utilizing. Thus, it's hard to tell whether it's just a configuration >> problem in your Hive or Spark SQL is indeed faster. You should be able to >> see the resource usage in YARN resource manage URL. >> >> --Xuefu >> >> On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh <m...@peridale.co.uk >> <javascript:_e(%7B%7D,'cvml','m...@peridale.co.uk');>> wrote: >> >>> Thanks Jeff. >>> >>> >>> >>> Obviously Hive is much more feature rich compared to Spark. Having said >>> that in certain areas for example where the SQL feature is available in >>> Spark, Spark seems to deliver faster. >>> >>> >>> >>> This may be: >>> >>> >>> >>> 1. Spark does both the optimisation and execution seamlessly >>> >>> 2. Hive on Spark has to invoke YARN that adds another layer to the >>> process >>> >>> >>> >>> Now I did some simple tests on a 100Million rows ORC table available >>> through Hive to both. >>> >>> >>> >>> *Spark 1.5.2 on Hive 1.2.1 Metastore* >>> >>> >>> >>> >>> >>> spark-sql> select * from dummy where id in (1, 5, 100000); >>> >>> 1 0 0 63 >>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 >>> xxxxxxxxxx >>> >>> 5 0 4 31 >>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 >>> xxxxxxxxxx >>> >>> 100000 99 999 188 >>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 >>> xxxxxxxxxx >>> >>> Time taken: 50.805 seconds, Fetched 3 row(s) >>> >>> spark-sql> select * from dummy where id in (1, 5, 100000); >>> >>> 1 0 0 63 >>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 >>> xxxxxxxxxx >>> >>> 5 0 4 31 >>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 >>> xxxxxxxxxx >>> >>> 100000 99 999 188 >>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 >>> xxxxxxxxxx >>> >>> Time taken: 50.358 seconds, Fetched 3 row(s) >>> >>> spark-sql> select * from dummy where id in (1, 5, 100000); >>> >>> 1 0 0 63 >>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi 1 >>> xxxxxxxxxx >>> >>> 5 0 4 31 >>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA 5 >>> xxxxxxxxxx >>> >>> 100000 99 999 188 >>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe 100000 >>> xxxxxxxxxx >>> >>> Time taken: 50.563 seconds, Fetched 3 row(s) >>> >>> >>> >>> So three runs returning three rows just over 50 seconds >>> >>> >>> >>> *Hive 1.2.1 on spark 1.3.1 execution engine* >>> >>> >>> >>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in >>> (1, 5, 100000); >>> >>> INFO : >>> >>> Query Hive on Spark job[4] stages: >>> >>> INFO : 4 >>> >>> INFO : >>> >>> Status: Running (Hive on Spark job[4]) >>> >>> INFO : Status: Finished successfully in 82.49 seconds >>> >>> >>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>> >>> | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised >>> | dummy.random_string | dummy.small_vc | >>> dummy.padding | >>> >>> >>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>> >>> | 1 | 0 | 0 | 63 | >>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | >>> xxxxxxxxxx | >>> >>> | 5 | 0 | 4 | 31 | >>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | >>> xxxxxxxxxx | >>> >>> | 100000 | 99 | 999 | 188 | >>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | >>> xxxxxxxxxx | >>> >>> >>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>> >>> 3 rows selected (82.66 seconds) >>> >>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in >>> (1, 5, 100000); >>> >>> INFO : Status: Finished successfully in 76.67 seconds >>> >>> >>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>> >>> | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised >>> | dummy.random_string | dummy.small_vc | >>> dummy.padding | >>> >>> >>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>> >>> | 1 | 0 | 0 | 63 | >>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | >>> xxxxxxxxxx | >>> >>> | 5 | 0 | 4 | 31 | >>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | >>> xxxxxxxxxx | >>> >>> | 100000 | 99 | 999 | 188 | >>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | >>> xxxxxxxxxx | >>> >>> >>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>> >>> 3 rows selected (76.835 seconds) >>> >>> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in >>> (1, 5, 100000); >>> >>> INFO : Status: Finished successfully in 80.54 seconds >>> >>> >>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>> >>> | dummy.id | dummy.clustered | dummy.scattered | dummy.randomised >>> | dummy.random_string | dummy.small_vc | >>> dummy.padding | >>> >>> >>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>> >>> | 1 | 0 | 0 | 63 | >>> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi | 1 | >>> xxxxxxxxxx | >>> >>> | 5 | 0 | 4 | 31 | >>> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA | 5 | >>> xxxxxxxxxx | >>> >>> | 100000 | 99 | 999 | 188 | >>> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe | 100000 | >>> xxxxxxxxxx | >>> >>> >>> +-----------+------------------+------------------+-------------------+-----------------------------------------------------+-----------------+----------------+--+ >>> >>> 3 rows selected (80.718 seconds) >>> >>> >>> >>> Three runs returning the same rows in 80 seconds. >>> >>> >>> >>> It is possible that My Spark engine with Hive is 1.3.1 which is out of >>> date and that causes this lag. >>> >>> >>> >>> There are certain queries that one cannot do with Spark. Besides it does >>> not recognize CHAR fields which is a pain. >>> >>> >>> >>> spark-sql> *CREATE TEMPORARY TABLE tmp AS* >>> >>> > SELECT t.calendar_month_desc, c.channel_desc, >>> SUM(s.amount_sold) AS TotalSales >>> >>> > FROM sales s, times t, channels c >>> >>> > WHERE s.time_id = t.time_id >>> >>> > AND s.channel_id = c.channel_id >>> >>> > GROUP BY t.calendar_month_desc, c.channel_desc >>> >>> > ; >>> >>> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7 >>> >>> . >>> >>> You are likely trying to use an unsupported Hive feature."; >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> *Sybase ASE 15 Gold Medal Award 2008* >>> >>> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >>> >>> >>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >>> >>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE >>> 15", ISBN 978-0-9563693-0-7*. >>> >>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN >>> 978-0-9759693-0-4* >>> >>> *Publications due shortly:* >>> >>> *Complex Event Processing in Heterogeneous Environments*, ISBN: >>> 978-0-9563693-3-8 >>> >>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume >>> one out shortly >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> NOTE: The information in this email is proprietary and confidential. >>> This message is for the designated recipient only, if you are not the >>> intended recipient, you should destroy it immediately. Any information in >>> this message shall not be understood as given or endorsed by Peridale >>> Technology Ltd, its subsidiaries or their employees, unless expressly so >>> stated. It is the responsibility of the recipient to ensure that this email >>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries >>> nor their employees accept any responsibility. >>> >>> >>> >>> *From:* Xuefu Zhang [mailto:xzh...@cloudera.com >>> <javascript:_e(%7B%7D,'cvml','xzh...@cloudera.com');>] >>> *Sent:* 02 February 2016 23:12 >>> *To:* user@hive.apache.org >>> <javascript:_e(%7B%7D,'cvml','user@hive.apache.org');> >>> *Subject:* Re: Hive on Spark Engine versus Spark using Hive metastore >>> >>> >>> >>> I think the diff is not only about which does optimization but more on >>> feature parity. Hive on Spark offers all functional features that Hive >>> offers and these features play out faster. However, Spark SQL is far from >>> offering this parity as far as I know. >>> >>> >>> >>> On Tue, Feb 2, 2016 at 2:38 PM, Mich Talebzadeh <m...@peridale.co.uk >>> <javascript:_e(%7B%7D,'cvml','m...@peridale.co.uk');>> wrote: >>> >>> Hi, >>> >>> >>> >>> My understanding is that with Hive on Spark engine, one gets the Hive >>> optimizer and Spark query engine >>> >>> >>> >>> With spark using Hive metastore, Spark does both the optimization and >>> query engine. The only value add is that one can access the underlying Hive >>> tables from spark-sql etc >>> >>> >>> >>> >>> >>> Is this assessment correct? >>> >>> >>> >>> >>> >>> >>> >>> Thanks >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> *Sybase ASE 15 Gold Medal Award 2008* >>> >>> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >>> >>> >>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >>> >>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE >>> 15", ISBN 978-0-9563693-0-7*. >>> >>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN >>> 978-0-9759693-0-4* >>> >>> *Publications due shortly:* >>> >>> *Complex Event Processing in Heterogeneous Environments*, ISBN: >>> 978-0-9563693-3-8 >>> >>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume >>> one out shortly >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> NOTE: The information in this email is proprietary and confidential. >>> This message is for the designated recipient only, if you are not the >>> intended recipient, you should destroy it immediately. Any information in >>> this message shall not be understood as given or endorsed by Peridale >>> Technology Ltd, its subsidiaries or their employees, unless expressly so >>> stated. It is the responsibility of the recipient to ensure that this email >>> is virus free, therefore neither Peridale Technology Ltd, its subsidiaries >>> nor their employees accept any responsibility. >>> >>> >>> >>> >>> >> >> > -- Sorry this was sent from mobile. Will do less grammar and spell check than usual.